## High level approach overview

1. Data Preparation:
    * Load the historical data
    * Convert the date column to datetime format and set it as index
2. Forecasting Algorithm:
    * Exponential Smoothing: Used for capturing trends and seasonality in the data.
    * ARIMA: Applied for modeling and forecasting time series data.
    * Prophet: Designed for handling time series data with seasonal effects and missing data.
3. Comparison:
    * Forecasted CPU utilization for each service for the next 15 days using the three algorithms.
    * Compare the results based on Root Mean Squared Error(RMSE) to determine the best algorithm for each service.

> pip install pandas \
> pip install numpy \
> pip install statsmodels \
> pip install scikit-learn \
> pip install matplotlib \
> pip install prophet \
> pip install seaborn \
> pip install tbats

In [11]:

import pandas as pd
import numpy as np
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.arima.model import ARIMA
#from fbprophet import Prophet
from prophet import Prophet
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
import seaborn as sns

%matplotlib inline
plt.style.use('fivethirtyeight')

In [3]:
#Load the dataset
data = pd.read_csv("test-1_cpu_usage_data.csv")

In [7]:
#Convert the date column to datetime format
data['date'] = pd.to_datetime(data['date'], format="%d-%m-%Y")

In [10]:
data.head()

Unnamed: 0,date,namespace,pods,cpu requested,cpu used,percentage of usage
0,2024-09-07,test-1,54,258.6,91.9,35.54
1,2024-09-08,test-1,177,606.0,245.7,40.54
2,2024-09-09,test-1,132,634.1,151.2,23.84
3,2024-09-10,test-1,108,482.3,144.2,29.9
4,2024-09-11,test-1,62,250.1,58.9,23.55


In [12]:
#sns.histplot(data=data, x='date', hue=['cpu requested', 'cpu used'], multiple="dodge", shrink=.8)

ValueError: Length of list vectors must match length of `data` when both are used, but `data` has length 181 and the vector passed to `hue` has length 2.

In [None]:
#Set the date column as index
data.set_index('date', inplace=True)

In [None]:
#Function to forecast using Exponential Smoothing
def forecast_exponential_smoothing(data, periods=15):
    model = ExponentialSmoothing(data, trend='add', seasonal='add', seasonal_periods=12)
    fit = model.fit()
    forecast = fit.forecast(periods)
    return forecast

In [None]:
#Function to forecast using ARIMA
def forecast_arima(data, periods=15):
    model = ARIMA(data, order=(5,1,0))
    fit = model.fit()
    forecast = fit.forecast(steps = periods)
    return forecast

In [None]:
#Function to forecast using Prophet
def forecast_prophet(data, periods=15):
    df = data.reset_index().rename(columns ={'date': 'ds', 'value': 'y'})
    model = Prophet()
    model.fit(df)
    future = model.make_future_dataframe(periods=periods)
    forecast = model.predict(future)
    return forecast.set_index('ds')['yhat'][-periods:]

In [None]:
#Forecast for each service
services = data['namespace'].unique()
results = {}

for service in services:
    service_data = data[data['namespace'] == service]['cpu used']
    
    #Forecast using different algorithms
    es_forecast = forecast_exponential_smoothing(service_data)
    arima_forecast = forecast_arima(service_data)
    prophet_forecast = forecast_prophet(service_data)
    
    #store the results
    results[service] = {
        'Exponential Smoothing': es_forecast,
        'ARIMA': arima_forecast,
        'Prophet': prophet_forecast
    }

In [None]:
#Plot the results for comparison
for service in services:
    plt.figure(figsize=(12,6))
    plt.plot(data[data['service'] == service]['cpu used'], label='Historical Data')
    plt.plot(results[service]['Exponential Smoothing'], label='Exponential Smoothing Forecast')
    plt.plot(results[service]['ARIMA'], label='ARIMA Forecast')
    plt.plot(results[service]['Prophet'], label='Prophet Forecast')
    plt.title(f'CPU Utilization Forecast for {service}')
    plt.legend()
    plt.show()

In [None]:
#Determine the best algorithm based on RMSE
best_algorithm = {}

for service in services:
    service_data = data[data['namespace'] == service]['cpu used']
    
    es_rmse = np.sqrt(mean_squared_error(service_data[-15], results[service]['Exponential Smoothing']))
    arima_rmse = np.sqrt(mean_squared_error(service_data[-15], results[service]['ARIMA']))
    prophet_rmse = np.sqrt(mean_squared_error(service_data[-15], results[service]['Prophet']))
    
    best_algorithm = min(
        [('Exponential Smoothing', es_rmse), ('ARIMA', arima_rmse), ('Prophet', prophet_rmse)],
        key = lambda x: x[1]
    )[0]
    
    best_algorithm[service] = best_algorithm
    
print("Best Algorithm for each service: ")
for service, algorithm in best_algorithm.items():
    print(f"{service}: {algorithm}")

## Handling new services without historical data can be challenging but there are several stratigies to make reasonable forecasts:

1. Use Similar Services: Identify services that have similar characteristics or usage patterns to the new service. You can use their historical data as a proxy to estimate the resource utilization for new service.
2. Benchmarking: Establish benchmarks based on industry standards or similar services. This can provide a starting point for resource allocation.
3. Initial Allocation and Monitoring: Allocate resources based on initial estimates and closely monitor the utilization. Adjust the resources dynamically based on the observed usage patterns.
4. Machine Learning Models: Use machine learning models that can handle cold-start problems. For example, collaborative filtering techniques can help predict resource utilization based on similarities between services.
5. Expert Judgement: Consult with domain experts who have experience with similar services. Their insights can help in making informed estimates.
6. Hybrid Approach: Combine multiple methods to improve the accuracy of your forecasts. For example, you can start with benchmarks and adjust based on real-time monitoring and expert judgement.