# **Cloud Computing: Can you predict the monthly cloud bill for startups in Austin?**

**Cloud Computing**:  a model for enabling on-demand network access to a shared 
pool of computing resources (e.g., networks, servers, storage, applications, 
and services) that can be rapidly provisioned and released with minimal management 
effort or service provider interaction.  

**5 Key Characteristics of Cloud Computing**

> 1. Broad Network Access
> 2. Measured Service
> 3. On Demand Self Service
> 4. Shared Resource Pooling
> 5. Rapid Elasticity

**3 Service Models**

> 1. **Software as a Service (SaaS)**: Provider’s applications running on a cloud infrastructure

> 2. **Platform as a Service (PaaS)**: Consumer-created or acquired applications created using tools supported by the provider

> 3. **Infrastructure as a Service (IaaS)**: Consumer is able to deploy and run arbitrary software on storage, networks, etc.

**Amazon Web Services (AWS)**  
![image.png](attachment:18a6adaf-82e5-4d54-8af6-f128cc121254.png)

## Cloud Management

![image.png](attachment:a03a53f4-6180-4940-a994-67a01aa776d4.png)

**MultiCloudX**: Austin based start-up that functions as a third-party cloud manager to manage cloud storage, operations, and costs

![image.png](attachment:a2e6d216-85d1-4bc0-928f-3771be97647a.png)

### Import Statements

In [1]:
# Data processing
import numpy as np
import pandas as pd

# Graphics libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Sklearn metrics and processing
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

# Keras/Tensorflow modules
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense, Dropout
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint, TensorBoard

### Import Data

In [2]:
pd.set_option('display.max_columns', None)

In [3]:
dfCCO = pd.read_csv(r".\Data\cco_cost_monthly.csv")

In [None]:
dfPYCO = pd.read_csv(r".\Data\pyco_cost_monthly.csv", low_memory=False)

In [None]:
dfXCO = pd.read_csv(r".\Data\xco_cost_monthly.csv")

In [None]:
dfCCO.head()

___
**Now we combine all 3 company's datasets to get a look at some statistics that we can generalize across all 3 companies for our feature engineering.**

In [None]:
def plot_corr(dataframe,size=10):
    """
    Plots a correlation matrix as a heat map 
    """
    corr = dataframe.corr()
    fig, ax = plt.subplots(figsize=(size, size))
    im = ax.matshow(corr,vmin = -1.0, vmax = 1.0, cmap = "bwr")
    plt.xticks(range(len(corr.columns)), corr.columns, rotation = 90);
    plt.yticks(range(len(corr.columns)), corr.columns);
    plt.colorbar(im, orientation = 'vertical')
    plt.title('Correlation Matrix')

In [None]:
dfTotal = pd.concat([dfCCO, dfPYCO, dfXCO], axis=0)
plot_corr(dfTotal)

**We gain little information from this heatmap. We can't get a strong idea of what's driving cost and we will need to 

In [None]:
dfPYCO.sort_values("total_cost", ascending=False)[["invoice_month", "product_name", "total_cost"]]

Here, we see some interesting behaviour where it looks like there were certain charges put forward that then had to be fixed (look at the indices of the largest negative charges and how they're +1 of the large charges).

____
## Plotting
Now, let's make some figures to get an idea of how the big hitter services are affecting costs month over month.

In [None]:
def hitter_monthly(df, product):
    """
    Grabs the total cost incurred by the big hitter (product) for each month of the DataFrame, df
    """
    dfFin = df.loc[df["product_name"] == product].groupby("invoice_month").agg({"total_cost": "sum"})
    return dfFin

In [None]:
def plot_n_hitters(n_hitters):
    """
    Plots the N biggest overall hitters and the aggregate cost
    """
    top_hitters = dfTotal.groupby("product_name").agg({"total_cost": "sum"}).sort_values("total_cost", ascending=False).index[:n_hitters]
    company_dfs = [dfXCO, dfPYCO, dfCCO]
    company_names = ["XCO", "PYCO", "CCO"]
    final_df_list = []
    fig, ax = plt.subplots(3,2, figsize=(14, 8.5))
    for index in range(len(company_names)):
        inter_list = []

        for hitter in range(len(top_hitters)):
            new_monthly_df = hitter_monthly(company_dfs[index], top_hitters[hitter])
            ax[index][0].plot(new_monthly_df)
            ax[index][0].tick_params("x", rotation=90)
            inter_list.append(new_monthly_df)


        ax[index][0].set_title(str(company_names[index]) + " Big Hitters Monthly Cost")
        ax[index][0].legend(top_hitters, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
        final_df_list.append(inter_list)
        agg_df = company_dfs[index].groupby("invoice_month").agg({"total_cost": "sum"})
        ax[index][1].plot(agg_df, color="red")
        ax[index][1].set_title(str(company_names[index]) + " Total Monthly Cost")
        ax[index][1].tick_params("x", rotation=90)

    fig.tight_layout()    
    fig.savefig(r".\figures\graphs_mod" + str(n_hitters) + ".png")
    return final_df_list

In [None]:
def plot_n_hitters_bycorp(n_hitters):
    """
    Plots the top N big hitters on a company-by-company basis (possibly different for each company)
    """
    company_dfs = [dfXCO, dfPYCO, dfCCO]
    company_names = ["XCO", "PYCO", "CCO"]
    final_df_list = []
    fig, ax = plt.subplots(3,2, figsize=(14, 8.5))
    for index in range(len(company_names)):
        inter_list = []
        top_hitters = company_dfs[index].groupby("product_name").agg({"total_cost": "sum"}).sort_values("total_cost", ascending=False).index[:n_hitters]
        
        for hitter in range(len(top_hitters)):
            new_monthly_df = hitter_monthly(company_dfs[index], top_hitters[hitter])
            ax[index][0].plot(new_monthly_df)
            ax[index][0].tick_params("x", rotation=90)
            inter_list.append(new_monthly_df)


        ax[index][0].set_title(str(company_names[index]) + " Big Hitters Monthly Cost")
        ax[index][0].legend(top_hitters, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
        final_df_list.append(inter_list)
        agg_df = company_dfs[index].groupby("invoice_month").agg({"total_cost": "sum"})
        ax[index][1].plot(agg_df, color="red")
        ax[index][1].set_title(str(company_names[index]) + " Total Monthly Cost")
        ax[index][1].tick_params("x", rotation=90)

    fig.tight_layout()    
    fig.savefig(r".\figures\company_hitters_" + str(n_hitters) + ".png")

____
The first plot here is a demonstration of how the top 5 overall "big hitters" affect the aggregate cost (right) for each company. We can see that the top 5 big hitters aren't necessarily the biggest cost drivers for each company. Another interesting insight is how for XCO, the graphical shape seems to be driven more by Amazon DocumentDB even though it's not the largest charge most months, it is the biggest contributor of variance.
____

In [None]:
top_5 = plot_n_hitters(5)

____
Here, we look at the top 5 top "big hitters" on a company-by-company basis and see that there were a lot of costs that matter to company with a smaller cloud footprint, like CCO that don't necessarily affect a relatively larger corporation like PYCO.
____

In [None]:
plot_n_hitters_bycorp(5)

## Data Processing

**Now we will process our data's into usable features for a time series regression model. We for each month, we take 3 main things:**
1. Total Cost
2. Total Cost by product
3. Total Usage by product

In [None]:
def process_dataframe(df, fillna=True):
    """
    Processes a dataFrame from what's initially given to a usable form for our model.
    """
    dfProcessed = df.groupby("invoice_month").agg({"total_cost": "sum"})
    products = dfTotal["product_name"].unique()
    # Loops through all of the unique product names
    for product in products:
        # Grabs the monthly total cost and usage for each product
        df_product_monthly = df.loc[df["product_name"] == product].groupby("invoice_month").agg({"total_cost": "sum", "usage_amount": "sum"}).rename(columns={"total_cost" : product + " cost", "usage_amount": product + " usage"})
        #Concatenates that onto the total dataframe we have
        dfProcessed = pd.concat([dfProcessed, df_product_monthly], axis=1)
    if fillna==True:    
        dfProcessed = dfProcessed.fillna(0)
    return dfProcessed

In [None]:
dfCCO_processed = process_dataframe(dfCCO, fillna=False)
dfCCO_processed

### CCO 

Interesting things to note:

* The only negative correlations lie with the Amazon Elastic File System

* Cost variance seems to be driven strongly by many different services. Likely points to a ramping up of cloud costs, which was reflected in our earlier plots

In [None]:
# CCO Correlation Matrix
plot_corr(dfCCO_processed.dropna(axis=1, how="all"))

### PYCO

Interesting things to note:  

* There's a somewhat large amount of negative correlations for cost, but none are very strong.

* The strongest correlation seems to lie with Amazon Elastic File System costs

In [None]:
dfPYCO_processed = process_dataframe(dfPYCO, fillna=False)
# PYCO Correlation Matrix
plot_corr(dfPYCO_processed.dropna(axis=1))

### XCO
Interesting points to note:  

* Amazon Lightsail cost has a fairly strong negative correlation with total cost.  

* Amazon Relational DB Cost, Amazon simple storage, and Amazon DocumentDB are cost variance drivers. This seems to be a company that intakes a good amount of data and the data intake seems to correlate with increasing costs.

In [None]:
dfXCO_processed = process_dataframe(dfXCO, fillna=False)
plot_corr(dfXCO_processed.dropna(axis=1, how="all"))

# Modelling

### Why LSTM?

* AR/MA models require hyperparameters tuned for each dataset and will not be deployable in the field if we need to 
* FB Prophet is not well-designed for data with many features beyond the explanatory variable driving itself
* Holt-Winters Exponential Smoothing **Someone needs to help with this**

The following functions respectively have the role of:
1. Converting dataframes into time slice tensors
2. Creating a test/train split of our time series data predicated on an array of months whose target and associated features we'd like to keep in testing
3. A custom class with important features of our data processing and processed data as attributes

In [None]:
def to_supervised (df_for_training, n_future, n_past):
    """
    Creates a tensor from time series slices of n months with all features 
    """
    trainX = []
    trainY = []

    for i in range(n_past, len(df_for_training) - n_future +1):
        trainX.append(df_for_training[i - n_past:i, 0:df_for_training.shape[1]])
        trainY.append(df_for_training[i + n_future - 1:i + n_future, 0])

    trainX, trainY = np.array(trainX), np.array(trainY)
    
    return trainX, trainY

In [None]:
def CustomSplit(X, y, locations):
    """
    Splits tensors in a custom list of locations
    """
    test_X = X[locations]
    train_X = np.delete(X, locations, axis=0)
    test_y = y[locations]
    train_y = np.delete(y, locations, axis=0)
    
    return train_X, test_X, train_y, test_y

In [None]:
class CustomTensorData:
    def __init__(self, df, locations, n_future, n_past, name):
            self.df = process_dataframe(df, fillna=True)
            self.months = self.df.index
            self.ss = StandardScaler()
            dfScaled = self.ss.fit_transform(self.df)
            X, y = to_supervised(dfScaled, n_future, n_past)
            self.train_X, self.test_X, self.train_y, self.test_y =  CustomSplit(X, y, locations)
            self.name = name
            self.split_locations = locations
            self.n_past = n_past

___
Now we will instantiate the objects for each dataset and can observe the shape of each tensor, which is in the format **(slices, window depth, features)** We removed some random points for each company to see how it would perform for each of them
___

In [None]:
PYCO_custom = CustomTensorData(dfPYCO, [1, 3, 6], 1, 2, "PYCO")
CCO_custom = CustomTensorData(dfCCO, [1, 4], 1, 2, "CCO")
XCO_custom = CustomTensorData(dfXCO, [1, 5, 8], 1, 2, "XCO")

___
We combine all 3 companies for a total train_X and a total train_y so that we can train the model on the trends of all 3 companies. Additionally, improved accuracy was found when feeding in each company twice into our model to train.
___

In [None]:
train_X_combined = np.concatenate((PYCO_custom.train_X, XCO_custom.train_X, CCO_custom.train_X, PYCO_custom.train_X, XCO_custom.train_X, CCO_custom.train_X), axis = 0)
train_Y_combined = np.concatenate((PYCO_custom.train_y, XCO_custom.train_y, CCO_custom.train_y, PYCO_custom.train_y, XCO_custom.train_y, CCO_custom.train_y), axis = 0)

___
Now, we build our LSTM model. Additional details on our layers and processes for ensuring a reasonable train time can be given upon request.
___

In [None]:
def build_model (trainX, trainY, epoch, bs):
    
    
    model = Sequential()
    model.add(LSTM(16, activation='tanh', input_shape=(trainX.shape[1], trainX.shape[2]), return_sequences=True))
    model.add(LSTM(8, activation='tanh', return_sequences=False))
    model.add(Dropout(0.2))
    model.add(Dense(trainY.shape[1]))

    model.compile(optimizer='adam', loss='mse') #custom loss function, l2/l1 regularization
    model.summary()
    
    es = EarlyStopping(monitor='val_loss', min_delta=1e-10, patience=20, verbose=1)
    rlr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=15, verbose=1)
    mcp = ModelCheckpoint(filepath='weights.h5', monitor='val_loss', verbose=1, save_best_only=True, save_weights_only=True)

    tb = TensorBoard('logs')

    history = model.fit(trainX, trainY, shuffle=True, epochs= epoch, callbacks=[es, rlr, mcp, tb], validation_split=0.2, verbose=1, batch_size= bs)
    plt.plot(history.history['loss'], label='Training loss')
    plt.plot(history.history['val_loss'], label='Validation loss')
    plt.legend()
    
    return model

In [None]:
model_HUGE = build_model(train_X_combined, train_Y_combined, 200, 16)

___
Now that we have a model built and trained we can fit our data on our test sets and see how the performance looks
___

In [None]:
def custom_evaluate_predictions(model, company):
    """
    Takes final predictions and compares them graphically and statistically against the true values
    """
    
    forecast = model.predict(company.test_X)
    
    forecast_copies = np.repeat(forecast, company.test_X.shape[2], axis=-1)
    forecast = company.ss.inverse_transform(forecast_copies)[:,0]
    
    y_copies = np.repeat(company.test_y, company.test_X.shape[2], axis=-1)
    y = company.ss.inverse_transform(y_copies)[:,0]
    future_time = company.months[[x + company.n_past for x in company.split_locations]]

    
    print(future_time)
    
    plt.title(company.name + " Test Set Validation")
    plt.xticks(rotation=90)
    plt.plot(company.df.total_cost, '-o', label="Original")
    plt.scatter(future_time, forecast, label="Predictions", color="red")
    plt.legend()
    plt.show();

    print("MSE:", mean_squared_error(forecast, y)) 
    print("MAE: ", mean_absolute_error(forecast, y))
    
    return

For PYCO, we generally have overestimates. This can possibly be attributed to the general upwards trend of the other two companies giving a slightly positive bias that does not apply to PYCO. Still, though, our model is off by around 10% of the average monthly cost.

In [None]:
custom_evaluate_predictions(model_HUGE, PYCO_custom)

Our model is very powerful for predicting on the XCO dataset, achieving an error of less than 5% on average here.

In [None]:
custom_evaluate_predictions(model_HUGE, XCO_custom)

Finally, we also have decent predictions for the CCO dataset also with an error in the 10-20% range.

In [None]:
custom_evaluate_predictions(model_HUGE, CCO_custom)

# Future Forecast