# Objective

Generate features for machine learning algorithms. The goal is to use unsupervised machine learning algorithms to cluster similar instances together, e.g., highly volatile instances, and/or cheaper least volatile instances.

This will help us in the migration between spot instances, i.e., if we need to move from one spot instance that is going to expire, which should be the next of where we should go next.

# Code

## Load libs

In [None]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append('..')

import random
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from src.data.helpers import load_aws_dataset
from src.data.helpers import remove_consecutive_repeated_price_entries
from src.data.helpers import calc_pdf_price_update_interval_seconds
from src.data.helpers import generate_price_update_interval

## Input params

In [None]:
compression = 'zip'

interim_dir = '../data/interim'
in_fname = 'step_1_aws_filtered_sample.csv.zip'

processed_dir = '../data/processed'
processed_out_fname_pricing = 'step_5_features_pricing_var.csv.zip'
processed_out_fname_updist = 'step_5_features_updist.csv.zip'

In [None]:
# Papermill parameters injection ... do not delete!

## Load data

In [None]:
file = f'{interim_dir}/{in_fname}'
data = load_aws_dataset(file)
print(data.shape)
data.head()

### Pivot table to wide format

Wide format allows to have a price column for each instance

In [None]:
%%time

df = data.query('AvailabilityZone == "us-east-1a"')\
         .drop('AvailabilityZone', axis=1)

print(df.shape)

# Pivot table to change a wide format for the data. Thus, we can remove
# instances that do not have any price update.
# Dropping MultiIndex column 'SpotPrice' as there is no use for it.
pvt = df.pivot_table(index=['Timestamp'], 
                     columns=['InstanceType'])\
        .droplevel(0, axis=1)

pvt.head()

## Feature Engineering

As features, we will extract the following:
1. Price variation function: extract the probabilistic density function (pdf) of price variation;
2. Volatility curve: this represents the  pdf of the number of price changes for a given instance;
3. Price update interval curve: represents the pdf of price update interval

We will use these three pdfs to cluster our instances together.

### Price variation (PDF)

In [None]:
var_list = [{'Instance': cname, 'price_var': cdf.dropna().var()} for cname, cdf in pvt.items()]
var_df = pd.DataFrame(var_list).set_index('Instance').T
var_df

### Volatility (PDF)



In [None]:
# Iterate over each instance type, drop NaN and get the price distribution.
volatility_list = [cdf.dropna().describe(include='all').to_frame() for _, cdf in pvt.items()]
volatility_pdf = pd.concat(volatility_list, axis=1).round(3)
volatility_pdf

### Merge Price and Volatility PDFs

In [None]:
res_df = pd.concat([volatility_pdf, var_df], axis=0).round(3)
res_df

### Export to csv

In [None]:
res_df.to_csv(f'{processed_dir}/{processed_out_fname_pricing}', compression=compression)

### Price Update Interval (PDF)

In [None]:
res_df = generate_price_update_interval(pvt)
res_df

### Export to csv

In [None]:
res_df.to_csv(f'{processed_dir}/{processed_out_fname_updist}', compression=compression)