# Objective

Execute data preprocessing for further analysis.

1. We will analyze only Linux machines, thus, we can remove all other instances (e.g. Windows) in the pricing list.
2. Filter out unnecessary columns, remove unnecessary timezones, convert to correct types.

# Code

## Load libs

In [1]:
import os
import pandas as pd

## Input params

This section is necessary if the notebook is run alone, i.e., without parameter injection, e.g., papermill.

In [2]:
compression = 'zip'

raw_dir = '../data/raw'
in_fname = 'aws_pricing_sample.csv.zip'

interim_dir = '../data/interim'
out_fname = 'step_1_aws_filtered_sample.csv.zip'

In [3]:
# Papermill parameters injection ... do not delete!

## Load data

- Check for data type: `Timestamp` needs to be loaded as datetime for future filtering

In [4]:
# check if file exists
filename = f'{raw_dir}/{in_fname}'

if not os.path.exists(filename):
    raise IOError(f'File "{filename}" not found error!')

data = pd.read_csv(f'{raw_dir}/{in_fname}', 
                   parse_dates=['Timestamp'],
                   compression=compression, 
                   index_col=0)

print(data.dtypes)
print(data.shape)
data.head()

  mask |= (ar1 == a)


Timestamp             datetime64[ns, UTC]
AvailabilityZone                   object
InstanceType                       object
ProductDescription                 object
SpotPrice                         float64
dtype: object
(6033671, 5)


Unnamed: 0,Timestamp,AvailabilityZone,InstanceType,ProductDescription,SpotPrice
0,2020-08-31 23:59:59+00:00,us-east-1c,r5dn.12xlarge,Linux/UNIX,0.9618
1,2020-08-31 23:59:59+00:00,us-east-1c,r5dn.12xlarge,Red Hat Enterprise Linux,1.0918
2,2020-08-31 23:59:59+00:00,us-east-1c,r5dn.12xlarge,SUSE Linux,1.1118
3,2020-08-31 23:59:58+00:00,ap-northeast-2a,m4.2xlarge,Linux/UNIX,0.1199
4,2020-08-31 23:59:58+00:00,ap-northeast-2c,m4.2xlarge,Linux/UNIX,0.1199


## Data Prep

- filter only for `Linux/Unix` machines;
- remove column `ProductDescription` as it will have one unique value;
- order dataframe ascending (for timeline comparison later on);
- remove timezone value from `Timestamp` as we won't use it here;
- returning to a new variable to keep idempotency;

In [5]:
df = data.query('ProductDescription == "Linux/UNIX"')\
         .drop('ProductDescription', axis=1)\
         .sort_values(by='Timestamp', ascending=True)\
         .reset_index(drop=True)

df['Timestamp'] = df['Timestamp'].dt.tz_localize(None)
print(df.shape)
df.head()

(1666418, 4)


Unnamed: 0,Timestamp,AvailabilityZone,InstanceType,SpotPrice
0,2020-06-01 00:00:04,us-east-1f,r5d.large,0.0356
1,2020-06-01 00:00:04,us-east-1c,r5d.large,0.0356
2,2020-06-01 00:00:04,us-east-1d,r5d.large,0.0356
3,2020-06-01 00:00:04,us-east-1b,r5d.large,0.0356
4,2020-06-01 00:00:50,us-west-2c,r5.2xlarge,0.156


## Create output file

Save parsed file in folder `interim`.

In [6]:
df.to_csv(f'{interim_dir}/{out_fname}', 
          compression=compression)