# Objective

Execute data preprocessing for further analysis.

1. We will analyze only Linux machines, thus, we can remove all other instances (e.g. Windows) in the pricing list.
2. Filter out unnecessary columns, remove unnecessary timezones, convert to correct types.

# Code

## Load libs

In [None]:
import os
import pandas as pd

## Input params

This section is necessary if the notebook is run alone, i.e., without parameter injection, e.g., papermill.

In [None]:
compression = 'zip'

raw_dir = '../data/raw'
in_fname = 'aws_pricing_sample.csv.zip'

interim_dir = '../data/interim'
out_fname = 'step_1_aws_filtered_sample.csv.zip'

In [None]:
# Papermill parameters injection ... do not delete!

In [None]:
# Parameters
raw_dir = "../data/raw"
interim_dir = "../data/interim"
processed_dir = "../data/processed"
report_dir = "../reports"
compression = "zip"
nb_name = "step_1_data_preprocessing.ipynb"
in_fname = "aws_pricing_sample.csv.zip"
out_fname = "step_1_aws_filtered_sample.csv.zip"


## Load data

- Check for data type: `Timestamp` needs to be loaded as datetime for future filtering

In [None]:
# check if file exists
filename = f'{raw_dir}/{in_fname}'

if not os.path.exists(filename):
    raise IOError(f'File "{filename}" not found error!')

data = pd.read_csv(f'{raw_dir}/{in_fname}', 
                   parse_dates=['Timestamp'],
                   compression=compression, 
                   index_col=0)

print(data.dtypes)
print(data.shape)
data.head()

## Data Prep

- filter only for `Linux/Unix` machines;
- remove column `ProductDescription` as it will have one unique value;
- order dataframe ascending (for timeline comparison later on);
- remove timezone value from `Timestamp` as we won't use it here;
- returning to a new variable to keep idempotency;

In [None]:
df = data.query('ProductDescription == "Linux/UNIX"')\
         .drop('ProductDescription', axis=1)\
         .sort_values(by='Timestamp', ascending=True)\
         .reset_index(drop=True)

df['Timestamp'] = df['Timestamp'].dt.tz_localize(None)
print(df.shape)
df.head()

## Create output file

Save parsed file in folder `interim`.

In [None]:
df.to_csv(f'{interim_dir}/{out_fname}', 
          compression=compression)