# Preprocessing

In this notebook, we will be preprocessing the data for machine learning models.

***

# Import libraries + Scripts

In [1]:
# Setting PYTHONHASHSEED
import os
print('Make sure the following says None: ', os.environ.get('PYTHONHASHSEED'))
os.environ['PYTHONHASHSEED'] = '0'
print('Make sure the following says 0: ', os.environ.get('PYTHONHASHSEED'))

Make sure the following says None:  None
Make sure the following says 0:  0


In [2]:
# Importing libraries
import sys
import os
from pathlib import Path

import pandas as pd
import numpy as np
import math

import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
import missingno as msno

# Setting seed
np.random.seed(42)

# To importing script
sys.path.append('./py_scripts')
import A_eda as A

# Importing functions from A
from A_eda import plt_show

In [3]:
# Importing data + variables
adclicks = A.adclicks
test_set = A.test_set
categoricals = A.categoricals

***

# Data Cleaning

We will be performing necessary data transformations and handling missing data in this section.

In [None]:
# Checking all the missing values
missing_values = adclicks[adclicks.isnull().any(axis=1)]
missing_values

Almost all of our rows (7361 out of 8000) are missing one or more features. A possible method of imputation is the MICE (Multivariate Imputation of Chained Equations), but it is best suited for data that is MAR (Missing At Random). Let us check this assumption below:

***

## Imputing with MICE

Next, we will be imputing our missing values using the MICE method.

In [None]:
categoricals

In [None]:
# Function to encode each categorical column (training and testing)
from sklearn import preprocessing

def encode_training(df, encode_columns):
    encoders = {col: preprocessing.LabelEncoder() for col in encode_columns}
    labels = {}

    # Encoding for each categorical variable and joining in new dataset
    for col, encoder in encoders.items():
        encode_col = f'{col}_en'
        encoded_col = encoder.fit_transform(df.loc[:, col])
        
        # Print label mapping
        labels[col] = (dict(zip(encoder.classes_, encoder.transform(encoder.classes_))))
        print(labels[col])
        df[encode_col] = encoded_col
    
    return encoders, labels

def encode_test(df, encoders):
    for col, encoder in encoders.items():
        encode_col = f'{col}_en'
        encoded_col = encoder.transform(df.loc[:, col])


In [None]:
# Encode training set + test set
encoders, labels = encode_training(adclicks, categoricals)
adclicks.head()

In [None]:
# Make NaN equivalent to NaN agains
for col in encoders:
    col_labels = labels[col]
    encode_col = f'{col}_en'
    adclicks[encode_col] = adclicks[encode_col].map(lambda x: np.nan if x == col_labels[np.nan] else x)
    
adclicks.head()


In [None]:
# Imputing with IterativeImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

imputer = IterativeImputer(max_iter=10, random_state=0)
imputed_df = imputer.fit_transform(adclicks)
