# Recap: Intermediate Machine Learning

The aim of this notebook is to practice running through the workflows covered in the Intermediate Machine Learning course.

## Scope
1. Pre-processing (missing and categorical data)
    - Drop missing and/or categorical data
    - Impute missing data (optional: label imputed columns)
    - Label encoding
    - One-hot encoding
2. Pipelines
3. Cross-validation
4. XGBoost

## Components and their methods
```
    ├── Missing values
    │   ├── (1) Remove values
    │   │   ├── isnull().any()
    │   │   ├── isnull().sum()
    │   │   ├── dropna(axis)
    │   │   └── drop(colsList, axis)
    │   ├── (2) Impute values
    │   │   └── SimpleImputer()
    │   │       ├── fit_transform(X)
    │   │       └── transform(X)
    │   └── (3) Impute values & track imputed values
    │       ├── SimpleImputer()
    │       └── astype() --- convert (0, 1) back to bool
    ├── Categorical variables
    │   ├── (1) Remove values
    │   │   ├── dtypes
    │   │   └── select_dtypes(exclude/include)
    │   ├── (2) Label encoding
    │   │   ├── LabelEncoder()
    │   │   └── set() --- generates unique values
    │   └── (3) One-hot encoding
    │       ├── OneHotEncoder()
    │       ├── pd.concat([pdList], axis=1)
    │       └── nunique() --- unique().count(), cardinality
    ├── Cross-validation
    │   └── cross_val_score(pipeline, X, y, cv, scoring)
    │       └── scoring = 'neg_mean_absolute_error'
    └── XGBoost
```

## Modules and their methods
```
├── sklearn
│   ├── impute
│   │   └── SimpleImputer(strategy, *fill_value)
│   │       └── strategy = 'mean', 'median', 'mostfrequent', 'constant'
│   ├── preprocessing
│   │   ├── LabelEncoder()
│   │   └── OneHotEncoder(handle_unknown='unknown', sparse=False)
│   ├── pipeline
│   │   └── Pipeline(steps) --- 'vertical' integration
│   │       ├── steps = [('name', object), ...]
│   │       ├── .fit(X, y)
│   │       └── .predict(X)
│   ├── compose
│   │   └── ColumnTransformer(transformers) --- 'horizontal' integration
│   │       └── transformers = [('name', object, colList), ...]
│   └── model_selection
│       └── cross_val_score(pipeline, X, y, cv, scoring)
│           └── scoring = 'neg_mean_absolute_error'
└── xgboost
    └── XGBRegressor(n_estimators, learning_rate)
        └── .fit(X_train, y_train, args)
            ├── early_stopping_rounds: # consecutive deprovements before halt
            └── eval_set = [(X_valid, y_valid)]
```

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from xgboost import XGBRegressor

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Step 1: Read and explore data

In [None]:
train_filepath = '../input/home-data-for-ml-course/train.csv'
test_filepath = '../input/home-data-for-ml-course/test.csv'

X_full = pd.read_csv(train_filepath, index_col='Id', parse_dates=True)
X_test_full = pd.read_csv(test_filepath, index_col='Id', parse_dates=True)

# Test data has no 'target' column
f_string = f'''The difference between train_data and test_data is {set(X_full.columns) - 
set(X_test_full.columns)}.\n'''
print(f_string)

# Remove rows from train_data with empty target, and check
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
# print(X_full.SalePrice.isnull().any())

# Separate target (y) from predictors (X)
y = X_full.SalePrice
X_full.drop('SalePrice', axis=1, inplace=True)

# At this point, we want to identify our numerical and categorical columns, for later reference.
print(f"Unique datatypes: {set([X_full[col].dtype for col in X_full.columns])}\n")

# Identify numerical cols
numerical_cols = [col for col in X_full.columns
                 if X_full[col].dtypes in ['int64', 'float64']]
print('Number of numerical columns:', len(numerical_cols))

# Identify categorical cols (upper-bound on cardinality)
cardinality = 10
categorical_cols = [col for col in X_full.columns
                   if X_full[col].dtypes == 'object'
                   and X_full[col].nunique() <= cardinality]

removed_cols = list(set(X_full.columns) - set(categorical_cols) - set(numerical_cols))

print('Number of categorical columns:', len(categorical_cols), \
      f"(Removed {removed_cols})\n")

# Remove columns with high cardinality from X_full and X_test_full
X = pd.concat([X_full[numerical_cols], X_full[categorical_cols]], axis=1)
X_test = X_test_full.drop(removed_cols, axis=1)
print('Total columns:', len(X.columns))

For interest, here we display the cardinality of the categorical columns. This could be used to define a function to limit the number of additional columns that one-hot encoding would add to the dataset.

In [None]:
print('Column Name\t  No. unique values\n-----------------------------------')
print(X_full.select_dtypes(include=['object']).nunique().sort_values(ascending=True))

## Step 2: Deal with missing values

At this point, we have the following issues with our data:
* Missing values (numerical, categorical)
* Categorical values (categorical only)

Let's start with missing values.

There are 3 options:
* Drop the data
* Impute missing values
* Impute missing values and 

### Dropping columns with missing values
It might not be a good idea to drop rows with missing values, as it might introduce some bias into the data (there might be a systematic bias to rows with missing values in certain columns).

Additionally, some columns might have many missing values, so if we drop rows we might not have much data left.


In [None]:
# Lets list the number of missing values to see if it's worth it to drop the whole column!
print(f"We're losing {len(X.isnull().sum()[X.isnull().any()].index)} columns.\n")
print('Column name\tNumber of missing values\n----------------------------------------')
print(f'No. of entries:\t{len(X_full.index)}\n----------------------------------------')
print(X.isnull().sum()[X.isnull().any()].sort_values(ascending=False))

In [None]:
# Drop columns with missing values
X_drop = X.dropna(axis=1)
print(f"{len(set(X.columns) - set(X_drop.columns))} columns dropped.")

### Impute columns with missing values
This can be done either using: `sklearn.impute.SimpleImputer(strategy, \*fill_value)`:
```
strategy = 'mean', 'median', 'mostfrequent', 'constant'
```
or `pd.DataFrame.fillna(value, method, axis)`:
```
value = str, int64, float64; dict
method = 'pad'/'ffill', 'bfill'/'backfill'
```
#### Note:
Imputation removes column names; we need to add them back.

In [None]:
# Create instance of SimpleImputer object
imputer = SimpleImputer(strategy='mean')

# Impute missing values of categorical columns with 'Unknown'
X_impute_cat = X.select_dtypes(include=['object']).fillna('Unknown')
X_impute_cat.head()

# Impute missing values of numerical columns with mean value
X_impute_num = pd.DataFrame(imputer.fit_transform(X[numerical_cols]))

# Add back column names and index for numerical columns
X_impute_num.columns = numerical_cols
X_impute_num.set_index(X.index, inplace=True)
X_impute_num.head()

# Rejoin X_impute
X_impute = X_impute_num.join(X_impute_cat)

# Check that order of columns is preserved after imputation
print(f"Columns in the right order: {(X_impute.columns == X.columns).all()}")

# Inspect X_impute
X_impute[numerical_cols].describe()
X_impute[categorical_cols].describe()
X_impute.head()

### Impute and indicate columns with missing values 

In [None]:
# Copy X to avoid adjusting original
X_impute_plus = X.copy()

# Create list containing

# Add columns to indicate imputation
for col in list(X.isnull().any()[X.isnull().any()].index):
    X_impute_plus[col + '_was_imputed'] = X[col].isnull()

print(len(set(X_impute_plus.columns) - set(X.columns)), 'columns were added:\n')
print(list(set(X_impute_plus.columns) - set(X.columns)))

# Impute categorical columns with 'Unknown'
X_impute_plus_cat = X.select_dtypes(include=['object']).fillna('Unknown')
X_impute_plus_cat.head()

# Create instance of SimpleImputer
imputer_plus = SimpleImputer(strategy='mean')

# Impute numerical columns with mean value
X_impute_plus_num = pd.DataFrame(imputer_plus.fit_transform(X[numerical_cols]))

# Add back column and names for numerical columns
X_impute_plus_num.set_index(X.index, inplace=True)
X_impute_plus_num.columns = numerical_cols
X_impute_plus_num.head()

# Rejoin X_impute_plus
X_impute_plus = pd.concat([X_impute_plus_num, X_impute_plus_cat, \
                           X_impute_plus[list(set(X_impute_plus.columns) - set(X.columns))]], axis=1)

# New datatypes in X_impute_plus
print('\nDatatypes in X_impute_plus:', list(X_impute_plus.apply(lambda df: df.dtypes).unique()))

# Inspect X_impute_plus
X_impute_plus.select_dtypes(include=['object']).describe()
X_impute_plus.select_dtypes(include=['int64', 'float64']).describe()
X_impute_plus.head()

## Step 3: Dealing with categorical values
We have, again, 3 options:
* Drop categorical values
* Label encoding
* One-hot encoding

### Drop columns with categorical values


In [None]:
# Select cols_was_imputed for categorical columns
categorical_cols_was_imputed = [col + '_was_imputed' for col in categorical_cols]
categorical_cols_was_imputed = set(categorical_cols_was_imputed) & set(X_impute_plus.columns)
categorical_cols_was_imputed

print(f"Out of 19 _was_imputed cols, {len(categorical_cols_was_imputed)} were categorical.")

# Drop categorical columns from X_drop
X_drop_drop = X_drop.select_dtypes(exclude=['object'])
X_drop_drop.head()

# Drop categorical columns from X_impute
X_impute_drop = X_impute.select_dtypes(exclude=['object'])
X_impute_drop.head()

# Drop categorical columns from X_impute_plus
X_impute_plus_drop = X_impute_plus.drop(categorical_cols_was_imputed, axis=1)
X_impute_plus_drop = X_impute_plus_drop.select_dtypes(exclude=['object'])
X_impute_plus_drop.head()

### Before encoding: choose good columns
We only want to choose columns where each value in the test data is represented in the training data. For this, we compare the `categorical_cols` in `X` with `X_test`.

In [None]:
'''
Results:
There are NaN values in X_test. Help! Do we need to replicate the whole workflow for the
X_test data as well?

Actually, this will be fine, because the difference is at most {Nan}.

Once we choose the best pre-processing workflow, we can apply the missing data workflow 
to the X_test data. That means we won't have this issue.

If we did find unique values in the X_test categorical columns (and not in the X_categoric-
al columns, then we would need to remove the column entirely, or choose a new X_test.
'''

# Doing a little investigation, we forgot to remove NaN from categorical cols in X_test
for col in categorical_cols:
    if not (set(X_test[col].unique()) <= set(X[col].unique())):
        print(col)
        print('Difference:\t', set(X_test[col].unique()) - set(X[col].unique()))
        print('Set of X_test\t', set(X_test[col].unique()))
        print('Set of X\t', set(X[col].unique()), '\n')

### Label encode columns with categorical values

LabelEncoder() objects only encode 1 column. We will use a `for` loop to do this.

> Note: LabelEncoder only replaces one column at a time; 'memory' is lost after each iteration. However, this is not a problem; we just have to re-fit_transform the label encoder to the training dataset we eventually use before transforming the label of the test data. Something like:

```
for col in [X_test_MV_removed_LE]:
    if col in categorical_cols:
        X_MV_removed_LE[col] = label_encoder.fit_transform(X_MV_removed)
        X_test_MV_removed_LE[col] = label_encoder.transform(X_test_MV_removed)
```

In [None]:
# Create instance of LabelEncoder object
label_encoder = LabelEncoder()

# Label encode X_impute_plus
X_impute_plus_LE = X_impute_plus.copy()

for col in X_impute_plus_LE:
    if col in categorical_cols:
        X_impute_plus_LE[col] = label_encoder.fit_transform(X_impute_plus[col])

X_impute_plus_LE.head()
X_impute_plus_LE.apply(lambda df: df.dtypes).unique()

# Label encode X_impute
X_impute_LE = X_impute.copy()

for col in X_impute_LE:
    if col in categorical_cols:
        X_impute_LE[col] = label_encoder.fit_transform(X_impute[col])

X_impute_LE.head()
X_impute_LE.apply(lambda df: df.dtypes).unique()

# Label encode X_drop
X_drop_LE = X_drop.copy()

for col in X_drop_LE:
    if col in categorical_cols:
        X_drop_LE[col] = label_encoder.fit_transform(X_drop[col])

X_drop_LE.head()
X_drop_LE.apply(lambda df: df.dtypes).unique()

### One-hot encoding

One-hot encoding removes the index. We will replace it.

In [None]:
# Create instance of OneHotEncoder object
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)

# One-hot encode X_impute_plus
X_impute_plus_OH = X_impute_plus.copy()
new_col_number = sum(X_impute_plus_OH[categorical_cols].apply(lambda df: df.nunique()))
print(f"We currently have {len(X_impute_plus_OH.columns)} columns.")
print(f"We will end up with {len(X_impute_plus_OH.columns) - len(categorical_cols) + new_col_number} columns.")

X_impute_plus_one_hot = pd.DataFrame(OH_encoder.fit_transform(X_impute_plus_OH[categorical_cols]))
X_impute_plus_one_hot.index = X_impute_plus_OH.index # Add back the index
X_impute_plus_OH.drop(categorical_cols, axis=1, inplace=True) # Remove categorical columns
X_impute_plus_OH = pd.concat([X_impute_plus_OH, X_impute_plus_one_hot], axis=1) # Add one-hot columns
print(f"We ended up with {len(X_impute_plus_OH.columns)} columns.\n")

# One-hot encode X_impute
X_impute_OH = X_impute.copy()
new_col_number = sum(X_impute_OH[categorical_cols].apply(lambda df: df.nunique()))
print(f"We currently have {len(X_impute_OH.columns)} columns.")
print(f"We will end up with {len(X_impute_OH.columns) - len(categorical_cols) + new_col_number} columns.")

X_impute_one_hot = pd.DataFrame(OH_encoder.fit_transform(X_impute_OH[categorical_cols]))
X_impute_one_hot.index = X_impute_OH.index # Add back the index
X_impute_OH.drop(categorical_cols, axis=1, inplace=True) # Remove categorical columns
X_impute_OH = pd.concat([X_impute_OH, X_impute_one_hot], axis=1) # Add one-hot columns
print(f"We ended up with {len(X_impute_OH.columns)} columns.\n")

# One-hot encode X_drop
X_drop_OH = X_drop.copy()
new_col_number = sum(X_drop_OH.select_dtypes(include=['object']).apply(lambda df: df.nunique()))
print(f"We currently have {len(X_drop_OH.columns)} columns.")
print(f"We will end up with {len(X_drop_OH.columns) - len(X_drop_OH.select_dtypes(include=['object']).columns) + new_col_number} columns.")

X_drop_one_hot = pd.DataFrame(OH_encoder.fit_transform(X_drop_OH.select_dtypes(include=['object'])))
X_drop_one_hot.index = X_drop_OH.index # Add back the index
X_drop_OH.drop(X_drop_OH.select_dtypes(include=['object']).columns, axis=1, inplace=True) # Remove categorical columns
X_drop_OH = pd.concat([X_drop_OH, X_drop_one_hot], axis=1) # Add one-hot columns
print(f"We ended up with {len(X_drop_OH.columns)} columns.")

## Step 3: Cross-validation
Instead of `train_test_split`, we can use `cross_val_score`, which will evaluate the loss function, while iterating over segments of the dataset as the 'validation data'.

In [None]:
# Create a quick model:
model = RandomForestRegressor(n_estimators=100, random_state=1)

# List containing all the datasets:
datasets = [X_drop_drop, X_impute_drop, X_impute_plus_drop,
           X_drop_LE, X_impute_LE, X_impute_plus_LE,
           X_drop_OH, X_impute_OH, X_impute_plus_OH]

# Find the cross validation score:
scores = list()

for dataset in datasets:
    scores.append(-cross_val_score(model, dataset, y, cv=5, verbose=1, scoring='neg_mean_absolute_error'))

In [None]:
score_dict = dict(zip(['X_drop_drop', 'X_impute_drop', 'X_impute_plus_drop',
                       'X_drop_LE', 'X_impute_LE', 'X_impute_plus_LE',
                       'X_drop_OH', 'X_impute_OH', 'X_impute_plus_OH'], scores))

# sort(score_dict)

print(f"{'-'*31}\n{'Dataset' : <20}{'Score' : ^14}\n{'-'*31}")
for key in score_dict.keys():
    print(f"{key : <20}{score_dict[key].mean() : ^14.6}")

## Step 4: Pipelines
Using pipelines, we can try to simplify the process futher. However, we cannot do things like impute, and then add columns specifying which rows were imputed.

To recap, we start with `X` and `X_test` as our uncleaned datasets. The resulting dataset is equivalent to `X_impute_OH`.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Handle numerical and categorical predictors separately
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean'))
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('one_hot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine both into a preprocessor step
preprocessing = ColumnTransformer(transformers=[
    ('num', numerical_transformer, numerical_cols),
    ('cat', categorical_transformer, categorical_cols)
])

# Define our model
model = RandomForestRegressor(n_estimators=100, random_state=1)

# Create our pipeline
pipeline = Pipeline(steps=[
    ('preprocessing', preprocessing),
    ('model', model)
])

In [None]:
# Test and fit our model
results = -cross_val_score(pipeline, X, y, cv=5, verbose=1, scoring='neg_mean_absolute_error')

print(f"{'-'*31}\n{'Dataset' : <20}{'Score' : ^14}\n{'-'*31}")
print(f"{'Pipeline' : <20}{results.mean() : ^14.6}")

## What's next: XGBoost ensemble estimators

An XGBoost Regressor is an ensemble of estimators (i.e. it takes the sum of the predictions of multiple independent estimators). The workflow goes something like:

1. Pick a model, and parametize it at random
2. Calculate the loss function
3. Use gradient descent to calculate the change in parameters that would most reduce the loss function
4. Use new parameters in new estimator and add it to the ensemble

The parameters are:
* `n_estimators`: usually between 100 and 1000, also equal to the number of iterations
* `learning_rate`: default = 0.10; the smaller the value, the more accurate the gradient calculated, but the longer the training time

The parameters of XGBRegressor().fit() are:
* `early_stopping_rounds`: number of consecutive deprovements before halt
* `eval_set = [(X_valid, y_valid)]` to evaluate against to determine whether to halt
* `eval_metric`: 'mae', etc.

In [None]:
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# Make a copy to avoid changing the original
X_xgb = X.copy()

X_xgb_OH = OneHotEncoder(sparse=False, handle_unknown='ignore')

# XGBRegressor can't handle categoricals easily, one-hot encode them first
X_xgb_OH = pd.DataFrame(X_xgb_OH.fit_transform(X_xgb[categorical_cols]))

# One-hot encoding removes indexes; add them back
X_xgb_OH.index = X_xgb.index

# Drop categorical columns from X_xgb, we'll add the one-hot encoded version after
X_xgb.drop(categorical_cols, axis=1, inplace=True)

# Add X_xgb_OH
X_xgb = pd.concat([X_xgb, X_xgb_OH], axis=1)

# Split training data and validation data, for the purpose of training our model
X_train, X_valid, y_train, y_valid = train_test_split(X_xgb, y, train_size=0.8, test_size=0.2,\
                                                     random_state=1)
# Create instance of XGBRegressor object
xgb = XGBRegressor(n_estimators=200, learning_rate=0.10)

# Fit the data
xgb.fit(X_train, y_train, early_stopping_rounds=5, eval_set=[(X_valid, y_valid)], verbose=0)

# Cross-validation score
results = -cross_val_score(xgb, X_xgb, y, cv=5, scoring='neg_mean_absolute_error')

print(f"{'-'*31}\n{'Dataset' : <20}{'Score' : ^14}\n{'-'*31}")
print(f"{'XGBRegressor' : <20}{results.mean() : ^14.6}")