# Preprocessing

In this notebook, we review preprocessing in scikit-learn.

<a href="https://colab.research.google.com/github/thomasjpfan/ml-workshop-intro/blob/master/notebooks/03-preprocessing.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

In [None]:
# Install dependencies for google colab
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    %pip install -r https://raw.githubusercontent.com/thomasjpfan/ml-workshop-intro/master/requirements.txt

In [None]:
import sklearn
assert sklearn.__version__.startswith("1.0"), "Plese install scikit-learn 1.0"

In [None]:
import seaborn as sns
sns.set_theme(context="notebook", font_scale=1.4, rc={"figure.constrained_layout.use": True, "figure.figsize": [10, 6]})

In [None]:
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(as_frame=True)
housing_df = housing.frame

In [None]:
housing.frame

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 4, figsize=(24, 9))

for name, ax in zip(housing_df.columns.drop("MedHouseVal"), axes.ravel()):
    sns.scatterplot(x=name, y='MedHouseVal', ax=ax, data=housing_df)

In [None]:
housing_df.drop("MedHouseVal", axis=1).plot(kind='box');
plt.gcf().savefig("images/housing_box.svg")

## Model without scaling

Remove categories for this example

In [None]:
X = housing.data
y = housing.target

In [None]:
feature_names = X.columns

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
from sklearn.neighbors import KNeighborsRegressor

knr = KNeighborsRegressor().fit(X_train, y_train)
knr.score(X_train, y_train)

In [None]:
knr.score(X_test, y_test)

## Model with scaling

### Scale first!

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

In [None]:
import pandas as pd
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=feature_names)

In [None]:
X_train_scaled_df.plot(kind='box', xticks=[]);

### Train model on scaled data

In [None]:
knr = KNeighborsRegressor().fit(X_train_scaled, y_train)
knr.score(X_train_scaled, y_train)

In [None]:
X_test_scaled = scaler.transform(X_test)
knr.score(X_test_scaled, y_test)

## Exercise 1

1. Train a `sklearn.linear_model.SGDRegressor` model on the unscaled training data and evaluate on the unscaled test data.
2. Train the same model on the scaled data and evalute on the scaled test data.
3. Does scaling the data change the performance of the model?

In [None]:
from sklearn.linear_model import SGDRegressor
import numpy as np

**If you are running locally**, you can uncomment the following cell to load the solution into the cell. On **Google Colab**, [see solution here](https://github.com/thomasjpfan/ml-workshop-intro/blob/master/notebooks/solutions/03-ex1-solution.py). 

In [None]:
# %load solutions/03-ex1-solution.py

## Tree based models

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
tree = DecisionTreeRegressor(random_state=0, max_depth=2).fit(X_train, y_train)
tree.score(X_test, y_test)

In [None]:
tree_scaled = DecisionTreeRegressor(random_state=0, max_depth=2).fit(X_train_scaled, y_train)
tree_scaled.score(X_test_scaled, y_test)

### Why are the scores the same?

In [None]:
from sklearn.tree import plot_tree
sns.reset_orig()
fig, ax = plt.subplots(figsize=(20, 10))
_ = plot_tree(tree, ax=ax, fontsize=20, feature_names=feature_names)

In [None]:
from sklearn.tree import plot_tree
sns.reset_orig()
fig, ax = plt.subplots(figsize=(20, 10))
_ = plot_tree(tree_scaled, ax=ax, fontsize=20, feature_names=feature_names)