# Preprocessing

In this notebook, we review preprocessing in scikit-learn.

<a href="https://colab.research.google.com/github/thomasjpfan/ml-workshop-intro-v2/blob/main/notebooks/03-preprocessing.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

In [None]:
# Install dependencies for google colab
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    %pip install -r https://raw.githubusercontent.com/thomasjpfan/ml-workshop-intro-v2/main/requirements.txt

In [None]:
import sklearn
assert sklearn.__version__.startswith("1.2"), "Please install scikit-learn 1.2"

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['figure.constrained_layout.use'] = True
sns.set_theme(context="talk", font_scale=1.2)

## Loading Housing Dataset

In [None]:
import pandas as pd

if IN_COLAB:
    HOUSING_PATH = ""
else:
    HOUSING_PATH = "data/housing.csv"
    
housing_df = pd.read_csv(HOUSING_PATH)
X = housing_df.drop("MEDV", axis="columns")
y = housing_df["MEDV"]

In [None]:
feature_names = X.columns

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 4, figsize=(24, 9))

for name, ax in zip(feature_names, axes.ravel()):
    sns.scatterplot(x=name, y='MEDV', ax=ax, data=housing_df)

In [None]:
X.plot(kind='box', rot=90);

## Model without scaling

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
from sklearn.neighbors import KNeighborsRegressor

knr = KNeighborsRegressor().fit(X_train, y_train)

In [None]:
knr.score(X_test, y_test)

## Model with scaling

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

In [None]:
X_train_scaled

### Pandas output!

In [None]:
scaler = StandardScaler()
scaler.set_output(transform="pandas")
X_train_scaled = scaler.fit_transform(X_train)

In [None]:
X_train_scaled.plot(kind='box', rot=45);

## Train model on scaled data

In [None]:
knr = KNeighborsRegressor()
knr.fit(X_train_scaled, y_train)

In [None]:
X_test_scaled = scaler.transform(X_test)
knr.score(X_test_scaled, y_test)

## Exercise 1

1. Train a `sklearn.svm.SVR` model on the unscaled training data and evaluate on the unscaled test data.
2. Train the same model on the scaled data and evaluate on the scaled test data.
3. Does scaling the data change the performance of the model?

In [None]:
from sklearn.svm import SVR
import numpy as np

**If you are running locally**, you can uncomment the following cell to load the solution into the cell. On **Google Colab**, [see solution here](https://github.com/thomasjpfan/ml-workshop-intro-v2/blob/main/notebooks/solutions/03-ex1-solution.py). 

In [None]:
# %load solutions/03-ex1-solution.py

## Tree based models

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
tree = DecisionTreeRegressor(random_state=0, max_depth=2)
tree.fit(X_train, y_train)
tree.score(X_test, y_test)

In [None]:
tree_scaled = DecisionTreeRegressor(random_state=0, max_depth=2)
tree_scaled.fit(X_train_scaled, y_train)
tree_scaled.score(X_test_scaled, y_test)

In [None]:
from sklearn.tree import plot_tree
fig, ax = plt.subplots(figsize=(12, 6))
_ = plot_tree(tree, ax=ax, fontsize=20, feature_names=feature_names)

In [None]:
from sklearn.tree import plot_tree
fig, ax = plt.subplots(figsize=(12, 6))
_ = plot_tree(tree_scaled, ax=ax, fontsize=20, feature_names=feature_names)