<img src='images/gdd-logo.png' width='300px' align='right' style="padding: 15px">

# Transformers

In the previous notebook, you developed a baseline model. However, we made you ignore any feature that contained text and any row that contained missing data.

In this notebook, you shall use sklearn Transformers to perform data preprocessing and investigate how it can aid model performance.

- [Baseline model](#baseline)
- [Sklearn Objects](#sklearn)
- [Sklearn Transformers]()
    - [Treating categorical columns](#cat)
    - [Using `ColumnTransformer` for a subset of columns](#ct)
    - [Using `SimpleImputer` to treat missing values](#impute)
- [The sklearn `Pipeline`](#pipeline)
- [Conclusion and next steps](#conc)

<a id=baseline></a>

## Recreating the Baseline Model

Let's make sure to have access to the baseline model during this notebook so you can compare performance.

Import the relevant packages and modules:

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier

In [None]:
stroke = pd.read_csv('data/stroke.csv').rename(columns=str.lower)
stroke.head()

Create X and y:

In [None]:
categorical_cols = ['work_type', 'smoking_status', 'who', 'gender', 'residence_type']
missing_cols = ['bmi', 'age']
drop_cols = ['id','address']

target = 'stroke'

# Function to split data into X and y
def create_Xy(df, drop_cols, target_col):
    df = df.drop(columns=drop_cols)
    return (
        df.drop(columns=target_col),
        df[target_col]
    )

X_baseline, y = stroke.pipe(create_Xy, 
                        drop_cols=drop_cols
                        + categorical_cols
                        + missing_cols, 
                        target_col='stroke')

Split the data in training and test set:

In [None]:
train_test_params = {
    'test_size':0.25, 
    'random_state':42, 
    'stratify':y
}

X_baseline_train, X_baseline_test, y_train, y_test = train_test_split(
                                                                      X_baseline, 
                                                                      y, 
                                                                      **train_test_params)

Recreate and train the base model (Decision Tree Classifier):

In [None]:
# Step 1: Import model
from sklearn.tree import DecisionTreeClassifier

# Step 2: Instantiate model and set parameters
base_model = DecisionTreeClassifier(max_depth=3, 
                                    class_weight='balanced',
                                    random_state=42)

# Step 3: Train model
base_model.fit(X_baseline_train, y_train)

In [None]:
# Step 4: Evaluate model
y_baseline_train_probs = base_model.predict_proba(X_baseline_train)[:,1]
y_baseline_test_probs = base_model.predict_proba(X_baseline_test)[:,1]

print(f'AUC: {roc_auc_score(y_train, y_baseline_train_probs), roc_auc_score(y_test, y_baseline_test_probs)}')

In this notebook, you will try to improve on the ROC-AUC scores achieved with your simple models through the use of sklearn **Transformers**.

<a id=sklearn></a>

## Sklearn Objects: Estimators, Predictors, Models and Transformers

Sklearn is built up of different types of [Objects](https://scikit-learn.org/stable/developers/develop.html). 

- An **Estimator** which implements a fit method to learn from data. 
- A **Predictor** makes predictions using the `.predict()` method.
- A **Model** can give a goodness of fit measure or a likelihood of unseen data using a `.score()` method.
- A **Transformer** can be used for filtering or modifying the data with the `.transform()` and `.fit_transform()` methods.

The Decision Tree algorithm is an example of an **Estimator** since it will use the `.fit()` method to apply the decision tree algorithm to some given data. Once fitted it becomes both a **Predictor** and a **Model** since, once fitted, it can make predictions and supply a measure of goodness of fit.

The great thing about sklearn is that all model algorithms follow this pattern. 

### Transformers

You will now use **Transformers** to *transform* the data (who would have guessed). 

Transformers help you:
- deal with missing values (e.g., by imputation)
- deal with categorical and string features.
- create new features from existing ones (e.g., by adding polynomial features and interactions)
- and much more...


Let's recreate X and y without dropping the categorical features or features with missing values.

In [None]:
drop_cols = ['id','address']

X, y = stroke.pipe(create_Xy, 
                   drop_cols=drop_cols, 
                   target_col='stroke')

X_train, X_test, y_train, y_test = train_test_split(X, y, **train_test_params)

In [None]:
X_train.head()

<a id=cat></a>

### Treating categorical data

The data now contains categorical and string features.

We will have to use **One-hot Encoding** or **Ordinal Encoding** to transform the categories into numerical features.

####  <mark>**Exercise**</mark> 

1. Which features do you consider categorical?
2. Sklearn implements the [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) and [`OrdinalEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) to deal with categorical data. Find out what they do.
3. For which ones would you use one-hot encoding and for which ones ordinal encoding?

In [None]:
# Find columns of DType "object"
stroke.select_dtypes('O').columns

In [None]:
stroke['who'].value_counts()

<details>
  <summary><span style="color:blue">Show solution</span></summary>
    
  You can use `stroke.select_dtypes('O').nunique()` to see how many unique categories each feature contains.
  
  None of the columns have an ordinal relationship (`smoking_status` may be a candidate but where would `unknown` fit?).
  We should therfore use `OneHotEncoder`.

</details>

<a id=ct></a>

### Using `ColumnTransformer` to select columns

Of course, we only want to transform the categorical columns and leave the numerical columns as they are.

What happens when you use `OneHotEncoder` on the entire dataset?

In [None]:
from sklearn.preprocessing import OneHotEncoder

onehot = OneHotEncoder()
onehot.fit_transform(X_train).shape

Let's look at the names of the outputted columns. 

*Note: On sklearn 1.0+ the method is `get_feature_names_out()`. For earlier versions use `get_feature_names()`.*

In [None]:
onehot.get_feature_names_out()[:10]

<mark>**Question:**</mark> What happened?

<details>
  <summary><span style="color:blue">Show solution</span></summary>
    
  You have created a new column for each new unique value of each column, regardless of whether the data was categorical or not. Instead you want to apply the `OneHotEncoder` and `OrdinalEncoder` only on the categorical columns.

</details>



 To select only a subset of features, you can use the `ColumnTransformer` object with the following syntax:
```python
onehot = ColumnTransformer([
    ('name_of_step_1', Transformer_1, list_of_cols_1),
    ('name_of_step_2', Transformer_2, list_of_cols_2),
    ...
    ('name_of_step_n', Transformer_n, list_of_cols_n),
], remainder='passthrough')
```

You can implement a `ColumnTransformer` to only select the categorical features like this:

In [None]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([
    ('onehot', OneHotEncoder(), categorical_cols)],
    remainder='passthrough')
ct

Running the `.fit_transform()` and `.tranform()` methods, we get:

In [None]:
X_train_encoded = ct.fit_transform(X_train)
X_test_encoded = ct.transform(X_test)

In [None]:
X_train.shape, X_train_encoded.shape

<mark>**Question**:</mark> There were 11 columns and now there are 22. Does that make sense?

In [None]:
ct.get_feature_names_out()

<details>
<summary><span style="color:blue">Show solution</span></summary>

Using `ct.get_feature_names_out()`, you can see that it created a column for each category: `onehot__gender_Male`, `onehot__gender_Female`, `onehot__work_type_Govt_job`, and so on... There were 17 categories and 5 remaining numerical features that were not one-hot encoded.

</details>

### <mark>Exercise:</mark> Drop-first One-Hot Encoding

It is not necessary to add a column for each categorical value.

There are two options to add to the `drop=` parameter in the `OneHotEncoder`:

- `'first'`
- `'if_binary'`

**Question 1:** What do each of the parameters do? Why would you drop a column?

<details>
  <summary><span style="color:blue">Show solution</span></summary>
    
`drop='first'` will remove one column for each feature

`drop='if_binary'` will only remove one column for the binary features

**Why drop a column in the first place?** Because you actually need one less column than categories to fully encode all the information. E.g., if `gender__Male = 1` you know that `gender__Female` must be zero. So you can drop one column and still have all the information. ML practice follows the *principle of parsimony*. If a simpler model (e.g., fewer features) works as well as a more complex model (e.g., more features), you will prefer the simpler model.


</details>

**Question 2:** Rebuild the `ColumnTransformer` with the correct parameter!

In [None]:
new_ct = ColumnTransformer([
    # your code here
])

In [None]:
# %load answers/02-ohe.py

<a id='impute'></a>

### Imputing missing values

Recall that there were two numeric features with missing values. 

In [None]:
missing_cols = list(stroke.columns[stroke.isnull().any()])
missing_cols

You can use the [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) from sklearn to fill in these missing values.

It will fill the missing values with some constant, e.g., the *median* value of that feature column.

In [None]:
from sklearn.impute import SimpleImputer

impute = SimpleImputer(strategy='median')

In [None]:
X_train_imputed = impute.fit_transform(X_train_encoded)
X_test_imputed = impute.transform(X_test_encoded)

Applying it directly to the (encoded) data means the same imputing strategy will be used for each column.

In [None]:
X_train['age'].median()

In [None]:
X_train['bmi'].median()

However, if you employed the `ColumnTransformer` you could apply different strategies to the columns.

*Note: If the strategy is mean or median, this transformer will only work when all columns are numeric (so you need to impute after the one-hot transformer has been implemented).*

<a id=pipeline></a>

## Sklearn Pipeline

To do this all in one go, you can use the sklearn `Pipeline` object.


<img src="images/sklearn-pipe.png" style="display: block;margin-left: auto;margin-right: auto;width: 400px" align='right'/>

**Pipelines** can encapsulate all the preprocessing steps (feature selections, scaling, encoding of variables and so on), as well as the final model, into a single Scikit-Learn estimator, thereby simplifying and automating many steps.

Pipelines are defined as a **list of steps**, with each step being a `(name, object)` **tuple**:

```Python
pipe = Pipeline(steps=[
    ('name_of_step_1', Transformer/Estimator/Model/Pipeline_1),
    ('name_of_step_2', Transformer/Estimator/Model/Pipeline_2),
    ...
    ('name_of_step_n', Transformer/Estimator/Model/Pipeline_n),
])
```

In [None]:
ct = ColumnTransformer([
    ('onehot', OneHotEncoder(drop="if_binary", sparse_output=False), categorical_cols),
    ], 
    remainder='passthrough', 
    )

#### <mark>Exercise:</mark> Build a `Pipeline`

Build a pipeline called `preprocessing`.
1. In the first step, you should add the `ct` Columntransformer.
2. In the second step, add the `SimpleImputer` with `strategy='mean'`.
3. Check if the data was correctly transformed by using the `.fit_transform(X_train)` method on your preprocessing pipeline. Does the output make sense?
4. Create a new pipeline that adds the Decision Tree Classifier after the preprocessing steps.

*Note: You can output a pandas dataframe by using `preprocessing.set_output(transform='pandas')` before calling the transform method.*

In [None]:
from sklearn.pipeline import Pipeline

# Your code here

In [None]:
# %load answers/02-pipeline.py

You can access parts of the pipeline by indexing on their names:

In [None]:
pipeline['preprocessing']['ct']

This way, you can still access the estimators/transformers/models and their parameters. 

E.g. the feature names after one-hot encoding:

In [None]:
all_features = pipeline['preprocessing']['ct'].get_feature_names_out()
all_features

### Model creation and evaluation

Let's fit the new pipeline to the original `X_train` and `y_train` data and compare it to the baseline you achieved earlier.

In [None]:
pipeline.fit(X_train, y_train)

Now, let's look at the ROC-AUC:

In [None]:
from sklearn.metrics import RocCurveDisplay

fig, ax = plt.subplots(1, 2, figsize=(16,6))

RocCurveDisplay.from_estimator(base_model, X_baseline_train, y_train, ax=ax[0], name='Baseline')
RocCurveDisplay.from_estimator(pipeline, X_train, y_train, ax=ax[0], name='Improved Model')
ax[0].set(title='Train');

RocCurveDisplay.from_estimator(base_model, X_baseline_test, y_test, ax=ax[1], name='Baseline')
RocCurveDisplay.from_estimator(pipeline, X_test, y_test, ax=ax[1], name='Improved Model')
ax[1].set(title='Test')

A good improvement!

Let's see what features the Decision Tree Classifier now used for its splitting rules:

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

fig,ax = plt.subplots(figsize=(20,20))

plot_tree(pipeline.named_steps['model'], 
          feature_names=list(all_features),
          ax=ax);

---

<img src='images/gdd-logo.png' width='300px' align='right' style="padding: 15px">
<a id=conc></a>

# Conclusion and next steps

This notebook covered the main objects in scikit-learn: Estimator, Predictor, Model, Transformer. Two transformers were used to preprocess the data to treat categorical features and features with missing values.

Pipelines were also used as an elegant way to write your code.

Next up you will work to improve this model even further by focussing on the model algorithm.