# Advanced scikit-learn

Authors: [Alexandre Gramfort](http://alexandre.gramfort.net), [Thomas Moreau](https://tommoral.github.io/about.html), and [Pedro L. C. Rodrigues](https://plcrodrigues.github.io/).

The aim of this notebook is:

  - To explain the **full scikit-learn API** (estimators, transformers, classifiers, regressors, splitters)
      - to explain how to assemble these objects in complex **pipelines with mixed data types** (numerical, categorical etc.) using `Pipeline` and `ColumnTransformer` objects.
  - Have you **write your own transformer, splitter and classifier**.
  
## Table of contents

* [1 Working only with numerical data](#workingnumerical)
    * [1.1 Pandas preprocessing](#workingnumerical_pandas)
    * [1.2 Making it less error prone using scikit-learn](#workingnumerical_errorprone)    
* [2 Working only with categorical data](#workingcategorical)
* [3 Combining both categorical and numerical data in the pipeline](#combining)
* [4 From one split to cross-validation](#crossvalidation)

To explain these concepts we will start from a full working code based on the Titanic dataset. 

Then, we will deconstruct all the blocks and start writing our own Python classes.

First, let's fetch the Titanic dataset directly from [OpenML](https://openml.org/).

In [None]:
import pandas as pd
from sklearn.datasets import fetch_openml

In [None]:
X_df, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X_df.head()

The classification task is to predict whether or not a person will survive the Titanic disaster.

In [None]:
y

We will split the data into a training and a testing set.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_df, y, random_state=42
)

<div class="alert alert-success">
    <p><b>QUESTIONS</b>:</p>
    <ul>
        <li>What would happen if you tried to fit a <tt>RandomForestClassifier</tt>?</li>
    </ul>
</div>

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100)
# TODO

# 1 Working only with numerical data <a class="anchor" id="workingnumerical"></a> [↑](#Table-of-contents)

Let's start with a model using only numerical columns.

In [None]:
X_df.dtypes

## 1.1 Pandas preprocessing  <a class="anchor" id="workingnumerical_pandas"></a> [↑](#Table-of-contents)

Before using scikit-learn, we will do some simple preprocessing using pandas. First, let's select only a few the numerical columns:

In [None]:
num_cols = ['pclass', 'age', 'parch', 'fare']

X_train_num = X_train[num_cols]
X_test_num = X_test[num_cols]

<div class="alert alert-success">
    <p><b>QUESTIONS</b>:</p>
    <ul>
        <li>And now, what would happen if you tried to fit a <tt>RandomForestClassifier</tt>?</li>
    </ul>
</div>

In [None]:
model = RandomForestClassifier(n_estimators=100)
# TODO

We might want to look into a summary of the data that we try to fit.

In [None]:
X_train_num.info()

Since there are some missing data, we can replace them with a mean.

In [None]:
X_train_num_imputed = X_train_num.fillna(X_train_num.mean())
X_train_num_imputed.info()

In [None]:
model.fit(X_train_num_imputed, y_train)

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    <ul>
    <li>What should we do if there are also missing values in the test set?</li>
    <li>Process the test set so as to be able to compute the test score of the model.</li>
    </ul>
</div>

Solution is in `solutions/01-pandas_fillna_test.py`

In [None]:
# TODO

## 1.2 Making it less error prone using scikit-learn <a class="anchor" id="workingnumerical_errorprone"></a> [↑](#Table-of-contents)

Scikit-learn provides some "transformers" to preprocess the data. `sklearn.impute.SimpleImputer` is a transformer allowing for the same job than the processing done with Pandas. However, we will see later that it integrates greatly with other scikit-learn components.

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="mean")

As any estimator in scikit-learn, a transformer has a `fit` method which should be called on the training data to learn the required statistics. In the case of a mean imputer, we need to compute the mean for each feature.

In [None]:
imputer.fit(X_train_num)

In [None]:
imputer.statistics_

To impute the values by the mean, we can use the `transform` method.

In [None]:
imputer.transform(X_train_num)

As previously mentioned, we should impute with the values computed in `fit` when imputing the test set.

<div class="alert alert-warning">
<b>What is a "Transformer"?</b>: <br/>

A scikit-learn transform should implement at least these methods:

<ul>
    <li>fit(X, y=None)</li>
    <li>transform(X)</li>
    <li>get_params()</li>
    <li>set_params(**kwargs)</li>  
</ul>
</div>

In [None]:
params = imputer.get_params()
params

In [None]:
imputer.fit?

In [None]:
imputer.transform?

Let's look at the attributes of our `imputer`

In [None]:
public_attributes = [attr for attr in dir(imputer) if not attr.startswith('_')]
public_attributes

We have among these attributes:

- **parameters** (keys in get_params method output)
- **methods** (fit, transform, etc.)
- **estimated quantities** that appear after a `fit` (ending with `_`)

In [None]:
public_methods = [
    attr for attr in dir(imputer)
    if not attr.startswith('_') and
    not attr.endswith('_') and
    attr not in params]
public_methods

In [None]:
imputer.inverse_transform?

Estimated quantities:

In [None]:
fit_attributes = [
    attr for attr in dir(imputer)
    if not attr.startswith('_') and
    attr.endswith('_')]
fit_attributes

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    <ul>
        <li>What are the attributes of a RandomForestClassifier. You will decompose these in the 3 categories.</li>
    </ul>
</div>

### Using a Pipeline

We saw earlier that we should be careful when preprocessing data to avoid any "data leak" (i.e. reusing some knowledge from the training when testing our model). Scikit-learn provides the `Pipeline` class to make successive transformations. In addition, it will ensure that the right operations will be applied at the right time.

In [None]:
from sklearn import set_config

set_config(display='diagram')

In [None]:
from sklearn.pipeline import make_pipeline

model = make_pipeline(SimpleImputer(strategy='mean'),
                      RandomForestClassifier(n_estimators=200))
model.fit(X_train_num, y_train)

Alternative syntax using named "steps".

In [None]:
from sklearn.pipeline import Pipeline

model = Pipeline([
    ("imputer", SimpleImputer(strategy='mean')),
    ("classifier", RandomForestClassifier(n_estimators=200))    
])
model.fit(X_train_num, y_train)

In [None]:
model.score(X_test_num, y_test)

Saving your estimator in HTML for presentations, blog posts etc.

In [None]:
from sklearn.utils import estimator_html_repr

with open('model.html', 'w') as fid:
    fid.write(estimator_html_repr(model))

# !open model.html

### Manipulating Pipeline steps

A pipeline is a sequence of `steps`. Each `step` is a scikit-learn estimator. All steps except the last one are typically **transformers** (fit, fit_transform, transform methods) and the last step is a **classifier** or a **regressor**.

In [None]:
model.steps  # accessing steps as a list

In [None]:
model.named_steps  # accessing steps with their names as a dict

In [None]:
model[:1]  # slicing a pipeline

In [None]:
model[-1]

In [None]:
from sklearn.base import is_classifier

is_classifier(model[-1])

Let's decompose the pipeline and chain the operations manually (mimicking what the Pipeline object does internally):

In [None]:
preprocessor = model[:-1]
classifier = model[-1]

X_train_preproc = preprocessor.fit_transform(X_train_num, y_train)
X_test_preproc = preprocessor.transform(X_test_num)

classifier.fit(X_train_preproc, y_train)
classifier.score(X_test_preproc, y_test)

### Towards a ColumnTransformer

If we want to directly fit the model on `X_train`, we can select the numerical columns using  a `ColumnTransformer` object:

In [None]:
from sklearn.compose import ColumnTransformer

ColumnTransformer?

In [None]:
numerical_preprocessing = ColumnTransformer([
    ("imputer", SimpleImputer(strategy='mean'), num_cols)
])
    
model = Pipeline([
    ("numerical preproc.", numerical_preprocessing),
    ("classifier", RandomForestClassifier(n_estimators=100)),
])

model.fit(X_train, y_train)

In [None]:
model.score(X_test, y_test)

# 2 Working only with categorical data <a class="anchor" id="workingcategorical"></a> [↑](#Table-of-contents)

Categorical columns (even more string data types) are not supported natively by machine-learning algorithms and required some preprocessing step usually called encoding. The most classical categorical encoders are the `OrdinalEncoder` and the `OneHotEncoder`. Let's first see the `OrdinalEncoder`.

In [None]:
X_train.head()

In [None]:
cat_cols = ['sex', 'embarked', 'pclass']

In [None]:
X_train_cat = X_train[cat_cols]

In [None]:
X_train_cat.info()

In [None]:
from sklearn.preprocessing import OrdinalEncoder

cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy='constant', fill_value='missing')),
    ("ordinal_encoder", OrdinalEncoder())
])

categorical_preprocessing = ColumnTransformer([
    ("categorical_preproc", cat_pipeline, cat_cols)
])

model = Pipeline([
    ("categorical_preproc", categorical_preprocessing),
    ("classifier", RandomForestClassifier(n_estimators=100))
])
model.fit(X_train, y_train)

In [None]:
model.score(X_test, y_test)

### Accessing and updating parameters of a Pipeline

Pipeline is yet another scikit-learn estimators. It therefore has the methods `set_params` and `get_params`

In [None]:
model.get_params()

In [None]:
model.set_params(classifier__n_estimators=666)
model.get_params()

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    There are many other types of ways to encode categorical variables.
    <ul>
        <li>Write your own categorical encoder CountEncoder. The idea is to replace categorical variables with their count in the train set.</li>
        <li>Change the ordinal encoder in the pipeline above with an instance of your CountEncoder.</li>
    </ul>
</div>

Your class will need to inherit from `BaseEstimator`, `TransformerMixin` in `sklearn.base` submodule.
You will use the class `Counter` from the collections module in the standard library.

You will your code on this toy example

```python
>>> X = np.array([
...    [0, 2],
...    [1, 3],
...    [1, 1],
...    [1, 1],
... ])
>>> ce = CountEncoder()
>>> print(ce.fit_transform(X))
[[1 1]
 [3 1]
 [3 2]
 [3 2]]
```

Solution is in `solutions/01-count_encoder.py`.

# 3 Combining both categorical and numerical data in the pipeline <a class="anchor" id="combining"></a> [↑](#Table-of-contents)

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    <ul>
    <li>Try to combine the numerical and categorical pipelines into a single <tt>ColumnTransformer</tt></li>
        <li>Fit a <tt>RandomForestClassifier</tt> on the output of this feature engineering. How does the test score evolve?</li>
    </ul>
</div>

Solution is in `solutions/01b-full_column_transformer.py`

# 4 From one split to cross-validation <a class="anchor" id="crossvalidation"></a> [↑](#Table-of-contents)

CV objects are parametrized to split data in multiple train/test splits.

A splitter should implement a `split` method.

Given a `model`, some data `X, y` and a splitter one can fit and score on
all requested data splits. Functions to do this are `cross_val_score`
(historical way) and `cross_validate` (more modern way).

Let's first define a model:

In [None]:
from sklearn.preprocessing import OrdinalEncoder

cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("ordinal_encoder", OrdinalEncoder())
])

categorical_preprocessing = ColumnTransformer([
    ("categorical_preproc", cat_pipeline, cat_cols)
])

model = Pipeline([
    ("categorical_preproc", categorical_preprocessing),
    ("classifier", RandomForestClassifier(n_estimators=100))
])
model

Let's now use a default 5-fold CV. You will see that there is a large amount of discrepancy among the test_score values

In [None]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, X_df, y, cv=5)
pd.DataFrame(cv_results)

In [None]:
pd.DataFrame(cv_results).agg(['mean', 'std'])

The reason is that default CV object (here 5-fold CV) is deterministic, while the distribution of "survivor" is not uniform in the dataset. See:

In [None]:
y.astype(int).to_frame().groupby(y.index.values // 100).mean().plot()

To fix this one needs stratified folds (so that the fraction of "survivors" is the same in each fold) but also to shuffle the data.

In [None]:
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_results = cross_validate(model, X_df, y, cv=cv)
pd.DataFrame(cv_results)

In [None]:
pd.DataFrame(cv_results).agg(['mean', 'std'])

The variance across folds is now much smaller which is great!

Let's look at the cross-validation scheme with a pretty plot:

In [None]:
from plotting_utils import plot_cv_indices
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(9, 4))

for shuffle, ax in zip([False, True], axes):
    cv = StratifiedKFold(n_splits=5, shuffle=shuffle)
    plot_cv_indices(cv, X_df, y, ax=ax)

fig.tight_layout()

See https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html for more details and a list of the different CV objects.

### Writing your own cross-validation object

A splitter should implement the `split` and `get_n_splits` method. The `split` should return an iterable of tuples of indices and `get_n_splits` an integer corresponding to the number of splits/folds. If you know about `yield` and Python generators you can use these.

Let's first see what we get with the `cv` object above.

In [None]:
cv.get_n_splits()

In [None]:
for train_idx, test_idx in cv.split(X_df, y):
    print(train_idx[:5], test_idx[:5])
    print(f"N. samples train: {len(train_idx)}  -- N. samples test: {len(test_idx)}")

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    <ul>
        <li>Imagine that the index of <code>y</code> gives you some provenance about the sample (e.g. which cohort of subjects in a clinical study). Write a splitter that allows to test the performance of a model on a left-out cohort. In other words, you will do as many splits as the number of unique values in <code>y.index.values</code>, and predict of each left-out cohort. To simulate this, we will modify the index variable <code>y</code>, just for educational purposes.
        </li>
    </ul>
</div>

Solution is in `solutions/01c-splitter.py`

In [None]:
y_with_provenance = y.copy()
y_with_provenance.index = y_with_provenance.index.values // 200  # to easily mimic 5 cohorts
n_splits = y_with_provenance.index.nunique()
print(n_splits)

### When you're done with this notebook you can do the assignments on scikit-learn.