# Scikit Learn (`sklearn`)

- Henry Webel at [NNF CPR](https://www.cpr.ku.dk/staff/rasmussen-group/?pure=en/persons/662319)
- Python Tsumanmi 2020 at [SUND](https://healthsciences.ku.dk/)
- Session : `Day 2, 13:00 -17.00` (Track 2)
  - Pre-requisites: Python Intro, NumPy, minimal Pandas, matplotlib
  
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pythontsunami/teaching/blob/sklearn/sklearn_intro.ipynb)

### Saving the notebook in Drive
Save a copy in your drive if you want to save your changes: `File` -> `Save a copy in Drive`


![Save Colab Notebook in Google Drive](figures/colab_save_in_drive.png)

or 

![Save Colab Notebook in Google Drive](figures/colab_save_in_drive_2.png)


**Table of Contents in Colab**
> Allows easier navigation

![Table of content in Colab](figures/colab_toc.png)

## Contents

1. Scikit-learn API introdoction.
2. If needed: Machine Learning
3. Use-Case with different objects from scikit-learn
    - this includes some exercises
4. Further material

## Scikit-learn

Library of algorithms for Data Science with unified interface.

This notebook is based on the available [tutorials](https://scikit-learn.org/stable/tutorial/index.html) which are interesting to read, but unfortunately note based on executable notebooks.

### Resources


- [Glossary](https://scikit-learn.org/stable/glossary.html#glossary)
- [examples](https://github.com/scikit-learn/scikit-learn/tree/master/examples)
- [API design for machine learning software: experiences from the scikit-learn project](https://arxiv.org/abs/1309.0238)
- [Géron, Aurélien (2019): Hands on Machine Learning ith Scikit-Learn, Keras and TensorFlow, Vol. 2, Ch. 1- 9](https://github.com/ageron/handson-ml2)

## Scikit Learn API main principles
> Géron (2019): 64f. and [scikit-learn-paper](https://arxiv.org/abs/1309.0238)

First some theory and names

#### Consistency
- `Estimators`: Interface for building and fitting models
    - `fit` method returns fitted models
    - supervised: `fit(X_train, y_train)`
    - unsupervised: `fit(X_train)`
    - factory to produce model objects
- `Predictors(Estimator)`: Interface for making predictions
    - `fit`, `predict` and `score`
    - supervised and unsupervised: `predict(X_test)`
    - performance assessment: `score` (the higher, the better)
    - clustering: `fit_predict` exists
    - extends `Estimator`
- `Transformers(Estimator)`: Interface for converting data
    - `fit`, `transform`, and `fit_transform`
    - extends `Estimator`

    
> Transformer which is also a predictor? Where is the difference between transform and predict?

#### Composition  
- `Pipeline` objects from a sequence of `Transformers` and a optinally a final `Predictor`
- `FeatureUnion` objects for a two or more `Pipeline`s in parallel, yielding concatenated outputs.

#### Inspection
- learned `features_` have a underscore suffix `_`

#### Sensible defaults
 - get your first models running quickly
 - sensible defaults for construction of `Estimators`

> Side Note: "A _hyperparameter_ is a parameter of a learning algorithm (not of the model).   
> As such, it is not affected by the learning algorithm itself;   
> it must be set prior to training and remains constant during training." (Géron 2019: 29)  
> Constructor parameters of scikit-learn objects are hyperparameters

### Website

In [None]:
from IPython.display import HTML, IFrame, display

display(IFrame(src='https://scikit-learn.org', width=1024, height=1024, metadata=None))

## User Guide

Some part of the [User Guide](https://scikit-learn.org/stable/user_guide.html) will be discussed.

> The User Guide is an overall reference which can be followed in different orders.

- [Different Estimator](https://scikit-learn.org/stable/supervised_learning.html)
- [preprocessing data](https://scikit-learn.org/stable/data_transforms.html): `sklearn.impute`, `sklearn.preprocessing`
- [model selection (incl. metrics)](https://scikit-learn.org/stable/model_selection.html): `sklearn.model_selection`
- [Pipeline](https://scikit-learn.org/stable/data_transforms.html): `sklearn.pipeline`

### Some things to look at

In [None]:
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.compose import ColumnTransformer

from sklearn.metrics import mean_squared_error

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

In [None]:
# Pipeline?

In [None]:
# import sklearn.base
# sklearn.base??

## Example: CustomTransformer
> scikit-learn is based on duck-typing, although we inherit some additional features for the interface from base classes.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class CustomTransformer(BaseEstimator, TransformerMixin): 
    """Don't use this. This is an example."""
    def __init__(self, my_bias=0): # no *args or **kargs
        """Add a bias/ intercept"""
        self.my_bias = my_bias
    def fit(self, X, y=None): 
        return self # nothing else to do
    def transform(self, X):
        return np.c_[X, np.array([self.my_bias] * len(X))]

In [None]:
import pandas as pd
import numpy as np
np.random.seed(42)

In [None]:
X = pd.DataFrame(range(10))
custom_transformer = CustomTransformer(my_bias=10)
custom_transformer.transform(X)

> Scikit-learn uses the underlying numpy.arrays of a DataFrame.

In [None]:
X.values

## Machine Learning Tutorial

1. Supervised
    1. Regression
    2. Classification
2. Unsupervised

Adapt Machine Learning Tutorial

### Classification vs Regression

What is the difference?

### Unsupervised

## Case Study: Age-prediction
> Thanks for [Sam Bradley](https://www.dtu.dk/english/service/phonebook/person?id=145074&cpid=266426&tab=0)
telling me and [Denis Shepelin](https://www.dtu.dk/english/service/phonebook/person?id=126180&tab=2&qt=dtupublicationquery)
telling him. There I stop the tracking:) 

A paper presenting age predictions based on RNA measurements did upload the data
- [paper](https://www.sciencedirect.com/science/article/pii/S1872497317301643)
- [data](https://zenodo.org/record/2545213/#.X43R0dAzb-g)

> It's a set of features and labels
> For first predictions you do not need to understand the biology,  
> but to explain _odd_ things, more knowledge is most of the times helpful

### Feel free to re-implement your own paper of interest 

> If you are interested in a paper which you have the data for, go on and try to adapt the following code.

In [None]:
url_train_data = "https://zenodo.org/record/2545213/files/train_rows.csv"
url_test_data = "https://zenodo.org/record/2545213/files/test_rows_labels.csv"

# additional data not used for now
url_train_normal = "https://zenodo.org/record/2545213/files/training_data_normal.tsv"
url_test_data_wo_labels = "https://zenodo.org/record/2545213/files/test_rows.csv"

In [None]:
train_data = pd.read_csv(url_train_data, sep='\t')
train_data

In [None]:
test_data = pd.read_table(url_test_data)
test_data

In [None]:
# train_normal = pd.read_csv(url_train_normal, sep='\t')
# train_normal

In [None]:
# test_data_wo_label = pd.read_table(url_test_data) # tab seperated data is often tsv format
# test_data_wo_label

In [None]:
TARGET_COLUMN = 'Age'

y_train = train_data[TARGET_COLUMN]
y_test  = test_data[TARGET_COLUMN] # pop() if you want to modify test_data inplace
y_test

In [None]:
test_data

In [None]:
X_test  = test_data.drop(TARGET_COLUMN, axis=1)
X_train = train_data.drop(TARGET_COLUMN, axis=1)
X_train

In [None]:
_df = X_train  # from here it's easy to write a function display what you are interested in
n_na = _df.isna().sum().sum()
print(f"Found # NAs: {n_na}")
if n_na:
    row_with_nas = _df.isna().any(axis=1)
    display(_df.loc[row_with_nas])

In [None]:
_ = X_train.hist(figsize=(15,15), sharex=True, sharey=True)
# _ = X_test.hist(figsize=(15,15), sharex=True, sharey=True)

### A first model

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg = lin_reg.fit(X_train, y_train) 

> Factory is replaced by fitted model, but calling fit again first erases previously fitted parameters

In [None]:
y_test_pred = lin_reg.predict(X_test)
y_test_pred 

In [None]:
y_test

In [None]:
from sklearn.metrics import mean_squared_error

lin_mse = mean_squared_error(y_true=y_test, y_pred=y_test_pred)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

## Exercise: Replace the model and see if this improves your results.

1. Select a different [model](https://scikit-learn.org/stable/supervised_learning.html)
2. Adapt only the first block from above below here

## Exercise: Model Selection

1. On the combined data set, split the data into a balanced train and test data set of 80/20 (i.e. 80% of the data goes into the training data set). 
2. Perform cross-validation
3. 

#### Hints
- `from sklearn.model_selection import StratifiedShuffleSplit` and [model-selection tutorial](https://scikit-learn.org/stable/model_selection.html)

> The aim is to get you started reading the documentation and understand the function signatures  
> while you are able to ask as many questions as you like:)

In [None]:
data = pd.concat([train_data, test_data])
old_index = pd.Series(data.index)
data.index = old_index.index
data

In [None]:
X = data.drop(TARGET_COLUMN, axis=1)
y = data[TARGET_COLUMN]

## Exercise 2: Check target variable

1. Check if the distribution of `Age` is the same in the predefined test and train set (there are several possibilites to do that)
2. How to stratify the data?

## Example: Custom `Transformer`

Create a custom Transformer adding the squared $x=x^2$ of each feature to the training data.

> scikit-learn is based on duck-typing, although we inherit some additional features for the interface from base classes.

##### !!! Don't use this
To add interaction effects, please use [`sklearn.preprocessing.PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html).

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin


class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_moment=None):  # no *args or **kargs
        self.add_moment = add_moment

    def fit(self, X, y=None):
        return self  # nothing to fit

    def transform(self, X):
        # your code here
        return X

attr_adder = CombinedAttributesAdder(add_moment=True)
extendend_data = attr_adder.transform(X_train.values)

> Can you think of a better transformations? 

## Exercise 3: Simple pipeline

Let's add a standardiser.

In [None]:
std_scaler = StandardScaler(copy=True, with_mean=True, with_std=True)

If you like, mask some data and add an imputer to the pipeline

In [None]:
mask_keep = np.random.random(size=X.shape) > 0.1
X.where(mask_keep) # Now X has not changed yet. Assing to a new reference!

In [None]:
X

## Exercise 4: Combining pipelines

What if we would have an additional category?

In [None]:
# num_attribs = 
# cat_attribs = 

# num_pipeline = Pipeline([
#         ('selector', DataFrameSelector(num_attribs)),
#         ('imputer', SimpleImputer(strategy="median")),
#         ('attribs_adder', AttributesAdder()),
#         ('std_scaler', StandardScaler()),
#     ])

# cat_pipeline = Pipeline([
#         ('selector', DataFrameSelector(cat_attribs)),
#         ('cat_encoder', OneHotEncoder(sparse=False)),
#     ])

# from sklearn.pipeline import FeatureUnion

# full_pipeline = FeatureUnion(transformer_list=[
#         ("num_pipeline", num_pipeline),
#         ("cat_pipeline", cat_pipeline),
#     ])

In [None]:
# ToDo

## Cross-Validation

- meta-estimators `GridSearchCV` and `RandomizedSearchCV`
- `best_estimator_` attribute

- [Diabetes example](https://scikit-learn.org/stable/auto_examples/exercises/plot_cv_diabetes.html#sphx-glr-auto-examples-exercises-plot-cv-diabetes-py)

In [None]:
from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedKFold
cross_validate(lin_reg, X, y=y, groups=y, cv=StratifiedKFold(5), scoring=None)

### Exercise

- Replace `scoring=None` by other metrics by reading the documentation.
- Extend this to several estimators and record the results

## Finetuning

`GridSearchCV` and `RandomSearchCV` on model hyperparameters.

> Side Note: "A _hyperparameter_ is a parameter of a learning algorithm (not of the model).   
> As such, it is not affected by the learning algorithm itself;   
> it must be set prior to training and remains constant during training." (Géron 2019: 29)  
> Constructor parameters of scikit-learn objects are hyperparameters

## Easy exercise: Image Classification 

Run [image-classification example](https://github.com/scikit-learn/scikit-learn/tree/master/examples/classification) and exchange the classifier.

## Extended exercise: automated stratified Cross-Valdiation 
Goals:
- Understanding documentation of [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate) function
- apply stratified KFold data splitting for imbalanced data

Using Stratified Splitting is default for [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate).

In [None]:
OLDER_THAN = 60
print(f"Binary (Dummy) Variable assigning 1 if some is older than {OLDER_THAN} years old.")
y_binary = (y > OLDER_THAN).astype(int)
y_binary.value_counts()

This will result in a imbalanced classification problem, where the aim is to predict if someone is older than `OLDER_THAN`.