# Scikit Learn (`sklearn`)

Collection of algorithms for Data Science with unified interface

- [Homepage](https://scikit-learn.org/stable/index.html)

This notebook is based on the available [tutorials](https://scikit-learn.org/stable/tutorial/index.html) which are interesting to read, but 
unfortunately note based on executable notebooks.

## Contents

Reordered [User Guide](https://scikit-learn.org/stable/user_guide.html)

> The User Guide is an overall reference which can be followed in different orders.

- [preprocessing data](https://scikit-learn.org/stable/data_transforms.html): `sklearn.impute`, `sklearn.preprocessing`
- [model selection (incl. metrics)](https://scikit-learn.org/stable/model_selection.html): `sklearn.model_selection`
- [Pipeline](https://scikit-learn.org/stable/data_transforms.html): `sklearn.pipeline`

In [None]:
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.compose import ColumnTransformer

from sklearn.metrics import mean_squared_error

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

### Ressources

- [Glossary](https://scikit-learn.org/stable/glossary.html#glossary)
- [examples](https://github.com/scikit-learn/scikit-learn/tree/master/examples)
- [API design for machine learning software: experiences from the scikit-learn project](https://arxiv.org/abs/1309.0238)
- [Géron, Aurelion (2019): Hands on Machine Learning ith Scikit-Learn, Keras and TensorFlow, Vol. 2, Ch. 1- 9](https://github.com/ageron/handson-ml2)

In [None]:
# import sklearn.base
# sklearn.base??

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class CustomTransformer(BaseEstimator, TransformerMixin): 
    """Don't use this. This is an example."""
    def __init__(self, my_bias=0): # no *args or **kargs
        """Add a bias/ intercept"""
        self.my_argument = my_argument
    def fit(self, X, y=None): 
        return self # nothing else to do
    def transform(self, X):
        return np.c_[X, 0]

## Scikit Learn API main principles
> Géron (2019): 64f. and [scikit-learn-paper](https://arxiv.org/abs/1309.0238)


#### Consistency
- `Estimators`: Interface for building and fitting models
    - `fit` method returns fitted models
    - supervised: `fit(X_train, y_train)`
    - unsupervised: `fit(X_train)`
    - factory to produce model objects
- `Predictors(Estimator)`: Interface for making predictions
    - `fit`, `predict` and `score`
    - supervised and unsupervised: `predict(X_test)`
    - performance assessment: `score` (the higher, the better)
    - clustering: `fit_predict` exists
    - extends `Estimator`
- `Transformers(Estimator)`: Interface for converting data
    - `fit`, `transform`, and `fit_transform`
    - extends `Estimator`

    
> Transformer which is also a predictor? Where is the difference between transform and predict?

#### Composition  
- `Pipeline` objects from a sequence of `Transformers` and a optinally a final `Predictor`
- `FeatureUnion` objects for a two or more `Pipeline`s in parallel, yielding concatenated outputs.

#### Inspection
- learned `features_` have a underscore suffix `_`

#### Sensible defaults
 - get your first models running quickly
 - sensible defaults for construction of `Estimators`

> Side Note: "A _hyperparameter_ is a parameter of a learning algorithm (not of the model).   
> As such, it is not affected by the learning algorithm itself;   
> it must be set prior to training and remains constant during training." (Géron 2019: 29)  
> Constructor parameters of scikit-learn objects are hyperparameters

### Website

In [None]:
from IPython.display import HTML, IFrame, display

display(IFrame(src='https://scikit-learn.org', width=1024, height=1024, metadata=None))

## Adapted Machine Learning Tutorial

## Classifiers
> We won't discuss what each classifier do, although this is important to know 
> in order to trust and assess the model predictions.
- general design of classifiers
- 

## Cross-Validation example

- meta-estimators `GridSearchCV` and `RandomizedSearchCV`
- `best_estimator_` attribute

- [Diabetes example](https://scikit-learn.org/stable/auto_examples/exercises/plot_cv_diabetes.html#sphx-glr-auto-examples-exercises-plot-cv-diabetes-py)

## Easy exercise: Image Classification 

Run [image-classification example](https://github.com/scikit-learn/scikit-learn/tree/master/examples/classification) and exchange the classifier.

## Extended exercise: automated stratified Cross-Valdiation 
Goals:
- Understanding documentation of [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate) function
- apply stratified KFold data splitting for imbalanced data

Using Stratified Splitting is default for [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate).

In [None]:
from sklearn.model_selection import StratifiedKFold
cv_results = {}
for key, clf in clf_dict.items(): 
    cv_results[key] = cross_validate(clf, X, y=target, groups=target, cv=StratifiedKFold(5), scoring=scoring)
    cv_results[key]['num_feat'] = X.shape[-1]

## Case Study: Age-prediction
> Thanks for [Sam Bradley](https://www.dtu.dk/english/service/phonebook/person?id=145074&cpid=266426&tab=0)
telling me and [Denis Shepelin](https://www.dtu.dk/english/service/phonebook/person?id=126180&tab=2&qt=dtupublicationquery)
telling him. There I stop the tracking:) 

A paper presenting age predictions based on RNA measurements did upload the data
- [paper](https://www.sciencedirect.com/science/article/pii/S1872497317301643)
- [data](https://zenodo.org/record/2545213/#.X43R0dAzb-g)

> It's a set of features and labels
> For first predictions you do not need to understand the biology,  
> but to explain _odd_ things, more knowledge is most of the times helpful

### Feel free to re-implement your own paper of interest 

> If you are interested in a paper which you have the data for, go an try this instead.

In [None]:
url_train_data = "https://zenodo.org/record/2545213/files/train_rows.csv"
url_test_data = None
url_train_normal = "https://zenodo.org/record/2545213/files/training_data_normal.tsv"

url_test_data = "https://zenodo.org/record/2545213/files/test_rows.csv"

url_test_labels "https://zenodo.org/record/2545213/files/test_rows_labels.csv"
