# Table of Contents
- The *data leakage* pitfall
- The *inconsistent preprocessing* pitfall
- scikit-learn pipelines: Chaining estimators
    - Building and using a pipeline
    - Accessing the steps of a pipeline
    - Pipeline and cross-validation
    - Imbalanced classification and cross-validation
- Classifiers comparison
    - Wilcoxon signed-rank test


The **data mining workflow** involves several steps, including (but not limited to) feature selection, normalization, sampling, classifier training and so on.

The need to combine these steps in a program paves the way to possible **methodological errors**; the most common one are **data leakage** and **inconsistent preprocessing**.

## The *Data Leakage* pitfall ([docs](https://scikit-learn.org/stable/common_pitfalls.html#data-leakage))

Data leakage occurs when information that would not be available at prediction time is used when building the model. 

This results in overly optimistic performance estimates, for example from cross-validation, and thus poorer performance when the model is used on actually novel data, for example during production. 

A common cause is not keeping the test and train data subsets separate. **Test data should never be used to make choices about the model**. The general rule is to never call `fit` on the test data.
Note that the same considerations apply for cross-validation, which can be viewed as an iteration of train-test splits.

## The *Inconsistent Preprocessing* pitfall ([docs](https://scikit-learn.org/stable/common_pitfalls.html#inconsistent-preprocessing))

Both train and test data subsets should receive the same preprocessing transformation: if data transformation operations (normalization, unsupervised dimensionality reduction, feature extraction, ...)  are used when training a model, they also must be used on subsequent datasets, whether it’s test data or data in a production system. Otherwise, the feature space will change, and the model will not be able to perform effectively. 

It is important that these transformations are only learnt from the training data. For example, if you have a normalization step where you divide by the average value, the average should be the average of the train subset, not the average of all the data. If the test subset is included in the average calculation, information from the test subset is influencing the model.



In [None]:
from sklearn import datasets
breast = datasets.load_breast_cancer()
X, y = breast.data, breast.target
print(breast.DESCR)

### <font color = red> Wrong #1: data leakage </font>

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


In [None]:
X_new = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_new, y, random_state=42)
y_pred = LogisticRegression().fit(X_train,y_train).predict(X_test)
accuracy_score(y_pred,y_test)

<font color = red> What's wrong here: parameters for scaling (mean and variance) are estimated on the whole dataset (that is, also on the test set)</font>



### <font color = red> Wrong #2 inconsistent preprocessing </font>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
scaler = StandardScaler()
X_train_new = scaler.fit_transform(X_train)

y_pred = LogisticRegression().fit(X_train_new,y_train).predict(X_test)
accuracy_score(y_pred,y_test)

<font color = red>What's wrong here: the train dataset is scaled, whereas the test dataset is not, so model performance on the test dataset is worse than expected.</font>


### <font color = green> Right</font>
- <font color = green> split your data into train and test **first**</font>
- <font color = green> learn your transformation on training data and apply an *all* data</font>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
scaler = StandardScaler()
X_train_new = scaler.fit_transform(X_train) # fit only on training set
X_test_new = scaler.transform(X_test) # transform also the test

y_pred = LogisticRegression().fit(X_train_new,y_train).predict(X_test_new)
accuracy_score(y_pred,y_test)

We have illustrated data leakage and inconsistent preprocessing with a standardization transformation. 

This risk of methodological mistake is however relevant with almost all transformations in scikit-learn, including (but not limited to) feature selection, SimpleImputer, PCA.

### <font color = red>Feature selection: Wrong (data leakage)</font>

In [None]:
from sklearn.feature_selection import SelectKBest
X_selected = SelectKBest(k=2).fit_transform(X, y)
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, random_state=42)
clf = LogisticRegression()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

### <font color = green> Feature selection: Right </font>

In [None]:
from sklearn.feature_selection import SelectKBest
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
selector = SelectKBest(k=2)
X_train_selected = selector.fit_transform(X_train, y_train)
clf = LogisticRegression()
clf.fit(X_train_selected, y_train)
X_test_selected = selector.transform(X_test)
y_pred = clf.predict(X_test_selected)
accuracy_score(y_test, y_pred)

[Recap](https://scikit-learn.org/stable/common_pitfalls.html#how-to-avoid-data-leakage) - below are some tips on avoiding data leakage:

- Always split the data into train and test subsets first, particularly before any preprocessing steps.

- Never include test data when using the `fit` and `fit_transform` methods. Using all the data, e.g., `fit(X)`, can result in overly optimistic scores. Conversely, the `transform` method should be used on both train and test subsets as the same preprocessing should be applied to all the data. This can be achieved by using `fit_transform` on the train subset and `transform` on the test subset.

- The **scikit-learn pipeline** is a great way to prevent data leakage as it ensures that the appropriate method is performed on the correct data subset. The pipeline is ideal for use in cross-validation and hyper-parameter tuning functions.

## `sklearn` Pipelines: Chaining estimators

One or more transformers are usually combined with classifiers, regressors or other estimators to build a composite estimator.

Consider the following example.

**Standardization** ➡️ **Feature Selection** ➡️ **Classification**

In [None]:
from sklearn.feature_selection import SelectKBest
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

selector = SelectKBest(k=2)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
clf = LogisticRegression()
clf.fit(X_train_selected, y_train)
        
X_test_selected = selector.transform(scaler.transform(X_test))
y_pred = clf.predict(X_test_selected)
accuracy_score(y_test, y_pred)

The code becomes more tricky, and it is easier to make mistakes.

**Pipelines can be used to chain multiple estimators into one**, and serve multiple purpose:
- **Convenience and encapsulation**: you only have to call **fit and predict once on your data** to fit a whole sequence of estimators.
- Joint parameter selection: you can **grid search over parameters of all estimators** in the pipeline at once.
- Safety: pipelines help **avoid data leakage from your test data into the trained model**, by ensuring that the same samples are used to train the transformers and predictors.

All estimators in a pipeline, except the last one, must be transformers (i.e. must have a `transform` method). The last estimator may be any type (transformer, classifier, etc.).



### Building and using a `pipeline`
The Pipeline is built using a list of **(key, value) pairs**, where 
- the key is a string containing the name you want to give this step 
- the value is an estimator object

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

estimators = [('scaling',StandardScaler()),('feature-selection',SelectKBest(k=2)), ('clf', LogisticRegression())]
pipe = Pipeline(estimators)
pipe

The utility function `make_pipeline` is a shorthand for constructing pipelines; it takes a variable number of estimators and returns a pipeline, filling in the names automatically:


```python
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(StandardScaler(), SelectKBest(k=2), LogisticRegression())
```
   

Calling `fit` on the pipeline is the same as 
- sequentially calling `fit` on each estimator
- `transform` the input and pass it on to the next step. 

Furthermore, the pipeline has **all the methods that the last estimator in the pipeline has**, i.e. if the last estimator is a classifier, the Pipeline can be used as a classifier. 


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
pipe.fit(X_train,y_train)

In [None]:
y_pred = pipe.predict(X_test)

In [None]:
accuracy_score(y_pred,y_test)

### Accessing the steps of a `pipeline`

We can access the steps of a `pipeline` in three ways:
- throught the `steps` attribute
- by index
- by name



In [None]:
pipe.steps[0]

In [None]:
pipe[0]

In [None]:
pipe['scaling']

In [None]:
pipe[0].mean_, pipe[0].var_ # mean_ and var_ are used by StandardScaler

### Pipeline and cross-validation

Our pipeline (*scaling* --> *feature-selection* --> *clf*) can be evaluated on the breast cancer dataset.

In [None]:
import pandas as pd
pd.Series(y).value_counts()

In [None]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(10, shuffle = True, random_state = 21)


In [None]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score, f1_score, make_scorer

results_fsel = cross_validate(pipe, # In previous notebook we had just a classifier here
                         X,
                         y,
                         scoring = {'fscore': make_scorer(f1_score),
                                    'accuracy': make_scorer(accuracy_score)},
                         return_estimator = True,
                         cv = skf,
                         n_jobs = -1) 
results_fsel

### Imbalanced classification and cross-validation

We know that `scikit-learn` does not natively handle methods for *imbalanced learning*, but `imblearn` does.

We can use the pipelines of imblearn in exactly the same way as those of scikit-learn, so that we can correctly perform rebalancing when working in cross-validation.

```python
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

pipe = Pipeline([
        ('sampling', SMOTE()),
        ... ,
        ('classification', LogisticRegression())
    ])

results = cross_val_score(pipe, X, y, ...)
```

## Classifiers comparison


It is often the case that we want to compare different classifiers, or different configurations of parameters.
We can approach this problem using statistical analysis.


In this exercise we want to assess whether using feature-selection is beneficial or not

**1. Choose the metric to look at: f-score is an option (an averaging strategy must be adopted if we have multiple classes)**

**2. Evaluate the metric distributions fairly (same test splits)**
- we already have results for the *features-selection* experiment
- we use the same `skf` object for the *no-features-selection* experiment

In [None]:
estimators = [('scaling',StandardScaler()), ('clf', LogisticRegression())]
pipe_noFsel = Pipeline(estimators)
pipe_noFsel

In [None]:
results_nofsel = cross_validate(pipe_noFsel,
                         X,
                         y,
                         scoring = {'fscore': make_scorer(f1_score),
                                    'accuracy': make_scorer(accuracy_score)},
                         return_estimator = True,
                         cv = skf,
                         n_jobs = -1) 

In [None]:
results_fsel['test_fscore']

In [None]:
results_nofsel['test_fscore']

**3. Look at the results**

In [None]:
metrics = pd.DataFrame({'fsel':results_fsel['test_fscore'],
                        'nofsel': results_nofsel['test_fscore']})
metrics



In [None]:
from matplotlib import pyplot as plt
ax = metrics.boxplot(figsize = (3,3))
ax.set_ylabel('f-score')
plt.show()

**4. Apply a statistical test**

#### Statistical tests
- [t-test for paired samples](https://en.wikipedia.org/wiki/Student%27s_t-test#Dependent_t-test_for_paired_samples) ([scipy ref](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html))
- [Wilcoxon signed-rank test](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test) ([scipy ref](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html))
    - serves the same purpose of student *t*-test for matched samples, but does not assume that the data is normally distributed
    - tests the **null hypothesis** that **two related paired samples come from the same distribution**.


In [None]:
metrics

### Wilcoxon signed-rank test

In [None]:
from scipy.stats import wilcoxon
wilcoxon(metrics.fsel, metrics.nofsel)

***p*-value** is the probability of obtaining a value of the W statistic equal to or lower than the one actually observed, under the assumption that the null hypothesis is correct.

**A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis.**

In other words, given a confidence level $\alpha$ (typically 0.05), we can conclude that:

- if ***p*-value $ \leq \alpha$**, I reject the null hypothesis (with a confidence level of $\alpha$): result is said to be statistically significant. 
- if ***p*-value $ > \alpha$**, I cannot reject the null hypothesis.

# <font color='blue'><ins>TASK</ins></font>
- Carry out a **classification analysis** considering the following setting.
    - Apply a 10-fold cross-validation procedure on the **Breast cancer wisconsin (diagnostic) dataset** to identify the most suitable classifier among the following
        - Logistic Regression (default params)
        - Logistic Regression (default params) after oversampling with SMOTE
    - Report and discuss the results, motivating the choice of the most suitable model
    