# Logistic Regression and Linear SVM

We will draw couple of plots during the lecture. We activate matplotlib to show the plots inline in the notebook.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

`scikit-learn` provides state-of-the-art machine learning algorithms. 
These algorithms, however, cannot be directly used on raw data. Raw data needs to be preprocessed beforehand. Thus, besides machine learning algorithms, `scikit-learn` provides a set of preprocessing methods. Furthermore, `scikit-learn` provides connectors for pipelining these estimators (i.e., transformer, regressor, classifier, clusterer, etc.).

In this lecture, we will present the set of `scikit-learn` functionalities allowing for pipelining estimators, evaluating those pipelines, tuning those pipelines using hyper-parameters optimization, and creating complex preprocessing steps.

## 1. Basic use-case: train and test a classifier

For this first example, we will train and test a classifier on a dataset. We will use this example to recall the API of `scikit-learn`.

We will use the `digits` dataset which is a dataset of hand-written digits.

In [None]:
from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)

Each row in `X` contains the intensities of the 64 image pixels. For each sample in `X`, we get the ground-truth `y` indicating the digit written.

In [None]:
X[0]

In [None]:
plt.imshow(X[0].reshape(8, 8), cmap='gray');
plt.axis('off')
print('The digit in the image is {}'.format(y[0]))

In machine learning, we should evaluate our model by training and testing it on distinct sets of data. `train_test_split` is a utility function to split the data into two independent sets. The `stratify` parameter enforces the classes distribution of the train and test datasets to be the same than the one of the entire dataset.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

Once we have independent training and testing sets, we can learn a machine learning model using the `fit` method. We will use the `score` method to test this method, relying on the default accuracy metric.

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver='lbfgs', C=1.0, multi_class='auto', max_iter=5000, random_state=42)
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print('Accuracy score of the {} is {:.3f}'.format(clf.__class__.__name__, accuracy))

Regularization strength increases in $\lambda$

slides: $\lambda$

sklearn:

regression: $\alpha = \lambda$

classification: $C = 1/\lambda$

In [None]:
clf.coef_.shape

In [None]:
clf.coef_[0]

In [None]:
clf.intercept_

The API of `scikit-learn` is consistent across classifiers. Thus, we can easily replace the `LogisticRegression` classifier by a `LinearSVC Classifier`. These changes are minimal and only related to the creation of the classifier instance.

In [None]:
from sklearn.svm import LinearSVC

clf = LinearSVC(C=1.0, max_iter=500000)
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print('Accuracy score of the {} is {:.3f}'.format(clf.__class__.__name__, accuracy))

## 2. More advanced use-case: preprocess the data before training and testing a classifier

### 2.1 Standardize your data

Preprocessing might be required before learning a model. For instance, a user could be interested in creating hand-crafted features or an algorithm might make some apriori assumptions about the data. 

In our case, the solver used by the `LogisticRegression` expects the data to be normalized. Thus, we need to standardize the data before training the model. To observe this necessary condition, we will check the number of iterations required to train the model.

In [None]:
clf1 = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=5000, random_state=42)

clf1.fit(X_train, y_train)
accuracy = clf1.score(X_test, y_test)

print('Accuracy score of the {} is {:.3f}'.format(clf1.__class__.__name__, accuracy))
print('{} required {} iterations to be fitted'.format(clf1.__class__.__name__, clf1.n_iter_[0]))

clf2 = LinearSVC(max_iter=500000)

clf2.fit(X_train, y_train)
accuracy = clf2.score(X_test, y_test)

print('Accuracy score of the {} is {:.3f}'.format(clf2.__class__.__name__, accuracy))

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train)

# X_train_scaled = scaler.fit(X_train) #find min and max of each feature in the training set

# X_train_scaled = scaler.transform(X_train) #normalize the training set (using the min and max found above)

X_test_scaled = scaler.transform(X_test) #normalize the testing set (using the min and max found above)

clf1.fit(X_train_scaled, y_train)
accuracy = clf1.score(X_test_scaled, y_test)
print('Accuracy score of the {} is {:.3f}'.format(clf1.__class__.__name__, accuracy))
print('{} required {} iterations to be fitted'.format(clf1.__class__.__name__, clf1.n_iter_[0]))

clf2.fit(X_train_scaled, y_train)
accuracy = clf2.score(X_test_scaled, y_test)
print('Accuracy score of the {} is {:.3f}'.format(clf2.__class__.__name__, accuracy))

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf1.fit(X_train_scaled, y_train)
accuracy = clf1.score(X_test_scaled, y_test)
print('Accuracy score of the {} is {:.3f}'.format(clf1.__class__.__name__, accuracy))
print('{} required {} iterations to be fitted'.format(clf1.__class__.__name__, clf1.n_iter_[0]))

clf2.fit(X_train_scaled, y_train)
accuracy = clf2.score(X_test_scaled, y_test)
print('Accuracy score of the {} is {:.3f}'.format(clf2.__class__.__name__, accuracy))

The `MinMaxScaler` and `StandardScaler` transformers are used to normalise the data. Other scalers include `RobustScaler` and `Normalizer`. The scaler should be applied in the following way: learn (i.e., `fit` method) the statistics on a training set and standardize (i.e., `transform` method) both the training and testing sets. Finally, we will train and test the model and the scaled datasets.

By scaling the data, the convergence of the model happened much faster than with the unscaled data.

For any feature: 

MinMaxScaler: (x-min)/(max-min)

StandardScaler: (x-mean)/standard deviation;

RobustScaler: (x-median)/(75% quantile - 25% quantile)

for any observation:
Normalizer

<img src="images/scaler_comparison_scatter.png">

### 2.2 The wrong preprocessing patterns

We highlighted how to preprocess and adequately train a machine learning model. It is also interesting to spot what would be the wrong way of preprocessing data. There are two potential mistakes which are easy to make but easy to spot.

The first pattern is to standardize the data before spliting the full set into training and testing sets.

In [None]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
X_train_prescaled, X_test_prescaled, y_train_prescaled, y_test_prescaled = train_test_split(
    X_scaled, y, stratify=y, random_state=42)

clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000, random_state=42)
clf.fit(X_train_prescaled, y_train_prescaled)
accuracy = clf.score(X_test_prescaled, y_test_prescaled)
print('Accuracy score of the {} is {:.3f}'.format(clf.__class__.__name__, accuracy))

The second pattern is to standardize the training and testing sets independently. It comes back to call the `fit` methods on both training and testing sets. Thus, the training and testing sets are standardized differently.

In [None]:
scaler = MinMaxScaler()
X_train_prescaled = scaler.fit_transform(X_train)
X_test_prescaled = scaler.fit_transform(X_test)

clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000, random_state=42)
clf.fit(X_train_prescaled, y_train)
accuracy = clf.score(X_test_prescaled, y_test)
print('Accuracy score of the {} is {:.3f}'.format(clf.__class__.__name__, accuracy))

<img src="images/no_separate_scaling.png">

### 2.3 Keep it simple, stupid: use the pipeline connector from `scikit-learn`

The two previous patterns are an issue with data leaking. However, this is difficult to prevent such a mistake when one has to do the preprocessing by hand. 

Thus, `scikit-learn` introduced the `Pipeline` object. It sequentially connects several transformers and a classifier (or a regressor). We can create a pipeline as:

In [None]:
from sklearn.pipeline import Pipeline

pipe = Pipeline(steps=[('scaler', MinMaxScaler()),
                       ('clf', LogisticRegression(solver='lbfgs', multi_class='auto', random_state=42))])

We see that this pipeline contains the parameters of both the scaler and the classifier. The general pipeline can join any number of estimators together. For example, you could build a pipeline containing feature extraction, feature selection, scaling, and classification, for a total of four steps. Similarly, the last step could be regression or clustering instead of classification. 

Sometimes, it can be tedious to give a name to each estimator in the pipeline. `make_pipeline` will give a name automatically to each estimator which is the lower case of the class name.

In [None]:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(MinMaxScaler(),
                     LogisticRegression(solver='lbfgs', multi_class='auto', random_state=42, max_iter=1000))

The pipeline will have an identical API. We use `fit` to train the classifier and `score` to check the accuracy. However, calling `fit` will call the method `fit_transform` of all transformers in the pipeline. Calling `score` (or `predict` and `predict_proba`) will call internally `transform` of all transformers in the pipeline. 

In [None]:
pipe.fit(X_train, y_train)
accuracy = pipe.score(X_test, y_test)
print('Accuracy score of the {} is {:.3f}'.format(pipe.__class__.__name__, accuracy))

We can check all the parameters of the pipeline using `get_params()`.

In [None]:
pipe.get_params()

## 3 Cross Validation

`scikit-learn` provides three functions: `cross_val_score`, `cross_val_predict`, and [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html). The latter provides more information regarding fitting time, training and testing scores. I can also return multiple scores at once.

In [None]:
from sklearn.model_selection import cross_validate

pipe = make_pipeline(MinMaxScaler(),
                     LogisticRegression(solver='lbfgs', multi_class='auto',
                                        max_iter=1000, random_state=42))
scores = cross_validate(pipe, X, y, cv=3, return_train_score=True)

Using the cross-validate function, we can quickly check the training and testing scores and make a quick plot using `pandas`.

In [None]:
import pandas as pd

df_scores = pd.DataFrame(scores)
df_scores

In [None]:
print("Mean times and scores:\n", df_scores.mean())

In [None]:
df_scores[['train_score', 'test_score']].boxplot()

## 4. Hyper-parameters optimization: fine-tune the inside of a pipeline using GridSearchCV

Sometimes you would like to find the parameters of a component of the pipeline which lead to the best accuracy. We already saw that we could check the parameters of a pipeline using `get_params()`.

In [None]:
pipe.get_params()

Hyper-parameters can be optimized by an exhaustive search. [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) provides such utility and does a cross-validated grid-search over a parameter grid.

Let's give an example in which we would like to optimize the `C` and `penalty` parameters of the `LogisticRegression` classifier.

In [None]:
from sklearn.model_selection import GridSearchCV

pipe = make_pipeline(MinMaxScaler(),
                     LogisticRegression(solver='saga', multi_class='auto',
                                        random_state=42, max_iter=5000))
param_grid = {'logisticregression__C': [0.1, 1.0, 10],
              'logisticregression__penalty': ['l2', 'l1']}
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=-1, return_train_score=True)
grid.fit(X_train, y_train)

When fitting the grid-search object, it finds the best possible parameter combination on the training set (using cross-validation). We can introspect the results of the grid-search by accessing the attribute `cv_results_`. It allows us to check the effect of the parameters on the model performance.

In [None]:
df_grid = pd.DataFrame(grid.cv_results_)
df_grid

In [None]:
param_grid

In [None]:
res = pd.pivot_table(pd.DataFrame(grid.cv_results_), values='mean_test_score', index='param_logisticregression__C', columns='param_logisticregression__penalty')
pd.set_option("display.precision",3)
res = res.set_index(res.index.values.round(4))

In [None]:
res

In [None]:
import seaborn as sns
sns.heatmap(res, annot=True, fmt=".3g", vmin=0.6)

By default, the grid-search object is also behaving as an estimator. Once it is fitted, calling `score` will fix the hyper-parameters to the best parameters found.

In [None]:
grid.best_params_

In [None]:
print("Best estimator:\n{}".format(grid.best_estimator_))

In [None]:
print("Logistic regression step:\n{}".format(
      grid.best_estimator_.named_steps["logisticregression"]))

In [None]:
print("Logistic regression coefficients:\n{}".format(
      grid.best_estimator_.named_steps["logisticregression"].coef_))

Besides this is possible to call the grid-search as any other classifier to make predictions.

In [None]:
accuracy = grid.score(X_test, y_test)
print('Accuracy score of the {} is {:.3f}'.format(grid.__class__.__name__, accuracy))

Up to know, we only make the fitting of the grid-search on a single split. However, as previously stated, we might be interested to make an outer cross-validation to estimate the performance of the model and different sample of data and check the potential variation in performance. Since grid-search is an estimator, we can use it directly within the `cross_validate` function. 

In [None]:
scores = cross_validate(grid, X, y, cv=3, n_jobs=-1, return_train_score=True)
df_scores = pd.DataFrame(scores)
df_scores

## 5. Summary: my scikit-learn pipeline in less than 10 lines of code (skipping the import statements)

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate

pipe = make_pipeline(MinMaxScaler(),
                     LogisticRegression(solver='saga', multi_class='auto', random_state=42, max_iter=5000))
param_grid = {'logisticregression__C': [0.1, 1.0, 10],
              'logisticregression__penalty': ['l2', 'l1']}
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=-1)
scores = pd.DataFrame(cross_validate(grid, X, y, cv=3, n_jobs=-1, return_train_score=True))
scores[['train_score', 'test_score']].boxplot()