# About me
![Flexper, innovation on demand](img/flexper_logo_large.jpg)

My name is **Tanguy Racinet** and I currently work at **Flexper**, a technological accelerator for startups.

### How to reach me
**mail**: tanguy@flexper.fr

**linkedin**: https://www.linkedin.com/in/tanguyracinet/

### Examples of projects I worked on
 * Movie rentals estimation and expected return on investment for movie streaming platform
 * Recommendation system for playlist generation in a streaming music platform 
 * Vehicle expected return and pricing over time in car renting service
 * Pollution evolution in air traffic study for the DGAC

# Topics covered in the course - what you should know about by the end
1. Generalities about Machine Learning
2. Classification algorithms
3. Recommender systems
4. Linear regression
5. Decision trees

# Course overview - what we're actually going to do
1. Day 1
    * open Q&A
    * Practice 1 - Generalities about Machine Learning
    * Practice 2 - Classification algorithms
2. Day 2
    * open Q&A
    * Practice - Recommender systems
3. Day 3
    * open Q&A
    * Practice 1 - Decision trees
    * Practice 2 - Linear regression

# Machine Learning concepts

Let's take a concrete example to better understand the different concepts and definition that you will require when developping a classification engine as a Machine Learning engineer.

### Your first day at the fair
You just arrive to the slaughterfest in super hero city, the biggest showdown of the year for powered people. You have been tasked by city officials to develop a model that can automatically identify superheros against supervilains in order to hopefully mitigate casualties.

You quickly gather your notes on all the powered people you've heard of and assemble your very own dataset. You still have a lot to do in order to get your data ready before you can train your first model.
![dataset: super people]()

### Data splitting: Training, validation, test
You know that you should **never** evaluate your model on the data you used to train it since this is the best way to overfit your model, therefore you start by dividing your data in 2 distinct sets:
 * the *training set* to train your model
 * the *test set* to evaluate your trained model and check how well it generalise to unknown super persons.
![data splitting]()
 
After careful consideration, you end up deciding against splitting your test set in **test** and **validation** given how small your dataset is. 

### Cross validation
Since you plan on using k-fold cross validation to automatically split your training set anyway, you should be covered. You will be able to train your model on k-1 splits of your training set k times to maximize your learning and find a model with good generalisation capabilities.
![cross validation with k-fold]()


### Encoding
Despite being a proficient Machine Learning engineer, you're still human and you used words instead of sweet hard numbers when creating your dataset. You decide to quickly review in your head the different encoding techniques you could use on your data:
 * Label encoding for graduated variables
 * One hot encoding for independant variables
 * Cyclical encoding for... Well cyclical variables...
![encoding]()

### Missing value inputation
Another problem with your dataset is that because of your memory problems, some data is missing from your dataset. That black and white picture you collected isn't helping you when deciding what color is his mask. You're now stuck trying to figure out how to fill in for the missing value.
![data imput]()

### Confusion matrix
Now that you're done with data processing, you can finally train your model. So let's get cracking and build your confusion matrix to evaluate your model performance:
![confusion matrix with log reg curve]()

### ROC curve
Alright, you've got a somewhat functional model. But now that you're getting close to the end, you're receiving offers from both the superheros and the supervilains who are all trying to outbid each other for your algorithm.
You now have the power to sell it to either side so you'de better figure out who would profit most from it. The ROC curve is the perfect tool for you to illustrate your algorithm constraints.
![roc curve construction with confusion matrices]()
If you are intending on providing your solution to the superheros, they are probably interested in being able to identify supervilains to arrest them and send them to jail. But it would be problematic if they were to send other superheroes to jail because taht would reduce their forces and let's face it... That wouldn't be very heroic of them to condemn innocent people.

On the other hand, vilains would probably not be as prudent with it and might not have an issue with taking out a few of their own, just in case they might be vigilantes.
The ROC curve will help you identify the best threshold or hyper parameters for your algorithm, depending on your problem constraints.

### Area Under the Curve
After selling your first algorithm for a very nice profit, you now have enough funds to investigate other potential algorithm that might be even more efficient that your first attempt.
![AUC for different models]()
You compute the AUC for all the different models you trained in order to identify the ultimate model that will definitely help you make a difference in super hero city. (and/or get filthy rich, that's your story after all)

# Classification algorithms on Heart Diseases
Download the [data](https://canvas.supinfo.com/courses/85/files/7110/download?wrap=1) directly from [UCI]()'s website and save it under ./data of your current directory on your host. (/home/jovyan/data in the docker container)
## Importing required libraries
Always put your imports at the top, it will greatly simplify your work when turning your notebooks into actual python code.

In [1]:
# Data manipulation:
import pandas as pd

# Data exploration:
!pip install pandas_profiling
import pandas_profiling as pp

# scikit learn ML models
# preprocessing
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Classification models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
# cross validation
from sklearn.model_selection import GridSearchCV

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Collecting pandas_profiling
  Downloading pandas-profiling-2.5.0.tar.gz (192 kB)
[K     |████████████████████████████████| 192 kB 1.6 MB/s eta 0:00:01
Collecting confuse==1.0.0
  Downloading confuse-1.0.0.tar.gz (32 kB)
Collecting jinja2==2.11.1
  Downloading Jinja2-2.11.1-py2.py3-none-any.whl (126 kB)
[K     |████████████████████████████████| 126 kB 25.6 MB/s eta 0:00:01
[?25hCollecting visions==0.2.2
  Downloading visions-0.2.2.tar.gz (27 kB)
Collecting htmlmin==0.1.12
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
Collecting missingno==0.4.2
  Downloading missingno-0.4.2-py3-none-any.whl (9.7 kB)
Collecting phik==0.9.9
  Downloading phik-0.9.9-py3-none-any.whl (607 kB)
[K     |████████████████████████████████| 607 kB 31.4 MB/s eta 0:00:01
[?25hCollecting astropy>=3.2.3
  Downloading astropy-4.0-cp37

Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 8.1 MB/s  eta 0:00:01
Collecting astroid<2.4,>=2.3.0
  Downloading astroid-2.3.3-py3-none-any.whl (205 kB)
[K     |████████████████████████████████| 205 kB 24.0 MB/s eta 0:00:01
[?25hCollecting isort<5,>=4.2.5
  Downloading isort-4.3.21-py2.py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.2 MB/s  eta 0:00:01
[?25hCollecting mccabe<0.7,>=0.6
  Downloading mccabe-0.6.1-py2.py3-none-any.whl (8.6 kB)
Collecting typed-ast<1.5,>=1.4.0; implementation_name == "cpython" and python_version < "3.8"
  Downloading typed_ast-1.4.1-cp37-cp37m-manylinux1_x86_64.whl (737 kB)
[K     |████████████████████████████████| 737 kB 20.7 MB/s eta 0:00:01
[?25hCollecting wrapt==1.11.*
  Downloading wrapt-1.11.2.tar.gz (27 kB)
Collecting lazy-object-proxy==1.4.*
  Downloading lazy_object_proxy-1.4.3-cp37-cp37m-manylinux1_x86_64.whl (56 kB)
[

## Loading the data

In [2]:
heart_disease_df = pd.read_csv('../data/HeartDiseaseUCI.csv')

## Exploring the data
### What you need to understand
A very good tutorial on data exploration: [kaggle tutorial](https://www.kaggle.com/pavansanagapati/a-simple-tutorial-on-exploratory-data-analysis/notebook)

### And the quick win version

In [3]:
directory = '../data'

# Get raw bookings file statistic report
data_report = pp.ProfileReport(heart_disease_df)
data_report.to_file(output_file='{}/heart_disease_report.html'.format(directory))

HBox(children=(FloatProgress(value=0.0, description='variables', max=15.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='correlations', max=6.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='interactions [continuous]', max=49.0, style=ProgressStyle…




HBox(children=(FloatProgress(value=0.0, description='table', max=1.0, style=ProgressStyle(description_width='i…




HBox(children=(FloatProgress(value=0.0, description='missing', max=4.0, style=ProgressStyle(description_width=…









HBox(children=(FloatProgress(value=0.0, description='package', max=1.0, style=ProgressStyle(description_width=…




HBox(children=(FloatProgress(value=0.0, description='build report structure', max=1.0, style=ProgressStyle(des…




In [4]:
data_report.to_widgets()

Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(value='Number of va…

## Preparing the data
We replace the varying degree of disease with a binary classification: healthy/sick

In [5]:
categorical_num_df = heart_disease_df.copy()

categorical_num_df['num'] = categorical_num_df['num'].apply(
    lambda x: 'healthy' if x == 0 else 'sick'
)
categorical_num_df.head(2)

Unnamed: 0.1,Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,1,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,healthy
1,2,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,sick


Because scikit learn models only work with numerical values, we need to replace our *healthy* and *sick* labels. There are two potnetial of turning this into numerical values:
 * Label encoding

In [6]:
label_encoded_df = categorical_num_df.copy()

le = LabelEncoder()
le.fit(label_encoded_df['num'])
label_encoded_df['num'] = le.transform(label_encoded_df['num'])

label_encoded_df.head(2)

Unnamed: 0.1,Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,1,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,2,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,1


 * One-hot encoding

In [7]:
pd.get_dummies(categorical_num_df).head(2)

Unnamed: 0.1,Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num_healthy,num_sick
0,1,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,1,0
1,2,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,0,1


### Quick win:
One-liner that we could have used directly, after understanding the data.

In [8]:
heart_disease_df['num'] = heart_disease_df['num'].apply(lambda x: 0 if x == 0 else 1)
heart_disease_df.head(2)

Unnamed: 0.1,Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,1,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,2,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,1


### Data splitting

In [9]:
X = heart_disease_df[[ 
    col for col in heart_disease_df.columns if col not in ['num', 'Unnamed: 0']
]]
y = heart_disease_df['num']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=7)

In [10]:
display(
    [
        'X_train: {}'.format(X_train.shape), 'y_train: {}'.format(y_train.shape), 
        'X_test: {}'.format(X_test.shape), 'y_test: {}'.format(y_test.shape)
    ],
    X_train.tail(),
    y_train.tail()
)

['X_train: (212, 13)', 'y_train: (212,)', 'X_test: (91, 13)', 'y_test: (91,)']

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
211,38,1,1,120,231,0,0,182,1,3.8,2,0.0,7.0
67,54,1,3,150,232,0,2,165,0,1.6,1,0.0,7.0
25,50,0,3,120,219,0,0,158,0,1.6,2,0.0,3.0
196,69,1,1,160,234,1,2,131,0,0.1,2,1.0,3.0
175,57,1,4,152,274,0,0,88,1,1.2,2,1.0,7.0


211    1
67     0
25     0
196    0
175    1
Name: num, dtype: int64

## Using a [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
Pipelines in scikit learn are a very important concept. They encapsulate a number of **transformers** and **estimators**.

**Transformers** apply a function over your data to modify it, commonly used to apply preprocessing steps over the data before training a model.

**Estimators** apply a ML model to make predictions.

These are the steps we'll take when [preprocessing our data with scikit learn](https://scikit-learn.org/stable/modules/preprocessing.html) before we can apply our model:
 * [Fill in missing values](https://scikit-learn.org/stable/modules/impute.html). We will use the [Simple Imputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html).
 * Scale and center the data with the [Standard scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to reduce feature range impact for enclidian distances. 
 * We can also conduct a [Principal Component Analysis](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) to reduce the number of dimensions and extract meaningful informations from the dataset.

## Training a classification model: [K-Neirest Neighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
Simple pipeline example:

In [11]:
knn_pipe_example = Pipeline(
    steps = [
        ('imputer', SimpleImputer()),
        ('scaler', StandardScaler()),
        ('pca', PCA(n_components=12)),
        ('knn_classifier', KNeighborsClassifier(n_neighbors=10))
    ],
    verbose=True
)

knn_pipe_example.fit(X_train, y_train)
knn_pipe_example.score(X_test, y_test)

[Pipeline] ........... (step 1 of 4) Processing imputer, total=   0.0s
[Pipeline] ............ (step 2 of 4) Processing scaler, total=   0.0s
[Pipeline] ............... (step 3 of 4) Processing pca, total=   0.0s
[Pipeline] .... (step 4 of 4) Processing knn_classifier, total=   0.0s


0.8021978021978022

And now, the same with python *list* and *dict* comprehension to reuse the same preprocessing steps for every model we'll test, along with a **cross_validate** utility function to perform grid search with multiple hyper parameters.

In [12]:
preprocessing_steps = [
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=12))
]
preprocessing_params = {
    'imputer__strategy': ['mean', 'median'],
    'scaler__with_mean': [True, False],
    'scaler__with_std': [True, False],
    'pca__n_components': [11, 12, 13]
}

def cross_validate(model, params):
    # create a gridsearch of the pipeline, the find the best hyper parameters
    grid_param = { **preprocessing_params, **params }
    gridsearch = GridSearchCV(model, grid_param, cv=5, verbose=1, n_jobs=-1)
    gridsearch_result = gridsearch.fit(X_train, y_train)

    display(gridsearch_result.best_estimator_)
    display('Best model accuracy over previously unseen data: {}'.format(
        gridsearch_result.score(X_test, y_test)
    ))

In [13]:
knn_pipe = Pipeline(
    preprocessing_steps + [('classifier', KNeighborsClassifier())]
)
knn_params = {
    'classifier__n_neighbors': [5, 10, 15, 20],
    'classifier__weights': ['uniform', 'distance'],
    'classifier__algorithm': ['brute', 'ball_tree']
}

cross_validate(knn_pipe, knn_params)

Fitting 5 folds for each of 384 candidates, totalling 1920 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed:    2.0s
[Parallel(n_jobs=-1)]: Done 1920 out of 1920 | elapsed:    6.6s finished


Pipeline(memory=None,
         steps=[('imputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('pca',
                 PCA(copy=True, iterated_power='auto', n_components=12,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('classifier',
                 KNeighborsClassifier(algorithm='brute', leaf_size=30,
                                      metric='minkowski', metric_params=None,
                                      n_jobs=None, n_neighbors=10, p=2,
                                      weights='uniform'))],
         verbose=False)

'Best model accuracy over previously unseen data: 0.8021978021978022'

## Training a classification model: [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)

In [14]:
svm_pipe = Pipeline(
    steps = preprocessing_steps + [('classifier', LinearSVC(random_state=0))]
)
svm_params = {
    'classifier__C': [0.5, 1, 1.5],
    'classifier__loss': ['hinge', 'squared_hinge']
}

cross_validate(svm_pipe, svm_params)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.


Fitting 5 folds for each of 144 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Done 100 tasks      | elapsed:    0.7s
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:    2.9s finished


Pipeline(memory=None,
         steps=[('imputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('pca',
                 PCA(copy=True, iterated_power='auto', n_components=12,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('classifier',
                 LinearSVC(C=0.5, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='squared_hinge', max_iter=1000,
                           multi_class='ovr', penalty='l2', random_state=0,
                           tol=0.0001, verbose=0))],
         verbose=False)

'Best model accuracy over previously unseen data: 0.8131868131868132'

## Training a classification model: [Linear Discriminant Analysis](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html)

In [15]:
lda_pipe = Pipeline(
    steps = preprocessing_steps + [('classifier', LinearDiscriminantAnalysis())]
)
lda_params = {
    'classifier__solver': ['svd', 'lsqr', 'eigen']
}

cross_validate(lda_pipe, lda_params)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.


Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Done 100 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:    1.1s finished


Pipeline(memory=None,
         steps=[('imputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=False)),
                ('pca',
                 PCA(copy=True, iterated_power='auto', n_components=11,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('classifier',
                 LinearDiscriminantAnalysis(n_components=None, priors=None,
                                            shrinkage=None, solver='svd',
                                            store_covariance=False,
                                            tol=0.0001))],
         verbose=False)

'Best model accuracy over previously unseen data: 0.8131868131868132'

## Training a classification model: [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html)

In [16]:
gnb_pipe = Pipeline(steps = preprocessing_steps + [('classifier', GaussianNB())])
gnb_params = {}

cross_validate(gnb_pipe, gnb_params)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.


Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Done  83 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:    0.4s finished


Pipeline(memory=None,
         steps=[('imputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=False)),
                ('pca',
                 PCA(copy=True, iterated_power='auto', n_components=11,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('classifier', GaussianNB(priors=None, var_smoothing=1e-09))],
         verbose=False)

'Best model accuracy over previously unseen data: 0.8571428571428571'

## Training a classification model: [Logistic regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)

In [17]:
lr_pipe = Pipeline(preprocessing_steps + [('classifier', LogisticRegression())])
lr_params = {
    'classifier__C': [0.1, 0.5, 1, 1.5],
    'classifier__solver': ['lbfgs', 'liblinear']
}

cross_validate(lr_pipe, lr_params)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.


Fitting 5 folds for each of 192 candidates, totalling 960 fits


[Parallel(n_jobs=-1)]: Done 100 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 949 out of 960 | elapsed:    4.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done 960 out of 960 | elapsed:    4.1s finished


Pipeline(memory=None,
         steps=[('imputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('pca',
                 PCA(copy=True, iterated_power='auto', n_components=12,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('classifier',
                 LogisticRegression(C=0.5, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbo

'Best model accuracy over previously unseen data: 0.8131868131868132'

## Testing out different setups: [Cross-validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) over multiple models

In [18]:
# Create a pipeline
pipe = Pipeline(preprocessing_steps + [('classifier', LogisticRegression())])

# Create list with candidate learning algorithms and their hyperparameters
grid_param = [
    { 
        'classifier': [KNeighborsClassifier()], 
        **preprocessing_params, 
        **knn_params 
    },
    { 
        'classifier': [LinearSVC()], 
        **preprocessing_params, 
        **svm_params 
    },
    { 
        'classifier': [LinearDiscriminantAnalysis()], 
        **preprocessing_params, 
        **lda_params
    },
    { 
        'classifier': [GaussianNB()], 
        **preprocessing_params, 
        **gnb_params 
    },
    { 
        'classifier': [LogisticRegression()], 
        **preprocessing_params, 
        **lr_params 
    }
]

# create a gridsearch of the pipeline, then fit the best model
gridsearch = GridSearchCV(pipe, grid_param, cv=10, verbose=1, n_jobs=-1)
gridsearch_result = gridsearch.fit(X_train, y_train)

display(gridsearch_result.best_estimator_)
display('Best model accuracy over previously unseen data: {}'.format(
    gridsearch_result.score(X_test, y_test)
))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.


Fitting 10 folds for each of 816 candidates, totalling 8160 fits


[Parallel(n_jobs=-1)]: Done 100 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done 2420 tasks      | elapsed:    6.8s
[Parallel(n_jobs=-1)]: Done 6420 tasks      | elapsed:   20.2s
[Parallel(n_jobs=-1)]: Done 8160 out of 8160 | elapsed:   27.2s finished


Pipeline(memory=None,
         steps=[('imputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=False)),
                ('pca',
                 PCA(copy=True, iterated_power='auto', n_components=11,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('classifier',
                 LogisticRegression(C=1, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='liblinear', tol=0.0001, ve

'Best model accuracy over previously unseen data: 0.8131868131868132'