## Intro

I am following [this tutorial](https://towardsdatascience.com/tpot-automated-machine-learning-in-python-4c063b3e5de9). 

TPOT is an open-source Python library for **automatizing Machine Learning.**

It's way of finding the best pipeline is of genetic inspiration: "TPOT tries a pipeline, evaluates its performance, and randomly changes parts of the pipeline in search of better performing algorithms". In fact, some of its parameters have "biological" names: generations, population_size, offspring_size, etc. Let's see what these mean in the [docs](http://epistasislab.github.io/tpot/api/): 
+ generations: number of iterations to optimize the ENTIRE pipeline.
+ population_size: number of individuals to retain in the GP population every generation.
+ offspring_size: number of offspring to produce in each GP generation. By default it equals population_size.

Take into consideration that TPOT needs to include in the pipeline: preprocessing (missing value imputation, scaling, PCA, feature selection, etc.), multiple machine learning algorithms, the hyperparameters of the algorithms and the preprocessing steps and multiple ways to join the algorithms. In fact, "TPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total". 

That is, TPOT evaluates POPULATION_SIZE pipelines at first. Then, it changes randomly OFFSPRING_SIZE parameters/algorithms from these pipelines. And does this GENERATIONS times.

It's strengths are: 
+ **Integrated with scikit**.
+ Evaluate **efficiently many different pipelines**. 

Note that **it does not replace a data scientists (at least for now)**, but helps in finding faster a good algorithm. 

Also, **it does not contain neural network algorithms.**

### Search space


##### 1. Classifiers

*‘sklearn.naive_bayes.BernoulliNB’: { ‘alpha’: [1e-3, 1e-2, 1e-1, 1., 10., 100.], ‘fit_prior’: [True, False] },*

*‘sklearn.naive_bayes.MultinomialNB’: { ‘alpha’: [1e-3, 1e-2, 1e-1, 1., 10., 100.], ‘fit_prior’: [True, False] },* 

*‘sklearn.tree.DecisionTreeClassifier’: { ‘criterion’: [“gini”, “entropy”], ‘max_depth’: range(1, 11), ‘min_samples_split’: range(2, 21), ‘min_samples_leaf’: range(1, 21) },*

*‘sklearn.ensemble.ExtraTreesClassifier’: { ‘n_estimators’: [100], ‘criterion’: [“gini”, “entropy”], ‘max_features’: np.arange(0.05, 1.01, 0.05), ‘min_samples_split’: range(2, 21), ‘min_samples_leaf’: range(1, 21), ‘bootstrap’: [True, False] },*

*‘sklearn.ensemble.RandomForestClassifier’: { ‘n_estimators’: [100], ‘criterion’: [“gini”, “entropy”], ‘max_features’: np.arange(0.05, 1.01, 0.05), ‘min_samples_split’: range(2, 21), ‘min_samples_leaf’: range(1, 21), ‘bootstrap’: [True, False] },*

*‘sklearn.ensemble.GradientBoostingClassifier’: { ‘n_estimators’: [100], ‘learning_rate’: [1e-3, 1e-2, 1e-1, 0.5, 1.], ‘max_depth’: range(1, 11), ‘min_samples_split’: range(2, 21), ‘min_samples_leaf’: range(1, 21), ‘subsample’: np.arange(0.05, 1.01, 0.05), ‘max_features’: np.arange(0.05, 1.01, 0.05) },*

*‘sklearn.neighbors.KNeighborsClassifier’: { ‘n_neighbors’: range(1, 101), ‘weights’: [“uniform”, “distance”], ‘p’: [1, 2] },*

*‘sklearn.svm.LinearSVC’: { ‘penalty’: [“l1”, “l2”], ‘loss’: [“hinge”, “squared_hinge”], ‘dual’: [True, False], ‘tol’: [1e-5, 1e-4, 1e-3, 1e-2, 1e-1], ‘C’: [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.] },*

*‘sklearn.linear_model.LogisticRegression’: { ‘penalty’: [“l1”, “l2”], ‘C’: [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.], ‘dual’: [True, False] },*

*‘xgboost.XGBClassifier’: { ‘n_estimators’: [100], ‘max_depth’: range(1, 11), ‘learning_rate’: [1e-3, 1e-2, 1e-1, 0.5, 1.], ‘subsample’: np.arange(0.05, 1.01, 0.05), ‘min_child_weight’: range(1, 21), ‘nthread’: [1] }*

**In addition, classifiers can be stacked one on top of another, using the predictions of the first as inputs for the second.**

##### 2. Preprocessors

*‘sklearn.preprocessing.Binarizer’: { ‘threshold’: np.arange(0.0, 1.01, 0.05) },*

*‘sklearn.decomposition.FastICA’: { ‘tol’: np.arange(0.0, 1.01, 0.05) },* 

*‘sklearn.cluster.FeatureAgglomeration’: { ‘linkage’: [‘ward’, ‘complete’, ‘average’], ‘affinity’: [‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘cosine’] },* 

*‘sklearn.preprocessing.MaxAbsScaler’: { },* 

*‘sklearn.preprocessing.MinMaxScaler’: { },* 

*‘sklearn.preprocessing.Normalizer’: { ‘norm’: [‘l1’, ‘l2’, ‘max’] },* 

*‘sklearn.kernel_approximation.Nystroem’: { ‘kernel’: [‘rbf’, ‘cosine’, ‘chi2’, ‘laplacian’, ‘polynomial’, ‘poly’, ‘linear’, ‘additive_chi2’, ‘sigmoid’], ‘gamma’: np.arange(0.0, 1.01, 0.05), ‘n_components’: range(1, 11) },* 

*‘sklearn.decomposition.PCA’: { ‘svd_solver’: [‘randomized’], ‘iterated_power’: range(1, 11) },*

*‘sklearn.preprocessing.PolynomialFeatures’: { ‘degree’: [2], ‘include_bias’: [False], ‘interaction_only’: [False] },*

*‘sklearn.kernel_approximation.RBFSampler’: { ‘gamma’: np.arange(0.0, 1.01, 0.05) }, ‘sklearn.preprocessing.RobustScaler’: { },*

*‘sklearn.preprocessing.StandardScaler’: { }, ‘tpot.builtins.ZeroCount’: { }, 
‘tpot.builtins.OneHotEncoder’: { ‘minimum_fraction’: [0.05, 0.1, 0.15, 0.2, 0.25], ‘sparse’: [False] } (emphasis mine)*

## Run TPOT

In [None]:
tpot = TPOTClassifier(verbosity=3, # see output as it is trained
                      periodic_checkpoint_folder='folder_for_pipelines', # folder where Python code for some pipelines will be stored
                      n_jobs=-1) # use all cores
tpot.fit(X_train, y_train)
tpot.score(X_test, y_test)

## Reproducibility


TPOT does not retrieve always the same results, even if we set a seed at the beginning! That is because some algorithms have their own random_state parameters and we cannot touch them. 

## Example 1

In [6]:
# import the usual stuff
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import os
import time
# import TPOT and sklearn stuff
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import sklearn.metrics
# create train and test sets
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, train_size=0.75, test_size=0.25, random_state=34)
tpot = TPOTClassifier(verbosity=3, 
                      scoring="balanced_accuracy", 
                      random_state=23, 
                      periodic_checkpoint_folder="tpot_mnst1", 
                      n_jobs=-1, 
                      generations=3, 
                      population_size=10)
# run three iterations and time them
times=[]
winning_pipes = []
scores = []

for x in range(3):
    start_time = time.time()
    tpot.fit(X_train, y_train)
    elapsed = time.time() - start_time
    times.append(elapsed)
    winning_pipes.append(tpot.fitted_pipeline_)
    scores.append(tpot.score(X_test, y_test))
    tpot.export('tpot_mnist_pipeline.py')
times = [time/60 for time in times]
print('Times:', times)
print('Scores:', scores)   
print('Winning pipelines:', winning_pipes)

30 operators have been imported by TPOT.


HBox(children=(IntProgress(value=0, description='Optimization Progress', max=40, style=ProgressStyle(descripti…

Generation 1 - Current Pareto front scores:
-1	0.9754358374956306	RandomForestClassifier(input_matrix, RandomForestClassifier__bootstrap=True, RandomForestClassifier__criterion=entropy, RandomForestClassifier__max_features=0.5, RandomForestClassifier__min_samples_leaf=4, RandomForestClassifier__min_samples_split=7, RandomForestClassifier__n_estimators=100)
-2	0.9839803080877585	KNeighborsClassifier(DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=entropy, DecisionTreeClassifier__max_depth=4, DecisionTreeClassifier__min_samples_leaf=7, DecisionTreeClassifier__min_samples_split=15), KNeighborsClassifier__n_neighbors=13, KNeighborsClassifier__p=2, KNeighborsClassifier__weights=uniform)

Generation 2 - Current Pareto front scores:
-1	0.9754358374956306	RandomForestClassifier(input_matrix, RandomForestClassifier__bootstrap=True, RandomForestClassifier__criterion=entropy, RandomForestClassifier__max_features=0.5, RandomForestClassifier__min_samples_leaf=4, RandomForestC

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=40, style=ProgressStyle(descripti…

Generation 1 - Current Pareto front scores:
-1	0.9754358374956306	RandomForestClassifier(input_matrix, RandomForestClassifier__bootstrap=True, RandomForestClassifier__criterion=entropy, RandomForestClassifier__max_features=0.5, RandomForestClassifier__min_samples_leaf=4, RandomForestClassifier__min_samples_split=7, RandomForestClassifier__n_estimators=100)
-2	0.9839803080877585	KNeighborsClassifier(DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=entropy, DecisionTreeClassifier__max_depth=4, DecisionTreeClassifier__min_samples_leaf=7, DecisionTreeClassifier__min_samples_split=15), KNeighborsClassifier__n_neighbors=13, KNeighborsClassifier__p=2, KNeighborsClassifier__weights=uniform)

Generation 2 - Current Pareto front scores:
-1	0.9754358374956306	RandomForestClassifier(input_matrix, RandomForestClassifier__bootstrap=True, RandomForestClassifier__criterion=entropy, RandomForestClassifier__max_features=0.5, RandomForestClassifier__min_samples_leaf=4, RandomForestC

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=40, style=ProgressStyle(descripti…

Generation 1 - Current Pareto front scores:
-1	0.9754358374956306	RandomForestClassifier(input_matrix, RandomForestClassifier__bootstrap=True, RandomForestClassifier__criterion=entropy, RandomForestClassifier__max_features=0.5, RandomForestClassifier__min_samples_leaf=4, RandomForestClassifier__min_samples_split=7, RandomForestClassifier__n_estimators=100)
-2	0.9839803080877585	KNeighborsClassifier(DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=entropy, DecisionTreeClassifier__max_depth=4, DecisionTreeClassifier__min_samples_leaf=7, DecisionTreeClassifier__min_samples_split=15), KNeighborsClassifier__n_neighbors=13, KNeighborsClassifier__p=2, KNeighborsClassifier__weights=uniform)

Generation 2 - Current Pareto front scores:
-1	0.9754358374956306	RandomForestClassifier(input_matrix, RandomForestClassifier__bootstrap=True, RandomForestClassifier__criterion=entropy, RandomForestClassifier__max_features=0.5, RandomForestClassifier__min_samples_leaf=4, RandomForestC