# Genetic Algorithms (TPOT Classifier)

Genetic Algorithms tries to apply natural selection mechanisms to Machine Learning contexts.

Let's immagine we create a population of N Machine Learning models with some predifined Hyperparameters. We can then calculate the accuracy of each model and decide to keep just half of the models (the ones that performs best). We can now generate some offsprings having similar Hyperparameters to the ones of the best models so that go get again a population of N models. At this point we can again caltulate the accuracy of each model and repeate the cycle for a defined number of generations. In this way, just the best models will survive at the end of the process.

In [17]:
!pip install tpot

Collecting tpot
[?25l  Downloading https://files.pythonhosted.org/packages/b2/55/a7185198f554ea19758e5ac4641f100c94cba4585e738e2e48e3c40a0b7f/TPOT-0.11.7-py3-none-any.whl (87kB)
[K     |███▊                            | 10kB 21.7MB/s eta 0:00:01[K     |███████▌                        | 20kB 17.4MB/s eta 0:00:01[K     |███████████▎                    | 30kB 14.4MB/s eta 0:00:01[K     |███████████████                 | 40kB 13.5MB/s eta 0:00:01[K     |██████████████████▉             | 51kB 7.5MB/s eta 0:00:01[K     |██████████████████████▋         | 61kB 7.3MB/s eta 0:00:01[K     |██████████████████████████▎     | 71kB 8.3MB/s eta 0:00:01[K     |██████████████████████████████  | 81kB 8.7MB/s eta 0:00:01[K     |████████████████████████████████| 92kB 6.1MB/s 
Collecting stopit>=1.1.1
  Downloading https://files.pythonhosted.org/packages/35/58/e8bb0b0fb05baf07bbac1450c447d753da65f9701f551dca79823ce15d50/stopit-1.1.2.tar.gz
Collecting xgboost>=1.1.0
[?25l  Downloading htt

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
import pandas as pd
df=pd.read_csv('/content/drive/MyDrive/Datasets/diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
import numpy as np
df['Glucose']=np.where(df['Glucose']==0,df['Glucose'].median(),df['Glucose'])
df['Insulin']=np.where(df['Insulin']==0,df['Insulin'].median(),df['Insulin'])
df['SkinThickness']=np.where(df['SkinThickness']==0,df['SkinThickness'].median(),df['SkinThickness'])
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72,35.0,30.5,33.6,0.627,50,1
1,1,85.0,66,29.0,30.5,26.6,0.351,31,0
2,8,183.0,64,23.0,30.5,23.3,0.672,32,1
3,1,89.0,66,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40,35.0,168.0,43.1,2.288,33,1


In [5]:
#### Independent And Dependent features

X=df.drop('Outcome',axis=1)
y=df['Outcome']

In [6]:
#### Train Test Split

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)

In [7]:
from sklearn.ensemble import RandomForestClassifier

In [8]:
rf_classifier=RandomForestClassifier(n_estimators=10).fit(X_train,y_train)
prediction=rf_classifier.predict(X_test)

In [9]:
y.value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [10]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
print('Accuracy score:',accuracy_score(y_test,prediction))
print(confusion_matrix(y_test,prediction))
print(classification_report(y_test,prediction))



Accuracy score: 0.7987012987012987
[[94 13]
 [18 29]]
              precision    recall  f1-score   support

           0       0.84      0.88      0.86       107
           1       0.69      0.62      0.65        47

    accuracy                           0.80       154
   macro avg       0.76      0.75      0.76       154
weighted avg       0.79      0.80      0.80       154



### The main parameters used by a Random Forest Classifier are:
1. criterion = the function used to evaluate the quality of a split.
2. max_depth = maximum number of levels allowed in each tree.
3. max_features = maximum number of features considered when splitting a node.
4. min_samples_leaf = minimum number of samples which can be stored in a tree leaf.
5. min_samples_split = minimum number of samples necessary in a node to cause node splitting.
6. n_estimators = number of trees in the ensemble.

In [14]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt','log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 1000,10)]
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10,14]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4,6,8]
# Create the random grid
param = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
              'criterion':['entropy','gini']}
print(param)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [10, 120, 230, 340, 450, 560, 670, 780, 890, 1000], 'min_samples_split': [2, 5, 10, 14], 'min_samples_leaf': [1, 2, 4, 6, 8], 'criterion': ['entropy', 'gini']}


In [18]:
from tpot import TPOTClassifier

tpot_classifier=TPOTClassifier(generations=5,population_size=24,offspring_size=12,verbosity=2,early_stop=12,
                               config_dict={'sklearn.ensemble.RandomForestClassifier':param},cv=4,scoring='accuracy')

tpot_classifier.fit(X_train,y_train)

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=84.0, style=ProgressStyle(des…


Generation 1 - Current best internal CV score: 0.7541061879297173

Generation 2 - Current best internal CV score: 0.7589550971903913

Generation 3 - Current best internal CV score: 0.7589550971903913

Generation 4 - Current best internal CV score: 0.7589550971903913

Generation 5 - Current best internal CV score: 0.7605466428995841

Best pipeline: RandomForestClassifier(RandomForestClassifier(CombineDFs(input_matrix, CombineDFs(CombineDFs(input_matrix, input_matrix), input_matrix)), criterion=gini, max_depth=120, max_features=log2, min_samples_leaf=8, min_samples_split=10, n_estimators=600), criterion=gini, max_depth=560, max_features=sqrt, min_samples_leaf=1, min_samples_split=14, n_estimators=1000)


TPOTClassifier(config_dict={'sklearn.ensemble.RandomForestClassifier': {'criterion': ['entropy',
                                                                                      'gini'],
                                                                        'max_depth': [10,
                                                                                      120,
                                                                                      230,
                                                                                      340,
                                                                                      450,
                                                                                      560,
                                                                                      670,
                                                                                      780,
                                                                                 

In [19]:
accuracy = tpot_classifier.score(X_test, y_test)
print(accuracy)

0.8441558441558441
