<a href="https://colab.research.google.com/github/sanketpadwal/GCDAI_INSAID_JAN20/blob/main/Misc/TPOT/AutoML_using_TPOT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AutoML using TPOT
Genetic Algorithms: <br>
 - based on natural selection/survival of the fittest <br>

Steps of Genetic Algorithms
 - **Selection:** find the best and fittest
 - **Crossover:** breed the best and the fittest to get a new generation
 - **Mutation:** mutate the offspring of the new generation till you get the best and fittest

##### **Pkgs** 
pip install tpot

#### **Dependencies**
scikit learn and numpy

#### **NB**
 - Remove Missing Values
 - Must be categorical and numbers

In [4]:
# Load libraries

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
import pandas as pd 
import numpy as np

In [5]:
# Data sources
data1 = "https://raw.githubusercontent.com/sanketpadwal/GCDAI_INSAID_JAN20/main/Misc/TPOT/Iris.csv"
data2 = ""

In [6]:
# Load data
df1 = pd.read_csv(data1)

In [7]:
df1.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [8]:
df1.isnull().sum()

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

In [10]:
df1.columns

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [12]:
# unique class values
df1['Species'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [14]:
# Label target class categorical to numerical

ref = {value:index for index,value in enumerate(df1['Species'].unique())}

In [15]:
ref

{'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}

In [16]:
# Create new target label in numerical format
df1['Iris_label'] = df1['Species'].map(ref)

In [17]:
df1.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,Iris_label
0,1,5.1,3.5,1.4,0.2,Iris-setosa,0
1,2,4.9,3.0,1.4,0.2,Iris-setosa,0
2,3,4.7,3.2,1.3,0.2,Iris-setosa,0
3,4,4.6,3.1,1.5,0.2,Iris-setosa,0
4,5,5.0,3.6,1.4,0.2,Iris-setosa,0


In [18]:
x = df1[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
y = df1['Iris_label']

In [19]:
from sklearn.model_selection import cross_val_score

In [21]:
cv_scores = cross_val_score(LogisticRegression(),x,y,cv=10)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

In [22]:
cv_scores

array([1.        , 0.93333333, 1.        , 1.        , 0.93333333,
       0.93333333, 0.93333333, 1.        , 1.        , 1.        ])

In [23]:
print(np.mean(cv_scores))

0.9733333333333334


In [24]:
# Individual Algorithm
rf_cv_scores = cross_val_score(RandomForestClassifier(),x,y,cv=10)

In [25]:
rf_cv_scores

array([1.        , 0.93333333, 1.        , 0.93333333, 0.93333333,
       0.93333333, 0.86666667, 1.        , 1.        , 1.        ])

In [26]:
print(np.mean(rf_cv_scores))

0.96


In [28]:
# Individual Algorithm
rf_cv_scores2 = cross_val_score(RandomForestClassifier(n_estimators=100,max_depth=2),x,y,cv=10)

In [29]:
rf_cv_scores2

array([1.        , 0.93333333, 1.        , 0.93333333, 0.93333333,
       0.93333333, 0.86666667, 1.        , 1.        , 1.        ])

In [30]:
print(np.mean(rf_cv_scores2))

0.96


# AutoML with TPOT

In [33]:
# Install TPOT libraries
!pip install -q tpot

In [40]:
# import tpot library
import tpot as tp

In [41]:
# Methods and Attributes
dir(tp)

['TPOTClassifier',
 'TPOTRegressor',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_version',
 'base',
 'builtins',
 'config',
 'decorators',
 'driver',
 'export_utils',
 'gp_deap',
 'gp_types',
 'main',
 'metrics',
 'operator_utils',
 'tpot']

In [42]:
# Split in train and test
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=42)

In [44]:
# Init
tp = TPOTClassifier(generations=5,verbosity=2,population_size=20)

NameError: ignored