# Genetic programming (TPOT) - Use & Generate best ML Algo

### What is [TPOT][1] ?
> Consider TPOT your Data Science Assistant. TPOT is a Python tool that automatically creates and optimizes machine learning pipelines using genetic programming.

In this Kernel I explored the TPOT library to automatically use the best algorithm and generate that algorithm's code. 

*Note: TPOT library may take some time to properly select the best algo in limited Kaggle Kernel resources . For testing it will be best to download this notebook in your local machine and experiment.*

[1]: http://rhiever.github.io/tpot/

In [None]:
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
import pandas as pd 
import numpy as np

### Load and check basic statistics 

In [None]:
data = pd.read_csv('../input/diabetes.csv')
data.describe()

### Data Pre-processing

Certain attributes (like BloodPressure, Glucose) have zero-valued entries which shouldn't be the case. Hence replacing them with mean values of non-zero data of the attribute.

Idea taken from ManasviKundalia's Kernel.

https://www.kaggle.com/manasvi22k/d/uciml/pima-indians-diabetes-database/prediction-using-multiple-models-76-77

In [None]:
def replaceZeroWithMean(column):
    if column == None:
        return
    
    print('Column: ', column)
    
    print('Number of zero entries: ', len(data.loc[data[column]==0, column]))
    
    column_values_non_zero = data.loc[data[column] != 0, column]
    mean = sum(column_values_non_zero)/len(column_values_non_zero)
    print('Mean: ', mean)
    data.loc[data[column] == 0 , column] = mean
    print('---------------------------------------')    

In [None]:
columns_zero_to_mean = ['Glucose', 'BloodPressure', 'BMI']

for column in columns_zero_to_mean:
    replaceZeroWithMean(column)

Verify that zero values has been modified for Glucose, BloodPressure & BMI

In [None]:
data.describe()[columns_zero_to_mean] 

## TPOT Implementation

The first and most important step in using TPOT on any data set is to rename the target class/response variable to class.

In [None]:
data.rename(columns = {'Outcome': 'class'}, inplace=True)

At present, TPOT requires all the data to be in numerical format. So verfiy that.

Our data is fine. Cool

In [None]:
data.dtypes

Save Features(X) and Target(y) values in different variables (just for simplicity)

In [None]:
data_y = data['class']
data_X = data.drop(['class'], axis=1)

Divide dataset into training and validation sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data_X, data_y,
                                                    train_size=0.75, test_size=0.25)

**Initialize `TPOTClassifier` and run `fit()` method on train dataset**

Increase max_time_mins for better model and set verbosity=2 for better TPOT output

In [None]:
tpot = TPOTClassifier(verbosity=1, max_time_mins=15) 
tpot.fit(X_train, y_train)

In [None]:
print(tpot.score(X_test, y_test))

**Save file generated by TPOT**

In [None]:
tpot.export('tpot_script.py')

The generated file can be found in Output tab

https://www.kaggle.com/usersumit/d/uciml/pima-indians-diabetes-database/genetic-programming-generate-best-ml-algo/output