# TPOT tutorial on the Titanic dataset 

The Titanic machine learning competition on [Kaggle](https://www.kaggle.com/c/titanic) is one of the most popular beginner's competitions on the platform. We will use that competition here to demonstrate the implementation of TPOT. 

In [1]:
# import os
# os.system('python3.6 -m pip install --upgrade --upgrade-strategy "eager" --force-reinstall --ignore-installed --compile --process-dependency-links --no-binary :all: deap'

In [2]:
# Import required libraries
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
import pandas as pd 
import numpy as np

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


In [3]:
# Load the data
titanic = pd.read_csv('../data/titanic/train.csv')
titanic.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Data Exploration 

In [4]:
titanic.groupby('Sex').Survived.value_counts()

Sex     Survived
female  1           233
        0            81
male    0           468
        1           109
Name: Survived, dtype: int64

In [5]:
titanic.groupby(['Pclass','Sex']).Survived.value_counts()

Pclass  Sex     Survived
1       female  1            91
                0             3
        male    0            77
                1            45
2       female  1            70
                0             6
        male    0            91
                1            17
3       female  0            72
                1            72
        male    0           300
                1            47
Name: Survived, dtype: int64

In [6]:
id = pd.crosstab([titanic.Pclass, titanic.Sex], titanic.Survived.astype(float))
id.div(id.sum(1).astype(float), 0)

Unnamed: 0_level_0,Survived,0.0,1.0
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1
1,female,0.031915,0.968085
1,male,0.631148,0.368852
2,female,0.078947,0.921053
2,male,0.842593,0.157407
3,female,0.5,0.5
3,male,0.864553,0.135447


## Data Munging 

The first and most important step in using TPOT on any data set is to rename the target class/response variable to `class`.

In [7]:
titanic.rename(columns={'Survived': 'class'}, inplace=True)

At present, TPOT requires all the data to be in numerical format. As we can see below, our data set has 5 categorical variables which contain non-numerical values: `Name`, `Sex`, `Ticket`, `Cabin` and `Embarked`.

In [8]:
titanic.dtypes

PassengerId      int64
class            int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

We then check the number of levels that each of the five categorical variables have. 

In [9]:
for cat in ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']:
    print("Number of levels in category '{0}': \b {1:2.2f} ".format(cat, titanic[cat].unique().size))

Number of levels in category 'Name': 891.00 
Number of levels in category 'Sex': 2.00 
Number of levels in category 'Ticket': 681.00 
Number of levels in category 'Cabin': 148.00 
Number of levels in category 'Embarked': 4.00 


As we can see, `Sex` and `Embarked` have few levels. Let's find out what they are.

In [10]:
for cat in ['Sex', 'Embarked']:
    print("Levels for catgeory '{0}': {1}".format(cat, titanic[cat].unique()))

Levels for catgeory 'Sex': ['male' 'female']
Levels for catgeory 'Embarked': ['S' 'C' 'Q' nan]


We then code these levels manually into numerical values. For `nan` i.e. the missing values, we simply replace them with a placeholder value (-999). In fact, we perform this replacement for the entire data set.

In [11]:
titanic['Sex'] = titanic['Sex'].map({'male':0,'female':1})
titanic['Embarked'] = titanic['Embarked'].map({'S':0,'C':1,'Q':2})

In [12]:
titanic = titanic.fillna(-999)
pd.isnull(titanic).any()

PassengerId    False
class          False
Pclass         False
Name           False
Sex            False
Age            False
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin          False
Embarked       False
dtype: bool

Since `Name` and `Ticket` have so many levels, we drop them from our analysis for the sake of simplicity. For `Cabin`, we encode the levels as digits using Scikit-learn's [`MultiLabelBinarizer`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) and treat them as new features. 

In [13]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
CabinTrans = mlb.fit_transform([{str(val)} for val in titanic['Cabin'].values])

In [14]:
CabinTrans

array([[1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]])

Drop the unused features from the dataset. 

In [15]:
titanic_new = titanic.drop(['Name','Ticket','Cabin','class'], axis=1)

In [16]:
assert (len(titanic['Cabin'].unique()) == len(mlb.classes_)), "Not Equal" #check correct encoding done

We then add the encoded features to form the final dataset to be used with TPOT. 

In [17]:
titanic_new = np.hstack((titanic_new.values,CabinTrans))

In [18]:
np.isnan(titanic_new).any()

False

Keeping in mind that the final dataset is in the form of a numpy array, we can check the number of features in the final dataset as follows.

In [19]:
titanic_new[0].size

156

Finally we store the class labels, which we need to predict, in a separate variable. 

In [20]:
titanic_class = titanic['class'].values

## Data Analysis using TPOT

To begin our analysis, we need to divide our training data into training and validation sets. The validation set is just to give us an idea of the test set error. The model selection and tuning is entirely taken care of by TPOT, so if we want to, we can skip creating this validation set.

In [21]:
training_indices, validation_indices = training_indices, testing_indices = train_test_split(titanic.index, stratify = titanic_class, train_size=0.75, test_size=0.25)
training_indices.size, validation_indices.size

(668, 223)

After that, we proceed to calling the `fit`, `score` and `export` functions on our training dataset. To get a better idea of how these functions work, refer the TPOT documentation [here](http://epistasislab.github.io/tpot/api/).

An important TPOT parameter to set is the number of generations. Since our aim is to just illustrate the use of TPOT, we have set it to 5. On a standard laptop with 4GB RAM, it roughly takes 5 minutes per generation to run. For each added generation, it should take 5 mins more. Thus, for the default value of 100, total run time could be roughly around 8 hours.  

In [22]:
tpot = TPOTClassifier(verbosity=2, max_time_mins=2, max_eval_time_mins=0.04, population_size=40)
tpot.fit(titanic_new[training_indices], titanic_class[training_indices])

  return f(*args, **kwds)




HBox(children=(IntProgress(value=0, description='Optimization Progress', max=40, style=ProgressStyle(descripti…

Generation 1 - Current best internal CV score: 0.8083010178846786
Generation 2 - Current best internal CV score: 0.8097935551981115

2.00457195 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: LogisticRegression(input_matrix, C=10.0, dual=False, penalty=l2)


TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
        disable_update_check=False, early_stop=None, generations=1000000,
        max_eval_time_mins=0.04, max_time_mins=2, memory=None,
        mutation_rate=0.9, n_jobs=1, offspring_size=None,
        periodic_checkpoint_folder=None, population_size=40,
        random_state=None, scoring=None, subsample=1.0, use_dask=False,
        verbosity=2, warm_start=False)

In [23]:
tpot.score(titanic_new[validation_indices], titanic.loc[validation_indices, 'class'].values)

0.7982062780269058

In [24]:
pipelines = pd.DataFrame(tpot.evaluated_individuals_).T
pipelines[:8]

Unnamed: 0,crossover_count,generation,internal_cv_score,mutation_count,operator_count,predecessor
"BernoulliNB(GaussianNB(input_matrix), BernoulliNB__alpha=0.01, BernoulliNB__fit_prior=True)",0,0,0.720081,0,2,"(ROOT,)"
"GaussianNB(RBFSampler(input_matrix, RBFSampler__gamma=0.30000000000000004))",0,0,0.567412,0,2,"(ROOT,)"
"GaussianNB(PCA(input_matrix, PCA__iterated_power=9, PCA__svd_solver=randomized))",0,0,0.649561,0,2,"(ROOT,)"
"GradientBoostingClassifier(Normalizer(input_matrix, Normalizer__norm=l2), GradientBoostingClassifier__learning_rate=0.001, GradientBoostingClassifier__max_depth=1, GradientBoostingClassifier__max_features=1.0, GradientBoostingClassifier__min_samples_leaf=6, GradientBoostingClassifier__min_samples_split=11, GradientBoostingClassifier__n_estimators=100, GradientBoostingClassifier__subsample=0.8500000000000001)",0,0,0.616768,0,2,"(ROOT,)"
"LogisticRegression(input_matrix, LogisticRegression__C=0.5, LogisticRegression__dual=False, LogisticRegression__penalty=l1)",0,0,0.796371,0,1,"(ROOT,)"
"ExtraTreesClassifier(BernoulliNB(input_matrix, BernoulliNB__alpha=100.0, BernoulliNB__fit_prior=True), ExtraTreesClassifier__bootstrap=False, ExtraTreesClassifier__criterion=entropy, ExtraTreesClassifier__max_features=0.1, ExtraTreesClassifier__min_samples_leaf=18, ExtraTreesClassifier__min_samples_split=9, ExtraTreesClassifier__n_estimators=100)",0,0,0.691688,0,2,"(ROOT,)"
"DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=gini, DecisionTreeClassifier__max_depth=2, DecisionTreeClassifier__min_samples_leaf=4, DecisionTreeClassifier__min_samples_split=3)",0,0,0.766486,0,1,"(ROOT,)"
"GradientBoostingClassifier(input_matrix, GradientBoostingClassifier__learning_rate=0.01, GradientBoostingClassifier__max_depth=9, GradientBoostingClassifier__max_features=0.7500000000000001, GradientBoostingClassifier__min_samples_leaf=4, GradientBoostingClassifier__min_samples_split=8, GradientBoostingClassifier__n_estimators=100, GradientBoostingClassifier__subsample=0.35000000000000003)",0,0,-inf,0,1,"(ROOT,)"


In [25]:
tpot.export('tpot_titanic_pipeline.py')

Let's have a look at the generated code. As we can see, the random forest classifier performed the best on the given dataset out of all the other models that TPOT currently evaluates on. If we ran TPOT for more generations, then the score should improve further.