# Random Forest CLF on the Titanic Dataset

![Imgur](https://i.imgur.com/8pw2dfj.jpg)

Photo by <a href="https://unsplash.com/@kmitchhodge?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">K. Mitch Hodge</a> on <a href="https://unsplash.com/s/photos/titanic?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
  

## Index

1. Basic EDA
    - 1.1 Comparative piecharts
    - 1.2 Stacked bar charts
    

2. Feature Engineering 
    - 2.1 Extracting Salutations from Names
    - 2.2 Making a 'Family' column
    - 2.3 Using Mean Salutation ages to compute missing values in Age
    
  
3. Data Preprocessing
    - 3.1 Onehot and Label Encoding 
    

4. Random Forest CLF
    - 4.1 Train test split
    - 4.2 Model Training
    - 4.3 Model Score
    



In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import warnings
warnings.filterwarnings("ignore")
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
    

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Combining Training and Test set. We'll seperate them later.

train_path = '/kaggle/input/titanic/train.csv'
test_path = '/kaggle/input/titanic/test.csv'
train_df = pd.read_csv(train_path, encoding = "utf-8")
test_df = pd.read_csv(test_path, encoding="utf-8")
main_df = pd.concat([train_df, test_df])

In [None]:
main_df.info()

## 1. Exploratory Data Analysis  

Let's divide the dataset according to survivors and victims. We'll see how the data looks like.

![Imgur](https://i.imgur.com/dduSGYD.png)

In [None]:
survived = main_df[main_df["Survived"] == 1]
died = main_df[main_df["Survived"] == 0]

## 1.1 Comparative Piecharts

In [None]:
labels = ['Survived', 'Died']
values = [len(survived),
        len(died)]
fig = go.Figure()
fig = make_subplots(
    rows=1, cols=4,
    specs=[[{'rowspan': 1, 'colspan': 2, 'type': 'domain'}, {}, {'rowspan': 1, 'colspan': 2, 'type': 'bar'}, {}]],
    print_grid=False)


fig.add_trace(go.Pie(labels=labels, values=values, hole=.5, marker={"colors": ['#035E7B', '#DB2B39']}), row=1, col=1)
fig.add_trace(go.Bar(x=labels, y=values, marker_color=['#035E7B', '#DB2B39']), row=1, col=3)
fig.update_layout(title = "Fate of Titanic Passengers")

fig.show()

In [None]:
#pie charts

def getLen(dataset, category, value):
    return len(dataset[dataset[category] == value])


label1 = ["Male", "Female"]
label2 = label1
val1 = [getLen(survived, "Sex", "male"), getLen(survived, "Sex", "female")]
val2 = [getLen(died, "Sex", "male"), getLen(died, "Sex", "female")]


label3 = ["Pclass 1", "Pclass 2", "Pclass 3"]
label4 = label3
val3 = [getLen(survived, "Pclass", 1), getLen(
    survived, "Pclass", 2), getLen(survived, "Pclass", 3)]
val4 = [getLen(died, "Pclass", 1), getLen(
    died, "Pclass", 2), getLen(died, "Pclass", 3)]


label5 = ["Cherboug", "Queenstown", "Southampton"]
label6 = label5
val5 = [getLen(survived, "Embarked", "C"), getLen(
    survived, "Embarked", "Q"), getLen(survived, "Embarked", "S")]
val6 = [getLen(died, "Embarked", "C"), getLen(
    died, "Embarked", "Q"), getLen(died, "Embarked", "S")]
#### Configuring Subplot Grid ##################################

fig = make_subplots(
    rows=9, cols=4,
    specs=[[{'rowspan': 3, 'colspan' : 2, 'type':'domain'},{}, {'rowspan': 3, 'colspan' : 2, 'type':'domain'}, {}],
           [{'type':'domain'},{'type':'domain'}, {'type':'domain'}, {'type':'domain'}],
           [{'type':'domain'}, {'type':'domain'}, {'type':'domain'}, {'type':'domain'}],
           [{'rowspan': 3, 'colspan' : 2, 'type':'domain'}, {'type':'domain'}, {'rowspan': 3, 'colspan' : 2, 'type':'domain'}, {'type':'domain'}],
           [{'type':'domain'}, {'type':'domain'}, {'type':'domain'}, {'type':'domain'}],
          [{'type':'domain'},{'type':'domain'},{'type':'domain'}, {'type':'domain'}],
          [{'rowspan': 3, 'colspan' : 2, 'type':'domain'},{'type':'domain'},{'rowspan': 3, 'colspan' : 2, 'type':'domain'}, {'type':'domain'}],
          [{'type':'domain'},{'type':'domain'},{'type':'domain'}, {'type':'domain'}],
          [{'type':'domain'},{'type':'domain'},{'type':'domain'}, {'type':'domain'}]],
    print_grid=False)



############# Adding Subplots ##################################

fig.add_trace(go.Pie(labels=label1, values=val1,textinfo='label+percent', insidetextorientation='radial'),
              row = 1, col = 1) 
fig.add_trace(go.Pie(labels=label2, values=val2, textinfo='label+percent', insidetextorientation='radial'),
              row = 1, col = 3)
fig.add_trace(go.Pie(labels=label3, values=val3,textinfo='label+percent', insidetextorientation='radial'),
              4, 1)
fig.add_trace(go.Pie(labels=label4, values=val4, textinfo='label+percent', insidetextorientation='radial'),
              4, 3)
fig.add_trace(go.Pie(labels=label5, values=val5,textinfo='label+percent', insidetextorientation='radial'),
              7, 1)
fig.add_trace(go.Pie(labels=label6, values=val6, textinfo='label+percent', insidetextorientation='radial'),
              7, 3)

fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_traces(hole=.25, hoverinfo="label+percent+name")
fig.update_layout(height=1000, width=1000, showlegend=False, title_text="Titanic at first glance.",
                  annotations=[dict(text='Survived', x=0.02, y=1, font_size=20, showarrow=False),
                               dict(text='Died', x=0.98, y=1,
                                    font_size=20, showarrow=False),
                               dict(text='Sex', x=0.495, y=0.88,
                                    font_size=20, showarrow=False),
                               dict(text='Pass. Class', x=0.495, y=0.5,
                                    font_size=20, showarrow=False),
                               dict(text='Embarked', x=0.495, y=0.15, font_size=20, showarrow=False)],
                  font=dict(
                      family="Fira Code, monospace",  # I like this font.
                      size=12,
                      color="black"
                  ))
fig.show()



### Class to generate Stacked Bar Charts.

In [None]:
def get_em_data(dataset, town):
    return dataset[dataset["Embarked"] == town]

def get_pclass(dataset, cl):
    return len(dataset[dataset["Pclass"] == cl])/len(dataset) * 100

def get_fare_bins(dataset, bin):
    return len(dataset[dataset["Fare_Bins"] == bin])/len(dataset) * 100


class StackedBars:
    def __init__(self, dataset, get_function, x_labels, y_labels, config, colors):
        self.dataset = dataset
        self.get_function = get_function
        self.x_labels, self.y_labels = x_labels, y_labels
        self.colors = colors
        self.config = config

    def generate_figure(self, startIndex):
        fig = go.Figure()
        i = startIndex
        col = 0
        for x in self.x_labels:
            fig.add_trace(go.Bar(
                y=self.y_labels,
                x=[self.get_function(x, i) for x in self.dataset],
                name=x,
                orientation='h',
                marker=dict(
                    color=self.colors[col],
                    line=dict(color='black', width=1)
                )
            ))
            i += 1
            col += 1

        fig.update_layout(
            barmode='stack', title=self.config["title"], xaxis=self.config["xaxis"], font={'family': 'Fira Code, monospace', 'size': 12, 'color': 'black'})
        fig.show()


In [None]:
y_labels = ["Southampton", "Queenstown", "Cherboug"]


colors_survived = ['rgba(20, 111, 163, 0.3)',
                   'rgba(20, 111, 163, 0.7)', 'rgba(20, 111, 163, 1)']
colors_died = ['rgba(240, 45, 58, 0.3)',
               'rgba(240, 45, 58, 0.7)', 'rgba(240, 45, 58, 1)']

config_survived = {"title": "Survived",
          "xaxis": {'title': {'text': "% of Passengers"}}}

config_died = {"title": "Died",
                "xaxis": {'title': {'text': "% of Passengers"}}}


s_survived, q_survived, c_survived = get_em_data(
    survived, "S"), get_em_data(survived, "Q"), get_em_data(survived, "C")
s_died, q_died, c_died = get_em_data(died, "S"), get_em_data(
    died, "Q"), get_em_data(died, "C")

x_labels = ["Pclass 1", "Pclass 2", "Pclass 3"]  
dataset_survived = [s_survived, q_survived, c_survived]
dataset_died = [s_died, q_died, c_died]


pclass_bars = StackedBars(dataset = dataset_survived, x_labels = x_labels, y_labels = y_labels, get_function= get_pclass, colors = colors_survived, config = config_survived)


pclass_bars2 = StackedBars(dataset=dataset_died, x_labels=x_labels, y_labels=y_labels,
                          get_function=get_pclass, colors=colors_died, config=config_died)
                          
pclass_bars.generate_figure(1)
pclass_bars2.generate_figure(1)


## 2. Feature Engineering
Let's take a look at our preprocessing map!

![Imgur](https://i.imgur.com/WNGbNOn.png)

### 2.1 Custom Preprocessing/Feature Engineering functions.

In [None]:
import re
from sklearn.preprocessing import Normalizer, StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

def label_sal(x): #assigns weighted labels to Salutations.
    if x in ["Mr.", "Jonkheer."]:
        return 0
    elif x in ["Miss.", "Ms."]:
        return 0
    elif x == "Mrs.":
        return 0
    elif x == "Master.":
        return 1
    elif x == "Rev.":
        return 1
    elif x == "Dr.":
        return 1
    elif x == "Don.":
        return 2.5
    elif x == "Dona.":
        return 2.5
    elif x == "Sir.":
        return 2.5
    elif x in ["Lady.", "Mlle."]:
        return 2.5
    elif x == "Capt.":
        return 1
    elif x == "Major.":
        return 2
    elif x == "Col.":
        return 2
    else:
        return 0

def cap_sal(x): #Extracts Salutations from Name Column. 
    pattern = re.search(',.+?(?= )', x)
    sal = pattern.group()
    x = label_sal(sal[2:])
    return x


def bin_fare(x): #Creates bins for Fare
    if x < 50:
        return 0
    elif 50 <= x < 100:
        return 1
    elif 100 <= x < 150:
        return 2
    elif 150 <= x < 200:
        return 3
    else:
        return 4


def one_hot_gender(x):#One hot encoding for Sex
    if x == "male":
        return 0
    else:
        return 1


def age_bin(x): #Bins for Age groups.
    if x < 10:
        return 0
    elif 10 <= x < 18:
        return 1
    elif 18 <= x < 30:
        return 2
    elif 30 <= x < 40:
        return 3
    elif 40 <= x < 50:
        return 4
    elif 50 <= x < 60:
        return 5
    else:
        return 6


### 2.2 Let's apply our functions to raw data.

We'll go column by column and understand why we are applying preprocessing functions to them.

In [None]:
main_df.head()

### Salutations

Salutations have been extracted from the Names column of the dataset. Salutations imply social hierarchy. Someone addressed as "Mr" is likely to have belonged to 3rd Class as compared to someone addressed as "Don.". I have assigned different weights to the Salutations so there is a clear seperation.

In [None]:
main_df["Salutations"] = [cap_sal(x) for x in main_df["Name"]]
main_df = main_df.drop(columns= ["Name","Cabin", "Ticket"])

### Parch and SibSp
Let's combine Parch and SibSp to give a family count. We'll add a +1 to include the person as well.

In [None]:
main_df["Family"] = main_df["SibSp"] + main_df["Parch"] + 1
main_df = main_df.drop(columns=["Parch", "SibSp"])

## 3. Preprocessing

### 3.1 Dealing with Age

The Age column has a lot of missing values. That's problematic. We need to impute Age values to ensure that our data is complete. Initially I used the Sklearn Imputer to fill in Age with their mean values. However, a better way to impute age is to find the "Salutation mean age" for each Salutation. For example, people with the salutation "Master." have a mean age of 5. Imputing their ages with the global mean is bound to give bad results. So I used the average salutation age to compute ages for every missing value.

In [None]:
sal_grp = main_df.groupby("Salutations").mean()
sal_grp.head()

In [None]:
age_na = main_df[main_df["Age"].isna()]
age_na.loc[age_na["Salutations"] == 0.0, "Age"] = 30
age_na.loc[age_na["Salutations"] == 1.0, "Age"] = 44
age_na.loc[age_na["Salutations"] == 2.0, "Age"] = 52
age_na.loc[age_na["Salutations"] == 2.5, "Age"] = 37
age_na.loc[age_na["Salutations"] == 3.0, "Age"] = 5
sep_df = main_df[main_df["Age"] > 0.0]
main_df = pd.concat([age_na, sep_df])

In [None]:
main_df["Age_Bin"] = [age_bin(x) for x in main_df["Age"]] # Making an Age bin.
main_df = main_df.drop(columns=["Age"])
px.histogram(main_df, x = "Age_Bin")

### 3.2 Why do we need bins for Fare?

Fare data has outliers which can impact our model accuracy. Binning the data will help deal with outliers.

In [None]:
px.histogram(main_df, x = "Fare")

In [None]:
main_df["Fare_Bins"] = [bin_fare(x) for x in main_df["Fare"]]
main_df = main_df.drop(columns = ["Fare"])
px.histogram(main_df, x = "Fare_Bins")

### 3.3 Onehot and Label Encoding.

We'll encode categorical values.

In [None]:
main_df["Sex"] = [one_hot_gender(x) for x in main_df["Sex"]] #Onehot encoding Sex

In [None]:
from sklearn.preprocessing import LabelEncoder
main_df["Embarked"] = main_df["Embarked"].replace(np.nan, "Unknown")
lenc = LabelEncoder()
main_df["Embarked"] = lenc.fit_transform(main_df["Embarked"])


In [None]:
main_df_test = main_df[891:] # Seperating Preprocessed Test Data
main_df_1 = main_df.drop(columns = ["PassengerId"])
main_df_1 = main_df_1.dropna() # Preprocessed Training Data

## Some last minute EDA on Processed Data

In [None]:
survived = main_df_1[main_df_1["Survived"] == 1]
died = main_df_1[main_df_1["Survived"] == 0]

In [None]:
config = {"title": "Fare Brackets",
          "xaxis": {'domain': [0.0, 1.0], 'title': {'text': "% of Passengers"}},
          "font": {'family': 'Fira Code, monospace', 'size': 12, 'color': 'black'}}
colors = ['#F8B4C4', '#EF577B', '#E01544', '#950E2D', '#5D091C']
bars = StackedBars(dataset = [survived, died], get_function = get_fare_bins, x_labels=[
                   "<50", "50 <= x < 100", "100 <= x < 150", "150 <= x < 200", ">200"],
                   y_labels = ["Survived", 'Died'], config = config, colors = colors)
bars.generate_figure(0)

In [None]:
survived_male = survived[survived["Sex"] == 0]
survived_female = survived[survived["Sex"] == 1]

died_male = died[died["Sex"] == 0]
died_female = died[died["Sex"] == 1]


def get_age_bins(dataset, bin):
    return len(dataset[dataset["Age_Bin"] == bin])


In [None]:
config = {"title": "Age Brackets",
          "xaxis": {'domain': [0.0, 1.0], 'title': {'text': "Count of Passengers"}}}
          
colors = ['#FFC2C2', '#FF7070', '#FF3333',
          '#F50000', '#B80000', '#8F0000', '#520000']

bars = StackedBars(dataset=[survived_male, survived_female, died_male, died_female], get_function=get_age_bins, x_labels=[
                   "<10", "10 <= Age < 18", "18 <= Age < 30", "30 <= Age < 40", "40 <= Age < 50", "50 <= Age < 60", "> 60"],
                   y_labels=["Male Survivors", "Female Survivors", "Male Victims", 'Female Victims'], config=config, colors=colors)
bars.generate_figure(0)


## 4. Machine Learning.


![Imgur](https://i.imgur.com/8HaxQfM.png)

### 4.1 Train Test Split.

In [None]:
from sklearn.model_selection import train_test_split
X_train, y_train = main_df_1.drop(columns=["Survived"]), main_df_1["Survived"]
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size = 0.3)

X_train.shape

In [None]:
X_train.head()

### 4.2 Model Training

In [None]:
from random import randint
param_forest= {'max_depth': [5, 6, 7],
         'n_estimators': [75, 80, 85, 90],
         'max_features': [5, 3, 2, 1, None],
         'criterion': ['gini'],
         'bootstrap': [True, False]}

In [None]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

search_clf = GridSearchCV(RandomForestClassifier(), param_forest, cv = 5)
search_clf.fit(X_train, y_train)
y_pred  = search_clf.predict(X_test)

In [None]:
search_clf.best_params_

### 4.3 Model Accuracy

In [None]:
print("Training accuracy for model is", accuracy_score(y_test, y_pred))

In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

### Final Thoughts
I spent almost 2 weeks on understanding the data, preprocessing and hypertuning the parameters. A lot of time was spent reading the discussion forums. Over the course of 10+ submissions, I reached Top 17%. Let's see if I can do better next time!