# Assignment 2: **Machine learning with tree based models** 

In this assignment, you will work on the **Titanic** dataset and use machine learning to create a model that predicts which passengers survived the **Titanic** shipwreck. 

---
## About the dataset:
---
* The column named  `Survived` is the label and the remaining columns are features. 
* The features can be described as given below:
  <table>
  <thead>
    <tr>
      <th>Variable</th>
      <th>Definition </th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>pclass</td>
      <td>Ticket class	</td>
    </tr>
    <tr>
      <td>SibSp</td>
      <td>Number of siblings / spouses aboard the Titanic</td>
    </tr>
    <tr>
      <td>Parch</td>
      <td>Number of parents / children aboard the Titanic</td>
    </tr>
    <tr>
      <td>Ticket</td>
      <td>Ticket number</td>
    </tr>
    <tr>
      <td>Embarked</td>
      <td>Port of Embarkation: C = Cherbourg, Q = Queenstown, S = Southampton</td>
    </tr>
  </tbody>
</table> 	

---
## Instructions
---
* Apply suitable data pre-processing techniques, if needed. 
* Implement a few classifiers to create your model and compare the performance metrics by plotting the curves like roc_auc, confusion matrix, etc. 

In [0]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

In [4]:
titanic_data = pd.read_csv('titanic.csv')
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
titanic_data.shape

(891, 12)

In [6]:
print(titanic_data.isna().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


In [7]:
# Handling Missing Data by filling with the median value
from sklearn.impute import SimpleImputer
data = titanic_data.copy(deep=True)
imputer = SimpleImputer(strategy="most_frequent")
data.iloc[:,:] = imputer.fit_transform(titanic_data)
print(data.isna().sum())

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64


In [0]:
labels = data.Survived
train = data.drop(columns=["Survived"])

## Preprocessing the data 

In [9]:
# Data conversion from objects and One-Hot Encoding
# converting Pclass
train["Pclass"] = train["Pclass"].astype("category")
train = pd.get_dummies(train, columns = ["Pclass"],prefix="Pclass")
train = pd.get_dummies(train, columns=["Embarked"] , prefix="Embarked")

# converting Sex
train = pd.get_dummies(train,columns=["Sex"])
train.drop("Sex_female",axis=1)

# Sibsp and Parch
train['Travelpeople']=train["SibSp"]+train["Parch"]
train['TravelAlone']=np.where(train['Travelpeople']>0, 0, 1)

# Treating the names
train["Title"] = train["Name"].str.extract(' ([A-Za-z]+)\.', expand=False)
train['Title'] = train['Title'].replace([ 'Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Jonkheer', 'Dona', 'Countess', 'Lady', 'Sir'], 'Rare')
train['Title'] = train['Title'].replace('Mlle', 'Miss')
train['Title'] = train['Title'].replace('Ms', 'Miss')
train['Title'] = train['Title'].replace('Mme', 'Mrs')
title = pd.get_dummies(train["Title"])
train = pd.get_dummies(train,columns=["Title"])

# treating the names of people by arranginng them according to the lengths
train_name=train["Name"]
for i in train['Name']:
    train['Name']= train['Name'].replace(i,len(i))
    
bins = [0,25,40, np.inf]
mylabels = ['s_name', 'm_name', 'l_name',]
train["Name_len"] = pd.cut(train["Name"], bins, labels = mylabels)
Name_mapping = {'s_name': 1, 'm_name': 2 , 'l_name': 3}
train['Name_len'] = train['Name_len'].map(Name_mapping)


# working with the age of passengers
bins = [0, 5, 12, 18, 24, 35, 60, np.inf]
mylabels = ['Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']
train['AgeGroup'] = pd.cut(train["Age"], bins, labels = mylabels)
age_mapping = {'Baby': 1, 'Child': 2, 'Teenager': 3, 'Student': 4, 'Young Adult':5 , 'Adult': 6, 'Senior':7}
train['AgeGroup'] = train['AgeGroup'].map(age_mapping)

# Fare feature
train['FareBand'] = pd.qcut(train['Fare'], 8, labels = [1, 2, 3, 4,5,6,7,8])

train = train.drop(columns=["Name","SibSp","Parch","Travelpeople"],axis=1)

# preprocessing ticket to make sense
train['Ticket1']=train['Ticket'].str[:1]
train['Ticket1']=train['Ticket1'].replace(['1', '2', '3', '4', '5', '6', '7', '8', '9'], 'N')
train['Ticket1']=train['Ticket1'].replace(['S','P', 'C', 'N'], ['S_Ticket', 'P_Ticket', 'C_Ticket','NumberTicket'])
train['Ticket1']=train['Ticket1'].replace(['A','W', 'F', 'L'], 'OtherTicket')

train = pd.get_dummies(train,columns=["Ticket1"])
train.drop("Ticket",inplace=True,axis=1)
train.head()

Unnamed: 0,PassengerId,Age,Fare,Cabin,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S,Sex_female,Sex_male,TravelAlone,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Rare,Name_len,AgeGroup,FareBand,Ticket1_C_Ticket,Ticket1_NumberTicket,Ticket1_OtherTicket,Ticket1_P_Ticket,Ticket1_S_Ticket
0,1,22.0,7.25,B96 B98,0,0,1,0,0,1,0,1,0,0,0,1,0,0,1,4,1,0,0,1,0,0
1,2,38.0,71.2833,C85,1,0,0,1,0,0,1,0,0,0,0,0,1,0,3,6,8,0,0,0,1,0
2,3,26.0,7.925,B96 B98,0,0,1,0,0,1,1,0,1,0,1,0,0,0,1,5,3,0,0,0,0,1
3,4,35.0,53.1,C123,1,0,0,0,0,1,1,0,0,0,0,0,1,0,3,5,7,0,1,0,0,0
4,5,35.0,8.05,B96 B98,0,0,1,0,0,1,0,1,1,0,0,1,0,0,1,5,3,0,1,0,0,0


## Model development : Logistic Regression, Decision Tree and RandomForest

In [0]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

train.drop(['PassengerId',"Cabin"], axis=1,inplace=True)
x_train, x_test, y_train, y_test = train_test_split(train, labels, test_size = 0.20, random_state = 0)

In [0]:
from sklearn.linear_model import LogisticRegression

logmodel = LogisticRegression(random_state= 0, solver='lbfgs', max_iter=10000,C=10)
logmodel.fit(x_train, y_train)
y_pred = logmodel.predict(x_test)

In [12]:
from sklearn.metrics import roc_auc_score,confusion_matrix,classification_report

print("ROC score : \n",roc_auc_score(y_test,y_pred))
print("Accuracy score : \n",accuracy_score(y_test,y_pred))
print("Confusion Matrix : \n",confusion_matrix(y_test,y_pred))
print("Precision and Recall : \n",classification_report(y_test,y_pred))

ROC score : 
 0.7805006587615284
Accuracy score : 
 0.7932960893854749
Confusion Matrix : 
 [[92 18]
 [19 50]]
Precision and Recall : 
               precision    recall  f1-score   support

           0       0.83      0.84      0.83       110
           1       0.74      0.72      0.73        69

    accuracy                           0.79       179
   macro avg       0.78      0.78      0.78       179
weighted avg       0.79      0.79      0.79       179



In [0]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

dec_tree = DecisionTreeClassifier()
rf = RandomForestClassifier(n_estimators=1000)

In [14]:
rf.fit(x_train, y_train)
y_pred = rf.predict(x_test)
print("ROC score : \n",roc_auc_score(y_test,y_pred))
print("Accuracy score : \n",accuracy_score(y_test,y_pred))
print("Confusion Matrix : \n",confusion_matrix(y_test,y_pred))
print("Precision and Recall : \n",classification_report(y_test,y_pred))

ROC score : 
 0.7996706192358366
Accuracy score : 
 0.8268156424581006
Confusion Matrix : 
 [[101   9]
 [ 22  47]]
Precision and Recall : 
               precision    recall  f1-score   support

           0       0.82      0.92      0.87       110
           1       0.84      0.68      0.75        69

    accuracy                           0.83       179
   macro avg       0.83      0.80      0.81       179
weighted avg       0.83      0.83      0.82       179



In [15]:
dec_tree.fit(x_train, y_train)
y_pred = dec_tree.predict(x_test)
print("ROC score : \n",roc_auc_score(y_test,y_pred))
print("Accuracy score : \n",accuracy_score(y_test,y_pred))
print("Confusion Matrix : \n",confusion_matrix(y_test,y_pred))
print("Precision and Recall : \n",classification_report(y_test,y_pred))

ROC score : 
 0.7823451910408431
Accuracy score : 
 0.7988826815642458
Confusion Matrix : 
 [[94 16]
 [20 49]]
Precision and Recall : 
               precision    recall  f1-score   support

           0       0.82      0.85      0.84       110
           1       0.75      0.71      0.73        69

    accuracy                           0.80       179
   macro avg       0.79      0.78      0.79       179
weighted avg       0.80      0.80      0.80       179

