# Titanic Survival Prediction

## Framing the problem

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

## Obtain Data

#### Importing the basic required libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# import seaborn as sns
# import missingno as ms
%matplotlib inline

### Reading the data from CSV file

In [2]:
data = pd.read_csv('titanic.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'titanic.csv'

## Analyze Data

#### Obtaining a glimpse of data

In [None]:
data.head(3)

In [None]:
data.tail(3)

In [None]:
type(data)

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data['Age'].hist()

In [None]:
data['Fare'].hist()

## Cleaning of data

#### Fill the missing values in the obtained data

In [None]:
data.groupby('Pclass')['Age'].median()


The average age for each of the classes are estimated to be as follows:
  
  * For **Class 1** - The median age is 37
  * For **Class 2** - The median age is 29
  * For **Class 3** - The median age is 24
  
Let's impute these values into the age column.



In [None]:
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        # Class-1
        if Pclass == 1:
            return 37
        # Class-2 
        elif Pclass == 2:
            return 29
        # Class-3
        else:
            return 24

    else:
        return Age



Applying the function.

In [None]:
data['Age'] = data[['Age','Pclass']].apply(impute_age,axis=1)

The Age column is imputed sucessfully.

Let's drop the Cabin column.

In [None]:
data.drop('Cabin', axis = 1,inplace=True)

In [None]:
data.head()

In [None]:
data.info()

### Categorical value conversion

In [None]:
data['Sex'].unique()

In [None]:
data['Sex'].value_counts()

In [None]:
sex_df = pd.get_dummies(data['Sex'],drop_first=3)
sex_df.head()

In [None]:
data['Embarked'].unique()

In [None]:
data['Embarked'].value_counts()

In [None]:
embark_df = pd.get_dummies(data['Embarked'],drop_first=True)
embark_df.head()

In [None]:
data.drop(['Sex','Embarked','Name','Ticket','PassengerId'],axis=1,inplace=True)

In [None]:
data = pd.concat([data,sex_df,embark_df],axis=1)

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe()

## Model Selection

### Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.drop('Survived',axis=1), 
                                                    data['Survived'], test_size=0.30, 
                                                    random_state=101)

In [None]:
X_train.head()

In [None]:
X_train.shape

### KNN Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

parameters = {'n_neighbors':[3,5,7,11,19], 'weights':['uniform', 'distance'], 'metric': ['minkowski', 'euclidean', 'manhattan']}

knn = KNeighborsClassifier()
knn_clf = GridSearchCV(knn, parameters)

knn_clf.fit(X_train, y_train)

In [None]:
knn_clf.best_estimator_

### Support Vector Machine Classifier

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

svc = SVC()
svc_clf = GridSearchCV(svc, parameters)

svc_clf.fit(X_train, y_train)

In [None]:
svc_clf.best_estimator_

## Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
parameters = {'n_estimators': [100, 200, 300, 500], 'max_features': ['auto', 'sqrt', 'log2'],
              'max_depth' : [4,5,6,7,8], 'criterion' :['gini', 'entropy']}

rfc = RandomForestClassifier(random_state=42)
rfc_clf = GridSearchCV(rfc, parameters)
rfc_clf.fit(X_train, y_train)

In [None]:
rfc_clf.best_estimator_imator_

### Predicting the model on the test set

In [None]:
svc_predicted = svc_clf.predict(X_test)
rfc_predicted = rfc_clf.predict(X_test)
knn_predicted = knn_clf.predict(X_test)

## Evaluate the predictions

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
print(confusion_matrix(y_test, svc_predicted))

In [None]:
print(confusion_matrix(y_test, rfc_predicted))

In [None]:
print(confusion_matrix(y_test, knn_predicted))

#### Precision Score

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The best value is 1 and the worst value is 0.



In [None]:
from sklearn.metrics import precision_score

In [None]:
print(precision_score(y_test,svc_predicted))

In [None]:
print(precision_score(y_test, rfc_predicted))

In [None]:
print(precision_score(y_test, knn_predicted))

#### Recall score

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The best value is 1 and the worst value is 0.



In [None]:
from sklearn.metrics import recall_score

In [None]:
print(recall_score(y_test,svc_predicted))

In [None]:
print(recall_score(y_test, rfc_predicted))

In [None]:
print(recall_score(y_test, knn_predicted))

#### f1_score

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:
                F1 = 2 \* (precision \* recall) / (precision + recall)

In [None]:
from sklearn.metrics import f1_score

In [None]:
print(f1_score(y_test,svc_predicted))

In [None]:
print(f1_score(y_test,rfc_predicted))

In [None]:
print(f1_score(y_test,knn_predicted))

### Classification Report

To get all the above metrics at one go, use the following function:

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test,svc_predicted ))

In [None]:
print(classification_report(y_test,rfc_predicted ))

In [None]:
print(classification_report(y_test,knn_predicted ))

---
                                    THE END