## Titanic Survival Prediction
<img src = "https://mcdn.wallpapersafari.com/medium/37/71/qtAKe3.jpg">

> **UPDATE**: 
- Version 16 gave the best results i.e. Top 4% (0.80622). Later versions are just some of my experiments.
- So if you are here for the best scoring book, use **version 16**. Otherwise, if you are here to **learn different type of models and experiments** use the later version i.e. **version 17 and ahead.**
_________________________________________________________________________________________________________________________________________

- This notebook deals with the popular problem statement from Kaggle Competion [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic). 
- Intially we talk about how the features affect the target (prediction about whether the passenger would survive or not).
- If you are here for the code, feel free to skip the Visualisation part and directly go to [Summary based on visuals](#Summary-based-on-visuals). 
- Table of Contents:
    1. [Overview](#Overview)
    2. [Visualisation](#Visualising-the-data)
    3. [Dealing with missing values.](#Dealing-with-missing-values)
    4. [Transforming the data.](#Transforming-the-data)
    5. [Defining Models.](#Defining-Models)
        1. Ensembling.
        2. Using the best performing model.
    6. Evaluation.

- Classifiers used: 
    - For Submission: [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
    - For Evaluation: 
        1. [Support Vector Machine(SVM)](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
            1. SVM with Linear kernel.
            2. SVM with Radial kernel.
        2. [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
        3. [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)
        4. [XGBoostClassifier](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn)
        5. [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
- Evaluation Metrics: [K-Folds cross-validator](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## **Overview**
* `PassengerId` is the unique id of the row and it doesn't have any effect on target
* `Survived` is the target variable we are trying to predict (**0** or **1**):
    - **1 = Survived**
    - **0 = Not Survived**
* `Pclass` (Passenger Class) is the socio-economic status of the passenger and it is a categorical ordinal feature which has **3** unique values (**1, 2 or 3**):
    - **1 = Upper Class**
    - **2 = Middle Class**
    - **3 = Lower Class**
* `Name`, `Sex` and `Age` are self-explanatory
* `SibSp` is the total number of the passengers' siblings and spouse
* `Parch` is the total number of the passengers' parents and children
* `Ticket` is the ticket number of the passenger
* `Fare` is the passenger fare
* `Cabin` is the cabin number of the passenger
* `Embarked` is port of embarkation and it is a categorical feature which has **3** unique values (**C**, **Q** or **S**):
    - **C = Cherbourg**
    - **Q = Queenstown**
    - **S = Southampton**

In [None]:
train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')

## Visualising the data

### 1. SibSp and Parch

In [None]:
sns.catplot(x="SibSp", col = 'Survived', data=train, kind = 'count', palette='pastel')
sns.catplot(x="Parch", col = 'Survived', data=train, kind = 'count', palette='pastel')
plt.show()

Basically, the columns `SibSp` and `Parch` tell us whether the corresponding person was accompanied by anyone or not. So we will create a new column `Is_alone` which will tell us whether the person was accompanied (**1**) or not (**0**).

In [None]:
def is_alone(x):
    if  (x['SibSp'] + x['Parch'])  > 0:
        return 1
    else:
        return 0

train['Is_alone'] = train.apply(is_alone, axis = 1)
test['Is_alone'] = test.apply(is_alone, axis = 1)

g = sns.catplot(x="Is_alone", col = 'Survived', data=train, kind = 'count', palette='deep')

As you can clearly see, a person has better a chance of surviving if he/she is not alone. This might prove to be a very good feature in prediction.

### 2. Age

In [None]:
g = sns.FacetGrid(train, col='Survived')
g = g.map(sns.distplot, "Age")

We notice that age distributions are not the same in the survived and not survived subpopulations. There is a peak corresponding to young passengers, that have survived. We also see that passengers between 60-80 have less survived.

So, even if "Age" is not correlated with "Survived", we can see that there is age categories of passengers that of have more or less chance to survive. It seems that very young passengers have more chance to survive.

### 3. Fare

In [None]:
f, axes = plt.subplots(2, 1, figsize = (10, 6))

g1 = sns.distplot(train["Fare"], color="red", label="Skewness : %.2f"%(train["Fare"].skew()), ax=axes[0])
axes[0].title.set_text('Before \'log\' Transformation')
axes[0].legend()

train_fare = train["Fare"].map(lambda i: np.log(i) if i > 0 else 0)

g2 = sns.distplot(train_fare, color="blue", label="Skewness : %.2f"%(train_fare.skew()), ax=axes[1])
axes[1].title.set_text('After \'log\' Transformation')
axes[1].legend()

plt.tight_layout()

As we can see, Fare distribution is very skewed. This can lead to overweigth very high values in the model, even if it is scaled.

In this case, it is better to transform it with the log function to reduce this skew.

As we can see, Skewness is clearly reduced after the log transformation.

### 4. Sex, Pclass and Embarked

In [None]:
sns.catplot(x="Sex", y="Survived", col="Pclass", data=train, saturation=.5, kind="bar", ci=None, aspect=0.8, palette='deep')
sns.catplot(x="Sex", y="Survived", col="Embarked", data=train, saturation=.5, kind="bar", ci=None, aspect=0.8, palette='deep')

### Inference using the above graphs:
1. **Sex**: There are more than chances of a person suriving if the person is **Female**.

2. **Pclass**: Passengers with passenger class (`Pclass`) as **1** have higher chances of surviving than the others.

3. **Embarked**: Passengers that boarded at **Cherbourg(C)** have survived more than those who boarded at **Queenstown(Q)** and **Southampton(S)**.

### Summary based on visuals

1. Column `PassengerId` won't help us.
2. I've seen people use column `Name` cleverly like [here](https://www.kaggle.com/gunesevitan/titanic-advanced-feature-engineering-tutorial) but I won't be using in this notebook because:
    - Not important from prespective of our main objective.
    - Requires extra efforts.
    - Might not bring a huge change.
3. Now that we have created a new feature `Is_alone` using features `SibSp` and `Parch`, we can delete them from our dataset.
4. Probability of surviving was the most with:
    1. Passengers who were **young**.
    2. Passengers who paid more **Fare**.
    3. Passengers who were **Female**, had passenger class as **1** and boarded at **Cherbourg**.

In [None]:
train = train.drop(['PassengerId','Name','SibSp','Parch'], axis = 1)
test = test.drop(['Name','SibSp','Parch'], axis = 1)

### Explore

In [None]:
print("Train columns:", ', '.join(map(str, train.columns))) 
display(train.head())
print("\nTest columns:",  ', '.join(map(str, test.columns)))
display(test.head())

### Checking for missing values

In [None]:
print("TRAIN DATA:")
train.isnull().sum()

In [None]:
print("TEST DATA:")
test.isnull().sum()

#### Observations:
- **177** values missing from `Age` from training data.
- **687** values missing from `Cabin` from training data.
- **2** values missing from `Embarked` from training data.

- **86** values missing from `Age` from testing data.
- **327** values missing from `Cabin` from testing data.

## Dealing with missing values

- We have two types of missing values:
    - Integer/Float (int64/float64)
    - Text (object)

- We will use [IterativeImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html) for numerical values and [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for categorical values.
- Let's find out which are numerical and categorical columns in our dataset.

In [None]:
train.dtypes

#### Observation
`Pclass, Age, Is_alone, Fare` are numeircal columns.

`Sex, Ticket, Cabin, Embarked` are categorical columns.

In [None]:
numerical = ['Pclass','Age','Is_alone','Fare']
categorical = ['Sex','Ticket','Cabin', 'Embarked']

In [None]:
features = numerical + categorical
target = ['Survived']
print('Features:', features, '\nTarget:', target)

## Transforming the data
We will use combination of [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) with [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to carry out the necessary transformation on our data.

Transformers we are going to use:

|Data type|Transformer|
|:---|:---|
|Numerical|[IterativeImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html) & [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)|
|Categorical|[SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) & [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)|


We will use **most_frequent** strategy for categorical columns.

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

numerical_transformer = Pipeline(
    steps=[('iterative', IterativeImputer(max_iter = 10, random_state=0)),
           ('scaler', StandardScaler())])

categorical_transformer = Pipeline(
    steps=[('imputer', SimpleImputer(strategy='most_frequent')),
           ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[('num', numerical_transformer, numerical),
                  ('cat', categorical_transformer, categorical)])

## Defining Models

Here, we are going to try two approaches:

1. Ensembling.
2. Random Forest Classifier (used for submission).

### 1.1 Ensembling
[Ensemble](https://scikit-learn.org/stable/modules/ensemble.html) methods are techniques that create multiple models and then combine them to produce improved results. Ensemble methods usually produces more accurate solutions than a single model would. The models used to create such ensemble models are called ‘base models’.

We will do ensembling with the [Voting Ensemble](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html). Voting is one of the simplest ways of combining the predictions from multiple machine learning algorithms. It works by first creating two or more standalone models from your training dataset. A Voting Classifier can then be used to wrap your models and average the predictions of the sub-models when asked to make predictions for new data.

We will be using weighted Voting Classifier. We will assign to the classifiers according to their accuracies. So the classifier with single accuracy will be assigned the highest weight and so on.

But before directly moving to using Voting Classifier, let's take a look at how the above mentioned classification algorithms work individually.

We will be following a pipeline in the next code. So please pay attention to each and every line.

<center>Numerical Imputation.</center>
<center>↓</center>
<center>Categorical Imputation.</center>
<center>↓</center>
<center>Transforming the dataset after the imputations.</center>
<center>↓</center>
<center>Use the models mentioned in `classifiers` list.</center>
<center>↓</center>
<center>Use K-Fold cross validation (5 folds).</center>

In [None]:
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
import xgboost as xgb
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import KFold, cross_val_score

observations = pd.DataFrame()
classifiers = ['Linear SVM', 'Radial SVM', 'LogisticRegression', 
               'RandomForestClassifier', 'AdaBoostClassifier', 
               'XGBoostClassifier', 'KNeighborsClassifier']
models = [svm.SVC(kernel='linear'),
          svm.SVC(kernel='rbf'),
          LogisticRegression(),
          RandomForestClassifier(n_estimators=200, random_state=0),
          AdaBoostClassifier(random_state = 0),
          xgb.XGBClassifier(n_estimators=100),
          KNeighborsClassifier()
         ]
j = 0
for i in models:
    model = i
    cv = KFold(n_splits=5, random_state=0, shuffle=True)
    pipe = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)])
    observations[classifiers[j]] = (cross_val_score(pipe, train[features], np.ravel(train[target]), scoring='accuracy', cv=cv))
    j = j+1

In [None]:
mean = pd.DataFrame(observations.mean(), index= classifiers)
observations = pd.concat([observations,mean.T])
observations.index=['Fold 1','Fold 2','Fold 3','Fold 4','Fold 5','Mean']
observations.T.sort_values(by=['Mean'], ascending = False)

### 1.2 Voting Ensemble
We will select the top 3 models based on their scores.

In [None]:
from sklearn.ensemble import VotingClassifier

linear_svm = svm.SVC(kernel='linear', C=0.1,gamma=10, probability=True)
pipe_linear = Pipeline(steps=[('preprocessor', preprocessor),  ('model', linear_svm)])

rand = RandomForestClassifier(n_estimators=200, random_state=0)
pipe_rand = Pipeline(steps=[('preprocessor', preprocessor),  ('model', rand)])

log = LogisticRegression()
pipe_log = Pipeline(steps=[('preprocessor', preprocessor),  ('model', log)])

ensemble_all = VotingClassifier(estimators=[('Linear_svm', pipe_linear),
                                                                        ('Random Forest Classifier', pipe_rand),
                                                                        ('Log', pipe_log)], 
                                                                        voting='soft', weights=[2,3,2])

### Evaluation of model with 3 classifiers.

In [None]:
cross_validation_score = cross_val_score(ensemble_all, train[features], np.ravel(train[target]), scoring='accuracy', cv=cv)
print("K-Fold scores:", cross_validation_score,
      "\nMean:", round(cross_validation_score.mean(), 3),
      "\nMax:", round(cross_validation_score.max(), 3))

### 2. Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

model = RandomForestClassifier(n_estimators=200, random_state = 0)

pipe = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)])

### Evaluation of model with the best classifier.

In [None]:
cross_validation_score_rand = cross_val_score(pipe, train[features], np.ravel(train[target]), scoring='accuracy', cv=cv)
print("K-Fold scores:", cross_validation_score_rand,
      "\nMean:", round(cross_validation_score_rand.mean(), 3),
      "\nMax:", round(cross_validation_score_rand.max(), 3))