All explanations are written from scikit website (https://scikit-learn.org/stable/index.html)

### Table of Contents
##### 1. Let's look at the data
##### 2. Replacing Null values
##### 3. Deriving new parameters
##### 4. Correlation plot
##### 5. Comparision between different classifiers
##### 6. Parameter Tuning
##### 7. Conclusion

## Let's look at the data

In [None]:
#Importing basic libraries
import pandas as pd
import numpy as np
import re

In [None]:
train_df = pd.read_csv("../input/titanic/train.csv")
test_df = pd.read_csv("../input/titanic/test.csv")

In [None]:
print(train_df.columns.values)

We have 12 columns in the dataset

In [None]:
print(train_df.shape)
train_df.head(5)

In [None]:
print(test_df.shape)
test_df.head(5)

Let's get some information regarding our data

In [None]:
train_df.info()
print('_'*40)
test_df.info()

Following columns have missing values (number in brackets indicates no. of missing values):

Train dataset - Age(177), Cabin(687), Embarked(2)

Test dataset - Age(86), Fare(1), Cabin(327)

In [None]:
train_df.describe()

In [None]:
#Let's create new variables for further use
y_train = train_df['Survived']
x_test = test_df
x_train = train_df
full_data = [x_train, x_test]

## Replacing Null values

In [None]:
#Replaceing 1 missing Fare value with median
for dataset in full_data:
    dataset['Fare'] = dataset['Fare'].fillna(x_train['Fare'].median())

In [None]:
#Now we replace 2 missing Embarked values
x_train[x_train['Embarked'].isnull()]

We create a box plot to see the distribution of Embarked w.r.t Fare

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('darkgrid')
fig, ax = plt.subplots(figsize=(16,12),ncols=2)
ax1 = sns.boxplot(x="Embarked", y="Fare", hue="Pclass", data=x_train, ax = ax[0]);
ax2 = sns.boxplot(x="Embarked", y="Fare", hue="Pclass", data=x_test, ax = ax[1]);
ax1.set_title("Training Set", fontsize = 18)
ax2.set_title('Test Set',  fontsize = 18)

plt.show()

We can observe that for Fare 80 the median of Embarked 'C' is closest

In [None]:
# Replacing the null values in the Embarked column with the median. 
x_train.Embarked.fillna("C", inplace=True)

In [None]:
#We calculate the mean age and fill na values with values around one std deviation from the mean
for dataset in full_data:
    age_avg = dataset['Age'].mean()
    age_std = dataset['Age'].std()
    age_null_count = dataset['Age'].isnull().sum()
    age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
    dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
    dataset['Age'] = dataset['Age'].astype(int)

## Let's derive few new parameters

In [None]:
# Create new feature FamilySize as a combination of SibSp and Parch
for dataset in full_data:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
# Create new feature IsAlone from FamilySize
for dataset in full_data:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1

In [None]:
for dataset in full_data:
    dataset['Has_Cabin'] = dataset["Cabin"].apply(lambda x: 0 if type(x) == float else 1)

In [None]:
# Define function to extract titles from passenger names
def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""
# Create a new feature Title, containing the titles of passenger names
for dataset in full_data:
    dataset['Title'] = dataset['Name'].apply(get_title)
# Group all non-common titles into one single grouping "Rare"
for dataset in full_data:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    
    # Mapping titles
    title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

In [None]:
# Mapping Name length
for dataset in full_data:
    dataset['Name_length'] = dataset['Name'].apply(len)

In [None]:
# Seperating Ticket Numbers and Letters
for dataset in full_data:
    dataset['TicketNumbers'] = dataset.Ticket.apply(lambda x:int(x) if x.isnumeric() else 0 if x == 'LINE' else int(x.split(' ')[-1]))
    dataset['TicketLetters'] = dataset.Ticket.apply(lambda x: ''.join(x.split(' ')[:-1]).replace('.', '').replace('/', '').lower()  if len(x.split(' ')[:-1]) > 0 else x.lower() if x == 'LINE' else 'none')

In [None]:
# Mapping Embarked
for dataset in full_data:
    dataset['Embarked'] = dataset['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)

In [None]:
# Mapping Sex
for dataset in full_data:
    dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1}).astype(int)

In [None]:
x_train.head(5)

In [None]:
x_test.head(5)

In [None]:
test_id = x_test['PassengerId']

Let's drop columns which are not required

In [None]:
drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin']
for dataset in full_data:
    dataset.drop(drop_elements, axis = 1, inplace = True)

## Correlation plot between different parameters

In [None]:
x_train_plot = x_train.drop('TicketLetters', axis = 1)
colormap = plt.cm.RdBu
plt.figure(figsize = (14, 12))
plt.title('Pearson correlation of features', y = 1.05, size = 15)
sns.heatmap(x_train_plot.astype(float).corr(), linewidths = 0.1, vmax = 1.0, square = True, cmap = colormap, linecolor = 'white', annot = True)

In [None]:
train_test_cleaning = pd.concat([x_train, x_test], keys = ['train', 'test'], axis = 0)
train_test_cleaning

In [None]:
train_test_cleaning = pd.get_dummies(train_test_cleaning)
train_test_cleaning

In [None]:
train_test_cleaning.drop('Survived', axis = 1, inplace = True)

In [None]:
x_train = train_test_cleaning.loc['train']
x_test = train_test_cleaning.loc['test']

## Classification 

We test on 6 classification models and see which performs best on train dataset. Since our dataset consists of both categorical and numeric data, we use Ensemble models which are known to perform well on such a dataset. If you don't know what Ensemble models are, here's some info...

##### Ensemble Method of Classification
The goal of "ensemble methods" is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.

Two families of ensemble methods are usually distinguished:

I. In "averaging methods", the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.

Examples: Bagging methods, Forests of randomized trees, …

II. By contrast, in "boosting methods", base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.

Examples: AdaBoost, Gradient Tree Boosting, …

In [None]:
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier


from sklearn.model_selection import GridSearchCV

In [None]:
lr = LogisticRegression()
dt = DecisionTreeClassifier(random_state = 1)
rf = RandomForestClassifier(random_state = 1)
svc = make_pipeline(StandardScaler(), SVC(probability = True))
knn = make_pipeline(StandardScaler(), KNeighborsClassifier())
exttree = ExtraTreesClassifier(random_state=1)


estimators = [lr, dt, rf, svc, knn, exttree]
labels = ['Linear Regression', 
            'Decision Tree', 
            'Random Forest Classifier', 
            'SVC', 
            'k Nearest Neighbour',
            'Extra Tree Classifier']

In [None]:
def estimate(x_train, y_train, estimators, labels):
    df_result = pd.DataFrame()
    
    row_index = 0
    for est, est_name in zip(estimators, labels):
        cv_results = cross_validate(est, x_train, y_train, n_jobs = -1, cv = 10)
        df_result.loc[row_index, 'Model name'] = est_name
        df_result.loc[row_index, 'Test_accuracy'] = cv_results['test_score'].mean()
        df_result.loc[row_index, 'Standard Deviation'] = cv_results['test_score'].std()
        df_result.loc[row_index, 'Fit_time'] = cv_results['fit_time'].mean()
        
        row_index +=1
        
    df_result.sort_values(by=['Test_accuracy'], ascending = False, inplace = True, ignore_index = True)
    
    return df_result

In [None]:
estimate(x_train, y_train, estimators, labels)

Random Forest performs the best among all classifiers. Here's some info on Random Forest Classifier --

#### Random Forest Classifier
In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set.

Furthermore, when splitting each node during the construction of a tree, the best split is found either from all input features or a random subset of size max_features.

The purpose of these two sources of randomness is to decrease the variance of the forest estimator. Indeed, individual decision trees typically exhibit high variance and tend to overfit. The injected randomness in forests yield decision trees with somewhat decoupled prediction errors. By taking an average of those predictions, some errors can cancel out. Random forests achieve a reduced variance by combining diverse trees, sometimes at the cost of a slight increase in bias. In practice the variance reduction is often significant hence yielding an overall better model.

### Parameter Tuning

We will tune the parameters with GridSearchCV. It performs an Exhaustive cross validation over the specified parameters and return the best parameters which maximize the score during Cross Validation. I ran my GridSearch on several values before nailing it down to the ones you can see.

In [None]:
rf_params = {'random_state': [1],
             'max_depth': [16, 17, 18],
             'max_features': [19, 20, 21],
             'min_samples_leaf': [1,2],
             'min_samples_split': [2, 3, 4, 5],
             'n_estimators': [42,43,44]}

grid = GridSearchCV(rf, 
                    rf_params,
                    cv = 10,   
                    n_jobs = -1)

grid.fit(x_train, y_train)

In [None]:
grid.best_params_

In [None]:
rf = RandomForestClassifier(**grid.best_params_)

cv_results = cross_val_score(rf, x_train, y_train, n_jobs = -1, cv = 10)

In [None]:
print(f'All results: {cv_results} \n\n' +
      f'Mean: {cv_results.mean()} \n\n' +
      f'Std: {cv_results.std()}')

In [None]:
rf.fit(x_train, y_train)
predictions = rf.predict(x_test)

In [None]:
submission = pd.DataFrame({'PassengerId': test_id,
                           'Survived': predictions})
submission.head(10)

In [None]:
submission.to_csv('submission.csv', index = False)

### Conclusion

There's still a lot I have in my mind which might increase the accuracy and the visualization. I will be updating this notebook in future.

Appreciate any comments you might have which might help me make this better. Also if someone wants to team up with me feel free to drop me an email on keyurpethad1996@gmail.com.