* # MY APPROACH TO A LEADING SCORE

Hi, thanks for reading. Please **UPVOTE** if you enjoy this. Outlined below I set out the following procedure for predicting the survivability of passengers from the famous Titanic sinking. My approach for building a prediction model is as follows:

1. Load the data
2. Explore the data
3. Modify the data
    3.1 Impute
    3.2 Remove outliers
    3.3 Scale
    3.4 Drop where too many missing values
4. Create new features
5. Build an Sklearn Pipeline
6. Train and Test several models
7. Generate submission file
    

## Import basic packages

In [None]:
import numpy as np
import pandas as pd 
import os
import math
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer

## Load the data

In [None]:
train = pd.read_csv("../input/titanic/train.csv").set_index('PassengerId')
test = pd.read_csv("../input/titanic/test.csv").set_index('PassengerId')

y = train['Survived']
train = train.drop('Survived',axis=1)

display(train.head())
display(test.head())


## View the data

In [None]:
X = pd.concat([train,test])
X.head()

In [None]:
X.dtypes

In [None]:
X.describe()

Convert Pclass to categorical, as it is made up of class either 1,2 or 3.

In [None]:
#Pclass is actually categorical
X['Pclass'] = X['Pclass'].astype(object)

num_col = X.select_dtypes(include=['float64','int64']).columns
cat_col = X.select_dtypes(include=['object']).columns

## View the Seaborn Pairplot for Numerical Data

In [None]:
sns.pairplot(X[num_col],corner=True)

From this, I would make the following points:
* Most of the data is positively skewed
* There are a couple of outliers in Fare
* There are no glaringly obvious strong correlations here

In [None]:
plt.figure(figsize=(8,6))
correlation = X[num_col].corr()
sns.heatmap(correlation, mask = correlation <0.4, cmap='Blues')

As discussed above, there are no correlations between our features here, so there is no **multicolinearity**. What does this mean? In essense a model with features that share some form of relationship won't improve our model. It's probable that it won't negatively effect are model but it's far more likely to result in an overfitted and biased model. Leaving any variables in that have multicolinearity will make your fit appear to be good but when you come to submit the model, you will find poor results. 

## Imputation for Numerical Columns


In [None]:
X[num_col].isnull().sum()

### Age

In order to deal with the above missing values, we explore a few approaches. The most troublesome at the moment is the missing values in age. Simply imputing the median has a huge effect on the distribution, see charts below:

In [None]:
sns.distplot(X.Age).set_title("Age Before Imputing")

Now let's see the median imputed results...

In [None]:
imputer = SimpleImputer(strategy='median')
imputed = imputer.fit_transform(X[['Age']])

sns.distplot(imputed).set_title("Age After Median Imputing")

This isn't great, as our model may end up thinking that being 28 is super important in determining chance of survival, which it probably wasn't. As such, I have created a random imputer in the range 20,55 to retain the distribution. This is a somewhat arbitrary choice and a more scientific method would be preferred but for our purposes here, this should be fine:

In [None]:
def replace_with_random(a):
    """
    a: Value or NaN to be replaced
    
    Cannot set a random state as it would generate the same value each time this function
    is called. This is unlikely to be the derired behaviour
    """    
    
    from random import randint
        
    if pd.isnull(a):
        return randint(20,55)
    else:
        return a

In [None]:
randimpute = X['Age'].apply(lambda a: replace_with_random(a))

sns.distplot(randimpute).set_title("Age After Random Imputing")

And this is the direibution of ages we will use.

In [None]:
# For now I will use my random approach for Age
X['Age'] = randimpute

### Fare

There is only one value to impute in Fare, using a median seems sensible

In [None]:
imputer = SimpleImputer(strategy='median')
X['Fare'] = imputer.fit_transform(X[['Fare']])

That should be it for missing data in numerical columns

In [None]:
X[num_col].isnull().sum()

In [None]:
X.describe()

### ... beautiful!

## Imputation for categorical columns

In [None]:
cat_col = X.select_dtypes(include=['object']).columns
X[cat_col]

In [None]:
X[cat_col].isnull().sum()

That's a lot of missing data for cabin, lets explore deck from it:

In [None]:
# Do something about cabin feature, at least extract deck where possible
plt.figure(figsize=(8,6))
X['Deck'] = X['Cabin'].str[0]
sns.countplot(x='Deck',data=X,palette="husl")

What does survival look like by Deck?

In [None]:
temp_data = pd.merge(X['Deck'],y,on='PassengerId')
temp_data = temp_data.groupby('Deck').sum()
sns.barplot(x=temp_data.index,y=temp_data['Survived'],palette='husl')

So on the limited data we have, clearly being in the upper decks improves survival. It would be nice to include this, but 1,014 values is just too many to impute. For now I will drop Deck/Cabin

An idea would be to try to infer Deck from ticket, as there seems to be some information in there that might help.

In [None]:
X = X.drop(['Cabin','Deck'],axis=1)

In [None]:
cat_col = X.select_dtypes(include=['object']).columns
X[cat_col].isnull().sum()

There are still two missing values for "Embarked", but we will let pipline handle the missing embarked value, see below for embedded iumputation using Sklearn's pipeline functions. There are only 2, so "most common" method should be OK.

## Remove outliers
Back to the numerical data. Let's clean up those outliers from before. Pipline imputation will take care of any missing values

In [None]:
X['Fare'] = X['Fare'][X['Fare']<400]

## Feature Engineering

In order to simplify our features, let's create a family variable that combines Parch and SibSp. It would be preferalble to have 1 dimension with only 60% zeros over 2 dimensions with at least 70% zeros. This is OK to do because, the data Parent/Child and Sibling/Spouse is of the same kind: they are counts of people.

In [None]:
X['FamilySize'] = X['SibSp'] + X['Parch']
X = X.drop(['SibSp','Parch'],axis=1)

num_col = X.select_dtypes(include=['float64','int64']).columns

## Pipline

Here we prepare the pipeline. See sklearn for further imformation

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Preprocessing for numerical data
numerical_transformer = Pipeline(steps=[
    ('imputer',SimpleImputer(strategy='constant')),
    ('scaler',StandardScaler())
    ])

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, num_col),
        ('cat', categorical_transformer, cat_col)
    ])

Split the data back into train and test datasets and split for modelling purposes.

In [None]:
# IMPORTANT: Now data is pre-processed, put it back into train and test sets and then split X and y.
test = X.loc[test.index]
X = X.loc[train.index]
y = y.loc[train.index]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.75,random_state=81)

In [None]:
# Import models
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import Perceptron
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier

### Random Forest

In [None]:
#Train RF model model, I did a Grid Search CV on this, and it yielded the following setup of parameters:
RandomForest = RandomForestClassifier(n_estimators=500,
                                      min_samples_split=5,
                                      min_samples_leaf=1,
                                      random_state=81)

RF_pipeline = Pipeline(steps=[('preprocessor', preprocessor),('model', RandomForest)])

RF_pipeline.fit(X_train, y_train)

y_pred = RF_pipeline.predict(X_test)

RF_accuracy = accuracy_score(y_test,y_pred)

print("Accuracy:",RF_accuracy)

In [None]:
# from sklearn.model_selection import GridSearchCV
# from sklearn.metrics import make_scorer

# parameters = {'model__n_estimators':[100,500,750,1000],
#               'model__min_samples_split':[2,5,10],
#               'model__min_samples_leaf':[1,2,5,10],
#               'model__max_depth':[1,3,5,10,20]}

# scorer = make_scorer(accuracy_score,greater_is_better=True)

# grid = GridSearchCV(RF_pipeline,parameters,scoring=scorer)

# grid.fit(X_train,y_train)

# y_pred = grid.predict(X_test)

# accuracy = accuracy_score(y_test,y_pred)

# final_params = grid.best_params_

# print("Accuracy:",accuracy)
# print(final_params)

### XGBoost

In [None]:
XGB = XGBClassifier(eta=0.0001,max_depth = 12, gamma = 3,random_state=81)

XGB_pipe = Pipeline(steps=[('preprocessor', preprocessor),('model', XGB)])

XGB_pipe.fit(X_train, y_train)

y_pred = XGB_pipe.predict(X_test)

XGB_accuracy = accuracy_score(y_test,y_pred)

print("Accuracy:",XGB_accuracy)

In [None]:
"""
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

parameters = {'model__eta':[0.0001,0.0005,0.001,0.01],
              'model__max_depth':[8,10,12,15],
              'model__gamma':[1,2,3,4,5]}

scorer = make_scorer(accuracy_score,greater_is_better=True)

grid = GridSearchCV(XGB_pipe,parameters,scoring=scorer)

grid.fit(X_train,y_train)

y_pred = grid.predict(X_test)

accuracy = accuracy_score(y_test,y_pred)

final_params = grid.best_params_

print("Accuracy:",accuracy)
print(final_params)
"""

### Perceptron

In [None]:
Perceptron = Perceptron()

Perc_pipeline = Pipeline(steps=[('preprocessor',preprocessor),('model',Perceptron)])

Perc_pipeline.fit(X_train,y_train)

y_pred = Perc_pipeline.predict(X_test)

Perceptron_accuracy = accuracy_score(y_test,y_pred)

print("Accuracy:",Perceptron_accuracy)

### Logistic Regression

In [None]:
LogRegCV = LogisticRegressionCV(cv=5)

LR_pipeline = Pipeline(steps=[('preprocessor',preprocessor),('model',LogRegCV)])

LR_pipeline.fit(X_train,y_train)

y_pred = LR_pipeline.predict(X_test)

LogReg_accuracy = accuracy_score(y_test,y_pred)

print("Accuracy:",LogReg_accuracy)

### Ada Boost

In [None]:
ADA = AdaBoostClassifier()

ADA_pipeline = Pipeline(steps=[('preprocessor',preprocessor),('model',ADA)])

ADA_pipeline.fit(X_train,y_train)

y_pred = ADA_pipeline.predict(X_test)

ADA_accuracy = accuracy_score(y_test,y_pred)

print("Accuracy:",ADA_accuracy)

### Stacking for all models
This combines all the models, to see if the combined models can predict better. Logistic Regression is used to choose the overall result from amongst the underlying models.

In [None]:
%%time

from sklearn.ensemble import StackingClassifier

estimators = [('RF',RF_pipeline),
              ('Perceptron',Perc_pipeline),
              ('ADA',ADA_pipeline),
              ('LogReg',LR_pipeline),
              ('XGB',XGB_pipe)]

stack = StackingClassifier(estimators=estimators)
stack.fit(X_train,y_train)
y_pred = stack.predict(X_test)
stack_accuracy = accuracy_score(y_test,y_pred)
print("Accruacy:",stack_accuracy)

In [None]:
%%time

from sklearn.ensemble import VotingClassifier

estimators = [('RF',RF_pipeline),
              ('Perceptron',Perc_pipeline),
              ('ADA',ADA_pipeline),
              ('LogReg',LR_pipeline),
              ('XGB',XGB_pipe)]

vote = VotingClassifier(estimators=estimators)
vote.fit(X_train,y_train)
y_pred = vote.predict(X_test)
vote_accuracy = accuracy_score(y_test,y_pred)
print("Accruacy:",vote_accuracy)

In [None]:
results = pd.DataFrame({'Model':['Random Forest','Perceptron','Logistic Regression','ADA Boost','XGBoost','Stacked Model','Vote Model'],
                        'Accuracy':[RF_accuracy, Perceptron_accuracy,LogReg_accuracy,ADA_accuracy,XGB_accuracy,stack_accuracy,vote_accuracy]}).set_index('Model')

In [None]:
results.sort_values('Accuracy',ascending=False)

So the RF model performs best. It is possible that running a GridSearchCV on ADA Boost and Perceptron may lead to better results and ultimately the Stack may improve by extension.

## Generate Submission

Thank you for reading. Please **UPVOTE** if you have enjoyed and leave a comment to indicate any suggestions for improvement, either to my approach or code. 

Thanks again

**Jon**

![UPVOTE](https://i.imgur.com/RVyQY7r.png)

In [None]:
test_pred = RF_pipeline.predict(test)

submission = pd.DataFrame(test_pred,index=test.index,columns=['Survived'])

submission.to_csv("./submission.csv")