# Titanic: A Beginner's Tutorial

I created a github repo for fast reference to some common ML techniques. I use a lot of them in this notebook.  
link: https://github.com/ZacharyJWyman/ML-Techniques

Please feel free to ask any questions to clarify topics in the comment section. This dataset is a great guide to get your feet wet in predictive modeling and machine learning. 

## Table of Contents:
* [Cleaning the Dataframe](#Cleaning-the-Dataframe)  
    * [Age](#Age)  
    * [Cabin](#Cabin)  
    * [Embarked](#Embarked)  
* [Exploratory Analysis](#Exploratory-Analysis)
* [Feature Engineering](#Feature-Engineering)  
* [One Hot Encoding](#One-Hot-Encoding)
* [Preparing the Data](#Preparing-the-Data)
* [Tuning Hyperparameters](#Tuning-Hyperparameters)
* [Stacking](#Implementing-Stacking)
* [Submission to CSV](#Submission-To-CSV)

In [None]:
# import libraries
import pandas as pd
import numpy as np

In [None]:
train = pd.read_csv(r'/kaggle/input/titanic/train.csv')
test = pd.read_csv(r'/kaggle/input/titanic/test.csv')

In [None]:
train.head(5)

## Cleaning the Dataframe

To ensure our data is trainable in our algorithm, we must take a look at any missing values. There are a combination of techniques to fix these missing values including:
* Fill with either the median or mean. Using the median may be preferable as it is more robust to outliers. If you high extreme values on each end, then the mean may be affected severally. (i.e. mean income of the district that Bill Gates lives in.)
* Drop the column if the majority of data is missing.
* Fill with 0 if appropriate. Many times a missing value may signify "no item". This is why it is important to examine the columns with missing data closely.

In [None]:
train.isna().sum()

### Age 

In [None]:
dfs = [train ,test]

for df in dfs:
    df['Age'].fillna(df['Age'].median(), inplace = True)

In the above code we iterate through each dataframe and fill the missing Age values with the median of each dataframe! Kinda cool right. If we check, we should see that Age now has 0 missing values.

In [None]:
train.isna().sum()

### Cabin

In [None]:
train['Cabin'].value_counts()

I choose to fill the missing cabin columns with 0 instead of drop it becuase cabin may be associated with passenger class! We will have a look at a correlation matrix that includes categorical columns once we have used One Hot Encoding!

In [None]:
for df in dfs:
    df['Cabin'].fillna(0, inplace = True)

The most important part of each value is what cabin letter they are in. We will aim to pull only the first character (letter) from each row.

In [None]:
cabins = []
for i in train['Cabin']:
    cabins.append(str(i))

In [None]:
letters = []
for i in cabins:
    letter= i[0]
    letters.append(letter)

In [None]:
train['Cabin'] = letters

In [None]:
cabins = []
for i in test['Cabin']:
    cabins.append(str(i))

In [None]:
letters = []
for i in cabins:
    letter = i[0]
    letters.append(letter)

In [None]:
test['Cabin'] = letters

In [None]:
train['Cabin'].head()

It worked! We have grabbed the first letter from each row.

### Embarked

In [None]:
train['Embarked'].value_counts()

We will fill with the mode of the data column. This being 'S' as it will alter out data the least. 

In [None]:
for df in dfs:
    df['Embarked'].fillna('S')

### Exploratory Analysis

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
#seaborn & matplotlib are excellent python libraries to perform clean visualizations.
#I highly suggest you get familiar with them!

#correlation matrix 
corr_matrix = train.corr()
fig, ax = plt.subplots(figsize = (10,8))
sns.heatmap(corr_matrix, annot = True, fmt='.2g', vmin = -1,
            vmax = 1, center = 0, cmap = 'coolwarm')

In [None]:
train.dtypes

In [None]:
#boxplot
numeric_cols = ['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
fig, ax = plt.subplots(figsize = (10,5))
sns.boxplot(data = train[numeric_cols], orient = 'h', palette = 'Set2')

It's a good thing that we filled fare with the median value as there is an outlier.

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(train[numeric_cols], figsize= (12,8))

In [None]:
train.hist(bins = 20, figsize = (12,8))

In [None]:
sns.countplot(train[train['Survived'] == 1]['Pclass']).set_title('Count Survived for each Class')

In [None]:
len(train[train['Pclass'] == 1]), len(train[train['Pclass'] == 2]), len(train[train['Pclass'] == 3])

In [None]:
train[train['Pclass'] == 1]['Survived'].sum(), train[train['Pclass'] == 2]['Survived'].sum(), train[train['Pclass'] == 3]['Survived'].sum()   

In [None]:
percentages = []
first = 136 / 216
second = 87/ 184
third = 119/491
percentages.append(first)
percentages.append(second)
percentages.append(third)

In [None]:
percents = pd.DataFrame(percentages)
percents.index+=1

In [None]:
percents['PClass'] = ['1', '2', '3']
cols= ['Percent', 'PClass']
percents.columns = [i for i in cols]
sns.barplot(y = 'Percent', x = 'PClass', data = percents).set_title('Percent Survived for Passenger Class')

The majority of first class passengers survived with about slighly lower than 50% of 2nd class passengers surviving. The majority of third class passengers did not survive. Therefore, we can see that Passenger Class impacted your survival chance aboard the Titanic. 

### Feature Engineering

In [None]:
train['Family'] = train.apply(lambda x: x['SibSp'] + x['Parch'], axis = 1)
test['Family'] = test.apply(lambda x: x['SibSp'] + x['Parch'], axis = 1)

In [None]:
#dropping columns from the dataframe 
train.drop(['SibSp', 'Parch', 'Name', 'Ticket'], axis = 1, inplace = True)
test.drop(['SibSp', 'Parch', 'Name', 'Ticket'], axis = 1, inplace = True)

Keep in mind that the model accuracy could be improved by finding titles for each of the passengers. For simplicity of this tutorial, we won't be covering that but if enough people request it I will make the change in following versions. 

In [None]:
train.head(5)

### Check Test DataFrame For Any Missing Values Too!

In [None]:
test.isna().sum()

In [None]:
test['Fare'].fillna(test['Fare'].median(), inplace = True)

Now all of our missing data is filled in so we can go ahead with our model!

### One Hot Encoding

One Hot Encoding, one of the most useful techniques that a data scientist can know. This techniques label encodes categorical columns resulting in a 1 if the value is true, with all associated values in that row taking value 0. Take for example our 'Embarked' column. Using one hot encoding will create embarked_S, embarked_C, and embarked_Q columns for each row. The True value will take a 1. This is crucial to preparing data for our model as it won't take kindly to non-numerics.

In [None]:
train_df = pd.get_dummies(train)
test_df = pd.get_dummies(test)

In [None]:
#axis 1 refers to columns!
train_df.drop('PassengerId', axis = 1, inplace = True)

We can see above how one hot encoding alters the dataframe.

## Prepare Data

In [None]:
y = train_df['Survived']
train_df.drop('Survived', axis = 1, inplace = True)
train_df.drop('Cabin_T', axis = 1, inplace = True)
test_df.drop('PassengerId', axis = 1, inplace = True)

In [None]:
X_test = test_df
X_train = train_df

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
rfc = RandomForestClassifier()

### Tuning Hyperparameters 

GridSearchCV and RandomizedSearchCV are excellent tools for determing the best hyperparameters for your models! This can increase your model accuracy significantly. The only downside is it takes quite some time to run so if using very large datasets you will want to convert to numpy arrays for much faster training time. The dataset in this competition is fairly small so we won't both with this.

In [None]:
param_grid = {
    'n_estimators': [200, 500, 1000],
    'max_features': ['auto'],
    'max_depth': [6, 7, 8],
    'criterion': ['entropy']
}

Our param grid is set up as a dictionary so that GridSearch can take in and read the parameters. This search will perform 3 X 1 X 3 X 1 = 9 different combinations and then fit them 5 times (cv = 5), resulting in 45 models trained. Calling best_estimator_ or best_params_ will give us the model that peformed the best.

In [None]:
CV = GridSearchCV(estimator = rfc, param_grid = param_grid, cv = 5)
CV.fit(X_train, y)
CV.best_estimator_

In [None]:
#impot some classification libraries
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier

In [None]:
rfc = RandomForestClassifier(criterion = 'entropy', max_depth = 8, n_estimators = 500, random_state = 42)
gbc = GradientBoostingClassifier()
ada = AdaBoostClassifier()

In [None]:
rfc.fit(X_train, y)
gbc.fit(X_train, y)
ada.fit(X_train, y)

## Implementing Stacking

This is a very powerful ensemble technique that generally results in increased model accuracy. Stacking forms prediction based on the models and then is peformed on the base model to create a more accurate set of predictions. If you want to achieve even higher accuracy than this notebook, hypertune each model before fitting. 

In [None]:
from mlxtend.classifier import StackingCVClassifier
stack_gen = StackingCVClassifier(classifiers = (rfc, gbc, ada),
                                        meta_classifier = rfc,
                                        use_features_in_secondary = True)
stack_gen.fit(X_train.values, y)
y_pred = stack_gen.predict(X_test.values)

In [None]:
#if we also wanted to implement blending, we could do so like this although I don't recommend doing so on this dataset.
def blend(X_test):
    y_pred = ((0.25 * gbc.predict(X_test)) + \
            0.25 * ada.predict(X_test) + \
           0.5 * stack_gen.predict(np.array(X_test)))
    return y_pred

#y_pred = np.round(blend(X_test))

### Submission To CSV

In [None]:
#reshape array so that it can be used in a dataframe for easy submission!
submission = y_pred.reshape(-1, 1)

Take a look at your submission object now by calling (submission). It should print out the array reshaped. I won't include it here because it will make the reader scroll quite a bit to pass the section.

In [None]:
sub_df = pd.DataFrame(submission)

In [None]:
sub_df['PassengerId'] = test['PassengerId']
sub_df['Survived'] = submission
cols = ['PassengerId',
       'Survived']
sub_df.drop(0, axis = 1, inplace = True)
sub_df.columns = [i for i in cols]
sub_df = sub_df.set_index('PassengerId')

In [None]:
sub_df.head(10)

In [None]:
#put file path in string!
sub_df.to_csv(r'submission.csv')

Thanks for following along. Make sure to give this notebook an upvote if it was helpful!