__Reference__

This notebook referenced the following Kaggle Kernels:
-  [Nadin Tamer, Titanic Survival Predictions (Beginner)](https://www.kaggle.com/nadintamer/titanic-survival-predictions-beginner)
-  [Omar El Gabry, A Journey through Titanic](https://www.kaggle.com/omarelgabry/a-journey-through-titanic)
-  [Anisotropic, Introduction to Ensembling/Stacking in Python](https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python)
- [Sina, Titanic best working Classifier](https://www.kaggle.com/sinakhorami/titanic-best-working-classifier)

## Introduction to Machine Learning through Titanic Project

###  Critical Steps
1. Importing & Exploring Necessary Libraries
2. Read in Data
3. Feature Exploration
4. Data Manipulation
5. Running Machine Learning Algorithms
6. Creating Submission File to Kaggle

In [7]:
!pip3 install numpy
!pip3 install pandas
!pip3 install seaborn
!pip3 install matplotlib
!pip3 install sklearn

Collecting pandas
  Downloading pandas-0.22.0-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (14.9MB)
[K    100% |████████████████████████████████| 14.9MB 78kB/s  eta 0:00:01
Collecting pytz>=2011k (from pandas)
  Downloading pytz-2017.3-py2.py3-none-any.whl (511kB)
[K    100% |████████████████████████████████| 512kB 1.7MB/s eta 0:00:01
Installing collected packages: pytz, pandas
Successfully installed pandas-0.22.0 pytz-2017.3
Collecting seaborn
  Downloading seaborn-0.8.1.tar.gz (178kB)
[K    100% |████████████████████████████████| 184kB 3.1MB/s ta 0:00:01
[?25hCollecting scipy (from seaborn)
  Downloading scipy-1.0.0-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (16.7MB)
[K    100% |████████████████████████████████| 16.7MB 61kB/s eta 0:00:011   21% |███████                         | 3.6MB 8.6MB/s eta 0:00:02    39% |████████████▊                   | 6.7MB

# 1. Import Libraries

In [8]:
# Data Analysis Libraries
import numpy as np
import pandas as pd

# Data Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Machine Learning Libraries
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier

# 2. Read in Data

In [10]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

# 3. Feature Exploration

In this step, we will get a basic sense of the data and visualize the features to figure out which ones are relevant for the analysis.

In [11]:
# A basic look at the training data
train.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
555,556,0,1,"Wright, Mr. George",male,62.0,0,0,113807,26.55,,S
572,573,1,1,"Flynn, Mr. John Irwin (""Irving"")",male,36.0,0,0,PC 17474,26.3875,E25,S
156,157,1,3,"Gilnagh, Miss. Katherine ""Katie""",female,16.0,0,0,35851,7.7333,,Q
81,82,1,3,"Sheerlinck, Mr. Jan Baptist",male,29.0,0,0,345779,9.5,,S
58,59,1,2,"West, Miss. Constance Mirium",female,5.0,1,2,C.A. 34651,27.75,,S


In [12]:
# Summary of the training data
train.describe(include = "all")

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891,891.0,204,889
unique,,,,891,2,,,,681,,147,3
top,,,,"Johanson, Mr. Jakob Alfred",male,,,,CA. 2343,,C23 C25 C27,S
freq,,,,1,577,,,,7,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


In [None]:
# Get a clearer understanding of data types and missing values
train.info()
print('***************************************************')
test.info()

## 3.1 Pclass

In [None]:
# Explore if survival rate depends on passenger class
sns.barplot(x = "Pclass", y = "Survived", data = train)
train[["Pclass", "Survived"]].groupby(["Pclass"], as_index = False).mean()

There seems to be a significant difference in survival rate for passengers in different classes. This feature should go into the model.

## 3.2 Sex

In [None]:
# Explore if survival rate depends on passenger gender
sns.barplot(x = "Sex", y = "Survived", data = train)
train[["Sex", "Survived"]].groupby(["Sex"], as_index = False).mean()

Sex should definitely go into the model as well.

## 3.3 Age 

In [None]:
# Age is a continuous variable with 20% of the data missing. 
# We will first look at the distribution
sns.distplot(train["Age"].dropna(), bins = 70, kde = False)

- Age is not normally distributed so we cannot simply generate random numbers following a normal distribution to fill in the missing numbers. 
- Instead of treating age as a continuous variable, it might be better to categorize age intervals since one year difference in age would probably not determine if the person survive.
- In the next section, we will come up ways to fill in the missing value and categorize age.

## 3.3 SibSp

In [None]:
# Explore if survival rate depends on the number of siblings/spouses abroad the Titanic
sns.barplot(x = "SibSp", y = "Survived", data = train)
sibsp = pd.DataFrame()
sibsp["Survived Mean"] = train[["SibSp", "Survived"]].groupby(["SibSp"], as_index = False).mean()["Survived"]
sibsp["Count"] = train[["SibSp", "Survived"]].groupby(["SibSp"], as_index = False).count()["Survived"]
sibsp["STD"] = train[["SibSp", "Survived"]].groupby(["SibSp"], as_index = False).std()["Survived"]
print(sibsp)
train[(train["SibSp"] == 5)|(train["SibSp"] == 8)]

- In the next step, we will group "SibSp" into [0, 1, 2 or more]
- It is surprising that none of the members in the two families with 5 and 8 SibSp survived. Looking at the available "Age" data points, it seems that most of them are kids. It would be a good idea to fill in the rest ages as "teenagers" or "kids". However, there are only 7 records that needs to be filled in in this way so in this analysis, we will not treat them differently. 

## 3.4 Parch

In [None]:
# Explore if survival rate depends on the number of parents/children abroad the Titanic
sns.barplot(x = "Parch", y = "Survived", data = train)
sibsp["Survived Mean"] = train[["Parch", "Survived"]].groupby(["Parch"], as_index = False).mean()["Survived"]
sibsp["Count"] = train[["Parch", "Survived"]].groupby(["Parch"], as_index = False).count()["Survived"]
sibsp["STD"] = train[["Parch", "Survived"]].groupby(["Parch"], as_index = False).std()["Survived"]
print(sibsp)

- In the next step, we will group "Parch" into [0, 1, 2 or more]

## 3.5 Fare

In [None]:
# See the distribution of Fare
#sns.distplot(train["Fare"][train["Pclass"]==1].dropna(), bins = 10, kde = False)
print(train[["Fare", "Survived"]].dropna().groupby(["Survived"]).count())
fare_hist = sns.FacetGrid(train, col="Survived")
fare_hist = fare_hist.map(plt.hist, "Fare")

train[["Fare", "Survived"]].dropna().groupby(["Survived"]).median()

- The outputs above indicate that the distribution of fare for the group who survived and the group who did not is different. So we will include fare in the model.
- We will also categorize fare.

## 3.6 Cabin

In [None]:
# There are many missing values in this colomn
(train["Survived"][train["Cabin"].isnull()].count())/(train["Cabin"].count())

- Since there are much more missing values than available values, we will leave this variable out from the model.

## 3.7 Embarked

In [None]:
# Explore if survival rate depends on the port passenger embarked
sns.barplot(x = "Embarked", y = "Survived", data = train)
train[["Survived", "Embarked"]].groupby(["Embarked"]).mean()

- We will keep this variable in our model.

## Insights from Feature Exploration & Next Steps
- Some variables may not have valuable information and can be dropped from the dataset.
- Missing values in both the training dataset and testing dataset should be addressed.
- Continuous variables should be categorized.

# 4. Data Manipulation

## 4.1 Dropping Unnecessary Variables

From the outputs above, we get a basic sense of the variables and it is intuitive that "PassengerId", "Name"and "Ticket" are not likely to be valuable for the analysis. Therefore, we will drop these variables from both the training and testing dataset. 

From the summary statistics, we also realize that the column "cabin" has too many missing values to draw information from. We will also exclude this column from the datasets.

In [None]:
PassengerId = test['PassengerId']
train = train.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis = 1)
test = test.drop(["PassengerId","Name", "Ticket", "Cabin"], axis = 1)

In [None]:
#train.info()
#print('***************************************************')
#test.info()

## 4.2 Dealing with Missing Values

There are two variabels with missing values in the training dataset: "Age" and "Embarked"

 ## *4.2.1 Embarked -- categorical data* 
 
 ### It's common to replace missing values of a categorical variable with mode. We will find the count for each unique values and replace is with the most appeared value.

In [None]:
print(train["Embarked"].unique())
print(train.groupby(["Embarked"])["Survived"].count().reset_index())
train["Embarked"] = train["Embarked"].fillna("S")
train.groupby(["Embarked"])["Survived"].count().reset_index()

## *4.2.2 Age*

### About 20% of the "Age" column is missing. As inspied by *A Journey through Titanic*, we will replace missing values with random numbers between (mean - std) and (mean + std)

In [None]:
# Calculate mean and standard deviation of "Age" column
train_mean = train["Age"].mean()
train_std = train["Age"].std()
test_mean = test["Age"].mean()
test_std = test["Age"].std()

# Count missing values
count_na_train = train["Age"].isnull().sum()
count_na_test = test["Age"].isnull().sum()

# generate random numbers
np.random.seed(66)
train_rand = np.random.randint(train_mean - train_std, train_mean + train_std, size = count_na_train)
test_rand = np.random.randint(test_mean - test_std, test_mean + test_std, size = count_na_test)

# Fill missing values with random numbers
train["Age"][np.isnan(train["Age"])] = train_rand
test["Age"][np.isnan(test["Age"])] = test_rand

# Convert into int
train["Age"] = train["Age"].astype(int)
test["Age"] = test["Age"].astype(int)


## *4.2.3 Fare*

### There is one missing value for fare in the test data. We will simply replace it with the median

In [None]:
test["Fare"] = test["Fare"].fillna(test["Fare"].median())

In [None]:
# Confirm that all missing values are taken care of
#train.info()
#print('***************************************************')
#test.info()

## 4.3 Categorize Numeric Values

## 4.3.1 Age

In [None]:
# Map Age to categorical groups
train.loc[train["Age"] <= 3, "age_c"] = "Infant & Toddler"
train.loc[(train["Age"] > 3)&(train["Age"] <= 12), "age_c"] = "Young Teenage"
train.loc[(train["Age"] <= 18)&(train["Age"] > 12), "age_c"] = "Teenager"
train.loc[(train["Age"] > 18)&(train["Age"] <= 25), "age_c"] = "Young Adult"
train.loc[(train["Age"] <= 50)&(train["Age"] > 25), "age_c"] = "Adult"
train.loc[(train["Age"] > 50)&(train["Age"] <= 65), "age_c"] = "Middle-aged"
train.loc[(train["Age"] > 65), "age_c"] = "Senior"

test.loc[test["Age"] <= 3, "age_c"] = "Infant & Toddler"
test.loc[(test["Age"] > 3)&(test["Age"] <= 12), "age_c"] = "Young Teenage"
test.loc[(test["Age"] <= 18)&(test["Age"] > 12), "age_c"] = "Teenager"
test.loc[(test["Age"] > 18)&(test["Age"] <= 25), "age_c"] = "Young Adult"
test.loc[(test["Age"] <= 50)&(test["Age"] > 25), "age_c"] = "Adult"
test.loc[(test["Age"] > 50)&(test["Age"] <= 65), "age_c"] = "Middle-aged"
test.loc[(test["Age"] > 65), "age_c"] = "Senior"

train[["age_c","Survived"]].groupby(["age_c"]).mean()

In [None]:
# set up two new dataframes for the final model
m_train = train
m_test = test

#m_train.info()
#print('***************************************************')
#m_test.info()

In [None]:
# Generate Dummy Variable for Age
# Dropped the first one to avoid multicollinearity
age_dummy_train = pd.get_dummies(train["age_c"], drop_first = True)
age_dummy_test = pd.get_dummies(test["age_c"], drop_first = True)

# Concatenate Age dummy with the original training dataset
m_train = pd.concat([m_train, age_dummy_train], axis = 1)
m_test = pd.concat([m_test, age_dummy_test], axis = 1)

# Drop original Age and age_c
m_train = m_train.drop(["age_c", "Age"], axis = 1)
m_test = m_test.drop(["age_c", "Age"], axis = 1)

m_train.sample(5)

## 4.3.2 SibSp

Categorize into 0,1,2 or more

In [None]:
# Map SibSp into categories
train.loc[train["SibSp"] == 0, "sib_c"] = "None"
train.loc[train["SibSp"] == 1, "sib_c"] = "One"
train.loc[train["SibSp"] >1 , "sib_c"] = "2 or More"

test.loc[test["SibSp"] == 0, "sib_c"] = "None"
test.loc[test["SibSp"] == 1, "sib_c"] = "One"
test.loc[test["SibSp"] >1 , "sib_c"] = "2 or More"

# Generate Dummy Variable
sib_dummy_train = pd.get_dummies(train["sib_c"], drop_first = True)
sib_dummy_test = pd.get_dummies(test["sib_c"], drop_first = True)


# Append sib_dummy to m-train
m_train = pd.concat([m_train, sib_dummy_train], axis = 1)
m_train = m_train.drop(["SibSp"], axis = 1)

m_test = pd.concat([m_test, sib_dummy_test], axis = 1)
m_test = m_test.drop(["SibSp"], axis = 1)

m_train.sample(5)

## 4.3.3 Parch

Categorize into 0,1, 2 and more

In [None]:
# Map Parch into categories
train.loc[train["Parch"] == 0, "pc_c"] = "None_pc"
train.loc[train["Parch"] == 1, "pc_c"] = "One_pc"
train.loc[train["Parch"] >1 , "pc_c"] = "2 or More_pc"

test.loc[test["Parch"] == 0, "pc_c"] = "None_pc"
test.loc[test["Parch"] == 1, "pc_c"] = "One_pc"
test.loc[test["Parch"] >1 , "pc_c"] = "2 or More_pc"

# Generate Dummy Variable
pc_dummy_train = pd.get_dummies(train["pc_c"], drop_first = True)
pc_dummy_test = pd.get_dummies(test["pc_c"], drop_first = True)


# Append sib_dummy to m-train/m-test
m_train = pd.concat([m_train, pc_dummy_train], axis = 1)
m_train = m_train.drop(["Parch"], axis = 1)

m_test = pd.concat([m_test, pc_dummy_test], axis = 1)
m_test = m_test.drop(["Parch"], axis = 1)

m_train.sample(5)

## 4.3.4 Fare

In [None]:
# Map fare values into categories
train["fare_c"] = pd.qcut(train["Fare"], 4, labels = ["fare25%", "fare50%", "fare75%","fare100%"])
test["fare_c"] = pd.qcut(test["Fare"], 4, labels = ["fare25%", "fare50%", "fare75%","fare100%"])

# Generate dummy variables for both train and test
fare_dummy_train = pd.get_dummies(train["fare_c"], drop_first = True)
fare_dummy_test = pd.get_dummies(test["fare_c"], drop_first = True)

# Append dummy variables to the original data frames
m_train = pd.concat([m_train, fare_dummy_train], axis = 1)
m_train = m_train.drop(["Fare"], axis = 1)

m_test = pd.concat([m_test, fare_dummy_test], axis = 1)
m_test = m_test.drop(["Fare"], axis = 1)

m_train.sample(5)

## 4.4 Assign numerical values to categorical variables

## 4.4.1 Sex

In [None]:
# Generate dummy variables for both train and test
sex_dummy_train = pd.get_dummies(train["Sex"], drop_first = True)
sex_dummy_test = pd.get_dummies(test["Sex"], drop_first = True)

# Append dummy variables to the original data frames
m_train = pd.concat([m_train, sex_dummy_train], axis = 1)
m_train = m_train.drop(["Sex"], axis = 1)

m_test = pd.concat([m_test, sex_dummy_test], axis = 1)
m_test = m_test.drop(["Sex"], axis = 1)

m_train.sample(5)

## 4.4.2 Embarked

In [None]:
# Generate dummy variables for both train and test
emk_dummy_train = pd.get_dummies(train["Embarked"], drop_first = True)
emk_dummy_test = pd.get_dummies(test["Embarked"], drop_first = True)

# Append dummy variables to the original data frames
m_train = pd.concat([m_train, emk_dummy_train], axis = 1)
m_train = m_train.drop(["Embarked"], axis = 1)

m_test = pd.concat([m_test, emk_dummy_test], axis = 1)
m_test = m_test.drop(["Embarked"], axis = 1)

m_train.sample(5)

## 4.4.3 Pclass

In [None]:
# Map Parch into categories
train.loc[train["Pclass"] == 1, "class_c"] = "class1"
train.loc[train["Pclass"] == 2, "class_c"] = "class2"
train.loc[train["Pclass"] == 3, "class_c"] = "class3"

test.loc[train["Pclass"] == 1, "class_c"] = "class1"
test.loc[train["Pclass"] == 2, "class_c"] = "class2"
test.loc[train["Pclass"] == 3, "class_c"] = "class3"

# Generate dummy variables for both train and test
class_dummy_train = pd.get_dummies(train["class_c"], drop_first = True)
class_dummy_test = pd.get_dummies(test["class_c"], drop_first = True)

# Append dummy variables to the original data frames
m_train = pd.concat([m_train, class_dummy_train], axis = 1)
m_train = m_train.drop(["Pclass"], axis = 1)

m_test = pd.concat([m_test, class_dummy_test], axis = 1)
m_test = m_test.drop(["Pclass"], axis = 1)

m_train.sample(5)

In [None]:
# Check dataset status before modelling
train.info()
print('***************************************************')
test.info()

# 5. Running Machine Learning Algorithms

The datasets are finally ready for modeling!!!

We will explore the following models:
- Gaussian Naive Bayes
- Logistics Regression
- Support Vector Machine
- Decision Tree Classifier
- Random Forest Classifier
- K-Nearest Neighbors

* Note that all parameters are set as default as of 1/10/2018; To be adjusted

In [None]:
# As inspired by Nadin, we will use 80% of the data for training,
# and the rest 20% to test the accuracy of the model

predictors = m_train.drop(["Survived"], axis = 1)
target = m_train["Survived"]
x_train, x_val, y_train, y_val = train_test_split(predictors, target, test_size = 0.2, random_state = 0)

In [None]:
# Gaussian Naive Bayes
gaussian = GaussianNB()
gaussian.fit(x_train, y_train)
# y_pred = gaussian.predict(x_val)
acc_gaussian = gaussian.score(x_val, y_val)
acc_gaussian

In [None]:
# Logistic Regression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
#y_pred = logreg.predict(x_val)
acc_logreg = logreg.score(x_val, y_val)
acc_logreg

In [None]:
# Support Vector Machine
svc = SVC()
svc.fit(x_train, y_train)
#y_pred = logreg.predict(x_val)
acc_svc = svc.score(x_val, y_val)
acc_svc

In [None]:
# Decision Tree Classifier
decisiontree = DecisionTreeClassifier()
decisiontree.fit(x_train, y_train)
#y_pred = decisiontree.predict(x_val)
acc_decisiontree = decisiontree.score(x_val, y_val)
acc_decisiontree

In [None]:
# Random Forest Classifier
randomforest = RandomForestClassifier()
randomforest.fit(x_train, y_train)
#y_pred = randomforest.predict(x_val)
acc_randomforest = randomforest.score(x_val, y_val)
acc_randomforest

In [None]:
# K-Nearest Neighbors
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)
#y_pred = knn.predict(x_val)
acc_knn = knn.score(x_val, y_val)
acc_knn

Based on the outputs above, Decision Tree seems to work the best.

# 6. Create a Submission File


In [None]:
# Generate Predictions
prediction = decisiontree.predict(m_test)

submission_titanic = pd.DataFrame({ 'PassengerId': PassengerId,
                            'Survived': prediction })
submission_titanic.to_csv("submission_titanic.csv", index = False)
