**Subject**: 

Kaggle - Titanic (https://www.kaggle.com/c/titanic)

**Goal:** 

To predict if the passenger will survive. As my goal is clear, this is supervised learning. I am gonna use the data we know to build two models, Logistics Regression model and Decision Tree, apply the models to the validation dataset, and score the record to make the best guess/prediction.

For this project:

I am predicting a binary, categorical target(outcome) variable.

Each row is a record/case (here means a passenger record).

Each column is a variable/feature.

**Main packages I use**
- pandas
- matplotlib.pyplot
- numpy
- sklearn
- yellowbrick

I'm sharing two methods to upload the data.

|| First method: Upload files to colab ||

In [None]:
# import package
# upload files to colab

from google.colab import files
uploaded = files.upload()

# Store dataset in a Pandas Dataframe

import pandas as pd
import io
data = pd.read_csv(io.BytesIO(uploaded['train.csv']))

|| Second method: use google drive ||

In [None]:
# import package
import pandas as pd

# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization
# The reason for the force_remount is that I run multiple times.
drive.mount("/content/drive", force_remount=True)

path = "/content/drive/MyDrive/Colab Notebooks/pJ1_Titanic/train.csv"
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/pJ1_Titanic/train.csv')


|| A glance at how the dataset looks like ||
- Any missing values?
- Any outliers?
- How unique the dataset is?

In [None]:
# Check the data shape and content
print("The dimension of the table is:", data.shape)
pd.DataFrame(data.head(3))

In [None]:
# Check what the variables look like
# Numerical variable result

print(data.describe())

# Categorical variable result - "unique"-show # of categories in each variable

data.describe(include=['O'])

***For Categorical variable result***

**Cabin**: Contain way too many missing values, and many unique record, may exclude it.

**Embarked**: Just two missing values, will think about what to do later.

|| Pre-processing Data ||

**80% of work happens here. There are some questions we can ask ourselves.**

1) What is trget variable, what are features?
  - What's the distribution?
  - How many in different categories?

2) Are the numerical variable correlated?

3) Correlation between the target variable and numeric features? (e.g. Scatter plot - younger, more likely to survive?)

4) Different survival rate in different categories? (e.g. Women more likely to survive?)

____________


STEP 1_1) What is target variable, what are features? 

- What's the distribution?

In [None]:
# import visualization package
import matplotlib.pyplot as plt

# set up the figure size
%matplotlib inline
plt.rcParams['figure.figsize'] = (20,10)

# make subplots - how many figure do I want? 2x2
fig, axes = plt.subplots(nrows = 3, ncols = 2)

# Specifiy the features of interest - Define xaxes & yaxes
num_features = ['Age', 'SibSp', 'Parch', 'Fare', 'Ticket', 'Cabin']
xaxes = num_features
yaxes = ['Counts', 'Counts', 'Counts', 'Counts', 'Counts', 'Counts']

# draw histograms
# loop over the axes
# The reason for [idx] is to loop each in sequence, otherwise is a total value
axes = axes.ravel()
for idx, ax in enumerate(axes):
    ax.hist(data[num_features[idx]].dropna(), bins=20)
    ax.set_xlabel(xaxes[idx], fontsize=10)
    ax.set_ylabel(yaxes[idx], fontsize=10)
    ax.tick_params(axis='both', labelsize=5)

**Features analysis**

**Age**: From the distribution, it looks ok. From "data.describe()" it shows there are 177 missing values. As age is a key element for our predictive model, I'll do something about it later.

**Fare**: From the distribution, it is seriously right-skewed. I may need to exclude outliers (if not too many) and/or just normalized this feature. 

**SibSp** and **Parch** seem like numerical variables, but have categorical variables traits. Not consider them as skewed.

**Ticket** and **Cabin**, no use for this model. Exculde them.

In [None]:
# Update my features set for later use
num_features = ['Age', 'SibSp', 'Parch', 'Fare']


STEP 1_2) What is target variable, what are features? 

- How many in different categories?

In [None]:
#||Create Bar plot to explore Categorical data||
# Each bar plot need to do it seperatly. Can't loop!

# set up the figure size
%matplotlib inline
plt.rcParams['figure.figsize'] = (20,10)

# make subplots
fig, axes = plt.subplots(nrows = 2, ncols = 2)

# - - make the data read to feed into the visulizer
X_Survived = data.replace({'Survived':{1:'yes', 0:'no'}}).groupby('Survived').size().reset_index(name='Counts')['Survived']
Y_Survived = data.replace({'Survived':{1:'yes', 0:'no'}}).groupby('Survived').size().reset_index(name='Counts')['Counts']

# Make the bar plot -- upper left axes[0,0]
axes[0,0].bar(X_Survived, Y_Survived)
axes[0,0].set_title('Survived', fontsize=25)
axes[0,0].set_ylabel('Counts', fontsize=20)
axes[0,0].tick_params(axis='both', labelsize=15)

# - - make the data read to feed into the visulizer
X_Pclass = data.replace({'Pclass':{1:'1st', 2:'2nd', 3:'3rd'}}).groupby('Pclass').size().reset_index(name='Counts')['Pclass']
Y_Pclass = data.replace({'Pclass':{1:'1st', 2:'2nd', 3:'3rd'}}).groupby('Pclass').size().reset_index(name='Counts')['Counts']

# Make the bar plot -- upper right axes[0,1]
axes[0,1].bar(X_Pclass, Y_Pclass)
axes[0,1].set_title('Pclass', fontsize=25)
axes[0,1].set_ylabel('Counts', fontsize=20)
axes[0,1].tick_params(axis='both', labelsize=15)

# - - make the data read to feed into the visulizer
X_Sex = data.groupby('Sex').size().reset_index(name='Counts')['Sex']
Y_Sex = data.groupby('Sex').size().reset_index(name='Counts')['Counts']

# Make the bar plot -- lower left axes[1,0]
axes[1,0].bar(X_Sex, Y_Sex)
axes[1,0].set_title('Sex', fontsize=25)
axes[1,0].set_ylabel('Counts', fontsize=20)
axes[1,0].tick_params(axis='both', labelsize=15)

# - - make the data read to feed into the visulizer
X_Embarked = data.groupby('Embarked').size().reset_index(name='Counts')['Embarked']
Y_Embarked = data.groupby('Embarked').size().reset_index(name='Counts')['Counts']

# Make the bar plot -- lower right axes[1,1]
axes[1,1].bar(X_Embarked, Y_Embarked)
axes[1,1].set_title('Embarked', fontsize=25)
axes[1,1].set_ylabel('Counts', fontsize=20)
axes[1,1].tick_params(axis='both', labelsize=15)

# Note:
# Can't just groupby Survived and Pclass, because its data is in number format.

STEP 2) Are the numerical variable correlated?

In [None]:
# Create Pearson Ranking visualization

# Set up the figure size
%matplotlib inline
plt.rcParams['figure.figsize'] = (15,7)

# import the package for visualization of the correlation
from yellowbrick.features import Rank2D

# extract the numpy arrays from the data frame
x = data[num_features].to_numpy()

# instantiate the 2D visualizer with the Pearson ranking algorithm
visualizer = Rank2D(features=num_features, algorithm='pearson')
visualizer.fit(x)             # Fit the data to the visualizer
visualizer.transform(x)       # Transform the data
visualizer.poof()             # Finalize and show the figure

STEP 3) Correlation between the target variable and numeric features? (Scatter plot - younger, more likely to survive?)

- compare the distributions of numerical variables between passengers that survived and those that did not survive 

In [None]:
# Create ParallelCoordinates

%matplotlib inline
plt.rcParams['figure.figsize'] = (15,7)
plt.rcParams['font.size'] = 50

# setup the color for yellowbrick visualizer
from yellowbrick.style import set_palette
set_palette('reset')

# import packages
from yellowbrick.features import ParallelCoordinates

# specify the features of interest and the classes of the target
classes = ['Not-survived', 'Survived']
num_features = ['Age', 'SibSp', 'Parch', 'Fare']

# copy data to a new DataFrame
data_norm = data.copy()

# normalize data to 0-1 range
# Tips to normalize data by using Pandas:
# (1) normalized_df=(df-df.mean())/df.std()
# (2) normalized_df=(df-df.min())/(df.max()-df.min())

for feature in num_features:
    data_norm[feature] = (data[feature] - data[feature].mean())/ (data[feature].max() - data[feature].min())

# Extract the Numpy arrays from the DataFrame
X = data_norm[num_features].to_numpy()
y = data.Survived.to_numpy()

# Instantiate the visualizer
visualizer = ParallelCoordinates(classes=classes, features=num_features)

visualizer.fit(X, y)      # Fit the data to the visualizer
visualizer.transform(x)   # Transform the data
visualizer.poof()         # Finalize and show the visualizer

STEP 4) Different survival rate in different categories? 

(Correlation between the target variable and categorical features)

(Women more likely to survive?)

In [None]:
# Create bar plots
# set up the figure size
%matplotlib inline
plt.rcParams['figure.figsize'] = (20,10)

# make subplots
fig, axes = plt.subplots(nrows = 2, ncols = 2)

# - - make the data read to feed into the visulizer
Sex_survived = data.replace({'Survived':{1: 'Survived', 0: 'Not-survived'}})[data['Survived']==1]['Sex'].value_counts()
Sex_not_survived = data.replace({'Survived': {1: 'Survived', 0: 'Not-survived'}})[data['Survived']==0]['Sex'].value_counts()
Sex_not_survived = Sex_not_survived.reindex(index = Sex_survived.index) # sex_survived as index at the bottom

# Make the bar plot
p1 = axes[0, 0].bar(Sex_survived.index, Sex_survived.values)
p2 = axes[0, 0].bar(Sex_not_survived.index, Sex_not_survived.values, bottom=Sex_survived.values)
axes[0, 0].set_title('Sex', fontsize=25)
axes[0, 0].set_ylabel('Counts', fontsize=20)
axes[0, 0].tick_params(axis='both', labelsize=15)
axes[0, 0].legend((p1[0], p2[0]), ('Survived', 'Not-survived'), fontsize = 15) #display legend categories

# - - make the data read to feed into the visulizer
Pclass_survived = data.replace({'Survived': {1: 'Survived', 0: 'Not-survived'}}).replace({'Pclass':{1: '1st', 2: '2nd', 3:'3rd'}})[data['Survived']==1]['Pclass'].value_counts()
Pclass_not_survived = data.replace({'Survived': {1: 'Survived', 0: 'Not-survived'}}).replace({'Pclass':{1: '1st', 2: '2nd', 3:'3rd'}})[data['Survived']==0]['Pclass'].value_counts()
Pclass_not_survived = Pclass_not_survived.reindex(index = Pclass_survived.index) # Pclass_survived as index at the bottom

# Make the bar plot
p3 = axes[0, 1].bar(Pclass_survived.index, Pclass_survived.values)
p4 = axes[0, 1].bar(Pclass_not_survived.index, Pclass_not_survived.values, bottom=Pclass_survived.values)
axes[0, 1].set_title('Pclass', fontsize=25)
axes[0, 1].set_ylabel('Counts', fontsize=20)
axes[0, 1].tick_params(axis='both', labelsize=15)
axes[0, 1].legend((p3[0], p4[0]), ('Survived', 'Not-survived'), fontsize = 15) #display legend categories

# - - make the data read to feed into the visulizer
Embarked_survived = data.replace({'Survived': {1: 'Survived', 0: 'Not-survived'}})[data['Survived']==1]['Embarked'].value_counts()
Embarked_not_survived = data.replace({'Survived': {1: 'Survived', 0: 'Not-survived'}})[data['Survived']==0]['Embarked'].value_counts()
Embarked_not_survived = Embarked_not_survived.reindex(index = Embarked_survived.index) # Embarked_survived as index at the bottom

# Make the bar plot
p5 = axes[1, 0].bar(Embarked_survived.index, Embarked_survived.values)
p6 = axes[1, 0].bar(Embarked_not_survived.index, Embarked_not_survived.values, bottom=Embarked_survived.values)
axes[1, 0].set_title('Embarked', fontsize=25)
axes[1, 0].set_ylabel('Counts', fontsize=20)
axes[1, 0].tick_params(axis='both', labelsize=15)
axes[1, 0].legend((p5[0], p5[0]), ('Survived', 'Not-survived'), fontsize = 15) #display legend categories


**Feature selection and feature engineering**

In this step, we will do lots of things to our data such as 

1) Drop more features if we think they are not helpful to our model from the analysis above.

2) Deal with missing values.

3) Deal with the highly skewed dataset.

4) One Hot Encoding for the categorical features.

@ @ @

1) Drop features

In [None]:
# Drop features from the dataset

#data = data.drop(['PassengerId', 'Ticket', 'Cabin', 'Name'])
data.head()

2) Deal with missing values

- Omission: If only a small number of records have missing values, can omit them. But it is not practical to do so if many records have missing values, or the cost of omitting the record is too high such as the medical study's dataset.

- Imputation: Do your best guess. Can use average to replace missing data blank.

  We will impute record for this project. And this fillin function can be apply for other dataset too !!

In [None]:
# Filling in missing values for 'Age', 'Embarked'
# Age - use 'median value' 
def fill_na(data, inplace=True):
    return data.fillna(data.median(), inplace=inplace)

# Use data['Age'] to replace data, will return data['Age'].median()
fill_na(data['Age'])

# Check the result
data['Age'].describe()

# Embarked - use 'S' since there are only 2 missing and S is the most represent.
def fill_na_2(data, inplace=True):
    return data.fillna('S', inplace=inplace)

fill_na_2(data['Embarked'])

# Check the result
data['Embarked'].describe()


3) Deal with the highly skewed dataset

- Can normalize the data when variables with the largest scales would dominate and skew results.



In [None]:
# Log-transformation of the fare
# 'Fair' is highly right-skewed

# import package
import numpy as np

# log-transformation
def log_transformation(data):
    return data.apply(np.log1p)

data['Fare_log1p']=log_transformation(data['Fare'])

# Chenck the result
data.describe()

In [None]:
# Use distribution to check fransformed 'Fare'

# Set up the figure size
%matplotlib inline
plt.rcParams['figure.figsize'] = (10,5)

# Draw histograms
plt.hist(data['Fare_log1p'], bins = 40)
plt.xlabel('Fare_log1p', fontsize = 20)
plt.ylabel('Counts', fontsize = 20)
plt.tick_params(axis = 'both', labelsize=15)

4) One Hot Encoding for the categorical features:

- Convert categorical data to numerical data (create binary dummies or indicator variables)

In [None]:
# Get the categorical data

cat_features = ['Pclass', 'Sex', 'Embarked']
data_cat = data[cat_features]
print(data_cat.head(3))

# Pclass looks like numbers but actually a categorical variable, replace the data to words first.

data_cat = data_cat.replace({'Pclass':{1:'1st', 2:'2nd', 3:'3rd'}})
print(data_cat.head(3))

# One hot encoding
data_cat_dummies = pd.get_dummies(data_cat)
data_cat_dummies.head()


**The introduction to the goal of this project and the methodology for preprocessing the data ends here. After the data preprocessing, we can move on to partitioning the data in order to build and assess the model.**

The following:

- Partitioning the dataset
  * Problem: How well will our model perform with new data?
  * Solution: Separate data into two parts: Training and Validation.(Or separate data into three parts: Training, Validation, and Test.)
- Train the model
- Evaluate the model

In [None]:
# Create all feature dataset: Combine the numerical features and the dummies features (data_cat_dummies) together
# num_features = ['Age', 'SibSp', 'Parch', 'Fare'] can't use, need "transformed fare"
# Can do another test model using original fare to see the result.

features_model = ['Age', 'SibSp', 'Parch', 'Fare_log1p']
dataset_features_x = pd.concat([data[features_model], data_cat_dummies], axis=1)

# Create a whole target dataset
# Our target variable is 'Survived', but it actually contain 0 & 1

dataset_targets_y = data.replace({'Survived':{1:'Survived', 0: 'Not_survived'}})['Survived']

# Dataset is ready above
# Partitioning dataset into training and validation

# Import packages

from sklearn.model_selection import train_test_split

# Partitioning the dataset
# X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=?, random_state=?)

x_train, x_val, y_train, y_val = train_test_split(dataset_features_x, dataset_targets_y, test_size = 0.3, random_state=11)

# Number of samples in each dataset - Make sure the shape of x,y dataset matches

print("For features-No. of samples in training set: ", x_train.shape)
print("For features-No. of samples in test set: ", x_val.shape)
print("For targets-No. of samples in training set:", y_train.shape)
print("For targets-No. of samples in test set: ", y_val.shape)

# Details of target dataset (Survived and not-survived details)

print("\n")
print("No. of survived and not-survived in the training set: ")
print(y_train.value_counts())

print("\n")
print("No. of survived and not-survived in the validation set: ")
print(y_val.value_counts())


To predict a passenger survived or not - a classification problem.
- We are using two models **Logistic regressoin** and **Decision tree**

To evaluate a binary prediction performance (classification)

- We are using **Confusion Matrix, precision, recall, F1 score,** and **ROC Curve**

In [None]:
# LogisticRegression
# Import packages

from sklearn.linear_model import LogisticRegression
from yellowbrick.classifier import ConfusionMatrix
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ROCAUC

# Instantiate the classification model

model = LogisticRegression()

# The ConfusionMatrix Visualizer taxes a model

cm = ConfusionMatrix(model, classes=['Not_survived', 'Survived'])

# Fit fits the passed model. This is unnecessary if you pass the visualizer a pre-fitted model

cm.fit(x_train, y_train)

# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.

cm.score(x_val, y_val)

# change fontsize of the labels in the figure

for label in cm.ax.texts:
    label.set_size(25)

# How did we do?
cm.poof()


# LogisticRegression

|| Confusion Matrix can tell us the accuracy of our prediction ||

**True Positive**: If the Actual Outcome is 1 (Survived), and the Predicted result is also 1.

**True Negative**: If the Actual Outcome is 0 (Not_Survived), and the Predicted result is also 0.

**False Positive**: If the Actual Outcome is 0, but the Predicted result is also 1.

**False Negative**: If the Actual Outcome is 1, but the Predicted result is also 0.

||Total Accuracy||

To get the Accuray of the model, we need to know what is the percentage of "**the number of correct predictions (True Positive and True Negative)**" over "**the number of all the prediction**".

Number of correct predictions: 158+68=226

Total prediction: 158+18+68+24=268

Total Accuracy: 226/268= 84.33% 

The accuracy is not that bad.


In [None]:
# Logistic Regression
# Precision, Recall, and F1 Score
# Set the size of the figure and the font size

%matplotlib inline
plt.rcParams['figure.figsize'] = (15, 7)
plt.rcParams['font.size'] = 20

# Instantiate the visualizer

visualizer = ClassificationReport(model, classes=['Not_survived', 'Survived'])
visualizer.fit(x_train, y_train)          # Fit the training dataset to the visualizer
visualizer.score(x_val, y_val)            # Evaluate the model on the validation dataset
visualizer.poof()

# Logistic Regression

|| Precision, Recall, and F1 Score ||

Also show the performance is pretty good.

In [None]:
# Logistic Regression
# ROC Curve and AUC

# Instantiate the visualizer
visualizer = ROCAUC(model)

visualizer.fit(x_train, y_train)   # Fit the training dataset to the visualizer
visualizer.score(x_val, y_val)     # Evaluate the model on the validation dataset
visualizer.poof()                  

# Logistic Regression

**ROC Curve**:

The dotted diagonal line corresponds to random ranking. The ROC Curve of a stong model will be far above that line.

**Area Under the Curve(AUC)**:
- Measures model performance from 0.5 to 1.0
- AUC < 0.6 is a weak model
- AUC > 0.7 is a strong model

In [None]:
# Decision Tree
# Import packages

from sklearn.tree import DecisionTreeClassifier

# Instantiate the classification model

model_tree = DecisionTreeClassifier()

# The ConfusionMatrix Visualizer taxes a model

cm = ConfusionMatrix(model_tree, classes=['Not_survived', 'Survived'])

# Fit fits the passed model. This is unnecessary if you pass the visualizer a pre-fitted model

cm.fit(x_train, y_train)

# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
# and then creates the confusion_matrix from scikit-learn.

cm.score(x_val, y_val)

# change fontsize of the labels in the figure

for label in cm.ax.texts:
    label.set_size(25)

# How did we do?
cm.poof()

# Decision Tree

||Total Accuracy||

Number of correct predictions: 148+65=213

Total prediction: 148+28+65+27=268

Total Accuracy: 226/268= 79.48% 

The accuracy is lower than the Logistic regression's performance.


In [None]:
# Decision Tree
# ROC Curve and AUC

# Instantiate the visualizer
visualizer = ROCAUC(model_tree)

visualizer.fit(x_train, y_train)   # Fit the training dataset to the visualizer
visualizer.score(x_val, y_val)     # Evaluate the model on the validation dataset
visualizer.poof() 

# Decision Tree

**ROC Curve and AUC**:

Compare to Logistic regression's ROC Curve, the Decision tree's curve is clearly weaker, closer to the dotted diagonal line.

If I have to choose between the two models, I'll use the Logistic regression model for this project.


# Thanks

This is my first project and I refer to some resources and course materials, I appreciate them helping me grow.

Reference:

*How to Start Your First Data Science Project*
https://www.districtdatalabs.com/how-to-start-your-first-data-science-project By Juan L. Kehoe

*UConn Predictive Modeling course* by Jennifer Eigo

*scikit-learn* https://scikit-learn.org/stable/index.html

And many other discussion boards.