# Lab: Titanic

### Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew (32% survival rate). This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this lab, complete the analysis of what sorts of people were likely to survive. In particular, apply the tools of machine learning to predict which passengers survived the tragedy.

### Overview

Split the data into training and test sets.

The **training** set should be used to build your machine learning models. Your model will be based on “features” like the passengers’ gender and class. You can also use feature engineering to create new features. Use your trained model to predict whether or not an individual survived the sinking of the Titanic. The **test** set should be used to see how well your model performs on unseen data. 

### Data Dictionary

Survived: 0 = No, 1 = Yes

Pclass: class 1 = 1st, 2 = 2nd, 3 = 3rd 

Sex: gender

Age: Age in years 

SibSp: # of siblings / spouse aboard the Titanic

Parch: # of parents / children aboard the Titanic 

Ticket: ticket number

Fare: Passenger fare 

Cabin: Cabin number 

Embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton


   **Variable Notes**
   
pclass: A proxy for socio-economic status (SES)
1st = Upper,
2nd = Middle,
3rd = Lower

Age: Age is fractional if less than 1. If the age is estimated, it is in the form of xx.5

SibSp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister;  Spouse = husband, wife (mistresses and fiancés were ignored)

Parch: The dataset defines family relations in this way...
Parent = mother, father;
Child = daughter, son, stepdaughter, stepson;
Some children travelled only with a nanny, therefore Parch=0 for them.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv("titanic.csv")
df.head()

## Inspect the features
Note which are numerical and which are categorical.

### Distribution of numerical features
Check for outliers.

In [None]:
df.describe()

### Check for missing values
Which features can be dropped?
Which features may we want to complete/impute?

In [None]:
df.info()

### Distribution of categorical features

In [None]:
df.describe(include=["O"])

How may passengers survived?

In [None]:
df["Survived"].sum()

What percentage of passengers traveled without a parent or child?

In [None]:
df.loc[df["Parch"] == 0, "Parch"].count()/df["Parch"].count()

## Exploratory Data Analysis

What are the unique values for "Parch"?

In [None]:
df["Parch"].unique()

Between parents and children, how many individuals travelled with a total of 6?

In [None]:
df.loc[df["Parch"]==6, "Parch"].count()

### Visualize the "Parch" and "SibSp" distributions

In [None]:
fig, ax = plt.subplots(1,2, figsize=(12,5))

ax[0].hist(df['Parch'],  alpha=.3, color = 'red', edgecolor='black', bins=[0,1,2,3,4,5,6,7])
ax[0].set(title="Titanic", xlabel='Parch')
ax[1].hist(df.loc[df.Age.notnull(),'SibSp'],  alpha=.3, edgecolor='black', bins=[0,1,2,3,4,5,6,7,8])
ax[1].set(title="Titanic", xlabel='SibSp')
plt.show()

### Analyze survival rate by grouping (pivoting) features

In [None]:
df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)


In [None]:
df[['Sex', 'Survived']].groupby(['Sex']).mean().sort_values(by='Survived', ascending=False)

In [None]:
df[["SibSp", "Survived"]].groupby(['SibSp']).mean().sort_values(by='Survived', ascending=False)


### Visualize feature relationships

In [None]:
g = sns.FacetGrid(df, col='Survived')
g.map(plt.hist, 'Fare', bins=20)

In [None]:
#grid = sns.FacetGrid(df, col='Pclass', hue='Survived')
grid = sns.FacetGrid(df, col='Survived', row='Pclass', height=2.2, aspect=1.6)
grid.map(plt.hist, 'Fare', alpha=.5, bins=20)
grid.add_legend();

### Relating categorical features

In [None]:
# grid = sns.FacetGrid(df, col='Embarked')
grid = sns.FacetGrid(df, row='Embarked', height=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep', order=None, hue_order=None)
grid.add_legend()
plt.show()

### Relating categorical and numerical features

In [None]:
#grid = sns.FacetGrid(df, col='Embarked', hue='Survived', palette={0: 'k', 1: 'w'})
grid = sns.FacetGrid(df, row='Embarked', col='Survived', height=2.2, aspect=1.6)
grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None, order=["male", "female"], palette="muted")
grid.add_legend()
plt.show()

### Drop poor features

In [None]:
df = df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

### Impute missing values (Embarked)

In [None]:
df.Embarked.value_counts()

In [None]:
most_common_port = df.Embarked.value_counts().idxmax()
most_common_port

In [None]:
df['Embarked'] = df['Embarked'].fillna(most_common_port)
    
df[['Embarked', 'Survived']].groupby(['Embarked']).mean().sort_values(by='Survived', ascending=False)

### One-hot encoding "Embarked"

In [None]:
df.head()

In [None]:
# use pandas to one-hot encode "Embarked"

# DEFAULTS:
    # prefix_sep='_' 
    # columns=None   ... will encode all columns with categorical variables
    # drop_first=False
# returns a DataFrame

df = pd.get_dummies(df, columns=["Embarked"])
df.head()

### Convert "Sex" to a binary value

In [None]:
# Converting a categorical feature to a binary one
df["Sex"] = df["Sex"].map({'male':0, 'female':1})

### Impute missing values (Age)

Note: There is a relationship among Age, Gender, and Pclass. 

Perhaps use the mean Age, across sets of Pclass and Gender combinations, to impute missing Ages.  Alternatively, a random value within 1 standard deviation of the mean Age can be used. The median value can also be used

### Imputing with a randomly selected age within 1 standard deviation of its groups mean

In [None]:
# Males (coded as Sex=0) in First Class (coded as Pclass=1)
Age01_mean = df.loc[(df['Sex']==0) & (df['Pclass']==1), 'Age'].mean()
Age01_std = df.loc[(df['Sex']==0) & (df['Pclass']==1), 'Age'].std()
Age01_mean, Age01_std

In [None]:
# Males in Second Class
Age02_mean = df.loc[(df['Sex']==0) & (df['Pclass']==2), 'Age'].mean()
Age02_std = df.loc[(df['Sex']==0) & (df['Pclass']==2), 'Age'].std()
Age02_mean, Age02_std

In [None]:
# Males in Third Class
Age03_mean = df.loc[(df['Sex']==0) & (df['Pclass']==3), 'Age'].mean()
Age03_std = df.loc[(df['Sex']==0) & (df['Pclass']==3), 'Age'].std()
Age03_mean, Age03_std

*Do the same, as above, for 'females'

In [None]:
# use the mean and std of Males in First Class
# to randomly generate an Age within 1 standard deviation of the mean
Age01_impute = round(np.random.uniform(Age01_mean - Age01_std, Age01_mean + Age01_std))
Age01_impute

In [None]:
# replace the null values with the imputed age
df.loc[ (df["Age"].isnull()) & (dataset.Sex=='male') & (dataset.Pclass==1),'Age'] = Age01_impute

## Feature Engineering
Perhaps create an "AgeBand" feature by grouping Age within bands (discretization).

In [None]:
df["Age"].head()

In [None]:
# Create "AgeBand" feature

df['AgeBand'] = pd.cut(df['Age'], 5)
#df['AgeBand'] = pd.cut(df['Age'], 4)
#df['AgeBand'] = df['AgeBand'] = pd.cut(df['Age'], [0,20,40,60,80])
#df['AgeBand'] = pd.cut(df['Age'], [0,20,40,60,80], labels=["child","adult","middle age","elder"])
#df['AgeBand'] = pd.cut(df['Age'], [0,20,40,60,80], labels=[1,2,3,4])
df["AgeBand"].head()

In [None]:

df[['AgeBand', 'Survived', "Sex"]].groupby(['AgeBand', "Sex"]).mean().sort_values(by='AgeBand', ascending=True)

Perhaps create a **"FamilySize"** feature, combining "SibSp" and "Parch"

In [None]:
# Create "FamilySize" feature  and perhaps drop "SibSp" and "Parch"
 
df["FamilySize"] = df["SibSp"] + df["Parch"] + 1

df[['FamilySize', 'Survived']].groupby(['FamilySize']).mean().sort_values(by='Survived', ascending=False)

Perhaps create an **"IsAlone"** feature using "FamilySize"

In [None]:
# Create IsAlone feature
df['IsAlone'] = 0
df.loc[df['FamilySize'] == 1, 'IsAlone'] = 1

df[['IsAlone', 'Survived']].groupby(['IsAlone']).mean()

In [None]:
df.head()

**Data Analysis:** 
It appears that women, children, the upperclass, and those traveling with at least one other person, but no more than 2, had the best chances to survive the Titanic tragedy.

# Model and predict

**Note:** Scikit-learn will give you an error if you have any NaNs in your data. You must impute or drop them.

In [None]:
# ...


# Lab Homework 2

Using either the Titanic dataset or the Pima Indian dataset. Build the best machine learning model that you can in order to predict whether an individual survived/has diabetes.

***Things to consider as you go through the process:***

• Clean and transform the data as you desire

• Summarize, group and/or visualize the data to better understand it

• Split the data into ***training*** and ***test*** sets

• Standardize the data

• Choose 2-5 algorithms and perform 10-fold cross validation on the training set

• Display a boxplot and select the best performing model

• Refine the model -- tune its hyperparameters (manually or using grid search)

• Train the refined model on your **full training set** (no cross validation)   ( model.fit(X_train_std, y_train))

• Test the model on your unseen **test** set ( model.score(X_test_std, y_test) )

• Display the Precision, Recall, and F1 score metrics along with a confusion matrix 

• Be able to explain what the scores and confusion matrix mean regarding your results 

** **A few things to keep in mind to improve classification performance:**

				impute or remove missing values,
                feature selection,
                feature engineering,
                try different algorithms
                
                
                