# Lab 3: Titanic

### Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew (32% survival rate). This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

### Overview

The data has been split into two groups:

training set (titanic_train.csv)
test set (titanic_test.csv)

The training set should be used to build your machine learning models. The training set includes the target (outcome) value for each passenger. Your model will be based on “features” like the passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. The test set lacks the target variable. It is the task of your model to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

### Data Dictionary

Survived: 0 = No, 1 = Yes

Pclass: class 1 = 1st, 2 = 2nd, 3 = 3rd 

Sex: gender

Age: Age in years 

SibSp: # of siblings / spouses aboard the Titanic

Parch: # of parents / children aboard the Titanic 

Ticket: ticket number

Fare: Passenger fare 

Cabin: Cabin number 

Embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton


   **Variable Notes**
   
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

Age: Age is fractional if less than 1. If the age is estimated, it is in the form of xx.5

SibSp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

Parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore Parch=0 for them.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv("titanic_train.csv")
df.head()

## Inspect the features
Note which are numerical and which are categorical.

### Distribution of numerical features
Check for outliers.

In [None]:
df.describe()

### Check for missing values

In [None]:
df.info()

### Distribution of categorical features
Which features can be dropped?
Which features may we want to complete/impute?

In [None]:
df.describe(include=["O"])

How may passengers survived?

What percentage of passengers traveled without a parent or child?

## Exploratory Data Analysis

In [None]:
df["Parch"].unique()

In [None]:
fig, ax = plt.subplots(1,2, figsize=(12,5))
ax[0].hist(df.loc[df.Age.notnull(),'Age'],  alpha=.3, edgecolor='black', bins=[0,10,20,30,40,50,60,70,80,90])
ax[1].hist(df[['Parch', "SibSp"]],  alpha=.3, edgecolor='black', bins=[0,1,2,3,4,5,6])
plt.show()

### Analyze by grouping (pivoting) features

In [None]:
df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)


In [None]:
df[['Sex', 'Survived']].groupby(['Sex']).mean().sort_values(by='Survived', ascending=False)

In [None]:
df[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)


### Visualize feature relationships

In [None]:
g = sns.FacetGrid(df, col='Survived')
g.map(plt.hist, 'Fare', bins=20)  # use age once imputed

In [None]:
#grid = sns.FacetGrid(df, col='Pclass', hue='Survived')
grid = sns.FacetGrid(df, col='Survived', row='Pclass', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Fare', alpha=.5, bins=20)    # use age once imputed
grid.add_legend();

### Relating categorical features

In [None]:
# grid = sns.FacetGrid(df, col='Embarked')
grid = sns.FacetGrid(df, row='Embarked', size=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()

### Relating categorical and numerical features

In [None]:
#grid = sns.FacetGrid(df, col='Embarked', hue='Survived', palette={0: 'k', 1: 'w'})
grid = sns.FacetGrid(df, row='Embarked', col='Survived', size=2.2, aspect=1.6)
grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None)
grid.add_legend()

### Drop poor features

In [None]:
df = df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

### Impute missing values (Embarked and Age)

***Impute 'Embarked'***

In [None]:
most_common_port = df.Embarked.value_counts().idxmax()
most_common_port

In [None]:
df['Embarked'] = df['Embarked'].fillna(most_common_port)
    
df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
# Converting a categorical feature
df['Embarked'] = df['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} )

***Impute 'Age'***

Note: There is a correlation among Age, Gender, and Pclass. 

Perhaps impute Age values across sets of Pclass and Gender feature combinations. The median value can be used or alternatively a random value within 1 standard deviation of the mean Age.

In [None]:
# Converting a categorical feature
df["Sex"] = df["Sex"].map({'male':0, 'female':1})

In [None]:
# Males (coded as Sex=0) in First Class (coded as Pclass=1)
Age01_mean = df.loc[(df['Sex']==0) & (df['Pclass']==1), 'Age'].mean()
Age01_std = df.loc[(df['Sex']==0) & (df['Pclass']==1), 'Age'].std()
Age01_mean, Age01_std

In [None]:
# Males in Second Class
Age02_mean = df.loc[(df['Sex']==0) & (df['Pclass']==2), 'Age'].mean()
Age02_std = df.loc[(df['Sex']==0) & (df['Pclass']==2), 'Age'].std()
Age02_mean, Age02_std

In [None]:
# get the mean and std of Males in First Class
# randowly generate an Age within 1 standard deviation of the mean
Age01_impute = round(np.random.uniform(Age01_mean - Age01_std, Age01_mean + Age01_std))
Age01_impute

In [None]:
# replace the null values with the imputed age
df.loc[ (df["Age"].isnull()) & (dataset.Sex=='male') & (dataset.Pclass==1),'Age'] = Age01_impute

### Feature Engineering
Perhaps create a new feature by creating Age or Fare bands (discretization).

In [None]:
# Create "AgeBand" feature

#df['AgeBand'] = pd.cut(df['Age'], 5)
#pd.cut(df['Age'], 4)
#pd.cut(df['Age'], [0,20,40,60,80])
#pd.cut(df['Age'], [0,20,40,60,80], labels=["child","adult","middle age","elder"])
df['AgeBand'] = pd.cut(df['Age'], [0,20,40,60,80], labels=[1,2,3,4])
df["AgeBand"].head()

In [None]:
# manually creating AgeBands

df.loc[ dataset['Age'] < 20, 'Age'] = 1
df.loc[(dataset['Age'] >= 20) & (dataset['Age'] < 40), 'Age'] = 2
df.loc[(dataset['Age'] >= 40) & (dataset['Age'] < 60), 'Age'] = 3
df.loc[(dataset['Age'] >= 60), 'Age'] = 4

In [None]:

df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)

In [None]:
# Create "FamilySize" feature  and perhaps drop "SibSp"/"Parch"
 
df["FamilySize"] = df["SibSp"] + df["Parch"] + 1

df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
# Create IsAlone feature
df['IsAlone'] = 0
df.loc[df['FamilySize'] == 1, 'IsAlone'] = 1

df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()

In [None]:
df.head()

# Model and predict

In [None]:
# ...


# Lab Homework 2

Using either the Titanic dataset or the Pima Indian dataset, build the best machine learning model that you can to predict whether an individual survived/has diabetes.

***Things to consider:***

• Clean and transform the data as you desire

• Summarize and/or visualize the data 

• Standardize the data

• Choose 2-5 algorithms and perform 10-fold cross validation

• Display a boxplot and select the best performing model

• Tune its hyperparameters (manually or using grid search)

• Train the same algorithm on your full training set (no cross validation)   ( model.fit(X_std_train, y_train))

• Test the model on your test set ( model.score(X_test, y_test) )

• Display the Precision, Recall, and F1 score metrics along with a confusion matrix 

• Be able to explain what the scores and confusion matrix mean regarding your results 

** A few things to possibly try to improve classification performance:

				clean the data,
                impute or remove missing values,
                feature selection,
                feature engineering,
                try a different scaler (standardization vs. normalization),
                try different algorithms, 
                tune the hyperparameters/Grid Search