# Project 4: Starter

In this project, you will combine and apply your knowledge from all three past unit projects to create a complete data science workflow on a new dataset. We will use the Kaggle Titanic competition dataset for this project.

In [None]:
import numpy as np
import pandas as pd
from sklearn import cross_validation
from sklearn import neighbors
from sklearn import grid_search
from sklearn import metrics
from sklearn import linear_model

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid", font_scale=1)
%matplotlib inline

![](../assets/images/workflow/data-science-workflow-01.png)

## Part 1. Identify the Problem

Using the competition description on [Kaggle](https://www.kaggle.com/c/titanic), write a short paragraph summarizing the problem, your goals and hypothesis.

**NOTE**: This section can be less rigorous for a kaggle competition where the problem, goals, and hypothesis are identified for you.

**Problem**: `*** FILL IN ***`

**Goals**: `*** FILL IN ***`

**Hypothesis**: `*** FILL IN ***`

![](../assets/images/workflow/data-science-workflow-02.png)

## Part 2. Acquire the Data

Kaggle has provided two files for this dataset:  
_train.csv_: Use for building a model (contains target variable "Survived")  
_test.csv_: Use for submission file (fill in for target variable "Survived")

Both files have been downloaded and added to your datasets folder. Read the files into a Pandas DataFrame.

**HINT**: You can further split _train.csv_ to generate your own cross validation set. However, use all of _train.csv_ to train your final model since Kaggle has already separated the test set for you.

In [None]:
# Load data
df = pd.read_csv("../assets/datasets/titanic/train.csv")

# Check head
df.head()

![](../assets/images/workflow/data-science-workflow-03-05.png)

## Part 3. Parse, Mine, and Refine the data

Perform exploratory data analysis and verify the quality of the data.

### Check columns and counts to drop any non-generic or near-empty columns

In [None]:
# Check columns
*** FILL IN ***

In [None]:
# Check counts
*** FILL IN ***

### Check for missing values and drop or impute

In [None]:
# Check counts for missing values in each column
*** FILL IN ***

### Write a function to wrangle the data to address any issues from above checks

In [None]:
def wrangler(df):
    # Drop non-generic columns PassengerId, Name, Ticket, and near-empty column Cabin
    df = df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

    # Replace missing values for age using median value
    df.loc[(df['Age'].isnull()), 'Age'] = df['Age'].dropna().median()
    
    # Replace missing values for Fare using median value (there are some missing in Kaggle's test set)
    df.loc[(df['Fare'].isnull()), 'Fare'] = *** FILL IN ***
    
    # Replace missing values for embarked using mode value
    df.loc[(df['Embarked'].isnull()), 'Embarked'] = *** FILL IN ***
    
    return df

In [None]:
# Apply wrangler() to DF
df = wrangler(df)

# Check data
df.head()

### Perform exploratory data analysis

In [None]:
# Get summary statistics for data
*** FILL IN ***

In [None]:
# Get pair plot for data
*** FILL IN ***

In [None]:
# Analyze Unnormalized and Normalized Survival by Sex

# Get group by counts for Unnormalized Survival by Sex
df_by_sex_unnorm = pd.DataFrame()
df_by_sex_unnorm['male'] = df[df['Sex']=='male']['Survived'].value_counts()
df_by_sex_unnorm['female'] = df[df['Sex']=='female']['Survived'].value_counts()

# Get group by counts for Normalized Survival by Sex
df_by_sex_normed = pd.DataFrame()
df_by_sex_normed['male'] = df[df['Sex']=='male']['Survived'].value_counts(normalize=True)
df_by_sex_normed['female'] = df[df['Sex']=='female']['Survived'].value_counts(normalize=True)

# Plot Unnormalized and Normalized Survival by Sex
fig = plt.figure(figsize=(10,4))
ax1 = fig.add_subplot(121)
df_by_sex_unnorm.plot(ax=ax1, kind='bar', rot=0, title="Survival by Sex")
ax2 = fig.add_subplot(122)
df_by_sex_normed.plot(ax=ax2, kind='bar', rot=0, title="Normalized Survival by Sex")

In [None]:
# What other exploratory analysis can you perform?
*** FILL IN ***

### Check and convert all data types to numerical

In [None]:
# Check data types
*** FILL IN ***

### Write a function to pre-process data for building a model

In [None]:
def pre_proc(df):
    # Create dummy variables for all non-numerical columns
    
    # Get dummy variables for Sex
    *** FILL IN ***
    # Remove Sex column
    *** FILL IN ***

    # What other columns need dummy variables?
    *** FILL IN ***
    
    return df

In [None]:
# Apply pre_proc() to DF
df = pre_proc(df)

# Check cleaned data
df.head()

![](../assets/images/workflow/data-science-workflow-06.png)

## Part 4. Build a Model

Create a cross validation split, select and build a model, evaluate the model, and refine the model

### Create cross validation sets

In [None]:
# Set target variable name
target = 'Survived'

# Set X and y
X = df.drop([target], axis=1)
y = df[target]

In [None]:
# Create separate training and test sets with 60/40 train/test split
X_train, X_test, y_train, y_test = *** FILL IN ***

### Build a model

In [None]:
# Instantiate logistic regression classifier using default params
lm = *** FILL IN ***

# Train logistic regression classifier on training set
*** FILL IN ***

### Evaluate the model

In [None]:
# Check model accuracy on test set
print "Accuracy: %0.3f" % lm.score(*** FILL IN ***

In [None]:
# Get confusion matrix on test set
y_pred = lm.predict(X_test)
cm = metrics.confusion_matrix(*** FILL IN ***
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

ax = plt.axes()
sns.heatmap(cm_normalized, annot=True)
ax.set_ylabel('True')
ax.set_xlabel('Pred')
plt.show()

print "Confusion Matrix:"
print cm

In [None]:
# Plot ROC curve and get AUC score
y_pred_proba = lm.predict_proba(X_test)[:,1]

# Determine the false positive and true positive rates
fpr, tpr, t = metrics.roc_curve(y_test, y_pred_proba)

 
# Plot of a ROC curve for a specific class
plt.figure()
plt.plot(fpr, tpr)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.show()

# Get ROC AUC score
print 'ROC AUC: %0.3f' % metrics.roc_auc_score(y_test, y_pred_proba)

In [None]:
# What other metrics can you calculate?
*** FILL IN ***

### Tune the model

In [None]:
# Set list of values to grid search over
c = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
s = ['newton-cg', 'lbfgs', 'liblinear', 'sag']
params = {'C': c, 'solver':s}

# Perform grid search using list of values
gs = *** FILL IN ***
gs.fit(X_train, y_train)

# Get best value to use
print "Best Params:"
print gs.*** FILL IN ***

# Get improvement
print "Accuracy of current model: %0.3f" % lm.score(X_test, y_test)
print "Accuracy using best param: %0.3f" % gs.best_score_

### Update parameters

In [None]:
# Current model params
print lm
print "Accuracy of current model: %0.3f" % lm.score(X_test, y_test)

# Update model params
lm.set_params(C=*** FILL IN ***
lm.set_params(solver=*** FILL IN ***

# Retrain model on new params
lm.fit(X_train, y_train)

# Updated model params
print lm
print "Accuracy of updated model: %0.3f" % lm.score(X_test, y_test)

![](../assets/images/workflow/data-science-workflow-07.png)

## Part 5: Present the Results

Generate summary of findings and kaggle submission file.

NOTE: For the purposes of generating summary narratives and kaggle submission, we can train the model on the entire training data provided in _train.csv_.

### Load Kaggle training data and use entire data to train tuned model

In [None]:
# Load data
df_train = pd.read_csv("../assets/datasets/titanic/train.csv")

# Wrangle data
df_train = wrangler(df_train)

# Pre-process data
df_train = pre_proc(df_train)

In [None]:
# Set target variable name
target = 'Survived'

# Set X_train and y_train
X_train = df_train.drop([target], axis=1)
y_train = df_train[target]

In [None]:
# Build tuned model
lm = linear_model.LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='newton-cg', tol=0.0001,
          verbose=0, warm_start=False)

# Train tuned model
lm.fit(X_train, y_train)

# Score tuned model
print "Accuracy: %0.3f" % lm.score(X_train, y_train)

### Use trained model to generate a few summary findings

In [None]:
# Generate probabililty of survivial using trained model
df_train['Probability'] = lm.predict_proba(X_train)[:,1]

In [None]:
# Plot Probability of Survivial Based on Sex and Age
ax = df_train[df_train['Sex_male']==1].plot(x='Age', y='Probability', kind='scatter', color='b', label='Male')
df_train[df_train['Sex_female']==1].plot(ax=ax, x='Age', y='Probability', kind='scatter', color='m', label='Female')
ax.set(title='Probability of Survival\n Based on Sex and Age')

In [None]:
# Plot Probability of Survivial Based on Pclass and Age
sns.lmplot(x="Age", y="Probability", hue="Pclass", data=df_train)

In [None]:
# What other summary visualizations can you make?
*** FILL IN ***

**Summary of findings**: `*** FILL IN ***`

### Load Kaggle test data, make predictions using model, and generate submission file

In [None]:
# Load data
df_test = pd.read_csv("../assets/datasets/titanic/test.csv")

# Create DF for submission
df_sub = df_test[['PassengerId']]

# Wrangle data
df_test = wrangler(df_test)

# Pre-process data
df_test = pre_proc(df_test)

# Check data
df_sub.head()

# Predict using tuned model
df_sub['Survived'] = lm.predict(df_test)

# Write submission file
df_sub.to_csv("../assets/datasets/titanic/mysubmission.csv", index=False)

print "Check ../assets/datasets/titanic/ for submission file"

**Kaggle score** _(if submitted to Kaggle)_: `*** FILL IN ***`

**HINT**: Try tranforming or combining features and create additional features to improve your score. This is a popular introductory dataset, Google for additional feature engineering hints!