# Logistic Regression Lab

In the previous lab we have constructed a processing pipeline using `sklearn` for the titanic dataset. At this point you should have a set of features ready for consumption by a Logistic Regression model.

In this la we will use the pre-processing pipeline you have created and combine it with a classification model.


We have imported this titanic data into our PostgreSQL instance that you can find connecting here:

    psql -h dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com -p 5432 -U dsi_student titanic
    password: gastudents

First of all let's load a few things:

- standard packages
- the training set from lab 2.3
- the union we have saved in lab 2.3

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

from sqlalchemy import create_engine
engine = create_engine('postgresql://dsi_student:gastudents@dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com/titanic')

df = pd.read_sql('SELECT * FROM train', engine)

In [2]:
import gzip
import dill

with gzip.open('../../../2.3-lab/assets/datasets/union.dill.gz') as fin:
    union = dill.load(fin)

Then, let's create the training and test sets:

In [3]:
X = df[[u'Pclass', u'Sex', u'Age', u'SibSp', u'Parch', u'Fare', u'Embarked']]
y = df['Survived']

In [4]:
from sklearn.cross_validation import train_test_split, cross_val_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## 1. Model Pipeline

Combine the union you have created in the previous lab with a LogisticRegression instance. Notice that a `sklearn.pipeline` can have an arbitrary number of transformation steps, but only one, optional, estimator step as the last one in the chain.

In [5]:
from sklearn import pipeline, linear_model, metrics

whatever_pipe = pipeline.make_pipeline(union, linear_model.LogisticRegression())

## 2. Train the model
Use `X_train` and `y_train` to fit the model.
Use `X_test` to generate predicted values for the target variable and save those in a new variable called `y_pred`.

In [6]:
whatever_pipe.fit(X=X_train, y=y_train)

Pipeline(steps=[('featureunion', FeatureUnion(n_jobs=1,
       transformer_list=[('age_pipe', Pipeline(steps=[('columnselector', ColumnSelector(columns='Age')), ('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('standardscaler', StandardScaler(copy=True, with_mean=Tr...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [7]:
y_pred = whatever_pipe.predict(X=X_test)

## 3. Evaluate the model accuracy

1. Use the `confusion_matrix` and `classification_report` functions to assess the quality of the model.
- Embed the results of the `confusion_matrix` in a Pandas dataframe with appropriate column names and index, so that it's easier to understand what kind of error the model is incurring into.
- Are there more false positives or false negatives? (remember we are trying to predict survival)
- How does that relate to what the `classification_report` is showing?

In [8]:
print metrics.classification_report(y_true=y_test, y_pred=y_pred)

             precision    recall  f1-score   support

          0       0.81      0.88      0.84       175
          1       0.80      0.69      0.74       120

avg / total       0.80      0.80      0.80       295



In [9]:
pd.DataFrame(metrics.confusion_matrix(y_true=y_test, y_pred=y_pred), columns=["Pred True", "Pred False"], index=["Actual True", "Actual False"])

Unnamed: 0,Pred True,Pred False
Actual True,154,21
Actual False,37,83


## 4. Improving the model

Can we improve the accuracy of the model?

One way to do this is to use tune the parameters controlling it.

You can get a list of all the model parameters using `model.get_params().keys()`.

Discuss with your team which parameters you could try to change.

In [10]:
whatever_pipe.get_params().keys()

['featureunion__gender_pipe__truefalsetransformer__flag',
 'featureunion__fare_pipe__standardscaler__with_mean',
 'featureunion__transformer_weights',
 'logisticregression__random_state',
 'featureunion',
 'logisticregression__max_iter',
 'featureunion__fare_pipe',
 'featureunion__transformer_list',
 'featureunion__age_pipe__imputer',
 'featureunion__age_pipe__columnselector__columns',
 'logisticregression__multi_class',
 'logisticregression__verbose',
 'featureunion__age_pipe__standardscaler',
 'logisticregression',
 'featureunion__fare_pipe__standardscaler__copy',
 'featureunion__age_pipe__standardscaler__with_std',
 'logisticregression__solver',
 'featureunion__age_pipe__standardscaler__with_mean',
 'featureunion__fare_pipe__standardscaler__with_std',
 'featureunion__age_pipe__columnselector',
 'logisticregression__dual',
 'steps',
 'logisticregression__intercept_scaling',
 'featureunion__age_pipe__imputer__axis',
 'logisticregression__n_jobs',
 'featureunion__age_pipe__imputer__str

You can systematically probe parameter combinations by using the `GridSearchCV` function. Implement a new classifier that searches the best parameter combination.

1. How will you choose the grid granularity?
1. How can you prevent the grid to exponentially grow?

In [33]:
from sklearn import grid_search

param_grid = {
    "featureunion__age_pipe__standardscaler__with_mean": [True, False],
    "logisticregression__C": [0.01, 0.1, 0.5, 1.0, 10.0],
    "logisticregression__penalty": ['l1','l2']
}

gscv = grid_search.GridSearchCV(whatever_pipe, param_grid=param_grid, n_jobs=1, cv=10, verbose=1, scoring='f1')

In [34]:
gscv.fit(X, y)

Fitting 10 folds for each of 20 candidates, totalling 200 fits


  'precision', 'predicted', average, warn_for)
[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    1.8s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:    6.8s
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    6.9s finished


GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(steps=[('featureunion', FeatureUnion(n_jobs=1,
       transformer_list=[('age_pipe', Pipeline(steps=[('columnselector', ColumnSelector(columns='Age')), ('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('standardscaler', StandardScaler(copy=True, with_mean=Tr...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'featureunion__age_pipe__standardscaler__with_mean': [True, False], 'logisticregression__penalty': ['l1', 'l2'], 'logisticregression__C': [0.01, 0.1, 0.5, 1.0, 10.0]},
       pre_dispatch='2*n_jobs', refit=True, scoring='f1', verbose=1)

In [35]:
gscv.best_params_

{'featureunion__age_pipe__standardscaler__with_mean': True,
 'logisticregression__C': 0.5,
 'logisticregression__penalty': 'l2'}

In [36]:
gscv.best_score_

0.72266751156565745

## 5. Assess the tuned model

A tuned grid search model stores the best parameter combination and the best estimator as attributes.

1. Use these to generate a new prediction vector `y_pred`.
- Use the `confusion matrix`and `classification_report` to assess the accuracy of the new model.
- How does the new model compare with the old one?
- What else could you do to improve the accuracy?

In [37]:
lm = gscv.best_estimator_

lm.fit(X_train, y_train)
y_pred = lm.predict(X_test)


print metrics.classification_report(y_true=y_test, y_pred=y_pred)
pd.DataFrame(metrics.confusion_matrix(y_true=y_test, y_pred=y_pred), columns=["Pred True", "Pred False"], index=["Actual True", "Actual False"])

             precision    recall  f1-score   support

          0       0.81      0.88      0.84       175
          1       0.80      0.69      0.74       120

avg / total       0.80      0.80      0.80       295



Unnamed: 0,Pred True,Pred False
Actual True,154,21
Actual False,37,83


## Bonus

What would happen if we used a different scoring function? Would our results change?
Choose one or two classification metrics from the [sklearn provided metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) and repeat the grid_search. Do your result change?