### Connect Intensive - Machine Learning Nanodegree
# Lesson 03: Building and evaluating models with `sklearn`
# Part 02: Building Predictive Models

## Objectives
  - Experiment with building predictive models using the [ `sklearn` library](http://scikit-learn.org/stable/).
   - Learn about evaluation metrics for classification [ `sklearn.metrics` ]
   - Confusion matrix, precision and recall 
   - Save cleaned datasets (so that we don't lose our hard preprocessing work!)
  
## Prerequisites
  - You should have the following python packages installed:
    - [matplotlib](http://matplotlib.org/index.html) (not a pre-reqisite for this part)
    - [numpy](http://www.scipy.org/scipylib/download.html)
    - [pandas](http://pandas.pydata.org/getpandas.html)
    - [sklearn](http://scikit-learn.org/stable/install.html)
  - If you're rusty on exploratory data analysis using `pandas`, you may want to check out lessons 01 and 02 in the [ConnectIntensive repo](https://github.com/nickypie/ConnectIntensive)


## Acknowledgements
  - This lesson is adapted from an introductory tutorial on pandas and scikit-learn from the Kaggle Titanic Compettion Website
  (https://www.kaggle.com/c/titanic/forums/t/10125/pandas-and-scikit-learn-introduction-via-kaggle-titanic-competition).
  - It also draws on material from part 1 of this lesson 
  

## Getting Started
As usual, we start by importing some useful libraries and modules. Don't worry if you get a warning message when importing `matplotlib` -- it just needs to build the font cache, and the warning is just to alert you that this may take a while the first time the cell is run.

**Run** the cell below to import useful libraries for this notebook.

In [None]:
%matplotlib inline
try:
    import matplotlib
    import matplotlib.pyplot as plt
    plt.style.use('ggplot')
    print("Successfully imported matplotlib.pyplot! (Version {})".format(matplotlib.__version__))
except ImportError:
    print("Could not import matplotlib.pyplot!")
    
try:
    import numpy as np
    print("Successfully imported numpy! (Version {})".format(np.version.version))
except ImportError:
    print("Could not import numpy!")
    
try:
    import pandas as pd
    print("Successfully imported pandas! (Version {})".format(pd.__version__))
    pd.options.display.max_rows = 10
except ImportError:
    print("Could not import pandas!")

try:
    from IPython.display import display
    print("Successfully imported display from IPython.display!")
except ImportError:
    print("Could not import display from IPython.display")
    
try:
    import sklearn
    print("Successfully imported sklearn! (Version {})".format(sklearn.__version__))
    skversion = int(sklearn.__version__[2:4])
except ImportError:
    print("Could not import sklearn!")

## Reload the pre-processed data
We need to reload the data we saved in the previous session

**Run** the cell below (**click** on the cell to highlight it, then press **shift + enter** or **shift + return** to run it) to read the training and testing data into `pandas` `DataFrame` objects.

In [None]:
train_df = pd.read_csv("lesson-03-data/titanic_train_cleaned.csv")
print("Pre-processed Titanic data sets loaded!")

In [None]:
#Just to be sure we grabbed the right data, lets print a summary
train_df.head(3)

In [None]:
#Looks like we inadvertently added a index column - lets take it out
if u'Unnamed: 0' in train_df.columns:
    print('Dropping "unnamed" column from train_df')
    train_df = train_df.drop(u'Unnamed: 0', axis=1)
train_df.head(3)

## Making some basic predictions

Recall that the key feature we will attempt to predict is the `'Survived'` feature, which is equal to 0 or 1 for a passenger who died or survived, respectively, from the Titanic sinking. 

We'll try several sets of predictions and calculate some metrics to evaluate our 'model'

A commonly used metric for classification is accuracy_score, which is simply the proportion of correct predictions. If a model predicts m classes of n possible correctly, then the accuracy score will be m / n.

The accuracy_score simply ignores wrong predictions. In some situations, we may care about making wrong predictions; the F1 score is a measure that combines both correct and incorrect predictions


In [None]:
from sklearn.metrics import accuracy_score, f1_score

no_survivors = np.array([0]*891)

# sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)[source]
print("Accuracy score: {:.2f}".format(accuracy_score(train_df['Survived'], no_survivors)))

print("Number perished: {}".format(sum(train_df['Survived'] == 0)))
print("Manual accuracy: {:.2f}".format(549.0/891))

# f1_score(y_true, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None)[source]
print("F1 score (survivors): {:.2f}".format(f1_score(train_df['Survived'], no_survivors)))
print("F1 score (non-survivors): {:.2f}".format(f1_score(train_df['Survived'], no_survivors, pos_label=0)))


Here is another "model" -- we predict everyone survived

In [None]:
all_survivors = np.array([1]*891)
print("Accuracy score: {:.2f}".format(accuracy_score(train_df['Survived'], all_survivors)))

print("Number survived: {}".format(sum(train_df['Survived'] == 1)))

print("Manual accuracy: {:.2f}".format(sum(train_df['Survived'] == 1)/891.0))
print("F1 score (survivors): {:.2f}".format(f1_score(train_df['Survived'], all_survivors)))
print("F1 score (non-survivors): {:.2f}".format(f1_score(train_df['Survived'], all_survivors, pos_label=0)))


## Question 1
Why are the two F1 scores different? Is one of them _correct_? If so, which one?

## Question 2
Construct a model that predicts all females survived. What is the accuracy_score and F1 score for this model?

## Question 3

Try some other model, e.g., all females and all males travelling first class survived. Calculate the accuracy score.


## Confusion Matrix

A confusion matrix for binary classes is often used to provide a compact summary of correct and incorrect predictions. The ground truth is listed down the side and the predicted values are listed along the top. The actual values in each cell of the corresponding grid is the count of cases for which both the ground truth and the predicted value hold.

 Total Pop | Predicted cond is negative | Predicted cond is positive 
 -------- | -------- | -------- 
 Ground cond is False |  True Negative (TN) | False Positive (FP) 
 Ground cond is True | False Negative (FN) | True Positive (TP)
 
 Some commonly used terms:
 - Precision = True Positive / All Positive = TP / ( TP + FP )
 - Recall = True Positive / All True = TP / (TP + FN)
 - Accuracy = True Positive + True Negative)/Total Population = (TP+TN)/(TP+TN+FP+FN)
 
 There are many other terms and ratios described in the lecture videos 
 (also this Wiki article https://en.wikipedia.org/wiki/Precision_and_recall )

In [None]:
# Demo code for calculating a confusion mtrix for the case where everyone survived

from sklearn.metrics import confusion_matrix
#confusion_matrix(y_true, y_pred, labels=None, sample_weight=None)[source]

cc= confusion_matrix(train_df['Survived'], all_survivors)
print "Confusion matrix", cc
precision = cc[1,1]*1.0/sum(cc[:,1])
recall = cc[1,1]*1.0/sum(cc[1,:])
print("Precision = {:.2f}".format(precision))
print("Recall = {:.2f}".format(recall))
print("Manual F1-score = {:.2f}".format(2.0*precision*recall/(precision+recall)))

## Question 4

What is the precision and recall for your answer to question 2? The F1 score is defined as $ 2*Precision*Recall/(Precision + Recall) $

## Question 5

Some other commonly used measures that can be calculated from the confusion matrix are:
- Specificity = TN / (TP+TN)
- Positive Likelihood Ratio (LR+) = (TP/(TP+FN)) / (FP/(FP+TN)) 
- Negative Likelihood Ratio (LR-) = ( FN/(FN+TP) ) / (TN/(TN+FP))
- Odds Ratio = LR+/LR-

Calculate the Specificity and LR+ for the "all survived" model

## Model Fitting

There is a relatively simple recipe you can use for most of the modelling algorithms in sklearn

### 1. Select the set of features in your model

In [None]:
cols = train_df.columns.tolist()
cols = [cols[1]] + cols[0:1] + cols[2:]
df = train_df[['Survived','Pclass','Sex','Age','SibSp','Parch']]
train_data = df.values

### 2. Select the appropriate classifier or regressor you want to use, then import it from `sklearn`

In [None]:
from sklearn.tree import DecisionTreeClassifier

### 3. Create an instance of the classifier or regressor, with appropriate model parameters

A learning algorithm adjusts the weights and other parameters during the fitting process. However, many algorithms also define and use some additional parameters that are in some sense part of the definition of the model we are trying to fit. These are sometimes called hyper-parameters. These need to be specified when the instance of the classifier or regressor is created. Most learners have defaults built in but it is important to know what those defaults are.

We will learn about the subtleties of many different algorithms in the coming weeks. 

In [None]:
model = DecisionTreeClassifier(max_depth=5)

### 4. Fit the model to the training data
This part generally takes two arguments, the training data and the labels (if a classifier)

In [None]:
model = model.fit(train_data[0:,1:],train_data[0:,0])

### 5. Predict the labels of the training data (or test data)

Normally we would split the available data into a test and a training set and predict the labels of the test data set. We do have a test set, so can we use that here as we normally would? 

In [None]:
p=model.predict(train_data[0:,1:])

### 6. Evaluate the performance of the model (i.e., score the predicted results)

In [None]:
accuracy_score(p, train_data[0:,0])

## Question 6

Pick your own set of features and classifier and calculate the accuracy score of your predictions. Note that in this exercise we are doing this on the training set only and we can potentially get really good accuracy scores for the training set (why?).



#####  That's all for today !!