### Connect Intensive - Machine Learning Nanodegree
# Lesson 03: Classification Metrics and CrossValidation

## Objectives
  - Experiment with building predictive models using the [ `sklearn` library](http://scikit-learn.org/stable/).
   - Learn about evaluation metrics for classification [ `sklearn.metrics` ]
   - Confusion matrix, precision and recall 
   - Review Model fitting, prediction and scoring
   - Cross Validation and Hyper-parameter Tuning
  
## Prerequisites
  - You should have the following python packages installed:
    - [matplotlib](http://matplotlib.org/index.html) (not a pre-reqisite for this part)
    - [numpy](http://www.scipy.org/scipylib/download.html)
    - [pandas](http://pandas.pydata.org/getpandas.html)
    - [sklearn](http://scikit-learn.org/stable/install.html)

  

## Getting Started
As usual, we start by importing some useful libraries and modules. Don't worry if you get a warning message when importing `matplotlib` -- it just needs to build the font cache, and the warning is just to alert you that this may take a while the first time the cell is run.

**Run** the cell below to import useful libraries for this notebook.

In [1]:
%matplotlib inline
try:
    import matplotlib
    import matplotlib.pyplot as plt
    plt.style.use('ggplot')
    print("Successfully imported matplotlib.pyplot! (Version {})".format(matplotlib.__version__))
except ImportError:
    print("Could not import matplotlib.pyplot!")
    
try:
    import numpy as np
    print("Successfully imported numpy! (Version {})".format(np.version.version))
except ImportError:
    print("Could not import numpy!")
    
try:
    import pandas as pd
    print("Successfully imported pandas! (Version {})".format(pd.__version__))
    pd.options.display.max_rows = 10
except ImportError:
    print("Could not import pandas!")

try:
    from IPython.display import display
    print("Successfully imported display from IPython.display!")
except ImportError:
    print("Could not import display from IPython.display")
    
try:
    import sklearn
    print("Successfully imported sklearn! (Version {})".format(sklearn.__version__))
    skversion = int(sklearn.__version__[2:4])
except ImportError:
    print("Could not import sklearn!")

Successfully imported matplotlib.pyplot! (Version 2.0.2)
Successfully imported numpy! (Version 1.13.1)
Successfully imported pandas! (Version 0.20.3)
Successfully imported display from IPython.display!
Successfully imported sklearn! (Version 0.18.2)


## 1 | Classification Metrics

#### Loading the data from the Titanic project

#### *You will need to specify the proper path to the dataset file or copy the file into the same directory as this notebook*


**Run** the cell below (**click** on the cell to highlight it, then press **shift + enter** or **shift + return** to run it) to read the preprocessed training and testing data into `pandas` `DataFrame` objects.

In [8]:
train_df = pd.read_csv("./titanic_data.csv")

print("Titanic data sets loaded!")

Titanic data sets loaded!


As we are pretty familiar with the titanic dataset, I cn just run through some quick pre-procesing steps that should be familiar to you. You can split the cell below after any line use appropriate print functions to see what each step s doing if you like.

In [9]:
# We can run this cell ONCE ONLY as the get_dummies method will replace the "Sex" and "Embarked" columns
train_df.loc[train_df.Age.isnull(),'Age'] = train_df.Age.mean()
train_df = pd.get_dummies(train_df, columns=["Sex", "Embarked"], prefix=["sex", "port"])

In [10]:
# Starting with scikit-learn version 0.18, the model_selection module replaces the cross_validation module,
# so we should import train_test_split from the appropriate module depending on the version number.
if skversion >= 18:
    from sklearn.model_selection import train_test_split
else:
    from sklearn.cross_validation import train_test_split

# A list of the desired feature names for model building
# We are skipping Passenger ID as that is too specific and is more a row label than a feature
# Same arguments lead us to exclude `Cabin` and `Ticket`
desired_features = ['Pclass', 'sex_female', 'sex_male', 'Age','SibSp','Parch', 'Fare', 'port_C', 'port_S', 'port_Q']

# X is our pandas DataFrame object with the features from which we will predict the 'Survived' feature.
X = pd.DataFrame(train_df[desired_features])

# y is our pandas Series object with the 'Survived' feature to be predicted.
y = pd.Series(train_df['Survived'])

# Split the data into training and validation (test) data sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=3)

# Verify that the data sets were split 80% training and 20% testing
print("The original data ({} instances) were split into training ({} instances) and testing ({} instances) data sets".\
     format(len(X),len(X_train),len(X_test)))

The original data (891 instances) was split into training (623 instances) and testing (268 instances) data sets


## Making some basic predictions

Recall that the key feature we will attempt to predict is the `'Survived'` feature, which is equal to 0 or 1 for a passenger who died or survived, respectively, from the Titanic sinking. We have separated it out as the `y_train` and `y_test` variables.

We'll try several sets of predictions and calculate some metrics to evaluate our 'model'

A commonly used metric for classification is accuracy_score, which is simply the proportion of correct predictions. If a model predicts m classes of n possible correctly, then the accuracy score will be m / n.

The accuracy_score simply ignores wrong predictions. In some situations, we may care about making wrong predictions; the F1 score is a measure that combines both correct and incorrect predictions

Let's import the metrics from sklearn

In [56]:
from sklearn.metrics import accuracy_score, f1_score, fbeta_score, precision_score, recall_score, confusion_matrix
from sklearn.metrics import classification_report

### Useful Formulas for Classification Scores

$$ Precision (p) = {True\space Positive \over (All\space Positive)} $$

$$ Precision (p) = {True\space Positive \over (True\space Positive + False\space Positive)} $$

$$ Recall (r) = {True\space Positive \over (True\space Positive + False\space Negative)} $$

$$ Accuracy = {True\space Positive + True \space Negative \over (All\space Samples)} $$



$$ F_1\space score = {2  p  r \over (p + r)} $$

$$ F_{\beta}\space score = (1 + \beta^2) {p \dot r \over (\beta^2p + r)} $$

## Confusion Matrix

A confusion matrix for binary classes is often used to provide a compact summary of correct and incorrect predictions. The ground truth is listed down the side and the predicted values are listed along the top. The actual values in each cell of the corresponding grid is the count of cases for which both the ground truth and the predicted value hold.

 Total Pop | Predicted cond is negative | Predicted cond is positive 
 -------- | -------- | -------- 
 Ground cond is False |  True Negative (TN) | False Positive (FP) 
 Ground cond is True | False Negative (FN) | True Positive (TP)
 
 Some commonly used terms:
 - Precision = True Positive / All Positive = TP / ( TP + FP )
 - Recall = True Positive / All True = TP / (TP + FN)
 - Accuracy = True Positive + True Negative)/Total Population = (TP+TN)/(TP+TN+FP+FN)
 
 There are many other terms and ratios described in the lecture videos 
 (also this Wiki article https://en.wikipedia.org/wiki/Precision_and_recall )

Let's try some simple predictions and calculate some of these by hand (a couple of times) using our training set.
We'll start with predicting no one survived. So the predictions array will be as long as y_train and filled with 0.

** Exercise ** Implement the four functions below. Use them to calculate the counts of true positive, false negative, false positive, and true negatives. Finally, calculate the accuracy. 

In [94]:
no_survivors = np.zeros(len(y_train))

def true_positives(yt, ypred): 
    x = pd.DataFrame({'yt': yt, 'pred':ypred})
    return x[(x.yt==1) & (x.pred==1)].shape[0]

def false_positives(yt, ypred): 
    x = pd.DataFrame({'yt': yt, 'pred':ypred})
    return x[(x.yt==0) & (x.pred == 1)].shape[0]

def true_negatives(yt, ypred): 
    x = pd.DataFrame({'yt': yt, 'pred':ypred})
    return x[(x.yt==0) & (x.pred==0)].shape[0]

def false_negatives(yt, ypred): 
    x = pd.DataFrame({'yt': yt, 'pred':ypred})
    return x[(x.yt==1) & (x.pred==0)].shape[0]

TP = true_positives(y_train, no_survivors)
FN = false_negatives(y_train, no_survivors)
FP = false_positives(y_train, no_survivors)
TN = true_negatives(y_train, no_survivors)

print y_train.sum()
#precision = TP / float(TP + FP)
accuracy = (TP + TN) / float(TP+FN+FP+TN)
#print("Precision score is {:.4f}".format(precision))
print("Accuracy score is {:.4f}".format(accuracy))

238
Accuracy score is 0.6180


Now use the appropriate functions from `sklearn` (already imported) to compare with your results. 

In [95]:
#print("Precision score: {:.3f}".format(precision_score(y_train, no_survivors)))
print("Accuracy score: {:.4f}".format(accuracy_score(y_train, no_survivors)))


Accuracy score: 0.6180


The accuracy scores should match! We couldn't calculate the precision score because we need _some_ positive predictions to get a meaningful value. 

**Exercise:** Once you have that working, let's calculate the precision (by hand) where our prediction is that all female passengers survived and all males did not. Remember to use X_train and y_train.

In [52]:
#TODO
all_females = 0 

TP = true_positives(y_train, all_females)
FN = false_negatives(y_train, all_females)
FP = false_positives(y_train, all_females)
TN = true_negatives(y_train, all_females)

#TODO - implement the expression for precision
precision = 0 # Remember to make the denominator a float 
accuracy = (TP + TN) / float(TP+FN+FP+TN)

print("Precision score is {:.4f}".format(precision))
print("Precision score: {:.4f}".format(precision_score(y_train, all_females)))


Precision score is 0.7560
Accuracy score is 0.7897
334 51
80 158


Let's look at a few other ways we can evaluate errors in a classification problem. The confusion matrix should be the same as the first two lines we printed.

In [61]:
print TN, FP
print FN, TP
print("Accuracy score: {:.4f}".format(accuracy_score(y_train, all_females)))
print("f1 score: {:.4f}".format(f1_score(y_train, all_females)))
print confusion_matrix(y_true=y_train, y_pred=all_females)

334 51
80 158
Accuracy score: 0.7897
f1 score: 0.7069
[[334  51]
 [ 80 158]]


Notice how the precision we calculate for the `all_females` is lower than the accuracy score. This often happens when the classes are _imbalanced_, i.e., the number of cases of one class is much larger than the number of the other class.

In some cases, e.g, detecting fraudulent credit card tansactions, we may be much more interested in correctly predicting the rare cases than in correctly predicting the common ones. We can look at the $f_1 score$ (the harmonic mean of the precision or recall), the $f_{\beta} score$. We can also flip our class labels. The `classification_report` provides a quick way to look at 
precision, recall, f1-score and support (number of samples of that class) to get a better feel for how well we did with the classification.

In [62]:
print classification_report(y_train, all_females)

             precision    recall  f1-score   support

          0       0.81      0.87      0.84       385
          1       0.76      0.66      0.71       238

avg / total       0.79      0.79      0.79       623



## 2 | Cross-Validation

### Build an optimized Decision Tree Classifier

For supervised learning problems, the model building `sklearn` workflow is pretty similar, regardless of the type of classifier you'd like to build. You should be getting used to this pattern:
  1. **Create** a classifier object.
  2. **Train** the classifier on the training data set.
  3. **Predict** with the classifier on the validation (test) data set.
  4. **Assess** the quality of the predictions of the classifier

Last week we built a [Decision Tree Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) and looked at a couple of models using `max_depth` as a (hyper)parameter. We used that as an illustration of model complexity and a segue for introducing complexity curves. In your project, you will need to use GridSearchCV. Our objective here is to work through some of those details here.

I have already completed the imports for you
**Exercise** Comlete the 4 TODO sections in the code cell below to **create** a Decision Tree Classifier, **train** it on the training data using k-fold cross-validation to find the optimal DecisionTree moel for this problem.

In [91]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

# TODO: Pick a value between 3 and 10 for the number of folds for k-fold cross-validation
nfolds = 1

# TODO: Create a decision tree regressor object
clf = None

# TODO: Create a dictionary for the parameter 'max_depth' with a range from 1 to 10
params = {}

# TODO: Create the grid search object
grid = GridSearchCV(estimator=clf, param_grid=params, cv=nfolds)

# TODO: Fit the grid search object to the data to compute the optimal model



GridSearchCV(cv=3, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': array([  0.1    ,   0.31623,   1.     ,   3.16228,  10.     ])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

GridSearchCV returns a lot of intermediate results. You can view them all through `cv_results_`. Within these, `mean_test_score` and `mean_train_score` gives you the data you would need to produce a complexity curve. `rank_test_score` gives the rank (1 is best) for each combination of parameters (we only have 1 here).

** Naive Question ** We didn't pass in a test set, how how did we get a test score?

** Question: ** What is the accuracy score and f1-score for the best classifier on the training set? How do these compare to the score on the test set? How would you interpret the difference?

## k-fold Cross validation exposed

In the cell below, I have exposed how cross-valiation works inside GridSearchCV (and other classes implementing CV in `sk;earn`). The KFold.split method is an iterator that returns the list of indices of the training and test sets in the context of CV. We have referred to it as train and validation sets elsewhere. Add print statements to see how the splitting works.

The average of `results` would be the reported CV score.

In [134]:
#Simple K-Fold cross validation. 

cv = KFold(n_splits=5)
results = []

# "Error_function" can be replaced by the error function of your analysis
model = DecisionTreeClassifier(max_depth=4)
for traincv, testcv in cv.split(X_train):
    kscore = model.fit(X_train.iloc[traincv], y_train.iloc[traincv]).score(X_train.iloc[testcv], y_train.iloc[testcv])
    results.append( kscore )
print results

[0.752, 0.82399999999999995, 0.82399999999999995, 0.782258064516129, 0.84677419354838712]


#####  That's all for this notebook !!