# Evaluating Models
*by Evgeny Sushko*
<img width="50%" height="50%" src="http://i.piccy.info/i9/666d78be04fbcf04fdb321d5953d1fa5/1492256847/123248/1137898/ua_parrots.jpg">

---
## <a name="0"></a> Table of Contents:
1. [Model evaluation applications](#1)    
    1. [Generalization performance](#1.1)
    2. [Model selection](#1.2)
    3. [Algorithm selection](#1.3)   
2. [Model evaluation techniques](#2)
   1. [Holdout method (simple train/test split)](#2.1)
       1. [exercise 1](#2.1.1)
   2. [K-fold cross-validation](#2.2)   
3. [Classification metrics](#3)
   1. [Accuracy](#3.1)
   2. [Confusion matrix](#3.2)
   3. [Precision & Recall](#3.3)
       1. [exercise 2](#3.3.1)
   4. [F-1 score](#3.4)
   5. [Classification report](#3.5)
4. [Appropriate merics choice](#4)
5. [Summary](#5)
---
### Requirements
1. Python 3.x (or Anaconda3 for Python 3.5, https://www.continuum.io/downloads)
2. Scikit-learn 0.18.x (pip install scikit-learn==0.18.1, http://scikit-learn.org/)
3. Pandas latest (http://pandas.pydata.org/)
4. For datasets more than 1M reviews min Hardware Requirements (SDRAM >= 8 GB)
---

Model evaluation is not just the end point of our machine learning pipeline. Before we handle any data, we want to plan ahead and use techniques and metrics that are suited for our purposes.

In this tutorial we will go over some of these techniques and metrics and we will see how they fit into a typical machine learning workflow.

### <a name="1"></a> 1. Model Evaluation Applications
Let's start with a question: **"Why do we care about performance estimates at all?"**

<a name="1.1"></a>**Generalization performance** - We want to estimate the predictive performance of our model on future (unseen) data.
- Ideally, the estimated performance of a model tells how well it performs on unseen data – making predictions on future data is often the main problem we want to solve.

<a name="1.2"></a>**Model selection** - We want to increase the predictive performance by tweaking the learning algorithm and selecting the best performing model from a given hypothesis space.
- Typically, machine learning involves a lot of experimentation. Running a learning algorithm over a training dataset with different hyperparameter settings and different features will result in different models. Since we are typically interested in selecting the best-performing model from this set, we need to find a way to estimate their respective performances in order to rank them against each other.

<a name="1.3"></a>**Algorithm selection** - We want to compare different ML algorithms, selecting the best-performing one.
- We are usually not only experimenting with the one single algorithm that we think would be the “best solution” under the given circumstances. More often than not, we want to compare different algorithms to each other, oftentimes in terms of predictive and computational performance.

Although these three sub-tasks have all in common that we want to estimate the performance of a model, they all require different approaches. 

This tutorial will focus on **supervised learning**, a subcategory of machine learning where our target values are known in our available dataset. Although many concepts also apply to regression analysis, we will focus on **classification**, the assignment of categorical target labels to the samples.

[To the table of contents](#0)

---

### <a name="2"></a>2. Model Evaluation Techniques
#### <a name="2.1"></a>Holdout method (simple train/test split)
The holdout method is the simplest model evaluation technique. We take our labeled dataset and split it randomly into two parts: A **training set** and a **test set**
<img src="https://sebastianraschka.com/images/blog/2016/model-evaluation-selection-part1/testing_01.png" width="500">
Then, we fit a model to the training data and predict the labels of the test set.
<img src="https://sebastianraschka.com/images/blog/2016/model-evaluation-selection-part1/testing_02.png" width="500">
And the fraction of correct predictions constitutes our estimate of the prediction accuracy.
<img src="https://sebastianraschka.com/images/blog/2016/model-evaluation-selection-part1/testing_03.png" width="500">
We really don’t want to train and evaluate our model on the same training dataset, since it would introduce **overfitting**. In other words, we can’t tell whether the model simply memorized the training data or not, or whether it generalizes well to new, unseen data.

##### Pros:
    + Simple
    + Fast

##### Cons:
    - Not so precise estimate of out-of-sample performance comparing to more advanced techniques

[To the table of contents](#0)

In [1]:
# import data
import pandas as pd

df = pd.read_csv('../data/movie_reviews.csv')

df_new = pd.read_csv('../data/test.csv')
X_test_new, y_test_new = df_new.text, df_new.label

# check number of rows & columns
df.shape

(152610, 2)

In [14]:
# split dataset to Train and Test parts
from sklearn.model_selection import train_test_split

X, y = df.text, df.label
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42)

In [3]:
# fit a model to the training data
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

vectorizer = CountVectorizer(binary=True)
classifier = LogisticRegression(class_weight='balanced')

pipeline = Pipeline([('vectorizer', vectorizer),
                     ('classifier', classifier)])

%time model = pipeline.fit(X_train, y_train)

CPU times: user 47.8 s, sys: 27.7 s, total: 1min 15s
Wall time: 45.9 s


In [4]:
# predict the labels of the test set
y_pred = model.predict(X_test)

CPU times: user 3.58 s, sys: 50 ms, total: 3.63 s
Wall time: 3.6 s


In [5]:
# compute prediction accuracy
from sklearn import metrics

print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.819376187668


#### <a name="2.1.1"></a>Excercise 1: Holdout validation
- Split dataset to train and test.
- Hold 30% of the data as a test part.
- Keep `random_state=42`
- Check accuracy on the test set.

In [6]:
# my_X, my_y = df.text, df.label

#### YOUR CODE STARTS HERE ###

# my_X_train, my_X_test, my_y_train, my_y_test = ...

#### YOUR CODE ENDS HERE ###

# vectorizer = CountVectorizer(binary=True)
# classifier = LogisticRegression(class_weight='balanced', C=0.1)
# my_pipeline = Pipeline([('vectorizer', vectorizer),
#                      ('classifier', classifier)])
# my_model = my_pipeline.fit(my_X_train, my_y_train)
# my_y_pred = my_model.predict(my_X_test)
# print("My Accuracy:", metrics.accuracy_score(my_y_test, my_y_pred))

<details>
  <summary>Click to see answer</summary>
  <p align='left'>Are you sure you tried to solve it on your own?</p>
          <code>
my_X_train, my_X_test, my_y_train, my_y_test = train_test_split(X, y, test_size=0.3, random_state=42)
          </code>
</details>

### <a name="2.2"></a>K-fold Cross-validation
K-fold Cross-validation is probably the most common technique for model evaluation and model selection. 
- We split the dataset into *K* parts and iterate over a dataset set *K* times
- In each round one part is used for validation, and the remaining *K-1* parts are merged into a training subset for model evaluation
- We compute the cross-validation performance as the arithmetic mean over the *K* performance estimates from the validation sets.
<img src="https://sebastianraschka.com/images/blog/2016/model-evaluation-selection-part3/kfold.png" width="500">

##### Pros:
    + Better estimate of out-of-sample performance than simple train/test split

##### Cons:
    - Runs "K" times slower than simple train/test split

If we have **little data** and **enough time**, it's better to always do cross-validation for a more precise estimate of performance.

In the following example we will apply k-fold cross validation for Model Selection using *GridSearchCV* function.

> #### GridSearchCV main parameters
>*sklearn.model_selection.GridSearchCV*

>**param_grid**: dict or list of dictionaries.
Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

>**cv**: int, cross-validation generator or an iterable, optional.
Determines the cross-validation splitting strategy.

>**scoring**: string, callable or None, default=None.
Controls what metric to apply to the estimators evaluated

[To the table of contents](#0)

In [21]:
# fit model 
from sklearn.model_selection import GridSearchCV

params = dict(classifier__C=[0.1, 0.01])
grid_search = GridSearchCV(pipeline, param_grid=params, cv=3)

%time grid_search.fit(X,y)

CPU times: user 4min 13s, sys: 42.1 s, total: 4min 56s
Wall time: 4min 10s


GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(steps=[('vectorizer', CountVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        ...ty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'classifier__C': [0.1, 0.01]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score=True, scoring=None, verbose=0)

In [22]:
# Best parameters found:
grid_search.best_params_

{'classifier__C': 0.1}

In [19]:
# Average accuracy over K folds for best parameters set
print("Validation Accuracy", grid_search.best_score_)

Validation Accuracy 0.762164995741


In [20]:
# Let's check how our model will generalize to unseen data
y_pred_new = grid_search.predict(X_test_new)

print("Test Accuracy:", metrics.accuracy_score(y_test_new, y_pred_new))

Test Accuracy: 0.80469043152


### <a name="3"></a>3. Classification metrics overview
Classification problems are probably the most common type of ML problem and as such there are many metrics that can be used to evaluate predictions for these problems. We will review some of them.

### <a name="3.1"></a>Accuracy
Accuracy simply measures *what percent of your predictions were correct*. It's the ratio between the number of correct predictions and the total number of predictions.

$$accuracy = {\frac{\#\ correct}{\#\ predictions}}$$

[To the table of contents](#0)

In [6]:
# calculate accuracy
print(metrics.accuracy_score(y_test, y_pred))

0.819376187668


Accuracy is also the most misused metric. It is really **only suitable** when there are an *equal number of observations in each class* (which is rarely the case) and that all *predictions and prediction errors are equally important*, which is often not the case.

### <a name="3.2"></a>Confusion Matrix
The confusion matrix is a handy presentation of the accuracy of a model with 2 or more classes. The table **presents predictions** on the x-axis and **accuracy outcomes** on the y-axis. The cells of the table are the number of predictions made by a machine learning algorithm.

In [7]:
# first argument is true values, second argument is predicted values
# this produces a 2x2 numpy array (matrix)
conf = metrics.confusion_matrix(y_test, y_pred)
print(conf)

[[10013  2534]
 [ 2979 14996]]


|                | Predicted Negative | Predicted Positive |
|:--------------:|--------------------|--------------------|
| **Negative Cases** |      TN: 9324      |      FP: 3266      |
| **Positive Cases** |      FN: 2288      |      TP: 15644     |

- **True Positives (TP)**:
We correctly predicted that the reviews are positive: **15644**
- **True Negatives (TN)**:
We correctly predicted that the reviews are negative: **9324**
- **False Positives (FP)**:
We incorrectly predicted that the reviews are positive: **3266**
- **False Negatives (FN)**:
We incorrectly predicted that the reviews are negative: **2288**



Confusion matrix allows you to compute various classification metrics, and these metrics can guide your model selection. 

In [8]:
# slice confusion matrix into four pieces for future use
TP = conf[1, 1]
TN = conf[0, 0]
FP = conf[0, 1]
FN = conf[1, 0]

You can learn more about the [Confusion Matrix on the Wikipedia article](https://en.wikipedia.org/wiki/Confusion_matrix).

[To the table of contents](#0)

### <a name="3.3"></a>Precision & Recall
Precision and recall are actually two metrics. But they are often used together.

**Precision** answers the question: *What percent of positive predictions were correct?*

$$precision = {\frac{\#\ true\ positive}{\#\ true\ positive + \#\ false\ positive}}$$

**Recall** answers the question: *What percent of the positive cases did you catch?*


$$recall = {\frac{\#\ true\ positive}{\#\ true\ positive + \#\ false\ negative}}$$

![](http://www.kdnuggets.com/images/precision-recall-relevant-selected.jpg)

See also a very good explanation of [Precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) in Wikipedia.

[To the table of contents](#0)

In [9]:
# calculate precision
precision = TP / float(TP + FP)

print(precision)
print(metrics.precision_score(y_test, y_pred))

0.855447803765
0.855447803765


#### <a name="3.3.1"></a>Excercise 2: Compute Recall
- Compute Recall metric in two different ways
- Check if the two values are equal

In [14]:
# compute recall

#### YOUR CODE STARTS HERE ###

# recall = ...

# print(recall)
# print(...)

#### YOUR CODE ENDS HERE ###

<details>
  <summary>Click to see answer</summary>
  <p align='left'>Are you sure you tried to solve it on your own?</p>
          <code>
              recall = TP / float(FN + TP)
              
              print(recall)
              print(metrics.recall_score(y_test, y_pred))
          </code>

</details>

### <a name="3.4"></a>F1-score
The F1-score (sometimes known as the balanced F-beta score) is a single metric that combines both precision and recall via their harmonic mean:

$$F_1 = 2 {\frac{precision * recall}{precision + recall}}$$

Unlike the arithmetic mean, the harmonic mean tends toward the smaller of the two elements. Hence the F1 score will be small if either precision or recall is small.

[To the table of contents](#0)

In [11]:
# calculate f1-score
f1 = 2 * precision * recall / (precision + recall)

print(f1)
print(metrics.f1_score(y_test, y_pred))

0.844726094916
0.844726094916


### <a name="3.5"></a>Classification Report
Scikit-learn does provide a convenience report when working on classification problems to give you a quick idea of the accuracy of a model using a number of measures.

The **classification_report()** function displays the precision, recall, f1-score and support for each class. (*support* is the number of occurrences of each class in *y_true*)

[To the table of contents](#0)

In [12]:
# print a report on the binary classification problem
print(metrics.classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.77      0.80      0.78     12547
          1       0.86      0.83      0.84     17975

avg / total       0.82      0.82      0.82     30522



### <a name="4"></a>4. Choice of Metrics
Depending on your application, you may want to consider different performance metrics. Choice of metric depends on your business objective and on the data you have at hand.

In many cases **accuracy** alone will be enough. It is suitable when the data is balanced (equal number of observations in each class) and when minimizing *False Positives* and *False Negatives* is equally important.

If that is not the case:

- Identify if FP or FN is more important to reduce
- Choose metric with relevant variable (FP or FN) in the equation

##### Case 1: Spam filter (positive class is "spam")
FN (spam goes to the inbox) are more acceptable than FP (non-spam is caught by the spam filter) => Choose **FP** as a variable, optimize for **precision**

##### Case 2: Fraudulent transaction detector (positive class is "fraud")
FP (normal transactions that are flagged as possible fraud) are more acceptable than FN (fraudulent transactions that are not detected) => Choose **FN** as a variable, optimize for **recall**

[To the table of contents](#0)

---
### <a name="5"></a>Summary
In this tutorial we briefly explored applications, techniques and metrics of model evaluation. We learned about 3 tasks of model evaluation:
- Estimating model performance
- Model selection
- Algorithm selection

2 common techniques:
- Simple train/test split (Holdout validation)
- K-fold cross-validation

About 4 classification metrics:
- Accuracy
- Precision & Recall
- F-1 score

Also 2 convenience methods for classification prediction results:
- Confusion matrix
- Classification report

And about choosing the right metric.

#### Thank you!

[To the table of contents](#0)

---
#### References
- Sebastian Raschka, [Model evaluation, model selection, and algorithm selection in machine learning](https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html)
- Jason Brownlee, [Metrics To Evaluate Machine Learning Algorithms in Python](http://machinelearningmastery.com/metrics-evaluate-machine-learning-algorithms-python/)
- Ritchie Ng, [Evaluating a Classification Model](http://www.ritchieng.com/machine-learning-evaluate-classification-model/)
- [Turi Machine Learning Platform User Guide](https://turi.com/learn/userguide/evaluation/classification.html)
- Gregory Piatetsky, [21 Must-Know Data Science Interview Questions and Answers](http://www.kdnuggets.com/2016/02/21-data-science-interview-questions-answers.html/2)