# Model Evaluation & Metrics - Quantifying the Quality of Predictions

http://scikit-learn.org/stable/modules/model_evaluation.html

There are three different approaches to evaluate the quality of prediction of a mode:

- **Estimator Score Method**: Estimators have a *score* method providing a default evaluation criterion for the problem they are designed to solve. This info can be found in each estimator's documentation.

- **Scoring Parameter**: Model-evaluation tools use ***cross-validation*** (such as *model_selection.cross_val_score* and *model_selecction.GridSearchCV*) rely on an internal scoring strategy.

- ** Metric Functions**: The **metrics** module implements functions assessing prediction error for specific purposes, including *Classification Metrics*, *Multilabel ranking metrics*, *Regression metrics* and *Clustering metrics*.

Additionally, ***Dummy Estimators*** are useful to get a baseline value of those metrics for random predcitions.


## 1 The `scoring` Parameter: Defining Model Evaluation Rules

This is mainly for model selection & evaluation using tools such as **`model_selection.GridSearchCV`** and **`model_selection.cross_val_score`**, take a `scoring` parameter that controls what metric they apply to the estimators evaluated.

### 1.1 Common Cases: Predifined Values

For the most common use cases, we can designate a scorer object with the `scoring` parameter. All possible values are listed as below. All scorer objects follow the convention that 

***"higher return values are better than lower return values"***. 

Thus metrics which measure the distance between the model and the data, like **`metrics.mean_squared_error`**, are available as **`neg_mean_squared_error`** which return the negated value of the metric.

| Scoring       |Function               |Comment           |
|--|--|--|
|**Classification**|   | |
|'accuracy' |`metrics.accuracy_score` | |
|'average_precision' |`metrics.average_precision_score` |It corresponds to the area under the precision-recall curve |
|'f1' |`metrics.f1_score` |For binary targets |
|'f1_micro' |`metrics.f1_score` | micro-averaged|
|'f1_macro' |`metrics.f1_score` | macro-averaged|
|'f1_weighted' |`metrics.f1_score` | weighted average |
|'f1_samples' |`metrics.f1_score` |by multilabel sample |
|'neg_log_loss' |`metrics.log_loss` | requires `predict_proba` support |
|'precision' etc |`metrics.precision_score` |suffixes apply as with 'f1' |
|'recall' etc. |`metrics.recall_scores` |suffixes apply as with 'f1' |
|'roc_auc' |`metrics.roc_auc_score` | |
|**Clustering**|   | |
| 'adjusted_rand_score'|`metrics.adjusted_rand__score`| 2|
|**Regression**|   | |
| 'r2'| `metrics.r2_score`| |
| 'neg_mean_absolute_error'| `metrics.mean_absolute_error`| |
| 'neg_mean_squared_error'| `metrics.mean_squared_error`| |
| 'neg_median_absolute_error'| `metrics.median_absolute_error`| |

In [17]:
from sklearn import svm, datasets
from sklearn.model_selection import cross_val_score
import pandas as pd

iris = pd.read_csv("data/iris.csv")

X, y = iris.iloc[:,0:4], iris.Species

model_1 = svm.SVC(probability=True, random_state=0)
print cross_val_score(model_1, X, y, scoring="neg_log_loss", cv=10)

model_2 = svm.SVC()
print cross_val_score(model_2, X, y, scoring="accuracy", cv=10)

[-0.05362861 -0.09135357 -0.0441815  -0.06721635 -0.14272846 -0.15222303
 -0.19022653 -0.07655709 -0.04210015 -0.04571209]
[ 1.          0.93333333  1.          1.          1.          0.93333333
  0.93333333  1.          1.          1.        ]


We can also list all the scorer objects by the commands below

In [19]:
from sklearn import metrics
metrics.SCORERS

{'accuracy': make_scorer(accuracy_score),
 'adjusted_rand_score': make_scorer(adjusted_rand_score),
 'average_precision': make_scorer(average_precision_score, needs_threshold=True),
 'f1': make_scorer(f1_score),
 'f1_macro': make_scorer(f1_score, average=macro, pos_label=None),
 'f1_micro': make_scorer(f1_score, average=micro, pos_label=None),
 'f1_samples': make_scorer(f1_score, average=samples, pos_label=None),
 'f1_weighted': make_scorer(f1_score, average=weighted, pos_label=None),
 'log_loss': make_scorer(log_loss, greater_is_better=False, needs_proba=True),
 'mean_absolute_error': make_scorer(mean_absolute_error, greater_is_better=False),
 'mean_squared_error': make_scorer(mean_squared_error, greater_is_better=False),
 'median_absolute_error': make_scorer(median_absolute_error, greater_is_better=False),
 'neg_log_loss': make_scorer(log_loss, greater_is_better=False, needs_proba=True),
 'neg_mean_absolute_error': make_scorer(mean_absolute_error, greater_is_better=False),
 'neg_mean

### 1.2 Defining Your Scoring Strategy from Metric Functions

http://scikit-learn.org/stable/modules/model_evaluation.html#defining-your-scoring-strategy-from-metric-functions

### 1.3 Implementing Your Own Scoring Object

http://scikit-learn.org/stable/modules/model_evaluation.html#implementing-your-own-scoring-object

## 2 Classification Metrics

**`sklearn.metrics`** module implements several loss, score, and utility functions to measure classification preformance. Some metrics may require probability estimates of the positive class, confidence values, or binary decisions values. Most implementations allow each sample to provide a weighted contribution to the overall score, through the *`sample_weight`* parameter.

Some of these are restricted to the **binary classification case**:

|                         FUNCTION|                COMMENT         |
|                         --|                --      |
| `matthews_corrcoef(y_true, y_pred[, ...])`|                Compute the Matthews Correlation Coefficient (MCC) for binary classes       |
|`precision_recall_curve(y_true, probas_pred)`|compute precision-recall pairs for different probability thresholds|
|`roc_curve(y_true, y_score[, pos_label, ...])`|Compute Receiver Operating Characteristic (ROC)|



Others also work in the **multiclass case**:

|                         FUNCTION|                COMMENT         |
|                         --|                --      |
|`cohen_kappa_score(y_1, y2[, labels, weights])`| Cohen's kappa: a statistics that measure inter-annotator agreement|
|`confusion_matrix(y_true, y_pred[, labels, ...])`|Compute Confusion Matrix to evaluate the accuracy of a classification|
|`hinge_loss(y_true, pred_decision[, labels, ...])`|Average hinge loss(non-regularized)|

Some also work in the **multilabel case**:

|                         FUNCTION|                COMMENT         |
|                         --|                --      |
|`accuracy_score(y_true, y_pred[, normalize, ...])`|Accuracy classification score.|
|`classficication_report(y_true, y_pred[,...])`| Build a text report showing the main classification metrics|
|`f1_score(y_true, y_pred[, labels, ...])`|Compute the F1 score, also known as balanced F-score or F-measure|
|`fbeta_score(y_true, y_pred, beta[, labels, ...])`|Compute the F-beta score|
|`hamming_loss(y_true, y_pred[, labels, ...])`|compute the average Hamming loss.|
|`jaccard_similarity_score(y_true, y_pred[,...])`|Jaccard similarity coefficient score|
|`log_loss(y_true, y_pred[, eps, normalize, ...])`|Log loss, aka logistic loss or cross-entropy loss.|
|`precision_recall_fscore_support(y_true, y_pred)`|Compute precision, recall, F-measure and support for each class|
|`precision_score(y_true, y_pred[, labels, ...])`|Compute the precision|
|`recall_score(y_true, y_pred[, labels, ...])`|Compute the recall|
|`zero_one_loss(y_true, y_pred[, normalize, ...])`|Zero-one classification loss.|

And some work with **binary and multilabel (but not multiclass)** problems.

|                         FUNCTION|                COMMENT         |
|                         --|                --      |
|`average_precision_score(y_true, y_score[,...])`|Compute average precision(AP) from predictions cores|
|`roc_auc_score(y_true, y_score[, average, ...])`|Compute Area Under the Curve (AUC) from prediction scores|