# Overview
There are 3 different approaches to evaluate the quality of predictions of a model:
1. **Estimator score method**: Estimators have a score method providing a default evaluation criterion for the problem they are designed to solve. This is not discussed on this page, but in each estimator’s documentation.
2. **Scoring parameter**: Model-evaluation tools using cross-validation (such as cross_validation.cross_val_score and grid_search.GridSearchCV) rely on an internal scoring strategy. This is discussed in the section The scoring parameter: defining model evaluation rules.
3. **Metric functions**: The metrics module implements functions assessing prediction error for specific purposes. These metrics are detailed in sections on Classification metrics, Multilabel ranking metrics, Regression metrics and Clustering metrics.

## Relevant modules
- `sklearn.metrics.make_scorer`
- `sklearn.metrics.classification_report`
- `sklearn.metrics.confusion_matrix`
- `sklearn.metrics.roc_curve`
- `sklearn.metrics.roc_auc_score`
- [`sklearn.dummy.DummyClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)

In [2]:
import pandas as pd
import numpy as np
from sklearn import grid_search, cross_validation, svm
from pprint import pprint
from pandas import DataFrame as DF
from pandas import Series as SR
from tak.tak import myprint, pd_underscore, pd_setdiff

# 3.3.1. The scoring parameter: defining model evaluation rules

For most cases, we just use the following predefined values:

http://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values

(I don't think i need to worry about coming up with my own score function, do I?)

In [4]:
from sklearn import svm, cross_validation, datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target

clf = svm.SVC(probability=True, random_state=0)
cross_validation.cross_val_score(clf, X, y, scoring='log_loss') 

array([-0.07475338, -0.16911634, -0.0698804 ])

## 3.3.1.2. Defining your scoring strategy from metric functions (make_scorer)
[`skl.metrics.make_scorer`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html#sklearn.metrics.make_scorer), which can take several parameters:

- the python function you want to use (my_custom_loss_func in the example below)
- whether the python function returns a **score** (greater_is_better=True, the default) or a **loss** (greater_is_better=False). 
    - If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
- for **classification metrics only**: whether the python function you provided requires continuous decision certainties (needs_threshold=True). 
    - The default value is False.
    - any additional parameters, such as beta in an f1_score.

In [10]:
# here's an example

from sklearn.metrics import make_scorer

def my_custom_loss_func(ground_truth, predictions):
    diff = np.abs(ground_truth - predictions).max()
    return np.log(1 + diff)

# loss_func will negate the return value of my_custom_loss_func,
#  which will be np.log(2), 0.693, given the values for ground_truth
#  and predictions defined below.
loss  = make_scorer(my_custom_loss_func, greater_is_better=False)
score = make_scorer(my_custom_loss_func, greater_is_better=True)

ground_truth = [1, 1]
predictions  = [0, 1]

from sklearn.dummy import DummyClassifier
clf = DummyClassifier(strategy='most_frequent', random_state=0)
clf = clf.fit(ground_truth, predictions)

print loss(clf,ground_truth, predictions) 

# score function negates the return value (so that higher is better convention is followed)
print score(clf,ground_truth, predictions) 

-0.69314718056
0.69314718056


# 3.3.2 Classification metrics (to measure classif. performance)
- Some metrics might require probability estimates of the positive class, confidence values, or binary decisions values. 
- Most implementations allow each sample to provide a weighted contribution to the overall score, through the sample_weight parameter.
- See here for a list http://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics

## 3.3.2.4. Classification report

In [11]:
# Example
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 0]
y_pred = [0, 0, 2, 2, 0]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))

             precision    recall  f1-score   support

    class 0       0.67      1.00      0.80         2
    class 1       0.00      0.00      0.00         1
    class 2       1.00      1.00      1.00         2

avg / total       0.67      0.80      0.72         5



  'precision', 'predicted', average, warn_for)


## Examples for binary classification

In [22]:
from sklearn import metrics
y_pred = [0, 1, 0, 0]
y_true = [0, 1, 0, 1]
myprint(metrics.precision_score(y_true, y_pred))
myprint(metrics.recall_score(y_true, y_pred))
myprint(metrics.f1_score(y_true, y_pred))  
myprint(metrics.fbeta_score(y_true, y_pred, beta=0.5))  
myprint(metrics.fbeta_score(y_true, y_pred, beta=1))  
myprint(metrics.fbeta_score(y_true, y_pred, beta=2)) 
myprint(metrics.precision_recall_fscore_support(y_true, y_pred, beta=0.5))



import numpy as np
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
precision, recall, threshold = precision_recall_curve(y_true, y_scores)

print "\nprecision={}\nrecall={}\nthreshold={}".format(precision, recall, threshold) 

average_precision_score(y_true, y_scores)  

metrics.precision_score(y_true, y_pred) = 1.0
metrics.recall_score(y_true, y_pred) = 0.5
metrics.f1_score(y_true, y_pred))  = 0.666666666667
metrics.fbeta_score(y_true, y_pred, beta=0.5))  = 0.833333333333
metrics.fbeta_score(y_true, y_pred, beta=1))  = 0.666666666667
metrics.fbeta_score(y_true, y_pred, beta=2)) = 0.555555555556
metrics.precision_recall_fscore_support(y_true, y_pred, beta=0.5) = (array([ 0.66666667,  1.        ]), array([ 1. ,  0.5]), array([ 0.71428571,  0.83333333]), array([2, 2]))

precision=[ 0.66666667  0.5         1.          1.        ]
recall=[ 1.   0.5  0.5  0. ]
threshold=[ 0.35  0.4   0.8 ]


0.79166666666666663

## 3.3.2.11. Receiver operating characteristic (ROC)

In [30]:
import numpy as np
from sklearn.metrics import roc_curve
y = np.array([1, 1, 2, 2])
scores = np.array([0.1, 0.4, 0.35, 0.8])
fpr, tpr, thresholds = roc_curve(y, scores, pos_label=2)
myprint(fpr)
myprint(tpr)
myprint(thresholds)

fpr = [ 0.   0.5  0.5  1. ]
tpr = [ 0.5  0.5  1.   1. ]
thresholds = [ 0.8   0.4   0.35  0.1 ]


**AUC SCORE**

In [31]:
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
roc_auc_score(y_true, y_scores)

0.75

# 3.3.6. Dummy estimators
- When doing supervised learning, a simple sanity check consists of comparing one’s estimator against simple rules of thumb. 
- [DummyClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html#sklearn.dummy.DummyClassifier) implements three such simple strategies for classification:
    1. `stratified` generates random predictions by respecting the training set class distribution.
    2. `most_frequent` always predicts the most frequent label in the training set.
    3. `uniform` generates predictions uniformly at random.
    4. `constant` always predicts a constant label that is provided by the user.
- A major motivation of this method is `F1-scoring`, when the positive class is in the minority.

**Note that with all these strategies, the predict method completely ignores the input data!**

[DummyRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html#sklearn.dummy.DummyRegressor) also implements four simple rules of thumb for regression:
1. **mean** always predicts the mean of the training targets.
2. **median** always predicts the median of the training targets.
3. **quantile** always predicts a user provided quantile of the training targets.
4. **constant** always predicts a constant value that is provided by the user.

In [None]:
# To illustrate DummyClassifier, first let’s create an imbalanced dataset:
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
iris = load_iris()
X, y = iris.data, iris.target
y[y != 1] = -1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [34]:
# Next, let’s compare the accuracy of SVC and most_frequent:
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC
clf = SVC(kernel='linear', C=1).fit(X_train, y_train)
myprint(clf.score(X_test, y_test))

clf.score(X_test, y_test) = 0.631578947368


In [35]:
clf = DummyClassifier(strategy='most_frequent',random_state=0)
clf.fit(X_train, y_train)

myprint(clf.score(X_test, y_test))

clf.score(X_test, y_test) = 0.578947368421


In [36]:
# We see that SVC doesn’t do much better than a dummy classifier. Now, let’s change the kernel:
clf = SVC(kernel='rbf', C=1).fit(X_train, y_train)
myprint(clf.score(X_test, y_test) )

clf.score(X_test, y_test)  = 0.973684210526
