#  Model Selection and Performance Evaluation

**What is a good estimator?** 

<div class= "alert alert-warning">
Model Selection refers to the process of tuning parameters and hyper-parameters in order to find a good machine learning estimator. 
</div>

It requires the 

- Selection of an error measure
- Definition of a test procedure that measures the generalisation performance over unseen examples


## Error Measures

### Accuracy
The most basic measure for a binary classification task is the accuracy, i.e.

$$
 accuracy =\frac{\sum_{x_i}y_i == h(x_i)}{|X|}
$$

where $h(x_i)$ is the estimator for example $x_i$ and $y_i$ is the true label of $x_i$


`sklearn` provides a `score` method that calculates this error measure

In [2]:
#svc.score(iris_X_test,iris_y_test)

The accuracy is often not an ideal measure, due to several reasons

- For unbalanced data sets, i.e. more negative (positive) examples than positive (negative) examples, it does not represent the real performance of an estimator.
- For multilabel data, it is unclear how to evaluate the equality of the estimator and the true label
- It does not provide any insights on whether the estimator makes errors du to accepting too many examples to a concept, or rejecting too many of them.

### Precision and Recall

Precision and Recall are two measures coming from Information Retrieval, which give insights on the quality of error. They are build on the concept of a contingence table. 

**Precision** defines the **accuracy** of an estimator **wrt. a certain classification decision**

**Recall** defines the **completeness** of an estimator **wrt. a certain classification decision**

To illustrate that we introduce the concept of a contingence table. 

### Contingence Table


**Formal definitions**

- estimator $h(x)$
- examples $X={<x_1,Y_1>,\ldots,<x_n,Y_n>}$, with $Y_n \subset Y$ and $Y$ being the **set of possible classes** an example 

For every class $a\in Y$ we define the contingency table as:

|Contintency Table for Class $a$| $y_i == a$ | $y_i \neq a$|
|---|---|---|
|$h(x_i)==a$|True Positive (TP)|False Positive (FP)|
|$h(x_i)\neq a$|False Negative (FN)|True Negative (TN)|

- `{False|True}` reflects whether the outcome was correct or not
- `{Positive|Negative}` reflects whether the decision was to assign it to $a$ (positive) or not (negative)



**Precision** now estimates the probability that an assignment made by $h$ is correct:

$$
Precision_a = \frac{TP_a}{TP_a+FP_a}
$$

**Recall** now estimates the probability that $h$ has found all correct assignments

$$
Recall_a = \frac{TP_a}{TP_a+FN_a}
$$

**F1** defines a combined measure, which is the harmonic mean between Precision and Recall:

$$
F1_a = \frac{2*Precision_a*Recall_a}{Precision_a + Recall_a}
$$


#### Microaveraging vs. Macroaveraging

Precision and Recall define the quality estimators with respect to a specific class $a$.

Given multiple classes $a\in Y$, there are two ways for generating a single estimator:

- **Microaveraging** sums over all single decisions, i.e. 
$$
Precision^\mu = \frac{\sum_{a\in Y}TP_a}{\sum_{a\in Y}TP_a+\sum_{a\in Y}FP_a}
$$
- **Macroaveraging** sums over all class estimators, i.e.

$$
Precision^\nu = \frac{1}{|Y|}\sum_{a\in Y}Precision_a
$$


<div class="alert alert-info">
Note that in the case of microaveraging in a single label classification:
$Precision^\mu==Recall^\mu== Accuracy$.

The reason is, that every FN becomes a FP in another class and hence the sum of FN and FP is the same.
</div>


Further details can be found under [1] and [2]

[1] http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html

[2] http://scikit-learn.org/stable/auto_examples/plot_precision_recall.html


## Test Procedure and Parameter Estimation


In the examples above we have always used a fraction of the example for the training and the remaining examples for testing.

- Why?
- What is the impact of a random split?
- Can we do better?



Note that we want to generalize over our given data. So we need to split the data into a training and a test set to get an unbiased estimator of the model quality on **unseen examples**. 

A **random split** can be very bad or very good. So we would need more random splits to avoid such a bias.

A better, more efficient method is called **cross-validation**.


### k-folded Cross Validation

Given our examples $X$, we split them into $k-folds$ of approximate equal size. 

1. for every fold $i$ in $k we

    1. Fit our model on all but the examples in the $k^{th}$ fold
    2. Estimate the quality of our model on the examples from the  $k^{th}$ fold
    
2. Average the obtained quality indicators (e.g. precision/recall) over al the $k-folds$


<p>


<div class="alert alert-info">
When averaging over quality indicators like accuracy, always investigate the standard deviation in addition to the average value. The standard deviation tells you how robust your estimated quality indicator is wrt. different runs.
</div>



In [3]:
#sklearn provides cv functionality as generator function
from sklearn import cross_validation
k_fold = cross_validation.KFold(n=6, n_folds=3)
for train_indices, test_indices in k_fold:
     print('Train: %s | test: %s' % (train_indices, test_indices))

Train: [2 3 4 5] | test: [0 1]
Train: [0 1 4 5] | test: [2 3]
Train: [0 1 2 3] | test: [4 5]




#### Cross-Validation on the iris data set

In [7]:
from sklearn import datasets
iris = datasets.load_iris()
from sklearn import linear_model
regr = linear_model.LinearRegression()
import numpy as np



k_fold = cross_validation.KFold(n=iris.data.shape[0], n_folds=8)

scores = []
for train_indices, test_indices in k_fold:
     logistic = linear_model.LogisticRegression(C=1e5)
     logistic.fit(iris.data[train_indices], iris.target[train_indices])  
     scores.append(
                  logistic.score(iris.data[test_indices], 
                                 iris.target[test_indices])
                  )
     print ('Score: %f' % (scores[-1]))

print ("Average model accuracy %f +/- %f" % (np.mean(scores),np.std(scores)))

Score: 1.000000
Score: 1.000000
Score: 1.000000
Score: 0.947368
Score: 0.947368
Score: 0.947368
Score: 0.944444
Score: 0.888889
Average model accuracy 0.959430 +/- 0.036357


**Some Questions**

- Run the above code with $n=3$. What is wrong and why?
- Research the concept of stratified cross validation
- What is the highest possible $k$?
- Research the concepts of leave-one-out testing and leave-one-label-out testing

Try to answer them when working on the next exercise.

# Exercise Text Classification with scikit-learn
Fetch the 20 Newsgroups data set and load it via the  `sklearn.datasets.fetch_20newsgroups` function. The data set contains 20 newsgroups with 1000 posts in each newsgroup. The data set comes with a test/training split. 

Conduct the following tasks:
1. Preprocess the data set using scikit learn vectorizer
2. Train the data set using two classifiers of your choice
3. Tune the parameters of those classifiers to obtain the best results
4. Evaluate the performance on the test set and compare both classifiers.

#### Solution

The solution can be found at http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

# References and Further Reading

- Tom Mitchell, Machine Learning, McGraw-Hill 1997 [chapter slides](http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html)
- [Scikit learn](http://scikit-learn.org/stable/)