Note: This is incomplete. I am having trouble calculating the F1 using `precision_recall_fscore_support` for kfold.
I think this is because when using kfold, the crosstab does not yield categories with 0 counts. 

# Unit 4 Lesson 8: Evaluating classifier performance

#### Estimated time: 5 - 6 hours

In this lesson, we're going to dive a little deeper into evaluating estimator performance. We're familiar with some common measure of estimator performance like accuracy, precision, recall, and F1 scores.

## Unit 4 Lesson 8 Assignment 1: Evaluating Classifier Performance Overview

Throughout the unit we've been splitting our data into training, test, and validation sets. Let's take a moment and discuss why this is necessary. By now you can probably see that learning an estimator and testing that estimator's performance on the same data is a methodological mistake. It's like if a professor administered a test with the exact same questions as the practice test. All a student would have to do to get 100% would be to memorize all the solutions to the practice test; they wouldn't actually have to learn anything. If you test your estimator on the data used to train it, it knows all the answers, and thus can achieve a perfect score, even though it very well could fail to predict anything on new data. This is called overfitting. Predicting on never-before-seen data is kind of the whole point, so knowing how our estimator performs on data it has already seen isn't really useful.

Holding out a subset of your data for testing, i.e., excluding a subset of your data from your training set, gives you some never-before-seen data to test your estimator's performance. The scikit-learn library has a train_test_split helper function to randomly split data into training and test sets.

When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and we can't make claims about how it will generalize (i.e., how it will perform) on never-before-seen data.

To resolve this problem, we can hold out yet another subset of our data for validation. Training proceeds on the training set, evaluation is done on the validation set, and when it seems like we have a good model, we can perform our final evaluation on the test set.

## Comparing Classifiers

| Classifier      | Advantages              | Shortcomings           | Uses                                  |
| --------------- | ----------------------- | -----------------------|---------------------------------------|
|Regression 	  | Fast, Easy to interpret |	Accuracy is average  |	Predicting continuous numeric values |
| Decision Trees  |	Fast, Easy to understand|	Accuracy is average, Cannot handle large sets of predictors |	Used when "explaining" the decisions is important |
| Random Forests |	Better accuracy, Scalable |	Slow model building, complex |	Classification problems with large number of predictors |
| Naive Bayes |	Simple, Works with less data | Accuracy, Missing data handling |	Text classification |
| K-Nearest Neighbors |	Simple, No model building step, less computational costs |	Higher storage requirements during prediction, Cannot handle large datasets |	Stock market forecasting, Credit rating,Customer profiling |
| Support Vector Machines |	Kernel method, works well in fuzzy situations |	Not easy to explain predictions |	Face recognition, handwriting recognition |

## Try it!

In [207]:
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.cross_validation import KFold
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_recall_fscore_support

### Use the `cross_validation.train_test_split()` helper function to split the Iris dataset into training and test sets, holding out 40% of the data for testing. 

In [208]:
iris = pd.read_csv('/home/MZ/Documents/CODE/THINKFUL/projects/kaggle/ARCHIVE/iris_wiki.csv')
feature_names = ['Sepal length', 'Sepal width', 'Petal length', 'Petal width']
features = iris.columns[:4] # Index(['Sepal length', 'Sepal width', 'Petal length', 'Petal width'], dtype='object')

y, _ = pd.factorize(iris['Species']) # Encode input values as an enumerated type or categorical variable
iris["y"] = y
train, test = train_test_split(iris, test_size=0.4)

### How many points do you have in your training set? In your test set?

In [209]:
print(len(train), len(test))

90 60


### Fit a linear Support Vector Classifier to the training set and evaluate its performance on the test set.

In [210]:
clf = svm.SVC()
clf.fit(train[features], train["y"])

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [211]:
target_names = np.array(['setosa', 'versicolor', 'virginica'])
preds = target_names[clf.predict(test[features])] # array(['setosa', 'versicolor', 'setosa', 'setosa', 'setos...
pd.crosstab(test['Species'], preds)

col_0,setosa,versicolor,virginica
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
setosa,23,0,0
versicolor,0,19,0
virginica,0,0,18


### What is the score?

In [213]:
# target_names[y]
# preds
precision_recall_fscore_support(target_names[test["y"]], preds, average='weighted')

(1.0, 1.0, 1.0, None)

### ...How does it compare to the score in the Support Vector Machine lesson?

# Unit 4 Lesson 8 Project 2: Cross Validation
### Estimated Time: 1 - 2 hours
The more data we set aside for testing and validation, the less data we have for training, and this will negatively impact estimator performance. To resolve this problem, we can use cross validation (see lesson 4.1.5) to "recycle" data over different folds. In this assignment, we're going to implement 5-fold cross-validation on the Iris dataset to train and test a Support Vector Machine classifier.
Try it!

### Compute the 5-fold cross-validation score of the SVC from the last assignment [in this unit].

In [260]:
def f1(ct, preds):
    '''Returns the F1 score given the confusion matrix generated with a set of predictions'''
#     index = list(pd.crosstab(iris['y'], preds))
    total = 0
    for i in list(ct):
        total += sum(list(ct[i]))

    true_positives, false_positives, true_negatives, false_negatives = 0, 0, 0, 0

    for i in zip(list(ct), range(1, len(ct)+1)): 
        row = list(ct.ix[i[0]])
        column = list(ct[i[0]])
    
        tp = row[i[1]-1]
        true_positives  += tp
        false_negatives += sum(row)-tp
        false_positives += sum(column) - tp
        true_negatives = total - tp
    
    return float(2*true_positives) / (2 * true_positives + false_positives + false_negatives)

In [262]:
# Code adapted from u4_l1_p5

k = 5
n = len(iris)
kf = KFold(n, n_folds=k) 

clf = svm.SVC()
target_names_numeric = np.array([0, 1, 2])

for train_index, test_index in kf:
    train, test = iris.ix[train_index], iris.ix[test_index]
    clf.fit(train[features], train["y"])
    preds = target_names[clf.predict(test[features])] # array(['setosa', 'versicolor', 'setosa', 'setosa', 'setos...
    
    ct = pd.crosstab(test['Species'], preds)
    print(f1(ct, preds))
    
    print()

1.0

1.0

0.928571428571

0.966666666667



KeyError: 'versicolor'

### Compute the mean score and the standard deviation of the scores.

As the sklean documentation notes, the default score computed at each cross-validation iteration is the estimator's accuracy. We could tell it to return the F1 score, precision, or recall instead.


### How do the accuracy scores compare to the F1 scores for this dataset?

For more detail on the different scoring parameters, see The scoring parameter: defining model evaluation rules.
http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

## Submission

Submit your code as "cv.py" via the link below.

# Unit 4 Lesson 8 Assignment 3: Discussion
### Estimated Time 1 - 2 hours

"How do I know if my model fits the data well?"

This is an important question that you should be asking yourself. A lot. And while it may feel like a silly thing to ask, evaluating the fit of a model is not as straightforward as it may seem. Many evaluation metrics have been developed to assess the fit of models, but it's important to keep in mind that each of these metrics was designed with a subset of models in mind, and probably evaluates different aspects of a model than another metric as a result.

A really good example of this are the metrics used to evaluate the fit of binary classification models. One way of evaluating the fit of a binary model is with a contingency table. Contingency tables break up the predicted classes against the actual classes like so:

| type |	predict 0 |	predicted 1 |
| ---- | ------------ | ----------- |
| actual 0 |	2 |	1 |
| actual 1 |	2 |	1 |

In this example contingency table, 3 observations were correctly predicted and 3 observations were incorrectly predicted. This gives us a percentage correctly predicted (PCP) score (another metric for evaluating binary classifiers) of 50%. As you might recall from the Classification, Regression, and Prediction lesson, that these metrics depend on the probability threshold we set to coerce continuous predictions to categorical predictions. If we change that threshold, our contingency table and out PCP score could change too. Additionally, we can lose information about the continuous predictions of the model itself. There's a difference between classifying an observation as 1 because the predicted probability was 0.51 or because the predicted probability was 0.99, but there's no distinction between these two cases in a contingency table or a PCP score.

A Receiver Operating Characteristic (ROC) curve is another way of evaluating the fit of a binary model. The ROC curve shows, graphically, the trade off between false-positives and false-negatives as the probability threshold for classifying observations is varied. The overall predictive power of a model is captured in the area under the ROC curve, termed the AUC ("Area Under Curve") score, and falls somewhere between 0 and 1. A model that performs no better than a coin toss would have an AUC score of 0.5. The limitation of the ROC curve in evaluating a model is that the shape of the curve doesn't tell us a lot about our model. All we have to go on is the area underneath it. How would we compare a ROC curve that's high on the left side with a ROC curve that's high on the right side if they both have the same AUC?

### Do a little research on Brier Scores, Expected Percentage of Correct Predictions, and Pseudo R2 measures. What features of binary model fit do they capture? What do they miss?


The best way to evaluate a model isn't written in stone anywhere, but computing multiple metrics tells you more than one single metric will. Computing these metrics is part of the practice of performing model diagnostics, and is useful not only for you as you're testing hypotheses and tuning models, but also for others to see your iterative modeling process.

# Reference / scaffolding

### Reference: code from # Code adapted from u4_l1_p5

In [None]:
    # TRAIN
    it_train = pd.Series(intrate).iloc[min(train):max(train)] # interest rate
    la_train = loanamt.iloc[min(train):max(train)] # loan amount
    fi_train = fico.iloc[min(train):max(train)] # fico

    x1_train = np.matrix(fi_train).transpose()     # fico is a series; transpose converts vertically
    x2_train = np.matrix(la_train).transpose()

    train_X = np.column_stack([x1_train,x2_train])
    train_y = np.matrix(it_train).transpose()

    X_train = sm.add_constant(train_X)

    model = sm.OLS(train_y,X_train)
    f = model.fit()   
    # print f.summary()    

    # TEST 
    it_test = pd.Series(intrate).iloc[min(test):max(test)]
    la_test = loanamt.iloc[min(test):max(test)]
    fi_test = fico.iloc[min(test):max(test)]

    x1_test = np.matrix(fi_test).transpose()
    x2_test = np.matrix(la_test).transpose()

    test_X = np.column_stack([x1_test,x2_test])

    X_test = sm.add_constant(test_X)    

    y_actual = test_y = np.matrix(intrate).transpose()[test[0]:test[-1]] # is this right?

    y_predicted = f.predict(X_test)
    
    mse += mean_squared_error(y_actual, y_predicted) 
    mas += mean_absolute_error(y_actual, y_predicted)
    r2  += r2_score(y_actual, y_predicted)

In [125]:
y_true = np.array(['cat', 'dog', 'pig', 'cat', 'dog', 'pig'])
y_pred = np.array(['cat', 'pig', 'dog', 'cat', 'cat', 'dog'])
print(precision_recall_fscore_support(y_true, y_pred, average='macro'))

(0.22222222222222221, 0.33333333333333331, 0.26666666666666666, None)
