# Model Performance

Get ready to put on your data science and ML engineering hats! In the next few lessons, you are going to learn how to evaluate the performance of ML models. Recall that after you train an ML model on past data, you can use that model to make predictions on new or previously unseen data. But how do you know if that model is useful?  In ML, when you hear the phrase "model performance", I want you to think about evaluating the quality of model predictions, commonly referred to as its **forecast skill** or **prediction skill**. This first lesson will focus on evaluating predictive modeling in the context of supervised learning, including both classification and regression problems.

## Classification Accuracy and Error Rate

Classification is about predicting a label, typically a discrete value. For example, an image of an animal may be classified as being a picture of a "cat" or "dog". There are many ways to measure the prediction skill of a classification model, but **accuracy** and **error rate** are the de facto standard.

### Accuracy
Accuracy is the ratio of the correct predictions to the total number of predictions made.
* Accuracy = Correct Predictions / Total Predictions

90% and above for the accuracy of a predictive model is considered to be good, and it is common practice to aim for that level. 

### Error Rate
You can also summarize model performance in terms of the error rate.
* Error Rate = Incorrect Predictions / Total Predictions

Accuracy and error rates are complements of each other and therefore you can calculate one from the other as follows:
* Accuracy = 1 - Error Rate
* Error Rate = 1 - Accuracy

Consider a classifier that labels pictures as either cats or dogs and that, when tested on 12 pictures (8 cats and 4 dogs), produces the following results:
* 9 Correct Predictions   = (9/12) = 0.75 
* 3 Incorrect Predictions = (3/12) = 0.25

Knowing that the classifier has an accuracy of 0.75 or 75%, does not provide any insight into where the classifier is not performing well. 

Is it more mistaking cats for dogs, or dogs for cats? or is it about the same? 

This is where a **confusion matrix** may prove useful. 

### Confusion Matrix
A confusion matrix, or table of confusion, allows you to easily visualize classification performance and see where the model may be confusing two or more classes. 

<img src="img/confusion_matrix.png" width="200">

In this confusion matrix, of the 8 cat pictures, the model predicted that 2 were dogs, and of the 4 dog pictures, it predicted that 1 was a cat. All correct predictions are located in the diagonal of the table (highlighted in bold), so it is easy to visually inspect the table for prediction errors, as they are represented by values outside the diagonal.

## Classification Metrics in Scikit-learn: A First Look

Scikit-learn is a free ML library for the Python programming language. It has 3 different programming interfaces for evaluating the quality of a modelâ€™s predictions:

* Estimator Score Method
* Scoring Parameter
* Metrics Functions

In this lesson, you'll get hands-on experience using the scikit-learn metrics functions to measure prediction skill in the context of the cat or dog classification example. 

First start by importing the scikit-learn metrics module: 

In [14]:
from sklearn import metrics

Assume that the actual and predicted values from the example are defined as follows, where cats belong to the class 0 and dogs belong to the class 1.

In [15]:
actual_values = [0,0,0,0,0,0,0,0,1,1,1,1]
predictions =   [1,1,0,0,0,0,0,0,1,1,1,0]

Now you can use the metrics functions to calculate the accuracy, error rate, and print the confusion matrix.

In [19]:
print(f'Accuracy: {metrics.accuracy_score(actual_values, predictions) * 100} % ')

print(f'Error Rate: {(1 - metrics.accuracy_score(actual_values, predictions)) * 100} % ')

print(f'Confusion Matrix:')

print(metrics.confusion_matrix(actual_values, predictions))

Accuracy: 75.0 % 
Error Rate: 25.0 % 
Confusion Matrix:
[[6 2]
 [1 3]]


## Precision, Recall, and F-Measure

As a performance measure, classification accuracy has its limitations. One example where accuracy may be an inadequate performance measure is in the presence of class imbalance. For example, imagine a situation where a dataset of cat and dog images contains a large number of cat examples (majority class) and a small number of dog examples (minority class). On such a dataset, even unskillful models model may achieve high accuracy if the large number of examples from the majority class overwhelms those in the minority class.  

An alternative to using classification accuracy is to use **precision** and **recall** metrics. 

However, prior to getting into precision and recall, it is important to dive deeper into the confusion matrix as it provides insight into both the performance of the model and the types of errors being made.

### Confusion Matrix Reloaded

The results summary displayed in the confusion matrix consists of true predictions and false predictions.

<img src="img/confusion_matrix_reloaded.png" width="200">

True Predictions: 
  * TP: True Positives. Model predicted Yes, and actual value is Yes.
  * TN: True Negatives. Model predicted No, and actual value is no.
  
False Predictions: 
  * FP: False Positives. Model predicted Yes, but actual value is No.
  * FN: False Negatives. Model predicted No, but actual value is Yes.


Precision and recall are defined using the terms in the "reloaded" confusion matrix.

### Precision


### Recall


### F-Measure

In [17]:
print(metrics.classification_report(actual_values, predictions))

              precision    recall  f1-score   support

           0       0.86      0.75      0.80         8
           1       0.60      0.75      0.67         4

    accuracy                           0.75        12
   macro avg       0.73      0.75      0.73        12
weighted avg       0.77      0.75      0.76        12



### Precision Score

In [None]:
print(f'Precision Score is: {metrics.precision_score(actual_values, predictions)}')

### Recall Score

In [None]:
print(f'Recall Score is: {metrics.recall_score(actual_values, predictions)}')

### Precision - Recall Curve

In [None]:
data = pd.read_csv('./data/penguins_size.csv')

data = data.dropna()
data = data.drop(['sex', 'island', 'flipper_length_mm', 'body_mass_g'], axis=1)
data = data[data['species'] != 'Chinstrap']

X = data.drop(['species'], axis=1).values

y = data['species']
spicies = {'Adelie': -1, 'Gentoo': 1}
y = [spicies[item] for item in y]
y = np.array(y) 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=33)

dt_model = DecisionTreeClassifier(max_depth=1)
dt_model.fit(X_train, y_train)
dt_preditions = dt_model.predict(X_test)

In [None]:
disp = metrics.plot_precision_recall_curve(dt_model, X_test, y_test, color='orange')

### F1 Score

In [None]:
print('F1 Score:', metrics.f1_score(actual_values, predictions))

### AUC-ROC

In [None]:
data = pd.read_csv('./data/penguins_size.csv')

data = data.dropna()
data = data.drop(['sex', 'island', 'flipper_length_mm', 'body_mass_g'], axis=1)
data = data[data['species'] != 'Chinstrap']

X = data.drop(['species'], axis=1).values

y = data['species']
spicies = {'Adelie': -1, 'Gentoo': 1}
y = [spicies[item] for item in y]
y = np.array(y) 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=33)

dt_model = DecisionTreeClassifier(max_depth=1)
dt_model.fit(X_train, y_train)
dt_preditions = dt_model.predict(X_test)

In [None]:
metrics.plot_roc_curve(dt_model, X_test, y_test, color = 'orange')

In [None]:
print('AUC-ROC:', metrics.roc_auc_score(actual_values, predictions))

### LOGLOSS

In [None]:
print('LOGLOSS:', metrics.log_loss(actual_values, predictions))