# Fundamentals of machine learning using Python 
## Evaluation

***
<br>

## What is model evaluation?

* Model evaluation is the process of using different evaluation metrics to understand a machine learning model’s performance, as well as its strengths and weaknesses.
* Model evaluation is important to assess the efficacy of a model during initial research phases, and it also plays a role in model monitoring.

## Evaluating the classifier

* The most popular metrics for measuring classification performance include accuracy, precision, confusion matrix, log-loss, and AUC (area under the ROC curve).

#### Classification Accuracy

* It is the simplest out of all the methods of evaluating the accuracy, and the most commonly used.
* It is simply the number of correct predictions divided by all predictions or a ratio of correct predictions to total predictions.
* While it can give you a quick idea of how your classifier is performing, it is best used when the number of observations/examples in each class is roughly equivalent.
* Because this doesn't happen very often, you're probably better off using another metric.

#### Precision

* It measures the proportion of predicted Positives that are truly Positive.
* Precision is a good choice of evaluation metrics when you want to be very sure of your prediction.
* For example, if you are building a system to predict whether to decrease the credit limit on a particular account, you want to be very sure about the prediction or it may result in customer dissatisfaction.

#### Confusion matrix

* The confusion matrix (or confusion table) shows a more detailed breakdown of correct and incorrect classifications for each class.
* Using a confusion matrix is useful when you want to understand the distinction between classes, particularly when the cost of misclassification might differ for the two classes, or you have a lot more test data on one class than the other.
* For example, the consequences of making a false positive or false negative in a cancer diagnosis are very different.

<img src="img/confusion-matrix.jpeg" style="width:500px">

#### Logarithmic Loss

* Logarithmic Loss, or LogLoss, essentially evaluates how confident the classifier is about its predictions.
* LogLoss returns probabilities for membership of an example in a given class, summing them together to give a representation of the classifier's general confidence.
* The value for predictions runs from 1 to 0, with 1 being completely confident and 0 being no confidence.
* The loss, or overall lack of confidence, is returned as a negative number with 0 representing a perfect classifier, so smaller values are better.

#### Area Under ROC Curve (AUC)

* This is a metric used only for binary classification problems.
* The area under the curve represents the model's ability to properly discriminate between negative and positive examples, between one class or another.
* A 1.0, all of the area falling under the curve, represents a perfect classifier.
* This means that an AUC of 0.5 is basically as good as randomly guessing.
* The ROC curve is calculated with regards to sensitivity (true positive rate/recall) and specificity (true negative rate).

<img src="img/auc.png" style="width:400px">

#### Classification Report

* The classification report is a Scikit-Learn built in metric created especially for classification problems.
* Using the classification report can give you a quick intuition of how your model is performing.
* Recall pits the number of examples your model labeled as Class A (some given class) against the total number of examples of Class A, and this is represented in the report.
* The report also returns prediction and f1-score.
* Precision is the percentage of examples your model labeled as Class A which actually belonged to Class A (true positives against false positives), and f1-score is an average of precision and recall.

<img src="img/classification-report.png" style="width:450px">

## Example of classifier evaluation

Comparison of the quality of svm and logistic regression classifiers when applied to `data/pima-indians-diabetes.csv` dataset.

In [1]:
# create and train classifiers

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing

dataset = pd.read_csv("data/pima-indians-diabetes.csv")
X = dataset.iloc[:,0:8].values
y = dataset.iloc[:,8]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(X_train, y_train)

clf_lr = LogisticRegression(random_state=0)
clf_lr.fit(X_train, y_train)

LogisticRegression(random_state=0)

In [2]:
# make predictions and evaluate - svm classsifier

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

train_predictions_svm = clf_svm.predict(X_train)
test_predictions_svm = clf_svm.predict(X_test)

print("Train set evaluation")
print(confusion_matrix(train_predictions_svm, y_train))
print(classification_report(train_predictions_svm, y_train))

print()
print("Test set evaluation")
print(confusion_matrix(test_predictions_svm, y_test))
print(classification_report(test_predictions_svm, y_test))

Train set evaluation
[[313  81]
 [ 36 107]]
              precision    recall  f1-score   support

           0       0.90      0.79      0.84       394
           1       0.57      0.75      0.65       143

    accuracy                           0.78       537
   macro avg       0.73      0.77      0.74       537
weighted avg       0.81      0.78      0.79       537


Test set evaluation
[[123  30]
 [ 28  50]]
              precision    recall  f1-score   support

           0       0.81      0.80      0.81       153
           1       0.62      0.64      0.63        78

    accuracy                           0.75       231
   macro avg       0.72      0.72      0.72       231
weighted avg       0.75      0.75      0.75       231



In [3]:
# make predictions and evaluate - logistic regression classsifier

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

train_predictions_lr = clf_lr.predict(X_train)
test_predictions_lr = clf_lr.predict(X_test)

print("Train set evaluation")
print(confusion_matrix(train_predictions_lr, y_train))
print(classification_report(train_predictions_lr, y_train))

print()
print("Test set evaluation")
print(confusion_matrix(test_predictions_lr, y_test))
print(classification_report(test_predictions_lr, y_test))

Train set evaluation
[[311  79]
 [ 38 109]]
              precision    recall  f1-score   support

           0       0.89      0.80      0.84       390
           1       0.58      0.74      0.65       147

    accuracy                           0.78       537
   macro avg       0.74      0.77      0.75       537
weighted avg       0.81      0.78      0.79       537


Test set evaluation
[[120  30]
 [ 31  50]]
              precision    recall  f1-score   support

           0       0.79      0.80      0.80       150
           1       0.62      0.62      0.62        81

    accuracy                           0.74       231
   macro avg       0.71      0.71      0.71       231
weighted avg       0.74      0.74      0.74       231



By observing the values of the evaluation parameters, we can conclude that both classifiers are of similar quality. The differences between them are relatively small.

## Evaluating the regressor

* For regression models, three evaluation metrics are mainly used: R Square, Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE).

#### R-Squared

* R-Squared is a statistical measure of fit that indicates how much variation of a dependent variable is explained by the independent variable(s) in a regression model.
* In investing, R-squared is generally interpreted as the percentage of a fund or security's movements that can be explained by movements in a benchmark index.
* An R-squared of 100% means that all movements of a security (or other dependent variables) are completely explained by movements in the index (or the independent variable(s) you are interested in).

$R^2 = 1 − \frac{Unexplained Variation}{Total Variation}$

#### Mean Absolute Error (MAE)

* When we subtract the predicted values from the actual values, obtaining the errors, sum the absolute values of those errors and get their mean.
* This metric gives a notion of the overall error for each prediction of the model, the smaller (closer to 0) the better.

$mae = \frac{1}{n} \sum_{i=1}^{n}|Actual−Predicted|$

#### Mean Squared Error (MSE)

* It is similar to the MAE metric, but it squares the absolute values of the errors.
* Also, as with MAE, the smaller, or closer to 0, the better.
* The MSE value is squared so as to make large errors even larger.
* One thing to pay close attention to, it that it is usually a hard metric to interpret due to the size of its values and of the fact that they aren't in the same scale of the data.

$mse = \frac{1}{n} \sum_{i=1}^{n}(Actual−Predicted)^2$

#### Root Mean Squared Error (RMSE)

* Tries to solve the interpretation problem raised with the MSE by getting the square root of its final value, so as to scale it back to the same units of the data.
* It is easier to interpret and good when we need to display or show the actual value of the data with the error.
* It shows how much the data may vary, so, if we have an RMSE of 4.35, our model can make an error either because it added 4.35 to the actual value, or needed 4.35 to get to the actual value.
* The closer to 0, the better as well.

$rmse = \sqrt{mse} = \sqrt{\frac{1}{n} \sum_{i=1}^{n}(Actual−Predicted)^2}$

## Example of regressor evaluation

Determination of evaluation parameters for a linear regression model built on dataset `data/petrol_consumption.csv`. 

In [4]:
# load dataset and train regressor

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

petrol_dataset = pd.read_csv("data/petrol_consumption.csv")

X = petrol_dataset[['Average_income', 'Paved_Highways', 'Population_Driver_licence(%)', 'Petrol_tax']]
y = petrol_dataset['Petrol_Consumption']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

In [5]:
# make predictions and evaluation

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

predictions_train = regressor.predict(X_train)
predictions_test = regressor.predict(X_test)

print("Train set evaluation")
r2 = r2_score(y_train, predictions_train)
mae = mean_absolute_error(y_train, predictions_train)
mse = mean_squared_error(y_train, predictions_train)
rmse = np.sqrt(mse)
print("r2 =", r2)
print("mae =", mae)
print("mse =", mse)
print("rmse = ", rmse)

print()
print("Test set evaluation")
r2 = r2_score(y_test, predictions_test)
mae = mean_absolute_error(y_test, predictions_test)
mse = mean_squared_error(y_test, predictions_test)
rmse = np.sqrt(mse)
print("r2 =", r2)
print("mae =", mae)
print("mse =", mse)
print("rmse = ", rmse)

Train set evaluation
r2 = 0.7068781342155135
mae = 49.377699313187556
mse = 4015.2628907647795
rmse =  63.366102063838355

Test set evaluation
r2 = 0.3913664001430558
mae = 53.46854128290795
mse = 4083.2558717442553
rmse =  63.900358932828034


Parameter values are slightly better for the training set than for the test set. However, the difference is not significantly large, which may indicate that the resulting model has satisfactory generalisation properties.

## --- Exercise ---

Find the best classifier for the `data/chronic_kidney_disease.csv` dataset. Evaluate on 20% of the test portion of the dataset.

In [6]:
import pandas as pd

dataset = pd.read_csv("data/chronic_kidney_disease.csv")
dataset.head()

Unnamed: 0,Bp,Sg,Al,Su,Rbc,Bu,Sc,Sod,Pot,Hemo,Wbcc,Rbcc,Htn,Class
0,80.0,1.02,1.0,0.0,1.0,36.0,1.2,137.53,4.63,15.4,7800.0,5.2,1.0,1
1,50.0,1.02,4.0,0.0,1.0,18.0,0.8,137.53,4.63,11.3,6000.0,4.71,0.0,1
2,80.0,1.01,2.0,3.0,1.0,53.0,1.8,137.53,4.63,9.6,7500.0,4.71,0.0,1
3,70.0,1.005,4.0,0.0,1.0,56.0,3.8,111.0,2.5,11.2,6700.0,3.9,1.0,1
4,80.0,1.01,2.0,0.0,1.0,26.0,1.4,137.53,4.63,11.6,7300.0,4.6,0.0,1


In [None]:
# Write your code here