# Practical Guide to Machine Learning Model Evaluation and Error Metrics

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In machine learning, we regularly deal with mainly two types of tasks that are classification and regression. Classification is a task where the predictive models are trained in a way that they are capable of classifying data into different classes for example if we have to build a model that can classify whether a loan applicant will default or not. But regression is a process where the models are built to predict a continuous variable for example if we need to predict the house prices for the upcoming year. 

In both the tasks we do the basic data processing followed by splitting the data into training and testing sets. We use training data to train the model whereas testing data is used to compute prediction by the model. Many different algorithms can be used for classification as well as regression problems but the idea is to choose that algorithm that works effectively on our data. This can be done by doing the evaluation of the model and using error metrics. Different evaluation methods are used like confusion matrix, accuracy score, classification report, mean square error etc.

This notebook demonstrates a classification and regression problem where we will first build the model and then we will evaluate to check the model performance. We will be using prime diabetes data for doing the classification task where we need to classify whether a patient is diabetic or not. Also, we will explore the wine dataset to do the regression task where we need to predict the quality of the wine.

### What you will learn from this notebook? 
* How to build a classification model?  
* How to build a Regression model?
* How to check the model performance using different error metrics?

### Classification Model 
We will first import the required libraries followed by the data. We will then process the data set for basic data preprocessing. In this experiment, we have taken the Pima Indian Diabetes dataset that is publicly available on Kaggle. In this dataset, there are a total of 768 rows and 9 columns in the data with no missing value. Use the below code to do the same.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,precision_score,recall_score,confusion_matrix,classification_report,f1_score

pima = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')

print(pima.head(10))
print(pima.shape)
print(pima.isnull().any())

In [None]:
pima.info()

After that, we will define the dependent and independent features X and y respectively. We then scale the data and then split it into training and testing sets. After that, we will fit the training data to the model and make predictions for the testing data. 
We will make use of the Random Forest classifier and Support Vector machine algorithm for building two models. Use the code below to the same.

In [None]:
X = pima.drop('Outcome', axis = 1)
y = pima['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

std = StandardScaler()

X_train = std.fit_transform(X_train)
X_test = std.fit_transform(X_test)

rfcl = RandomForestClassifier()
rfcl.fit(X_train,y_train)

y_pred_rfcl = rfcl.predict(X_test)

svc = svm.SVC()
svc.fit(X_train,y_train)

y_pred_svc = svc.predict(X_test)

We have stored our prediction of testing data in y_pred_rfcl variable and y_pred_svc.  We will make you use this variable for the evaluation of the model. We will now compute different error metrics to check the model performance like accuracy_score, confusion_matrix, classification_report, f1_score. Use the below code to compute different error metrics.

In [None]:
print(accuracy_score(y_pred_rfcl,y_test))   #Random Forest Classifier:
print(accuracy_score(y_pred_svc,y_test))   #Support Vector Machine Classifier:
print(confusion_matrix(y_pred_rfcl,y_test))  #Random Forest Classifier:
print(confusion_matrix(y_pred_svc,y_test))   #Support Vector Machine Classifier:
print(f1_score(y_pred_rfcl,y_test))    #Random Forest Classifier:
print(f1_score(y_pred_svc,y_test))     #Support Vector Machine Classifier:
print(classification_report(y_pred_rfcl,y_test))  #Random Forest Classifier:
print(classification_report(y_pred_svc,y_test))   #Support Vector Machine Classifier:

### Regression Model 
We will first import the required libraries that are required and load the data set. We will be using the wine dataset for this problem that can be downloaded directly from Kaggle. After which we will load the data followed by pre-processing of the data. There are a total of 1599 rows and 12 columns in the data set. There were no missing values found in the data. Use the below to code to the same.

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error,mean_absolute_error

wdata = pd.read_csv("../input/winedata/winequality_red.csv")

print(wdata.head(10))
print(wdata.shape)
print(wdata.isnull().any())

After that, we will define the dependent and independent features X and y respectively. We then scale the data and then split it into training and testing sets. After that, we will fit the training data to the model and make predictions for the testing data. We will make use of Linear Regression and Support Vector Machine Regression for building two models. Use the code below to the same.

In [None]:
X = wdata.drop('quality', axis =1)
y = wdata['quality']

std = StandardScaler()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

X_train = std.fit_transform(X_train)
X_test = std.fit_transform(X_test)

lr = LinearRegression()
rfr = RandomForestRegressor()

lr.fit(X_train,y_train)
rfr.fit(X_train,y_train)

y_pred_lr = lr.predict(X_test)
y_pred_rfr = rfr.predict(X_test)

We have stored our prediction of testing data in y_pred_lr and y_pred_rfr. We will make use of this variable for the evaluation of the model. We will now compute different error metrics to check the model performance like mean squared error and mean absolute data.

In [None]:
print("Mean Squared Error: ", mean_squared_error(y_pred_lr,y_test))  #Mean Squared Error:
print("Mean Squared Error: ",mean_squared_error(y_pred_rfr,y_test))  #Mean Squared Error:
print("Mean Absolute Error: ",mean_absolute_error(y_pred_lr,y_test))  #Mean Absolute Error:
print("Mean Absolute Error",mean_absolute_error(y_pred_rfr,y_test))   #Mean Absolute Error:


### Conclusion
We have computed the evaluation metrics for both the classification and regression problems. We can always try improving the model performance using a good amount of feature engineering and Hyperparameter Tuning. I hope you have now understood how you can build a classification and regression model and also how to evaluate the model using different metrics discussed above.

Read more about error metrics here in this notebook below - “Model evaluation techniques for Machine Learning”

### Model Evaluation Techniques For Machine Learning
Model evaluation plays a crucial role while developing a predictive machine learning model. Building just a predictive model without checking does not count as a fit model but a model which gives maximum accuracy surely does count a good one. For this, you need to check on the metrics and make improvements accordingly until you get your desired accuracy rate.

### 1| Chi-Square
The χ^2 test is a method which is used to test the hypothesis between two or more groups in order to check the independence between the two variables. It is basically used to analyse the categorical data and evaluate Tests of Independence when using a bivariate table. Some examples of Chi-Square tests are Fisher’s exact test, Binomial test, etc. The formula for calculating a Chi-Square statistic is given as
![image](https://github.com/VivekN147/datascience-pro/blob/main/chi-square-gof.png?raw=true)
*Formula for the Chi-Squared Goodness-of-Fit Test*

Where O represents the observed frequency, E represents the expected frequency.

### 2| Confusion Matrix
The confusion matrix is also known as Error matrix and is represented by a table which describes the performance of a classification model on a set of test data in machine learning. 
![image](https://www.analyticsindiamag.com/wp-content/uploads/2019/04/confusion-matrix.png)
In the above table, Class 1 is depicted as the positive table and Class 2 is depicted as the negative table. It is a two-dimensional matrix where each row represents the instances in predictive class while each column represents the instances in the actual class or you put the values in the other way. Here, TP (True Positive) means the observation is positive and is predicted as positive, FP (False Positive) means observation is positive but is predicted as negative, TN (True Negative) means the observation is negative and is predicted as negative and FN (False Negative) means the observation is negative but it is predicted as positive.
![image](https://cdn.shortpixel.ai/client/q_glossy,ret_img,w_398,h_196/https://financetrain.com/wp-content/uploads/confusion-matrix.png)

### 3| Concordant-Discordant Ratio
In a pair of cases when one case is higher on both the variables than the other cases, it is known as a concordant pair. On the other hand, in a pair of cases where one case is higher on one variable than the other case but lower on the other variable, it is known as a discordant pair.

Suppose, there are a pair of observations (Xa, Ya) and (Xb, Yb)

Then, the pair is concordant if Xa>Xb and Ya>Yb or Xa < Xb and Ya < Yb.

And the pair is discordant if Xa>Xb and Ya<Yb or Xa < Xb and Ya > Yb.

### 4| Confidence Interval
Confidence Interval or CI is the range of values which is required to meet a certain confidence level in order to estimate the features of the total population. In the domain of machine learning, Confidence Intervals basically consist of a range of potential values of an unknown population parameter and the factors which are affecting the width of the confidence interval are the confidence level, size as well as variability of the sample.
CI is generated on range and probability. Range, which is the lower and upper limit on the skill that can be expected on the model. Probability talks about whether the model belongs to the range or not.
![image](https://github.com/VivekN147/datascience-pro/blob/main/the_construction_of_a_ci.gif?raw=true)
Source:[Construction Of Confidence Interval](http://www.sumsar.net/blog/2013/12/an-animation-of-the-construction-of-a-confidence-interval/)

The CI is often referred to as the margin of error and may be used to graphically depict the uncertainty of an estimate on graphs through the use of error bars.

### For Classification Accuracy In Machine Learning
A machine learning algorithm is well understood by the data scientists and the engineers who develop them but when the product needs to be pitched, the only parameter that counts is its performance. So, a metric to gauge the performance of a model is necessary.

Classification accuracy is used to assess the efficacy of a classification algorithm. To report the classification accuracy of the model alone is not best of practices.
```
Classification Accuracy = correct predictions/ total predictions
```
It is common to use classification accuracy or classification error (the inverse of accuracy) to describe the skill of a classification predictive model. For example, a model that makes correct predictions of the class outcome variable 75% of the time has a classification accuracy of 75%, calculated as:
```
accuracy = total correct predictions / total predictions made * 100
```
Classification accuracy or classification error is a proportion or a ratio. It describes the proportion of correct or incorrect predictions made by the model. Each prediction is a binary decision that could be correct or incorrect. Technically, this is called a Bernoulli trial, named for Jacob Bernoulli. The proportions in a Bernoulli trial have a specific distribution called a binomial distribution.
![image](https://www.analyticsindiamag.com/wp-content/uploads/2019/01/ci.png)
We can use the assumption of a Gaussian distribution of the proportion (i.e. the classification accuracy or error) to easily calculate the confidence interval.

In the case of classification error, the radius of the interval can be calculated as:
```
interval = z * sqrt( (error * (1 - error)) / n)
```
In the case of classification accuracy, the radius of the interval can be calculated as:
```
interval = z * sqrt( (accuracy * (1 - accuracy)) / n)
```
Where interval is the radius of the confidence interval, error and accuracy are classification error and classification accuracy respectively, n is the size of the sample, sqrt is the square root function, and z is a critical value from the Gaussian distribution. Technically, this is called the Binomial proportion confidence interval.

A code snippet to calculate the accuracy scores:
```
# split the data into a train and validation sets
X1, X2, y1, y2 = train_test_split(X_train, y_train, test_size=0.5)
base_prediction = base_model.predict(X2)
error = mean_squared_error(base_prediction, y2) ** 0.5
mean = base_model.predict(X_test)
st_dev = error
X1, X2, y1, y2 = train_test_split(X, y, test_size=0.5)
base_model.fit(X1, y1)
base_prediction = base_model.predict(X2)
validation_error = (base_prediction - y2) ** 2
error_model.fit(X2, validation_error)
mean = base_model.predict(X_test)
st_dev = error_model.predict(X_test)
```
### Common Misconceptions About Confidence Intervals
A 95% confidence interval does not mean that for a given realised interval there is a 95% probability that the population parameter lies within the interval. The 95% probability relates to the reliability of the estimation procedure, not to a specific calculated interval.

A confidence interval is not a definitive range of plausible values for the sample parameter, though it may be understood as an estimate of plausible values for the population parameter.

A particular confidence interval of 95% calculated from an experiment does not mean that there is a 95% probability of a sample parameter from a repeat of the experiment falling within this interval. So, it is essential to remember that:
* 95% confidence is confidence that in the long-run 95% of the CIs will include the population mean. It is a confidence in the algorithm and not a statement about a single CI.
* In frequentist terms, the CI either contains the population mean or it does not.
* There is no relationship between a sample’s variance and it’s mean. Therefore we cannot infer that a single narrow CI is more accurate. In this context “accuracy” refers to the long run coverage of the population mean. Look at the visualisation above and note how much the widths of the CIs vary. They can still be narrow but far away from the true mean.

### Result
A confidence interval is different from a tolerance interval that describes the bounds of data sampled from the distribution. CI provides bounds on a population parameter, such as a mean, standard deviation, or similar and, to deal with the uncertainty inherent in results derived from data that are themselves only a randomly selected subset of a population.

It is said that preferring hypothesis testing to confidence intervals and estimation will lead to fewer statistical misinterpretations. Confidence intervals can be unintuitive and sometimes are as misunderstood as p-values and null hypothesis significance testing. Moreover, CIs are often used to perform hypothesis tests and are therefore prone to the same misuses as p-values.

Real world data is filled with noise, is inconsistent, non-linear. So, a single “significant” CI can be mighty useful to draw conclusions which otherwise would be cumbersome.

### 5| Gini Co-efficient
The Gini coefficient or Gini Index is a popular metric for imbalanced class values. It is a statistical measure of distribution developed by the Italian statistician Corrado Gini in 1912. The coefficient ranges from 0 to 1 where 0 represent perfect equality and 1 represents perfect inequality. Here, if the value of an index is higher, then the data will be more dispersed.

### 6| Gain and Lift Chart
This method is generally used to evaluate the performance of the classification model in machine learning and is calculated as the ratio between the results obtained with and without the model. Here, the gain is defined as the ratio of the cumulative number of targets to the total number of targets in the entire dataset and lift is defined as for how many times the model is better than the random choice of cases.
    
### 7| Kolmogorov-Smirnov Chart
This non-parametric statistical test measures the performance of classification models where it is defined as the measure of the degree of separation between the positive and negative distributions. The KS test is generally used to compare the equality of a single sample with another.

### 8| Predictive Power
Predictive Power is a synthetic metric which satisfies interesting properties like it is always been 0 and 1 where 0 represents that the feature subset has no predictive power and 1 represents that the feature subset has maximum predictive power.and is used to select a good subset of features in any machine learning project.   

### 9| AUC-ROC Curve
ROC or Receiver Operating Characteristics Curve is one of the most popular evaluation metrics for checking the performance of a classification model. The curve plots two parameters, True Positive Rate (TPR) and False Positive Rate (FPR). Area Under ROC curve is basically used as a measure of the quality of a classification model. Hence, the AUC-ROC curve is the performance measurement for the classification problem at various threshold settings.

The True Positive Rate or Recall is defined as
![image](https://www.analyticsindiamag.com/wp-content/uploads/2019/04/TPR.jpg)
The False Positive Rate is defined as
![image](https://www.analyticsindiamag.com/wp-content/uploads/2019/04/FPR.jpg)

### 10| Root Mean Square Error
Root Mean Squared Erro or RMSE is defined as the measure of the differences between the values predicted by a model and the values actually observed. It is basically the square root of MSE, Mean Squared Error which is the average of the squared error used as the loss function for least squares regression.

Specifically, the RMSE is defined as
![image](https://www.analyticsindiamag.com/wp-content/uploads/2019/04/RMSE.jpg)

### Fundamental Techniques For Model Evaluation
ML models are deployed for something as trivial as purchasing a toothbrush or to gravely significant applications like cancer detection or for safety of nuclear plants. Whatever be the application, the foundations of an ML model are rooted in mathematics-dot products, partial differentiation, network topology etc.

Since the ground truth is very well known, few techniques can be devised to keep an eye on how these models behave in their finality.

The following are few traditionally used methodologies to evaluate the performance of a machine learning model:

### With The Use Of Loss Functions
To keep a check on how accurate the solution is, loss functions are used. These functions are a handful of mathematical expressions whose results depicts by how much the algorithm has missed the target. An example would be that of a  self driving car whose on board camera, if, misidentifies a cyclist as lane marks then that would be a bad day for the cyclist. Loss functions help avoid these kind of misses by mitigating the errors.

For a classification problem, hinge loss and logistic loss are almost equal for a given convergence rate and are better than square loss rate.

Squared loss function which operates statistical assumptions of mean, is more prone to outliers. It penalises the outliers intensely. This results in slower convergence rates when compared to hinge loss or cross entropy functions.

When it comes to hinge loss function, it penalises the data points lying on the wrong side of the hyperplane in a linear way. Hinge loss is not differentiable and cannot be used with methods which are differentiable like stochastic gradient descent(SGD). In this case Cross entropy(log loss) can be used. This function is convex like Hinge loss and can be minimised used SGD.

### With Similarity Scores
Similarity score is designed to tackle this scenario using the training probability values associated coverage scenario and not the good training coverage scenario. In addition to a score for detecting potential drops in predictive performance, the system infrastructure must support such feedback, alert the mechanism and ideally handle the diversity of engines and languages typically used for ML applications (Spark, Python, R, etc.). a system that leverages this score to generate alerts in production deployments.

There are three pressing issues that Similarity scores aim to address:

* Low number of samples: since the Similarity score is calculated based on the parameters of the multinoulli distribution and does not rely on the inference distribution, it is agnostic of the number of samples.
* Similarity score reply on the probability values associated with this narrow range of distribution and hence does not penalize the fact that inference distribution does not cover the entire range of categories observed during training.
* The subset of patterns seen during inference might either have poor training coverage or good training coverage.

A similar approach was taken up by the engineers at ParallelM propose MLHealth, a model to monitor the change in patterns of incoming features in comparison to the ones observed during training and argue that such a change could indicate the fitness of an ML model to inference data.

Though there exists techniques such as KL-divergence, Wasserstein metric etc. that provide a score for the divergence between the two distributions, they rely on the fact that the inference distribution is available and representative of the inference data. This implies there are enough samples to form a representative distribution.

### Testing For NLP Tasks
Natural Language Processing(NLP) applications have become a top priority for machine learning developers. From QA systems at Google to chatbots to speech assistants like Alexa, NLP is essential. As companies look to make AI to AGI, understanding very sophisticated human language will be under the radar for quite some time.

To test models implementing NLP tasks, there are techniques like F-score and BLEU score.

F-score is a measure of the test’s accuracy. The precision P and the recall R of the test are calculated. The F-score is the harmonic average of the precision and recall. Closer to 1 is considered to better and closer to 0 values indicate the inaccuracy of the model.
![image](https://www.analyticsindiamag.com/wp-content/uploads/2019/04/fscore.png)

### ML Is Not A Substitute For Crystal Ball
The typical life cycle of deployment machine learning models involves a training phase, where a typical data scientist develops a model with good predictive based on historical data. This model is put into production with the hope that it would continue to have similar predictive performance during the course of its deployment.

But there can be problems associated with the information that is deployed into the model such as:

* an incorrect model gets pushed
* incoming data is corrupted
* incoming data changes and no longer resembles datasets used during training.

Whether it is the market crash or a wrong diagnosis, the after-effects will be certainly irreversible. Tracking the development of machine learning algorithm throughout its life cycle, therefore, becomes crucial.