<a href="https://colab.research.google.com/github/raj-vijay/ml/blob/master/12_Model_Evaluation_kNN_Diabetes_Prediction_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Model Evaluation Metrics**

Here, we fit ridge regression models over a range of different alphas, and plot cross-validated R2 scores for each. The R2 scores along with the standard error for each alpha is plotted.

**Classification Accuracy** 

Classification Accuracy is the ratio of number of correct predictions to the total number of instances subjected to the machine learning algorithm.

Accuracy as a metric for model evaluation works well when the distribution of samples belonging to each class is even.

In real world applications, accuracy as the lone model performance metric becomes a real problem, when the risk or cost of misclassification of the minor class samples are very high.

A 99% accuracy can be excellent, good, mediocre, poor or terrible depending upon the problem.

**Confusion Matrix**

A confusion matrix is a tool to determine the performance of classifier. 

A confusion matrix, also known as an error matrix, is a matrix that allows visualization of the performance of a supervised learning  algorithm. In unsupervised learning it is usually called a matching matrix. 

Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class. The matrix help determine if the machine learning model, or a binary classifier, is confusing two classes as it carries out the predictions.

![alt text](https://scikit-learn.org/stable/_images/sphx_glr_plot_confusion_matrix_001.png)

**Definition of the Terms**

• **Positive (P)** : Observation is positive.

• **Negative (N)** : Observation is not positive.

• **True Positive (TP)** : Observation is positive, and is predicted to be positive.

• **False Negative (FN)** : Observation is positive, but is predicted negative.

• **True Negative (TN)** : Observation is negative, and is predicted to be negative.

• **False Positive (FP)** : Observation is negative, but is predicted positive.

**Recall**

Recall is defined as the ratio of the total number of correctly classified positive instances divide to the total number of positive instances. 

High Recall indicates the class is correctly recognized (small number of FN).

Recall = True Positives / (True Postives + False Negatives)


**Precision**

Precision is defined as the ratio of the total number of correctly classified positive instances over the total number of predicted positive instances. 

High Precision indicates an instance is labeled as positive is indeed positive (small number of FP).

Precision = True Postives / (True Postives + False Postives)


**Pima Indians Diabetes Database**

Pima Indians Diabetes dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. 

The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. 

In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The dataset is imported from Kaggle.

https://www.kaggle.com/uciml/pima-indians-diabetes-database

Installing Kaggle Package to access the Gapminder dataset from Kaggle.

In [26]:
!pip install kaggle



Make .kaggle directory under root to import the Kaggle Authentication JSON.

In [27]:
!mkdir ~/.kaggle

mkdir: cannot create directory ‘/root/.kaggle’: File exists


Change file path to root/.kaggle/kaggle.json

In [0]:
!cp /content/kaggle.json ~/.kaggle/kaggle.json

Protect Kaggle JSON file for security reasons

Chmod 600 (chmod a+rwx,u-x,g-rwx,o-rwx) sets permissions so that, (U)ser / owner can read, can write and can't execute. (G)roup can't read, can't write and can't execute. (O)thers can't read, can't write and can't execute.

In [0]:
!chmod 600 /root/.kaggle/kaggle.json

Import the Gapminder dataset

In [30]:
!kaggle datasets download -d uciml/pima-indians-diabetes-database

pima-indians-diabetes-database.zip: Skipping, found more recently modified local copy (use --force to force download)


In [31]:
# Import numpy and pandas
import numpy as np
import pandas as pd

# Read the Gapminder file into a DataFrame: df
df = pd.read_csv('pima-indians-diabetes-database.zip', compression='zip', header=0, sep=',', quotechar='"')
print(df)

     Pregnancies  Glucose  ...  Age  Outcome
0              6      148  ...   50        1
1              1       85  ...   31        0
2              8      183  ...   32        1
3              1       89  ...   21        0
4              0      137  ...   33        1
..           ...      ...  ...  ...      ...
763           10      101  ...   63        0
764            2      122  ...   27        0
765            5      121  ...   30        0
766            1      126  ...   47        1
767            1       93  ...   23        0

[768 rows x 9 columns]


In [0]:
X = df.drop('Outcome', axis = 1)
y = df['Outcome']

In [0]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [0]:
# Import necessary modules
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [0]:
# Create training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

In [0]:
# Instantiate a k-NN classifier: knn
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

In [38]:
# Generate the confusion matrix
print(confusion_matrix(y_test, y_pred))


[[176  30]
 [ 56  46]]


In [39]:
# Generate the classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.76      0.85      0.80       206
           1       0.61      0.45      0.52       102

    accuracy                           0.72       308
   macro avg       0.68      0.65      0.66       308
weighted avg       0.71      0.72      0.71       308

