# K-Nearest Neighbors

## Table of Contents

1. [**Introduction**](#Intro)   
2. [**Data Normalization**](#NormStand)
3. [**Model Development Procedure**](#ModelDev)
4. [**Binary Classification**](#BinaryClass)

    4.1 [**Model Development**](#BinModelDev)

    4.2 [**Model Evaluation**](#BinModelEval)

5. [**Multiclass Classification**](#MultiClass)

    5.1 [**Model Development**](#MultModelDev)

    5.2 [**Model Evaluation**](#MultModelEval)
  
6. [**Final Comments**](#FinCom)


## 1 Introduction <a name="Intro"></a>

To implement the nearest neighbor rule, we need to first quantify the similarity between the two points. There are different ways to measure similarity, or equivalently dissimilarity. Here is how this algorithm works:

1. Pick a value for k
2. Calculate the distance of unknown case from all cases.
3. Select the k points in the training data that are "nearest" to the unknown data point.
4. Make a prediction using the most popular target variable class from the k-nearest neighbors. 

## 2 Data Normalization <a name="NormStand"></a>

Many modeling algorithms perform better when numerical input variables are scaled to a standard range.

The most popular technique for scaling numerical data prior to modeling is normalization. Normalization scales each input variable separately to the range 0-1. You can noarmalize your data using: $x_{norm} = 
\frac{x – x_{min}}{x_{max} – x_{min}}$

You can normalize your variable using the scikit-learn module <font color='blue'>MinMaxScaler</font>.

<font color='orange'>__Note__</font>: There are multiple methods for normalization and scaling the data. You can see a list of these methods and how they affect the data here: 

Article: https://drive.google.com/file/d/19alt-Le4m6SQKFURmT8gt5Rfr0JIbAZp/view?usp=sharing

Code: https://www.kaggle.com/discdiver/guide-to-scaling-and-standardizing/notebook 


## 3 Model Development Procedure <a name="ModelDev"></a>

Followings are the steps to develop KNN model using <font color='blue'>scikit-learn</font> library:

__1.__ Import Numpy, KNeighborsClassifier, MinMaxScaler, and train_test_split funcions from <font color='blue'>scikit_learn</font> library:
```python
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler  # For normalization
import numpy as np
```

__2.__ Define dependant variables (target variable) and independent variables (features) from data set:
```python
x_data=np.array(df[['feature1','feature2','feature3',...]])
y_data=df['target variable']
```

__3.__ Normalize your data using <font color='blue'>MinMaxScaler</font> (Optional but advised)
```python
MinMaxscaler = MinMaxScaler()  # define min max scaler object
x_data_scaled = MinMaxscaler.fit_transform(x_data)  # transform data
```

__4.__ Split the data into train and test sets: `x_train,x_test,y_train,y_test=train_test_split(x_data_scaled,y_data, test_size=[szie of the test data])`

__5.__ Create the KNN object using the constructor with known value of k: `neigh = KNeighborsClassifier(n_neighbors = k)`

__6.__ Use the fit function to fit the model to the training data: `neigh.fit(x_train,y_train)`

__7.__ Then, make prediction using the test data and training data:
```python
yhatTest = neigh.predict(x_test)
yhatTrain= neigh.predict(x_train)
```

As alwasy, there needs to be a little bit of an exploratory data analysis at the beginning. One step that might be useful to do is to look at the target variable to see how many categories do we have and how balanced the data is in terms of the target variable. This can be done using
```python
# To see the list of unique values for our target variables.
df['target variable'].unique()
# To see the distribution of the target variable in terms of different categories. 
# This gives you bar chart showing the distribution.
df['target variable'].value_counts().plot(kind='barh')
```


We are going to use part of the Fuel Economy Dataset, which is produced by the Office of Energy Efficiency and Renewable Energy of the U.S. Department of Energy. Fuel economy data are the result of vehicle testing done at the Environmental Protection Agency's National Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan, and by vehicle manufacturers with oversight by EPA. This dataset can be accessed from here: https://github.com/MasoudMiM/ME_364/blob/main/EPA_Green_Vehicle_Guide/EPA_2020_Fuel_Economy.csv and a description of the data is provided at this link: https://www.fueleconomy.gov/feg/EPAGreenGuide/GreenVehicleGuideDocumentation.pdf

In [None]:
import pandas as pd

url = ('https://raw.githubusercontent.com/MasoudMiM/ME_364/main/EPA_Green_Vehicle_Guide/EPA_2020_Fuel_Economy.csv')
df = pd.read_csv(url)           

df.drop(columns='Unnamed: 0', inplace=True)
df.head()

Check to see if we have any missing values

In [None]:
df.isnull().sum().sum()

## 4 Binary Classification <a name="BinaryClass"></a>


If the target variable has two possible values (classes), the classification is a binary classification. Here, we first develop a model to classify the cars based on their fuel type based on City MPG and Comb MPG. Since Fuel type in this dataset only has two possible classes, `Gasoline` and `Diesel`, so this is a binary classification.

### 4.1 Model Development <a name="BinModelDev"></a>

Step __1__, importing required libraries

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler  # For normalization
import numpy as np

Step __2__, selecting our features and target variables

In [None]:
x_data=np.array(df[['City MPG','Comb CO2']])   # Feature variables
y_data=df['Fuel']                              # Target variable

Let's also take a look at the possible values for our target variable as well as the distribution of the category. This helps us find out how imbalanced our dataset is in terms of target variable.

In [None]:
df['Fuel'].unique()

There are two types of fuels.

In [None]:
print('Target variable distribution:')
print( df['Fuel'].value_counts() )

df['Fuel'].value_counts().plot(kind='barh')

Our data is imbalanced. We have much more data representing the cars with Gasoline fuel type than the cars using Diesel. This will affect your model development and prediction. To see how to deal with imbalanced data, see this: https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

Step __3__, normalizing our feature variables

In [None]:
MinMaxscaler = MinMaxScaler()  # define min max scaler
x_data_scaled = MinMaxscaler.fit_transform(x_data)  # transform data 

Before we go further, let's take a look at our normalized features and compare that to original features.

In [None]:
import matplotlib.pyplot as plt

fig=plt.figure(figsize=(15,6))
fig.add_subplot(1,2,1)
plt.scatter(x_data[:,0],x_data[:,1])
plt.xlabel('City MPG'),plt.ylabel('Comb CO2')
plt.title('Original Data')

fig.add_subplot(1,2,2)
plt.scatter(x_data_scaled[:,0],x_data_scaled[:,1])
plt.xlabel('City MPG'),plt.ylabel('Comb CO2')
plt.title('Normalized Data');

Step **4**, splitting our data into training and test sets

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x_data_scaled,y_data,test_size=0.2)

Step __5__ and __6__, creating KNN classifier and fit the model to training data. Here we used _euclidean_ distance but you can choose any distance method from the list mentioned in the documentation page: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html 

In [None]:
neigh = KNeighborsClassifier(n_neighbors = 4, metric='euclidean')

In [None]:
neigh.fit(x_train,y_train) 

Step __7__, making predictions using test data and training data

In [None]:
yhatTest = neigh.predict(x_test)
yhatTrain= neigh.predict(x_train)

<font color='orange'>__Note__:</font> Keep in mind that if you want to make a prediction for a specific value, you need to scale the value first before using your model for prediction.

### 4.2 Model Evaluation <a name="BinModelEval"></a>

We can evaluate the performance of our KNN classifier using Jaccard index, F-score, or accuracy. We can also generate the confusion matrix to have a better sense of how the model performs in terms of TP, TN, FP, and FN.  

In [None]:
from sklearn.metrics import accuracy_score 
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score

from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

In [None]:
acc_scoreTrain = accuracy_score(y_train,yhatTrain)
acc_scoreTest = accuracy_score(y_test,yhatTest)
print(f'accuracy score for training data is {acc_scoreTrain:0.3f} and accuracy score for test data is {acc_scoreTest:0.3f}')

In [None]:
J_scoreTrain = jaccard_score(y_train,yhatTrain, average='micro')
J_scoreTest = jaccard_score(y_test,yhatTest, average='micro')
print(f'Jaccard index for training data is {J_scoreTrain:0.3f} and Jaccard index for test data is {J_scoreTest:.3f}')

In [None]:
F_scoreTrain = f1_score(y_train,yhatTrain, average='micro')
F_scoreTest = f1_score(y_test,yhatTest, average='micro')
print(F_scoreTrain,F_scoreTest)

You can also calculate _precision_ and _recall_ using their corresponding functions. Look at here https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html and here https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

In [None]:
print('Confusion matrix for training data')
CM_scoreTrain = confusion_matrix(y_train,yhatTrain)   # possible option normalize='true'
print(CM_scoreTrain)

print(40*'-')

print('Confusion matrix for test data')
CM_scoreTest = confusion_matrix(y_test,yhatTest)   # possible option normalize='true'
print(CM_scoreTest)

Looking at the plots of confusion matrix for training and test data.

In [None]:
dispTr = ConfusionMatrixDisplay(CM_scoreTrain,display_labels=neigh.classes_)
dispTr.plot()

dispTs = ConfusionMatrixDisplay(CM_scoreTest,display_labels=neigh.classes_)
dispTs.plot()

We used k=4, however, we can run the model for different values of k and see if there is a better value for k.

In [None]:
Ks = 25
Jacc_Test = np.zeros((Ks-1))
Jacc_Train= np.zeros((Ks-1))

F_Loop_Test = np.zeros((Ks-1))
F_Loop_Train= np.zeros((Ks-1))

for n in range(1,Ks):
    
    #Train Model and Predict  
    neighLoop = KNeighborsClassifier(n_neighbors = n).fit(x_train,y_train)
    yhatTestLoop  = neighLoop.predict(x_test)
    yhatTrainLoop = neighLoop.predict(x_train)
    Jacc_Test[n-1] = jaccard_score(y_test, yhatTestLoop, average='micro')
    Jacc_Train[n-1] = jaccard_score(y_train, yhatTrainLoop, average='micro')

    F_Loop_Test[n-1] = f1_score(y_test, yhatTestLoop, average='micro')
    F_Loop_Train[n-1] = f1_score(y_train, yhatTrainLoop, average='micro')


In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(13,5))
plt.subplot(1,2,1)
plt.plot(range(1,Ks),Jacc_Test,'r-o',range(1,Ks),Jacc_Train,'g-o')
plt.legend(['Test Data', 'Train Data'])
plt.xlabel('K value')
plt.ylabel('Jaccard Index');

plt.subplot(1,2,2)
plt.plot(range(1,Ks),F_Loop_Test,'r-o',range(1,Ks),F_Loop_Train,'g-o')
plt.legend(['Test Data', 'Train Data'])
plt.xlabel('K value')
plt.ylabel('F-score');

## 5 Multiclass Classification <a name="MultiClass"></a>

Let's develop the model using a target variable that has more than two possible values. In this case, we use `SmartWay`, which can be 'Yes', 'No', or 'Elite'. This is a multi-class classification.

### 5.1 Model Development <a name="MultModelDev"></a>

In [None]:
df['SmartWay'].unique()

Looking at the distribution:

In [None]:
print('Target variable distribution:')
print( df['SmartWay'].value_counts() )

df['SmartWay'].value_counts().plot(kind='barh')

Let's develop the model following the same steps as before, this time the target variable has three possible classes. Therefore, this will be a multiclass classifier.

In [None]:
x_dataM=np.array(df[['City MPG','Hwy MPG']])
y_dataM=df['SmartWay']

In [None]:
MinMaxscalerM = MinMaxScaler()  # define min max scaler
x_data_scaledM = MinMaxscaler.fit_transform(x_dataM)  # transform data 

In [None]:
x_trainM,x_testM,y_trainM,y_testM=train_test_split(x_data_scaledM,y_dataM,test_size=0.25)

In [None]:
neighM = KNeighborsClassifier(n_neighbors = 4, metric='euclidean')

In [None]:
neighM.fit(x_trainM,y_trainM)

In [None]:
yhatTestM = neighM.predict(x_testM)
yhatTrainM= neighM.predict(x_trainM)

### 5.2 Model Evaluation <a name="MultModelEval"></a>

In [None]:
J_scoreTrainM = jaccard_score(y_trainM,yhatTrainM, average='micro')
J_scoreTestM = jaccard_score(y_testM,yhatTestM, average='micro')
print(J_scoreTrainM,J_scoreTestM)

In [None]:
F_scoreTrainM = f1_score(y_trainM,yhatTrainM, average='micro')
F_scoreTestM = f1_score(y_testM,yhatTestM, average='micro')
print(F_scoreTrainM,F_scoreTestM)

In [None]:
CM_scoreTrainM = confusion_matrix(y_trainM,yhatTrainM)   # possible option normalize='true'
CM_scoreTestM = confusion_matrix(y_testM,yhatTestM)   # possible option normalize='true'

dispTrM=ConfusionMatrixDisplay(CM_scoreTrainM, display_labels=neighM.classes_)
dispTrM.plot()
dispTsM=ConfusionMatrixDisplay(CM_scoreTestM, display_labels=neighM.classes_) 
dispTsM.plot()

In [None]:
Ks = 25
Jacc_TestM = np.zeros((Ks-1))
Jacc_TrainM= np.zeros((Ks-1))

F_Loop_TestM = np.zeros((Ks-1))
F_Loop_TrainM= np.zeros((Ks-1))

for n in range(1,Ks):
    
    #Train Model and Predict  
    neighLoopM = KNeighborsClassifier(n_neighbors = n).fit(x_trainM,y_trainM)
    yhatTestLoopM  = neighLoopM.predict(x_testM)
    yhatTrainLoopM = neighLoopM.predict(x_trainM)
    Jacc_TestM[n-1] = jaccard_score(y_testM, yhatTestLoopM, average='micro')
    Jacc_TrainM[n-1] = jaccard_score(y_trainM, yhatTrainLoopM, average='micro')

    F_Loop_TestM[n-1] = f1_score(y_testM, yhatTestLoopM, average='micro')
    F_Loop_TrainM[n-1] = f1_score(y_trainM, yhatTrainLoopM, average='micro')

import matplotlib.pyplot as plt

plt.figure(figsize=(13,5))
plt.subplot(1,2,1)
plt.plot(range(1,Ks),Jacc_TestM,'r-o',range(1,Ks),Jacc_TrainM,'g-o')
plt.legend(['Test Data', 'Train Data'])
plt.xlabel('K value')
plt.ylabel('Jaccard Index');

plt.subplot(1,2,2)
plt.plot(range(1,Ks),F_Loop_TestM,'r-o',range(1,Ks),F_Loop_TrainM,'g-o')
plt.legend(['Test Data', 'Train Data'])
plt.xlabel('K value')
plt.ylabel('F-score');

## 6 Final Comments <a name="FinCom"></a>

In this notebook, we learned how to develop a KNN classifier and use that to make predictions. We also learned about data normalization by scaling the data into [0,1] range. Finally, some functions that can be used to evaluate a classifier were implemented. The main aspect in developing a KNN classifier is to find the appropriate value for _k_.


