<center> <h1>ðŸ‘ª K-Nearest-Neighbours ðŸ‘ª </h1> </center>

<p> <center> This notebook is in <span style="color: green"> <b> Active </b> </span> state of development! </center> </p>  
<p> <center> Be sure to checkout my other notebooks for <span style="color: blue"> <b> knowledge, insight and laughter </b> </span>! ðŸ§ ðŸ’¡ðŸ˜‚</center> </p> 

<center> <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSfDjqlgh8f1Py0JpOj7GKmGHLeawf4TKsBeQ&usqp=CAU" width="400" height="400" /> </center>

<hr style="height:2px;border-width:0;color:gray;background-color:gray">

# Aim

The aim is to provide, from scratch, code implementations for linear regression problems. This will involve both the main functions needed to solve a linear regression and some additional utility functions as well.

**Note**: We will not be diving into in-depth exploratory data analysis, feature engineering etc... in these notebooks and so will not be commenting extensively on things such as skewness, kurtosis, homoscedasticity etc...

<hr style="height:2px;border-width:0;color:gray;background-color:gray">

# Background

The kNN algorithm can be used both for classification and regression. 

1. Start with calculating the distance of a given point $x$ to all other points in the data set. 
2. Then, it finds the _k_ nearest points closest to $x$, and assigns the new point $x$ to the majority class of the _k_ nearest points _(classification)_. 

e.g. So, for example, if two of the _k_=3 closest points to $x$ were red while one is blue, $x$ would be classified as red.

On the other hand in _regression_, we see the labels as continuous variables and assign the label of a data point $x$ as the mean of the labels of its _k_ nearest neighbours. 

<center> <img src="https://machinelearningknowledge.ai/wp-content/uploads/2018/08/KNN-Classification.gif" width="600" height="600" /> </center>

<hr style="height:2px;border-width:0;color:gray;background-color:gray">

## Import Modules

In [None]:
# Importing standard packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import copy
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

# Data Collection

In [None]:
# Import dataset
df = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
# Display dataframe
df

# Data Processing

In [None]:
# Check for Nulls and dtype of dataframe
df.info()

In [None]:
# Check for NaNs
df.isna().sum()

In [None]:
#Â Check for duplicates
df.duplicated().sum()

In [None]:
#Â Drop NaNs and duplicates
df = df.dropna().drop_duplicates().reset_index(drop=True)

Let us look at the information obtained **before** applying the pre-processing steps:

In [None]:
#Â Overall statistics
df.describe()

In [None]:
#Â Bar chart of class ratio 
quality_labels = pd.DataFrame(np.sort(df['quality'].unique()),columns= ['Quality'])
quality_num = pd.DataFrame(np.bincount(df['quality'])[np.bincount(df['quality'])>0], columns= ['Quantity'])
quality_percent = quality_num/len(df)*100
quality_percent.columns = ['Percentage']
target_pd = pd.concat((quality_labels, quality_num, quality_percent), axis=1)
# Plot barchart
fig = plt.figure(figsize = (10, 5))
plt.bar(list(target_pd['Quality']), target_pd['Quantity'], color= ["Green", "Red", "Blue", "Maroon", "Purple", "Orange"], width = 0.4)
plt.xlabel("Wine Quality")
plt.ylabel("Number of quality wine")
plt.title("Distribution of wine qualities");
#Â Print the dataframe
target_pd

It is clear that the target variable, which is multi-classed, has a skewed distribution between the quality of the wine. 

In [None]:
# Histogram of features (check for skew)
fig=plt.figure(figsize=(20,20))
for i, feature in enumerate(df.columns):
    ax=fig.add_subplot(8,4,i+1)
    df[feature].hist(bins=20,ax=ax,facecolor='black')
    ax.set_title(feature+" Distribution",color='DarkRed')
    ax.set_yscale('log')
fig.tight_layout()  

In [None]:
# Check for correlation
fig=plt.figure(figsize=(20,20))
sns.heatmap(df.corr(), annot = True, cmap="tab20c");
fig.tight_layout()  

## Splitting dataset

For most machine learning models, we would like them to have low bias and low variance - that is, the model should perform well on the training set (low bias) and also the test set, alongside with other new random test sets (low variance). Therefore, to test for bias and variance of our model, we shall split the dataset into training and test set. We will not be tuning any hyperparameters (and thus do not need a validation set).  We will not be tuning any hyperparameters (and thus do not need a validation set). 

For these functions, the $X$ dataset (of features) does not need to have a column 1's as the first column as there is no bias term. One check the order of magnitude of the features - if they differ hugely, one must apply feature scaling. Having looked at the data however, it is clear that the order of magnitude of some of the features are very different, so we must perform feature scaling. 

In [None]:
#Â Create X (features) and y (target) dataset
X = df[df.columns[:-1]]
y = df[df.columns[-1]]

In [None]:
def feature_scaling(X: pd.DataFrame) -> pd.DataFrame:
    
    """ Normalises the features in X (dataframe) and returns a normalized version of X where the mean value of each feature is 0 and the standard deviation is 1. """
    
    # Return normalised data
    return (X - np.mean(X, axis=0))/np.std(X, axis=0, ddof=0)

In [None]:
# Create normalised data
X = feature_scaling(X)
# Split the dataset into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/4, random_state=42, stratify=y)
# Re-index
X_train = X_train.reset_index(drop=True) 
y_train = y_train.reset_index(drop=True) 
X_test = X_test.reset_index(drop=True) 
y_test = y_test.reset_index(drop=True) 

# Euclidean Distance

We know that kNN algorithm is based on computing distances between data points. For simplicity, we will only work with **Euclidean distances** in this notebook, but other distances can be chosen interchangably, of course e.g. Hamming, Minkowski, Manhattan etc...

**Note:** Choosing different distance metrics will give different accuracies. 

The Euclidean distance $d$, is defined as
$$
d(\boldsymbol p, \boldsymbol q) = \sqrt{\sum_{i=1}^D{(q_i-p_i)^2}} \, ,
$$
where $\boldsymbol p$ and $\boldsymbol q$ are the two points in our $D$-dimensional Euclidean space.

In [None]:
def euclidean_distance(p: np.array, q: np.array) -> float:

    """ Return Euclidean Distance. """
        
    return np.sqrt(np.sum((p-q)**2, axis=1))

Due to varying labels in the target variable, we need to decide wether we will treat the target variable via a binary classification or multi-class classification. 

With binary, we decide on the threshold of wine quality e.g. any wine quality equal or above 6 is encoded as 1 (good quality) and anything below 6 is encoded as 0 (bad quality). With multi, the task is just to predict which wine qualities are "close" to each other in multi-dimensional space (and therefore perhaps share similar qualities). 

# K-Nearest-Neighbours

We try to find the _k_ nearest neighbours in our train set for every test data point. The majority of labels of the _k_ closest train points determines the label of the test point. 

In [None]:
def kNN(X_train: pd.DataFrame, X_test: pd.DataFrame, k: int, return_distance: bool) -> np.array:
    
    """ Return k-neighbours. """
    
    n_neighbours = k
    dist = []
    neigh_ind = []
    # Compute distance from each point x_test in X_test to all points in X_train 
    point_dist = [euclidean_distance(X_test.loc[i], X_train) for i in range(len(X_test))]  
    # Determine which k training points are closest to each test point
    for row in point_dist:
        enum_neigh = enumerate(row)
        sorted_neigh = sorted(enum_neigh, key=lambda x: x[1])[:k]
        ind_list = [tup[0] for tup in sorted_neigh]
        dist_list = [tup[1] for tup in sorted_neigh]
        dist.append(dist_list)
        neigh_ind.append(ind_list)
    # Return distances together with indices of k nearest neighbouts
    if return_distance:
        return np.array(dist), np.array(neigh_ind)
    return np.array(neigh_ind)

Once we know which _k_ neighbours are closest to our test points (from the training set), we can predict the labels of these test points.
The `predict` function determines how any point $x_\text{test}$ in the test set is classified. Here, we only consider the case where each of the *k* neighbours contributes equally to the classification of $x_\text{test}$.

# Classification Prediction

In [None]:
def predict(X_train: pd.DataFrame, X_test: pd.DataFrame, y_train: pd.Series, k: int, return_false: bool) -> np.array:
  
    """ Return label predictions for test set. """
    
    # Each of the k neighbours contributes equally to the classification of any data point in X_test  
    neighbours = kNN(X_train, X_test, k, False)
    # Count number of occurences of label 
    y_pred = np.array([np.argmax(np.bincount(y_train[neighbour])) for neighbour in neighbours]) 
    return y_pred

We now create an accuracy function that calculates how many labels we have correctly classified.

# Classification Score

In [None]:
def score(X_train: pd.DataFrame, X_test: pd.DataFrame, y_test: pd.Series, y_train: pd.Series, k: int) -> float:
    
    """ Return mean accuracy of test set. """
    
    y_pred = predict(X_train, X_test, y_train, k=k) 
    return 100*np.float(sum(y_pred==y_test))/ float(len(y_test))

# Full K-Nearest-Neighbours Model

In [None]:
class K_Nearest_Neighbours():
    
    def __init__(self):
    
        """ Initialise parameters. """
       
        self.neighbours = None
        
    def fit(self, X_train: pd.DataFrame, X_test: pd.DataFrame, k: int, return_distance: bool) -> np.array:
    
        """ Fit kNN model. """
        
        dist = []
        neigh_ind = []
        # Compute distance from each point x_test in X_test to all points in X_train 
        point_dist = [self.euclidean_distance(X_test.loc[i], X_train) for i in range(len(X_test))] 
        # Determine which k training points are closest to each test point
        for row in point_dist:
            enum_neigh = enumerate(row)
            sorted_neigh = sorted(enum_neigh, key=lambda x: x[1])[:k]
            ind_list = [tup[0] for tup in sorted_neigh]
            dist_list = [tup[1] for tup in sorted_neigh]
            dist.append(dist_list)
            neigh_ind.append(ind_list)
        # Return distances together with indices of k nearest neighbours
        if return_distance:
            return np.array(dist), np.array(neigh_ind)
        self.neighbours = np.array(neigh_ind)
    
    def predict(self, y_train: pd.Series) -> np.array:
  
        """ Return label predictions for test set. """

        # Count number of occurences of label 
        y_pred = np.array([np.argmax(np.bincount(y_train[neighbour])) for neighbour in self.neighbours]) 
        return y_pred
    
    def euclidean_distance(self, p: np.array, q: np.array) -> float:

        """ Return Euclidean Distance. """
        
        return np.sqrt(np.sum((p-q)**2, axis=1))

    def score(self, y_true: pd.Series, y_pred: pd.Series) -> float:
    
        """ Return mean accuracy test set. """
    
        return 100*np.float(sum(y_pred==y_true))/ float(len(y_true))

# Model Testing and Results

## Binary Classification

For binary, we will classify wine quality into two categories:
1. Quality equal and below 6 is Bad.
2. Quality above 6 is Good.

In [None]:
# Create new y target labels which are binary
y_bin_train = np.where(y_train > 6, 1, 0)
y_bin_test = np.where(y_test > 6, 1, 0)

In [None]:
# Instantiate model
wine_model_bin = K_Nearest_Neighbours()
# Fit model to training and test dataset 
k_neighbours = np.arange(1,51,1)
accuracy_score_train_bin = []
accuracy_score_test_bin = []
for k in k_neighbours:
    #Â Train data
    wine_model_bin.fit(X_train, X_train, k, return_distance=False)
    y_pred_train = wine_model_bin.predict(y_train)
    y_pred_train_bin = np.where(y_pred_train > 6, 1, 0)
    accuracy_score_train_bin.append(wine_model_bin.score(y_bin_train, y_pred_train_bin))
    #Â Test data
    wine_model_bin.fit(X_train, X_test, k, return_distance=False)
    y_pred_test = wine_model_bin.predict(y_train)
    y_pred_test_bin = np.where(y_pred_test > 6, 1, 0)
    accuracy_score_test_bin.append(wine_model_bin.score(y_bin_test, y_pred_test_bin))

In [None]:
# Plot accuracy scores for training set
plt.plot(k_neighbours, accuracy_score_train_bin, marker = 'o',  mfc = 'r', mec = 'b')
plt.xlabel("K-Neighbours")
plt.ylabel("Accuracy Score (%)")
plt.title("Accuracy Score for varying k-neighbours in training set");

In [None]:
# Plot accuracy scores for test set
plt.plot(k_neighbours, accuracy_score_test_bin, marker = 'o',  mfc = 'r', mec = 'b')
plt.xlabel("K-Neighbours")
plt.ylabel("Accuracy Score (%)")
plt.title("Accuracy Score for varying k-neighbours in test set");

In [None]:
print(f"Optimal K-Neighbours is: {[i+1 for i, j in enumerate(accuracy_score_test_bin) if j == max(accuracy_score_test_bin)]}, with a mean accuracy of {round(accuracy_score_test_bin[np.argmax(accuracy_score_test_bin)],1)}%")

We will take 5 as our answer since this will provide us with the quickest runtime. 

In [None]:
#Â Print classification report for optimal k-nearest-neighbours baseline model
opt_baseline_wine_model_bin = K_Nearest_Neighbours()
opt_baseline_wine_model_bin.fit(X_train, X_test, k=5, return_distance=False)
y_pred_test = opt_baseline_wine_model_bin.predict(y_train)
y_pred_test_bin = np.where(y_pred_test > 6, 1, 0)
pd.DataFrame(classification_report(y_bin_test, y_pred_test_bin, output_dict=True, zero_division=0))

# Multi-Class Classification

For multi, we will classify wine quality into their unique respective wine quality categories i.e. Quality 3,4,5,6,7 and 8. In order to find the optimal _k_ neighbours, we will return the _k_ which gives us the largest accuracy score. 

In [None]:
# Instantiate model
wine_model_multi = K_Nearest_Neighbours()
# Fit model to training and test dataset 
k_neighbours = np.arange(1,51,1)
accuracy_score_train_multi = []
accuracy_score_test_multi = []
for k in k_neighbours:
    #Â Train data
    wine_model_multi.fit(X_train, X_train, k, return_distance=False)
    y_pred_train = wine_model_multi.predict(y_train)
    accuracy_score_train_multi.append(wine_model_multi.score(y_train, y_pred_train))
    #Â Test data
    wine_model_multi.fit(X_train, X_test, k, return_distance=False)
    y_pred_test = wine_model_multi.predict(y_train)
    accuracy_score_test_multi.append(wine_model_multi.score(y_test, y_pred_test))

In [None]:
# Plot accuracy scores for training set
plt.plot(k_neighbours, accuracy_score_train_multi, marker = 'o',  mfc = 'r', mec = 'b')
plt.xlabel("K-Neighbours")
plt.ylabel("Accuracy Score (%)")
plt.title("Accuracy Score for varying k-neighbours in training set");

This graph makes a lot of sense. We compute the distances of each training set data point to every other training set data point - thus it is obvious why for _k_=1, we have 100% accuracy because every training data point is its own nearest neighbour. When we have _k_=2, then we are accounting for the first two closest neighbours (which will be the original data point + the next closest) etc...

In [None]:
# Plot accuracy scores for test set
plt.plot(k_neighbours, accuracy_score_test_multi, marker = 'o',  mfc = 'r', mec = 'b')
plt.xlabel("K-Neighbours")
plt.ylabel("Accuracy Score (%)")
plt.title("Accuracy Score for varying k-neighbours in test set");

We are more concerned with the how the model does on the test set and so we aim to look for the optimal _k_ neighbours from these accuracies. We have not applied any hyperparameter tuning as mentioned before so this baseline model can be improved! 

In [None]:
print(f"Optimal K-Neighbours is: {[i+1 for i, j in enumerate(accuracy_score_test_multi) if j == max(accuracy_score_test_multi)]}, with a mean accuracy of {round(accuracy_score_test_multi[np.argmax(accuracy_score_test_multi)],1)}%")

In [None]:
#Â Print classification report for optimal k-nearest-neighbours baseline model
opt_baseline_wine_model_multi = K_Nearest_Neighbours()
opt_baseline_wine_model_multi.fit(X_train, X_test, k=8, return_distance=False)
y_pred_test_multi = opt_baseline_wine_model_multi.predict(y_train)
pd.DataFrame(classification_report(y_test, y_pred_test_multi, output_dict=True, zero_division=0))

**Note:** The inbuilt KNeighborsClassifier() from sklearn provides the exact same results (when using Euclidean distance)! 

# Summary

- A big assumption kNN makes is that it automatically assumes the label of the data point depending on its neighbours which is not necessarily always accurate if some data points are anomalous or outliers. 
- It is clear in multi-class classification that when the test set has low values of certain types of classes, the kNN model does not perform well (also similar to the binary case as well). 

<hr style="height:2px;border-width:0;color:gray;background-color:gray">

## Extra

Some comments about the code implementations:

1. If dealing with arrays rather than dataframes, some of the functions may need altering to account for dimension/shape issues e.g. the Euclidean distance is implemented with dataframes - the axis might need to be altered (or even removed) if working with arrays, the fit method will have to have .loc removed from X_test in the point_dist calculation etc... 
2. To debug this, it is important to print out the point_dist with neigh_ind so you can cross reference them against their respective target labels. 

<hr style="height:2px;border-width:0;color:gray;background-color:gray">

Thanks for reading this notebook. If there are any mistakes or things that need more clarity, feel free to respond in the comment section and I will be happy to reply.

As always, please leave an upvote - it would also be helpful if you cite this documentation if you are going to use any of the code. ðŸ˜Š

#CodeWithSid