# Supervised Learning with KNNs and the Iris Dataset

In this demo we will be using the **[Iris Dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html)** to build a supervised learning model for a multiclass classification task.

As discussed in the session, the dataset consists of 
- 3 different types of iris flowers (setosa, versicolour, and virginica), with 
- 4 features (petal length, petal width, sepal length, and sepal width), and 
- 50 samples for each flower, 
- stored in a 150x4 numpy.ndarray

For this demo, we'll be using the **K-Nearest Neighbour (KNN)** model. 

The following will be covered in this demo: 
- Initial setup 
- Loading and exploring the data
- Splitting the dataset into train and test datasets
- Creating a KNN classifier
- Evaluating the performance of the classifier
- Improviing the classifier

***

## Step 0) Initial Setup

Install the required libraries if you don't already have them installed. To install them, uncomment the following lines from the code block and run it.

In [None]:
# !pip install numpy==1.26.*
# !pip install pandas==2.1.*
# !pip install matplotlib==3.8.*
# !pip install seaborn==0.12.*
# !pip install sklearn==1.3.*

Import required libraries. We will be using the following: 
- [NumPy](https://numpy.org/doc/stable/index.html)
- [pandas](https://pandas.pydata.org/)
- [matplotlib](https://matplotlib.org/stable/tutorials/pyplot.html)
- [seaborn](https://seaborn.pydata.org/)

In [None]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

# Print the versions of the libraries you've just imported
print(f'numpy: {np.__version__}')
print(f'pandas: {pd.__version__}')
print(f'matplotlib: {matplotlib.__version__}')
print(f'seaborn: {sns.__version__}')
print(f'sklearn: {sklearn.__version__}')

Setting a constant for the [random state](https://medium.com/mlearning-ai/what-the-heck-is-random-state-24a7a8389f3d). 

This is not required, but since some algorithms rely on random number generators, the outputs can vary across runs. By passing in the same random state to those functions, allows for reproducable outputs.

Any guesses on why its set to [42](https://grsahagian.medium.com/what-is-random-state-42-d803402ee76b)?

In [None]:
RANDOM_STATE = 42

***

## Step 1) Load and explore the data

We will be using the [Iris Dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html) for this demo. This can be imported from [scikit-learn](https://scikit-learn.org/stable/index.html) 

In [None]:
from sklearn.datasets import load_iris

In [None]:
iris = load_iris()

In [None]:
type(iris)

The dataset comes with lots of useful information, including things like the description of the dataset, names of the categories (targets), and names of the features.

In [None]:
iris.keys()

In [None]:
print(iris['DESCR'])

In [None]:
iris['target_names']

In [None]:
iris['feature_names']

Create a pandas dataframe from the dataset. This way we can visualise the data more easily. 
Since the 'targets' in the dataset are numeric by default (recall one-hot encoding?), we can apply some logic to create a column with text categories so its more readable. 

In [None]:
iris_df = pd.DataFrame(data = iris['data'], columns = iris['feature_names'])
iris_df['Iris type'] = iris['target']
iris_df['Iris name'] = iris_df['Iris type'].apply(lambda x: 'Setosa' if x == 0 else ('Versicolour' if x == 1 else 'Virginica'))

In [None]:
iris_df.sample(100)

In [None]:
sns.FacetGrid(iris_df, hue="Iris name", height=4, aspect=1.5) \
   .map(plt.scatter, "sepal length (cm)", "sepal width (cm)") \
   .add_legend()
plt.show()

In [None]:
sns.FacetGrid(iris_df, hue="Iris name", height=4, aspect=1.5) \
   .map(plt.scatter, "petal length (cm)", "petal width (cm)") \
   .add_legend()
plt.show()

***

## Step 2) Split the dataset into train and test datasets

Since we have a single dataset, we will need to split it into train and test datasets. 

It is very **important** to make the split before apply any transformations to the dataset or feeding it into the model. This is because we do not want any information from the test dataset to influence any part of the training process.

First, lets create X (samples/inputs) and y (targets/outputs) from the iris dataset.

In [None]:
X = iris.data
y = iris.target

The easiest way to create train and test dataset is to use the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from sklearn.

In [None]:
from sklearn.model_selection import train_test_split

We can create a test dataset using 20% of the entire dataset. That means, 80% of the data will be used for training. If we need to create a validation split as well, then we can pass the X_train and y_train to the same function, and create another split.

Notice that we are setting the random state here. This is so that we can reproduce the same train test split each time we call this function - makes things easier in demos! 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)

Let's create a simple function to count the frequency of classes in each split. 

In [None]:
def count_frequency_of_classes(l): 
    print('# Samples of 0: ', np.count_nonzero(l == 0))
    print('# Samples of 1: ', np.count_nonzero(l == 1))
    print('# Samples of 2: ', np.count_nonzero(l == 2))

In [None]:
count_frequency_of_classes(y_train)

In [None]:
count_frequency_of_classes(y_test)

We can see that the frequency of features from different classes is not uniform in this train/test split. In some cases, this might be desired, in others this might not be. 

If we want a uniform distribution of samples from each class, we can use the **stratify** parameter for the train_test_split function.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=RANDOM_STATE)

In [None]:
count_frequency_of_classes(y_train)

In [None]:
count_frequency_of_classes(y_test)

***

## Step 3) Create the KNN classifier

We can import the KNN classifier from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). 

In [None]:
from sklearn.neighbors import KNeighborsClassifier

The classifier takes a few hyperparameters. The main ones are: 
- **n_neighbors**: Value of K for the classifier. Number of neighbors to use by default for kneighbors queries
- **weights**: Weight functions used for prediction. 
    - uniform: All points in each neighborhood are weighted equally.
    - distance: weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away
- **p**: Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

In [None]:
knn = KNeighborsClassifier(n_neighbors=2, weights='uniform')

**Training** the model is incredibly simple! All we need to do is call the **fit()** function on the classifier object that we just created, and pass in the train data (X_train) and targets (y_train). 

Since this is a supervised learning model, we need to pass te targets. We'll see in future sessions that training unsupervised learning models follows the same process without the need to pass in any targets.

In [None]:
knn.fit(X_train, y_train)

Similarily, once we have trained the model, we can simply call the **predict()** function along with the test samples (X_test) to get a list of predictions. 

In [None]:
y_pred = knn.predict(X_test)
y_pred

We can get a bit of insight into how the model made its predictions by calling the **predict_proba()** function. More information on how this classifier breaks ties can be found [here](https://stats.stackexchange.com/questions/144718/how-does-scikit-learn-resolve-ties-in-the-knn-classification)

In [None]:
pred_prob = knn.predict_proba(X_test)
pred_prob

***

## Step 4) Evaluate the Model Performance

We will evaluate the performance of the model primarily by looking at the **accuracy**. For this demo, we will also create a confusion matrix from the predictions. 

Fortunately, sklearn provides a lot of performance evaluation [metrics](https://scikit-learn.org/stable/modules/model_evaluation.html) for all sorts of machine learning models, including the accuracy and confusion matrix. 

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
confusion_matrix(y_test, y_pred)

In [None]:
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=iris['target_names'])

Lets try to see why that happened. 

In [None]:
print('True\tPredict\tProbability')
for _y_test, _y_pred, _pred_prob in zip(y_test, y_pred, pred_prob): 
    print(f'{_y_test}\t{_y_pred}\t{_pred_prob}')

Because the KNN model is quite simple and intuitive, it is possible for us to do this sort of analysis. However, this is not always possible for all classifiers, especially when working with large amounts of data.

***

## Step 5) Improve model performance

So far we tried using K = 2 for the KNN classifier. Changing this value will lead to different classifiers, which will have lead to different performance against the test dataset. 

So, we can use K-fold cross validation to validate the performance of different hyperparameters (in this case, values of K). 

We can start by importing the [Grid Search Cross Validator](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) from sklearn. 

In [None]:
from sklearn.model_selection import GridSearchCV

In this case, we'll try values for k between 1 and 11 (recall the rule of thumb for an optimal value of K from the slides!). And we'll use 10 splits in the cross validation process. That means 

- The train set will be split into 10 subsets. 
- The CV will loop over the datasets 10 times, each time, it will pick a different subset to use as a validation set, and use the other 9 to train the model. 
- The mean performance of the models will be the output 
- The process will be repeated for each hyperparameter (in this case, value of K)

In [None]:
knn = KNeighborsClassifier()
cv = GridSearchCV(knn, {'n_neighbors': [1,2,3,4,5,6,7,8,9,10,11]}, cv=10)

In [None]:
cv.fit(X_train, y_train)

In [None]:
cv.cv_results_

In [None]:
cv.best_params_

In [None]:
plt.plot(range(1,12), cv.cv_results_['mean_test_score'])
plt.xticks(range(1,12))
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
plt.grid()
plt.show()

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=iris['target_names'])