## K-Nearest Neighbors classification

The K-Nearest Neighbors (KNN) algorithm is a supervised learning algorithm. It is a _non-parametric_ method used for classification and regression. In both cases, the input consists of the $k$ closest training examples in the feature space. The output depends on whether KNN is used for classification or regression. Here, we focus on the classification case.

### Ingredients & transparency
For all machine learning models covered in this course, we aim to talk about their ingredients and transparency in a standard way to facilitate understanding their similarities and differences. For transparency, we will focus on the system transparency, i.e. system logic. Process transparency is not specific to ML models and it will be discussed when we cover the ML process/lifecycle.

The ingredients of a KNN model are the training data. The transparency of a KNN model is the distance between the test point and the training points.

```{admonition} Ingredients
- Input: features of data samples
- Output: class labels of data samples
- Model: assigned dominant class label of the $K$ nearest neighbors of an input sample to it
- Hyperparameter(s): the number of nearest neighbors $K$
- Loss function: minimise distance between samples
- Learning algorithm: sorting distances between samples
```

```{admonition} Transparency
System logic
- Condition to produce certain output: nearest neighbors for those samples with that label
```

### Example: Iris classification

We adapt the [KNN example from scikit-learn](https://scikit-learn.org/1.0/auto_examples/neighbors/plot_classification.html#sphx-glr-auto-examples-neighbors-plot-classification-py) to illustrate the use of KNN for classification. 

To do so, we use the Iris dataset, which is a classic dataset in machine learning and statistics. It is included in scikit-learn and we load it as follows.

```{admonition} Launch 
Click the rocket symbol (<i class="fas fa-rocket"></i>) to launch this page as an interactive notebook in Google Colab (faster but requiring a Google account) or Binder.
```

##### Libraries

Get ready by importing the application programming interfaces (APIs) needed from respective libraries.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets

##### Load data

 Let us work on one of the classical dataset in machine learning: the [Iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set). It can be loaded directly from the scikit-learn library.

In [None]:
iris = datasets.load_iris()

##### Set hyperparameter

Here, we set the only one hyperparameter, the number of neighbors.

In [None]:
n_neighbors = 15

##### Visulisation in 2D

For the purpose of visulisation to get intuition, we will work with the first two features only so that we can do visualisation of the data samples in a 2D plot. We can take the first two features by [indexing and slicing](https://pykale.github.io/transparentML/00-prereq/basic-python.html#indexing-and-slicing) as below.

In [None]:
X = iris.data[:, :2]
y = iris.target

In [None]:
h = 0.02  # step size in the mesh

# Create color maps
cmap_light = ListedColormap(["orange", "cyan", "cornflowerblue"])
cmap_bold = ["darkorange", "c", "darkblue"]

for weights in ["uniform", "distance"]:
    # we create an instance of Neighbours Classifier and fit the data.
    clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
    clf.fit(X, y)

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure(figsize=(8, 6))
    plt.contourf(xx, yy, Z, cmap=cmap_light)

    # Plot also the training points
    sns.scatterplot(
        x=X[:, 0],
        y=X[:, 1],
        hue=iris.target_names[y],
        palette=cmap_bold,
        alpha=1.0,
        edgecolor="black",
    )
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title(
        "3-Class classification (k = %i, weights = '%s')" % (n_neighbors, weights)
    )
    plt.xlabel(iris.feature_names[0])
    plt.ylabel(iris.feature_names[1])

plt.show()

### Exercises

To be completed in the next cycle


Q1. How many features are there in total for the iris dataset? Write code below to find out or verify.