# k-Nearest Neighbours (k-NN)
**If you have not yet read and done the exercises of chapters 4-6 https://course.elementsofai.com/, please do so now.**

Chapter 4.2 of https://course.elementsofai.com/ discussed the Nearest Neighbour (NN) classifier, wherein unclassified items are classified using the "nearest" known training-data-point's label. The k-NN classifier works very similarly; instead of using the single nearest known neighbour of an unknown datapoint, it uses the "k" nearest neighbours and takes the most common class among those neighbours. Basically, with $k = 1$ k-NN would do the exact same as the NN classifier, where k is the number of neighbours considered.

Consider the following example where the test sample (green dot) should be classified either to blue squares or to red triangles:

<img src="knn_classification.png" alt="k-NN" style="width: 400px;"/>

In the case of $k = 1$ k-NN would assign the test sample (green circle) to the class of red triangles, because the nearest other datapoint is a red triangle. If $k = 3$ (solid circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the inner circle. If $k = 5$ (dashed circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle).

Why or when would you use this? Well, there are times where we might have data wherein specific data-entries are attributed to specific classes. For example: a credit card company recieves many applications for new cards. The applications for these cards contain informations regarding a bunch of attributes, such as annual salary, any outstanding debts, age, etc. These applications would need to be classified to those who have good credit, or bad credit. Typically, the amount of applications is too big to be checked manually, which is where classification algorithms come in. It would even be possible to create a "gray" area, where our algorithm is unsure whether to categorize an credit card application into good or bad credit.

This notebook will guide you through the process of classification using k-NN.

## Nearest Neighbour

First we will try to tackle the problem of finding the nearest neighbour. We will use the geometric distance (also known as straight-line distance, or Euclidean distance) to decide which is the nearest, most similar, item.

#### Euclidean Distance

The euclidean distance $d$ between point $p$ with coordinates $(p_x, p_y)$ and point $q$ with coordinates $(q_x, q_y)$ is defined as:

$$ dist(p,q) = \sqrt{(p_x - q_x)^2 + (p_y - q_y)^2} $$

This can be read as; the straight-line distance between $p$ and $q$ is equal to the square root of the sum of squared differences in the $x$ and $y$ dimension (which is still quite abstract).

To make this more clear, consider the following example:

![euclidean_distance.png](euclidean_distance.png)

The Euclidean distance between point $p$ and $q$ is depicted as the blue line ($C$). As you can see, we can use the Pythagorean Theorem ($A^2 + B^2 = C^2$) to calculate the length of $C$. To do this, we first need the length of $A$ and $B$. The length of $A$ is equal to $(q_x - p_x)$, while the length of $B$ is equal to $(q_y - p_y)$. Now, applying the Pythagorean Theorem we end up with:

$$ C^2 = A^2 + B^2 $$

$$ C = \sqrt{A^2 + B^2} = \sqrt{(p_x - q_x)^2 + (p_y - q_y)^2} = d(p,q)$$

If we now define the $x$ and $y$ dimensions to be $1$ and $2$, we can even rewrite our formula for the distance like this:

$$ dist(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2} $$ 

Which can be rewritten more generally, using the sum notation:

$$ dist(p,q) = \sqrt{\sum^2_{i=1}(p_i - q_i)^2} $$

Remember that we will use this to calculate the class that our datapoint might belong to. In most real-life datasets we will see more than one feature. However, you can also use this formula for any number of dimensions, lets say a variable amount of $d$ dimensions, while still having a valid measure of distance! This would need a very small adaptation in the formula:

$$ dist(p,q) = \sqrt{\sum^d_{i=1}(p_i - q_i)^2} $$

Herein we assume that the nearest item is also the most similar, just by the virtue of it having the most similar total values for each of the features it has.

__Write a piece of code that calculates the Euclidean distance between two lists/tuples of coordinates (`p` and `q`) and stores the solution in the variable `distance`.__ Write it such that it works for any `p` and `q`, as you will need this code later. You do not need to write it such that it works for any other $d$ than $d=2$.

*Disclaimer; normally you would write functions to prevent duplicate code, but for this exercise "copy and pasting" your own code is okay. If, however, you would like to try to solve it using functions, you are free to do so.*

In [None]:
import math

p = (1, 2)
q = (3, 4)

### YOUR SOLUTION HERE

In [None]:
# Testing cell
assert distance == math.sqrt(8), "Something is wrong in your calculation."

#### Finding the closest neighbour
Now that we can define the distance between two points, we can also find the closest neighbouring point of any given set of coordinates. This can be done by looping over every available point one by one, and saving the point that is the closest and its distance at every step.

We have provided you with a dictionary `points` that holds the values for the classes of our dataset. There are two possible class-values; $1$ and $-1$. So for some new unknown we are trying to predict if it belongs to class $-1$ or $1$, which could stand for our "bad credit" or "good credit" from the example used in our introduction.

For this exercise you do not need the class labels, but you can use the coordinates of the point-values. The plot below shows how the point-values are distributed. 

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt


points = {(20, 14): 1, (9, 23): -1, (2, 4): -1, (35, 20): 1, (39, 9): 1, (34, 5): 1, (18, 18): -1, (22, 4): 1, (3, 30):-1, (26, 35):1, (16, 38):-1}

def plot_points(points):
    # Loop over all points and assign red when the class is -1, green when 1
    for point, class_value in points.items():
        if class_value == -1:
            plt.scatter(*point, color="red")
        else:
            plt.scatter(*point, color="blue")

    plt.show()
    
plot_points(points)

__Write code that, for some point `p`, can find the nearest point in a dictionary of `points`, and save the result in `nearest_point` and the distance in `nearest_distance`.__ You may copy and reuse your solution from the previous assignment to compute the distance.

*Disclaimer; normally you would write functions to prevent duplicate code, but for this exercise "copy and pasting" your own code is okay. If, however, you would like to try to solve it using functions, you are free to do so.*

In [None]:
p = (24, 9)

# Set closest point to something very far away
nearest_point = None
nearest_distance = math.inf

# Loop over the coordinates in our `points` dictionary
for q in points:
    ### YOUR SOLUTION HERE
    

In [None]:
# Testing cell
assert nearest_point == (22, 4), "This is not the closest point."
assert nearest_distance == math.sqrt(29), "This is not the correct distance."

#### Classifying
Finally, we need to classify our unknown point. We know what the nearest point is, so all we need to know now is what class it had. As we have discussed before, we have provided you with a dictionary that holds the values for the classes of the `points`. There are two possible class-values; $1$ and $-1$. Getting a value from the dictionary can be done as follows:

In [None]:
# Example using direct coordinates
print(f'The class-value for point {(9, 23)} is: {points[(9, 23)]}')

# Example using the nearest_point found in the last exercise
print(f'The class-value for point {nearest_point} is: {points[nearest_point]}')

__Write code that for some unknown point `p` and dictionary of known `points`, finds the closest known point to `p` and stores its class in `p_class`.__ Copy the code from the last assignment and use it here to find the nearest neighbour.

In [None]:
def nn(p, points):
    p_class = 0

    # Set closest point to something very far away
    nearest_point = None
    nearest_distance = math.inf

    ### YOUR SOLUTION HERE
    
    ### END SOLUTION

    return points[nearest_point]

In [None]:
# Testing cell
p = (9, 12)

p_class = nn(p, points)

assert p_class == -1, "You have not found the correct class."

#### Visualising the result 

It is possible to visualise the correct class label of any given point given a dataset using something that is called a Voronoi diagram. A Voronoi diagram is a partition of a plane into regions determined by the datapoints. The idea is that each of the regions exists of an area that is closer to that region's datapoint than to any other datapoint. This means each region indicates what part of the entire input plane would "belong" to that data point if we classified it using Nearest Neighbours. The Voronoi diagram that results from our datapoints is shown below:

![voronoi.PNG](voronoi.PNG)

In this image, the red areas are areas where the NN is one of the datapoints that belong to the negative class, and the blue areas are areas where the NN is one of the datapoints that belong to the positive class. The black line shows what we call the Decision Boundary; the region of a problem space in which the output label of our classifier is ambiguous. Any point that is exactly on top of this line can not be classified exactly by the NN algorithm, so usually a random choice is made.

Building an exact Voronoi diagram can be quite complex, but we can also make an approximation by sampling a lot of points in the plot and determining their nearest neighbour. Below, we have provided you with the code to display how your NN algorithm classifies the points within the plot. __Call the function you made in the previous exercise for every point in the area, each time storing the result in `p_class`. The plot of all of these points should look similar to the image above.__

In [None]:
import numpy as np

N = 40

grid_points = {}

# Loop over a grid of points
for x in range(N):
    for y in range(N):
        p = (x, y)
        
        ### YOUR SOLUTION HERE

        ### END SOLUTION

        # Add point with value to our new dictionary
        grid_points[p] = p_class

plot_points(grid_points)

## k-NN
Now that we have a method of finding the Nearest Neighbour, the step to finding $k$-Nearest Neighbours is fairly small. We can just repeat our previous method $k$ times while ignoring the points that we have already found.

There are several methods that you can use to ignore points that you have already determined were the closest. The simplest method, however, is using  the Python keyword `in`:

In [None]:
example_list = [(10, 12), (13, 4), (20, 5)]

if (20, 5) in example_list:
    print("The coordinate (20, 5) is in our list!")
else:
    print("The coordinate (20, 5) is not in our list...")
          
if (30, 3) not in example_list:
    print("The coordinate (30, 3) is not in our list...")

As you can see, you can use the keyword `in` to see if a tuple of coordinates is in a `list`.

Below, we have provided you with a framework of code that saves which sets of coordinates were already found to be the closest. __Copy your code that finds the nearest neighbour. Then, add a piece of code that makes sure that `nearest_point` can not become a point that is already in `neighbours`.__

In [None]:
k = 3

p = (9, 12)
p_class = 0

neighbours = []

for i in range(k):
    # Set closest point to something very far away
    nearest_point = None
    nearest_distance = math.inf
    
    ### YOUR SOLUTION HERE

    ### END SOLUTION
    
    # Add the point to our list of neighbours
    neighbours.append(nearest_point)
    
print(neighbours)

In [None]:
assert neighbours == [(2, 4), (18, 18), (9, 23)], "You have not found the correct neighbours."

Of course, this is not the most efficient method, as we have to re-calculate the distances every single time we try to find a new Nearest Neighbour. If you would like an added challenge, try to find a method that does not require you to re-calculate the distances. _(This is not a required exercise)_

### Determining the class-value

Next we need to add code to determine the class of an unknown datapoint. In k-NN this is done by taking the most common class among the Nearest Neighbours. Since we have defined our two possible class-values as $-1$ and $1$, this can easily be done by just taking the sum of the class labels of our `neighbours`, and seeing if this value is negative, positive, or zero. When the value of this sum is negative, there were more points with the class-value of $-1$ than there were points with the class-value of $1$. When the value of this sum is positive, the opposite is the case. When the value is zero, there were just as many points with a class-value of $-1$ as there were points with a class-value of $1$.

__Write code that can loop over the list `neighbours` and sums all class-values. Then, determine the resulting class value for our unknown datapoint and store the result in `class_outcome`.__

In [None]:
class_outcome = 0

### YOUR SOLUTION HERE


In [None]:
assert class_outcome == -1, "The outcome is not correct."

### Combining everything
Finally, we get to combine everything into one! 

__Fill in the blanks in the cell below using your code from the exercises above.__ The result of the k-nn function should be a prediction for a the point `p`, given a dictionary of known `points` and some value of `k`.

In [None]:

def knn(p, points, k):
    p_class = 0
    class_outcome = 0

    neighbours = []

    ### YOUR SOLUTION HERE
    
    ### END SOLUTION    
        
    return class_outcome

In [None]:
k = 5
p = (20,17)

class_outcome = knn(p, points, k)

assert class_outcome == 1, "The point was classified incorrectly."

#### Visualising k-NN
Communicating your scientific results in an effective way is an important step in research. Often, when exploring a dataset or when you’re evaluating some statistical analysis, creating insightful visualizations of the data is an essential part of the process. Interactive data visualization allows a user to interact with the data in question. In some cases, this might lead to a greater understanding of the data compared to static visualizations.

Below, we have provided you with an interactive version of our earlier plot using `ipywidgets`. __Call the function you made in the previous exercise, save the result in `class_outcome`, and see whether the result looks like you expected.__

__NOTE:__ There might be a delay of a couple of seconds before you will see the graph. This is caused by the multitude of times the k-NN algorithm must be run to produce each of the dots in the graph.

In [None]:
from ipywidgets import interact, fixed
import ipywidgets as widgets

def plot_knn(k, points): 
    N = 40
    grid_points = {}

    # Loop over a grid of points
    for x in range(N):
        for y in range(N):
            p = (x, y)

            ### YOUR SOLUTION HERE

            ### END SOLUTION

            # Assign the class to the point while adding it to our new dictionary
            grid_points[p] = class_outcome

    plot_points(grid_points)

interact(plot_knn, k=widgets.IntSlider(value=1, min=1, max=11, step=2, continuous_update=False), points=fixed(points));

## Questions
Answer the following questions about this notebook. Write your answers below each question in this cell.

__So far, in this notebook, we have only used uneven values for $k$. What potential problem might occur when we use an even $k$? E.g. for $k=2$, $k=4$, $k=6$, etc.__

YOUR ANSWER HERE

__Think of a possible solution to this problem, and argue why this solution would work.__

YOUR ANSWER HERE

__What seems to happen to the classifications when $k=11$? Why does this happen?__

YOUR ANSWER HERE

__What is the tradeoff you are making when you are increasing the value of $k$?__

YOUR ANSWER HERE

__When would want to use a small value for k and when would you use a larger value for k?__

YOUR ANSWER HERE