<a href="https://colab.research.google.com/github/rahiakela/hands-on-explainable-ai-xai-with-python/blob/main/1-explaining-artificial-intelligence-with-python/1_exploring_simple_knn_algorithm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exploring a simple KNN algorithm

We will begin by exploring a simple KNN algorithm that can predict a disease with a few symptoms. We will limit our study to detecting the flu, a cold, or pneumonia. The number of symptoms will be limited to a cough, fever, headache, and colored sputum.

From the doctor's perspective, the symptoms are generally viewed as follows:

- A mild headache and a fever could be a cold
- A cough and fever could be a flu
- A fever and a cough with colored sputum could be pneumonia

Notice the verb is "could" and not "must." **A medical diagnosis remains a probability in the very early stages of a disease. A probability becomes certain only after a few minutes to a few days, and sometimes even weeks.**

## k-nearest neighbors

The KNN algorithm is best explained with a real-life example. Imagine you are in
a supermarket. The supermarket is the dataset. You are at point $p_n$ in an aisle of the supermarket. You are looking for bottled water. You see many brands of bottled water spread over a few yards (or meters). You are also tempted by some cans of soda you see right next to you; however, you want to avoid sugar.

In terms of what's best for your diet, we will use a scale from 1 (very good for your health) to 10 (very bad for your health). $p_n$ is at point (0, 0) in a Euclidian space in which the first term is $x$ and the second $y$.

The many brands of bottled water are between (0, 1) and (2, 2) in terms of their
features in terms of health standards. The many brands of soda, which are generally bad in terms of health standards, have features between (3, 3) and (10, 10).

To find the nearest neighbors in terms of health features, for example, the KNN
algorithm will calculate the Euclidean distance between pn and all the other points in our dataset. The calculation will run from $p_1$ to $p_{n–1}$ using the Euclidean distance formula. The $k$ in KNN represents the number of "nearest neighbors" the algorithm will consider for classification purposes. 

The Euclidean distance ($d_1$) between two given points, such as between $p_n(x_1, y_1)$ and $p_1(x_2, y_2)$, for example, is as follows:

$$ d_1(p_n, p_1) = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2} $$

Intuitively, we know that the data points located between (0, 1) and (2, 2) are closer to our point (0, 0) than the data points located between (3, 3) and (10, 10). The nearest neighbors of our point (0, 0), are the bottled water data points.

Note that these are representations of the closest features to us, not the physical points in the supermarket. The fact that the soda is close to us in the real world of the supermarket does not bring it any closer to our need in terms of our health requirements.

Considering the number of distances to calculate, a function such as the one
provided by sklearn.neighbors proves necessary. We will now go back to our
medical diagnosis program and build a KNN in Python.

## Setup

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.neighbors import KNeighborsClassifier