# k-Nearest Neighbors

In [None]:
import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

## Chronic kidney disease 

Let's work through an example. We're going to work with a data set that was collected to help doctors diagnose chronic kidney disease (CKD). Each row in the data set represents a single patient who was treated in the past and whose diagnosis is known. For each patient, we have a bunch of measurements from a blood test. We'd like to find which measurements are most useful for diagnosing CKD, and develop a way to classify future patients as "has CKD" or "doesn't have CKD" based on their blood test results.

In [None]:
ckd = Table.read_table('https://raw.githubusercontent.com/data-8/textbook/gh-pages/data/ckd.csv').relabeled('Blood Glucose Random', 'Glucose')
ckd

We won't need everything in this table, and it will be easier to work with the numerical data if it's in standard units, so the following code will update the table by dropping the data columns that aren't needed and converting the remaining columns into standard units.

In [None]:
def standard_units(array_of_numbers):
    "Convert any array of numbers to standard units."
    return (array_of_numbers - np.mean(array_of_numbers)) / np.std(array_of_numbers)  

In [None]:
ckd = Table().with_columns(
    'Hemoglobin', standard_units(ckd.column('Hemoglobin')),
    'Glucose', standard_units(ckd.column('Glucose')),
    'White Blood Cell Count', standard_units(ckd.column('White Blood Cell Count')),
    'Blood Urea', standard_units(ckd.column('Blood Urea')),
    'Class', ckd.column('Class')
)

In [None]:
ckd

We'll draw a scatter plot to visualize the relation between two variables. But which 2? Try different pairings to see if there are any that produce a scatterplot that could have a boundary line drawn between the two different classes of points.

In [None]:
ckd.scatter(..., ..., group='Class')

We should now figure out how to classify a new patient using the variables we've agreen upon.

## Alice

Alice is a new patient who has the following characteristics:
* Hemoglobin attribute is 0
* Glucose is 1.5
(remember, these are in standard units)

We can represent Alice as a 2-dimensional array:

In [None]:
alice = make_array(0, 1.5)

Now let's write a function that will find the distance between 2 arrays of equal length.

In [None]:
def distance(arr1, arr2):
    ...

In [None]:
# Use this to test your distance function.
distance(alice, make_array(1.1, 2))

Let's use this function and the `.apply` method to calculate the distance from alice to each point in the data set.

## Classification

In [None]:
def classify(row, k, train):
    ...

## Training and Testing

In [None]:
shuffled_ckd = ckd.sample(with_replacement=False)
training = shuffled_ckd.take(np.arange(79))
testing = shuffled_ckd.take(np.arange(79, 158))