# k-Nearest Neighbors, Part 2

In [None]:
import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

## Picking Back Up

Let's pick up where we left off yesterday in class. Run the follow set of cells to load up everything from where we stopped.

Get the original table:

In [None]:
ckd = Table.read_table('https://raw.githubusercontent.com/data-8/textbook/gh-pages/data/ckd.csv').relabeled('Blood Glucose Random', 'Glucose')

Load in these helper functions

In [None]:
def standard_units(array_of_numbers):
    "Convert any array of numbers to standard units."
    return (array_of_numbers - np.mean(array_of_numbers)) / np.std(array_of_numbers)  

def distance(arr1, arr2):
    return np.sqrt( np.sum( (arr2 - arr1)**2 ))

Reduce the table down to just the columns we care about and in standard units:

In [None]:
ckd = Table().with_columns(
    'Hemoglobin', standard_units(ckd.column('Hemoglobin')),
    'Glucose', standard_units(ckd.column('Glucose')),
    'Class', ckd.column('Class')
)

In [None]:
ckd

Here's Alice (*Note: I've modified her Glucose level to be 1.1 instead of the 1.5 value we used in class on Thursday*)

In [None]:
alice = make_array(0, 1.1)

### New Helper Function

Here's a new function `row_to_array` that will help us turn a row from a table into an array that we can use with our `distance` function. You give it an entire row of a table, which can include ALL attributes for that row, and it will return an array that only contains the features that are specified in an input array.

In [None]:
def row_to_array(row, features):
    """Returns an array of the features specified in the array named features"""
    arr = make_array()
    for feature in features:
        arr = np.append(arr, row.item(feature))
    return arr

For example, normally calling a row from `ckd` would include the data on `Class`, which we don't want when calculating distance:

In [None]:
ckd.row(3)

We could first `.drop()` Class from the table `ckd` like this:

In [None]:
ckd.drop('Class').row(3)

and then convert to an array using `np.array()`

In [None]:
patient3_np = np.array( ckd.drop('Class').row(3) )
patient3_np

But that's kind of messy. It works, but sometimes we might want a different way to do this.

Another way is to use the helper function `row_to_array()`, which can handle a lot of this extra work for us:

In [None]:
features = make_array('Hemoglobin', 'Glucose')
patient3 = row_to_array( ckd.row(3), features )

Both ways are fine (you get the same array when you're done) but one may be a bit easier to read and understand. We need to be able to convert a row of information into an array before we can calculate the distance from that row with another test point.

In [None]:
distance( alice, patient3_np )

In [None]:
distance( alice, patient3 )

## Calculating Distances

There are two ways you can calculate distances from a given point (in this context, a patient) and all the other rows in a table.

1. A `for` loop
2. Using the `.apply` method.

We'll take a look at both, because they each have their advantages.

### A `for` loop

You can use a `for` loop to iterate through each row in a table, and then calculate the distance between that row and a provided test point. First, let's make sure we've identified the features we care about (Hemoglobin and Gluce) and select a patient as our test point. We'll keep using Alice.

In [None]:
features = make_array('Hemoglobin', 'Glucose')
test_point = alice

Then, we'll make an empty array named `distances` that will collect the distance from Alice (`test_point`) to each of the patients in table `ckd`. The code `for row in ckd.rows:` will sequentially select an individual row from the table `ckd` and store it to the variable `row`. The code inside the loop will convert the current row into an array named `row_as_array` (keeping only the features specified earlier), calculate the distance between `test_point` to the current row, `row_as_array`, and then append that distance to the array `distances`. You can see that the result is an array of distances from Alice to each patient in the table.

In [None]:
distances = make_array()
for row in ckd.rows:
    row_as_array = row_to_array(row, features)
    one_distance = distance(test_point, row_as_array)
    distances = np.append(distances, one_distance)

distances

We can now augment the table `ckd` so it includes these distances in addition to the original table.

In [None]:
ckd_with_distances = ckd.with_column('Distance to Alice', distances)
ckd_with_distances

## The `.apply` method

As a refresher, the `.apply` method can be called on a table, and it will create an array of values that are calculated by using the specified function on a given column of the table. For example, the following code will apply the `np.square` function using the column labeled `Hemoglobin` as the input to the function.

In [None]:
ckd.apply( np.square, 'Hemoglobin')

If no input is provided, it's assumed that the entire Row will be provided as an input. We'll need to make sure that the able only has the features that make sense for the function to work on. For example, in the table `ckd`, it wouldn't make sense to include the column `Class` in any calculations, since it's not a feature/attribute of a patient. We should drop this column before using the `.apply` method. Here's an example where the code will add the values in the `Hemoglobin` and `Glucose` columns. This doesn't really mean much in context, but it's meant to illustrate how the `.apply` method works when no columns are specified.

In [None]:
ckd.drop('Class').apply( sum )

One limitation to the `.apply` method is that it requires you to provide a function that only takes in one input. To calculate all of our distances, we would need two inputs: one for the test point, and one for the current row. It turns out we can write a function that only takes in one input *inside* another function that takes in two. Take a look.

In [None]:
def all_distances(new_point, data):
    
    data = data.drop('Class')
    
    def one_distance(row):
        arr = np.array( row )
        return distance(new_point, arr)
    
    return data.apply(one_distance)

The function `all_distances` will take in a single person (`new_point`) and a table all the patients who are classified (`data`). It drops the `Class` column from the table, and then creates the `one_distance` function which only takes in a single row and calculates the distance between `new_point` and a single row from a table. Lastly, the `apply` method will create an array of values using the `one_distance` function, `new_point`, and every row in the table `data`.

Calling the function results in the following:

In [None]:
all_distances(alice, ckd)

We can now augment the table `ckd` so it includes these distances in addition to the original table. This should be identical to the table we created using the `for` loop.

In [None]:
ckd_with_distances = ckd.with_column('Distance to Alice', all_distances(alice, ckd) )
ckd_with_distances

### Choosing `for` loop vs. `.apply` method?

How do you decide to use techinique over another?

|         | Pros | Cons |
|---------|------|------|
|`for`    | No need to create extra helper functions  | Can take a lot longer when working with big Tables     |
|`.apply` | No need to know about loops. <br>Can be a lot faster working with big data sets    | Need to write a function in a function to get the final array     |

At the end of the day, choose the approach that makes more sense to you!

## Classification

Now that we have a way to quickly calculate the distance from any patient to the other patients in a table, we need to figure out how to classify the patient easily. We should start by sorting our table by `Distance`

In [None]:
ckd_with_distances.sort('Distance to Alice')

Let's assume that `k=5`, meaning we'll use the 5 nearest neighbors to make a decision on our new patient. The following code will only keep the nearest 5 patients and stores the new table to `nearest_neighbors`.

In [None]:
nearest_neighbors = ckd_with_distances.sort('Distance to Alice').take(np.arange(5))
nearest_neighbors

We could count by hand how many neighbors fall into each Class, but that sounds like a job better suited for the following code:

In [None]:
nearest_neighbors.group('Class')

If we sort this new table by `count` (largest at the top), and then take the first item from the column `Class` it will tell us how we classified the new patient.

In [None]:
nearest_neighbors.group('Class').sort('count', descending=True).column('Class').item(0)

Alice has been determined to belong to class `1`, meaning our classifier has predicted that she is likely to have kidney disease.

## How Accurate is it?

Next week, we'll look further into how to determine how accurate our approach is in classifying people into a particular group.