In [None]:
from datascience import *
import numpy as np
import matplotlib
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## Classifying Patients ##

Ch 17.1: Nearest Neighbors
We’re going to work with a data set that was collected to help doctors diagnose chronic kidney disease (CKD). Each row in the data set represents a single patient who was treated in the past and whose diagnosis is known. For each patient, we have a bunch of measurements from a blood test. We’d like to find which measurements are most useful for diagnosing CKD, and develop a way to classify future patients as “has CKD” or “doesn’t have CKD” based on their blood test results.

In [None]:
ckd = Table.read_table('data/ckd.csv').relabeled('Blood Glucose Random', 'Glucose')
ckd.show(3)

Is Red Blood Cells a categorial/quantitative variable?


What about Age? Categorical/Quantitative?


What about Class? Categorical/Quantitative?
Class = 0 means no Chronic Kidney Disease
Class = 1 means the patient has Chronic Kidney Disease



In [None]:
ckd.group('Class')


In [None]:
ckd.scatter('White Blood Cell Count', 'Glucose', group = "Class")

From the graph above, if a patient has a white blood count of 25,000,
do you predict they'll have CKD or not?



What about glucose level: if a patient has a glucose level of 100,
do you predict they'll have CKD or not?



In [None]:
ckd.scatter('Hemoglobin', 'Glucose', group='Class')

In [None]:
# we want to be able to way to predict the class of someone
# without having to plot & eye ball this graph every time.
#
# one way to do this is to put some thresholds into code

max_glucose_for_0 = ckd.where('Class',are.equal_to(0)).column('Glucose').max()
# return max glucose amount from blue points (class = 0)

min_hemoglobin_for_0 = ckd.where('Class',are.equal_to(0)).column('Hemoglobin').min()
# return min Hemoglobin level from blue points (class = 0)

In [None]:
def classify(hemoglobin, glucose):
    # if hemoglobin is less than the min hemoglobin level
    # or 
    # if glucose level is greater than max glucose level
    # return 1, NO CKD
    # else return 0, CKD
    if hemoglobin < min_hemoglobin_for_0 or glucose > max_glucose_for_0:
        return 1
    else:
        return 0

In [None]:
# Let's try our classifier!
classify(15, 100)

In [None]:
classify(10, 300)

## Classifying Banknotes ##

This time we’ll look at predicting whether a banknote (e.g., a $20 bill) is counterfeit or legitimate. Researchers have put together a data set for us, based on photographs of many individual banknotes: some counterfeit, some legitimate. They computed a few numbers from each image, using techniques that we won’t worry about for this course. So, for each banknote, we know a few numbers that were computed from a photograph of it as well as its class (whether it is counterfeit or not). Let’s load it into a table and take a look.

In [None]:
# Can we tell if a bank note is real or not?
banknotes = Table.read_table('data/banknote.csv')
banknotes

In [None]:
banknotes.group('Class')
# Class 0 = real bank note
# Class 1 = fake bank note

In [None]:
# Let's compare Wavelet variability and Wavelet Curt
banknotes.scatter('WaveletVar', 'WaveletCurt', group='Class')
# Class 0 = real bank note (BLUE)
# Class 1 = fake bank note (GOLD)

Pretty interesting! Those two measurements do seem helpful for predicting whether the banknote is counterfeit or not. However, in this example you can now see that there is some overlap between the blue cluster and the gold cluster. This indicates that there will be some images where it’s hard to tell whether the banknote is legitimate based on just these two numbers. Still, you could use a 𝑘-nearest neighbor classifier to predict the legitimacy of a banknote.

Take a minute and think it through: Suppose we used 𝑘=11 (say). What parts of the plot would the classifier get right, and what parts would it make errors on? What would the decision boundary look like?


In [None]:
#here’s what we’d get if used a different pair of 
# measurements from the images:

banknotes.scatter('WaveletSkew', 'Entropy', group='Class')

There does seem to be a pattern, but it’s a pretty complex one. Nonetheless, the 𝑘-nearest neighbors classifier can still be used and will effectively “discover” patterns out of this. This illustrates how powerful machine learning can be: it can effectively take advantage of even patterns that we would not have anticipated, or that we would have thought to “program into” the computer.



In [None]:
fig = plots.figure(figsize=(8,8))
ax = Axes3D(fig)
ax.scatter(banknotes.column('WaveletSkew'), 
           banknotes.column('WaveletVar'), 
           banknotes.column('WaveletCurt'), 
           c=banknotes.column('Class'),
           cmap='viridis',
          s=50);