In [1]:
import pandas as pd
import numpy as np

Iris is a famous dataset that has 3 species of iris along with 4 measurements. Learn more here:

https://en.wikipedia.org/wiki/Iris_flower_data_set

Lots of beginner tutorials use it because the flower types ("classes") aren't linearly separatable. This essentially means that different species often have similar measurements, which makes this a good dataset for machine learning

In [2]:
from sklearn.datasets import load_iris

iris = load_iris()
x = iris.data
y = iris.target

Above, we're taking advantage of methods built into this sklearn dataset. In a real-world project you'd be loading a csv or some other data file, and then specifying your x and y columns. It'd look something like:
```
data = pd.load_csv('myfile.csv')
x = data.drop('target_column', axis=1)
y = data['target_column')
```

We'll skip the exploratory data analysis (EDA) in this notebook, but you should typically perform it!

Next we need to create training and testing sets. Get used to the following code; you'll use it a lot.

In [3]:
from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest =\
train_test_split(x, y, test_size=.2, random_state=10)

# Remember that we use random_state for teaching and
# reproducibility purposes. You probably won't use it
# in your own code.

In [4]:
xtrain.shape, ytrain.shape

((120, 4), (120,))

In [5]:
xtest.shape, ytest.shape

((30, 4), (30,))

The next step is to see how good a "dummy classifier" does that always predicts the most common $y$ value. Here's how I do this using sklearn.

In [6]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

dum = DummyClassifier()

dum.fit(xtrain, ytrain)
preds = dum.predict(xtest)

accuracy_score(ytest, preds)

0.43333333333333335

Those 4 lines of code above are like 80% of machine learning! Get used to those, too.

Now let's use a k-nearest neighbors algorithm, which is simple but powerful. The intuition behind KNN is that you should find the data point(s) most similar to the one you're predicting, and make that same prediction.

Here's an example:

Let's say I know someone who lives in SF, drives a Prius, is 30 years old, and is not religious. Do you think she's a Democrat or Republican?

Well, you'd probably predict she's a Democrat! Because you probably know people in a similar demographic and most (if not all of them) are liberal. That's fundamentally how k-nearest neighbors works -- it asks, "Is this like something I've seen before?"

In [7]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier()
clf.fit(xtrain, ytrain)

preds = clf.predict(xtest)

accuracy_score(ytest, preds)

0.9666666666666667

Our KNN model can predict species with 96.6% accuracy -- much better than our Dummy Classifier!

** Hyperparameter tuning **

We're going to keep it simple for now and test our hyperparameters using a for-loop. The main parameter you care about with KNN is "k" -- how many similar data points you should compare the new one to. Should you only look at the one most similar? Or let the top 3 take a vote? Top 5? 9?

In [8]:
results = []

for n in range(1, 21, 2):
    clf = KNeighborsClassifier(n_neighbors=n)
    clf.fit(xtrain, ytrain)
    preds = clf.predict(xtest)
    acc = accuracy_score(ytest, preds)
    results.append([n, acc])
    
results

[[1, 0.9666666666666667],
 [3, 0.9666666666666667],
 [5, 0.9666666666666667],
 [7, 0.9666666666666667],
 [9, 1.0],
 [11, 1.0],
 [13, 1.0],
 [15, 1.0],
 [17, 0.9666666666666667],
 [19, 1.0]]

You can see that our model does really well with 9 neighbors, so you'd probably use that setting in your final model.

Note that no real-world model ever does *this* good. It's probably the result of having a tiny dataset.

In [9]:
clf = KNeighborsClassifier(n_neighbors=9)

clf.fit(xtrain, ytrain)

preds = clf.predict(xtest)

accuracy_score(ytest, preds)

1.0

Aaand there it is again, just so you can see that this is our current model of choice.

# Cross validation

The more advanced ML workflow is to split your data into **3** parts: training, testing, and validation.

Technically, what we were calling testing data before is actually *validation* data. The main difference is that use validation data to "check your work" and tune your model, and testing data is the "final exam" to see how good your model actually is.

In an ideal scenario, you evaluate the results of your testing data **only once**. It represents real-world data your model has never seen before. And, as such, your model's performance will always be at least slightly worse on it.

In [10]:
# Separate out the testing data
xtrain, xtest, ytrain, ytest = train_test_split(
    x, y, test_size=.1, random_state=35)

In [11]:
from sklearn.model_selection import cross_val_score

clf = KNeighborsClassifier(n_neighbors=1)

cvs = cross_val_score(clf, xtrain, ytrain, cv=5, scoring='accuracy')
# cv=5 means we're going to run this model 5 times,
# each time validating on a different 20%
# of the data

# cvs is an array of the accuracy scores you obtained:
cvs

array([0.89285714, 0.96296296, 0.96296296, 1.        , 1.        ])

So `cvs` itself isn't all that useful to you. What you really want is its mean and standard deviation:

In [12]:
print('Mean accuracy:     ', cvs.mean())
print('Standard deviation:', cvs.std())

Mean accuracy:      0.9637566137566138
Standard deviation: 0.03912840612589273


In [13]:
from sklearn.model_selection import cross_val_score

clf = KNeighborsClassifier(n_neighbors=3)

cvs = cross_val_score(clf, xtrain, ytrain, cv=5, scoring='accuracy')

print('Mean accuracy:     ', cvs.mean())
print('Standard deviation:', cvs.std())

Mean accuracy:      0.9711640211640212
Standard deviation: 0.04169835865372268


In [14]:
from sklearn.model_selection import cross_val_score

clf = KNeighborsClassifier(n_neighbors=5)

cvs = cross_val_score(clf, xtrain, ytrain, cv=5, scoring='accuracy')

print('Mean accuracy:     ', cvs.mean())
print('Standard deviation:', cvs.std())

Mean accuracy:      0.9783068783068783
Standard deviation: 0.02870827506084769


From the 3 we tried above, `n_neighbors=5` looks to be the best.

Time for the final exam!

In [15]:
# Note that you can combine the first 2 lines of code if you prefer!
clf = KNeighborsClassifier(n_neighbors=5).fit(xtrain, ytrain)

preds = clf.predict(xtest)

accuracy_score(ytest, preds)

0.9333333333333333

That's the basic machine learning workflow. The one major piece you're still missing is how to automate the finding of the best hyperparameters (the knobs you turn -- in this case, just `n_neighbors`.

I'll add that to this notebook shortly. Hope this helps for now!