# ML Tutorial for CHEGs!

Welcome to the in-class portion of the ML tutorial! 

Don't worry, you don't need any knowledge of Python to be able to do this, we'll walk you through every step. 

Just follow the comments!

**The first step in creating a robot that can do your homework is to teach it to recognize the numbers on your problem set.**

How would we go about teaching a computer to recognize digits?

Well, it probably would be pretty hard to write an algorithm to do it directly. Maybe your teacher has messy handwriting, or they write really lightly, or a number of other factors. Instead, we will try to teach the computer the same way you might teach a kindergartener - by showing it many examples of each digit and telling it what digit that is.

This *data driven* approach requires we have a *training set* of many handwritten numbers to teach the computer. Luckily, we have one below.

In [None]:
# I'm a comment, hi there!
# Read through the code until the end and follow any instructions in the comments!
# It's ok if you don't understand what everything does.

# Imports the datasets, classifier, and metrics
from sklearn import datasets, neighbors, metrics

# Plotting and math packages for Python
import matplotlib.pyplot as plt 
import numpy as np

# Saves the datasets containing the labeled digits
digits = datasets.load_digits()

# Click on this box and press Ctrl+Enter on your keyboard to run this code!
# (the "Ctrl" and "Enter" keys, not the "+" key)

Note that each digit in the dataset is stored as an array of pixel values, as demonstrated below.

In [None]:
# We want to see how much data we have
print('(Datapoints, Attributes)')
print(digits.data.shape, '\n')

# An example of the data in the set
print('For example, this is the array form of the first digit in the set.')
print('Note it is composed of 8 separate 1x8 arrays.')
print('We can think of it as a 8x8 pixel image.')
print(digits.images[0], '\n')

# Reshapes the data so we can use it for training, validation, and testing later
# Data goes from an array of arrays to just one array
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

# The example reshaped into a form that can be used by our algorithm
print('The first digit in the set has been reformatted so it can be used by our ML algorithm.')
print('Note that the one datapoint now is a single 1x64 array.')
print(data[0, :], '\n')

# Setting up our images and labels to be shown
print('And these are the first few digits in the dataset with their corresponding labels')
images_and_labels = list(zip(digits.images, digits.target))

# Visualizing the first few images in the dataset
for index, (image, label) in enumerate(images_and_labels[:10]):
# Runs through the first 10 images and labels    
    
    plt.subplot(2, 5, index + 1) # Initializes plot
    plt.axis('off') # No axes
    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    # Shows each image in grayscale on white background
    plt.title('Label: %i' % label) # Sets labels
    
# Press Ctrl+Enter again!

So from the above exploration of the data, we see we have 1797 handwritten digits (rows of data samples), each represented as an 8x8 array of pixels.

In the dataset we will use, we have reformatted each 8x8 array of pixels as a 1x64 array of 64 different variables, or *attributes*. Here, each row represents a handwritten digit and each column represents one of the pixels of that handwritten digit.

**Machine learning techniques such as this one can be applied to many different cases - you just need to change what attributes are in the columns of your data!**

For instance, if you wanted to predict what clear liquid you had in a bottle, you might have attributes for density, vapor pressure, whether it dissolves in water, or others.

Or, if you wanted to predict the probability of Dr. Burkey's chemical plant blowing up, you might have attributes for how peppy Emily is, how annoying Victor is being, whether Charles has to take care of his wife, and how much pressure Wanda is putting on you.

**The digits dataset above, our training set, will be used to train our model, specifically the k-Nearest Neighbor classifier.** This classifier will be able to tell us what digit any new (unlabeled) input (that it has never seen before) is, as long as it is in the same 8x8 format as the training samples. We hope that many of the predictions of the classifier line up with what the digits actually are, the *ground truth*.

Before training our model, we will need to separate our dataset into a *training set* (which the classifier will learn from), a *validation set* (which we can use to test our initial results and fine-tune the value of k), and a *testing set* (which we will hide from the classifier until we are confident we have a good model). Splitting up our dataset like this helps us avoid *overfitting*, where the model has great performance on this dataset, but does much worse when exposed to new inputs. 

In [None]:
#ALERT: you have to change some of the numbers here to get the code to work

# Remember there are 1797 datapoints to distribute among all these categories!
# We will set the training set as datapoints 0 through trainEnd (you specify trainEnd below)
# The validation set will be datapoints trainEnd through valEnd (you specify valEnd below)
# Datapoints valEnd till the end will be the testing set

# Aim for around 40-80% training data, ~10% validation data, and 10-50% testing data.

# Training datapoints
trainEnd = #Put number here
trainingDigits = data[:trainEnd] # Sets the images
trainingLabels = digits.target[:trainEnd] # Sets the correspoinding labels


# Validation datapoints
valEnd = #Put number here
validationDigits = data[trainEnd:valEnd]
validationLabels = digits.target[trainEnd:valEnd]

# Testing datapoints
testingDigits = data[valEnd:]
testingLabels = digits.target[valEnd:]

# Press Ctrl+Enter when you've adjusted the sets to your liking!

Now we can train the k-Nearest Neighbors classifier. This classifier has a number of settings that can be optimized to make it recognize digits better. These settings are called *hyperparamters* As we discussed, the *hyperparameter* you have to optimize is k.

k represents the number of neighbors that we consider when deciding what digit to assign a new input. **We will try tweaking the value of k to figure out what works best on the validation set.** Then, once you are satisfied with a value of k, we can run the classifier on the testing set and report our accuracy.

Be careful! Do not try to optimize, or else you are effectively training your model on that set as well. Then the results of our training might not be *generalizable*, say if we wrote our own digits and had the classifier try to categorize those.

In [None]:
# ALERT: you have to change some of the numbers here to get the code to work

# We want to test a bunch of values 
# For instance, [1, 3, 5, 100]
for k in [#Put numbers here]:
    
    # Sets up our model with the appropriate k
    neigh = neighbors.KNeighborsClassifier(n_neighbors = k)
  
    # Fits the classifier and has it predict labels
    neigh.fit(trainingDigits, trainingLabels)
    validationPredictions = neigh.predict(validationDigits)
    
    # Calculating and printing some of the metrics for our data
    acc = np.mean(validationPredictions == validationLabels) # Calculates accuracy
    print('k Value: %i' % k)
    print('Accuracy: %f' % acc, '\n')
    print(metrics.classification_report(validationLabels, validationPredictions))
    # A summary of some testing metrics

# Press Ctrl+Enter when you're ready!

Once you've picked your favorite k-value, use it for the testing below! **Make sure you run the testing set only once, or you risk overfitting your hyperparameter k!**

In [None]:
# ALERT: you have to change some of the numbers here to get the code to work

# Pick the k-value you think did best above
k = # Put number here

# Sets up the classifier
neigh = neighbors.KNeighborsClassifier(n_neighbors = k)

# Fits the classifier and has it predict labels
neigh.fit(trainingDigits, trainingLabels)
testingPredictions = neigh.predict(testingDigits)

# Calculating and printing some of the metrics for our data
acc = np.mean(testingPredictions == testingLabels) # Calculates accuracy
print('k Value: %i' % k) 
print('Accuracy: %f' % acc, '\n')
print(metrics.classification_report(testingLabels, testingPredictions))
# A summary of some testing metrics

# Press Ctrl+Enter when you're ready!

Great job! Hopefully that was pretty good accuracy! Some of the numbers that the classifier predicted are visualized below.

In [None]:
# Sets up images and labels of our testing set
print('Here are some examples of what our classifier predicted!')
images_and_labels = list(zip(digits.images[valEnd:], digits.target[valEnd:]))

# Visualizing the last few images in the testing set
# Runs through the first 10 images and labels
for index, (image, label) in enumerate(images_and_labels[:10]):
    plt.subplot(2, 5, index + 1) # Initializes plot
    plt.axis('off') # No axes
    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    # Shows each image in grayscale on white background
    plt.title('Predict: %i' % label) # Sets labels

# Press Ctrl+Enter when you're ready!

While k-Nearest Neighbors was a pretty good choice for this task, it does have some pros and cons.

**Pros:**
- Easy to understand
- Little training time, since all you have to do is store all the datapoints

**Cons:**
- Takes more time to classify since you need to compare every new testing sample to all of the training samples



Also, many similar classification ML algorithms all suffer from some of the same pitfalls.
- Turning an image upside down, shifting it, distorting it, or darkening it could make it fail miserably (after all, it's just learning from pixel values).

This could make it dangerous when you get different quality data!



**Still, you've now got your homework-doing robot able to read numbers! Now all you have to do is teach it to do a Laplace transform!**

Thanks for trying our tutorial!

*Adapted from cs231n and Gael Varoquaux's example on the scikit-learn website. (Thanks to them for the inspiration and some code).*