# A Gentle Introduction to Machine Learning with sklearn

## Basic Tenets

> [Machine Learning] gives computers the ability to learn without being explicitly programmed
>~Arthur Samuel

Finding ways to make machine predict an outcome based on data.
Examples:
1. Given newspaper articles, determine their sentiment, that is, is the article overall positive or negative?
2. Given the amount of time a student studied last night, predict their test score (0-100)
3. Given a whole bunch of pictures, try to name what they are (is it a car?  is it red?  is it alive?)

**Supervised Learning**
Using data with predetermined outcomes to train a model. (e.g. 1,2)

**Unsupervised Learning**
Given data without predetermined outcomes or labels, try to find some pattern (this is harder). (e.g. 3)

**Classification**
Data outcomes given discrete labels (e.g. 1, 3)

**Regression**
Data results in outcomes that are continuous in nature (e.g. 2)

Data comes in the form of features.  For example,
1. Word counts in a newspaper article
2. Number of hours
3. The pixel values

## The Machine Learning Pipeline:
- **Preprocessing/Feature Extraction** - Obtain the data, clean it, and try to get it into a usable form.  This can include imputation (filling in missing data), or scaling the data, etc.  For example, so called "bag of words" representation of text consists of a dictionary of words and then the number of each word that appears in each text document
- **Feature Selection** - Find out what features are relevant (e.g. which words that appear tell you the most about the sentiment of a newspaper article?  Words like "the" are likely to be irrelevant.  Words like "happy" might mean something.
- **Fitting/Training** - Models are actually mathematical formulas that have adjustable parameters.  These parameters can change to yield different outcomes given input data.  In this step, the machine will read in the data and find parameters that maximize some measure that quantifies the quality of its prediction (for example, compute a probability that a data point is predicted correctly and then try to maximize this probability; this is called an MLE - Maximum Likelihood Estimate)
- **Model Selection/Tuning** - Many models have hyperparameters, that is, parameters that YOU the programmer must set up to maximize the quality of the model's predictions.  Sometimes you need to do a little guess and check (try 1 set of hyperparameters on a portion of the set and see how good it is, then try another).  This technique is called Cross Validation
- **Prediction** - Using your tuned and trained model, you can predict test data (equivalent to plugging in values into a massive equation and observing the result)
- **Model Evaluation** - Look at some measures of evaluation (e.g. accuracy/precision of classification, mean squared error for regression, etc.)

Ok, let's play with some data!

## Playing with the MNIST Data set

We will use one of sklearn's many pre-provided data sets, digits - a sample of the MNIST handwritten numbers.

In [None]:
from sklearn import datasets
import numpy as np

digits = datasets.load_digits()
print(digits['DESCR'])
print("Data shape: {}".format(digits.data.shape))
print("Target shape: {}".format(digits.target.shape))
np.column_stack([digits.data, digits.target])

Lets get a feel for what kind of data we're dealing with here.

In [None]:
import matplotlib.pyplot as plt

def display_digit(digit_data):
    div = np.full((8,8), 16, dtype=int)
    dig = digit_data.reshape(8,8) / div
    plt.imshow(dig, cmap="gray")
    plt.show()

sample = 0 # change this to anything between 0 and size of data set (1796) to view a plot of any data point
print("Sample label: {}".format(digits.target[sample]))
display_digit(digits.data[sample])

Now let's pick a model and train it with this data!

In [None]:
from sklearn.linear_model import LogisticRegression

# Step 1: Create the classifier, specifying parameters (or use defaults)
clf = LogisticRegression()

# Step 2: Fit the data!
clf.fit(digits.data, digits.target)

# Step 3: Try predicting something
pred_num = 0 # Change this to anything between 0 and the size of data set (1796) to predict a different data point
pred = clf.predict([digits.data[pred_num]])
print("Classifier predicted {}".format(pred))
print("Actual value: {}".format(digits.target[pred_num]))
display_digit(digits.data[pred_num])

Now let's evaluate the robustness of our model

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

# predict everything we originally trained on
pred = clf.predict(digits.data)
print("Accuracy: {}".format(accuracy_score(digits.target, pred)))
print(classification_report(digits.target, pred))

This is remarkably accurate!  However, in this case we've tested our model with the same data we trained this with.  To illustrate this, we will split the data and use a different model to motivate why this is the case.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.33, random_state=1)

print("Train X shape: {}".format(X_train.shape))
print("Test X shape: {}".format(X_test.shape))


Let's try a different model this time, called K-Nearest Neighbors (KNN)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# If there are n features in the data and we were to plot the points in an n dimensional space
# KNN will assign a label based on the labels of the closest k points to the data point we wish to predict
clf = KNeighborsClassifier(n_neighbors=5) # Here we let k=5
clf.fit(X_train, y_train)
y_pred = clf.predict(X_train) # predict using the training data
accuracy_score(y_train, y_pred)

Here, k is a hyperparameter, meaning we could change it to get better performance.  Can we do better?

In [None]:
clf = KNeighborsClassifier(n_neighbors=1) # now k=1
clf.fit(X_train, y_train)
y_pred = clf.predict(X_train) # predict using the training data
accuracy_score(y_train, y_pred)

The prediction on training data was perfect (why?).  But how will it do on the test data set?

In [None]:
y_pred = clf.predict(X_test) # predict using the training data
accuracy_score(y_test, y_pred)

Let's try to find a k that will give us the best results on the test data set.

In [None]:
k_results = []
for i in range(1, 20):
    clf = KNeighborsClassifier(n_neighbors=i)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    k_results.append(accuracy_score(y_test, y_pred))

plt.plot(k_results)
plt.show()