![ADSA Logo](http://i.imgur.com/BV0CdHZ.png?2 "ADSA Logo")

# ADSA Workshop 6 - Introduction to Scikit-Learn

Workshop content created by Aditya Bhargava and Dongyup Lee, with some content adapted from:
* http://scikit-learn.org/stable/tutorial/basic/tutorial.html
* http://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html

In this workshop we will learn machine learning using Scikit-Learn. We will talk about loading sample datasets, learning and predicting using support vector classification as well as model persistence. 

## What is machine learning?

While a lot of people like to make it sound really complex, machine learning is quite simple at its core and can be best envisioned as machine classification. We guide the machine in a certain way to think. What we are really doing is supervising it.

There are two types of machine learning:
* Supervised learning
* Unsupervised learning

So, supervised learning is where we, the scientist, supervise and sometimes sort of guide the learning process. We might say what some of the data is, and leave some to question.

Within supervised learning, we have classification, which is where we already have the classifications done. An example here would be the image recognition tutorial we are going to do, where you have a set of numbers, and you have an unknown that you want to fit into one of your pre-defined categories.

## Support Vector Machine (SVM) example with character learning

In this example we are going to use a pre-existing data set included with the sklearn library called datasets.
We are also going to import svm which is a form of machine learning. What is does is illustrated below.
Given labeled training data, SVM finds the algorithm that outputs an optimal hyperplane which categorizes new examples.

 ![SVM](http://docs.opencv.org/2.4/_images/separating-lines.png "SVM")

 ![SVM2](http://docs.opencv.org/2.4/_images/optimal-hyperplane.png "SVM2")

We are going to make our machine guess a blurred out number. To do this, we are first going to train and learn it using a bunch of examples.

In [28]:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import svm




The dataset digits, contains an already structured and labeled set of data that contains pixel information for numbers up to 9 that we can use for training and testing.

In [29]:
digits = datasets.load_digits()
#Next, we're defining the digits variable, which is the loaded digit dataset.

This is how the dataset looks like. It includes labels.
Digits.data is the actual data (features).

In [30]:
print digits.data

[[  0.   0.   5. ...,   0.   0.   0.]
 [  0.   0.   0. ...,  10.   0.   0.]
 [  0.   0.   0. ...,  16.   9.   0.]
 ..., 
 [  0.   0.   1. ...,   6.   0.   0.]
 [  0.   0.   2. ...,  12.   0.   0.]
 [  0.   0.  10. ...,  12.   1.   0.]]


Digits.target is the label we assign digits data to.

In [31]:
print digits.target

[0 1 2 ..., 8 9 8]


In [32]:
print len(digits.data)

1797


In [33]:
print digits.images[0]

[[  0.   0.   5.  13.   9.   1.   0.   0.]
 [  0.   0.  13.  15.  10.  15.   5.   0.]
 [  0.   3.  15.   2.   0.  11.   8.   0.]
 [  0.   4.  12.   0.   0.   8.   8.   0.]
 [  0.   5.   8.   0.   0.   9.   8.   0.]
 [  0.   4.  11.   0.   1.  12.   7.   0.]
 [  0.   2.  14.   5.  10.  12.   0.   0.]
 [  0.   0.   6.  13.  10.   0.   0.   0.]]


Now that we have our data, we can carry out machine learning.

In [38]:
clf = svm.SVC(gamma=0.001, C=100)

This chooses the SVC, and we set gamma and C.

With that done, now we're ready to train. We first assign the value into X (uppercase) and y that the training dataset will have.

In [39]:
X,y = digits.data[:-10], digits.target[:-10]

It loads everything of the dataset but the last 10 data points for learning. Therefore the last 10 points can be used for testing.

Next we will test it using clf.fit().

In [40]:
clf.fit(X,y)

SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Now that we have trained it, we can now predict what the number would be!

In [41]:
print "Prediction is: ", clf.predict(digits.data[-6])

Prediction is:  [4]




It predicts what the 6th from last element is.

To check the whether our machine has predicted correctly we are going to use matplotlib that would plot the blurred image for us that is in question.

In [25]:
plt.imshow(digits.images[-6], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

## Face recognition example using eigenfaces and SVMs

Eigenfaces: eigenface is a set of eigenvectors used in computer vision problem of human face recognition. 

In [42]:
from __future__ import print_function
from time import time
import logging
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn.datasets import fetch_lfw_people
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import RandomizedPCA
from sklearn.svm import SVC
print(__doc__)

Automatically created module for IPython interactive environment


In [48]:
#Display progress logs on stdout
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')

In [49]:
#Download data
lfw_people = fetch_lfw_people(min_faces_per_person=150, resize=0.4)

In [50]:
#introspect the images arrays to find the shapes for plotting
n_samples, h, w = lfw_people.images.shape

In [51]:
#For machine learning, we use the 2 data directly as relative pixel
#The positions info is ignored by this model
X = lfw_people.data
n_features = X.shape[1]

In [52]:
#label to predict is the id of the person
y = lfw_people.target
target_names = lfw_people.target_names
n_classes = target_names.shape[0]

In [53]:
print("Total dataset size:")
print("n_samples: %d" % n_samples)
print("n_features: %d" % n_features)
print("n_classes: %d" % n_classes)

Total dataset size:
n_samples: 766
n_features: 1850
n_classes: 2


In [54]:
# Split into a training set and a test set using a stratified k fold

# split into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42)

In [55]:
# Compute a PCA(eigenfaces) on the face dataset
# (treated as unlabeled dataseet): unsupervised feature 
# extraction/ dimensionality reduction

n_components = 300

In [56]:
print("Extracting the top %d eigenfaces from %d faces"
      % (n_components, X_train.shape[0]))

Extracting the top 300 eigenfaces from 574 faces


In [57]:
t0 = time()

In [58]:
pca = RandomizedPCA(n_components=n_components, whiten=True).fit(X_train)

In [59]:
print("done in %0.3fs" % (time() - t0))

done in 13.660s


In [60]:
eigenfaces = pca.components_.reshape((n_components, h, w))

In [61]:
print("Projecting the input data on the eigenfaces orthonormal basis")

Projecting the input data on the eigenfaces orthonormal basis


In [62]:
t0 = time()

In [63]:
X_train_pca = pca.transform(X_train)

In [64]:
X_test_pca = pca.transform(X_test)
print("done in %0.3fs" % (time() - t0))

done in 14.645s


In [65]:
#Train a SVM classification model
print("Fitting the classifier to the training set")
t0 = time()

Fitting the classifier to the training set


In [66]:
param_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5],
              'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1], }

In [67]:
clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)

In [68]:
#This is where the program actually learns from the datasets
clf = clf.fit(X_train_pca, y_train)

In [69]:
print("done in %0.3fs" % (time() - t0))
print("Best estimator found by grid search:")
print(clf.best_estimator_)

done in 29.733s
Best estimator found by grid search:
SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.0005, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


In [70]:
#Quantitative evaluation of the model quality on the test set
print("Predicting people's names on the test set")
t0 = time()

Predicting people's names on the test set


In [71]:
y_pred = clf.predict(X_test_pca)

In [72]:
print("done in %0.3fs" % (time() - t0))

done in 14.029s


In [73]:
print(classification_report(y_test, y_pred, target_names=target_names))
print(confusion_matrix(y_test, y_pred, labels=range(n_classes)))

               precision    recall  f1-score   support

 Colin Powell       0.82      0.90      0.85        59
George W Bush       0.95      0.91      0.93       133

  avg / total       0.91      0.91      0.91       192

[[ 53   6]
 [ 12 121]]


In [74]:
#Qualitative evaluation of the predictions using matplotlib

def plot_gallery(images, titles, h, w, n_row=3, n_col=4):
    """Helper function to plot a gallery of portraits"""
    plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())

In [75]:
#plots the result of the prediction on a portion of the test set
def title(y_pred, y_test, target_names, i):
    pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1]
    true_name = target_names[y_test[i]].rsplit(' ', 1)[-1]
    return 'predicted: %s\ntrue:      %s' % (pred_name, true_name)

In [76]:
prediction_titles = [title(y_pred, y_test, target_names, i)
                     for i in range(y_pred.shape[0])]

In [77]:
plot_gallery(X_test, prediction_titles, h, w)

In [78]:
# plot the gallery of the most significative eigenfaces
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)

In [79]:
plt.show()