**Name:** \_\_\_\_\_

**EID:** \_\_\_\_\_

# CS4487 - Tutorial 10
## Stochastic Gradient Descent

In this tutorial you will use stochastic gradient descent to train classifiers quickly.

First we need to initialize Python.  Run the below cell.

In [1]:
%matplotlib inline
import IPython.core.display         
# setup output image format (Chrome works best)
IPython.core.display.set_matplotlib_formats("svg")
import matplotlib.pyplot as plt
import matplotlib
from numpy import *
from sklearn import *
import glob
import os
import IPython.utils.warn as warn
import cPickle, gzip, numpy
import time

random.seed(100)
rbow = plt.get_cmap('rainbow')



We will use a larger version of the MNIST digits dataset.  Download "mnist.pkl.gz" from Canvas and put it in the same directory as this ipynb file. The training set has 50,000 images and the test set has 10,000 images

In [2]:
# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = cPickle.load(f)
f.close()

trainX,trainY = train_set
valX,valY = valid_set
testX,testY = test_set

print trainX.shape
print testX.shape

(50000, 784)
(10000, 784)


Now we will train a linear SVM using the standard algorithm, and time how long it takes.  Run the below code.  It may take a few minutes to finish.

In [3]:
starttime = time.clock()
clfo = svm.LinearSVC(C=1.)
clfo.fit(trainX, trainY)
print "elapsed time (sec):", time.clock() - starttime

elapsed time (sec): 99.210545


Here are the training and test errors.

In [4]:
Ypred = clfo.predict(trainX)
trainacc_svm = metrics.accuracy_score(Ypred, trainY)

Ypred = clfo.predict(testX)
testacc_svm = metrics.accuracy_score(Ypred, testY)
print "SVM accuracies:", trainacc_svm, testacc_svm

SVM accuracies: 0.92626 0.9158


## SGD Classifier
Now train a SGD classifier using the SVM loss and L2 penalty.  Time the amount of time it takes to fit the classifier (use the `fit` function).  Calculate the training and test error of the SGD classifier.  Use `alpha=0.1`.  Remember, alpha = 1/C.

In [5]:
### INSERT YOUR CODE HERE

In [7]:
clf = linear_model.SGDClassifier(
    loss='hinge',  # SVM loss (change to 'log' for logistic regression)
    penalty='l2',  # standard penalty (change to 'l1' for feature selection)
    alpha=0.1,     # penalty parameter: C=1/alpha 
    n_iter=5,
    average=True)  # use a running average for classifier weights
                   #   makes classifier more stable between batches

starttime = time.clock()
clf.fit(trainX, trainY)
print "elapsed time (sec):", time.clock() - starttime


elapsed time (sec): 5.322864


In [9]:
Ypred = clf.predict(trainX)
trainacc_sgd = metrics.accuracy_score(Ypred, trainY)

Ypred = clf.predict(testX)
testacc_sgd = metrics.accuracy_score(Ypred, testY)
print trainacc_sgd, testacc_sgd

0.86476 0.8754


_How does the speed and the accuracy compare with the original SVM?_
- **INSERT YOUR ANSWER HERE**
- faster, but about 5% decrease in accuracy

## Parallel SGD Classifier
Now train a parallel SGD classifier using IPython clusters, and measure the fitting time.  Use the same value for alpha as your SGD Classifier. Try different batch sizes (B) and number of processes (K).  Calculate the training and test error.

First start the IP clusters using the "IPython Clusters" tab in Jupyter.  If the tab says "Clusters tab now provided by IPython parallel", then run `ipcluster nbextension enable` to enable it.  Alternatively, you can run  `ipcluster start -n 4` on the command line to directly start 4 clients.

In [10]:
# load the client interface
import ipyparallel

clients = ipyparallel.Client()
clients.block = True   # wait for calculations to finish
print clients.ids      # client process ids

# get the load-balanced scheduler
lview = clients.load_balanced_view()

[0, 1, 2, 3]


In [11]:
%%px
# load libraries on all clients
from numpy import *
from sklearn import *

In [12]:
### INSERT YOUR CODE HERE

In [13]:
def par_sgd(data, param):
    # run SGD on a dataset
    clf = linear_model.SGDClassifier(
        loss='hinge', 
        penalty='l2',
        alpha=param['alpha'],
        average=False)  # don't use averaging, since we will do it later
    clf.fit(data['trainX'], data['trainY'])
    return clf

def combine_sgd(clfs):
    # combine sgd classifiers
    
    # make a copy of the first one
    import copy
    clfout = copy.deepcopy(clfs[0])
    K = len(clfs)

    # add all the remaining ones to it
    for i in range(1,K):
        clfout.coef_ += clfs[i].coef_
        clfout.intercept_ += clfs[i].intercept_

    # take the average
    clfout.coef_ /= K
    clfout.intercept_ /= K

    return clfout

In [14]:
param = {'alpha': 0.01}

K = 20           # use 10 processes
N = len(trainX)  # dataset size
B = int(0.05*N)  # batch size

random.seed(612)

# split data into batches
starttime = time.clock()
data_batches = []
for i in range(K):
    rp = random.permutation(N)
    trainX_shuffle = trainX[rp[range(B)]]
    trainY_shuffle = trainY[rp[range(B)]]
    data_batches.append({'trainX': trainX_shuffle, 'trainY': trainY_shuffle})
print "elapsed time (sec):", time.clock() - starttime

# run par_sgd on each batch of data
lview.block = True
starttime = time.clock()
clfs = lview.map(par_sgd, data_batches, [param]*K)
clf = combine_sgd(clfs)  
print "elapsed time (sec):", time.clock() - starttime

# without load-balanced view (for testing)
#clfs = map(par_sgd, data_batches, [param]*K)

# combine classifiers

# training error
Ypred = clf.predict(trainX)
trainacc_psgd = metrics.accuracy_score(Ypred, trainY)

Ypred = clf.predict(testX)
testacc_psgd = metrics.accuracy_score(Ypred, testY)

print "psgd accuracy:", trainacc_psgd, testacc_psgd

# compare with individual classifiers
#errs = []
#for myclf in clfs:
#    mypred = myclf.predict(trainX)
#    errs.append( mean(mypred != trainY) )
#
#    
#print "individual clf errors:", errs


elapsed time (sec): 0.312453
elapsed time (sec): 0.836902
psgd accuracy: 0.89064 0.8984


_How does the speed and the accuracy compare with the original SVM?_
- **INSERT YOUR ANSWER HERE**
- faster, and about 3% decrease in accuracy