# Training a Classifier Using Gradients

By completing this notebook, you should get a good sense of how to use a gradient-based method to train a classifier. We're going to build modules for linear transformations of data, componentwise applications of the "sigmoid" function to a vector, and the mean-squared error loss. Once we have these modules, we will combine them to create a logistic regression classifier, which we will train on some randomly-generated data to see how the training process alters the decision-making process that the classifier goes through.

In [None]:
%pylab inline
import numpy as np
from sklearn import datasets

## Modules

The `linear` module computes the transformation $f(x,W,b) = Wx + b$ for a given input $x$ and parameters $W$ and $b$. The main goal of "training" our classifier will be to move from the randomly initialized $W$ and $b$ that we get by instantiating a module to the "best" $W$ and $b$ for a given data set.

The `__init__` method should take an input dimension `n_in` and an output dimension `n_out`, and should create the following instance variables:

1. The `n_out` by `n_in` matrix $W$.
2. The `n_out` vector $b$.
3. The `n_out` vector `output`, representing the output of the module.
4. The `n_in` vector `gradInput`, representing $\frac{\partial L}{\partial x}$.
5. The `n_out` by `n_in` matrix `gradW`, representing $\frac{\partial L}{\partial W}$.
6. The `n_out` vector `gradb`, representing $\frac{\partial L}{\partial b}$.

The matrix $W$ and vector $b$ should be initialized randomly. The other objects can be initialized to arrays of the correct shape filled with zeros.

The `updateOutput` function should update the value of the `output` variable and should return that `output` value.

The `updateGradInput` function should update the `gradInput` variable and return the new value of `gradInput`.

The `updateGradWeight` function should update _both_ `gradW` and `gradb` and should return a tuple consisting of `gradW` and `gradb`.

In case you're rusty on your chain rule, recall that 

$$\frac{\partial L}{\partial x} = W^T \frac{\partial L}{\partial y}$$

and

$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial y} x^T$$

and

$$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial y}$$.

In [None]:
class Linear(object):
    def __init__(self, n_in, n_out):
        # TODO
        pass
        
    def updateOutput(self, input):
        # TODO
        pass
    
    def updateGradInput(self, gradOutput):
        # TODO
        pass
    
    def updateGradWeight(self, gradOutput):
        # TODO
        pass

The `Sigmoid` module will be responsible for calculating $\sigma(x) = \frac{1}{1 + \exp(-x)}$. Since there are no tunable parameters to this module, we only have to implement `updateOutput` and `updateGradInput` here.

The `__init__` function takes an integer size parameter (the number of dimensions of the input) and should create two member variables:

1. An `output` variable which represents the output of the module.
2. A `gradInput` variable which represents $\frac{\partial L}{\partial x}$.

The `updateOutput` function should compute the output of the module, store that in the `output` member variable, and return that value.

The `updateGradInput` function should update the `gradInput` member variable and return that value.  

Remember that we have the following identity:

$$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} * \sigma(x) * (1 - \sigma(x))$$

In [None]:
class Sigmoid(object):
    def __init__(self, n_in):
        # TODO
        pass
    
    def updateOutput(self, input):
        # TODO
        pass
    
    def updateGradInput(self, gradOutput):
        # TODO
        pass

To estimate our performance and perform training, we will use a loss function based on the "mean squared error". For a model output $y$ and known correct answer $t$ (both vectors in some finite-dimensional vector space $\mathbb{R}^n$), this loss can be can be computed as

$$ L(y,t) = \frac{1}{n} \sum_{i=1}^n (y_i - t_i)^2 $$

Your module's `__init__` function should create the following values:

1. An `output` variable, initialized to some reasonable scalar value (like 0.0).
2. A `gradOutput` variable, which should be an `ndarray` of size $n$ (the number of dimensions of the input vector).

The `updateOutput` function should accept two arguments: `input` is the value produced by your model, and `target` is the known correct value corresponding to the model's input. The `updateOutput` argument should produce the error obtained for `input` and `target`. This will be a scalar value.

The `updateGradInput` function should calculate $\frac{\partial L}{\partial y}$ using the following relation:

$$ \frac{\partial L}{\partial y} = \frac{2}{n} * (y - t) $$

In [None]:
class MSELoss(object):
    def __init__(self, n_in):
        # TODO
        pass
        
    def updateOutput(self, input, target):
        # TODO
        pass
    
    def updateGradInput(self, input, target):
        # TODO
        pass

The purpose of the `Classifier` object is to create an object that links together the modules defined above for the purposes of classification and training.

The `__init__` function should take an integer `n_in` giving the number of dimensions of an input point. The `__init__` function should create member variables for a `Linear` module, a `Sigmoid` module, and a `MSELoss` module, all with the approriate dimensions.

The `forward` function should accept an input that is to be classified, and optionally a `target` argument. If there is no target, the `forward` function should just return the result of the classifier's `Sigmoid` module. If there is a `target` argument, the `forward` function should call the `updateOutput` function of the classifier's `MSELoss` before returning the output of the `Sigmoid` module.

The `backward` function should take an `input`, the corresponding `target`, and should propagate derivatives backwards through the model to obtain $\frac{\partial L}{\partial W}$ and $\frac{\partial L}{\partial b}$. The `backward` function should return a tuple whose components are those derivatives.

The `predict` function should take an `input` argument and should produce a label for that point. The resulting label should be either 0 or 1.

In [None]:
class Classifier(object):
    def __init__(self, n_in, n_out):
        # TODO
        pass
        
    def forward(self, input):
        # TODO
        pass
    
    def backward(self, gradOutput):
        # Assumes forward has already been called with input
        # TODO
        pass
    
    def predict(self, input):
        # TODO
        pass

To see if our system learns anything, we're going to do a 2D classification problem (just like kNN). The next cell generates and visualizes the data. 

In [None]:
(easy_x, easy_y) = datasets.make_classification(n_samples=400, n_features = 2, n_informative = 2,
                             n_redundant = 0, n_repeated = 0, n_clusters_per_class=1, class_sep=2)
scatter(easy_x[:,0], easy_x[:,1], c = easy_y, cmap = 'cool')

Since we're doing a classification problem in the plane, the next line should produce exactly what we need.

In [None]:
classifier = Classifier(2)

The next cell shows what an untrained classifier does. Unless you happen to get lucky, you should see no correlation between the line drawn by your classifier and the distribution of points.

We also print out the $W$ and $b$ parameters for our model. These will be modified by training.

In [None]:
h = 0.05 # mesh size
x_min, x_max = easy_x[:, 0].min() - 1, easy_x[:,0].max() + 1
y_min, y_max = easy_x[:, 1].min() - 1, easy_x[:,1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = []
for pt in np.c_[xx.ravel(), yy.ravel()]:
    Z.append(classifier.predict(pt))
Z = np.asarray(Z)
Z = Z.reshape(xx.shape)
figure()
pcolormesh(xx,yy, Z, cmap='cool')

print classifier.linear.W, classifier.linear.b

scatter(easy_x[:,0], easy_x[:,1], c=easy_y, cmap='cool')

The process of training should be the following:

- Loop over the entire dataset $k$ times (say, 1000). On each iteration:
- Loop over each point in the dataset. For a given point in the dataset (consisting of an input $x$ and a label $y$):
- Run the classifier's `forward` method to compute an error for the given $(x,y)$ pair.
- Then run the classifier's `backward` method to get the necessary gradients.
- Then update the model parameters using the gradient descent update rule described in class. On iteration $k+1$, the update rules you should use are

  $$ W^{(k+1)} \leftarrow W^{(k)} - \alpha_W * \frac{\partial L}{\partial W} $$ 
  
  and

  $$ b^{(k+1)} \leftarrow b^{(k)} - \alpha_b * \frac{\partial L}{\partial b} $$

In [None]:
# training
for i in range(1000):
    for j in range(easy_x.shape[0]):
        # TODO
        pass

Once properly trained, the decision line determined by your model should "line up" correctly with your data, so we re-run the mesh visualization to see how we do. We will learn about quantifying our error later in the course.

In [None]:
h = 0.05 # mesh size
x_min, x_max = easy_x[:, 0].min() - 1, easy_x[:,0].max() + 1
y_min, y_max = easy_x[:, 1].min() - 1, easy_x[:,1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = []
for pt in np.c_[xx.ravel(), yy.ravel()]:
    Z.append(classifier.predict(pt))
Z = np.asarray(Z)
Z = Z.reshape(xx.shape)
figure()
pcolormesh(xx,yy, Z, cmap='cool')

print classifier.linear.W, classifier.linear.b

scatter(easy_x[:,0], easy_x[:,1], c=easy_y, cmap='cool')