## Logistic Regression for the MNIST dataset

In this notebook we will continue using Pytorch. Now, we will create a shallow model (the logistic regression), plus the full training pipeline in order to train and evaluate this model over the MNIST dataset.

In [49]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

Next, we will define some hyperparameters and config options

In [50]:
# Set the random seed, so the experiment is reproducible
seed = 0
torch.manual_seed(seed)
# For the moment, we will just train on CPU, so no cuda
use_cuda = False
# We use a batch size of 32 examples while training, and 1000 while testing
batch_size = 32
test_batch_size = 1000
# We will use SGD with a momentum term
momentum = 0.5
# The learning rate
lr = 0.01
# The number of epochs
epochs = 10
# The size of the input. MNIST are greyscale images, 28x28 pixels each
im_size = 28*28

Now, we are ready to define our pytorch model. This is done by creating a class that extends nn.Module, with two methods:

* __init__ : used to define and initialize the parameters of the model (for the LR, this is just a weight matrix of size 28*28 x 10, since we want to project to a 10-dimensional space to predict the digit. We could have define this explicitly, but with pytorch we have nn.Linear which is a shorthand for a linear projection.

* forward: used to define the computation in the model. We just want to apply the linear projection to the input, and then the softmax transformation (so that the output of the model can be interpreted as a probability distribution over the 10 digits).

Two things to note:

1. Since the images are tensors of size 28x28, and the Linear operator expects a 1-d array (since it is the operation Wx), we need to flatten the image using the method view (as a reshape).
2. Since the loss function is nll_loss (see next cell), it expects the input to be in log space, thats why we use log_softmax instead of softmax (see https://pytorch.org/docs/stable/nn.functional.html#log-softmax)

In [51]:
class LogisticRegression(nn.Module):
    def __init__(self):
        super(LogisticRegression, self).__init__()
        self.lin = nn.Linear(im_size, 10)

    def forward(self, x):
        x_flat = x.view(-1, im_size)
        return F.log_softmax(self.lin(x_flat), dim=1)

Next, we define the training loop.
For each minibatch (obtained using enumerate(train_loader) (see later for how the dataset is generated), we do the following:

1. Move x, y (data, target) to the GPU if necessary (using .to(device))
2. Reset the gradients to zero (from previous iterations) using .zero_grad()
3. Forward pass: we compute the predictions using model(data) (which is y = f(x) with f being the logistic regression)
4. Backward pass: we compute the gradients using .backward(), and then we apply one step of the optimizer, that is performing $w \leftarrow w - lr * \nabla (loss(f(x), y)$ for the case of gradient descent.

In [42]:
def train(model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % 10 == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))

We can also create a function to evaluate our model. It is very similar to the previous one, but the main differences are:
    
1. We use torch.no_grad() to prevent gradient computations and changing the weights of the model.
2. Once we have the predicted distribution, we use .argmax to compute the most probable digit. We also compute the accuracy of our model

In [53]:
def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

Now, we are going to load the dataset. Since the MNIST dataset is very popular, PyTorch already has it included into datasets.MNIST, and will automatically download it the first time. We can also specify some transformations, like converting the matrix to a tensor and normalize the pixels

Note that we create two data loaders: one for the training set and another one for the testing set. Since this is just a quick experiment, we are not going to validate the initial hyperparameters, so we don't create another data loader for a validation set.

The main benefit of the DataLoader class is that it creates an object which can be used in a for loop, as we did in the previous train function, to iterate over the minibatches.

In [45]:
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=True, download=True,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=batch_size, shuffle=True, **kwargs)

test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=False, transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=test_batch_size, shuffle=True, **kwargs)

Now, we are almost done! We just instantiate the model, and tell which optimizer are we going to use, in this case, stochastic gradient descent

In [55]:
model = LogisticRegression().to(device)
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)

And finally, a simple loop over the number of epochs. This should take over a few minutes in a simple laptop cpu..

In [47]:
for epoch in range(1, epochs + 1):
        train(model, device, train_loader, optimizer, epoch)
        test(model, device, test_loader)




Test set: Average loss: 0.2997, Accuracy: 9142/10000 (91%)




Test set: Average loss: 0.2880, Accuracy: 9189/10000 (92%)




Test set: Average loss: 0.2820, Accuracy: 9187/10000 (92%)




Test set: Average loss: 0.2968, Accuracy: 9162/10000 (92%)




Test set: Average loss: 0.2784, Accuracy: 9216/10000 (92%)




Test set: Average loss: 0.2759, Accuracy: 9238/10000 (92%)






Test set: Average loss: 0.2719, Accuracy: 9214/10000 (92%)




Test set: Average loss: 0.2766, Accuracy: 9234/10000 (92%)




Test set: Average loss: 0.2749, Accuracy: 9215/10000 (92%)




Test set: Average loss: 0.2743, Accuracy: 9223/10000 (92%)



As our first attemp, we obtained over 92% accuracy, which is not so bad, but soon we will see how to vastly improve on this result!