# Task Overview

In this task, your goal is to verify the impact of data noise level in neural network training.
You should use MLP architecture trained on MNIST dataset (like in previous lab exercises).


We will experiment with two setups:
1. Pick X. Take X% of training examples and reassign their labels to random ones. Note that we don't change anything in the test set.
2. Pick X. During each training step, for each sample, change values of X% randomly selected pixels to random values. Note that we don't change anything in the test set.

For both setups, check the impact of various levels of noise (various values of X%) on model performance. Show plots comparing crossentropy (log-loss) and accuracy with varying X%, and also comparing two setups with each other.
Prepare short report briefly explaining the results and observed trends. Consider questions like "why accuracy/loss increases/decreases so quickly/slowly", "why Z is higher in setup 1/2" and any potentially surprising things you see on charts.

### Potential questions, clarifications
* Q: Can I still use sigmoid/MSE loss?
  * You should train your network with softmax and crossentropy loss (log-loss), especially since you should report crossentropy loss.
* Q: When I pick X% of pixels/examples, does it have to be exactly X% or can it be X% in expectation?
  * A: It's fine either way.
* Q: When I randomize pixels, should I randomize them again each time a particular example is drawn (each training step/epoch) or only once before training?
  * A: Each training step/epoch.
* Q: When I randomize labels, should I randomize them again each time a particular example is drawn (each training step/epoch) or only once before training?
  * A: Only once before training.
* Q: What is the expected length of report/explanation?
  * A: There is no minimum/maximum, but between 5 (concise) and 20 sentences should be good. Don't forget about plots.
* Q: When I replace labels/pixels with random values, what random distribution should I use?
  * A: A distribution reasonably similar to the data. However, you don't need to match dataset's distribution exactly - approximation will be totally fine, especially if it's faster or easier to get.
* Q: Can I use something different than Colab/Jupyter Notebook? E.g. just Python files.
  * A: Yes, although notebook is encouraged; please include in you solution code and pdf.

# Model definition and training.

I will use the same MLP as in the Lab 4, for the sake of consistency. The only thing I changed is adding (a) an "option" for disabling the logging (by passing `log_interval <= 0`), (b) returning the accuracy metric.


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets, transforms
import pandas as pd
import numpy as np
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import random

def set_seed(seed):
  torch.manual_seed(seed)
  random.seed(seed)
  np.random.seed(seed)

seed = 1
set_seed(seed)

In [None]:
batch_size = 256
epochs = 5
lr = 1e-2
use_cuda = True

In [None]:
class Net(nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    # After flattening an image of size 28x28 we have 784 inputs
    self.fc1 = nn.Linear(784, 128)
    self.fc2 = nn.Linear(128, 128)
    self.fc3 = nn.Linear(128, 10)

  def forward(self, x):
    x = torch.flatten(x, 1)
    x = self.fc1(x)
    x = F.relu(x)
    x = self.fc2(x)
    x = F.relu(x)
    x = self.fc3(x)
    output = F.log_softmax(x, dim=1)
    return output

In order to speed up the prototyping, I will use [a better version of the MNIST dataset class](https://github.com/y0ast/pytorch-snippets/tree/main/fast_mnist), along with [some modifications](https://gist.github.com/y0ast/f69966e308e549f013a92dc66debeeb4).

In [None]:
class FastMNIST(datasets.MNIST):
  def __init__(self, device, train=True, download=False):
    super().__init__("data", train=train, download=download)

    self.data = self.data.unsqueeze(1).float().div(255)
    self.data = self.data.sub_(0.1307).div_(0.3081)
    self.data, self.targets = self.data.to(device), self.targets.to(device)
  
  def __getitem__(self, index):
    img, target = self.data[index], self.targets[index]
    if self.transform:
      img = self.transform(img)
    return img, target

In [None]:
def train(model, train_dataset, optimizer):
    model.train()
    train_loss = 0
    correct = 0

    ds_data, ds_targets = train_dataset.data, train_dataset.targets
    perm = torch.randperm(ds_data.shape[0])
    ds_data, ds_targets = ds_data[perm], ds_targets[perm]

    if train_dataset.transform:
      ds_data = train_dataset.transform(ds_data)

    for batch_off in np.arange(0, len(ds_data), batch_size):
      data = ds_data[batch_off:batch_off+batch_size]
      target = ds_targets[batch_off:batch_off+batch_size]

      optimizer.zero_grad()
      output = model(data)
      loss = F.nll_loss(output, target)

      train_loss += loss.item() * data.shape[0]
      pred = output.argmax(dim=1, keepdim=True)
      correct += pred.eq(target.view_as(pred)).sum().item()

      loss.backward()
      optimizer.step()

    train_loss /= len(ds_data)
    train_acc = correct / len(ds_data)
    return train_loss, train_acc

def test(model, test_dataset):
    model.eval()
    test_loss = 0
    correct = 0

    data, target = test_dataset.data, test_dataset.targets
    with torch.no_grad():
      output = model(data)
      test_loss = F.nll_loss(output, target, reduction='sum').item()
      pred = output.argmax(dim=1, keepdim=True)
      correct = pred.eq(target.view_as(pred)).sum().item()
    
    test_loss /= len(data)
    test_acc = correct / len(data)

    return test_loss, test_acc

Though for the purposes of displaying the transformed data, I will also keep around the regular versions.

In [None]:
default_transform = transforms.Compose([
  transforms.ToTensor(),
  transforms.Normalize((0.1307,), (0.3081,))
])

train_dataset = datasets.MNIST("data", train=True, transform=default_transform,
                              download=True)

test_dataset = datasets.MNIST("data", train=False, transform=default_transform)

The two experiments essentially modify the dataset objects in various ways, so I thought that the cleanest way of writing the code would be to provide these modifications in the form of the context manager objects (see `RandomizedLabelsContext` and `RandomizedPixelsContext` later for more details). Thanks to this, we need not worry about manually reverting the changes made for a particular experiment.

The function `run_experiment` will do the training for `epochs` epochs, and will return the train loss/acc and test loss/acc for each of them as a `pd.DataFrame`.

In [None]:
# Note: we pass the context as a factory, not the context object in itself
def run_experiment(make_ctx, train_dataset, test_dataset, device):
  results = []
  
  model = Net().to(device)
  optimizer = optim.Adam(model.parameters(), lr=lr)

  # Here we enter the context of a modified dataset
  with make_ctx():
    for epoch in range(1, epochs+1):
      train_loss, train_acc = train(model, train_dataset, optimizer)
      test_loss, test_acc = test(model, test_dataset)
      
      results.append({
          "Epoch": epoch,
          "Train loss": train_loss,
          "Train acc": train_acc,
          "Test loss": test_loss,
          "Test acc": test_acc
      })
    
  return model, pd.DataFrame.from_dict(results)

We will also add a helper procedure, which shall allow us to look at the modified images and labels.

In [None]:
def peek_at_images(dataset=train_dataset, nrows=5, ncols=5):
  image_indices = np.random.choice(len(dataset), size=(nrows, ncols),
                                   replace=False).reshape((nrows, ncols))

  titles = [None] * (nrows * ncols)
  for flat_idx, (row_idx, col_idx) in enumerate(np.ndindex(nrows, ncols)):
    image_idx = image_indices[row_idx,col_idx]
    image, pred = dataset[image_idx]
    titles[flat_idx] = str(pred)

  fig = make_subplots(rows=nrows, cols=ncols, subplot_titles=titles)

  for (row_idx, col_idx) in np.ndindex(nrows, ncols):
    image_idx = image_indices[row_idx,col_idx]
    image, pred = dataset[image_idx]
    image = torch.tile(image, (3, 1, 1))
    image = 255 * (image * 0.3081 + 0.1307)
    image = image.permute(1, 2, 0)
    trace = go.Image(z=image)
    fig.add_trace(trace, row=1+row_idx, col=1+col_idx)
  
  fig.update_xaxes(visible=False)
  fig.update_yaxes(visible=False)

  fig.show()

# Training models in setup 1: with randomized labels.

Let's start with the implementation of the context object: in it, we will replace (in expectation) `frac` of the labels with random values.

In [None]:
class RandomizedLabelsContext:
  def __init__(self, frac: float, dataset=train_dataset, 
               device=torch.device("cpu")):
    self.frac = frac
    self.dataset = dataset
    self.device = device
  
  def __enter__(self):
    # Here we save the original targets for us to restore at exit
    self.orig_targets = self.dataset.targets

    # We replace frac labels with random values
    pr_matrix = self.frac * torch.ones_like(self.dataset.targets).float()
    chosen = torch.bernoulli(pr_matrix).bool()
    rand_labels = torch.randint_like(self.dataset.targets, 10)
    chosen, rand_labels = chosen.to(self.device), rand_labels.to(self.device)
    self.dataset.targets = torch.where(chosen, rand_labels, self.dataset.targets)

    return self
  
  def __exit__(self, exc_type, exc_value, exc_tb):
    # Restore the original targets
    self.dataset.targets = self.orig_targets

Now, to check if it works, let's test it for some values of `frac`.

In [None]:
with RandomizedLabelsContext(0):
  peek_at_images()

In [None]:
with RandomizedLabelsContext(0.5):
  peek_at_images()

In [None]:
with RandomizedLabelsContext(1):
  peek_at_images()

It seems to work fine, that is we get progressively more bad labels.

# Training models in setup 2: with randomized pixels.

Now, we shall replace `frac` of the pixels (again, in expectation) with the random values. Assuming that the original pixel values follow roughly the normal distribution, and because we use `transforms.Normalize((0.1307,), (0.3081,))`, the correct distribution from which to draw the random pixels is the normal distribution $\mathcal{N}(0, 1)$. As far as the implementation is concerned, the pixel replacement will be done as a `torchvision` transform, and in the context we will substitute the transform of the train dataset by one, in which we append the pixel replacement.

In [None]:
class RandomizedPixels:
  def __init__(self, frac: float, device):
    self.frac = frac
    self.device = device
  
  def __call__(self, image):
    chosen = torch.bernoulli(self.frac * torch.ones_like(image)).bool()
    rand_pixels = torch.normal(mean=0, std=1, size=image.shape)
    chosen, rand_pixels = chosen.to(self.device), rand_pixels.to(self.device)
    return torch.where(chosen, rand_pixels, image)

class RandomizedPixelsContext:
  def __init__(self, frac: float, dataset=train_dataset,
               device=torch.device("cpu")):
    self.frac = frac
    self.dataset = dataset
    self.device = device
  
  def __enter__(self):
    self.orig_transform = self.dataset.transform
    
    if self.dataset.transform is not None:
      self.dataset.transform = transforms.Compose([
        self.dataset.transform,
        RandomizedPixels(self.frac, self.device),
      ])
    else:
      self.dataset.transform = RandomizedPixels(self.frac, self.device)

    return self
  
  def __exit__(self, exc_type, exc_value, exc_tb):
    self.dataset.transform = self.orig_transform

As before, let's look at the changes done with varying levels of `frac`.

In [None]:
with RandomizedPixelsContext(0):
  peek_at_images()

In [None]:
with RandomizedPixelsContext(0.5):
  peek_at_images()

In [None]:
with RandomizedPixelsContext(0.7):
  peek_at_images()

In [None]:
with RandomizedPixelsContext(1):
  peek_at_images()

# Plots and report.

Let's start with a uniform distribution of the values of `frac` (I *suspect* there are some special ranges which we will want to investigate, but for now let's not assume anything about what the results will look like).

In [None]:
use_cuda = use_cuda and torch.cuda.is_available()
print(f"Using CUDA: {use_cuda}")
device = torch.device("cuda" if use_cuda else "cpu")

Using CUDA: True


In [None]:
from timeit import default_timer as timer
from datetime import timedelta

def results_for(frac_values):
  all_results_df = pd.DataFrame()

  fast_train_dataset = FastMNIST(device, train=True, download=True)
  fast_test_dataset = FastMNIST(device, train=False, download=False)

  for experiment_name in ("Randomized labels", "Randomized pixels"):
    if experiment_name == "Randomized labels":
      ctx = RandomizedLabelsContext
    else:
      ctx = RandomizedPixelsContext
    
    make_ctx = lambda: ctx(frac, fast_train_dataset, device)
    
    for frac in frac_values:
      start = timer()
      _, results_df = run_experiment(make_ctx, fast_train_dataset, fast_test_dataset,
                                  device)
      end = timer()
      dur = timedelta(seconds=end-start)
      print(f"{experiment_name}, with frac = {frac}, in {dur}")

      results_df["Experiment"] = experiment_name
      results_df["Frac"] = frac
      all_results_df = all_results_df.append(results_df)
  
  return all_results_df

In [None]:
frac_values = np.around(np.linspace(0, 1, 11), decimals=1)
all_results_df = results_for(frac_values)

Randomized labels, with frac = 0.0, in 0:00:02.600413
Randomized labels, with frac = 0.1, in 0:00:02.578623
Randomized labels, with frac = 0.2, in 0:00:02.583024
Randomized labels, with frac = 0.3, in 0:00:02.580737
Randomized labels, with frac = 0.4, in 0:00:02.571950
Randomized labels, with frac = 0.5, in 0:00:02.573543
Randomized labels, with frac = 0.6, in 0:00:02.566769
Randomized labels, with frac = 0.7, in 0:00:02.547805
Randomized labels, with frac = 0.8, in 0:00:02.566993
Randomized labels, with frac = 0.9, in 0:00:02.581072
Randomized labels, with frac = 1.0, in 0:00:02.579974
Randomized pixels, with frac = 0.0, in 0:00:04.818285
Randomized pixels, with frac = 0.1, in 0:00:04.728836
Randomized pixels, with frac = 0.2, in 0:00:04.738587
Randomized pixels, with frac = 0.3, in 0:00:04.787211
Randomized pixels, with frac = 0.4, in 0:00:04.791179
Randomized pixels, with frac = 0.5, in 0:00:04.775331
Randomized pixels, with frac = 0.6, in 0:00:04.749492
Randomized pixels, with frac

(The training is, I didn't measure precisely how much faster than the ordinary version from Lab 4 but *much* faster, less than 5 seconds in each case.)

In [None]:
all_results_df.to_csv("all_results_df.csv")

In [None]:
all_results_df

Unnamed: 0,Epoch,Train loss,Train acc,Test loss,Test acc,Experiment,Frac
0,1,0.264131,0.916100,0.158217,0.9511,Randomized labels,0.0
1,2,0.127277,0.960983,0.144722,0.9556,Randomized labels,0.0
2,3,0.106885,0.967733,0.124078,0.9651,Randomized labels,0.0
3,4,0.094409,0.971950,0.119594,0.9666,Randomized labels,0.0
4,5,0.091567,0.973133,0.152974,0.9636,Randomized labels,0.0
...,...,...,...,...,...,...,...
0,1,2.303726,0.109883,2.301042,0.1134,Randomized pixels,1.0
1,2,2.301982,0.112283,2.301071,0.1135,Randomized pixels,1.0
2,3,2.301645,0.110917,2.301501,0.1135,Randomized pixels,1.0
3,4,2.301739,0.111083,2.301654,0.1135,Randomized pixels,1.0


Let's look at the test accuracy in the final epoch first.

In [None]:
import plotly.express as px
final_epoch_df = all_results_df[all_results_df["Epoch"] == all_results_df["Epoch"].max()]
px.line(final_epoch_df, x="Frac", y="Test acc", color="Experiment")

What we can observe is that, even with fairly high degree of randomization (like `frac = 0.8`,) we get fairly good results. Now, in the case of the randomization of pixels, this is somewhat understandable since even we can do that to some degree, as the following example demonstrates:

In [None]:
with RandomizedPixelsContext(0.6):
  peek_at_images()

As for the randomization of labels, what I suspect happens is that the model learns that, the randomization notwithstanding, the true label is selected with higher frequency (specifically `(1-frac) + frac/10`) than the incorrect labels (which are selected at probabilities `frac/10` each), but that with enough noise it becomes incapable of doing so. (I suppose one could check that by running enough epochs and checking if the model eventually learns the, however slight, bias).

Let's now look at the cross-entropy loss plots, as requested.

In [None]:
px.line(final_epoch_df, x="Frac", y="Test loss", color="Experiment")

We can see that the variant with the randomized pixels has a consitently lower loss (which, I believe, we can interpret as a distance of sorts between the true and the learned distribution of labels). I don't really have any clever explanation for it at this moment, though.

Let's now look at how quickly the models stabilize.

In [None]:
rand_labels_df = all_results_df[all_results_df["Experiment"] == "Randomized labels"]
px.line(rand_labels_df, x="Epoch", y="Train loss", color="Frac")

In [None]:
rand_pixels_df = all_results_df[all_results_df["Experiment"] == "Randomized pixels"]
px.line(rand_pixels_df, x="Epoch", y="Train loss", color="Frac")

I must say that it could be wise to look at the loss within the first epoch since it learns the distribution by the end thereof, but I couldn't code it in time.

In [None]:
fast_train_dataset = FastMNIST(device, train=True, download=False)
fast_test_dataset = FastMNIST(device, train=False, download=False)
frac = 0.7
make_ctx = lambda: RandomizedPixelsContext(frac, fast_train_dataset, device)
trained_model, _ = run_experiment(make_ctx, fast_train_dataset, fast_test_dataset,
                                  device)
idx = np.random.choice(len(fast_test_dataset.data))
F.softmax(trained_model(fast_test_dataset.data[[idx]])), fast_test_dataset.targets[[idx]]


Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.



(tensor([[3.8968e-08, 6.2470e-07, 1.8869e-05, 8.6307e-07, 9.9434e-01, 1.9110e-06,
          3.8354e-06, 2.4136e-04, 1.1867e-03, 4.2090e-03]], device='cuda:0',
        grad_fn=<SoftmaxBackward0>), tensor([4], device='cuda:0'))

In [None]:
px.line(final_epoch_df, x="Frac", y="Train loss", color="Experiment")