# Label and Data Noise in Machine Learning Datasets
This notebook presents a simple example of label noise as well as data noise for common machine learning datasets. <br/>
For a simple walkthrough, this notebook presents MNIST, including (CASE A) no noise, (CASE B) symmetric label noise and (CASE C) salt&pepper data noise.

In [None]:
# IMPORTS
import torch
import torch.nn as nn
import torchvision
from src import data_load, noise, train, utils, test
import os
import glob
import json
root = os.getcwd()
print(root)

## CASE A: No Noise
In this case no further noise is injected, it is therefore the default MNIST classification example. <br/>
First, these settings are defined. <br/>
Second, given these settings, the data is loaded from data_load.py. In this file label and data noise are handled. <br/>
Third, a model is trained to classify MNIST.

In [None]:
# General imports, considering MNIST as database, simple ResNet-18 as model
dataset = 'MNIST'
setting = 'no_noise'
hp_file = glob.glob(root + '/params/' + dataset + '/' + setting + '.json')[0]
with open(hp_file) as json_file:
    hp = json.load(json_file)
    for k in hp.keys():
        print(f"{k:21}: {hp[k]}")
params_a = utils.Params(hp_file)

if params_a.net == 'resnet18':
    model_a = torchvision.models.resnet18(num_classes=10)
    model_a.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3,
                               bias=False)

In [None]:
# Get dataloader
if params_a.dataset_class_name == 'MNIST':  # current error on LeCun's website
    !wget www.di.ens.fr/~lelarge/MNIST.tar.gz
    !tar -zxvf MNIST.tar.gz
train_loader_a, val_loader_a, test_loader_a = data_load.dataloader(params_a)

In [None]:
# Train a model
model_a = train.get_trained_model(params_a, model_a, train_loader_a, val_loader_a)

In [None]:
acc_a = test.evaluate(params_a, model_a, test_loader_a)

## CASE B: Symmetric Label Noise
In this case symmetric label noise is injected. <br/>
First, these settings are defined. <br/>
Second, given these settings, the data is loaded from data_load.py. In this file label and data noise are handled. <br/>
Third, a model is trained to classify MNIST.

In [None]:
# General imports, considering MNIST as database, simple ResNet-18 as model
dataset = 'MNIST'
setting = 'symmetric_noise'
hp_file = glob.glob(root + '/params/' + dataset + '/' + setting + '.json')[0]
with open(hp_file) as json_file:
    hp = json.load(json_file)
    for k in hp.keys():
        print(f"{k:21}: {hp[k]}")
params_b = utils.Params(hp_file)

if params_b.net == 'resnet18':
    model_b = torchvision.models.resnet18(num_classes=10)
    model_b.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3,
                               bias=False)

In [None]:
# Get dataloader
if params_b.dataset_class_name == 'MNIST':  # current error on LeCun's website
    !wget www.di.ens.fr/~lelarge/MNIST.tar.gz
    !tar -zxvf MNIST.tar.gz
train_loader_b, val_loader_b, test_loader_b = data_load.dataloader(params_b)

In [None]:
# Train a model
model_b = train.get_trained_model(params_b, model_b, train_loader_b, val_loader_b)

In [None]:
acc_b = test.evaluate(params_b, model_b, test_loader_b)

## CASE C: Salt&Pepper Data Noise
In this case salt&pepper data noise is injected. <br/>
First, these settings are defined. <br/>
Second, given these settings, the data is loaded from data_load.py. In this file label and data noise are handled. <br/>
Third, a model is trained to classify MNIST.

In [None]:
# General imports, considering MNIST as database, simple ResNet-18 as model
dataset = 'MNIST'
setting = 'salt_pepper_noise'
hp_file = glob.glob(root + '/params/' + dataset + '/' + setting + '.json')[0]
with open(hp_file) as json_file:
    hp = json.load(json_file)
    for k in hp.keys():
        print(f"{k:21}: {hp[k]}")
params_c = utils.Params(hp_file)

if params_c.net == 'resnet18':
    model_c = torchvision.models.resnet18(num_classes=10)
    model_c.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3,
                               bias=False)

In [None]:
# Get dataloader
if params_c.dataset_class_name == 'MNIST':  # current error on LeCun's website
    !wget www.di.ens.fr/~lelarge/MNIST.tar.gz
    !tar -zxvf MNIST.tar.gz
train_loader_c, val_loader_c, test_loader_c = data_load.dataloader(params_c)

In [None]:
# Train a model
model_c = train.get_trained_model(params_c, model_c, train_loader_c, val_loader_c)

In [None]:
acc_c = test.evaluate(params_c, model_c, test_loader_c)

## Conclusion

In [None]:
print('Default test accuracy: {:.2%}'.format(acc_a))
print('Symmetric label noise test accuracy: {:.2%}'.format(acc_b))
print('Salt&Pepper test accuracy: {:.2%}'.format(acc_c))

## Next Steps
Having observed how noise in labels and data can affect performance, various open questions arise. How can such environments be detected? Uncertainty estimation is a popular technique to capture noise. How can the model still be robust? Diverse approaches exist, such as adversarial training. How can, for applications for example, high performance still be achieved? Selective prediction is one method to improve model performance.