# Preprocessing and data statistics
In this notebook you can get more information about the statistics of the data. You can have a look on the MNIST data as well as on the generated data for the subtraction task.

In [1]:
import os 
os.chdir('/app/dpl/')
from data.pre_processing import MNISTTrain, MNISTest, MNISTDiffTrain, MNISTDiffTest
import torchvision.transforms as transforms

## Dataset for MNIST subtraction
For the MNIST subtraction we first generated combinations of the MNIST images.

In [2]:
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
mnist_train = MNISTTrain(transform=transform)
mnist_test = MNISTest(transform=transform)
mnist_test.drop_samples(0.8)

# generate MNIST pairs and their corresponding labels
diff_train = MNISTDiffTrain(mnist_train)
diff_test = MNISTDiffTest(mnist_test)
print(f'We generate {len(diff_train)} samples for the training set and {len(diff_test)} samples for the test set.')

We generate 59999 samples for the training set and 1998 samples for the test set.


Here you can see the distribution of classes in the training set. The further we are from zero, the fewer samples we have for the corresponding classes. This results from the fact that we have fewer possible combinations which result in class 9 (9-0) then for example for class 8 (9-1, 8-0).

In [3]:
diff_train.print_data_statistics()

class_label
 0    5850
 1    5811
-1    5513
-2    4790
 2    4676
-3    4110
 3    4079
 4    3536
-4    3506
 5    2910
-5    2886
 6    2523
-6    2421
-7    1902
 7    1858
-8    1216
 8    1194
-9     628
 9     590
Name: count, dtype: int64


As mentioned above we have several combinations which can result in a specific class. Here you can see the number of samples per combination for class 1. 

In [4]:
diff_train.print_class_statistics(1)

   x1_label  x2_label  count
0         0         1    714
1         1         2    701
2         2         3    645
3         3         4    652
4         4         5    548
5         5         6    607
6         6         7    650
7         7         8    638
8         8         9    656


To test how sample efficient the neural based and NeSy approach are, we reduced the number of combinations per class to 1. 

In [5]:
diff_train.set_num_class_samples(1)
print('Statistics of class 1 and its combinations:')
diff_train.print_class_statistics(1)
print('\nStatistics of the overall training dataset:')
diff_train.print_data_statistics()
print(f'\nWe reduced the number of samples for the original dataset (59999 samples) to only {len(diff_train)} samples.')

Statistics of class 1 and its combinations:
   x1_label  x2_label  count
0         0         1    714

Statistics of the overall training dataset:
class_label
6     726
1     714
-1    689
-2    671
-8    639
-5    638
-9    628
-7    606
-3    603
7     602
9     590
5     576
-4    569
8     557
-6    556
0     555
2     552
4     535
3     524
Name: count, dtype: int64

We reduced the number of samples for the original dataset (59999 samples) to only 11530 samples.


## Dataset for MNIST task
For the pretraining of the LeNet we used only 50% of the available MNIST data, because it is sufficient to get to a satisfying result. 

In [6]:
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
mnist_train = MNISTTrain(transform=transform)
mnist_train.drop_samples(0.5)
mnist_test = MNISTest(transform=transform)
mnist_test.drop_samples(0.8)

print(f'The training set has {len(mnist_train)} samples. Samples are (approximately) evenly distributed between the different classes.')
print(f'The test set has {len(mnist_test)} samples. Samples are (approximately) evenly distributed between the different classes.')

The training set has 30000 samples. Samples are (approximately) evenly distributed between the different classes.
The test set has 1999 samples. Samples are (approximately) evenly distributed between the different classes.
