# 2 Design pattern for data generators

In [2]:
import torch

A typical setup for machine learning is that we have an amount of observations, with multiple dimensions. Lets say we have images, size (28, 28) pixels and three colors, so (3, 28, 28).

In [3]:
observations = (50, )
datasize = (3, 28, 28)

dim = observations + datasize

X = torch.rand(dim)
X.shape

torch.Size([50, 3, 28, 28])

To avoid clusters of observations that are highly correlated, we would want to shuffle the data. While we could shuffle the data itself, it is better to use an index and shuffle the index.

This approach is especially useful if your data is too big to fit into your memory. Also take into account the model that runs transformations on your data: it will take up a multitude of the original image.

In that case you would want to feed the model a list of paths to images, files etc, and load a batch of images while training.

In [4]:
index_list = torch.randperm(len(X))
index_list

tensor([22, 48, 43, 38,  9, 46,  1, 39,  0, 12,  8, 23, 34, 10, 49, 30,  7, 26,
        19,  4, 44, 36, 37, 45, 16, 20, 41, 17, 47, 18, 33,  3, 28, 42, 40, 21,
        11, 14, 27, 13, 35,  2, 25,  5, 24, 31, 32, 15, 29,  6])

With `torch.randperm(n)` we obtain a random permutation of the numbers from 0 to $n$. Next step is using a generator. Simple generators are introduced with [PEP 255](https://www.python.org/dev/peps/pep-0255/)

In [6]:
gen = (i for i in range(10))
gen

<generator object <genexpr> at 0x7f158a9c64d0>

We can call `next` on a generator:

In [7]:
next(gen)

0

And a second time:

In [8]:
next(gen)

1

This is usefull if we want to generator lists that are infinite:

In [9]:
def fib():
    a, b = 0, 1
    while True:
       yield b
       a, b = b, a+b

In [10]:
fibonacci = fib()
for i in range(10):
    print(next(fibonacci))

1
1
2
3
5
8
13
21
34
55


We can use the generator pattern to yield a dataset in batches of observations:

In [11]:
def naive_generator(data, batchsize):
    index_list = torch.randperm(len(data))
    i = 0
    while True:
        index = index_list[i:i+batchsize]
        X = data[index]
        yield X
        i += batchsize

This code will scramble an index, and take a batch-sized chunck of these indices to yield

In [12]:
gen = naive_generator(X, 32)
for i in range(3):
    batch = next(gen)
    print(f"Shape: {batch.shape}")

Shape: torch.Size([32, 3, 28, 28])
Shape: torch.Size([18, 3, 28, 28])
Shape: torch.Size([0, 3, 28, 28])


However, we run into a problem: after two batches, we ran out of data and we will yield empty tensors...

To solve this problem we will use a nested index:
- `i` loops over the batchsize
- `index` loops through the `index_list`
- every time the `index` would go beyond the datasize, we reset it, and shuffle the `index_list`
- The data is collected in a tensor `X` with the nested `index_list[index]` approach

In [13]:
def generator(data, batchsize):
    size = len(data)
    shape = data.shape[1:]
    index_list = torch.randperm(size)
    index = 0

    X = torch.zeros((batchsize, ) + shape)
    while True:

        for i in range(batchsize):
            # i will always run from 0 to batchsize,
            # regardless of how many items you have left
            if index >= size:
                # if your index goes beyond the amount of data
                index = 0
                # we reset it to zero
                index_list = torch.randperm(size)
                # and shuffle the index_list

            # we use the index (that goes from 0 to size)
            # to grab the next (shuffled) index_list item
            # and fill batch i with it
            X[i] = data[index_list[index]]
            index += 1
        yield X

In [14]:
gen = generator(X, 32)
for i in range(3):
    batch = next(gen)
    print(f"Shape: {batch.shape}")

Shape: torch.Size([32, 3, 28, 28])
Shape: torch.Size([32, 3, 28, 28])
Shape: torch.Size([32, 3, 28, 28])


Even though we have only 50 observations, we don't run out of data. We will keep shuffling the data and can generate infinite batches, shuffled every time.