# Training Data Preparation

Before introducing the model, let’s assume we will use a neural network to train a language model.Now the question is how to read mini-batches of examples and labels at random. We describe how to accomplish this for both random
sampling and sequential partitioning strategies below.

## Random Sampling

The following code randomly generates a minibatch from the data each time. Here, the batch size batch_size
indicates to the number of examples in each mini-batch and num_steps is the length of the sequence (or time
steps if we have a time series) included in each example. In random sampling, each example is a sequence
arbitrarily captured on the original sequence. The positions of two adjacent random mini-batches on the
original sequence are not necessarily adjacent. The target is to predict the next character based on what
we’ve seen so far, hence the labels are the original sequence, shifted by one character.

In [8]:
import torch
import random
import utils

In [9]:
corpus, vocab = utils.load_corpus_time_machine(max_tokens=30)

In [10]:
corpus[:10]

[3, 9, 2, 1, 3, 5, 13, 2, 1, 13]

In [11]:
len(vocab)

28

In [12]:
def seq_data_iter_random(corpus, batch_size, num_steps):
    # Offset the iterator over the data for uniform starts
    corpus = corpus[random.randint(0, num_steps):]
    # Subtract 1 extra since we need to account for label
    num_examples = (len(corpus) - 1)//num_steps
    example_indices = list(range(0, num_examples*num_steps,
                                num_steps))
    random.shuffle(example_indices)
    # This returns a sequence of the length num_steps starting from pos
    data = lambda pos: corpus[pos:pos+num_steps]
    # Discard half empty batches
    num_batches = num_examples // batch_size
    for i in range(0, batch_size * num_batches, batch_size):
        # Batch_size indicates the random examples read each time
        batch_indices = example_indices[i:i+batch_size]
        X = [data(j) for j in batch_indices]
        Y = [data(j+1) for j in batch_indices]
        yield torch.tensor(X), torch.tensor(Y)

Let us generate an artificial sequence from 0 to 30.

In [16]:
# type your code here
my_seq = list(range(30))
for X, Y in seq_data_iter_random(my_seq, batch_size=2,
                                num_steps=4):
    print('X:', X) 
    print('Y:', Y)

X: tensor([[ 3,  4,  5,  6],
        [ 7,  8,  9, 10]])
Y: tensor([[ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])
X: tensor([[23, 24, 25, 26],
        [19, 20, 21, 22]])
Y: tensor([[24, 25, 26, 27],
        [20, 21, 22, 23]])
X: tensor([[15, 16, 17, 18],
        [11, 12, 13, 14]])
Y: tensor([[16, 17, 18, 19],
        [12, 13, 14, 15]])


## Sequential Partitioning

In addition to random sampling of the original sequence, we can also make the positions of two adjacent
random mini-batches adjacent in the original sequence.

In [17]:
def seq_data_iter_consecutive(corpus, batch_size, num_steps):
    # Offset for the iterator over the data for uniform starts
    offset = random.randint(0, num_steps)
    # Slice out data - ignore num_steps and just wrap around
    num_indices = ((len(corpus) - offset - 1)//batch_size
                   *batch_size)
    Xs = torch.tensor(corpus[offset: offset+num_indices])
    Ys = torch.tensor(corpus[offset+1: offset+1+num_indices])
    Xs, Ys = Xs.reshape(batch_size, -1), Ys.reshape(batch_size, -1)
    num_batches = Xs.shape[1] // num_steps
    for i in range(0, num_batches * num_steps, num_steps):
        X = Xs[:,i:(i+num_steps)]
        Y = Ys[:, i:(i+num_steps)]
        yield X, Y

Using the same settings, print input X and label Y for each mini-batch of examples read by r

In [18]:
# type your code here
my_seq = list(range(30))
for X, Y in seq_data_iter_consecutive(my_seq, batch_size=2,
                                     num_steps=4):
    print('X:', X)
    print('Y:', Y)

X: tensor([[ 2,  3,  4,  5],
        [15, 16, 17, 18]])
Y: tensor([[ 3,  4,  5,  6],
        [16, 17, 18, 19]])
X: tensor([[ 6,  7,  8,  9],
        [19, 20, 21, 22]])
Y: tensor([[ 7,  8,  9, 10],
        [20, 21, 22, 23]])
X: tensor([[10, 11, 12, 13],
        [23, 24, 25, 26]])
Y: tensor([[11, 12, 13, 14],
        [24, 25, 26, 27]])


Now we wrap the above two sampling functions to a class so that we can use it as a normal pytorch data iterator later.

In [19]:
class SeqDataLoader(object):
    """A iterator to load sequence data"""
    def __init__(self, batch_size, num_steps, use_random_iter, max_tokens):
        if use_random_iter:
            data_iter_fn = seq_data_iter_random
        else:
            data_iter_fn = seq_data_iter_consecutive
        self.corpus, self.vocab = utils.load_corpus_time_machine(max_tokens)
        self.get_iter = lambda: data_iter_fn(self.corpus, 
                                             batch_size,
                                            num_steps)
    def __iter__(self):
        return self.get_iter()

Lastly, we define a function load_data_time_machine that returns both the data iterator and the vocabulary,
so we can use it similarly as other functions with load_data prefix.

In [20]:
def load_data_time_machine(batch_size, num_steps, 
                           use_random_iter=False, max_tokens=10000):
    data_iter = SeqDataLoader(batch_size, num_steps,
                             use_random_iter, max_tokens)
    return data_iter, data_iter.vocab