# Exercise 1 -- The Dataset

The PTB Dataset is a collection of text snippets of stories from the Wall Street Journal. 
The data set divided into 3 datasets for training, validation and testing, respectively.
In this exercise we prepare the data for neural network training and inference.

## Downloading the dataset

We have downloaded the PTB training set from https://github.com/tomsercu/lstm/tree/master/data and put the files `train.txt`, `test.txt` and `valid.txt` into the subfolder `ptb_data/`.

## Some sample phrases

Let's print some examples of phrases, which occur in the data set:

In [1]:
print_lines = (7, 34, 67, 93, 114)      # randomly chosen integers
with open('ptb_data/train.txt') as fh:
    for il, line in enumerate(fh):
        if il in print_lines:
            print('Line %i:\n"%s"\n' %(il, line))

Line 7:
" although preliminary findings were reported more than a year ago the latest results appear in today 's new england journal of medicine a forum likely to bring new attention to the problem 
"

Line 34:
" yields on money-market mutual funds continued to slide amid signs that portfolio managers expect further declines in interest rates 
"

Line 67:
" in the new position he will oversee mazda 's u.s. sales service parts and marketing operations 
"

Line 93:
" south korea 's economic boom which began in N stopped this year because of prolonged labor disputes trade conflicts and sluggish exports 
"

Line 114:
" new england electric system bowed out of the bidding for public service co. of new hampshire saying that the risks were too high and the potential <unk> too far in the future to justify a higher offer 
"



We see that the text has been proprocessed as
- ... it contains the special term `<unk>` in place of rare words that do not occur frequently in the dataset
- ... there is no punctuation (exception in abbreviations)
- ... words are in lower case
- ... numbers have been replaced by "N"
- ... spaces have been included before some terms such as "’ll", "’s", "n’t"
Furthermore we can see that each line contains still the newline character `\n` and also an additional space in the front.

## Adding `<eos>`

In order to represent the text as arrays, we need to add one further preprocessing step: each newline character should be replaced with the special sequence `<eos>`. We do this for our example sentences below:

In [2]:
print_lines = (7, 34, 67, 93, 114)      # randomly chosen integers
with open('ptb_data/train.txt') as fh:
    for il, line in enumerate(fh):
        if il in print_lines:
            print('Line %i:\n"%s"\n' %(il, line.strip('\n') + '<eos>'))

Line 7:
" although preliminary findings were reported more than a year ago the latest results appear in today 's new england journal of medicine a forum likely to bring new attention to the problem <eos>"

Line 34:
" yields on money-market mutual funds continued to slide amid signs that portfolio managers expect further declines in interest rates <eos>"

Line 67:
" in the new position he will oversee mazda 's u.s. sales service parts and marketing operations <eos>"

Line 93:
" south korea 's economic boom which began in N stopped this year because of prolonged labor disputes trade conflicts and sluggish exports <eos>"

Line 114:
" new england electric system bowed out of the bidding for public service co. of new hampshire saying that the risks were too high and the potential <unk> too far in the future to justify a higher offer <eos>"



## Assessing the size of the splits

Here we count the number of words in each of the splits of the training data. For that puropose we define a function, which performs the word counting for us:

In [3]:
def count_words(fname):
    tmp = []
    with open(fname) as fh:
        for line in fh:
            tmp2 = line.strip().split() + ['<eos>']
            for el in tmp2:
                tmp.append(line)
    return len(tmp)

Now we can easily determine the number of words in each of the splits of the PTB data:

In [4]:
print('The training data contains:', count_words('ptb_data/train.txt'), 'words')
print('The validation data contains:', count_words('ptb_data/valid.txt'), 'words')
print('The test data contains:', count_words('ptb_data/test.txt'), 'words')

The training data contains: 929589 words
The validation data contains: 73760 words
The test data contains: 82430 words


## Building a dictionary

Here we take the example code from https://github.com/pytorch/examples/blob/main/word_language_model/data.py as our data loading routine as it implements everything we need. It includes a dictionary of the words in the data.

In [5]:
import data
corpus = data.Corpus('ptb_data')
len(corpus.dictionary)

10000

We see that there are in total 10000 unique words in the dataset. The `Corpus` class contains a dictionary as class attribute, which in turn contains the assignment of an unique integer label to each unique word. As an illustration we print every 1000th word in the dictionary:

In [6]:
words = corpus.dictionary.idx2word[::1000]
print(words)

['aer', 'dec.', 'polled', 'furniture', 'sort', 'root', 'geneva', 'tale', 'interpreted', 'gte']


Furthermore the `Corpus` class contains also a lookup table of which word corresponds to which integer label. We illustrate this by printing the integers corresponding ot the words in the variable `words` assigned above:

In [7]:
tmpstr = []
for word in words:
    tmpstr.append(corpus.dictionary.word2idx[word])
print(tmpstr)

[0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000]


With these preparations we are ready to embark on the task of training a Elman RNN on the data.