Build a simple Sequential Model from an LSTM. 

* Given a sequence of n-grams, the classifier predicts the following token as a class. 
* The input of the neural network is an array of sequences of tokens for the design matrix and the output a vector of labels that corresponds to the target token.


### Motivation

In the case of the n-gram language model, we used the probability of each n-gram in the input sentence to calculate the perplexity. Our current model does not rely on n-grams, but on probabilities of sequences of tokens to be followed by a subsequent token. We need to adapt the perplexity formula for n-grams language models to sequence-based language models.

If we consider the sentence of N tokens:

$$w_{1},\cdots, w_N$$

Then we can calculate the probability of that sentence as the product of probabilities of all the padded subsequences. Let’s take an example of a 3-token sentence.

$$ P(w_1,w_2, w_3) =  P(w_3 | w_1, w_2) \times p(w_2 | w_1 ,0)  \times p(w_1 | 0 ,0) $$

In general, for a sentence of N tokens and a sequence length of length S,

$$ P(w_1,\cdots, w_N) = \prod_{k = 1}^{ \max{(N,S)}} P(w_k | \text{padded}_S(w_{1}, \cdots, w_{k-1})    )  $$

where $ P(w_{k} | \text{padded}_S(w_{1}, \cdots, w_{k-1}) $ is precisely the probability given by the classification model.

We can therefore compute the perplexity of a sentence of length N with

$$PP(w_{1},\cdots, w_N) = \exp [ - \frac{1}{N} {\sum_{i = 1}^{ \max{(N,S)} } \log { P(w_{k} | \text{padded}_S(w_{1}, \cdots, w_{k-1}) } } ) ]$$

### Workflow

#### Preparing the data

1. Load the dataset that was prepared in task 1.
2. The original dataset is too large and needs to be reduced. To reduce it, you can, for instance,
   + filter out items that have too many or too little tokens,
   + select items of a certain type: post, comments, or titles, or
   + or sub sample items randomly.
3. Build the vocabulary as the set of all unique tokens to construct the list of token indexes.
  + Filtering on token frequency is one way to reduce the overall size of the vocabulary.
4. Set a fixed sequence length and build sequences of token indexes from the corpus. (See for instance keras pad_sequences.)
5. Split the sequences into predictors and labels (`keras.utils.to_categorical`)

#### The model

The data is now ready to be used to fit a neural network.

1. Define a simple sequential model with an embedding layer, LSTM(s), and a dense layer with softmax activation. Feel free to experiment with dropouts, different optimizers. You can use any type of neural net you want; for example, Keras, TensorFlow, PyTorch, and so on.
2. Specify the number of epochs, the batch size, and other fitting parameters.
3. Fit the network.

#### Assessing the results

1. Write a function that generates text.
2. Generate some text and take note of:
  - Token repetitions
  - Missing punctuations
  - Other anomalies
3. Write a function that calculates **perplexity of a sentence** and apply it to a subset of sentences to evaluate the model.
4. Define a validation set; for instance, 1000 titles.
5. Transform that validation set into sequences of tokens using the training vocabulary.
6. Tune the neural net and the parameters of the preprocessing phase to improve the model’s perplexity score.

### Preparing the data

In [2]:
import pandas as pd

In [3]:
TOK_FILE= 'stackexchange_tokenized.csv'
df = pd.read_csv(f'data/{TOK_FILE}')

In [None]:
# reduce the dataset

In [None]:
# build the vocabularly from the set of unique tokens

In [None]:
# pick the right fixed sequence length

In [None]:
# split the sequence into predictors and labels

### The model

In [None]:
# define the model

In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F


In [15]:
%load_ext autoreload
%autoreload 2

from rnn_model import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [33]:
### https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html
### views in pytorch - https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view
### Maybe skip this one - https://gist.github.com/williamFalcon/f27c7b90e34b4ba88ced042d9ef33edd
### https://pytorch.org/docs/master/generated/torch.nn.LSTM.html
### https://github.com/pytorch/examples/tree/master/word_language_model
### https://stackoverflow.com/a/42482819 reshaping tensors with "view"


class Net(nn.Module):
    def __init__(self, seq_len, vocab_size, embedding_dim, hidden_dim): #TODO: DO I NEED THE seq_len here? Doubt it. 
        
        super(Net, self).__init__()
        self.vocab_size = vocab_size
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.lstm =  nn.LSTM(embedding_dim, hidden_dim)
        self.dense = nn.Linear(hidden_dim, vocab_size)
        
    def forward(self, x):
        
        # LEFT OFF HERE - decide how/when to reset and share hidden params? Consider bptt approach. 
        # when you process inputs, will batches be contiguous? Are there any helper function in pytorch you can use?
        
        seq_len, batch_size = x.shape   
        x = self.word_embeddings(x)
        x, states = self.lstm(x) # -> (seq_len, batch_size, hidden_size)
        x = self.dense(x)        # -> (seq_len, batch_size, vocab_size)
        x = F.log_softmax(x, dim=2) 
        return x
        
        

In [34]:
# Specify the number of epochs, the batch size, and other fitting parameters

# TODO: these numbers are all placeholders for testing. 
seq_len = 50
vocab_size = 4000
embedding_dim = 200
hidden_dim = 40
batch_size = 15
epochs = 25

In [35]:
net = Net(seq_len, vocab_size, embedding_dim, hidden_dim)

In [36]:
# run `forward` here to make sure that you have the expected dimensions for inputs and outputs.

#fake input
tsr = torch.randint(0,vocab_size, (seq_len, batch_size))
print(tsr.shape)  

out_tensor = net(tsr)
print(out_tensor.shape) 

torch.Size([50, 15])
torch.Size([50, 15, 4000])


In [None]:
# fit the model

### Assessing the results

In [None]:
#Write a function that generates text.

In [None]:
#Test the function for errors (missing text, punctuation, etc.)

In [None]:
# Calculate Perplexity

In [None]:
# Define a validation set; for instance, 1000 titles.

In [None]:
# Transform that validation set into sequences of tokens using the training vocabulary.

In [None]:
# Tune the neural net and the parameters of the preprocessing phase to improve the model’s perplexity score.

#### Testing

In [11]:

emb = nn.Embedding(400, 30)
tsr = torch.LongTensor([[1,2,4,5],[4,3,2,9]])
ts2 = emb(tsr)
ts2.shape

torch.Size([2, 4, 30])

In [16]:
#how to get random tensors that are integers
torch.randint(0,10, (3,4,))

tensor([[5, 6, 9, 7],
        [3, 6, 8, 7],
        [0, 8, 9, 0]])

In [17]:
# copied code from https://github.com/pytorch/examples/tree/master/word_language_model
# testing model dimensions. 
rnn_lm = RNNModel('LSTM', vocab_size, seq_len, embedding_dim, 1, dropout=0)

In [19]:
hidden = rnn_lm.init_hidden(batch_size)
out_tensor, _ = rnn_lm(tsr, hidden)
out_tensor.shape #same output size as my model. **update** they do this intentionally for easy loss calculations. 

torch.Size([750, 4000])

In [31]:
lstm_out = torch.randint(0,vocab_size, (seq_len, batch_size, vocab_size)).float()
#rszd_out = lstm_out.view(seq_len, batch_size, -1) #this is a noop. 
rszd_out = F.softmax(rszd_out, dim=2) #removed log
rszd_out.shape


torch.Size([50, 15, 4000])

In [32]:
#sanity check
summed = torch.sum(rszd_out, dim=2)
summed[0,0]

tensor(1.0000)

In [24]:
torch.randint(0,vocab_size, (2, 3, 4))

tensor([[[2152,  453, 2541, 3116],
         [1831, 3638, 2561, 1046],
         [1450, 2765, 3805, 1294]],

        [[3281, 2037,  266, 3607],
         [3907, 2355, 3675, 1584],
         [3511, 2891, 3937, 2514]]])