## Notebook Summary
* Learn a Language Model on the ROC Story Dataset: https://cs.rochester.edu/nlp/rocstories/
> Available here: https://drive.google.com/file/d/1eJINcSbC3JLl0hTNbhh5G94zTuXinpC-/view?usp=sharing

* Generate Text with this Language Model using several decoding techniques
* Evaluate the Language Model using the perplexity and the BLEU score. 

In [None]:
import pandas as pd
import tensorflow as tf
import numpy as np
import re
import json

### 1. Load and Preprocess the Dataset

In [None]:
data_path = "/content/drive/MyDrive/12_Teaching/UM6P-NLP-Jan2022/notebooks/ROCStories_winter2017.csv"
df = pd.read_csv(data_path)

In [None]:
print(df.head())
print(len(df))

                                storyid  ...                                          sentence5
0  8bbe6d11-1e2e-413c-bf81-eaea05f4f1bd  ...  After a few weeks, he started to feel much bet...
1  0beabab2-fb49-460e-a6e6-f35a202e3348  ...  Tom sat on his couch filled with regret about ...
2  87da1a22-df0b-410c-b186-439700b70ba6  ...  Marcus was happy to have the right clothes for...
3  2d16bcd6-692a-4fc0-8e7c-4a6f81d9efa9  ...  He ended up buying the truck he wanted despite...
4  c71bb23b-7731-4233-8298-76ba6886cee1  ...      His congregation was delighted and so was he.

[5 rows x 7 columns]
52665


In [None]:
def get_sentences(df, max_samples=None):
    df["sentence_1_2"] = df.sentence1 + " " + df.sentence2
    sentences = df.sentence_1_2
    sentences_1, sentences_2 = df["sentence1"], df["sentence2"]
    if max_samples is not None:
        sentences = sentences[:max_samples]
        sentences_1 = sentences_1[:max_samples]
        sentences_2 = sentences_2[:max_samples]
    return sentences, sentences_1, sentences_2

In [None]:
sentences, sentences_1, sentences_2 = get_sentences(df)

In [None]:
# print sentences example:
sentences[np.random.randint(len(sentences))]

"I was on duty at my work when i noticed someone staring at me. I didn't mind him but the day after that i saw him again."

**Exercise 1**: Create a function `clean_text` that clean sentences
* split words with "-"
* split number and text using a regular expressions and the function `re.split`
* Replace the token "&" by the token "and". 
* lower all letters
* Tips: create lambda functions and apply it to the dataframe using `.apply` method. 

In [None]:
def clean_text(sentences):
    clean_func1 = lambda t: ' '.join(t.split("-")) # .replace("-", " ")
    clean_func2 = lambda t: ' '.join(re.split(r"([0-9]+)([a-z]+)", t, flags=re.I)) # "9st" => "9 st" 
    clean_func3 = lambda t: ' '.join(re.split(r"([a-z]+)([0-9]+)", t, flags=re.I))
    clean_func4 = lambda t: t.lower().replace("&", "and")
    sentences = sentences.apply(clean_func1)
    sentences = sentences.apply(clean_func2)
    sentences = sentences.apply(clean_func3)
    sentences = sentences.apply(clean_func4)
    return sentences

In [None]:
sentences = clean_text(sentences)
sentences_1 = clean_text(sentences_1)
sentences_2 = clean_text(sentences_2)

**Exercise 2**: Build the vocab by removing some punctuation and adding the special tokens. 

In [None]:
import nltk
from nltk import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
def get_vocab(sentences, tokens_to_remove=["$", "%", "'", "''"], special_tokens=["<PAD>", "<SOS>", "<EOS>"]):
    print("Building vocab....")
    # tokenize sentences
    tokenized_sentences = sentences.apply(word_tokenize)
    tokenized_sentences = tokenized_sentences.values # nested list 
    tokens = [w for s in tokenized_sentences for w in s] # flatten list 

    # build vocab
    unique_tokens = list(set(tokens))
    for token in tokens_to_remove:
        unique_tokens.remove(token)
    unique_tokens.sort()
    vocab = {v: k for k, v in enumerate(special_tokens + unique_tokens)}
    print("vocab length:", len(vocab))
    print("saving vocab...")
    with open("vocab.json", "w") as f:
        json.dump(vocab, f)
    return tokens, vocab

In [None]:
tokens, vocab = get_vocab(sentences)

Building vocab....
vocab length: 20093
saving vocab...


In [None]:
def tokenize(sentences, vocab):
    # tokenize sentences with vocab
    # add <SOS> and <EOS> at beginning and end of sentence
    # splitter input data, target data (shifted sentence)
    # pad sequences to have same length 
    tok_func = lambda t: [vocab["<SOS>"]] + [vocab[w] for w in t if w in vocab.keys()]+[vocab["<EOS>"]]
    tokens_id = sentences.apply(word_tokenize)
    tokens_id = tokens_id.apply(tok_func)
    df = pd.DataFrame()
    df['input_sentence'] = tokens_id.apply(lambda t: t[:-1])
    df['target_sentence'] = tokens_id.apply(lambda t: t[1:])
    len_sentences = tokens_id.apply(len)
    max_len = np.max(len_sentences)
    pad_func = lambda t: t + [0] * (max_len - len(t))
    df["input_sentence"] = df.input_sentence.apply(pad_func)
    df["target_sentence"] = df.target_sentence.apply(pad_func)
    return df, max_len

In [None]:
df, max_len = tokenize(sentences, vocab)

In [None]:
import pprint
pprint.pprint(df.iloc[0])

input_sentence     [1, 4725, 12150, 8137, 7930, 13993, 12354, 326...
target_sentence    [4725, 12150, 8137, 7930, 13993, 12354, 326, 1...
Name: 0, dtype: object


In [None]:
def tokenize_test(sentences, vocab):
    tokenize_func = lambda t: word_tokenize(t)
    tok_to_id_func = lambda t: [vocab["<SOS>"]]+[vocab[w] for w in t if w in vocab.keys()]+[vocab["<EOS>"]]
    tokenized_sentences = sentences.apply(tokenize_func)
    tokens_id = tokenized_sentences.apply(tok_to_id_func)
    len_sentences = tokens_id.apply(len)
    return tokens_id, len_sentences

In [None]:
def split_train_test(sentences, sentences_1_and_2, val_size=5000, test_size=3000):
    train_size = len(sentences) - (val_size + test_size)
    train_sentences = sentences[:train_size]
    val_sentences = sentences[train_size:train_size + val_size]
    test_sentences = sentences_1_and_2[train_size + val_size:train_size + val_size + test_size]
    return train_sentences, val_sentences, test_sentences

In [None]:
def preprocess_data(data_path):
    df = pd.read_csv(data_path) # read the file
    sentences, sentences_1, sentences_2 = get_sentences(df) # select the first 2 sentences
    sentences, sentences_1, sentences_2 = clean_text(sentences), clean_text(sentences_1), clean_text(sentences_2) # text cleaning
    tokens, vocab = get_vocab(sentences) # Build vocab 
    padded_sentences, max_len = tokenize(sentences, vocab) # tokenize, split input/target, pad sequences
    print("dataset set length:", len(padded_sentences))
    sentences_1, len_sentences_1 = tokenize_test(sentences_1, vocab)
    sentences_2, len_sentences_2 = tokenize_test(sentences_2, vocab)
    sentences_1_and_2 = pd.concat([sentences_1, sentences_2], axis=1) # dataframe with 2 sentences 
    train_sentences, val_sentences, test_sentences = split_train_test(padded_sentences, sentences_1_and_2)
    print("train dataset size", len(train_sentences))
    print("val dataset size", len(val_sentences))
    print("test dataset size", len(test_sentences))
    return train_sentences, val_sentences, test_sentences

In [None]:
data_path = "/content/drive/MyDrive/12_Teaching/UM6P-NLP-Jan2022/notebooks/ROCStories_winter2017.csv"
train_sentences, val_sentences, test_sentences = preprocess_data(data_path)

Building vocab....
vocab length: 20094
saving vocab...
dataset set length: 52665
train dataset size 44665
val dataset size 5000
test dataset size 3000


In [None]:
def get_dataloader(dataset, max_samples, batch_size):
    # transform 2 columns of dataframe to numpy arrays
    input_sentence = np.array([seq for seq in dataset.input_sentence.values])
    target_sentence = np.array([seq for seq in dataset.target_sentence.values])
    if max_samples is not None:
      input_sentence = input_sentence[:max_samples]
      target_sentence = target_sentence[:max_samples]
    # tensorflow dataset
    tfdataset = tf.data.Dataset.from_tensor_slices(
            (input_sentence, target_sentence))
    # tensorflow dataloader
    dataloader = tfdataset.batch(batch_size, drop_remainder=True)
    return dataloader

In [None]:
def get_test_dataloader(data):
    inputs, targets = data.sentence1, data.sentence2
    inputs = inputs.to_list()
    targets = targets.to_list()
    inputs = [tf.constant(inp, dtype=tf.int32) for inp in inputs]
    targets = [tf.constant(tar, dtype=tf.int32) for tar in targets]
    return (inputs, targets)

In [None]:
batch_size = 64
train_loader = get_dataloader(train_sentences, batch_size=64, max_samples=None)
print(next(iter(train_loader)))
val_loader = get_dataloader(val_sentences, batch_size=64, max_samples=None)
print(next(iter(val_loader)))
test_loader = get_test_dataloader(test_sentences)
inputs, targets = test_loader
print(inputs[0])
print(targets[0])

(<tf.Tensor: shape=(64, 38), dtype=int64, numpy=
array([[    1,  4726, 12151, ...,     0,     0,     0],
       [    1, 18245,  7931, ...,     0,     0,     0],
       [    1, 10910, 11919, ...,     0,     0,     0],
       ...,
       [    1,  9261,  7931, ...,     0,     0,     0],
       [    1,  9503, 19398, ...,     0,     0,     0],
       [    1,  9683, 12758, ...,     0,     0,     0]])>, <tf.Tensor: shape=(64, 38), dtype=int64, numpy=
array([[ 4726, 12151,  8138, ...,     0,     0,     0],
       [18245,  7931,   327, ...,     0,     0,     0],
       [10910, 11919,  3640, ...,     0,     0,     0],
       ...,
       [ 9261,  7931,   829, ...,     0,     0,     0],
       [ 9503, 19398, 18206, ...,     0,     0,     0],
       [ 9683, 12758,  8257, ...,     0,     0,     0]])>)
(<tf.Tensor: shape=(64, 38), dtype=int64, numpy=
array([[    1,  9261,   890, ...,     0,     0,     0],
       [    1, 11337,   829, ...,     0,     0,     0],
       [    1, 19693, 19436, ...,     0,

***Exercise 4***: 
Create a `decode` function that decode a list of tokens id into text using the vocab. 

In [None]:
def decode(seq_idx, vocab, delim=' ', ignored=["<SOS>", "<PAD>", "<EOS>"]):
  # inv vocab
  inv_vocab = {token_id: token for token, token_id in vocab.items()}
  # decode sent
  decoded_sentence = [inv_vocab[token_id] 
                      for token_id in seq_idx
                      if inv_vocab[token_id] not in ignored]
  # join tokens
  return delim.join(decoded_sentence)

In [None]:
inputs, targets = (next(iter(train_loader)))
decode(inputs[0].numpy(), vocab)

"davidson notices head haddie putrid ona a'neial lotion off weightlifting reception . head examining hispanic habitual toad trying anderson figured outage theater reasonable ."

**Exercise 5**: Build a GRU network using `tf.keras.Model` or `tf.keras.Sequential` with: 
* An embedding layer
* A GRU layer: there is a subtility -> you need to ouput the whole sequence of hidden states using the return_sequences argument: https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM
* A dropout layer after the LSTM Layer
* A dense layer that project the hidden state over the vocabulary. 
> What is the size of the NN output ? 

In [None]:
# Build Model
def build_LSTM(vocab_size, emb_size, output_size, rnn_units, dropout_rate, rnn_drop_rate=0.0):
  model = tf.keras.Sequential()
  e = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=emb_size, mask_zero=True)
  model.add(e)
  lstm = tf.keras.layers.LSTM(rnn_units, recurrent_dropout=rnn_drop_rate, return_sequences=True) # to output each hidden representation of the sequence
  model.add(lstm)
  model.add(tf.keras.layers.Dropout(dropout_rate))
  model.add(tf.keras.layers.Dense(output_size)) # we compute only logits. 
  return model

In [None]:
lstm_model = build_LSTM(vocab_size=len(vocab), emb_size=32, output_size=len(vocab), rnn_units=64, dropout_rate=0.1)
lr = 0.001
optimizer = tf.keras.optimizers.Adam(lr,
                                                  beta_1=0.9,
                                                  beta_2=0.98,
                                                  epsilon=1e-9)
EPOCHS = 10 # 10 for debugging. generally, in Language modelling, we take between 30 and 50 epochs. 

In [None]:
# test past forward on the lstm
for (inputs, targets) in train_loader.take(1):
    preds = lstm_model(inputs)

In [None]:
print(preds.shape)

(64, 38, 20093)


**Exercise 6:**  
* Create a function that train LSTM (similarly of notebook of day 2) 
> Use the tf.keras.losses.SparseCategoricalCrossEntropy: https://www.tensorflow.org/api_docs/python/tf/keras/losses/SparseCategoricalCrossentropy
 

* Compute the perplexity over the train and validation set: Note that the perplexity is the exponantial of the cross-entropy ! 

In [None]:
import time

In [None]:
def train_LSTM(model, optimizer, EPOCHS, train_dataset, val_dataset, checkpoint_path):
    LSTM_ckpt_path = checkpoint_path + '/' + 'LSTM-{epoch}'

    callbacks = [
        tf.keras.callbacks.ModelCheckpoint(
            filepath=LSTM_ckpt_path,
            monitor='val_loss',
            save_best_only=True,
            save_weights_only=True,
            verbose=1)
    ]
    model.compile(optimizer=optimizer,
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

    print(model.summary())

    # Save the weights using the `checkpoint_path` format
    #model.save_weights(checkpoint_path.format(epoch=0))

    # --- starting the training ... -----------------------------------------------
    start_training = time.time()
    rnn_history = model.fit(train_dataset,
                            epochs=EPOCHS,
                            validation_data=val_dataset,
                            callbacks=callbacks,
                            verbose=2)

    train_loss_history_rnn = rnn_history.history['loss']
    val_loss_history_rnn = rnn_history.history['val_loss']
    train_ppl_history = np.exp(train_loss_history_rnn)
    val_ppl_history = np.exp(val_loss_history_rnn)
    train_history = [train_loss_history_rnn, val_loss_history_rnn, train_ppl_history, val_ppl_history]

    print('Training time for {} epochs: {}'.format(EPOCHS, time.time() - start_training))

    return train_history # [list_train_loss, list_val_loss, list_train_perplexity, list_val_perplexity]

In [None]:
import os
import time
checkpoint_path = "/checkpoints"
if not os.path.isdir(checkpoint_path):
  os.makedirs(checkpoint_path)
train_history = train_LSTM(model=lstm_model, optimizer=optimizer, EPOCHS=EPOCHS, train_dataset=train_loader, val_dataset=val_loader, checkpoint_path=checkpoint_path)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 32)          642976    
                                                                 
 lstm (LSTM)                 (None, None, 64)          24832     
                                                                 
 dropout (Dropout)           (None, None, 64)          0         
                                                                 
 dense (Dense)               (None, None, 20093)       1306045   
                                                                 
Total params: 1,973,853
Trainable params: 1,973,853
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/10


InvalidArgumentError: ignored

**Exercise 7**:
Create a function that generate text at inference over the trained lstm. 
This function either use: 
* greedy decoding using `tf.math.argmax`
* sampling with temperature decoding `tf.random.categorical`

In [None]:
def generate_text(lstm, inputs, seq_len=10,
                            decoding="sampling", temp=1):
  # Loop over number of decoding timesteps: (equal to seq_len)

      # pass forward on the lstm on inputs

      # get the last prediction (logits)

      # if decoding = sampling 
        # divide logits by temperature 
        # sample a word

        # if decoding == "greedy"
        # find the greedy word (argmax)

      # compute the inputs of the next timestep by concatenating inputs and the predicted token using tf.concat

  # return the final inputs (complete sequence of word ids)
  return inputs

**Exercise 8**:
* Take an `inputs` of the test dataset, generate text on this inputs, and decode it with the `decode` function

#### Measuring the BLEU score 
(between true sentence and generated sentence on the test dataset)  
use sentence_bleu of nltk: https://www.nltk.org/_modules/nltk/translate/bleu_score.html

In [None]:
from nltk.translate.bleu_score import SmoothingFunction, sentence_bleu

In [None]:
def BLEU_score(true_sentence, generated_sentence, split_str=False):
    if split_str:
        true_sentence = true_sentence.split(sep=' ')
        generated_sentence = [generated_sentence.split(sep=' ')]
    score = sentence_bleu(references=generated_sentence, hypothesis=true_sentence, smoothing_function=SmoothingFunction().method2)
    return score

**Exercise 9**: Create a function that: 
* Loop over the test set 
* generate text on each inputs of the test set
* decode it using the decode function 
* Evaluate the BLEU score between the true decoded sentence (from the test set) and the decoded generate sentence 
* Compute the average BLEU score on the test set. 