# Language Modelling with RNNs
For this notebook, you do not need to write any code. Instead, follow along with the notebook, familiarizing yourself with using Keras to generate RNNs.

## Setup
Just as in the other notebooks, we will begin with importing the needed modules and reading in the training data. We'll also be borrowing some code from the other notebook to remove infrequent words and cut down our vocabulary size.

If you do not have Keras or Tensorflow you can install them using the terminal following these instructions.

[Install Keras](https://keras.io/#installation)

[Install Tensorflow](https://www.tensorflow.org/install/install_mac)

In [None]:
from keras.callbacks import LambdaCallback
from keras.layers import Dense, SimpleRNN, Activation
from keras.models import Sequential
from keras.optimizers import RMSprop
from keras.utils import to_categorical

import numpy as np
import random
import re
import string

# Reading in the training data; we'll be taking a smaller set to reduce training time
with open("headlines.train", 'r') as f:
    headlines_train = f.readlines()[:100000]

# Removing excess punctuation and newline
regex = re.compile('[%s]' % re.escape(string.punctuation))
headlines_train = [regex.sub('', h.split("\n")[0]) for h in headlines_train]

# Define the unk, start and stop tokens
UNK_TOKEN = "<UNK>"
START_TOKEN = "<START>"
STOP_TOKEN = "<STOP>"

# We'll be borrowing some code from the other notebook to trim down the vocabulary a bit
def count_unigrams(text, unigram_dict):
    """
    :param text: A headline, consisting of a string of words
    :param unigram_dict: A dictionary containing unigrams as keys and their respective counts as values
    """
    tokens = [START_TOKEN] + text.split(" ") + [STOP_TOKEN]
    for i in range(len(tokens)):
        unigram = tokens[i]
        if unigram not in unigram_dict:
            unigram_dict[unigram] = 1
        else:
            unigram_dict[unigram] += 1

min_freq = 3 # The minimum word frequency to be present in the vocabulary

# The following are used to keep track of and remove infrequent words
low_freq = set()
all_words = {}

def replace_text_train(text):
    return " ".join([UNK_TOKEN if t in low_freq else t for t in text.split()])

# Finding all words with low frequency
for h in headlines_train:
    count_unigrams(h, all_words)
for word, count in all_words.items():
    if count <= min_freq:
        low_freq.add(word)
# Replacing low frequency words from training dataset with UNK
headlines_train_clean = [replace_text_train(h) for h in headlines_train]

# Build vocabulary and make a mapping from index to word for generation
vocab = set([item for sublist in map(lambda x: x.split(" "), headlines_train_clean) for item in sublist])
vocab.add(STOP_TOKEN)
vocab_list = list(vocab)
word_to_index = {vocab_list[i]: i for i in range(len(vocab_list))}
index_to_word = {v: k for k, v in word_to_index.items()}

For our RNN, we be first converting our text into GloVe word embeddings before giving it as input. We'll also need to define some parameters which will be used in our model.

In [None]:
# Reading in GloVe embeddings as save them as a dictionary
with open("glove_embeddings.txt", 'r') as f:
    gloves = [t.split(" ") for t in f.readlines()]
    gloves_dict = {t[0]: np.array(t[1:]) for t in gloves}

In [None]:
# Parameters to used for the batch generator and model
vocab_size = len(index_to_word.keys())
sent_len = max([len(h.split(" ")) for h in headlines_train_clean]) + 1
glove_dim = next(iter(gloves_dict.values())).size

## Creating Data Batches
First we'll need to turn the headlines into data samples (where each sample is an output word given the entire history of previous words in the headline). To do this, we will iterate through all the headlines and through each word within the headline to get (history, word) pairs as our inputs and labels.

In [None]:
data = []
for h in headlines_train_clean:
    # Pad the text in the beginning with start tokens
    text = [START_TOKEN for _ in range(sent_len)] + h.split(" ") + [STOP_TOKEN]
    for i in range(len(text) - sent_len):
        data.append((text[i:i+sent_len], text[i+sent_len]))

Keras allows batches of data to be fed into the RNN through a generator, so we'll make such a generator to process the data and package it nicely for the model to use during the training steps.

In [None]:
# Parameters for the data generator and model
batch_size = 512
num_batches = -(-len(data) // batch_size)

def sample_generator():
    while True:
        random.shuffle(data)
        for i in range(num_batches):
            batch_input = np.zeros((batch_size, sent_len, glove_dim))
            batch_label = np.zeros((batch_size, vocab_size))
            for j in range(batch_size):
                idx = j + i*batch_size
                history, word = data[j]
                for k in range(len(history)):
                    if history[k] in gloves_dict:
                        batch_input[j,k,:] = gloves_dict[history[k]]
                batch_label[j,word_to_index[word]] = 1
            yield batch_input, batch_label

## Building the model
Keras is a high-level machine learning library that greatly simplifies building and training neural networks. It does the forward and backward pass, as well as other implementation details, all for you; you just need to declare what kind of layers you would like to add. To read the API and view some tutorials on how to use Keras, visit https://keras.io/.

In [None]:
hidden_neurons = 128
model = Sequential()
model.add(SimpleRNN(hidden_neurons, input_shape=(sent_len, glove_dim)))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

The next block of code defines a few functions to generate sentences from the RNN. The `sample` and `generate_headline` functions behave in a similar way to the `sample_word` function that you had implemented and the `generate_headline` function implemented for you in the other notebook.

In [None]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

def generate_headline():
    end_sentence = False
    sent = np.zeros((sent_len, glove_dim))
    
    generated = []
    curr_len = 0
    while not end_sentence:
        sent_input = np.expand_dims(sent[-sent_len:sent.shape[0]], axis=0)
        word_probs = model.predict(sent_input, verbose=0)
        next_word = sample(np.squeeze(word_probs, axis=0))
        if next_word == word_to_index[STOP_TOKEN] or curr_len == sent_len:
            end_sentence = True
            print(' '.join(generated))
        else:
            if index_to_word[next_word] in gloves_dict:
                word_embeded = gloves_dict[index_to_word[next_word]]
            else:
                word_embeded = np.zeros(glove_dim)
            sent = np.concatenate((sent, np.expand_dims(word_embeded, axis=0)), axis=0)
            generated.append(index_to_word[next_word])
            curr_len += 1

Before we begin training, let's first look at what kind of headlines an untrained RNN generates.

In [None]:
for _ in range(5):
    generate_headline()

## RNN Training
Now that we've constructed our RNN, we can begin training. Be aware that this may take upwards of half an hour to train!

In [None]:
def on_epoch_end(epoch, logs):
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    for i in range(3):
        generate_headline()
        print()

optimizer = RMSprop(lr=0.001)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
model.fit_generator(sample_generator(), num_batches, 3,
          callbacks=[LambdaCallback(on_epoch_end=on_epoch_end)])

We've finally finished training our RNN! Let's see what kind of headlines we can generate now. 

List the three headlines generated after running the cell below in the RNN Language Model section of the `bigram_language_model.ipynb` notebook. You do not need to submit this notebook; only submit `bigram_language_model.ipynb` and the its executed PDF.

In [None]:
for i in range(3):
    generate_headline()