# Introduction

<center><h3>**Welcome to the Language modeling Notebook.**</h3></center>

In this assignment, you are going to train a neural network to **generate news headlines**.
To reduce computational needs, we have reduced it to headlines about technology, and a handful of Tech giants.
In this assignment you will:
- Learn to preprocess raw text so it can be fed into an LSTM.
- Make use of the LSTM library of Tensorflow, to train a Language model to generate headlines
- Use your network to generate headlines, and judge which headlines are likely or not




**What is a language model?**

Language modeling is the task of assigning a probability to sentences in a language. Besides assigning a probability to each sequence of words, the language models also assigns a probability for the likelihood of a given word (or a sequence of words) to follow a sequence of words.
â€” Page 105, __[Neural Network Methods in Natural Language Processing](https://www.amazon.com/Language-Processing-Synthesis-Lectures-Technologies/dp/1627052984/)__, 2017.

In terms of neural network, we are training a neural network to produce probabilities (classification) over a fixed vocabulary of words.
Concretely, we are training a neural network to produce:
$$ P ( w_{i+1} | w_1, w_2, w_3, ..., w_i), \forall i \in (1,n)$$

** Why is language modeling important? **

Language modeling is a core problem in NLP.

Language models can either be used as a stand-alone to produce new text that matches the distribution of text the model is trained on, but can also be used at the front-end of a more sophisticated model to produce better results.

Recently for example, the __[BERT](https://arxiv.org/abs/1810.04805)__ paper show-cased that pretraining a large neural network on a language modeling task can help improve state-of-the-art on many NLP tasks. 

How good can the generation of a Language model be?

If you have not seen the latest post by OpenAI, you should read some of the samples they generated from their language model __[here](https://blog.openai.com/better-language-models/#sample1)__.
Because of computational restrictions, we will not achieve as good text production, but the same algorithm is at the core. They just use more data and compute.

# Library imports

Before starting, make sure you have all these libraries.

In [1]:
from segtok import tokenizer
from collections import Counter
import tensorflow as tf
import numpy as np
import json
import os

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

root_folder = ""

# Loading the datasets

Make sure the dataset files are all in the `dataset` folder of the assignment.

 - If you are using this notebook locally: You should run the `download_data.sh` script.
 - If you are using the Colab version of the notebook, make sure that your Google Drive is mounted, and you verify from the file explorer in Colab that the files are viewable within `/content/gdrive/CS182_HW03/dataset/`
 


In [2]:
# This cell loads the data for the model
# Run this before working on loading any of the additional data

with open(root_folder+"dataset/headline_generation_dataset_processed.json", "r") as f:
    d_released = json.load(f)

with open(root_folder+"dataset/headline_generation_vocabulary.txt", "r") as f:
    vocabulary = f.read().split("\n")
w2i = {w: i for i, w in enumerate(vocabulary)} # Word to index
unkI, padI, start_index = w2i['UNK'], w2i['PAD'], w2i['<START>']
startI = start_index

vocab_size = len(vocabulary)
input_length = len(d_released[0]['numerized']) # The length of the first element in the dataset, they are all of the same length
d_train = [d for d in d_released if d['cut'] == 'training']
d_valid = [d for d in d_released if d['cut'] == 'validation']

print("Number of training samples:",len(d_train))
print("Number of validation samples:",len(d_valid))

Number of training samples: 88568
Number of validation samples: 946


Now that we have loaded the data, let's inspect one of the elements. Each sample in our dataset is has a `numerized` vector, that contains the preprocessed headline. This vector is what we will feed in to the neural network. The field `numerized` corresponds to this list of tokens. The already loaded dictionary `vocabulary` maps token lists to the actual string. Use these elements to recover `title` key of entry 1001 in the training dataset.

**TODO**: Write the numerized2text function and inspect element 1001 in the training dataset (`entry = d_train[1001]`).



In [3]:
def numerized2text(numerized):
    """ Converts an integer sequence in the vocabulary into a string corresponding to the title.
    
        Arguments:
            numerized: List[int]  -- The list of vocabulary indices corresponding to the string
        Returns:
            title: str -- The string corresponding to the numerized input, without padding.
    """
    #####
    # BEGIN YOUR CODE HERE 
    # Recover each word from the vocabulary in the list of indices in numerized, using the vocabulary variable
    # Hint: Use the string.join() function to reconstruct a single string
    #####
    
    words = [vocabulary[i] for i in numerized]
    converted_string = ' '.join(words)
    
    #####
    # END YOUR CODE HERE
    #####
    
    return converted_string

entry = d_train[1001]
print("Reversing the numerized: "+numerized2text(entry['numerized']))
print("From the `title` entry: "+ entry['title'])

Reversing the numerized: microsoft donates cloud computing ' worth $ 1 bn ' PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD
From the `title` entry: Microsoft donates cloud computing 'worth $1 bn'


In language modeling, we train a model to produce the next word in the sequence given all previously generated words. This has, in practice, two steps:


    1. Adding a special <START> token to the start of the sequence for the input. This "shifts" the input to the right by one. We call this the "source" sequence
    2. Making the network predict the original, unshifted version (we call this the "target" sequence)

    
Let's take an example. Say we want to train the network on the sentence: "The cat is great."
The input to the network will be "`<START>` The cat is great." The target will be: "The cat is great".
    
Therefore the first prediction is to select the word "The" given the `<START>` token.
The second prediction is to produce the word "cat" given the two tokens "`<START>` The".
At each step, the network learns to predict the next word, given all previous ones.
    
---

Your next step is to write the build_batch function. Given a dataset, we select a random subset of samples, and will build the "inputs" and the "targets" of the batch, following the procedure we've described.

**TODO**: write the build_batch function. We give you the structure, and you have to fill in where we have left things `None`.


In [4]:
def build_batch(dataset, batch_size):
    """ Builds a batch of source and target elements from the dataset.
    
        Arguments:
            dataset: List[db_element] -- A list of dataset elements
            batch_size: int -- The size of the batch that should be created
        Returns:
            batch_input: List[List[int]] -- List of source sequences
            batch_target: List[List[int]] -- List of target sequences
            batch_target_mask: List[List[int]] -- List of target batch masks
    """
    
    #####
    # BEGIN YOUR CODE HERE 
    #####
    
    
    # We get a list of indices we will choose from the dataset.
    # The randint function uses a uniform distribution, giving equal probably to any entry
    # for each batch
    indices = list(np.random.randint(0, len(dataset), size=batch_size))
    
    # Recover what the entries for the batch are
    batch = [dataset[i] for i in indices]
    
    # Get the raw numerized for this input, each element of the dataset has a 'numerized' key
    batch_numerized = [e["numerized"] for e in batch]

    # Create an array of start_index that will be concatenated at position 1 for the input.
    # Should be of shape (batch_size, 1)
    start_tokens = [[0] for _ in range(batch_size)]

    # Concatenate the start_tokens with the rest of the input
    # The np.concatenate function should be useful
    # The output should now be [batch_size, sequence_length+1]
    batch_input = [start_tokens[i] + e for i, e in enumerate(batch_numerized)]

    # Remove the last word from each element in the batch
    # To restore the [batch_size, sequence_length] size
    batch_input = [e[:-1] for e in batch_input]
    
    # The target should be the un-shifted numerized input
    batch_target = batch_numerized

    # The target-mask is a 0 or 1 filter to note which tokens are
    # padding or not, to give the loss, so the model doesn't get rewarded for
    # predicting PAD tokens.
    batch_target_mask = np.array([a['mask'] for a in batch])
    
    #####
    # END YOUR CODE HERE 
    #####
        
    return batch_input, batch_target, batch_target_mask

# Creating the language model

Now that we've written the data pipelining, we are ready to write the Neural network.

The steps to setting up a neural network to do Language modeling are:
- Creating the placeholders for the model, where we can feed in our inputs and targets.
- Creating an RNN of our choice, size, and with optional parameters
- Using the RNN on our placeholder inputs.
- Getting the output from the RNN, and projecting it into a vocabulary sized dimension, so that we can make word predictions.
- Setting up the loss on the outputs so that the network learns to produce the correct words.
- Finally, choosing an optimizer, and defining a training operation: using the optimizer to minimize the loss.

We provide skeleton code for the model, you can fill in the `None` section. If you are unfamiliar with Tensorflow, we provide some idea of what functions to look for, you should use the Tensorflow online documentation.

**TODO**: Replace the `None` variables with their respective code elements in the LanguageModel Class


In [5]:
# Using a basic RNN/LSTM for Language modeling
class LanguageModel():
    def __init__(self, input_length, vocab_size, rnn_size, learning_rate=1e-4, lm_cell_num=1, dropout=False, lr_schedule=False):
        
        # Create the placeholders for the inputs:
        # All three placeholders should be of size [None, input_length]
        # Where None represents a variable batch_size, and input_length is the
        # maximal length of a sequence of words, after being padded.
        self.input_num = tf.placeholder(tf.int32, shape=[None, input_length])
        self.targets = tf.placeholder(tf.int32, shape=[None, input_length])
        self.targets_mask = tf.placeholder(tf.int32, shape=[None, input_length])
        self.learning_rate = tf.placeholder(tf.float32, shape=[]) if lr_schedule else learning_rate

        # Create an embedding variable of shape [vocab_size, rnn_size]
        # That will map each word in our vocab into a vector of rnn_size size.
        embedding = tf.get_variable("embedding", shape=[vocab_size, rnn_size])
        # Use the tensorflow embedding_lookup function
        # To embed the input_num, using the embedding variable we've created
        input_emb = tf.nn.embedding_lookup(embedding, self.input_num)

        # Create a an RNN or LSTM cell of rnn_size size.
        # Look into the tf.nn.rnn_cell documentation
        # You can optionally use Tensorflow Add-ons such as the MultiRNNCell, or the DropoutWrapper
        cells = [tf.nn.rnn_cell.LSTMCell(rnn_size) for _ in range(lm_cell_num)]
        if dropout:
            cells = [tf.nn.rnn_cell.DropoutWrapper(cell,
                                                   input_keep_prob=0.8,
                                                   output_keep_prob=0.8,
                                                   state_keep_prob=0.8,
                                                  ) for cell in cells
                    ]
        lm_cell = tf.nn.rnn_cell.MultiRNNCell(cells)    # Stack of cells as a single lm_cell
        
        # Use the dynamic_rnn function of Tensorflow to run the embedded inputs
        # using the lm_cell you've created, and obtain the outputs of the RNN cell.
        # You have created a cell, which represents a single block (column) of the RNN.
        # dynamic_rnn will "copy" the cell for each element in your sequence, runs the input you provide through the cell,
        # and returns the outputs and the states of the cell.
        outputs, states = tf.nn.dynamic_rnn(lm_cell, input_emb, dtype=tf.float32)

        # Use a dense layer to project the outputs of the RNN cell into the size of the
        # vocabulary (vocab_size).
        # output_logits should be of shape [None,input_length,vocab_size]
        # You can look at the tf.layers.dense function
        self.output_logits = tf.layers.dense(outputs, vocab_size)

        # Setup the loss: using the sparse_softmax_cross_entropy.
        # The logits are the output_logits we've computed.
        # The targets are the gold labels we are trying to match
        # Don't forget to use the targets_mask we have, so your loss is not off,
        # And your model doesn't get rewarded for predicting PAD tokens
        # You might have to cast the masks into float32. Look at the tf.cast function.
        self.loss = tf.losses.sparse_softmax_cross_entropy(self.targets,
                                                           self.output_logits,
                                                           weights=self.targets_mask
                                                          )

        # Setup an optimizer (SGD, RMSProp, Adam), you can find a list under tf.train.*
        # And provide it with a start learning rate.

        optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate)

        # We create a train_op that requires the optimizer we've created to minimize the
        # loss we've defined.
        # look for the optimizer.minimize function, define what should be miniminzed.
        # You can provide it with the provide an optional global_step parameter as well that keeps of how many
        # Optimizations steps have been run.
        
        self.global_step = tf.train.get_or_create_global_step()
        self.train_op = optimizer.minimize(self.loss)
        self.saver = tf.train.Saver()

Once you have created the Model class, we should instantiate the model. The line tf.reset_default_graph() resets the graph for the Jupyter notebook, so multiple models aren't floating around. If you have trouble with redefinition of variables, it may be worth re-running the cell below. 

In [10]:
# We can create our model,
# with parameters of our choosing.

tf.reset_default_graph() # This is so that when you debug, you reset the graph each time you run this, in essence, cleaning the board
model = LanguageModel(input_length=input_length, vocab_size=vocab_size, rnn_size=256, learning_rate=1e-3)

# Training the model

Your objective is to train the Language on the dataset you are provided to reach a **validation loss <= 5.50**

**TODO**: Train your model so that it achieves a validation loss of <= 5.5. 

**Careful**: we will be testing this loss on an unreleased test set, so make sure to evaluate properly on a validation set and not overfit. You must save the model you want us to test under: models/final_language_model (the .index, .meta and .data files)

**Advice**:
- It should be possible to attain loss <= 5.50 with a 1-layer LSTM of size 256 or less.
- You should not need more than 10 epochs to attain the threshold. More passes over the data can however give you a better model.
- You can however try using:
    - LSTM dropout (Tensorflow has a layer for that)
    - Multi-layer RNN cell (Tensorflow has a layer for that)
    - Change your optimizers, tune your learning_rate, use a learning rate schedule.
    
**Extra credit**:

Get the loss below **validation loss <= 5.00** and get 5 points of extra-credit on this assignment. Get creative,

but remember, what you do should work on our held-out test set to get the points.

In [11]:
# Skeleton code
# You have to write your own training process to obtain a
# Good performing model on the validation set, and save it.

experiment = root_folder+"models/magic_model"

with tf.Session() as sess:
    # Here is how you initialize weights of the model according to their
    # Initialization parameters.
    sess.run(tf.global_variables_initializer())
    
    # Here is how you obtain a batch:
    batch_size = 16
    batch_input, batch_target, batch_target_mask = build_batch(d_train, batch_size)
    # Map the values to each tensor in a `feed_dict`
    feed = {model.input_num: batch_input, model.targets: batch_target, model.targets_mask: batch_target_mask}

    # Obtain a single value of the loss for that batch.
    # !IMPORTANT! Don't forget to include the train_op to when using a batch from the training dataset
    # (d_train)
    # !MORE IMPORTANT! Don't use the train_op if you evaluate the loss on the validation set,
    # Otherwise, your network will overfit on your validation dataset.
    
    step, train_loss, _ = sess.run([model.global_step, model.loss, model.train_op], feed_dict=feed)
    
    # Here is how you save the model weights
    model.saver.save(sess, experiment)
    
    # Here is how you restore the weights previously saved
    model.saver.restore(sess, experiment)

INFO:tensorflow:Restoring parameters from models/magic_model


In [18]:
# Create our model
tf.reset_default_graph() # This is so that when you debug, you reset the graph each time you run this, in essence, cleaning the board
model = LanguageModel(input_length=input_length, vocab_size=vocab_size, rnn_size=256, learning_rate=0.5e-3, lm_cell_num=2)

# Path to model
experiment_1 = root_folder+"models/language_model_1"

assert 0 == 1, "STOP! You are about to re-train the model. Uncomment this line if you want to"

with tf.Session() as sess:
    # Initialize weights of the model according to their Initialization parameters.
    sess.run(tf.global_variables_initializer())
    
    # Set training parameters
    batch_size = 64
    epoch_num = 20
    iter_per_epoch = len(d_train) // batch_size + 1
    
    # Evaluate model before training
    eval_input, eval_target, eval_target_mask = build_batch(d_valid, 500)
    feed = {model.input_num: eval_input, model.targets: eval_target, model.targets_mask: eval_target_mask}
    validate_loss = sess.run([model.loss], feed_dict=feed)[0]
    print("Before traning. Validation loss: {}".format(validate_loss))
    print("Start training ...")
    print()
    
    for epoch in range(epoch_num):
        print("---- Epoch {} / {} ----".format(epoch + 1, epoch_num))
        
        for it in range(iter_per_epoch):
            # Obtain a batch:
            batch_input, batch_target, batch_target_mask = build_batch(d_train, batch_size)
            
            # Map the values to each tensor in a `feed_dict`
            feed = {model.input_num: batch_input, model.targets: batch_target, model.targets_mask: batch_target_mask}

            # Train
            step, train_loss, _ = sess.run([model.global_step, model.loss, model.train_op], feed_dict=feed)
            
            if (it + 1) % 20 == 0:
                print("\tIteration {}/{}. Train loss: {}".format(it + 1, iter_per_epoch, train_loss))
            
            if (it + 1) % 200 == 0 or (it + 1) == iter_per_epoch:
                eval_input, eval_target, eval_target_mask = build_batch(d_valid, 500)
                feed = {model.input_num: eval_input, model.targets: eval_target, model.targets_mask: eval_target_mask}
                validate_loss = sess.run([model.loss], feed_dict=feed)[0]
                print("\tValidation loss: {}".format(validate_loss))
                
        print()
    
    # Here is how you save the model weights
    model.saver.save(sess, experiment_1)
    
    # Restore the weights previously saved
    model.saver.restore(sess, experiment_1)
    
    # Validate our model
    print("Finished training. Validating model ...")
    eval_input, eval_target, eval_target_mask = build_batch(d_valid, 500)
    feed = {model.input_num: eval_input, model.targets: eval_target, model.targets_mask: eval_target_mask}
    validate_loss = sess.run([model.loss], feed_dict=feed)[0]
    print("Validation loss: {}".format(validate_loss))
    print()

Before traning. Validation loss: 9.210343360900879
Start training ...

---- Epoch 1 / 20 ----
	Iteration 20/1384. Train loss: 8.73129653930664
	Iteration 40/1384. Train loss: 7.525349140167236
	Iteration 60/1384. Train loss: 7.2386064529418945
	Iteration 80/1384. Train loss: 7.319820404052734
	Iteration 100/1384. Train loss: 7.251431465148926
	Iteration 120/1384. Train loss: 7.27501106262207
	Iteration 140/1384. Train loss: 6.928033351898193
	Iteration 160/1384. Train loss: 7.138749599456787
	Iteration 180/1384. Train loss: 6.99680757522583
	Iteration 200/1384. Train loss: 7.329260349273682
	Validation loss: 7.0617475509643555
	Iteration 220/1384. Train loss: 7.118978023529053
	Iteration 240/1384. Train loss: 6.841607093811035
	Iteration 260/1384. Train loss: 6.887425422668457
	Iteration 280/1384. Train loss: 6.749744892120361
	Iteration 300/1384. Train loss: 7.107308387756348
	Iteration 320/1384. Train loss: 6.886488437652588
	Iteration 340/1384. Train loss: 6.937403678894043
	Iterati

	Validation loss: 6.347458362579346
	Iteration 220/1384. Train loss: 6.217479705810547
	Iteration 240/1384. Train loss: 6.3772125244140625
	Iteration 260/1384. Train loss: 6.2790846824646
	Iteration 280/1384. Train loss: 6.2849812507629395
	Iteration 300/1384. Train loss: 6.2419538497924805
	Iteration 320/1384. Train loss: 6.243693828582764
	Iteration 340/1384. Train loss: 6.369756698608398
	Iteration 360/1384. Train loss: 6.285290241241455
	Iteration 380/1384. Train loss: 6.3315887451171875
	Iteration 400/1384. Train loss: 6.257084369659424
	Validation loss: 6.277031898498535
	Iteration 420/1384. Train loss: 6.215978622436523
	Iteration 440/1384. Train loss: 6.1365580558776855
	Iteration 460/1384. Train loss: 6.026968479156494
	Iteration 480/1384. Train loss: 6.050325870513916
	Iteration 500/1384. Train loss: 6.345302581787109
	Iteration 520/1384. Train loss: 6.318019390106201
	Iteration 540/1384. Train loss: 6.131617069244385
	Iteration 560/1384. Train loss: 6.20449161529541
	Iterati

	Iteration 420/1384. Train loss: 5.705564022064209
	Iteration 440/1384. Train loss: 5.501356601715088
	Iteration 460/1384. Train loss: 5.8056321144104
	Iteration 480/1384. Train loss: 5.708396911621094
	Iteration 500/1384. Train loss: 5.737717151641846
	Iteration 520/1384. Train loss: 5.6619086265563965
	Iteration 540/1384. Train loss: 5.440262317657471
	Iteration 560/1384. Train loss: 5.802973747253418
	Iteration 580/1384. Train loss: 5.674111366271973
	Iteration 600/1384. Train loss: 5.823302745819092
	Validation loss: 5.854322910308838
	Iteration 620/1384. Train loss: 5.82112979888916
	Iteration 640/1384. Train loss: 5.698856830596924
	Iteration 660/1384. Train loss: 5.8136115074157715
	Iteration 680/1384. Train loss: 5.687065124511719
	Iteration 700/1384. Train loss: 5.521208763122559
	Iteration 720/1384. Train loss: 5.6823225021362305
	Iteration 740/1384. Train loss: 5.804106712341309
	Iteration 760/1384. Train loss: 5.613536834716797
	Iteration 780/1384. Train loss: 5.69329881668

	Iteration 640/1384. Train loss: 5.273184299468994
	Iteration 660/1384. Train loss: 5.469831466674805
	Iteration 680/1384. Train loss: 5.486926078796387
	Iteration 700/1384. Train loss: 5.3761186599731445
	Iteration 720/1384. Train loss: 5.413712501525879
	Iteration 740/1384. Train loss: 5.560020446777344
	Iteration 760/1384. Train loss: 5.330338954925537
	Iteration 780/1384. Train loss: 5.4764485359191895
	Iteration 800/1384. Train loss: 5.402155876159668
	Validation loss: 5.60810661315918
	Iteration 820/1384. Train loss: 5.305099964141846
	Iteration 840/1384. Train loss: 5.3039374351501465
	Iteration 860/1384. Train loss: 5.335104942321777
	Iteration 880/1384. Train loss: 5.108603477478027
	Iteration 900/1384. Train loss: 5.367897987365723
	Iteration 920/1384. Train loss: 5.370744228363037
	Iteration 940/1384. Train loss: 5.433781147003174
	Iteration 960/1384. Train loss: 5.273314476013184
	Iteration 980/1384. Train loss: 5.254038333892822
	Iteration 1000/1384. Train loss: 5.21150302

	Iteration 860/1384. Train loss: 5.2305145263671875
	Iteration 880/1384. Train loss: 5.046619892120361
	Iteration 900/1384. Train loss: 5.0588860511779785
	Iteration 920/1384. Train loss: 5.009200096130371
	Iteration 940/1384. Train loss: 4.9908857345581055
	Iteration 960/1384. Train loss: 5.03995943069458
	Iteration 980/1384. Train loss: 5.11441707611084
	Iteration 1000/1384. Train loss: 5.016565799713135
	Validation loss: 5.562745571136475
	Iteration 1020/1384. Train loss: 5.159538745880127
	Iteration 1040/1384. Train loss: 4.880504131317139
	Iteration 1060/1384. Train loss: 5.212449550628662
	Iteration 1080/1384. Train loss: 5.113317489624023
	Iteration 1100/1384. Train loss: 5.140244007110596
	Iteration 1120/1384. Train loss: 5.095072269439697
	Iteration 1140/1384. Train loss: 5.219786643981934
	Iteration 1160/1384. Train loss: 5.277318477630615
	Iteration 1180/1384. Train loss: 5.114001274108887
	Iteration 1200/1384. Train loss: 5.106595993041992
	Validation loss: 5.45903253555297

	Iteration 1080/1384. Train loss: 4.81841516494751
	Iteration 1100/1384. Train loss: 4.973952770233154
	Iteration 1120/1384. Train loss: 4.856415271759033
	Iteration 1140/1384. Train loss: 4.985759735107422
	Iteration 1160/1384. Train loss: 4.866040229797363
	Iteration 1180/1384. Train loss: 4.923847675323486
	Iteration 1200/1384. Train loss: 4.704346179962158
	Validation loss: 5.412441253662109
	Iteration 1220/1384. Train loss: 4.850836277008057
	Iteration 1240/1384. Train loss: 5.053105354309082
	Iteration 1260/1384. Train loss: 4.888153076171875
	Iteration 1280/1384. Train loss: 4.7072367668151855
	Iteration 1300/1384. Train loss: 5.153251647949219
	Iteration 1320/1384. Train loss: 5.0537824630737305
	Iteration 1340/1384. Train loss: 4.940372943878174
	Iteration 1360/1384. Train loss: 4.934412479400635
	Iteration 1380/1384. Train loss: 4.921279430389404
	Validation loss: 5.404284477233887

---- Epoch 12 / 20 ----
	Iteration 20/1384. Train loss: 4.767512321472168
	Iteration 40/1384. 

	Iteration 1300/1384. Train loss: 4.774343967437744
	Iteration 1320/1384. Train loss: 4.687753200531006
	Iteration 1340/1384. Train loss: 4.590094089508057
	Iteration 1360/1384. Train loss: 4.557735919952393
	Iteration 1380/1384. Train loss: 4.824099063873291
	Validation loss: 5.3212385177612305

---- Epoch 14 / 20 ----
	Iteration 20/1384. Train loss: 4.734081745147705
	Iteration 40/1384. Train loss: 4.613382816314697
	Iteration 60/1384. Train loss: 4.866113662719727
	Iteration 80/1384. Train loss: 4.77431583404541
	Iteration 100/1384. Train loss: 4.608621120452881
	Iteration 120/1384. Train loss: 4.935647010803223
	Iteration 140/1384. Train loss: 4.640763282775879
	Iteration 160/1384. Train loss: 4.726688861846924
	Iteration 180/1384. Train loss: 4.643087863922119
	Iteration 200/1384. Train loss: 4.66223669052124
	Validation loss: 5.366191387176514
	Iteration 220/1384. Train loss: 4.720583915710449
	Iteration 240/1384. Train loss: 4.770028591156006
	Iteration 260/1384. Train loss: 4.7

	Iteration 120/1384. Train loss: 4.677751541137695
	Iteration 140/1384. Train loss: 4.521937370300293
	Iteration 160/1384. Train loss: 4.805904865264893
	Iteration 180/1384. Train loss: 4.739128589630127
	Iteration 200/1384. Train loss: 4.475789546966553
	Validation loss: 5.30163049697876
	Iteration 220/1384. Train loss: 4.7238545417785645
	Iteration 240/1384. Train loss: 4.629750728607178
	Iteration 260/1384. Train loss: 4.579588890075684
	Iteration 280/1384. Train loss: 4.355923652648926
	Iteration 300/1384. Train loss: 4.793821811676025
	Iteration 320/1384. Train loss: 4.512328147888184
	Iteration 340/1384. Train loss: 4.579394817352295
	Iteration 360/1384. Train loss: 4.656285762786865
	Iteration 380/1384. Train loss: 4.598205089569092
	Iteration 400/1384. Train loss: 4.445307731628418
	Validation loss: 5.375027656555176
	Iteration 420/1384. Train loss: 4.620642185211182
	Iteration 440/1384. Train loss: 4.628066062927246
	Iteration 460/1384. Train loss: 4.221531391143799
	Iteration

	Iteration 340/1384. Train loss: 4.6742963790893555
	Iteration 360/1384. Train loss: 4.371867656707764
	Iteration 380/1384. Train loss: 4.606247425079346
	Iteration 400/1384. Train loss: 4.397864818572998
	Validation loss: 5.378758430480957
	Iteration 420/1384. Train loss: 4.498430252075195
	Iteration 440/1384. Train loss: 4.480257987976074
	Iteration 460/1384. Train loss: 4.50527811050415
	Iteration 480/1384. Train loss: 4.441649913787842
	Iteration 500/1384. Train loss: 4.603725433349609
	Iteration 520/1384. Train loss: 4.4395575523376465
	Iteration 540/1384. Train loss: 4.375431537628174
	Iteration 560/1384. Train loss: 4.507632732391357
	Iteration 580/1384. Train loss: 4.463362693786621
	Iteration 600/1384. Train loss: 4.556092739105225
	Validation loss: 5.354894638061523
	Iteration 620/1384. Train loss: 4.316308498382568
	Iteration 640/1384. Train loss: 4.401506423950195
	Iteration 660/1384. Train loss: 4.460450172424316
	Iteration 680/1384. Train loss: 4.516157627105713
	Iteratio

	Iteration 560/1384. Train loss: 4.165859699249268
	Iteration 580/1384. Train loss: 4.524913787841797
	Iteration 600/1384. Train loss: 4.515651702880859
	Validation loss: 5.346724033355713
	Iteration 620/1384. Train loss: 4.434627056121826
	Iteration 640/1384. Train loss: 4.366643905639648
	Iteration 660/1384. Train loss: 4.351452350616455
	Iteration 680/1384. Train loss: 4.459339618682861
	Iteration 700/1384. Train loss: 4.221236705780029
	Iteration 720/1384. Train loss: 4.260685443878174
	Iteration 740/1384. Train loss: 4.339695930480957
	Iteration 760/1384. Train loss: 4.547367095947266
	Iteration 780/1384. Train loss: 4.435733318328857
	Iteration 800/1384. Train loss: 4.550050735473633
	Validation loss: 5.351284503936768
	Iteration 820/1384. Train loss: 4.399274826049805
	Iteration 840/1384. Train loss: 4.552123069763184
	Iteration 860/1384. Train loss: 4.392159461975098
	Iteration 880/1384. Train loss: 4.399227619171143
	Iteration 900/1384. Train loss: 4.255939960479736
	Iteration

In [27]:
# Path to final model
my_experiment = root_folder+"models/final_language_model"

with tf.Session(config=tf.ConfigProto(device_count={'GPU': 0})) as sess:
    # Load language_model_1
    model.saver.restore(sess, experiment_1)
    eval_input, eval_target, eval_target_mask = build_batch(d_valid, 500)
    feed = {model.input_num: eval_input, model.targets: eval_target, model.targets_mask: eval_target_mask}
    eval_loss = sess.run([model.loss], feed_dict=feed)
    print("Evaluation set loss:", eval_loss)
    
    # Save language_model_1 into final_model
    model.saver.save(sess, my_experiment)
    
# # Create my model
# tf.reset_default_graph()
# config=tf.ConfigProto(device_count={'GPU': 0})    # Comment this line if wanna train on GPU
# my_model = LanguageModel(input_length=input_length,
#                          vocab_size=vocab_size,
#                          rnn_size=256,
#                          learning_rate=1e-3,
#                          lm_cell_num=2,
#                          dropout=True,
#                          lr_schedule=True)
    
# with tf.Session(config=config) as sess:
#     # Initialize weights of the model according to their Initialization parameters.
#     sess.run(tf.global_variables_initializer())
    
#     # Set training parameters
#     batch_size = 32
#     epoch_num = 20
#     iter_per_epoch = len(d_train) // batch_size + 1
#     initial_learning_rate = 1e-3
    
#     # Evaluate model before training
#     eval_input, eval_target, eval_target_mask = build_batch(d_valid, 500)
#     feed = {my_model.input_num: eval_input, my_model.targets: eval_target, my_model.targets_mask: eval_target_mask}
#     validate_loss = sess.run([my_model.loss], feed_dict=feed)[0]
#     print("Before traning. Validation loss: {}".format(validate_loss))
#     print("Start training ...")
#     print()
    
#     for epoch in range(epoch_num):
#         print("---- Epoch {} / {} ----".format(epoch + 1, epoch_num))
        
#         for it in range(iter_per_epoch):
#             # Obtain a batch:
#             batch_input, batch_target, batch_target_mask = build_batch(d_train, batch_size)
            
#             if epoch < 10:
#                 learning_rate = initial_learning_rate
#             elif epoch < 15:
#                 learning_rate = initial_learning_rate / 2
#             else:
#                 learning_rate = initial_learning_rate / 5
            
#             # Map the values to each tensor in a `feed_dict`
#             feed = {my_model.input_num: batch_input,
#                     my_model.targets: batch_target,
#                     my_model.targets_mask: batch_target_mask,
#                     my_model.learning_rate: learning_rate,
#                    }

#             # Train
#             step, train_loss, _ = sess.run([my_model.global_step, my_model.loss, my_model.train_op], feed_dict=feed)
            
#             if (it + 1) % 100 == 0:
#                 print("\tIteration {}/{}. Train loss: {}".format(it + 1, iter_per_epoch, train_loss))
            
#             if (it + 1) % 500 == 0 or (it + 1) == iter_per_epoch:
#                 eval_input, eval_target, eval_target_mask = build_batch(d_valid, 500)
#                 feed = {my_model.input_num: eval_input,
#                         my_model.targets: eval_target,
#                         my_model.targets_mask: eval_target_mask}
#                 validate_loss = sess.run([my_model.loss], feed_dict=feed)[0]
#                 print("\tValidation loss: {}".format(validate_loss))
                
#         print()
    
#     # Here is how you save the model weights
#     my_model.saver.save(sess, my_experiment)
    
#     # Restore the weights previously saved
#     my_model.saver.restore(sess, my_experiment)
    
#     # Validate our model
#     print("Finished training. Validating model ...")
#     eval_input, eval_target, eval_target_mask = build_batch(d_valid, 500)
#     feed = {my_model.input_num: eval_input, my_model.targets: eval_target, my_model.targets_mask: eval_target_mask}
#     validate_loss = sess.run([my_model.loss], feed_dict=feed)[0]
#     print("Validation loss: {}".format(validate_loss))
#     print()

INFO:tensorflow:Restoring parameters from models/language_model_1
Evaluation set loss: [5.220231]


# Using the language model

Congratulations, you have now trained a language model! We can now use it to evaluate likely news headlines, as well as generate our very own headlines.

**TODO**: Complete the three parts below, using the model you have trained.

## (1) Evaluation loss

To evaluate the language model, we evaluate its loss (ability to predict) on unseen data that is reserved for evaluation.
Your first evaluation is to load the model you trained, and obtain a test loss.

In [19]:
# Your best performing model should go here.
model_file = root_folder+"models/language_model_1"

In [20]:
# We will evaluate your model in the model_file above
# In a very similar way as the code below.
# Make sure your validation loss is befow the threshold we specified
# and that you didn't train using the validation set, as you would
# get penalized.

with tf.Session() as sess:
    model.saver.restore(sess, model_file)
    eval_input, eval_target, eval_target_mask = build_batch(d_valid, 500)
    feed = {model.input_num: eval_input, model.targets: eval_target, model.targets_mask: eval_target_mask}
    eval_loss = sess.run([model.loss], feed_dict=feed)
    print("Evaluation set loss:", eval_loss)

INFO:tensorflow:Restoring parameters from models/language_model_1
Evaluation set loss: [5.3999333]


## (2) Evaluation of likelihood of data

One use of a language model is to see what data is more likely to have originated from the training data. Because we have trained our model on news headlines, we can see which of these headlines is more likely:

``Apple to release another iPhone in September``


 ``Apple and Samsung resolve all lawsuits amicably``
 
**TODO**: Use the model to obtain the loss the neural network assigns to each sentence.
Because the neural network assigns probability to the words appearing in a sequence, this loss can be used as a proxy to measure how likely the sentence is to have occurred in the dataset.
Once you have the loss for each headline, write down which sentence was judged to be more likely, and explain why/if you think this is coherent.

**Your answer:**


In [23]:
headline1 = "Apple to release new iPhone in July"
headline2 = "Apple and Samsung resolve all lawsuits"

headlines = [headline1, headline2]

with tf.Session() as sess:
    model.saver.restore(sess, model_file)

    for headline in headlines:
        headline = headline.lower() # Our LSTM is trained on lower-cased headlines
    
        # From the code in the Preprocessing section at the end of the notebook
        # Find out how to tokenize the headline
        tokenized = tokenizer.word_tokenizer(headline)
        
        # Find out how to numerize the tokenized headline
        numerized = numerize_sequence(tokenized)

        # Learn how to pad and obtain the mask of the sequence.
        padded, mask = pad_sequence(numerized, padI, input_length)
        
        # Obtain the loss of the sequence, and print it
        
        loss = sess.run([model.loss], feed_dict={model.input_num: [[startI] + padded[:-1]], 
                                                 model.targets: [padded],
                                                 model.targets_mask: [mask]
                                                }
                       )
        print("----------------------------------------")
        print("Headline:",headline)
        print("Loss of the headline:", loss)

# Important check: one headline should be more likely (and have lower loss)
# Than the other headline. You should know which headline should have lower loss.

INFO:tensorflow:Restoring parameters from models/language_model_1
----------------------------------------
Headline: apple to release new iphone in july
Loss of the headline: [3.1056914]
----------------------------------------
Headline: apple and samsung resolve all lawsuits
Loss of the headline: [5.165652]


## (3) Generation of headlines

We can use our language model to generate text according to the distribution of our training data.
The way generation works is the following:

We seed the model with a beginning of sequence, and obtain the distribution for the next word.
We select the most likely word (argmax) and add it to our sequence of words.
Now our sequence is one word longer, and we can feed it in again as an input, for the network to produce the next sentence.
We do this a fixed number of times (up to 20 words), and obtain automatically generated headlines!


We have provided a few headline starters that should produce interesting generated headlines.

**TODO:** Get creative and find at least 2 more headline_starters that produce interesting headlines.

In [39]:
with tf.Session() as sess:
    model.saver.restore(sess, model_file)

    # Here are some headline starters.
    # They're all about tech companies, because
    # That is what is in our dataset
    headline_starters = ["apple has released", "google has released", "amazon", "tesla to", "facebook gained", "microsoft lost"]
    
    for headline_starter in headline_starters:
        print("===================")
        print("Generating headline starting with: "+headline_starter)

        # Tokenize and numerize the headline. Put the numerized headline
        # beginning in `current_build`
        tokenized = tokenizer.word_tokenizer(headline_starter)
        current_build = [startI] + numerize_sequence(tokenized)

        while len(current_build) < input_length:
            # Pad the current_build into a input_length vector.
            # We do this so that it can be processed by our LanguageModel class
            current_padded = current_build[:input_length] + [padI] * (input_length - len(current_build))
            current_padded = np.array([current_padded])

            # Obtain the logits for the current padded sequence
            # This involves obtaining the output_logits from our model,
            # and not the loss like we have done so far
            logits = sess.run(model.output_logits, feed_dict={model.input_num: current_padded})[0]

            # Obtain the row of logits that interest us, the logits for the last non-pad
            # inputs
            last_logits = logits[len(current_build) - 1, :]
            
            # Find the highest scoring word in the last_logits
            # array. The np.argmax function should be useful.
            # Append this word to our current build
            current_build.append(np.argmax(last_logits))
        
        # Go from the current_build of word_indices
        # To the headline (string) produced. This should involve
        # the vocabulary, and a string merger.
        produced_sentence = numerized2text(current_build)
        print(produced_sentence)

INFO:tensorflow:Restoring parameters from models/language_model_1
Generating headline starting with: apple has released
<START> apple has released a new iphone app store that could be a UNK UNK for the iphone x -
Generating headline starting with: google has released
<START> google has released a UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK
Generating headline starting with: amazon
<START> amazon to buy whole foods for $ 13.7 bln in UNK - UNK UNK , UNK , UNK ,
Generating headline starting with: tesla to
<START> tesla to sell apple pay in china , says gartner says it is not to blame for UNK UNK
Generating headline starting with: facebook gained
<START> facebook gained for UNK , but not to be UNK , says UNK UNK says they are talking about
Generating headline starting with: microsoft lost
<START> microsoft lost its $ UNK million investment in UNK , UNK says it will be UNK to UNK UNK


## All done

You are done with the first part of the HW.

Next notebook deals with Summarization of text!


# Preprocessing (read only)


**You can skip this section, however you may find these functions useful later in the assignment**

We have provided this code so you see how the dataset was generated. You will have to come back some of these functions later in the assignment, so feel free to read through, to get familiar.

In [22]:
def numerize_sequence(tokenized):
    return [w2i.get(w, unkI) for w in tokenized]
def pad_sequence(numerized, pad_index, to_length):
    pad = numerized[:to_length]
    padded = pad + [pad_index] * (to_length - len(pad))
    mask = [w != pad_index for w in padded]
    return padded, mask

In [None]:
# You do not need to run this
# This is to show you how the dataset was created
# You should read to understand, so you can preprocess text
# In the same way, in the evaluation section

for a in dataset:
    a['tokenized'] = tokenizer.word_tokenizer(a['title'].lower())

In [None]:
# You do not need to run this
# This is to show you how the dataset was created
# You should read to understand, so you can preprocess text
# In the same way, in the evaluation section

word_counts = Counter()
for a in dataset:
    word_counts.update(a['tokenized'])

print(word_counts.most_common(30))

In [None]:
# You do not need to run this
# This is to show you how the dataset was created
# You should read to understand, so you can preprocess text
# In the same way, in the evaluation section

# Creating the vocab
vocab_size = 20000
special_words = ["<START>", "UNK", "PAD"]
vocabulary = special_words + [w for w, c in word_counts.most_common(vocab_size-len(special_words))]
w2i = {w: i for i, w in enumerate(vocabulary)}

# Numerizing and padding
input_length = 20
unkI, padI, startI = w2i['UNK'], w2i['PAD'], w2i['<START>']

for a in dataset:
    a['numerized'] = numerize_sequence(a['tokenized']) # Change words to IDs
    a['numerized'], a['mask'] = pad_sequence(a['numerized'], padI, input_length) # Append appropriate PAD tokens
    
# Compute fraction of words that are UNK:
word_counters = Counter([w for a in dataset for w in a['input'] if w != padI])

print("Fraction of UNK words:", float(word_counters[unkI]) / sum(word_counters.values()))

In [None]:
# You do not need to run this
# This is to show you how the dataset was created
# You should read to understand, so you can preprocess text
# In the same way, in the evaluation section

d_released_processed   = [d for d in dataset if d['cut'] != 'testing']
d_unreleased_processed = [d for d in dataset if d['cut'] == 'testing']

with open("dataset/headline_generation_dataset_processed.json", "w") as f:
    json.dump(d_released_processed, f)

# This file is purposefully left out of the assignment, we will use it to evaluate your model.
with open("dataset/headline_generation_dataset_unreleased_processed.json", "w") as f:
    json.dump(d_unreleased_processed, f)
    
with open("dataset/headline_generation_vocabulary.txt", "w") as f:
    f.write("\n".join(vocabulary).encode('utf8'))