# Introduction

<center><h3>**Welcome to the Language modeling Notebook.**</h3></center>

In this assignment, you are going to train a neural network to **generate news headlines**.
To reduce computational needs, we have reduced it to headlines about technology, and a handful of Tech giants.
In this assignment you will:
- Learn to preprocess raw text so it can be fed into an LSTM.
- Make use of the LSTM library of Pytorch, to train a Language model to generate headlines
- Use your network to generate headlines, and judge which headlines are likely or not




**What is a language model?**

Language modeling is the task of assigning a probability to sentences in a language. Besides assigning a probability to each sequence of words, the language models also assigns a probability for the likelihood of a given word (or a sequence of words) to follow a sequence of words.
— Page 105, __[Neural Network Methods in Natural Language Processing](https://www.amazon.com/Language-Processing-Synthesis-Lectures-Technologies/dp/1627052984/)__, 2017.

In terms of neural network, we are training a neural network to produce probabilities (classification) over a fixed vocabulary of words.
Concretely, we are training a neural network to produce:
$$ P ( w_{i+1} | w_1, w_2, w_3, ..., w_i), \forall i \in (1,n)$$

** Why is language modeling important? **

Language modeling is a core problem in NLP.

Language models can either be used as a stand-alone to produce new text that matches the distribution of text the model is trained on, but can also be used at the front-end of a more sophisticated model to produce better results.

Recently for example, the __[BERT](https://arxiv.org/abs/1810.04805)__ paper show-cased that pretraining a large neural network on a language modeling task can help improve state-of-the-art on many NLP tasks. 

How good can the generation of a Language model be?

If you have not seen the post about GPT-2 by OpenAI, you should read some of the samples they generated from their language model __[here](https://blog.openai.com/better-language-models/#sample1)__.
Because of computational restrictions, we will not achieve as good text production, but the same algorithm is at the core. They just use more data and compute.

# Library imports

Before starting, make sure you have all these libraries.

In [1]:
!pip install segtok



Run the first of the following two cells if you are running the homework locally, and run the second cell if you are running the homework in Colab

In [2]:
DRIVE=False
root_folder = ""
dataset_folder = "dataset/"

In [3]:
# from google.colab import drive
# drive.mount('/content/drive')
# root_folder = "/content/drive/My Drive/cs182_hw3/"
# dataset_folder = "/content/drive/My Drive/cs182_hw3_public/dataset/"

In [4]:
%load_ext autoreload
%autoreload 2

In [5]:
%pip install "numpy<2"
import os
import sys
sys.path.append(root_folder)
%pip install "torch<2"
from segtok import tokenizer
from collections import Counter
import torch as th
from torch import nn
import torch.nn.functional as F
import torch.optim as optim

import numpy as np
import json
from utils import validate_to_array

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


# Loading the datasets

Make sure the dataset files are all in the `dataset` folder of the assignment.

 - If you are using this notebook locally: You should run the `download_data.sh` script.
 - If you are using the Colab version of the notebook, make sure that your Google Drive is mounted, and you verify from the file explorer in Colab that the files are viewable within `/content/drive/cs182_hw3_public/dataset/`
 


In [6]:
# This cell loads the data for the model
# Run this before working on loading any of the additional data

with open(dataset_folder+"headline_generation_dataset_processed.json", "r") as f:
    d_released = json.load(f)

with open(dataset_folder+"headline_generation_vocabulary.txt", "r",encoding='utf8') as f:
    vocabulary = f.read().split("\n")
w2i = {w: i for i, w in enumerate(vocabulary)} # Word to index
i2w = {i: w for i, w in enumerate(vocabulary)} # Index to word
unkI, padI, startI = w2i['UNK'], w2i['PAD'], w2i['<START>']

vocab_size = len(vocabulary)
input_length = len(d_released[0]['numerized']) # The length of the first element in the dataset, they are all of the same length
d_train = [d for d in d_released if d['cut'] == 'training']
d_valid = [d for d in d_released if d['cut'] == 'validation']

print("Number of training samples:",len(d_train))
print("Number of validation samples:",len(d_valid))

Number of training samples: 88568
Number of validation samples: 946


Now that we have loaded the data, let's inspect one of the elements. Each sample in our dataset is has a `numerized` vector, that contains the preprocessed headline. This vector is what we will feed in to the neural network. The field `numerized` corresponds to this list of tokens. The already loaded dictionary `vocabulary` maps token lists to the actual string. Use these elements to recover `title` key of entry 1001 in the training dataset.

**TODO**: Write the numerized2text function in notebook_utils and inspect element 1001 in the training dataset (`entry = d_train[1001]`).



In [7]:
def numerize_sequence(tokenized):
    return [w2i.get(w, unkI) for w in tokenized]
def pad_sequence(numerized, pad_index, to_length):
    pad = numerized[:to_length]
    padded = pad + [pad_index] * (to_length - len(pad))
    mask = [w != pad_index for w in padded]
    return padded, mask

In [8]:
def numerized2text(numerized):
    """Converts an integer sequence in the vocabulary into a string corresponding to the title.

    Arguments:
        numerized: List[int]  -- The list of vocabulary indices corresponding to the string
    Returns:
        title: str -- The string corresponding to the numerized input, without padding.
    """
    #####
    # BEGIN YOUR CODE HERE
    # Recover each word from the vocabulary in the list of indices in numerized, using the vocabulary variable
    # Hint 1: Use the string.join() function to reconstruct a single string
    # Hint 2: The objects and/or functions defined in above cells may be useful.
    #####
    converted_string = " ".join(np.array(vocabulary)[numerized])
    #####
    # END YOUR CODE HERE
    #####

    return converted_string


entry = d_train[1001]
print("Reversing the numerized: " + numerized2text(entry["numerized"]))
validate_to_array(numerized2text, (entry["numerized"],), "numerized2text", root_folder)
print("From the `title` entry: " + entry["title"])

Reversing the numerized: microsoft donates cloud computing ' worth $ 1 bn ' PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD
From the `title` entry: Microsoft donates cloud computing 'worth $1 bn'


In language modeling, we train a model to produce the next word in the sequence given all previously generated words. This has, in practice, two steps:


    1. Adding a special <START> token to the start of the sequence for the input. This "shifts" the input to the right by one. We call this the "source" sequence
    2. Making the network predict the original, unshifted version (we call this the "target" sequence)

    
Let's take an example. Say we want to train the network on the sentence: "The cat is great."
The input to the network will be "`<START>` The cat is great." The target will be: "The cat is great".
    
Therefore the first prediction is to select the word "The" given the `<START>` token.
The second prediction is to produce the word "cat" given the two tokens "`<START>` The".
At each step, the network learns to predict the next word, given all previous ones.
    
---

Your next step is to write the build_batch function. Given a dataset, we select a random subset of samples, and will build the "inputs" and the "targets" of the batch, following the procedure we've described.

**TODO**: write the build_batch function. We give you the structure, and you have to fill in where we have left things `your_code`.


In [9]:
def build_batch(dataset, indices):
    """Builds a batch of source and target elements from the dataset.

    Arguments:
        dataset: List[db_element] -- A list of dataset elements
        indices: List[int] -- A list of indices of the dataset to sample
    Returns:
        batch_input: List[List[int]] -- List of source sequences
        batch_target: List[List[int]] -- List of target sequences
        batch_target_mask: List[List[int]] -- List of target batch masks
    """
    #####
    # BEGIN YOUR CODE HERE
    #####

    # We get a list of indices we will choose from the dataset.
    # indices = range(iteration*batch_size,(iteration+1)*batch_size)

    # Recover what the entries for the batch are
    batch = np.array(dataset)[indices]

    # Get the raw numerized for this input, each element of the dataset has a 'numerized' key
    batch_numerized = np.array([x["numerized"] for x in batch])

    # Create an array of startI that will be concatenated at position 1 for the input.
    # Should be of shape (batch_size, 1)
    start_tokens = np.full((len(indices), 1), startI)

    # Concatenate the start_tokens with the rest of the input
    # The np.concatenate function should be useful
    # The output should now be [batch_size, sequence_length+1]
    batch_input = np.concatenate([start_tokens, batch_numerized], axis=1)

    # Remove the last word from each element in the batch
    # To restore the [batch_size, sequence_length] size
    batch_input = batch_input[:, :-1]

    # The target should be the un-shifted numerized input
    batch_target = batch_numerized

    # The target-mask is a 0 or 1 filter to note which tokens are
    # padding or not, to give the loss, so the model doesn't get rewarded for
    # predicting PAD tokens.
    batch_target_mask = np.array([a["mask"] for a in batch])

    #####
    # END YOUR CODE HERE
    #####

    return batch_input, batch_target, batch_target_mask


validate_to_array(build_batch, (d_train, range(100)), "build_batch", root_folder)

# Creating the language model

Now that we've written the data pipelining, we are ready to write the Neural network.

The steps to setting up a neural network to do Language modeling are:
- Creating the placeholders for the model, where we can feed in our inputs and targets.
- Creating an RNN of our choice, size, and with optional parameters
- Using the RNN on our placeholder inputs.
- Getting the output from the RNN, and projecting it into a vocabulary sized dimension, so that we can make word predictions.
- Setting up the loss on the outputs so that the network learns to produce the correct words.
- Finally, choosing an optimizer, and defining a training operation: using the optimizer to minimize the loss.

We provide skeleton code for the model, you can fill in the `your_code` section. If you are unfamiliar with Pytorch, we provide some idea of what functions to look for, you should use the Pytorch online documentation.

**TODO**: Fill in the LanguageModel in the language_model file.


In [10]:
from language_model import LanguageModel

# Training the model

Your objective is to train the Language on the dataset you are provided to reach a **validation loss <= 5.50**

**TODO**: Train your model so that it achieves a validation loss of <= 5.5. 

**Careful**: we will be testing this loss on an unreleased test set, so make sure to evaluate properly on a validation set and not overfit. You must save the model you want us to test under: models/final_language_model (the .index, .meta and .data files)

**Advice**:
- It should be possible to attain loss <= 5.50 with a 1-layer LSTM of size 256 or less.
- You should not need more than 10 epochs to attain the threshold. More passes over the data can however give you a better model.
- You can however try using:
    - LSTM dropout (Pytorch has a layer for that)
    - Multi-layer RNN cell (Pytorch has a layer for that)
    - Change your optimizers, tune your learning_rate, use a learning rate schedule.
    
**Extra credit**:

Get the loss below **validation loss <= 5.00** and get 5 points of extra-credit on this assignment. Get creative,

but remember, what you do should work on our held-out test set to get the points.

In [29]:
# We can create our model,
# with parameters of our choosing.
hidden_size = 256
num_layers = 2
dropout = 0.5

# Setup the loss using cross-entropy loss.
# The logits are the output_logits we've computed,
# look at the pytorch docs for `CrossEntropyLoss` and `permute`
# to align the axes correctly and to account for the masking properly.
# The targets are the goal labels we are trying to match.
# Note that if you directly take the mean of the loss tensor,
# it will underestimate your loss! (why would that be?)
# Lastly, there are a few valid forms of averaging token losses,
# here we will take the mean of all non-mask tokens together.
criterion = nn.CrossEntropyLoss(reduction="none")


def loss_fn(pred, target, mask):
    pred = pred.permute(0, 2, 1)  # put the class probabilities in the middle
    loss_tensor = criterion(pred, target)
    loss_masked = loss_tensor * mask
    loss_per_sample = loss_masked.sum() / mask.sum()
    return loss_per_sample


# The build_batch function outputs numpy, but our model is built in pytorch,
# so you need to convert numpy to pytorch.
# You also have to cast the masks into float32, target into long, and input into long.
# Look at the `float` and `long` function.
batch_to_torch = lambda b_in, b_target, b_mask: (
    th.from_numpy(b_in).long(),
    th.from_numpy(b_target).long(),
    th.from_numpy(b_mask).float(),
)


# Look at the docs for torch.optim and pick an optimizer
# And provide it with a start learning rate.
optimizer_class = optim.AdamW
lr = 1e-3
epochs = 300
batch_size = 128

model_id = "test1"
os.makedirs(root_folder + "models/part1/", exist_ok=True)

device = th.device("cuda" if th.cuda.is_available() else "cpu")
print(device)
list_to_device = lambda th_obj: [tensor.to(device) for tensor in th_obj]

cuda


In [30]:
model = LanguageModel(
    vocab_size=vocab_size, rnn_size=hidden_size, num_layers=num_layers, dropout=dropout
)
optimizer = optimizer_class(model.parameters(), lr=lr, weight_decay=2e-3)

In [38]:
# Skeleton code
# You have to write your own training process to obtain a
# Good performing model on the validation set, and save it.

model.train()
losses = []
accuracies = []
from pathlib import Path
from tqdm import tqdm

p = Path(root_folder + "models/part1/" + f"model_{model_id}.pt")
best_valid_loss = float("inf")
if p.exists():
    save_dict = th.load(
        root_folder + "models/part1/" + f"model_{model_id}.pt", map_location="cpu"
    )
    model.load_state_dict(save_dict["model_state_dict"])
    batch = build_batch(d_valid, range(len(d_valid)))
    (batch_input, batch_target, batch_target_mask) = batch_to_torch(*batch)
    (batch_input, batch_target, batch_target_mask) = list_to_device(
        (batch_input, batch_target, batch_target_mask)
    )
    prediction = model(batch_input)
    best_valid_loss = loss_fn(prediction, batch_target, batch_target_mask)
# Add cosine annealing
scheduler = th.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
model.to(device)
for epoch in range(epochs):
    if best_valid_loss <= 5.00:
        break
    model.train()
    indices = np.random.permutation(range(len(d_train)))
    t = tqdm(range(0, (len(d_train) // batch_size) + 1))
    for i in t:
        # Here is how you obtain a batch:
        batch = build_batch(d_train, indices[i * batch_size : (i + 1) * batch_size])
        (batch_input, batch_target, batch_target_mask) = batch_to_torch(*batch)
        (batch_input, batch_target, batch_target_mask) = list_to_device(
            (batch_input, batch_target, batch_target_mask)
        )

        prediction = model(batch_input)
        loss = loss_fn(prediction, batch_target, batch_target_mask)
        losses.append(loss.item())
        accuracy = (
            th.eq(prediction.argmax(dim=2, keepdim=False), batch_target).float()
            * batch_target_mask
        ).sum() / batch_target_mask.sum()
        accuracies.append(accuracy.item())

        optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()
        scheduler.step()
        if i % 10 == 0:
            t.set_description(
                f"Epoch: {epoch} Iteration: {i} Loss: {np.mean(losses[-10:])} Accuracy: {np.mean(accuracies[-10:])}"
            )

    model.eval()
    batch = build_batch(d_valid, range(len(d_valid)))
    (batch_input, batch_target, batch_target_mask) = batch_to_torch(*batch)
    (batch_input, batch_target, batch_target_mask) = list_to_device(
        (batch_input, batch_target, batch_target_mask)
    )
    prediction = model(batch_input)
    valid_loss = loss_fn(prediction, batch_target, batch_target_mask)
    print("Epoch:", epoch, "Validation Loss:", valid_loss.item())
    # Save the model if the validation loss decreases
    if best_valid_loss > valid_loss:
        best_valid_loss = valid_loss
        # save your latest model
        save_dict = dict(
            kwargs=dict(
                vocab_size=vocab_size,
                rnn_size=hidden_size,
                num_layers=num_layers,
                dropout=dropout,
            ),
            model_state_dict=model.state_dict(),
            notes="",
            optimizer_class=optimizer_class,
            lr=lr,
            epochs=epochs,
            batch_size=batch_size,
        )
        th.save(save_dict, root_folder + f"models/part1/model_{model_id}.pt")
        print(f"Saved new best model with val loss {valid_loss}")

Epoch: 0 Iteration: 690 Loss: 6.1319678783416744 Accuracy: 0.12138126268982888: 100%|██████████| 692/692 [00:19<00:00, 34.90it/s]


Epoch: 0 Validation Loss: 5.8736724853515625
Saved new best model with val loss 5.8737


Epoch: 1 Iteration: 690 Loss: 6.0703388214111325 Accuracy: 0.12148561477661132: 100%|██████████| 692/692 [00:21<00:00, 32.77it/s]


Epoch: 1 Validation Loss: 5.835305690765381
Saved new best model with val loss 5.8353


Epoch: 2 Iteration: 690 Loss: 6.033824491500854 Accuracy: 0.12844372913241386: 100%|██████████| 692/692 [00:21<00:00, 32.36it/s] 


Epoch: 2 Validation Loss: 5.794264316558838
Saved new best model with val loss 5.7943


Epoch: 3 Iteration: 690 Loss: 5.994955539703369 Accuracy: 0.12547899633646012: 100%|██████████| 692/692 [00:22<00:00, 30.50it/s] 


Epoch: 3 Validation Loss: 5.760225296020508
Saved new best model with val loss 5.7602


Epoch: 4 Iteration: 690 Loss: 5.9718162536621096 Accuracy: 0.1294025346636772: 100%|██████████| 692/692 [00:21<00:00, 32.18it/s] 


Epoch: 4 Validation Loss: 5.729812145233154
Saved new best model with val loss 5.7298


Epoch: 5 Iteration: 690 Loss: 5.964282035827637 Accuracy: 0.1339194618165493: 100%|██████████| 692/692 [00:22<00:00, 31.43it/s]  


Epoch: 5 Validation Loss: 5.703686714172363
Saved new best model with val loss 5.7037


Epoch: 6 Iteration: 690 Loss: 5.923934745788574 Accuracy: 0.13271935135126114: 100%|██████████| 692/692 [00:20<00:00, 33.64it/s] 


Epoch: 6 Validation Loss: 5.676356792449951
Saved new best model with val loss 5.6764


Epoch: 7 Iteration: 690 Loss: 5.858956670761108 Accuracy: 0.1324831284582615: 100%|██████████| 692/692 [00:23<00:00, 29.52it/s]  


Epoch: 7 Validation Loss: 5.652425765991211
Saved new best model with val loss 5.6524


Epoch: 8 Iteration: 690 Loss: 5.878982305526733 Accuracy: 0.1368647575378418: 100%|██████████| 692/692 [00:26<00:00, 26.24it/s]  


Epoch: 8 Validation Loss: 5.632537841796875
Saved new best model with val loss 5.6325


Epoch: 9 Iteration: 690 Loss: 5.766369581222534 Accuracy: 0.14264036044478418: 100%|██████████| 692/692 [00:28<00:00, 24.58it/s] 


Epoch: 9 Validation Loss: 5.60797643661499
Saved new best model with val loss 5.6080


Epoch: 10 Iteration: 690 Loss: 5.826354312896728 Accuracy: 0.14417929351329803: 100%|██████████| 692/692 [00:27<00:00, 25.09it/s] 


Epoch: 10 Validation Loss: 5.58839225769043
Saved new best model with val loss 5.5884


Epoch: 11 Iteration: 690 Loss: 5.775570011138916 Accuracy: 0.14372445344924928: 100%|██████████| 692/692 [00:25<00:00, 27.05it/s] 


Epoch: 11 Validation Loss: 5.568085193634033
Saved new best model with val loss 5.5681


Epoch: 12 Iteration: 690 Loss: 5.763145399093628 Accuracy: 0.14018619507551194: 100%|██████████| 692/692 [00:26<00:00, 26.45it/s] 


Epoch: 12 Validation Loss: 5.551076889038086
Saved new best model with val loss 5.5511


Epoch: 13 Iteration: 690 Loss: 5.7234416007995605 Accuracy: 0.14127732962369918: 100%|██████████| 692/692 [00:25<00:00, 27.08it/s]


Epoch: 13 Validation Loss: 5.531661510467529
Saved new best model with val loss 5.5317


Epoch: 14 Iteration: 690 Loss: 5.713720846176147 Accuracy: 0.14280691742897034: 100%|██████████| 692/692 [00:23<00:00, 29.11it/s] 


Epoch: 14 Validation Loss: 5.513602256774902
Saved new best model with val loss 5.5136


Epoch: 15 Iteration: 690 Loss: 5.705002641677856 Accuracy: 0.1458980545401573: 100%|██████████| 692/692 [00:23<00:00, 29.43it/s]  


Epoch: 15 Validation Loss: 5.499938011169434
Saved new best model with val loss 5.4999


Epoch: 16 Iteration: 690 Loss: 5.667625427246094 Accuracy: 0.14761812537908553: 100%|██████████| 692/692 [00:24<00:00, 28.18it/s] 


Epoch: 16 Validation Loss: 5.486828804016113
Saved new best model with val loss 5.4868


Epoch: 17 Iteration: 690 Loss: 5.713454627990723 Accuracy: 0.1415018692612648: 100%|██████████| 692/692 [00:23<00:00, 28.90it/s]  


Epoch: 17 Validation Loss: 5.466463565826416
Saved new best model with val loss 5.4665


Epoch: 18 Iteration: 690 Loss: 5.629257535934448 Accuracy: 0.15099240243434905: 100%|██████████| 692/692 [00:24<00:00, 28.44it/s] 


Epoch: 18 Validation Loss: 5.456913471221924
Saved new best model with val loss 5.4569


Epoch: 19 Iteration: 690 Loss: 5.653025484085083 Accuracy: 0.1487760990858078: 100%|██████████| 692/692 [00:23<00:00, 29.46it/s]  


Epoch: 19 Validation Loss: 5.440999984741211
Saved new best model with val loss 5.4410


Epoch: 20 Iteration: 690 Loss: 5.618740892410278 Accuracy: 0.15524579286575318: 100%|██████████| 692/692 [00:23<00:00, 29.21it/s] 


Epoch: 20 Validation Loss: 5.4317169189453125
Saved new best model with val loss 5.4317


Epoch: 21 Iteration: 690 Loss: 5.647732877731324 Accuracy: 0.14403168261051177: 100%|██████████| 692/692 [00:24<00:00, 28.49it/s] 


Epoch: 21 Validation Loss: 5.42484188079834
Saved new best model with val loss 5.4248


Epoch: 22 Iteration: 690 Loss: 5.5970314025878904 Accuracy: 0.1524325877428055: 100%|██████████| 692/692 [00:24<00:00, 28.71it/s] 


Epoch: 22 Validation Loss: 5.410187244415283
Saved new best model with val loss 5.4102


Epoch: 23 Iteration: 690 Loss: 5.588820600509644 Accuracy: 0.15065210908651352: 100%|██████████| 692/692 [00:22<00:00, 30.43it/s] 


Epoch: 23 Validation Loss: 5.398757457733154
Saved new best model with val loss 5.3988


Epoch: 24 Iteration: 690 Loss: 5.639509296417236 Accuracy: 0.14225512892007827: 100%|██████████| 692/692 [00:23<00:00, 29.52it/s] 


Epoch: 24 Validation Loss: 5.385733604431152
Saved new best model with val loss 5.3857


Epoch: 25 Iteration: 690 Loss: 5.544202184677124 Accuracy: 0.15183613300323487: 100%|██████████| 692/692 [00:24<00:00, 28.39it/s] 


Epoch: 25 Validation Loss: 5.377673625946045
Saved new best model with val loss 5.3777


Epoch: 26 Iteration: 690 Loss: 5.583007526397705 Accuracy: 0.15098831802606583: 100%|██████████| 692/692 [00:25<00:00, 27.59it/s] 


Epoch: 26 Validation Loss: 5.367212295532227
Saved new best model with val loss 5.3672


Epoch: 27 Iteration: 690 Loss: 5.530154991149902 Accuracy: 0.1534997895359993: 100%|██████████| 692/692 [00:24<00:00, 28.82it/s]  


Epoch: 27 Validation Loss: 5.358710289001465
Saved new best model with val loss 5.3587


Epoch: 28 Iteration: 690 Loss: 5.562793159484864 Accuracy: 0.15360087156295776: 100%|██████████| 692/692 [00:29<00:00, 23.32it/s] 


Epoch: 28 Validation Loss: 5.34564208984375
Saved new best model with val loss 5.3456


Epoch: 29 Iteration: 690 Loss: 5.5287987232208256 Accuracy: 0.1548133447766304: 100%|██████████| 692/692 [00:24<00:00, 28.65it/s] 


Epoch: 29 Validation Loss: 5.336002826690674
Saved new best model with val loss 5.3360


Epoch: 30 Iteration: 690 Loss: 5.498895120620728 Accuracy: 0.1580825001001358: 100%|██████████| 692/692 [00:19<00:00, 34.95it/s]  


Epoch: 30 Validation Loss: 5.331347942352295
Saved new best model with val loss 5.3313


Epoch: 31 Iteration: 690 Loss: 5.52697434425354 Accuracy: 0.15342593491077422: 100%|██████████| 692/692 [00:20<00:00, 33.88it/s]  


Epoch: 31 Validation Loss: 5.31801700592041
Saved new best model with val loss 5.3180


Epoch: 32 Iteration: 690 Loss: 5.519826745986938 Accuracy: 0.15557309240102768: 100%|██████████| 692/692 [00:20<00:00, 33.44it/s] 


Epoch: 32 Validation Loss: 5.310035228729248
Saved new best model with val loss 5.3100


Epoch: 33 Iteration: 690 Loss: 5.51539158821106 Accuracy: 0.15191710293292998: 100%|██████████| 692/692 [00:19<00:00, 34.92it/s]  


Epoch: 33 Validation Loss: 5.302761554718018
Saved new best model with val loss 5.3028


Epoch: 34 Iteration: 690 Loss: 5.478180646896362 Accuracy: 0.1621485620737076: 100%|██████████| 692/692 [00:26<00:00, 26.02it/s]  


Epoch: 34 Validation Loss: 5.297129154205322
Saved new best model with val loss 5.2971


Epoch: 35 Iteration: 690 Loss: 5.476265907287598 Accuracy: 0.1556912049651146: 100%|██████████| 692/692 [00:23<00:00, 29.95it/s]  


Epoch: 35 Validation Loss: 5.286157608032227
Saved new best model with val loss 5.2862


Epoch: 36 Iteration: 690 Loss: 5.447385311126709 Accuracy: 0.15582208931446076: 100%|██████████| 692/692 [00:20<00:00, 34.50it/s] 


Epoch: 36 Validation Loss: 5.280440330505371
Saved new best model with val loss 5.2804


Epoch: 37 Iteration: 690 Loss: 5.43580117225647 Accuracy: 0.15921226143836975: 100%|██████████| 692/692 [00:20<00:00, 34.13it/s]  


Epoch: 37 Validation Loss: 5.274022102355957
Saved new best model with val loss 5.2740


Epoch: 38 Iteration: 690 Loss: 5.401480770111084 Accuracy: 0.15850600749254226: 100%|██████████| 692/692 [00:20<00:00, 32.96it/s] 


Epoch: 38 Validation Loss: 5.264351844787598
Saved new best model with val loss 5.2644


Epoch: 39 Iteration: 690 Loss: 5.388906860351563 Accuracy: 0.1620214179158211: 100%|██████████| 692/692 [00:20<00:00, 34.50it/s]  


Epoch: 39 Validation Loss: 5.257831573486328
Saved new best model with val loss 5.2578


Epoch: 40 Iteration: 690 Loss: 5.415353918075562 Accuracy: 0.15919750034809113: 100%|██████████| 692/692 [00:19<00:00, 35.76it/s] 


Epoch: 40 Validation Loss: 5.251143455505371
Saved new best model with val loss 5.2511


Epoch: 41 Iteration: 690 Loss: 5.44761381149292 Accuracy: 0.15558430105447768: 100%|██████████| 692/692 [00:19<00:00, 34.98it/s]  


Epoch: 41 Validation Loss: 5.245546817779541
Saved new best model with val loss 5.2455


Epoch: 42 Iteration: 690 Loss: 5.36662540435791 Accuracy: 0.16321303844451904: 100%|██████████| 692/692 [00:20<00:00, 34.44it/s]  


Epoch: 42 Validation Loss: 5.2395219802856445
Saved new best model with val loss 5.2395


Epoch: 43 Iteration: 690 Loss: 5.392172718048096 Accuracy: 0.1592481553554535: 100%|██████████| 692/692 [00:20<00:00, 33.81it/s]  


Epoch: 43 Validation Loss: 5.235273838043213
Saved new best model with val loss 5.2353


Epoch: 44 Iteration: 690 Loss: 5.362756204605103 Accuracy: 0.16221764534711838: 100%|██████████| 692/692 [00:19<00:00, 35.28it/s] 


Epoch: 44 Validation Loss: 5.230124473571777
Saved new best model with val loss 5.2301


Epoch: 45 Iteration: 690 Loss: 5.395526504516601 Accuracy: 0.1574579268693924: 100%|██████████| 692/692 [00:19<00:00, 35.03it/s]  


Epoch: 45 Validation Loss: 5.227705955505371
Saved new best model with val loss 5.2277


Epoch: 46 Iteration: 690 Loss: 5.39258713722229 Accuracy: 0.15831895619630815: 100%|██████████| 692/692 [00:19<00:00, 34.92it/s]  


Epoch: 46 Validation Loss: 5.216131687164307
Saved new best model with val loss 5.2161


Epoch: 47 Iteration: 690 Loss: 5.359222793579102 Accuracy: 0.16452469378709794: 100%|██████████| 692/692 [00:19<00:00, 35.07it/s] 


Epoch: 47 Validation Loss: 5.210713863372803
Saved new best model with val loss 5.2107


Epoch: 48 Iteration: 690 Loss: 5.3504383087158205 Accuracy: 0.16837919801473616: 100%|██████████| 692/692 [00:19<00:00, 35.28it/s]


Epoch: 48 Validation Loss: 5.2068986892700195
Saved new best model with val loss 5.2069


Epoch: 49 Iteration: 690 Loss: 5.381691789627075 Accuracy: 0.1671860173344612: 100%|██████████| 692/692 [00:20<00:00, 34.09it/s]  


Epoch: 49 Validation Loss: 5.198848724365234
Saved new best model with val loss 5.1988


Epoch: 50 Iteration: 690 Loss: 5.379554700851441 Accuracy: 0.16103825569152833: 100%|██████████| 692/692 [00:20<00:00, 34.58it/s] 


Epoch: 50 Validation Loss: 5.1927971839904785
Saved new best model with val loss 5.1928


Epoch: 51 Iteration: 690 Loss: 5.307805156707763 Accuracy: 0.16701227575540542: 100%|██████████| 692/692 [00:20<00:00, 34.01it/s] 


Epoch: 51 Validation Loss: 5.189753532409668
Saved new best model with val loss 5.1898


Epoch: 52 Iteration: 690 Loss: 5.362269306182862 Accuracy: 0.15853222608566284: 100%|██████████| 692/692 [00:19<00:00, 35.17it/s] 


Epoch: 52 Validation Loss: 5.181978225708008
Saved new best model with val loss 5.1820


Epoch: 53 Iteration: 690 Loss: 5.334455060958862 Accuracy: 0.1679845407605171: 100%|██████████| 692/692 [00:19<00:00, 35.03it/s]  


Epoch: 53 Validation Loss: 5.178476333618164
Saved new best model with val loss 5.1785


Epoch: 54 Iteration: 690 Loss: 5.335708713531494 Accuracy: 0.15897155702114105: 100%|██████████| 692/692 [00:20<00:00, 34.08it/s] 


Epoch: 54 Validation Loss: 5.175323963165283
Saved new best model with val loss 5.1753


Epoch: 55 Iteration: 690 Loss: 5.315512180328369 Accuracy: 0.16441715359687806: 100%|██████████| 692/692 [00:20<00:00, 34.17it/s] 


Epoch: 55 Validation Loss: 5.17097282409668
Saved new best model with val loss 5.1710


Epoch: 56 Iteration: 690 Loss: 5.329663801193237 Accuracy: 0.1663459375500679: 100%|██████████| 692/692 [00:19<00:00, 34.84it/s]  


Epoch: 56 Validation Loss: 5.169686794281006
Saved new best model with val loss 5.1697


Epoch: 57 Iteration: 690 Loss: 5.3691082954406735 Accuracy: 0.1621215224266052: 100%|██████████| 692/692 [00:19<00:00, 34.96it/s] 


Epoch: 57 Validation Loss: 5.161103248596191
Saved new best model with val loss 5.1611


Epoch: 58 Iteration: 690 Loss: 5.312686777114868 Accuracy: 0.16163143515586853: 100%|██████████| 692/692 [00:20<00:00, 33.92it/s] 


Epoch: 58 Validation Loss: 5.16357946395874


Epoch: 59 Iteration: 690 Loss: 5.270036029815674 Accuracy: 0.1735748454928398: 100%|██████████| 692/692 [00:19<00:00, 34.99it/s]  


Epoch: 59 Validation Loss: 5.153796195983887
Saved new best model with val loss 5.1538


Epoch: 60 Iteration: 690 Loss: 5.304565525054931 Accuracy: 0.16843605786561966: 100%|██████████| 692/692 [00:19<00:00, 35.22it/s] 


Epoch: 60 Validation Loss: 5.154327392578125


Epoch: 61 Iteration: 690 Loss: 5.275889253616333 Accuracy: 0.16428953260183335: 100%|██████████| 692/692 [00:20<00:00, 34.08it/s] 


Epoch: 61 Validation Loss: 5.146648406982422
Saved new best model with val loss 5.1466


Epoch: 62 Iteration: 690 Loss: 5.2510308742523195 Accuracy: 0.16653278321027756: 100%|██████████| 692/692 [00:19<00:00, 35.01it/s]


Epoch: 62 Validation Loss: 5.144391059875488
Saved new best model with val loss 5.1444


Epoch: 63 Iteration: 690 Loss: 5.317184638977051 Accuracy: 0.1613687828183174: 100%|██████████| 692/692 [00:19<00:00, 35.62it/s]  


Epoch: 63 Validation Loss: 5.13632345199585
Saved new best model with val loss 5.1363


Epoch: 64 Iteration: 690 Loss: 5.238944959640503 Accuracy: 0.1706140384078026: 100%|██████████| 692/692 [00:19<00:00, 35.23it/s]  


Epoch: 64 Validation Loss: 5.137294292449951


Epoch: 65 Iteration: 690 Loss: 5.229941892623901 Accuracy: 0.167493237555027: 100%|██████████| 692/692 [00:19<00:00, 34.99it/s]   


Epoch: 65 Validation Loss: 5.129665374755859
Saved new best model with val loss 5.1297


Epoch: 66 Iteration: 690 Loss: 5.250402450561523 Accuracy: 0.16840819120407105: 100%|██████████| 692/692 [00:19<00:00, 34.90it/s] 


Epoch: 66 Validation Loss: 5.12377405166626
Saved new best model with val loss 5.1238


Epoch: 67 Iteration: 690 Loss: 5.259288930892945 Accuracy: 0.16743122339248656: 100%|██████████| 692/692 [00:20<00:00, 33.97it/s] 


Epoch: 67 Validation Loss: 5.123744010925293
Saved new best model with val loss 5.1237


Epoch: 68 Iteration: 690 Loss: 5.258086729049682 Accuracy: 0.16135288625955582: 100%|██████████| 692/692 [00:19<00:00, 34.79it/s] 


Epoch: 68 Validation Loss: 5.114820957183838
Saved new best model with val loss 5.1148


Epoch: 69 Iteration: 690 Loss: 5.2620518684387205 Accuracy: 0.17078315764665603: 100%|██████████| 692/692 [00:19<00:00, 34.94it/s]


Epoch: 69 Validation Loss: 5.114995956420898


Epoch: 70 Iteration: 690 Loss: 5.194170665740967 Accuracy: 0.171940778195858: 100%|██████████| 692/692 [00:19<00:00, 35.45it/s]   


Epoch: 70 Validation Loss: 5.110679626464844
Saved new best model with val loss 5.1107


Epoch: 71 Iteration: 690 Loss: 5.237292766571045 Accuracy: 0.1709013268351555: 100%|██████████| 692/692 [00:19<00:00, 35.75it/s]  


Epoch: 71 Validation Loss: 5.109731674194336
Saved new best model with val loss 5.1097


Epoch: 72 Iteration: 690 Loss: 5.240435457229614 Accuracy: 0.17283084094524384: 100%|██████████| 692/692 [00:19<00:00, 34.66it/s] 


Epoch: 72 Validation Loss: 5.106288909912109
Saved new best model with val loss 5.1063


Epoch: 73 Iteration: 690 Loss: 5.224604940414428 Accuracy: 0.17080100923776625: 100%|██████████| 692/692 [00:20<00:00, 34.21it/s] 


Epoch: 73 Validation Loss: 5.102226734161377
Saved new best model with val loss 5.1022


Epoch: 74 Iteration: 690 Loss: 5.193443679809571 Accuracy: 0.1721683621406555: 100%|██████████| 692/692 [00:19<00:00, 35.33it/s]  


Epoch: 74 Validation Loss: 5.101139068603516
Saved new best model with val loss 5.1011


Epoch: 75 Iteration: 690 Loss: 5.168806505203247 Accuracy: 0.16913099884986876: 100%|██████████| 692/692 [00:19<00:00, 36.24it/s] 


Epoch: 75 Validation Loss: 5.100174427032471
Saved new best model with val loss 5.1002


Epoch: 76 Iteration: 690 Loss: 5.205361557006836 Accuracy: 0.1693915992975235: 100%|██████████| 692/692 [00:18<00:00, 37.08it/s]  


Epoch: 76 Validation Loss: 5.091855049133301
Saved new best model with val loss 5.0919


Epoch: 77 Iteration: 690 Loss: 5.190276479721069 Accuracy: 0.1733582556247711: 100%|██████████| 692/692 [00:18<00:00, 36.76it/s]  


Epoch: 77 Validation Loss: 5.087942600250244
Saved new best model with val loss 5.0879


Epoch: 78 Iteration: 690 Loss: 5.1544183731079105 Accuracy: 0.1757098063826561: 100%|██████████| 692/692 [00:18<00:00, 37.10it/s]


Epoch: 78 Validation Loss: 5.083929061889648
Saved new best model with val loss 5.0839


Epoch: 79 Iteration: 690 Loss: 5.197062730789185 Accuracy: 0.17282797992229462: 100%|██████████| 692/692 [00:19<00:00, 35.85it/s] 


Epoch: 79 Validation Loss: 5.081719398498535
Saved new best model with val loss 5.0817


Epoch: 80 Iteration: 690 Loss: 5.203112173080444 Accuracy: 0.17010480165481567: 100%|██████████| 692/692 [00:19<00:00, 36.29it/s] 


Epoch: 80 Validation Loss: 5.079762935638428
Saved new best model with val loss 5.0798


Epoch: 81 Iteration: 690 Loss: 5.189410400390625 Accuracy: 0.17284712046384812: 100%|██████████| 692/692 [00:18<00:00, 37.74it/s] 


Epoch: 81 Validation Loss: 5.073680877685547
Saved new best model with val loss 5.0737


Epoch: 82 Iteration: 690 Loss: 5.115394020080567 Accuracy: 0.17907176464796065: 100%|██████████| 692/692 [00:19<00:00, 35.52it/s] 


Epoch: 82 Validation Loss: 5.076968193054199


Epoch: 83 Iteration: 690 Loss: 5.160304594039917 Accuracy: 0.17556862980127336: 100%|██████████| 692/692 [00:18<00:00, 37.37it/s] 


Epoch: 83 Validation Loss: 5.07266902923584
Saved new best model with val loss 5.0727


Epoch: 84 Iteration: 690 Loss: 5.192904996871948 Accuracy: 0.1704835742712021: 100%|██████████| 692/692 [00:18<00:00, 38.16it/s]  


Epoch: 84 Validation Loss: 5.075275421142578


Epoch: 85 Iteration: 690 Loss: 5.147973728179932 Accuracy: 0.1755061998963356: 100%|██████████| 692/692 [00:19<00:00, 36.21it/s]  


Epoch: 85 Validation Loss: 5.066590309143066
Saved new best model with val loss 5.0666


Epoch: 86 Iteration: 690 Loss: 5.139589071273804 Accuracy: 0.17305618524551392: 100%|██████████| 692/692 [00:19<00:00, 35.85it/s] 


Epoch: 86 Validation Loss: 5.063330173492432
Saved new best model with val loss 5.0633


Epoch: 87 Iteration: 690 Loss: 5.100084447860718 Accuracy: 0.17856552600860595: 100%|██████████| 692/692 [00:19<00:00, 36.39it/s] 


Epoch: 87 Validation Loss: 5.061854362487793
Saved new best model with val loss 5.0619


Epoch: 88 Iteration: 690 Loss: 5.086931228637695 Accuracy: 0.1848041146993637: 100%|██████████| 692/692 [00:18<00:00, 36.81it/s]  


Epoch: 88 Validation Loss: 5.05889368057251
Saved new best model with val loss 5.0589


Epoch: 89 Iteration: 690 Loss: 5.19562873840332 Accuracy: 0.16556521207094194: 100%|██████████| 692/692 [00:19<00:00, 36.08it/s]  


Epoch: 89 Validation Loss: 5.0569634437561035
Saved new best model with val loss 5.0570


Epoch: 90 Iteration: 690 Loss: 5.168718099594116 Accuracy: 0.17122071981430054: 100%|██████████| 692/692 [00:18<00:00, 37.47it/s] 


Epoch: 90 Validation Loss: 5.055038928985596
Saved new best model with val loss 5.0550


Epoch: 91 Iteration: 690 Loss: 5.126359510421753 Accuracy: 0.18240851908922195: 100%|██████████| 692/692 [00:19<00:00, 36.36it/s] 


Epoch: 91 Validation Loss: 5.0518879890441895
Saved new best model with val loss 5.0519


Epoch: 92 Iteration: 690 Loss: 5.077395582199097 Accuracy: 0.18193917125463485: 100%|██████████| 692/692 [00:18<00:00, 36.45it/s] 


Epoch: 92 Validation Loss: 5.048563480377197
Saved new best model with val loss 5.0486


Epoch: 93 Iteration: 690 Loss: 5.14219069480896 Accuracy: 0.17767405956983567: 100%|██████████| 692/692 [00:18<00:00, 37.04it/s]  


Epoch: 93 Validation Loss: 5.049846172332764


Epoch: 94 Iteration: 690 Loss: 5.120662593841553 Accuracy: 0.18115674704313278: 100%|██████████| 692/692 [00:18<00:00, 37.42it/s] 


Epoch: 94 Validation Loss: 5.045170783996582
Saved new best model with val loss 5.0452


Epoch: 95 Iteration: 690 Loss: 5.117073106765747 Accuracy: 0.17258478850126266: 100%|██████████| 692/692 [00:18<00:00, 37.06it/s] 


Epoch: 95 Validation Loss: 5.042312145233154
Saved new best model with val loss 5.0423


Epoch: 96 Iteration: 690 Loss: 5.083605003356934 Accuracy: 0.17361707538366317: 100%|██████████| 692/692 [00:18<00:00, 37.10it/s] 


Epoch: 96 Validation Loss: 5.039815902709961
Saved new best model with val loss 5.0398


Epoch: 97 Iteration: 690 Loss: 5.104813909530639 Accuracy: 0.1747190147638321: 100%|██████████| 692/692 [00:18<00:00, 36.73it/s]  


Epoch: 97 Validation Loss: 5.038727283477783
Saved new best model with val loss 5.0387


Epoch: 98 Iteration: 690 Loss: 5.084809398651123 Accuracy: 0.17381160259246825: 100%|██████████| 692/692 [00:19<00:00, 35.42it/s] 


Epoch: 98 Validation Loss: 5.033129692077637
Saved new best model with val loss 5.0331


Epoch: 99 Iteration: 690 Loss: 5.128803777694702 Accuracy: 0.17265601307153702: 100%|██████████| 692/692 [00:19<00:00, 35.40it/s] 


Epoch: 99 Validation Loss: 5.034183025360107


Epoch: 100 Iteration: 690 Loss: 5.097299289703369 Accuracy: 0.18209814876317978: 100%|██████████| 692/692 [00:18<00:00, 36.85it/s] 


Epoch: 100 Validation Loss: 5.029355049133301
Saved new best model with val loss 5.0294


Epoch: 101 Iteration: 690 Loss: 5.0477681159973145 Accuracy: 0.18178484290838243: 100%|██████████| 692/692 [00:18<00:00, 36.68it/s]


Epoch: 101 Validation Loss: 5.031768798828125


Epoch: 102 Iteration: 690 Loss: 5.063678789138794 Accuracy: 0.1765623062849045: 100%|██████████| 692/692 [00:20<00:00, 34.24it/s]  


Epoch: 102 Validation Loss: 5.028562068939209
Saved new best model with val loss 5.0286


Epoch: 103 Iteration: 690 Loss: 5.135699319839477 Accuracy: 0.17221090495586394: 100%|██████████| 692/692 [00:19<00:00, 35.68it/s] 


Epoch: 103 Validation Loss: 5.025163650512695
Saved new best model with val loss 5.0252


Epoch: 104 Iteration: 690 Loss: 5.1306990623474125 Accuracy: 0.1759599193930626: 100%|██████████| 692/692 [00:19<00:00, 35.08it/s] 


Epoch: 104 Validation Loss: 5.024940013885498
Saved new best model with val loss 5.0249


Epoch: 105 Iteration: 690 Loss: 5.077016830444336 Accuracy: 0.17874182611703873: 100%|██████████| 692/692 [00:19<00:00, 35.40it/s] 


Epoch: 105 Validation Loss: 5.019539833068848
Saved new best model with val loss 5.0195


Epoch: 106 Iteration: 690 Loss: 5.097570991516113 Accuracy: 0.17861033231019974: 100%|██████████| 692/692 [00:18<00:00, 37.11it/s] 


Epoch: 106 Validation Loss: 5.022732734680176


Epoch: 107 Iteration: 690 Loss: 5.090868616104126 Accuracy: 0.17380513101816178: 100%|██████████| 692/692 [00:18<00:00, 37.04it/s] 


Epoch: 107 Validation Loss: 5.018594264984131
Saved new best model with val loss 5.0186


Epoch: 108 Iteration: 690 Loss: 5.051522302627563 Accuracy: 0.1832921177148819: 100%|██████████| 692/692 [00:18<00:00, 36.59it/s]  


Epoch: 108 Validation Loss: 5.020631313323975


Epoch: 109 Iteration: 690 Loss: 5.075745105743408 Accuracy: 0.17875033468008042: 100%|██████████| 692/692 [00:19<00:00, 35.67it/s] 


Epoch: 109 Validation Loss: 5.016787528991699
Saved new best model with val loss 5.0168


Epoch: 110 Iteration: 690 Loss: 5.0380988121032715 Accuracy: 0.17348500341176987: 100%|██████████| 692/692 [00:20<00:00, 33.71it/s]


Epoch: 110 Validation Loss: 5.018561363220215


Epoch: 111 Iteration: 690 Loss: 5.041338729858398 Accuracy: 0.1729518309235573: 100%|██████████| 692/692 [00:20<00:00, 33.62it/s]  


Epoch: 111 Validation Loss: 5.01122522354126
Saved new best model with val loss 5.0112


Epoch: 112 Iteration: 690 Loss: 5.063911724090576 Accuracy: 0.17614165246486663: 100%|██████████| 692/692 [00:19<00:00, 34.60it/s] 


Epoch: 112 Validation Loss: 5.012146472930908


Epoch: 113 Iteration: 690 Loss: 5.072928810119629 Accuracy: 0.1794956922531128: 100%|██████████| 692/692 [00:20<00:00, 34.47it/s]  


Epoch: 113 Validation Loss: 5.006381034851074
Saved new best model with val loss 5.0064


Epoch: 114 Iteration: 690 Loss: 5.042406225204468 Accuracy: 0.17670904844999313: 100%|██████████| 692/692 [00:20<00:00, 33.12it/s] 


Epoch: 114 Validation Loss: 5.007034778594971


Epoch: 115 Iteration: 690 Loss: 5.048744201660156 Accuracy: 0.18068117499351502: 100%|██████████| 692/692 [00:20<00:00, 34.07it/s] 


Epoch: 115 Validation Loss: 5.004249572753906
Saved new best model with val loss 5.0042


Epoch: 116 Iteration: 690 Loss: 5.035372018814087 Accuracy: 0.1800599843263626: 100%|██████████| 692/692 [00:26<00:00, 26.33it/s]  


Epoch: 116 Validation Loss: 5.001808166503906
Saved new best model with val loss 5.0018


Epoch: 117 Iteration: 690 Loss: 5.0279004096984865 Accuracy: 0.18048894554376602: 100%|██████████| 692/692 [00:20<00:00, 34.05it/s]


Epoch: 117 Validation Loss: 5.001772880554199
Saved new best model with val loss 5.0018


Epoch: 118 Iteration: 690 Loss: 5.081632804870606 Accuracy: 0.17443606108427048: 100%|██████████| 692/692 [00:20<00:00, 33.31it/s] 


Epoch: 118 Validation Loss: 4.996859550476074
Saved new best model with val loss 4.9969


# Using the language model

Congratulations, you have now trained a language model! We can now use it to evaluate likely news headlines, as well as generate our very own headlines.

**TODO**: Complete the three parts below, using the model you have trained.

## (1) Evaluation loss

To evaluate the language model, we evaluate its loss (ability to predict) on unseen data that is reserved for evaluation.
Your first evaluation is to load the model you trained, and obtain a test loss. If you are running this validation and not training, run the setup cell above the training loop first.

In [39]:
model_id = "test1"
save_dict = th.load(root_folder+'models/part1/'+f"model_{model_id}.pt",map_location='cpu')
model = LanguageModel(**save_dict['kwargs'])
model.load_state_dict(save_dict['model_state_dict'])
model.eval()

LanguageModel(
  (embedding): Embedding(10000, 256)
  (embed_dropout): Dropout(p=0.35, inplace=False)
  (lstm): LSTM(256, 256, num_layers=2, batch_first=True, dropout=0.5)
  (layernorm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
  (dropout): Dropout(p=0.5, inplace=False)
  (output): Linear(in_features=256, out_features=10000, bias=True)
)

In [40]:
# We will evaluate your model in the best_models folder
# In a very similar way as the code below.
# Make sure your validation loss is below the threshold we specified
# and that you didn't train using the validation set

batch = build_batch(d_valid, range(len(d_valid)))
(batch_input, batch_target, batch_target_mask) = batch_to_torch(*batch)
prediction = model(batch_input.long())
loss = loss_fn(prediction, batch_target, batch_target_mask)
print("Evaluation set loss:", loss.item())

Evaluation set loss: 4.996857643127441


In [41]:
# Your best performing model should go here.
os.makedirs(root_folder+"best_models",exist_ok=True)
best_model_file = root_folder+"best_models/part1_best_model.pt"
th.save(save_dict,best_model_file)

## (2) Evaluation of likelihood of data

One use of a language model is to see what data is more likely to have originated from the training data. Because we have trained our model on news headlines, we can see which of these headlines is more likely:

``Apple to release another iPhone in September``


 ``Apple and Samsung resolve all lawsuits amicably``
 
**TODO**: Use the model to obtain the loss the neural network assigns to each sentence.
Because the neural network assigns probability to the words appearing in a sequence, this loss can be used as a proxy to measure how likely the sentence is to have occurred in the dataset.
Once you have the loss for each headline, write down which sentence was judged to be more likely, and explain why/if you think this is coherent.

**Your answer:**


In [42]:
def raw_sample_pred(headline, model):
    #####
    # BEGIN YOUR CODE HERE 
    #####
    # From the code in the Preprocessing section at the end of the notebook
    # Find out how to tokenize the headline
    tokenized = your_code

    # Find out how to numerize the tokenized headline
    numerized = your_code

    # Learn how to pad and obtain the mask of the sequence.
    padded, mask = your_code

    # Obtain the predicted headline and target headline
    input_headline = your_code
    pred_headline = your_code
    target_headline = your_code
    mask = your_code

    #####
    # END YOUR CODE HERE 
    #####

    return pred_headline,target_headline,mask

In [43]:
model.eval()

headline1 = "Apple to release new iPhone in July"
headline2 = "Apple and Samsung resolve all lawsuits"

headlines = [headline1.lower(), headline2.lower()] # Our LSTM is trained on lower-cased headlines
for headline in headlines:
    pred_headline,target_headline,mask = raw_sample_pred(headline, model)
    loss = your_code # Obtain the loss
    
    print("----------------------------------------")
    print("Headline:", headline)
    print("Loss of the headline:", loss)
validate_to_array(raw_sample_pred,zip(headlines,[model]*2),'raw_sample_pred',root_folder,multi=True)
# Important check: one headline should be more likely (and have lower loss)
# Than the other headline. You should know which headline should have lower loss.

NameError: name 'your_code' is not defined

## (3) Generation of headlines

We can use our language model to generate text according to the distribution of our training data.
The way generation works is the following:

We seed the model with a beginning of sequence, and obtain the distribution for the next word.
We select the most likely word (argmax) and add it to our sequence of words.
Now our sequence is one word longer, and we can feed it in again as an input, for the network to produce the next sentence.
We do this a fixed number of times (up to 20 words), and obtain automatically generated headlines!


We have provided a few headline starters that should produce interesting generated headlines.

**TODO:** Get creative and find at least 2 more headline_starters that produce interesting headlines.

In [None]:
def generate_sentence(headline_starter, model):
    # Tokenize and numerize the headline. Put the numerized headline
    # beginning in `current_build`
    tokenized = tokenizer.word_tokenizer(headline_starter.lower())
    current_build = [startI] + numerize_sequence(tokenized)

    while len(current_build) < input_length:
        # Pad the current_build into a input_length vector.
        # We do this so that it can be processed by our LanguageModel class
        current_padded, _m = pad_sequence(current_build, padI, input_length)

        # Obtain the logits for the current padded sequence
        # This involves obtaining the output_logits from our model,
        # and not the loss like we have done so far
        logits = your_code
        logits_np = logits.detach().cpu().numpy()

        # Obtain the row of logits that interest us, the logits for the last non-pad
        # inputs
        last_logits = your_code

        # Find the highest scoring words in the last_logits
        # array, or sample from the softmax.
        # The np.argmax function may be useful for first option,
        # sp.special.softmax and np.random.choice may be useful for second option.
        # Append this word to our current build
        current_build.append(your_code)

    # Go from the current_build of word_indices
    # To the headline (string) produced. This should involve
    # the vocabulary, and a string merger.
    produced_sentence = your_code
    return produced_sentence

In [None]:
model.eval()
# Here are some headline starters.
# They're all about tech companies, because
# That is what is in our dataset
headline_starters = ["apple has released", "google has released", "amazon", "tesla to"]
for headline_starter in headline_starters:
    print("===================")
    print("Generating headline starting with: "+headline_starter)

    produced_sentence = generate_sentence(headline_starter, model)
    print(produced_sentence)
validate_to_array(generate_sentence,zip(headline_starters,[model]*len(headline_starters)),"generate_sentence",root_folder,multi=True)

## All done

You are done with the first part of the HW.

Next notebook deals with Summarization of text!


# Preprocessing (read only)


**You can skip this section, however you may find these functions useful in later sections of this notebook**

We have provided this code so you see how the dataset was generated. You will have to come back some of these functions later in the assignment, so feel free to read through, to get familiar.

In [None]:
# You do not need to run this
# This is to show you how the dataset was created
# You should read to understand, so you can preprocess text
# In the same way, in the evaluation section

for a in dataset:
    a['tokenized'] = tokenizer.word_tokenizer(a['title'].lower())

In [None]:
# You do not need to run this
# This is to show you how the dataset was created
# You should read to understand, so you can preprocess text
# In the same way, in the evaluation section

word_counts = Counter()
for a in dataset:
    word_counts.update(a['tokenized'])

print(word_counts.most_common(30))

In [None]:
# You do not need to run this
# This is to show you how the dataset was created
# You should read to understand, so you can preprocess text
# In the same way, in the evaluation section

# Creating the vocab
vocab_size = 20000
special_words = ["<START>", "UNK", "PAD"]
vocabulary = special_words + [w for w, c in word_counts.most_common(vocab_size-len(special_words))]
w2i = {w: i for i, w in enumerate(vocabulary)}

# Numerizing and padding
input_length = 20
unkI, padI, startI = w2i['UNK'], w2i['PAD'], w2i['<START>']

for a in dataset:
    a['numerized'] = numerize_sequence(a['tokenized']) # Change words to IDs
    a['numerized'], a['mask'] = pad_sequence(a['numerized'], padI, input_length) # Append appropriate PAD tokens
    
# Compute fraction of words that are UNK:
word_counters = Counter([w for a in dataset for w in a['input'] if w != padI])

print("Fraction of UNK words:", float(word_counters[unkI]) / sum(word_counters.values()))

In [None]:
# You do not need to run this
# This is to show you how the dataset was created
# You should read to understand, so you can preprocess text
# In the same way, in the evaluation section

d_released_processed   = [d for d in dataset if d['cut'] != 'testing']
d_unreleased_processed = [d for d in dataset if d['cut'] == 'testing']

with open("dataset/headline_generation_dataset_processed.json", "w") as f:
    json.dump(d_released_processed, f)

# This file is purposefully left out of the assignment, we will use it to evaluate your model.
with open("dataset/headline_generation_dataset_unreleased_processed.json", "w") as f:
    json.dump(d_unreleased_processed, f)
    
with open("dataset/headline_generation_vocabulary.txt", "w") as f:
    f.write("\n".join(vocabulary).encode('utf8'))