# Introduction

<center><h3>**Welcome to the Summarization Notebook.**</h3></center>

In this assignment, you are going to train a neural network to summarize news articles.
Your neural network is going to learn from example, as we provide you with (article, summary) pairs.
We provide you with a **toy dataset** made of only articles about police related news.
Usual datasets can be 20x larger in size, but we have reduced it for computational purposes.

You will do this using a Transformer network, from the __[Attention is all you need](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)__ paper.
In this assignment you will:
- Learn to process text into sub-word tokens, to avoid fixed vocabulary sizes, and UNK tokens.
- Implement the key conceptual blocks of a Transformer.
- Use a Transformer to read a news article, and produce a summary.
- Perform operations on learned word-vectors to examine what the model has learned.

    
** Before you start **

You should read the Attention is all you need paper.
We are providing you with skeleton code for the Transformer, but there will have to implement 5 conceptual blocks of the transformer yourself:
-  AttentionQKV: the Query, Key, Value attention mechanism at the center of the Transformer
- MultiHeadAttention: the multiple heads that enable each input to attend at many places at once.
- PositionEmbedding: the sinusoid-based position embedding of the Transformer.
- Encoder & Decoder: The encoder (that reads inputs, such as news articles), the decoder (that produces the output summary, one token at a time)
- Full Transformer: piecing it all together.

All dataset files should be placed in the `dataset/` folder of this assignment.

If you are using Google Colab, follow the instructions to mount your Google Drive onto the remote machine.

# Library imports

In [1]:
!pip install segtok
!pip install sentencepiece

Collecting sentencepiece
  Using cached sentencepiece-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Using cached sentencepiece-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.2.0


Run the first of the following two cells if you are running the homework locally, and run the second cell if you are running the homework in Colab

In [2]:
DRIVE=False
root_folder = ""
dataset_folder = "dataset/"

In [None]:
from google.colab import drive
drive.mount('/content/drive')
root_folder = "/content/drive/My Drive/cs182_hw3/"
dataset_folder = "/content/drive/My Drive/cs182_hw3_public/dataset/"

In [3]:
# This cell autoreloads the notebook when you change you python file code.
# If you think the notebook did not reload, rerun this cell.
%load_ext autoreload
%autoreload 2

In [4]:
import os
import sys
sys.path.append(root_folder)
# from transformer import Transformer
import sentencepiece as spm
import torch as th
from torch import nn
from torch.nn import functional as F
from torch import optim
import numpy as np
import json
import capita
import os
from transformer_utils import set_device
import gc
from utils import validate_to_array, model_out_to_list
device = th.device("cuda" if th.cuda.is_available() else "cpu")
list_to_device = lambda th_obj: [tensor.to(device) for tensor in th_obj]

In [5]:
# Load the word piece model that will be used to tokenize the texts into
# word pieces with a vocabulary size of 10000
sp = spm.SentencePieceProcessor()
sp.Load(root_folder+"dataset/wp_vocab10000.model")

vocab = [line.split('\t')[0] for line in open(root_folder+"dataset/wp_vocab10000.vocab", "r")]
pad_index = vocab.index('#')

def pad_sequence(numerized, pad_index, to_length):
    pad = numerized[:to_length]
    padded = pad + [pad_index] * (to_length - len(pad))
    mask = [w != pad_index for w in padded]
    return padded, mask

# Building blocks of a Transformer


**TODO**:

Implement the 5 blocks of the Transformer. In order to finish this section, you should get very small error <1e-7 on each of the 5 checks in this section.


The Transformer is split into 3 files: transformer_attention.py, transformer_utils.py and transformer.py

Each section below gives you directions and a way to verify your code works properly.

You do not need to modify the rest of the code provided, but should read it to understand overall architecture.

Our Transformer is built as a Pytorch model, a standard that is good for you to get accustomed to.



## (1) Implementing the Query-Key-Value Attention (AttentionQKV)

This part is located in AttentionQKV in transformer_attention.py. You must implement the call function of the class.
You will need to implement the mathematical procedure of AttentionQKV that is described in the [Attention is all you need paper](https://arxiv.org/pdf/1706.03762.pdf).

In [6]:
from transformer_attention import AttentionQKV

batch_size = 2;
n_queries = 3;
n_keyval = 5;
depth_k = 2;
depth_v = 2

with open(root_folder+"transformer_checks/attention_qkv_io.json", "r") as f:
    io = json.load(f)
    queries = th.tensor(io['queries'])
    keys = th.tensor(io['keys'])
    values = th.tensor(io['values'])
    expected_output  = th.tensor(io['output'])
    expected_weights = th.tensor(io['weights'])

attn_qkv = AttentionQKV()
output, weights = attn_qkv(queries, keys, values)
# validate_to_array(model_out_to_list,((queries,keys,values),attn_qkv),'attentionqkv', root_folder)
print("Total error on the output:",th.sum(th.abs(expected_output-output)).item(), "(should be 0.0 or close to 0.0)")
print("Total error on the weights:",th.sum(th.abs(expected_weights-weights)).item(), "(should be 0.0 or close to 0.0)")

Total error on the output: 2.8312206268310547e-07 (should be 0.0 or close to 0.0)
Total error on the weights: 2.849847078323364e-07 (should be 0.0 or close to 0.0)


## (2) Implementing Multi-head attention

This part is located in the class MultiHeadProjection in transformer_attention.py.
You must implement the call, \_split_heads, and \_combine_heads functions.

**Procedure**

The objective is to leverage the AttentionQKV class you already wrote.

Your input are the queries, keys, values as 3-d tensors (batch_size, sequence_length, feature_size).

Split them into 4-d tensors (batch_size, n_heads, sequence_length, new_feature_size). Where:
$$feature\_size = n\_heads * new_{feature\_size}.$$

You can then feed the split qkv to your implemented AttentionQKV, which will treat each head as an independent attention function.

Then the output must be combined back into a 3-d tensor.
You can test the validity of your implementation in the cell below.

In [7]:
from transformer_attention import MultiHeadProjection

batch_size = 2;
n_queries = 3;
n_heads = 4
n_keyval = 5;
depth_k = 8;
depth_v = 8;

with open(root_folder+"transformer_checks/multihead_io.json", "r") as f:
    io = json.load(f)
    queries = th.tensor(io['queries'])
    keys = th.tensor(io['keys'])
    values = th.tensor(io['values'])
    expected_output  = th.tensor(io['output'])

mhp = MultiHeadProjection(n_heads, (depth_k,depth_v))
multihead_output = mhp((queries, keys, values))
#validate_to_array(model_out_to_list,(((queries,keys,values),),mhp),'multihead', root_folder)
print("Total error on the output:",th.sum(th.abs(expected_output-multihead_output)).item(), "(should be 0.0 or close to 0.0)")

Total error on the output: 1.5934929251670837e-06 (should be 0.0 or close to 0.0)


## (3) Position Embedding 

You must implement the FeedForward and PositionEmbedding classes in transformer.py.


The cell below helps you verify the validity of your implementation


In [8]:
from transformer import PositionEmbedding

batch_size = 2;
sequence_length = 3;
dim = 4;

with open(root_folder+"transformer_checks/position_embedding_io.json", "r") as f:
    io = json.load(f)
    inputs = th.tensor(io['inputs'])
    expected_output  = th.tensor(io['output'])

pos_emb = PositionEmbedding(dim)
(inputs,expected_output,pos_emb) = list_to_device((inputs,expected_output,pos_emb))
output_t = pos_emb(inputs)
# validate_to_array(model_out_to_list,((inputs,),pos_emb),'position_embedding', root_folder)
print("Total error on the output:",th.sum(th.abs(expected_output-output_t)).item(), "(should be 0.0 or close to 0.0)")

Total error on the output: 2.980232238769531e-07 (should be 0.0 or close to 0.0)


## (4) Transformer Encoder / Transformer Decoder

You now have all the blocks needed to implement the Transformer.
For this part, you have to fill in 2 classes in the transformer.py file: TransformerEncoderBlock, TransformerDecoderBlock.

The code below will verify the accuracy of each block

In [9]:
from transformer import TransformerEncoderBlock

batch_size = 2
sequence_length = 5
hidden_size = 6
filter_size = 12
n_heads = 2

with open(root_folder+"transformer_checks/transformer_encoder_block_io_new.json", "r") as f:
    io = json.load(f)
    inputs = th.tensor(io['inputs'])
    expected_output = th.tensor(io['output'])
enc_block = TransformerEncoderBlock(input_size=6, n_heads=n_heads, filter_size=filter_size, hidden_size=hidden_size)
# th.save(enc_block.state_dict(),root_folder+"transformer_checks/transformer_encoder_block")
enc_block.load_state_dict(th.load(root_folder+"transformer_checks/transformer_encoder_block"))
(inputs,expected_output,enc_block) = list_to_device((inputs,expected_output,enc_block))
output_t = enc_block(inputs)
# validate_to_array(model_out_to_list,((inputs,),enc_block),'encoder_block', root_folder)
print("Total error on the output:",th.sum(th.abs(expected_output-output_t)).item(), "(should be 0.0 or close to 0.0)")

Total error on the output: 4.999339580535889e-06 (should be 0.0 or close to 0.0)


  WeightNorm.apply(module, name, dim)


In [10]:
from transformer import TransformerDecoderBlock
batch_size = 2
encoder_length = 5
decoder_length = 3
hidden_size = 6
filter_size = 12
n_heads = 2

with open(root_folder+"transformer_checks/transformer_decoder_block_io_new.json", "r") as f:
    io = json.load(f)
    decoder_inputs = th.tensor(io['decoder_inputs'])
    encoder_output = th.tensor(io['encoder_output'])
    expected_output = th.tensor(io['output'])

dec_block = TransformerDecoderBlock(input_size=6, n_heads=n_heads, filter_size=filter_size, hidden_size=hidden_size)
dec_block.load_state_dict(th.load(root_folder+"transformer_checks/transformer_decoder_block"))
(decoder_inputs,encoder_output,expected_output,dec_block) = list_to_device((decoder_inputs,encoder_output,expected_output,dec_block))
output_t = dec_block(decoder_inputs, encoder_output)
validate_to_array(model_out_to_list,((decoder_inputs, encoder_output),dec_block),'decoder_block', root_folder)
print("Total error on the output:",th.sum(th.abs(expected_output-output_t)).item(), "(should be 0.0 or close to 0.0)")


Total error on the output: 3.2186508178710938e-06 (should be 0.0 or close to 0.0)


## (5) Transformer

This is the final high-level function that pieces it all together.

You have to implement the call function of the Transformer class in the `transformer.py` file.

The block below verifies your implementation is correct.

In [11]:
from transformer import Transformer

batch_size = 2
vocab_size = 11
n_layers = 3
n_heads = 4
d_model = 8
d_filter = 16
input_length = 5
output_length = 3

with open(root_folder+"transformer_checks/transformer_io_new.json", "r") as f:
    io = json.load(f)
    enc_input = th.tensor(io['enc_input'])
    dec_input = th.tensor(io['dec_input'])
    enc_mask = th.tensor(io['enc_mask'])
    dec_mask = th.tensor(io['dec_mask'])
    expected_output = th.tensor(io['output'])
transformer = Transformer(vocab_size=vocab_size, n_layers=n_layers, n_heads=n_heads, d_model=d_model, d_filter=d_filter)
transformer.load_state_dict(th.load(root_folder+"transformer_checks/transformer"))
(enc_input,dec_input,enc_mask,dec_mask,expected_output,transformer) \
    = list_to_device((enc_input,dec_input,enc_mask,dec_mask,expected_output,transformer))
output_t = transformer(enc_input, target_sequence=dec_input, encoder_mask=enc_mask, decoder_mask=dec_mask)
validate_to_array(model_out_to_list, ((enc_input, dec_input, enc_mask, dec_mask),transformer),'transformer', root_folder)
print("Total error on the output:",th.sum(th.abs(expected_output-output_t)).item(), "(should be 0.0 or close to 0.0)")

Total error on the output: 5.602836608886719e-05 (should be 0.0 or close to 0.0)


# Training the model

Your objective is to train the Language on the dataset you are provided to reach a **validation loss <= 6.50**

Careful: we will be testing this loss on an unreleased test set, so make sure to evaluate properly on a validation set and not overfit.

You must save the model you want us to test under: models/final_transformer_summarization (the .index, .meta and .data files)

**Advice**:
- It should be possible to attain validation loss <= 6.50 with the model dimensions we've specified (n_layers=6, d_model=104, d_filter=416), but you can tune these hyperparameters. Increasing d_model will yield better model, at the cost of longer training time.
- You should try tuning the learning rate, as well as what optimizer you use.
- You might need to train for a few (up to 2 hours) to obtain our expected loss. Remember to tune your hyperparameters first, once you find ones that work well, let it train for longer.

**Dataset**: as in the previous notebook, make sure the dataset files are in the `dataset` folder. These can be found on the Google Drive.


In [12]:
with open(root_folder+"dataset/summarization_dataset_preprocessed.json", "r") as f:
    dataset = json.load(f)

# We load the dataset, and split it into 2 sub-datasets based on if they are training or validation.
# Feel free to split this dataset another way, but remember, a validation set is important, to have an idea of 
# the amount of overfitting that has occurred!

d_train = [d for d in dataset if d['cut'] == 'training']
d_valid = [d for d in dataset if d['cut'] == 'evaluation']

len(d_train), len(d_valid)

(61055, 1558)

In [13]:
# An example (article, summary) pair in the training data:

print(d_train[145]['story'])
print("=======================\n=======================")
print(d_train[145]['summary'])

Tbilisi, Georgia (CNN)Police have shot and killed a white tiger that killed a man Wednesday in Tbilisi, Georgia, a Ministry of Internal Affairs representative said, after severe flooding allowed hundreds of wild animals to escape the city zoo. 
The tiger attack happened at a warehouse in the city center. The animal had been unaccounted for since the weekend floods destroyed the zoo premises.
The man killed, who was 43, worked in a company based in the warehouse, the Ministry of Internal Affairs said. Doctors said he was attacked in the throat and died before reaching the hospital. 
Experts are still searching the warehouse, the ministry said, adding that earlier reports that the tiger had injured a second man were unfounded. 
The zoo administration said Wednesday that another tiger was still missing. It was unable to confirm if the creature was dead or had escaped alive.
Georgian Prime Minister Irakli Garibashvili apologized to the public, saying he had been misinformed by the zoo's ma

Similarly to the previous assignment, we create a function to get a random batch to train on, given a dataset.

In [16]:
def build_batch(dataset, batch_size):
    indices = list(np.random.randint(0, len(dataset), size=batch_size))
    
    batch = [dataset[i] for i in indices]
    batch_input = np.array([a['input'] for a in batch])
    batch_input_mask = np.array([a['input_mask'] for a in batch])
    batch_output = np.array([a['output'] for a in batch])
    batch_output_mask = np.array([a['output_mask'] for a in batch])
    
    return batch_input, batch_input_mask, batch_output, batch_output_mask

We now instantiate the Transformer with our sets of hyperparameters specific to the task of summarization.
In summarization, we are going to go from documents with up to 400 words, to documents with up to 100 words.
The vocabulary size is set for you, and is of 10,000 words (we are using WordPieces, [here is a paper about subword encoding](http://aclweb.org/anthology/P18-1007), if you are interested).

In [15]:
# Use this trainer to train a Transformer model

class TransformerTrainer(nn.Module):
    def __init__(self, vocab_size, d_model, input_length, output_length, n_layers, d_filter, dropout=0, learning_rate=1e-3):
        super().__init__()
        self.model = Transformer(vocab_size=vocab_size, d_model=d_model, n_layers=n_layers, d_filter=d_filter)

        # Summarization loss
        criterion = nn.CrossEntropyLoss(reduce='none')
        self.loss_fn = lambda pred,target,mask: (criterion(pred.permute(0,2,1),target)*mask).sum()/mask.sum()
        self.learning_rate = learning_rate
        self.optimizer = optim.Adam(self.model.parameters(), lr=learning_rate)
    def forward(self,batch,optimize=True):
        pred_logits = self.model(**batch)
        target,mask = batch['target_sequence'],batch['decoder_mask']
        loss = self.loss_fn(pred_logits,target,mask)
        accuracy = (th.eq(pred_logits.argmax(dim=2,keepdim=False),target).float()*mask).sum()/mask.sum()
        
        if optimize:
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
                
        return loss, accuracy

In [25]:
# Dataset related parameters
vocab_size = len(vocab)
ilength = 400 # Length of the article
olength  = 100 # Length of the summaries

# Model related parameters, feel free to modify these.
n_layers = 6
d_model  = 160
d_filter = 4*d_model
batch_size = 16
# device = th.device("cuda" if th.cuda.is_available() else "cpu")
device = th.device("cuda")

print(device)
set_device(device)
dropout = 0
learning_rate = 1e-3
trainer = TransformerTrainer(vocab_size, d_model, ilength, olength, n_layers, d_filter, dropout)
model_id = 'test1'
os.makedirs(root_folder+'models/part2/',exist_ok=True)



cuda




In [None]:
# Skeleton code, as in the previous notebook.
# Write code training code and save your best performing model on the
# validation set. We will be testing the loss on a held-out test dataset.
import math
from tqdm import tqdm
gc.collect()
trainer.model.to(device)
trainer.model.train()
losses,accuracies = [],[]
t = tqdm(range(int(1e3)+1))

# t = tqdm(range(int(1e2)+1))
for i in t:
    # Create a random mini-batch from the training dataset
    batch = build_batch(d_train, batch_size)
    # Build the feed-dict connecting placeholders and mini-batch
    batch_input, batch_input_mask, batch_output, batch_output_mask = [th.tensor(tensor) for tensor in batch]
    batch_input, batch_input_mask, batch_output, batch_output_mask \
                = list_to_device([batch_input, batch_input_mask, batch_output, batch_output_mask])
    batch = {'source_sequence': batch_input, 'target_sequence': batch_output,
            'encoder_mask': batch_input_mask, 'decoder_mask': batch_output_mask}

    # Obtain the loss. Be careful when you use the train_op and not, as previously.
    train_loss, accuracy = trainer(batch)
    losses.append(train_loss.item()),accuracies.append(accuracy.item())
    if i % 10 == 0:
        t.set_description(f"Iteration: {i} Loss: {np.mean(losses[-10:])} Accuracy: {np.mean(accuracies[-10:])}")
    if i % 100 == 0:
        save_dict = dict(
            kwargs = dict(
                vocab_size=vocab_size,
                d_model=d_model,
                n_layers=n_layers, 
                d_filter=d_filter
            ),
            model_state_dict = trainer.model.state_dict(),
            notes = ""
        )
        th.save(save_dict, root_folder+f'models/part2/model_{model_id}.pt')

  0%|          | 1/1001 [00:00<04:20,  3.84it/s]

Iteration: 0 Loss: 6.897461891174316 Accuracy: 0.15285995602607727


  1%|          | 12/1001 [00:01<02:13,  7.43it/s]

Iteration: 10 Loss: 5.885189771652222 Accuracy: 0.16042315810918809


  2%|▏         | 22/1001 [00:03<02:12,  7.36it/s]

Iteration: 20 Loss: 5.962117290496826 Accuracy: 0.15752334892749786


  3%|▎         | 32/1001 [00:04<02:11,  7.35it/s]

Iteration: 30 Loss: 6.211309719085693 Accuracy: 0.15759579837322235


  4%|▍         | 42/1001 [00:05<02:10,  7.35it/s]

Iteration: 40 Loss: 5.9583018779754635 Accuracy: 0.1642598956823349


  5%|▌         | 52/1001 [00:07<02:08,  7.39it/s]

Iteration: 50 Loss: 6.185847663879395 Accuracy: 0.15902172923088073


  6%|▌         | 62/1001 [00:08<02:07,  7.38it/s]

Iteration: 60 Loss: 6.217063665390015 Accuracy: 0.1581038698554039


  7%|▋         | 72/1001 [00:09<02:06,  7.35it/s]

Iteration: 70 Loss: 5.908007335662842 Accuracy: 0.15574225783348083


  8%|▊         | 82/1001 [00:11<02:04,  7.37it/s]

Iteration: 80 Loss: 6.218970537185669 Accuracy: 0.15532099902629853


  9%|▉         | 92/1001 [00:12<02:04,  7.32it/s]

Iteration: 90 Loss: 5.741907835006714 Accuracy: 0.15387760251760482


 10%|█         | 101/1001 [00:13<02:23,  6.29it/s]

Iteration: 100 Loss: 5.565541362762451 Accuracy: 0.1537403032183647


 11%|█         | 112/1001 [00:15<02:02,  7.29it/s]

Iteration: 110 Loss: 5.746829605102539 Accuracy: 0.15545573681592942


 12%|█▏        | 122/1001 [00:16<01:59,  7.34it/s]

Iteration: 120 Loss: 6.166535329818726 Accuracy: 0.14950157552957535


 13%|█▎        | 132/1001 [00:18<01:57,  7.38it/s]

Iteration: 130 Loss: 6.055329990386963 Accuracy: 0.15697616785764695


 14%|█▍        | 142/1001 [00:19<01:56,  7.36it/s]

Iteration: 140 Loss: 6.311853313446045 Accuracy: 0.15491567105054854


 15%|█▌        | 152/1001 [00:20<01:55,  7.33it/s]

Iteration: 150 Loss: 6.154069566726685 Accuracy: 0.1587525337934494


 16%|█▌        | 162/1001 [00:22<01:55,  7.28it/s]

Iteration: 160 Loss: 6.447346925735474 Accuracy: 0.1623397797346115


 17%|█▋        | 172/1001 [00:23<01:53,  7.29it/s]

Iteration: 170 Loss: 5.930704736709595 Accuracy: 0.15575879216194152


 18%|█▊        | 182/1001 [00:24<01:52,  7.31it/s]

Iteration: 180 Loss: 6.111931991577149 Accuracy: 0.1575032353401184


 19%|█▉        | 192/1001 [00:26<01:51,  7.29it/s]

Iteration: 190 Loss: 6.250297021865845 Accuracy: 0.1558724030852318


 20%|██        | 202/1001 [00:27<02:00,  6.65it/s]

Iteration: 200 Loss: 6.12143383026123 Accuracy: 0.15925669223070144


 21%|██        | 212/1001 [00:29<01:48,  7.26it/s]

Iteration: 210 Loss: 5.671660852432251 Accuracy: 0.1531708925962448


 22%|██▏       | 222/1001 [00:30<01:46,  7.30it/s]

Iteration: 220 Loss: 6.595358180999756 Accuracy: 0.1608447551727295


 23%|██▎       | 232/1001 [00:31<01:45,  7.26it/s]

Iteration: 230 Loss: 6.118797397613525 Accuracy: 0.155203241109848


 24%|██▍       | 242/1001 [00:33<01:43,  7.32it/s]

Iteration: 240 Loss: 6.099196290969848 Accuracy: 0.15630883127450942


 25%|██▌       | 252/1001 [00:34<01:42,  7.31it/s]

Iteration: 250 Loss: 6.561116361618042 Accuracy: 0.15390486866235734


 26%|██▌       | 262/1001 [00:36<01:41,  7.27it/s]

Iteration: 260 Loss: 5.861220455169677 Accuracy: 0.1555054858326912


 27%|██▋       | 272/1001 [00:37<01:40,  7.27it/s]

Iteration: 270 Loss: 5.828981018066406 Accuracy: 0.16363717019557952


 28%|██▊       | 282/1001 [00:38<01:38,  7.30it/s]

Iteration: 280 Loss: 5.847439956665039 Accuracy: 0.1599757507443428


 29%|██▉       | 292/1001 [00:40<01:37,  7.27it/s]

Iteration: 290 Loss: 6.00897479057312 Accuracy: 0.16394879966974257


 30%|███       | 301/1001 [00:41<01:50,  6.32it/s]

Iteration: 300 Loss: 5.841495704650879 Accuracy: 0.16228505074977875


 31%|███       | 312/1001 [00:42<01:35,  7.21it/s]

Iteration: 310 Loss: 6.3340555191040036 Accuracy: 0.16279821395874022


 32%|███▏      | 322/1001 [00:44<01:33,  7.28it/s]

Iteration: 320 Loss: 6.267124652862549 Accuracy: 0.1657020255923271


 33%|███▎      | 332/1001 [00:45<01:32,  7.21it/s]

Iteration: 330 Loss: 6.040697479248047 Accuracy: 0.15978287756443024


 34%|███▍      | 342/1001 [00:47<01:31,  7.24it/s]

Iteration: 340 Loss: 5.709396648406982 Accuracy: 0.1561113029718399


 35%|███▌      | 352/1001 [00:48<01:29,  7.23it/s]

Iteration: 350 Loss: 6.252426290512085 Accuracy: 0.14649442136287688


 36%|███▌      | 362/1001 [00:49<01:27,  7.26it/s]

Iteration: 360 Loss: 6.154591608047485 Accuracy: 0.15382705107331276


 37%|███▋      | 372/1001 [00:51<01:27,  7.18it/s]

Iteration: 370 Loss: 6.301575326919556 Accuracy: 0.15841923654079437


 38%|███▊      | 382/1001 [00:52<01:24,  7.29it/s]

Iteration: 380 Loss: 6.36583456993103 Accuracy: 0.156681852042675


 39%|███▉      | 392/1001 [00:54<01:23,  7.29it/s]

Iteration: 390 Loss: 5.842637586593628 Accuracy: 0.16053735315799714


 40%|████      | 402/1001 [00:55<01:29,  6.68it/s]

Iteration: 400 Loss: 6.045001745223999 Accuracy: 0.16406423151493071


 41%|████      | 412/1001 [00:56<01:20,  7.30it/s]

Iteration: 410 Loss: 6.18634934425354 Accuracy: 0.14606865346431733


 42%|████▏     | 422/1001 [00:58<01:20,  7.20it/s]

Iteration: 420 Loss: 5.741339015960693 Accuracy: 0.16648395359516144


 43%|████▎     | 432/1001 [00:59<01:17,  7.32it/s]

Iteration: 430 Loss: 6.127590227127075 Accuracy: 0.1549356147646904


 44%|████▍     | 442/1001 [01:00<01:17,  7.26it/s]

Iteration: 440 Loss: 5.973372316360473 Accuracy: 0.15804422795772552


 45%|████▌     | 452/1001 [01:02<01:15,  7.23it/s]

Iteration: 450 Loss: 6.5278678894042965 Accuracy: 0.1603325754404068


 46%|████▌     | 462/1001 [01:03<01:14,  7.19it/s]

Iteration: 460 Loss: 6.091510057449341 Accuracy: 0.15185993760824204


 47%|████▋     | 472/1001 [01:05<01:12,  7.28it/s]

Iteration: 470 Loss: 6.401440382003784 Accuracy: 0.15277106687426567


 48%|████▊     | 482/1001 [01:06<01:11,  7.26it/s]

Iteration: 480 Loss: 6.061173009872436 Accuracy: 0.15148424953222275


 49%|████▉     | 492/1001 [01:07<01:10,  7.23it/s]

Iteration: 490 Loss: 6.000705814361572 Accuracy: 0.15525390207767487


 50%|█████     | 501/1001 [01:09<01:19,  6.31it/s]

Iteration: 500 Loss: 6.372032976150512 Accuracy: 0.16361222863197328


 51%|█████     | 512/1001 [01:10<01:08,  7.11it/s]

Iteration: 510 Loss: 5.849400186538697 Accuracy: 0.15703681409358977


 52%|█████▏    | 522/1001 [01:12<01:06,  7.19it/s]

Iteration: 520 Loss: 5.949140930175782 Accuracy: 0.1633294403553009


 53%|█████▎    | 532/1001 [01:13<01:04,  7.26it/s]

Iteration: 530 Loss: 6.100977516174316 Accuracy: 0.1570553719997406


 54%|█████▍    | 542/1001 [01:14<01:03,  7.19it/s]

Iteration: 540 Loss: 6.358716726303101 Accuracy: 0.15930524170398713


 55%|█████▌    | 552/1001 [01:16<01:02,  7.24it/s]

Iteration: 550 Loss: 6.053978967666626 Accuracy: 0.15688012391328812


 56%|█████▌    | 562/1001 [01:17<01:01,  7.13it/s]

Iteration: 560 Loss: 5.996196937561035 Accuracy: 0.1552640199661255


 57%|█████▋    | 572/1001 [01:19<00:59,  7.26it/s]

Iteration: 570 Loss: 6.081537866592408 Accuracy: 0.15747737884521484


 58%|█████▊    | 582/1001 [01:20<00:57,  7.25it/s]

Iteration: 580 Loss: 6.041585969924927 Accuracy: 0.15628893822431564


 59%|█████▉    | 592/1001 [01:21<00:56,  7.26it/s]

Iteration: 590 Loss: 5.813946676254273 Accuracy: 0.15322296023368837


 60%|██████    | 602/1001 [01:23<00:59,  6.67it/s]

Iteration: 600 Loss: 6.279171371459961 Accuracy: 0.15685813277959823


 61%|██████    | 612/1001 [01:24<00:53,  7.27it/s]

Iteration: 610 Loss: 5.789444351196289 Accuracy: 0.15654205530881882


 62%|██████▏   | 622/1001 [01:25<00:52,  7.23it/s]

Iteration: 620 Loss: 6.674302959442139 Accuracy: 0.15860032439231872


 63%|██████▎   | 632/1001 [01:27<00:51,  7.20it/s]

Iteration: 630 Loss: 6.16577525138855 Accuracy: 0.16085085421800613


 64%|██████▍   | 642/1001 [01:28<00:49,  7.28it/s]

Iteration: 640 Loss: 5.9531481742858885 Accuracy: 0.15940554738044738


 65%|██████▌   | 652/1001 [01:30<00:48,  7.25it/s]

Iteration: 650 Loss: 5.941096687316895 Accuracy: 0.15558053851127623


 66%|██████▌   | 662/1001 [01:31<00:46,  7.27it/s]

Iteration: 660 Loss: 5.820159673690796 Accuracy: 0.15998611748218536


 67%|██████▋   | 672/1001 [01:32<00:45,  7.27it/s]

Iteration: 670 Loss: 5.919001960754395 Accuracy: 0.1638287141919136


 68%|██████▊   | 682/1001 [01:34<00:43,  7.26it/s]

Iteration: 680 Loss: 5.9608124732971195 Accuracy: 0.1627848356962204


 69%|██████▉   | 692/1001 [01:35<00:42,  7.27it/s]

Iteration: 690 Loss: 6.112608623504639 Accuracy: 0.16379142701625823


 70%|███████   | 701/1001 [01:36<00:48,  6.24it/s]

Iteration: 700 Loss: 5.861233949661255 Accuracy: 0.15983981490135193


 71%|███████   | 712/1001 [01:38<00:40,  7.22it/s]

Iteration: 710 Loss: 6.104278612136841 Accuracy: 0.16065119802951813


 72%|███████▏  | 722/1001 [01:39<00:38,  7.27it/s]

Iteration: 720 Loss: 5.984081077575683 Accuracy: 0.16039149314165116


 73%|███████▎  | 732/1001 [01:41<00:37,  7.26it/s]

Iteration: 730 Loss: 6.396000337600708 Accuracy: 0.15263855457305908


 74%|███████▍  | 742/1001 [01:42<00:35,  7.27it/s]

Iteration: 740 Loss: 5.849219799041748 Accuracy: 0.15385426431894303


 75%|███████▌  | 752/1001 [01:43<00:34,  7.26it/s]

Iteration: 750 Loss: 6.262934684753418 Accuracy: 0.15199289470911026


 76%|███████▌  | 762/1001 [01:45<00:33,  7.15it/s]

Iteration: 760 Loss: 6.0449306011199955 Accuracy: 0.1567722514271736


 77%|███████▋  | 772/1001 [01:46<00:31,  7.28it/s]

Iteration: 770 Loss: 5.980852127075195 Accuracy: 0.15881458222866057


 78%|███████▊  | 782/1001 [01:48<00:30,  7.18it/s]

Iteration: 780 Loss: 5.8938703536987305 Accuracy: 0.15996347814798356


 79%|███████▉  | 792/1001 [01:49<00:28,  7.23it/s]

Iteration: 790 Loss: 5.922591686248779 Accuracy: 0.1554047018289566


 80%|████████  | 801/1001 [01:50<00:31,  6.28it/s]

Iteration: 800 Loss: 6.5822831153869625 Accuracy: 0.15318797379732133


 81%|████████  | 812/1001 [01:52<00:26,  7.19it/s]

Iteration: 810 Loss: 5.752757024765015 Accuracy: 0.16616491675376893


 82%|████████▏ | 822/1001 [01:53<00:25,  7.11it/s]

Iteration: 820 Loss: 5.9558521747589115 Accuracy: 0.15847859680652618


 83%|████████▎ | 832/1001 [01:55<00:23,  7.25it/s]

Iteration: 830 Loss: 6.231389951705933 Accuracy: 0.15047113597393036


 84%|████████▍ | 842/1001 [01:56<00:21,  7.24it/s]

Iteration: 840 Loss: 5.884429454803467 Accuracy: 0.15376947224140167


 85%|████████▌ | 852/1001 [01:57<00:20,  7.27it/s]

Iteration: 850 Loss: 6.330794143676758 Accuracy: 0.15848029851913453


 86%|████████▌ | 862/1001 [01:59<00:19,  7.28it/s]

Iteration: 860 Loss: 5.93193416595459 Accuracy: 0.1562215968966484


 87%|████████▋ | 872/1001 [02:00<00:17,  7.25it/s]

Iteration: 870 Loss: 6.467763996124267 Accuracy: 0.158534038066864


 88%|████████▊ | 882/1001 [02:02<00:16,  7.23it/s]

Iteration: 880 Loss: 5.854572486877442 Accuracy: 0.1522786423563957


 89%|████████▉ | 892/1001 [02:03<00:15,  7.20it/s]

Iteration: 890 Loss: 6.1437867164611815 Accuracy: 0.1531238429248333


 90%|█████████ | 901/1001 [02:04<00:16,  6.25it/s]

Iteration: 900 Loss: 6.139829778671265 Accuracy: 0.14796368032693863


 91%|█████████ | 912/1001 [02:06<00:12,  7.18it/s]

Iteration: 910 Loss: 5.957414150238037 Accuracy: 0.16387099623680115


 92%|█████████▏| 922/1001 [02:07<00:10,  7.19it/s]

Iteration: 920 Loss: 6.221318960189819 Accuracy: 0.15900123864412308


 93%|█████████▎| 932/1001 [02:09<00:09,  7.22it/s]

Iteration: 930 Loss: 6.080341386795044 Accuracy: 0.15279445350170134


 94%|█████████▍| 942/1001 [02:10<00:08,  7.22it/s]

Iteration: 940 Loss: 6.018937683105468 Accuracy: 0.1518075428903103


 95%|█████████▌| 952/1001 [02:11<00:06,  7.21it/s]

Iteration: 950 Loss: 6.221824741363525 Accuracy: 0.15383128672838212


 96%|█████████▌| 962/1001 [02:13<00:05,  7.26it/s]

Iteration: 960 Loss: 6.368556308746338 Accuracy: 0.15599415674805642


 97%|█████████▋| 972/1001 [02:14<00:03,  7.26it/s]

Iteration: 970 Loss: 6.046928834915161 Accuracy: 0.15406258553266525


 98%|█████████▊| 982/1001 [02:15<00:02,  7.25it/s]

Iteration: 980 Loss: 5.909612417221069 Accuracy: 0.1604288101196289


 99%|█████████▉| 992/1001 [02:17<00:01,  7.19it/s]

Iteration: 990 Loss: 5.971125221252441 Accuracy: 0.15959894508123398


100%|██████████| 1001/1001 [02:18<00:00,  7.22it/s]

Iteration: 1000 Loss: 6.124185943603516 Accuracy: 0.15861453413963317





# Using the Summarization model

Now that you have trained a Transformer to perform Summarization, we will use the model on news articles from the wild.

The three subsections below explore what the model has learned.

## The validation loss

Measure the validation loss of your model. This part could be used, as in our previous notebook, in deciding what is a likely, vs. unlikely summary for an article.

We will use the code here with the unreleased test-set to evaluate your model.

In [42]:
gc.collect()
model_id = "test1"
save_dict = th.load(root_folder+'models/part2/'+f"model_{model_id}.pt", map_location='cpu')
model = Transformer(**save_dict['kwargs'])
model.load_state_dict(save_dict['model_state_dict'])
# device = th.device("cuda" if th.cuda.is_available() else "cpu")
# print(device)
# set_device(device)
model.eval()
trainer.model = model

In [44]:
gc.collect()
device = th.device("cuda" if th.cuda.is_available() else "cpu")
# print(device)

losses = []
for i in tqdm(range(100)):
    batch = build_batch(d_valid, 1)
    # Build the feed-dict connecting placeholders and mini-batch
    batch_input, batch_input_mask, batch_output, batch_output_mask = [th.tensor(tensor) for tensor in batch]
    batch = {'source_sequence': batch_input, 'target_sequence': batch_output,
            'encoder_mask': batch_input_mask, 'decoder_mask': batch_output_mask}
    # print(device)
    valid_loss, accuracy = trainer(batch,optimize=False)
    # print(device)

    #losses.append(float(valid_loss.to(device).item()))
    
print("Validation loss:", np.mean(losses))

100%|██████████| 100/100 [00:06<00:00, 15.88it/s]

Validation loss: nan





In [45]:
# Your best performing model should go here.
os.makedirs(root_folder+"best_models",exist_ok=True)
best_model_file = root_folder+"best_models/part2_best_model.pt"
th.save(save_dict,best_model_file)

## Generating an article's summary

This model we have built is meant to be used to generate summaries for new articles we do not have summaries for.
We got a [news article](https://www.chicagotribune.com/news/local/breaking/ct-met-officer-shot-20190309-story.html) from the Chicago Tribune about a police shooting, and want to use our model to produce a summary.

As you will see, our model is still limited in its ability, and will most likely not produce an interpretible summary, however, with more data and training, this model would be able to produce good summaries.

In [49]:
article_text = "A 34-year-old Chicago police officer has been shot in the shoulder during the execution of a search warrant in the Humboldt Park neighborhood, police say. The alleged shooter, a 19-year-old woman, was in custody. The shooting happened about 7:20 p.m. in the 2700 block of West Potomac Avenue, police said. The officer, part of the Grand Central District tactical unit, was taken to Stroger Hospital. While officers were serving a \"typical\" search warrant for \"narcotics and illegal weapons\" and were attempting to reach a rear door, \"a shot was fired,\" striking the tactical officer in the shoulder, said Chicago police Superintendent Eddie Johnson during a news briefing outside the hospital. He said the officer, who has about four or five years on the job, was \"stable\" but in critical condition. \"His family is here,\" Johnson said. \"He’s talking a lot and just wants the ordeal to be over.\" He said this incident serves as just another reminder of how dangerous a police officer’s job is. At the scene of the shooting, crime tape closed Potomac from Washtenaw Avenue to California Avenue and encompassed the alley west of the brick apartment building, south of Potomac. Dozens of officers stood in the alley, while even more walked up and down the street. Neighbors gathered at the edge of the yellow tape on the sidewalk along California and watched them work. Standing next to a man, a woman talked to police in the crime scene, across the street. \"We're not under arrest? We can go?\" the woman checked with officers. They told her she could go, and she and the man walked underneath the yellow tape and out of the crime scene."
input_length = 400
output_length = 100

# Process the capitalization with the preprocess_capitalization of the capita package.
article_text = capita.preprocess_capitalization(article_text)

# Numerize the tokens of the processed text using the loaded sentencepiece model.
numerized = sp.EncodeAsIds(article_text)
# Pad the sequence and keep the mask of the input
padded, mask = pad_sequence(numerized, pad_index, input_length)

# Making the news article into a batch of size one, to be fed to the neural network.
encoder_input = np.array([padded])
encoder_mask = np.array([mask])

decoded_so_far = [0]
device = th.device("cuda" if th.cuda.is_available() else "cpu")

for j in range(output_length):
    padded_decoder_input, decoder_mask = pad_sequence(decoded_so_far, pad_index, output_length)
    padded_decoder_input = [padded_decoder_input]
    decoder_mask = [decoder_mask]
    # print("========================")
    # print(padded_decoder_input)
    # Use the model to find the distrbution over the vocabulary for the next word
    batch = (encoder_input,encoder_mask,padded_decoder_input,decoder_mask)
    batch_input, batch_input_mask, batch_output, batch_output_mask = [th.tensor(tensor) for tensor in batch]
    batch = {'source_sequence': batch_input, 'target_sequence': batch_output,
            'encoder_mask': batch_input_mask, 'decoder_mask': batch_output_mask}
    # print(device)
    logits = trainer.model(**batch).detach().numpy()

    chosen_words = np.argmax(logits, axis=2) # Take the argmax, getting the most likely next word
    decoded_so_far.append(int(chosen_words[0, j])) # We add it to the summary so far


print("The final summary:")
print("".join([vocab[i] for i in decoded_so_far]).replace("▁", " "))

The final summary:
<unk>  ↑↑ tissue teacheryardevyardhoyardfinyard ,yard lutheryard kingdomyardister stewart suffocat accent virginia praise imperial defeat imperial casino imperial recent imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial investigate imperial


## Word vectors

The model we train learns word representations for each word in our vocabulary. A word represention is a vector of **dim** size.

It is common in NLP to inspect the word vectors, as some properties of language often appear in the embedding structure.


We are going to load the word embeddings learned by our model, and inspect it.
Because our network was not trained for long, we are going for the simplest patterns, but if we let the network train longer, it learns more complex, semantic patterns.

In [50]:
# We help you load the matrix, as it is hidden within the Transformer structure.
E = trainer.model.encoder.embedding_layer.embedding.weight.cpu().detach().numpy()

print("The embedding matrix has shape:", E.shape)
print("The vocabulary has length:", len(vocab))

The embedding matrix has shape: (10000, 160)
The vocabulary has length: 10000


Pronouns serve very similar purposes, therefore we should expect the representation of "he" and "she" to be similar, and have cosine similarity.

- **TODO**:  Find the cosine similarity between the vectors that represent words "she" and "he".
- **TODO**:  Find the cosine similarity between the vectors that represent words "more" and "less".

We can contrast that with the cosine similarity to a random, non-related word, like "ball", or "gorilla".
- **TODO**: Compute the cosine similarity between "she" and "ball".
- **TODO**: Compute the cosine similarity between "more" and "protest".



In [52]:
def cosine_sim(v1, v2):
    # TODO: Implement the cosine similarity of 2 vectors. Careful: the words might not have unit norm.
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

for w1, w2 in [("she", "he"), ("more", "less"), ("she", "ball"), ("more", "gorilla")]:
    w1_index = vocab.index('▁'+w1) # The index of the first  word in our vocabulary
    w2_index = vocab.index('▁'+w2) # The index of the second word in our vocabulary
    w1_vec = E[w1_index] # Get the embedding vector of the first  word
    w2_vec = E[w2_index] # Get the embedding vector of the second word
    
    print(w1," vs. ", w2, "similarity:",cosine_sim(w1_vec, w2_vec))
#validate_to_array(lambda f,i: (f(*i),i), (cosine_sim,tuple(20*np.random.random((2,1000))-1)),'cosine_sim') 

she  vs.  he similarity: 0.10949755
more  vs.  less similarity: 0.029119322
she  vs.  ball similarity: 0.01745215
more  vs.  gorilla similarity: 0.020972772


These effects are unfortunately small, as we have only trained the network on a few hours on a few thousand articles.
However, the same model trained for longer on more data exhibits many interesting semantic and syntactic patterns, such as:

- Words vectors with high cosine similarity usually represent words that have semantic similarity (such as duck and pigeon)
- Analogies can occur, a famous case is that of: woman - man + king ≈ queen. Or france - paris + rome ≈ italy.

- Looking at top-k similar words can help find synonyms.

To read examples of more complex patterns that appear in word embedding spaces, read [this blog](https://explosion.ai/blog/sense2vec-with-spacy). To play with a live demo and try similarities on rich word embeddings, [go here.](https://explosion.ai/demos/sense2vec)