## IRE Major Project - Summarization using Sequence to Sequence Model with Attention 


Contributed by:
<br>
**Vasu Singhal** (2018101074)
<br>
**Tanish Lad** (2018114005)

Importing all the required libraries.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from torchtext import data
from torchtext.data import Field
from torchtext.data import BucketIterator


import spacy
import numpy as np
import os
import random
import pandas as pd
import time
import math

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)


cuda


### Downloading Dataset

We are using the CNN/DailyMail dataset. The Hugginface datasets library already contains that dataset. We download the library and then load the cnn_dailymail dataset. It already contains the train, validate, test set splits so we extract them out.

In [None]:
!pip3 install datasets
from datasets import list_datasets, load_dataset, list_metrics, load_metric
dataset = load_dataset("cnn_dailymail", '3.0.0')
train=dataset['train']
test=dataset['test']
validate=dataset['validation']

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/f0/f4/2a3d6aee93ae7fce6c936dda2d7f534ad5f044a21238f85e28f0b205adf0/datasets-1.1.2-py3-none-any.whl (147kB)
[K     |████████████████████████████████| 153kB 943kB/s 
Collecting pyarrow>=0.17.1
[?25l  Downloading https://files.pythonhosted.org/packages/d7/e1/27958a70848f8f7089bff8d6ebe42519daf01f976d28b481e1bfd52c8097/pyarrow-2.0.0-cp36-cp36m-manylinux2014_x86_64.whl (17.7MB)
[K     |████████████████████████████████| 17.7MB 1.4MB/s 
Collecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/f7/73/826b19f3594756cb1c6c23d2fbd8ca6a77a9cd3b650c9dec5acc85004c38/xxhash-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl (242kB)
[K     |████████████████████████████████| 245kB 53.8MB/s 
Installing collected packages: pyarrow, xxhash, datasets
  Found existing installation: pyarrow 0.14.1
    Uninstalling pyarrow-0.14.1:
      Successfully uninstalled pyarrow-0.14.1
Successfully installed datasets-1.1.2 py

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3528.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1610.0, style=ProgressStyle(description…


Downloading and preparing dataset cnn_dailymail/3.0.0 (download: 558.32 MiB, generated: 1.28 GiB, post-processed: Unknown size, total: 1.82 GiB) to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602...


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Downloading', max=1.0, style=ProgressSt…




HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Downloading', max=1.0, style=ProgressSt…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=572061.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=12259516.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=660943.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset cnn_dailymail downloaded and prepared to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602. Subsequent calls will reuse this data.


Fixing seeds for reproducibility

In [None]:
SEED = 42

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

We are using the 'TabularDataset.splits' function from torchtext.data. The function requires the dataset in json or csv files. The next few sets of cells help in ultimately storing the dataset in json format. First the dataset is converted into a dataframe and then to json.

In [None]:
stories_train = []
summaries_train = []

stories_test = []
summaries_test = []

stories_validate = []
summaries_validate = []

num = 10
for i in range(len(train)):
    story = train[i]['article']
    summary = train[i]['highlights']
    stories_train.append(story)
    summaries_train.append(summary)
    # if i == num:
    #   break
for i in range(len(test)):
    story = test[i]['article']
    summary = test[i]['highlights']
    stories_test.append(story)
    summaries_test.append(summary)
    # if i == num:
    #   break
for i in range(len(validate)):
    story = validate[i]['article']
    summary = validate[i]['highlights']
    stories_validate.append(story)
    summaries_validate.append(summary)
    # if i == num:
    #   break
      
DATA_TRAIN = {"story": [story for story in stories_train], "summary": [summary for summary in summaries_train]}
DATA_TEST = {"story": [story for story in stories_test], "summary": [summary for summary in summaries_test]}
DATA_VALIDATE = {"story": [story for story in stories_validate], "summary": [summary for summary in summaries_validate]}

In [None]:
df_train = pd.DataFrame(DATA_TRAIN, columns=["story", "summary"])
df_test = pd.DataFrame(DATA_TEST, columns=["story", "summary"])
df_validate = pd.DataFrame(DATA_VALIDATE, columns=["story", "summary"])

df_train.to_json("train_data.json", orient="records", lines=True)
df_test.to_json("test_data.json", orient="records", lines=True)
df_validate.to_json("validation_data.json", orient="records", lines=True)

We are using the spacy english tokenizer to tokenize our text. We also limit the stories (articles) to maximum 400 tokens. Those stories that exceed 400 tokens are packed to 400 tokens only.

In [None]:
spacy_en = spacy.load('en')
MAX_LENGTH = 400

def tokenize(text):
    tokens = spacy_en.tokenizer(text)
    length = len(tokens)
    
    cnt = 0
    toRet = []
    
    for tok in tokens:
        if cnt < length and cnt < MAX_LENGTH:
            toRet.append(tok.text)
            cnt += 1
        
        else:
            break
    
    return toRet
#     return [tok.text for tok in spacy_en.tokenizer(text)]



Corresponding fields are created for the stories and summaries which determine the sos, eos tokens and other details such as tokenize function and whether or not to lowercase the data, etc.

In [None]:
story_field = Field(tokenize = tokenize, init_token = '<sos>', eos_token = '<eos>', lower = True, include_lengths = True)
summary_field = Field(tokenize =tokenize, init_token = '<sos>', eos_token = '<eos>', lower = True)
fields = {'story': ('story', story_field), 'summary': ('summary', summary_field)}


We finally get the train, validation and test data in the appopriate format required by the Bucket Iterator.

In [None]:
train_data, validation_data, test_data = data.TabularDataset.splits(path = '',
                                        train = 'train_data.json',
                                        validation = 'validation_data.json',
                                        test = 'test_data.json',
                                        format = 'json',
                                        fields = fields)

BucketIterator is a very helpful method which splits the dataset into batches of given batch size, and also sends the dataset to cuda readable format if gpu is being used. It also sorts each story according to it's length in descending order within each batch. This is necessay during padding and packing which will come shortly.

In [None]:
BATCH_SIZE = 16

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, validation_data, test_data), 
     batch_size = BATCH_SIZE,
     sort_within_batch = True,
     sort_key = lambda x : len(x.story),
     device = device)


We build the vocabulary, cutting off tokens that only come once in the corpus.

In [None]:
story_field.build_vocab(train_data, min_freq = 2)
summary_field.build_vocab(train_data, min_freq = 2)

print(len(story_field.vocab))
print(len(summary_field.vocab))

137
68


We then create the Encoder class. This Encoder will encode our stories into vectors and we return the outputs at each time step (for attention) and also the last hidden state which will act as the first context vector in the decoder.

Note that we pass the whole story at once in the Encoder but we generate summaries only 1 word at a time in the decoder. 

**Shapes:**
<br>
Input - (max_story_length, batch_size)
<br>
Embeddings - (max_story_length, batch_size, embedding_dimension)

The input is packed so that padded tokens don't get any attention in the decoder. After passing them through the GRU, we again pad its outputs

Outputs - (max_story_length, batch_size, 2 * hidden_dimension) (2 because GRU was bidirectional)
Last Hidden State - (2, batch_size, hidden_dimension). 

We then concatenate the final hidden states from both the directions and pass through a feed forward network that reduces the size (2 x hidden_dimension) to half (1 x hidden_dimension)

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src, src_len):
        embedded = self.dropout(self.embedding(src))              
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, src_len)
        packed_outputs, hidden = self.rnn(packed_embedded)
        outputs, _ = nn.utils.rnn.pad_packed_sequence(packed_outputs) 
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
 
        return outputs, hidden

The following is a very important piece of code in the whole model. This concept makes the whole Encoder Decoder system viable and feasible.

**Shapes:** <br>
Previous Hidden State of Decoder - (batch_size, hidden_dimension) <br>
Outputs of Encoder - (max_story_length, batch_size, 2 * hidden_dimension) <br>
Attention Scores - (batch_size, max_story_length, 1), out of which the dimension containing 1 is squeezed out.

In [None]:
class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs, mask):
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        encoder_outputs = encoder_outputs.permute(1, 0, 2)     
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        attention = self.v(energy).squeeze(2)      
        attention = attention.masked_fill(mask == 0, -1e10)
        
        return F.softmax(attention, dim = 1)

The next is the Decoder Class. We give it the input word, the encoder outputs and the previous hidden state as input and we get the prediction probabilities, the new hidden state and the attention scores as outputs.

**Shapes:** <br>
Input - (batch_size) - this is changed to (1, batch_size) because the neural networks needs it in that shape. <br>
Embeddings - (1, batch_size, embedding_dimension) <br>
Attention Scores - (batch_size, max_story_length) <br>
Decoder Output - (1, batch_size, hidden_dimension) <br>
Decoder Hidden State - (1, batch_size, hidden_dimension). Note that Decoder's GRU is not bidirectional but rather unidirectional, unlike the Encoder. <br>
Predictions - (batch_size, output_vocab_size)

In [None]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()

        self.output_dim = output_dim
        self.attention = attention
        self.embedding = nn.Embedding(output_dim, emb_dim)  
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs, mask):
                   
        input = input.unsqueeze(0)  
        embedded = self.dropout(self.embedding(input))  
        a = self.attention(hidden, encoder_outputs, mask)
        a = a.unsqueeze(1)
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        weighted = torch.bmm(a, encoder_outputs)
        weighted = weighted.permute(1, 0, 2)
        rnn_input = torch.cat((embedded, weighted), dim = 2)         
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1))
        
        return prediction, hidden.squeeze(0), a.squeeze(1)

The next class is the one that combines all the models to make a complete model.

Masking is also done here in which pad tokens are masked and they are packed before sending as input to GRU.

Then using the sos token as the first input word to the decoder and the last hidden state of the encoder as the first hidden state of the decoder and using the attention mechanism, the decoder starts to predict summaries one word at a time. During training, we use a ratio called teacher forcing ratio that decides that the next input word to the decoder should be the previous output of decoder or the actual ground truth word in the summary

In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, src_pad_idx, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.src_pad_idx = src_pad_idx
        self.device = device
        
    def create_mask(self, src):
        mask = (src != self.src_pad_idx).permute(1, 0)
        return mask
        
    def forward(self, src, src_len, trg, teacher_forcing_ratio = 0.5):
                            
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        encoder_outputs, hidden = self.encoder(src, src_len)       
        input = trg[0,:]
        mask = self.create_mask(src)
                
        for t in range(1, trg_len):

            output, hidden, _ = self.decoder(input, hidden, encoder_outputs, mask)
            outputs[t] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1) 
            input = trg[t] if teacher_force else top1
            
        return outputs

Declaring all the variables/hyperparameters and initializing the model.

In [None]:
INPUT_DIM = len(story_field.vocab)
OUTPUT_DIM = len(summary_field.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5
SRC_PAD_IDX = story_field.vocab.stoi[story_field.pad_token]

attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = Seq2Seq(enc, dec, SRC_PAD_IDX, device).to(device)

Intializing the weights. The weights are initialized from a normal distribution and the biases are initialized to 0.

In [None]:
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(137, 5)
    (rnn): GRU(5, 5, bidirectional=True)
    (fc): Linear(in_features=10, out_features=5, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=15, out_features=5, bias=True)
      (v): Linear(in_features=5, out_features=1, bias=False)
    )
    (embedding): Embedding(68, 5)
    (rnn): GRU(15, 5)
    (fc_out): Linear(in_features=20, out_features=68, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 3,283 trainable parameters


We use the Adam Optimzer. The loss function is the Cross Entropy Loss. We ignore the indexes that are padded otherwise those indexes will unnecessarily contribute to loss.

In [None]:
optimizer = optim.Adam(model.parameters())
PAD_IDX = summary_field.vocab.stoi[summary_field.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = PAD_IDX)

This is the train function. The function trains our model for 1 complete epoch. We also clip our gradients to a maximum norm of 1 to avoid Exploding Gradient Problem. The Vanishing Gradient Problem has already been avoided because we use GRUs instead of RNNs.

A note about the way we calculate loss: So the Cross Entropy Criterion requires the predicted vector to be 2 dimensional and the target vector to be 1 dimensional. But our Model outputs are 3 dimensional and targets are 2 dimensional. (Because they contain the batch_size dimension). So what we do is we flatten those dimensions and combine the max_story_length and batch_size dimensions into 1.

In [None]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src, src_len = batch.story
        trg = batch.summary      
        optimizer.zero_grad()    
        output = model(src, src_len, trg)
        output_dim = output.shape[-1]      
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        loss = criterion(output, trg)       
        loss.backward()   
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

The evaluate function is very similar to train. The only differences are that we run the model in eval mode (so that dropout, and batch norm and backward gradient flowing is turned off), and we also turn off teacher forcing because in the real world, we wont know the ground truths.

In [None]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    epoch_loss = 0
    
    with torch.no_grad():
        for i, batch in enumerate(iterator):
            src, src_len = batch.story
            trg = batch.summary
            output = model(src, src_len, trg, 0) 
            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)
            loss = criterion(output, trg)
            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [None]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

We train our model for 15 epochs. We train it on Ada. We also check the time required for each epoch. We compare the model at each epoch by calculating its loss on the validation set, and then we keep only that model which gave the best validation error.

In [None]:
N_EPOCHS = 15
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'seq2seq_attention_padding-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 0s
	Train Loss: 4.220 | Train PPL:  68.003
	 Val. Loss: 4.218 |  Val. PPL:  67.888


We then evaluate our trained best model on the test set and calculate the loss and perplexity.

In [None]:
model.load_state_dict(torch.load('seq2seq_attention_padding-model.pt'))
test_loss = evaluate(model, test_iterator, criterion)
print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 4.218 | Test PPL:  67.895 |


This is the summarize function. Given an article, it will return the summary of the article and also return the attention weights of each word generated by the decoder. We predict for a maximum of 100 words and stop early if the decoder predicts an eos token.

In [None]:
def make_summary(sentence, src_field, trg_field, model, device, max_len = 100):
    model.eval()
        
    if isinstance(sentence, str):
        nlp = spacy.load('en')
        tokens = [token.text.lower() for token in nlp(sentence)]
    else:
        tokens = [token.lower() for token in sentence]

    tokens = [src_field.init_token] + tokens + [src_field.eos_token]      
    src_indexes = [src_field.vocab.stoi[token] for token in tokens]  
    src_tensor = torch.LongTensor(src_indexes).unsqueeze(1).to(device)
    src_len = torch.LongTensor([len(src_indexes)]).to(device)
    
    with torch.no_grad():
        encoder_outputs, hidden = model.encoder(src_tensor, src_len)

    mask = model.create_mask(src_tensor)     
    trg_indexes = [trg_field.vocab.stoi[trg_field.init_token]]
    attentions = torch.zeros(max_len, 1, len(src_indexes)).to(device)
    
    for i in range(max_len):

        trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)
                
        with torch.no_grad():
            output, hidden, attention = model.decoder(trg_tensor, hidden, encoder_outputs, mask)

        attentions[i] = attention  
        pred_token = output.argmax(1).item()    
        trg_indexes.append(pred_token)

        if pred_token == trg_field.vocab.stoi[trg_field.eos_token]:
            break
    
    trg_tokens = [trg_field.vocab.itos[i] for i in trg_indexes]
    
    return trg_tokens[1:], attentions[:len(trg_tokens)-1]

In [None]:
# def display_attention(article, summary, attention):
    
#     fig = plt.figure(figsize=(100,100))
#     ax = fig.add_subplot(111)
    
#     attention = attention.squeeze(1).cpu().detach().numpy()
    
#     cax = ax.matshow(attention, cmap='bone')
   
#     ax.tick_params(labelsize=15)
#     ax.set_xticklabels(['']+['<sos>']+[t.lower() for t in article]+['<eos>'], 
#                        rotation=45)
#     ax.set_yticklabels(['']+summary)

#     ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
#     ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

#     plt.show()
#     plt.close()

In [None]:
example_idx = 2

src = vars(train_data.examples[example_idx])['story']
trg = vars(train_data.examples[example_idx])['summary']

print(f'src = {src}')
print(f'trg = {trg}')

src = ['kansas', 'city', ',', 'missouri', '(', 'cnn', ')', '--', 'the', 'general', 'services', 'administration', ',', 'already', 'under', 'investigation', 'for', 'lavish', 'spending', ',', 'allowed', 'an', 'employee', 'to', 'telecommute', 'from', 'hawaii', 'even', 'though', 'he', 'is', 'based', 'at', 'the', 'gsa', "'s", 'kansas', 'city', ',', 'missouri', ',', 'office', ',', 'a', 'cnn', 'investigation', 'has', 'found', '.', 'it', 'cost', 'more', 'than', '$', '24,000', 'for', 'the', 'business', 'development', 'specialist', 'to', 'travel', 'to', 'and', 'from', 'the', 'mainland', 'united', 'states', 'over', 'the', 'past', 'year', '.', 'he', 'is', 'among', 'several', 'hundred', 'gsa', '"', 'virtual', '"', 'workers', 'who', 'also', 'travel', 'to', 'various', 'conferences', 'and', 'their', 'home', 'offices', ',', 'costing', 'the', 'agency', 'millions', 'of']
trg = ['the', 'employee', 'in', 'agency', "'s", 'kansas', 'city', 'office', 'is', 'among', 'hundreds', 'of', '"', 'virtual', '"', 'worke

In [None]:
summary, attention = make_summary(src, story_field, summary_field, model, device)

print(f'predicted trg = {summary}')

predicted trg = ['\n', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the', 'the']


In [None]:
# import matplotlib.pyplot as plt
# import matplotlib.ticker as ticker
# display_attention(src, summary, attention)

In [None]:
example_idx = 1

src = vars(validation_data.examples[example_idx])['story']
trg = vars(validation_data.examples[example_idx])['summary']

print(f'src = {src}')
print(f'trg = {trg}')

src = ['(', 'cnn)sigma', 'alpha', 'epsilon', 'is', 'under', 'fire', 'for', 'a', 'video', 'showing', 'party', '-', 'bound', 'fraternity', 'members', 'singing', 'a', 'racist', 'chant', '.', 'sae', "'s", 'national', 'chapter', 'suspended', 'the', 'students', ',', 'but', 'university', 'of', 'oklahoma', 'president', 'david', 'boren', 'took', 'it', 'a', 'step', 'further', ',', 'saying', 'the', 'university', "'s", 'affiliation', 'with', 'the', 'fraternity', 'is', 'permanently', 'done', '.', 'the', 'news', 'is', 'shocking', ',', 'but', 'it', "'s", 'not', 'the', 'first', 'time', 'sae', 'has', 'faced', 'controversy', '.', 'sae', 'was', 'founded', 'march', '9', ',', '1856', ',', 'at', 'the', 'university', 'of', 'alabama', ',', 'five', 'years', 'before', 'the', 'american', 'civil', 'war', ',', 'according', 'to', 'the', 'fraternity', 'website', '.', 'when']
trg = ['sigma', 'alpha', 'epsilon', 'is', 'being', 'tossed', 'out', 'by', 'the', 'university', 'of', 'oklahoma', '.', '\n', 'it', "'s", 'also',

In [None]:
summary, attention = make_summary(src, story_field, summary_field, model, device)

print(f'predicted trg = {summary}')

# display_attention(src, summary, attention)

In [None]:
example_idx = 1

src = vars(test_data.examples[example_idx])['story']
trg = vars(test_data.examples[example_idx])['summary']

print(f'src = {src}')
print(f'trg = {trg}')

src = ['(', 'cnn)the', 'attorney', 'for', 'a', 'suburban', 'new', 'york', 'cardiologist', 'charged', 'in', 'what', 'authorities', 'say', 'was', 'a', 'failed', 'scheme', 'to', 'have', 'another', 'physician', 'hurt', 'or', 'killed', 'is', 'calling', 'the', 'allegations', 'against', 'his', 'client', '"', 'completely', 'unsubstantiated', '.', '"', 'appearing', 'saturday', 'morning', 'on', 'cnn', "'s", '"', 'new', 'day', ',', '"', 'randy', 'zelin', 'defended', 'his', 'client', ',', 'dr.', 'anthony', 'moschetto', ',', 'who', 'faces', 'criminal', 'solicitation', ',', 'conspiracy', ',', 'burglary', ',', 'arson', ',', 'criminal', 'prescription', 'sale', 'and', 'weapons', 'charges', 'in', 'connection', 'to', 'what', 'prosecutors', 'called', 'a', 'plot', 'to', 'take', 'out', 'a', 'rival', 'doctor', 'on', 'long', 'island', '.', '"', 'none', 'of', 'anything', 'in', 'this', 'case']
trg = ['a', 'lawyer', 'for', 'dr.', 'anthony', 'moschetto', 'says', 'the', 'charges', 'against', 'him', 'are', 'baseles

In [None]:


summary, attention = make_summary(src, story_field, summary_field, model, device)

print(f'predicted trg = {summary}')

# display_attention(src, summary, attention)

In [None]:
# !pip3 install pyrouge --upgrade
# !pip3 install https://github.com/bheinzerling/pyrouge/archive/master.zip
# !pip3 install pyrouge
# !pip3 show pyrouge
# !git clone https://github.com/andersjo/pyrouge.git
# from pyrouge import Rouge155
# !pyrouge_set_rouge_path 'pyrouge/tools/ROUGE-1.5.5'


Requirement already up-to-date: pyrouge in /usr/local/lib/python3.6/dist-packages (0.1.3)
Collecting https://github.com/bheinzerling/pyrouge/archive/master.zip
  Using cached https://github.com/bheinzerling/pyrouge/archive/master.zip
Building wheels for collected packages: pyrouge
  Building wheel for pyrouge (setup.py) ... [?25l[?25hdone
  Created wheel for pyrouge: filename=pyrouge-0.1.3-cp36-none-any.whl size=191914 sha256=33872c8ae6e7deb626698da35fb7da5a3733c87d3334826a379f7e4858839d0e
  Stored in directory: /tmp/pip-ephem-wheel-cache-pwa1q1_4/wheels/70/02/b4/a23b5feb5980a5eb940441cb04ec1e17d5f18344138efbecf8
Successfully built pyrouge
Name: pyrouge
Version: 0.1.3
Summary: A Python wrapper for the ROUGE summarization evaluation package.
Home-page: https://github.com/noutenki/pyrouge
Author: Benjamin Heinzerling, Anders Johannsen
Author-email: benjamin.heinzerling@h-its.org
License: LICENSE.txt
Location: /usr/local/lib/python3.6/dist-packages
Requires: 
Required-by: 
fatal: destin

So pyrouge uses a perl script which can get messed up if our text contains tokens that unintentionally make a format of a HTML Tag. Hence we make our texts safe by converting all the less than and greater than symbols into a safe format.

In [None]:
def make_html_safe(s):
    s.replace("<", "&lt;")
    s.replace(">", "&gt;")
    return s

In [None]:
!rm -rf predicted
!rm -rf target
!mkdir predicted
!mkdir target

The below function helps in storing all the predicted summaries and the gold standard summaries which will be further used to calculate ROUGE scores using Pyrouge. It also makes sure that the outputs are in appropriate format as required by PyRouge (e.g. one sentence per line, etc.)

In [None]:
def store_data(data, src_field, trg_field, model, device, max_len = 100):
    
    trgs = []
    pred_trgs = []
    idx = 1
    cnt = 0
    for datum in data:
        cnt += 1
        if cnt == 10000:
          break
        src = vars(datum)['story']
        trg = vars(datum)['summary']
        
        src = ' '.join(src)
        trg = ' '.join(trg)
        pred_trg, _ = make_summary(src, src_field, trg_field, model, device, max_len)
        pred_trg = pred_trg[:-1]
        pred_trg = ' '.join(pred_trg)

        trg = trg.replace("\n", "")
        pred_trg = pred_trg.replace("\n", "")
        trg = trg.split(".")
        pred_trg = pred_trg.split(".")

        trg = "\n".join(trg)
        pred_trg = "\n".join(pred_trg)
        trg = make_html_safe(trg)
        pred_trg = make_html_safe(pred_trg)

        pred_file = open("./predicted/file." + str(idx) + ".txt", "w+")
        trg_file = open("./target/file." + str(idx) +".txt", "w+")

        pred_file.write(pred_trg)
        trg_file.write(trg)
        pred_file.close()
        trg_file.close()
        idx += 1

In [None]:
store_data(test_data, story_field, summary_field, model, device)

In [None]:
# output = r.convert_and_evaluate()
# print(output)
# output_dict = r.output_to_dict(output)

2020-10-24 20:03:11,996 [MainThread  ] [INFO ]  Writing summaries.
2020-10-24 20:03:11,998 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpuon6g_3d/system and model files to /tmp/tmpuon6g_3d/model.
2020-10-24 20:03:11,999 [MainThread  ] [INFO ]  Processing files in /tmp/tmp488n_fd3/system.
2020-10-24 20:03:12,000 [MainThread  ] [INFO ]  Processing file.1.txt.
2020-10-24 20:03:12,001 [MainThread  ] [INFO ]  Processing file.1.
2020-10-24 20:03:12,005 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpuon6g_3d/system.
2020-10-24 20:03:12,007 [MainThread  ] [INFO ]  Processing files in /tmp/tmp488n_fd3/model.
2020-10-24 20:03:12,008 [MainThread  ] [INFO ]  Processing file.1.txt.
2020-10-24 20:03:12,011 [MainThread  ] [INFO ]  Processing file.1.
2020-10-24 20:03:12,013 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpuon6g_3d/model.
2020-10-24 20:03:12,014 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmptu_73m6m/rouge_conf.xml
2020-1

CalledProcessError: ignored

In [None]:
# !cat ./target/file.1.txt

james best , who played the sheriff on " the dukes of hazzard , " died monday at 88 
  " hazzard " ran from 1979 to 1985 and was among the most popular shows on tv 
