# **Headline Generation via Adversarial Training**
## **Project for Statistical Natural Language Processing (COMP0087)**
## **University College London**

<hr>

**File: Group 22 - Demo.ipynb**

**Collaborators:**
  - Daniel Stancl (ucabds7@ucl.ac.uk)
  - Guoliang HE (ucabggh@ucl.ac.uk)
  - Dorota Jagnesakova (ucabdj1@ucl.ac.uk)
  - Zakhar Borok (zcabzbo@ucl.ac.uk)

<hr>

### **Description:** ***To be added***

<hr>

### **README:** ***To be added***

# **1 Setup**

<hr>

- install and import libraries
- remove and clone the most recent version of git repository
- run auxiliary Python scripts

## **1.1 GitHub stuff**

In [1]:
%%time

## Clone GitHub repo to the desired folder
# Mount drive
from google.colab import drive
drive.mount("/content/drive", force_remount = True)
%cd "drive/My Drive"
!rm -r Group22
%mkdir "Group22"
%cd "Group22"

# Remove NLP_Project if presented and clone up-to-date repo
!rm -r GeneratingHeadline_GANs
!git clone https://github.com/stancld/GeneratingHeadline_GANs.git

# Go to the NLP_Project folder
%cd GeneratingHeadline_GANs

Mounted at /content/drive
/content/drive/My Drive
/content/drive/My Drive/Group22
rm: cannot remove 'GeneratingHeadline_GANs': No such file or directory
Cloning into 'GeneratingHeadline_GANs'...
remote: Enumerating objects: 46, done.[K
remote: Counting objects: 100% (46/46), done.[K
remote: Compressing objects: 100% (34/34), done.[K
remote: Total 2357 (delta 26), reused 31 (delta 12), pack-reused 2311[K
Receiving objects: 100% (2357/2357), 19.62 MiB | 20.20 MiB/s, done.
Resolving deltas: 100% (1527/1527), done.
Checking out files: 100% (153/153), done.
/content/drive/My Drive/Group22/GeneratingHeadline_GANs
CPU times: user 180 ms, sys: 67.7 ms, total: 247 ms
Wall time: 10.4 s


## **1.2 General stuff**

### **1.2.1 Install and import packages**

In [2]:
pip install rouge==1.0.0



In [0]:
import os
import sys
import time
import gc
import copy
import json
import pickle

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

import re
import unicodedata
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

from rouge import Rouge
from termcolor import colored

%matplotlib inline

### **1.2.2 Set Torch device**

In [4]:
# Set torch.device to use GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(torch.cuda.get_device_name())

Tesla P100-PCIE-16GB


### **1.2.3 Run auxiliary Python scripts**

In [0]:
# code for transforming data to padded array
run Code/data2PaddedArray.py

In [0]:
# code for the generator
run Code/Models/Attention_seq2seq.py

In [0]:
# code for the training class (generator)
run Code/Models/generator_training_class.py

In [0]:
# code for the discriminator
run Code/Models/CNN_text_clf.py

In [0]:
# code for the training class (discriminator)
run Code/Models/discriminator_training_class.py

In [0]:
# Adversarial training
run Code/Models/Adversarial_Training.py

### **1.2.4 Necessary class for opening text & headline dictionaries**

In [0]:
# Class Language Dictionary
class LangDict:
  """
  Source: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
  """
  def __init__(self):
    self.word2index = {}
    self.word2count = {}
    self.index2word = {0: "sos", 1: "eos"}
    self.n_words = 2

  def add_article(self, article):
    for word in article:
      self.add_word(word)

  def add_word(self, word):
    if word not in self.word2index:
      self.word2index[word] = self.n_words
      self.word2count[word] = 1
      self.index2word[self.n_words] = word
      self.n_words += 1
    else:
      self.word2count[word] += 1

# **2. Load the data and models**

<hr>

- Text_data & headline_data (splitted into train, dev and test set)
- Pretrained GloVe embeddings
- text and headline dictionaries

In [12]:
# Downloadn all the data in ZIP file from Google Drive, unzip, remove
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1kv6gOwsHGE_H8wpYfjrIcQSRQuunzF2V' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1kv6gOwsHGE_H8wpYfjrIcQSRQuunzF2V" -O DemoFiles.zip && rm -rf /tmp/cookies.txt
!unzip DemoFiles.zip
!rm DemoFiles.zip

--2020-03-29 12:00:07--  https://docs.google.com/uc?export=download&confirm=5RRJ&id=1kv6gOwsHGE_H8wpYfjrIcQSRQuunzF2V
Resolving docs.google.com (docs.google.com)... 74.125.20.139, 74.125.20.113, 74.125.20.138, ...
Connecting to docs.google.com (docs.google.com)|74.125.20.139|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-08-70-docs.googleusercontent.com/docs/securesc/m6vl4ta55t70e3j9cgb6ku9u08lo8cc3/tk8r37rj9oa00oo3deftl3fl6soali60/1585483200000/01710675870877940700/11259274077832818668Z/1kv6gOwsHGE_H8wpYfjrIcQSRQuunzF2V?e=download [following]
--2020-03-29 12:00:07--  https://doc-08-70-docs.googleusercontent.com/docs/securesc/m6vl4ta55t70e3j9cgb6ku9u08lo8cc3/tk8r37rj9oa00oo3deftl3fl6soali60/1585483200000/01710675870877940700/11259274077832818668Z/1kv6gOwsHGE_H8wpYfjrIcQSRQuunzF2V?e=download
Resolving doc-08-70-docs.googleusercontent.com (doc-08-70-docs.googleusercontent.com)... 74.125.142.132, 2607:f8b0:400e:c08::84
Connecting to 

## **2.1 WikiHow data**

### **2.1.1 Input nad target data**

In [0]:
# Paragraphs
text_test = np.load(
    'DemoFiles/text_test.npy',
    allow_pickle = True
)

# Headlines
headline_test = np.load(
    'DemoFiles/headline_test.npy',
    allow_pickle = True
)


### **2.1.2 Lengths of the input and target data**

In [0]:
text_lengths_test = np.load(
    'DemoFiles/text_lengths_test.npy',
    allow_pickle = True
)
headline_lengths_test = np.load(
    'DemoFiles/headline_lengths_test.npy',
    allow_pickle = True
)

## **2.2 Filtered GloVe embeddings**

In [0]:
# Embeddings for the text dictionry
pre_train_weight = np.load(
    'DemoFiles/embedding.npy'
)

# Embeddings for the headline dictionary
pre_train_weight_head = np.load(
    'DemoFiles/embedding_headline.npy'
)

## **2.3 Headline & text dictionary**

In [0]:
# text_dictionary
with open('DemoFiles/text.dictionary', 'rb') as text_dictionary_file:
  text_dictionary = pickle.load(text_dictionary_file)

# headline_dictionary
with open('DemoFiles/headline.dictionary', 'rb') as headline_dictionary_file:
  headline_dictionary = pickle.load(headline_dictionary_file)

# **3 Evaluation**

## **3.0 Backend ROUGE computaiton**

In [0]:
# Initialization of Rouge() and define Helper function
rouge = Rouge()
def rouge_get_scores(hyp, ref, n):
  try:
    return float(rouge.get_scores(hyp, ref)[0]['rouge-{}'.format(n)]['f'])
  except:
    return "drop" # some summaries a flawed (there is nothing to compute) so they're dropped (only a negligible number of examples!! it does merely affect average scores by some thousandths)

# Define indices of <pad> and <eos> tokens
pad_idx = headline_dictionary.word2index['<pad>']
eos_idx = headline_dictionary.word2index['eos']

def push_to_repo():
  """
  Helper function that pushes saved fils to github repo.
  """
  !git remote rm origin
  !git remote add origin https://<your_username>:<your_password>@github.com/stancld/GeneratingHeadline_GANs.git
  !git checkout master
  !git pull origin master
  !git checkout models_branch
  !git add .
  !git commit -m "model state update"
  !git checkout master
  !git merge models_branch
  !git push -u origin master

In [0]:
def baseline_hypotheses(model_size, text_dictionary = text_dictionary, headline_dictionary = headline_dictionary,
                        pre_train_weight = pre_train_weight, pre_train_weight_head = pre_train_weight_head,
                        generator = generator, device = device, push_to_repo = push_to_repo):
  grid = {'max_epochs': 25,
          'batch_size': 32,
          'learning_rate': 3e-4,
          'clip': 10,
          'l2_reg': 1e-4,
          'model_name': "generator{:.0f}".format(model_size)
        }

  OUTPUT_DIM = len(headline_dictionary.index2word.keys()) # number of output classes
  ENC_EMB_DIM = pre_train_weight.shape[1] # embedding dimension
  ENC_HID_DIM = model_size # size of the RNN layer
  DEC_HID_DIM = model_size # size of the RNN layer

  enc_num_layers = 1 # number of layers in RNN
  dec_num_layers = 1 # number of layers in RNN

  ENC_DROPOUT = 0.1 # probability used for dropout in the encoder RNN unit
  DEC_DROPOUT = 0.1 # probability used for dropout in the decoder RNN unit
  ##########################################


  # Initialization the model and load the state
  Generator = generator(model = _Seq2Seq, loss_function = nn.CrossEntropyLoss, optimiser = optim.Adam, l2_reg = grid['l2_reg'], batch_size = grid['batch_size'],
                      text_dictionary = text_dictionary, embeddings = pre_train_weight, max_epochs = grid['max_epochs'], learning_rate = grid['learning_rate'],
                      clip = grid['clip'], teacher_forcing_ratio = 1, OUTPUT_DIM = OUTPUT_DIM, ENC_HID_DIM = ENC_HID_DIM, ENC_EMB_DIM = ENC_EMB_DIM,
                      DEC_HID_DIM = DEC_HID_DIM, ENC_DROPOUT = ENC_DROPOUT, DEC_DROPOUT = DEC_DROPOUT, enc_num_layers = enc_num_layers, dec_num_layers = dec_num_layers,
                      device = device, model_name = grid['model_name'], push_to_repo = push_to_repo)
  Generator.load(demo = True)


  ##### ROUGE scores #####
  # Generate hypothesis and conver them into text
  hypotheses = Generator.generate_summaries(text_test, text_lengths_test, headline_test, headline_lengths_test)
  return hypotheses, Generator.n_batches

In [0]:
model_sizes = [128, 256, 512]

In [0]:
# Convert references into the text
references = [' '.join([headline_dictionary.index2word[index] for index in headline_test[:, sentence] if (index != pad_idx) & (index != eos_idx)][1:]) for sentence in range(headline_test.shape[1])]
first_run = 1

## **3.1 Baseline models**

In [21]:
for model_size in model_sizes:
  hypotheses, n_batches = baseline_hypotheses(model_size)
  hypotheses = sum(
      [[' '.join([headline_dictionary.index2word[index] for index in batch[:, hypothesis] if (index != pad_idx) & (index != eos_idx)][1:]) for hypothesis in range(batch.shape[1])] for batch in hypotheses], []
  )
  
  # trim (only full batches are generated)
  lim = int(32 * n_batches)
  if first_run == 1:
    references = references[:lim]
 
  # Compute average rouge scores
  rouge1 = [rouge_get_scores(hyp, ref, '1') for hyp, ref in zip(hypotheses, references)]
  rouge1 = np.array([x for x in rouge1 if x != 'drop']).mean()
  rouge2 = [rouge_get_scores(hyp, ref, '2') for hyp, ref in zip(hypotheses, references)]
  rouge2 = np.array([x for x in rouge2 if x != 'drop']).mean()
  rougel = [rouge_get_scores(hyp, ref, 'l') for hyp, ref in zip(hypotheses, references)]
  rougel = np.array([x for x in rougel if x != 'drop']).mean()
  
  # cleaning
  del hypotheses
  gc.collect()

  # Print rouge scores
  print('Model size = {:.0f}.'.format(model_size))
  print('ROUGE-1: {:.3f} on test data.'.format(100*np.array(rouge1)))
  print('ROUGE-2: {:.3f} on test data.'.format(100*np.array(rouge2)))
  print('ROUGE-l: {:.3f} on test data.'.format(100*np.array(rougel)))
  print('---------------')
  first_run += 1

Model size = 128.
ROUGE-1: 14.197 on test data.
ROUGE-2: 3.168 on test data.
ROUGE-l: 14.210 on test data.
---------------
Model size = 256.
ROUGE-1: 15.283 on test data.
ROUGE-2: 3.960 on test data.
ROUGE-l: 15.312 on test data.
---------------
Model size = 512.
ROUGE-1: 15.399 on test data.
ROUGE-2: 4.353 on test data.
ROUGE-l: 15.385 on test data.
---------------


## **3.2 SumGAN**

### **3.2.1 Backend Model initialization**

**Discriminator initialization**

<hr>

This step is required and it is an undesired artifact of our non-ideal implemntation

In [0]:
##### Model & Training specification #####
param = {'max_epochs': 80,
            'learning_rate': 5e-4,
            'batch_size': 32,               
            'seq_len': 68,                   # length of your summary
            'embed_dim': 200,
            'drop_out': 0.5,
            'kernel_num': 50,                 # number of your feature map
            'in_channel': 1,                 # for text classification should be one
            # how many conv net are used in parallel in text classification
            'parallel_layer': 3,
            'model_name': 'n_{:.0f}_d_{:.0f}'.format(50, 10*0.5),
            'device':'cuda'}
##########################################


# Initialization the model and load the state
Discriminator = Discriminator_utility(pre_train_weight_head,**param)
Discriminator.load()

**Generator initialization**

In [0]:
model_size = 512

grid = {'max_epochs': 25,
        'batch_size': 32,
        'learning_rate': 3e-4,
        'clip': 10,
        'l2_reg': 1e-4,
        'model_name': "generator{:.0f}".format(model_size)
      }

OUTPUT_DIM = len(headline_dictionary.index2word.keys()) # number of output classes
ENC_EMB_DIM = pre_train_weight.shape[1] # embedding dimension
ENC_HID_DIM = model_size # size of the RNN layer
DEC_HID_DIM = model_size # size of the RNN layer

enc_num_layers = 1 # number of layers in RNN
dec_num_layers = 1 # number of layers in RNN

ENC_DROPOUT = 0.1 # probability used for dropout in the encoder RNN unit
DEC_DROPOUT = 0.1 # probability used for dropout in the decoder RNN unit
##########################################


# Initialization the model and load the state
Generator = generator(model = _Seq2Seq, loss_function = nn.CrossEntropyLoss, optimiser = optim.Adam, l2_reg = grid['l2_reg'], batch_size = grid['batch_size'],
                    text_dictionary = text_dictionary, embeddings = pre_train_weight, max_epochs = grid['max_epochs'], learning_rate = grid['learning_rate'],
                    clip = grid['clip'], teacher_forcing_ratio = 1, OUTPUT_DIM = OUTPUT_DIM, ENC_HID_DIM = ENC_HID_DIM, ENC_EMB_DIM = ENC_EMB_DIM,
                    DEC_HID_DIM = DEC_HID_DIM, ENC_DROPOUT = ENC_DROPOUT, DEC_DROPOUT = DEC_DROPOUT, enc_num_layers = enc_num_layers, dec_num_layers = dec_num_layers,
                    device = device, model_name = grid['model_name'], push_to_repo = push_to_repo)
Generator.load(demo = True)

**GAN initialization**

In [0]:
##### Model & Training specification #####
grid = {'max_epochs': 2,
        'batch_size': 32,
        'learning_rate_D': 5e-4,
        'learning_rate_G': 1e-3,
        'G_multiple': 2,
        'l2_reg': 0.1,      # appled only for discriminator
        'clip': 10,
        'model_name': 'Adversarial_v02',
        'text_dictionary': text_dictionary,
        'headline_dictionary': headline_dictionary,
        'device': 'cuda',
        'noise_std': 0.00, 
        'optim_d_prob': 0.15 #this values determines the probability a discriminator will make step with
        }

##########################################


# Initialization the model and load the state
GAN = AdversarialTraining(Generator, Discriminator, optim.Adam, optim.SGD, **grid)
GAN.load(demo = True)  # if any

### **3.2.2 ROUGE**


In [25]:
##### ROUGE scores #####
# Generate hypothesis and conver them into text
hypotheses = GAN.generate_summaries(text_test, text_lengths_test, headline_test, headline_lengths_test)
hypotheses = sum(
    [[' '.join([headline_dictionary.index2word[index] for index in batch[:, hypothesis] if (index != pad_idx) & (index != eos_idx)][1:]) for hypothesis in range(batch.shape[1])] for batch in hypotheses], []
)

# Convert references into the text
references = [' '.join([headline_dictionary.index2word[index] for index in headline_test[:, sentence] if (index != pad_idx) & (index != eos_idx)][1:]) for sentence in range(headline_test.shape[1])]

# trim (only full batches are generated)
lim = int(32 * n_batches)
references = references[:lim]

# Compute average rouge scores
rouge1 = [rouge_get_scores(hyp, ref, '1') for hyp, ref in zip(hypotheses, references)]
rouge1 = np.array([x for x in rouge1 if x != 'drop']).mean()
rouge2 = [rouge_get_scores(hyp, ref, '2') for hyp, ref in zip(hypotheses, references)]
rouge2 = np.array([x for x in rouge2 if x != 'drop']).mean()
rougel = [rouge_get_scores(hyp, ref, 'l') for hyp, ref in zip(hypotheses, references)]
rougel = np.array([x for x in rougel if x != 'drop']).mean()

# cleaning
del hypotheses, references
gc.collect()

# Print rouge scores
print('Model size = {:.0f}.'.format(model_size))
print('ROUGE-1: {:.3f} on test data.'.format(100*np.array(rouge1)))
print('ROUGE-2: {:.3f} on test data.'.format(100*np.array(rouge2)))
print('ROUGE-l: {:.3f} on test data.'.format(100*np.array(rougel)))
print('---------------')

Model size = 512.
ROUGE-1: 19.382 on test data.
ROUGE-2: 6.209 on test data.
ROUGE-l: 19.343 on test data.
---------------


## **Backend sampling**

**Baseline**

In [0]:
Generator.load(demo = True)

hypotheses_2 = Generator.generate_summaries(text_test, text_lengths_test, headline_test, headline_lengths_test)
hypotheses_2 = sum(
    [[' '.join([headline_dictionary.index2word[index] for index in batch[:, hypothesis] if (index != pad_idx) & (index != eos_idx)][1:]) for hypothesis in range(batch.shape[1])] for batch in hypotheses_2], []
)

**SumGAN**

In [0]:
GAN.load(demo = True)

hypotheses_1 = GAN.generate_summaries(text_test, text_lengths_test, headline_test, headline_lengths_test)
hypotheses_1 = sum(
    [[' '.join([headline_dictionary.index2word[index] for index in batch[:, hypothesis] if (index != pad_idx) & (index != eos_idx)][1:]) for hypothesis in range(batch.shape[1])] for batch in hypotheses_1], []
)

**Take reference headlines**

In [0]:
references = [' '.join([headline_dictionary.index2word[index] for index in headline_test[:, sentence] if (index != pad_idx) & (index != eos_idx)][1:]) for sentence in range(headline_test.shape[1])]
# trim
n_batches = len(references) // grid['batch_size']
lim = n_batches * grid['batch_size']
references = references[:lim]

## **Print random samples**

In [29]:
np.random.seed(42)

# Take a random sample
samples = np.random.choice(lim, 50, replace = False)
hypotheses_1, hypotheses_2, references = (
    np.array(hypotheses_1)[samples],
    np.array(hypotheses_2)[samples],
    np.array(references)[samples]
)

# Print and save
for hyp_1, hyp_2, ref in zip(
    hypotheses_1,
    hypotheses_2,
    references):
  print(f'Reference: {ref}')
  print(colored(f'Baseline: {hyp_2}', 'blue'))
  print(colored(f'SumGAN: {hyp_1}', 'green')) 
  print('----------')

Reference: rely on friends and family .
[34mBaseline: be positive .[0m
[32mSumGAN: ask for help .[0m
----------
Reference: fuel your body with nutritious foods .
[34mBaseline: eat a healthy diet .[0m
[32mSumGAN: eat a healthy diet .[0m
----------
Reference: have your vet trim your guinea pigs nails .
[34mBaseline: get your guinea pig .[0m
[32mSumGAN: get your guinea pig .[0m
----------
Reference: use washable bedding on the bed .
[34mBaseline: clean your bedding .[0m
[32mSumGAN: clean your bedding .[0m
----------
Reference: get a good night s rest .
[34mBaseline: get plenty of sleep .[0m
[32mSumGAN: get enough sleep .[0m
----------
Reference: consider your family wishes and lifestyle .
[34mBaseline: be aware of your family .[0m
[32mSumGAN: consider your family history .[0m
----------
Reference: add additional front matter .
[34mBaseline: write your first draft .[0m
[32mSumGAN: write a brief .[0m
----------
Reference: use ice for injuries .
[34mBaseline: ice

# **CLEANING**

In [30]:
%cd ..
%cd .. 
!rm -r Group22

/content/drive/My Drive/Group22
/content/drive/My Drive
