# **Headline Generation via Adversarial Training**
## **Project for Statistical Natural Language Processing (COMP0087)**
## **University College London**

<hr>

**File: Generating fake summaries for the discriminator.ipynb**

**Collaborators:**
  - Daniel Stancl (ucabds7@ucl.ac.uk)
  - Guoliang HE (ucabggh@ucl.ac.uk)
  - Dorota Jagnesakova (ucabdj1@ucl.ac.uk)
  - Zakhar Borok (zcabzbo@ucl.ac.uk)

<hr>

### **Description:** Colab notebook intented for the generation of fake summaries using by our pre-trained seq2seq models. There fake summaries are then used for thre pre-training of our discriminator.

# **1 Setup**

<hr>

- install and import libraries
- remove and clone the most recent version of git repository

## **1.1 GitHub stuff**

### **1.1.1 Set GitHub credentials and username of repo owner**

In [0]:
# credentials
user_email = '<your_email>'
user = '<your_username>'
user_password = "<your_password>"

# username of repo owner
owner_username = 'stancld'
# reponame
reponame = 'GeneratingHeadline_GANs'

# generate 
add_origin_link = (
    'https://{}:{}github@github.com/{}/{}.git'.format(
    user, user_password, owner_username, reponame)
)

print("Link used for git cooperation:\n{}".format(add_origin_link))

Link used for git cooperation:
https://<your_username>:<your_password>github@github.com/stancld/GeneratingHeadline_GANs.git


### **1.1.2 Clone GitHub repo on the personal drive**

In [0]:
%%time

## Clone GitHub repo to the desired folder
# Mount drive
from google.colab import drive
drive.mount("/content/drive", force_remount = True)
%cd "drive/My Drive/projects"

# Remove NLP_Project if presented and clone up-to-date repo
!rm -r GeneratingHeadline_GANs
!git clone https://github.com/stancld/GeneratingHeadline_GANs.git

# Go to the NLP_Project folder
%cd GeneratingHeadline_GANs

# Config global user and add origin enabling us to execute push commands
!git config --global user.email user_email
!git remote rm origin
!git remote add origin https://<your_username>:<your_password>@github.com/stancld/GeneratingHeadline_GANs.git

Mounted at /content/drive
/content/drive/My Drive/projects
Cloning into 'GeneratingHeadline_GANs'...
remote: Enumerating objects: 2311, done.[K
remote: Total 2311 (delta 0), reused 0 (delta 0), pack-reused 2311[K
Receiving objects: 100% (2311/2311), 19.33 MiB | 15.44 MiB/s, done.
Resolving deltas: 100% (1501/1501), done.
Checking out files: 100% (150/150), done.
/content/drive/My Drive/projects/GeneratingHeadline_GANs
/bin/bash: your_username: No such file or directory
CPU times: user 221 ms, sys: 75 ms, total: 296 ms
Wall time: 15.1 s


### **1.1.3 Helper function: push_to_repo**

In [0]:
def push_to_repo():
  """
  Helper function that pushes saved fils to github repo.
  """
  !git remote rm origin
  !git remote add origin https://<your_username>:<your_password>@github.com/stancld/GeneratingHeadline_GANs.git
  !git checkout master
  !git pull origin master
  !git checkout models_branch
  !git add .
  !git commit -m "model state update"
  !git checkout master
  !git merge models_branch
  !git push -u origin master

## **1.2 General stuff**

### **1.2.1 Install and import packages**

In [0]:
pip install rouge==1.0.0



In [0]:
import os
import sys
import time
import gc
import copy
import json
import pickle

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

import re
import unicodedata
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

from rouge import Rouge
from termcolor import colored

%matplotlib inline

In [0]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### **1.2.2 Set Torch device**

In [0]:
# Set torch.device to use GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(torch.cuda.get_device_name())

Tesla T4


### **1.2.3 Run auxiliary Python scripts**

In [0]:
# code for transforming data to padded array
run Code/data2PaddedArray.py

In [0]:
# code for the generator
run Code/Models/Attention_seq2seq.py

In [0]:
# code for the training class (generator)
run Code/Models/generator_training_class.py

### **1.2.4 Necessary class for opening text & headline dictionaries**

In [0]:
# Class Language Dictionary
class LangDict:
  """
  Source: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
  """
  def __init__(self):
    self.word2index = {}
    self.word2count = {}
    self.index2word = {0: "sos", 1: "eos"}
    self.n_words = 2

  def add_article(self, article):
    for word in article:
      self.add_word(word)

  def add_word(self, word):
    if word not in self.word2index:
      self.word2index[word] = self.n_words
      self.word2count[word] = 1
      self.index2word[self.n_words] = word
      self.n_words += 1
    else:
      self.word2count[word] += 1

# **2. Load the data**

<hr>

- Text_data & headline_data (splitted into train, dev and test set)
- Pretrained GloVe embeddings
- text and headline dictionaries

## **2.1 WikiHow data**

### **2.1.1 Input nad target data**

In [0]:
# Train set
text_train = np.load(
    '../data/text_train.npy',
    allow_pickle = True
)
headline_train = np.load(
    '../data/headline_train.npy',
    allow_pickle = True
)

# Dev set
text_val = np.load(
    '../data/text_val.npy',
    allow_pickle = True
)
headline_val = np.load(
    '../data/headline_val.npy',
    allow_pickle = True
)

# Test set
text_test = np.load(
    '../data/text_test.npy',
    allow_pickle = True
)
headline_test = np.load(
    '../data/headline_test.npy',
    allow_pickle = True
)

### **2.1.2 Lengths of the input and target data**

In [0]:
# Train set
text_lengths_train = np.load(
    '../data/text_lengths_train.npy',
    allow_pickle = True
)
headline_lengths_train = np.load(
    '../data/headline_lengths_train.npy',
    allow_pickle = True
)

# Dev set
text_lengths_val = np.load(
    '../data/text_lengths_val.npy',
    allow_pickle = True
)
headline_lengths_val = np.load(
    '../data/headline_lengths_val.npy',
    allow_pickle = True
)

# Test set
text_lengths_test = np.load(
    '../data/text_lengths_test.npy',
    allow_pickle = True
)
headline_lengths_test = np.load(
    '../data/headline_lengths_test.npy',
    allow_pickle = True
)

## **2.2 Filtered GloVe embeddings**

In [0]:
# Embeddings for the text dictionry
pre_train_weight = np.load(
    '../data/embedding.npy'
)

# Embeddings for the headline dictionary
pre_train_weight_head = np.load(
    '../data/embedding_headline.npy'
)

## **2.3 Headline & text dictionary**

In [0]:
# text_dictionary
with open('../data/text.dictionary', 'rb') as text_dictionary_file:
  text_dictionary = pickle.load(text_dictionary_file)

# headline_dictionary
with open('../data/headline.dictionary', 'rb') as headline_dictionary_file:
  headline_dictionary = pickle.load(headline_dictionary_file)

# **3 Summary generation**

**Helper function**

In [0]:
def padded_hypotheses(x, threshold, pad_idx):
  """
  :param x:
    type:
    description:
  :param threshold:
    type:
    description:
  :param pad_idx:
    type:
    description:

  :return x:
    type:
    description  
  """
  if x.shape[0] == threshold:
    return x
  else: 
    return np.r_[x, np.repeat(pad_idx, 32*(threshold - x.shape[0])).reshape(-1, 32)]

**Generation**

In [0]:
# specify the size of the RNN layer
best_model_size = 512

##### Training specification #####
grid = {'max_epochs': 25,
        'batch_size': 32,
        'learning_rate': 3e-4,
        'clip': 10,
        'l2_reg': 1e-4,
        'model_name': "generator{:.0f}".format(best_model_size)
      }
#################################

##### Model specification ######
OUTPUT_DIM = len(headline_dictionary.index2word.keys()) # number of output classes
ENC_EMB_DIM = pre_train_weight.shape[1] # embedding dimension
ENC_HID_DIM = best_model_size # size of the RNN layer
DEC_HID_DIM = best_model_size # size of the RNN layer

enc_num_layers = 1 # number of layers in RNN
dec_num_layers = 1 # number of layers in RNN

ENC_DROPOUT = 0.1 # probability used for dropout in the encoder RNN unit
DEC_DROPOUT = 0.1 # probability used for dropout in the decoder RNN unit
#################################

# Initialization
Generator = generator(model = _Seq2Seq, loss_function = nn.CrossEntropyLoss, optimiser = optim.Adam, l2_reg = grid['l2_reg'], batch_size = grid['batch_size'],
                    text_dictionary = text_dictionary, embeddings = pre_train_weight, max_epochs = grid['max_epochs'], learning_rate = grid['learning_rate'],
                    clip = grid['clip'], teacher_forcing_ratio = 1, OUTPUT_DIM = OUTPUT_DIM, ENC_HID_DIM = ENC_HID_DIM, ENC_EMB_DIM = ENC_EMB_DIM,
                    DEC_HID_DIM = DEC_HID_DIM, ENC_DROPOUT = ENC_DROPOUT, DEC_DROPOUT = DEC_DROPOUT, enc_num_layers = enc_num_layers, dec_num_layers = dec_num_layers,
                    device = device, model_name = grid['model_name'], push_to_repo = push_to_repo)

# Load the model
Generator.load()

##### TRAINING DATA #####
hypotheses = Generator.generate_summaries(text_train, text_lengths_train, headline_train, headline_lengths_train)
# Pad hypotheses
hypotheses = np.concatenate(
    [padded_hypotheses(hypothesis, 68, headline_dictionary.word2index['<pad>']) for hypothesis in hypotheses], axis = 1
)
# Correct the 'sos' symbol
hypotheses[0, :] = 0
# Concatenate real and fake summaries + transpose
real_fake_train = np.concatenate((headline_train, hypotheses), axis = 1)
real_fake_train = np.swapaxes(real_fake_train, 0, 1) # shape [n_examples, seq_len]
# add labels as the first column - 1 = Real, 0 = Generated
real_fake_train = np.c_[np.vstack((np.ones((headline_train.shape[1], 1)), np.zeros((hypotheses.shape[1], 1)))), real_fake_train]
# save
np.save('../data/real_fake_train.npy', real_fake_train)
del hypotheses
##########################


##### VALIDATION DATA #####
hypotheses = Generator.generate_summaries(text_val, text_lengths_val, headline_val, headline_lengths_val)
# Pad hypotheses
hypotheses = np.concatenate(
    [padded_hypotheses(hypothesis, headline_val.shape[0], headline_dictionary.word2index['<pad>']) for hypothesis in hypotheses], axis = 1
)
# Correct the 'sos' symbol
hypotheses[0, :] = 0
# Concatenate real and fake summaries + transpose
real_fake_val = np.concatenate((headline_val, hypotheses), axis = 1)
real_fake_val = np.swapaxes(real_fake_train, 0, 1) # shape [n_examples, seq_len]
# add labels as the first column - 1 = Real, 0 = Generated
real_fake_val = np.c_[np.vstack((np.ones((headline_val.shape[1], 1)), np.zeros((hypotheses.shape[1], 1)))), real_fake_val]
# reshuffle
np.random.shuffle(real_fake_val)
# save
np.save('../data/real_fake_val.npy', real_fake_val)
##########################

'\n\nfor model_size in model_sizes:\n  \n  ##### Training specification #####\n  grid = {\'max_epochs\': 25,\n          \'batch_size\': 32,\n          \'learning_rate\': 3e-4,\n          \'clip\': 10,\n          \'l2_reg\': 1e-4,\n          \'model_name\': "generator{:.0f}".format(model_size)\n        }\n  #################################\n\n  ##### Model specification ######\n  OUTPUT_DIM = len(headline_dictionary.index2word.keys()) # number of output classes\n  ENC_EMB_DIM = pre_train_weight.shape[1] # embedding dimension\n  ENC_HID_DIM = model_size # size of the RNN layer\n  DEC_HID_DIM = model_size # size of the RNN layer\n\n  enc_num_layers = 1 # number of layers in RNN\n  dec_num_layers = 1 # number of layers in RNN\n\n  ENC_DROPOUT = 0.1 # probability used for dropout in the encoder RNN unit\n  DEC_DROPOUT = 0.1 # probability used for dropout in the decoder RNN unit\n\n  # Initialization\n  Generator = generator(model = _Seq2Seq, loss_function = nn.CrossEntropyLoss, optimiser =