# **Headline Generation via Adversarial Training**
## **Project for Statistical Natural Language Processing (COMP0087)**
## **University College London**

<hr>

**File: Data Preprocessing.ipynb**

**Collaborators:**
  - Daniel Stancl (ucabds7@ucl.ac.uk)
  - Guoliang HE (ucabggh@ucl.ac.uk)
  - Dorota Jagnesakova (ucabdj1@ucl.ac.uk)
  - Zakhar Borok (zcabzbo@ucl.ac.uk)

<hr>

### **Description:** Colab notebook which downloads WikiHow data and GloVe embeddings. Then, input data are preprocessed, GloVe embeddings is trimmed/filtered to the words which are present in the training data. Eventually, input data are transformed to continuous representation using this pre-trained embeddings.

# **1 Setup**

<hr>

- set GitHub credentials, clone repository, and define helper *push* function
- install and import all required libraries
- run auxiliary python scripts
- download pre-traind embeddings (GloVe)
- define function for filtering only those embedding vectors which are needed for our training data
- load the WikiHow data

## **1.1 GitHub stuff**

### **1.1.1 Set GitHub credentials and username of repo owner**

In [0]:
# credentials
user_email = '<your_email>'
user = '<your_username>'
user_password = "<your_password>"

# username of repo owner
owner_username = 'stancld'
# reponame
reponame = 'GeneratingHeadline_GANs'

# generate 
add_origin_link = (
    'https://{}:{}github@github.com/{}/{}.git'.format(
    user, user_password, owner_username, reponame)
)

print("Link used for git cooperation:\n{}".format(add_origin_link))

Link used for git cooperation:
https://<your_username>:<your_password>github@github.com/stancld/GeneratingHeadline_GANs.git


### **1.1.2 Clone GitHub repo on the personal drive**

In [0]:
%%time

## Clone GitHub repo to the desired folder
# Mount drive
from google.colab import drive
drive.mount("/content/drive", force_remount = True)
%cd "drive/My Drive/projects"

# Remove NLP_Project if presented and clone up-to-date repo
!rm -r GeneratingHeadline_GANs
!git clone https://github.com/stancld/GeneratingHeadline_GANs.git

# Go to the NLP_Project folder
%cd GeneratingHeadline_GANs

# Config global user and add origin enabling us to execute push commands
!git config --global user.email user_email
!git remote rm origin
!git remote add origin https://<your_username>:<your_password>@github.com/stancld/GeneratingHeadline_GANs.git

'\n\n%%time\n\n## Clone GitHub repo to the desired folder\n# Mount drive\nfrom google.colab import drive\ndrive.mount("/content/drive", force_remount = True)\n%cd "drive/My Drive/projects"\n\n# Remove NLP_Project if presented and clone up-to-date repo\n!rm -r GeneratingHeadline_GANs\n!git clone https://github.com/stancld/GeneratingHeadline_GANs.git\n\n# Go to the NLP_Project folder\n%cd GeneratingHeadline_GANs\n\n# Config global user and add origin enabling us to execute push commands\n!git config --global user.email user_email\n!git remote rm origin\n!git remote add origin https://<your_username>:<your_password>@github.com/stancld/GeneratingHeadline_GANs.git\n\n'

### **1.1.3 Helper function: push_to_repo**

In [0]:
def push_to_repo():
  """
  Helper function that pushes saved fils to github repo.
  """
  !git remote rm origin
  !git remote add origin https://<your_username>:<your_password>@github.com/stancld/GeneratingHeadline_GANs.git
  !git checkout master
  !git pull origin master
  !git checkout models_branch
  !git add .
  !git commit -m "model state update"
  !git checkout master
  !git merge models_branch
  !git push -u origin master

## **1.2 General stuff**

### **1.2.1 Install and import packages**

In [0]:
pip install rouge==1.0.0

Collecting rouge==1.0.0
  Downloading https://files.pythonhosted.org/packages/43/cc/e18e33be20971ff73a056ebdb023476b5a545e744e3fc22acd8c758f1e0d/rouge-1.0.0-py3-none-any.whl
Installing collected packages: rouge
Successfully installed rouge-1.0.0


In [0]:
import os
import sys
import time
import gc
import copy
import json
import pickle

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

import re
import unicodedata
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

from rouge import Rouge
from termcolor import colored

%matplotlib inline

ERROR! Session/line number was not unique in database. History logging moved to new session 59


In [0]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

### **1.2.2 Run auxiliary Python scripts**

In [0]:
# Contractions
run Code/contractions.py

In [0]:
# code for text_preprocessing()
run Code/text_preprocessing.py

In [0]:
# code for transforming data to padded array
run Code/data2PaddedArray.py

### **1.2.3 Download pre-trained embeddings**

In [0]:
# Set desired dimension of embeddings from [50, 100, 200, 300]
embed_dim = 200

# Download and unzip GloVe embedding
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

# input your pre-train txt path and parse the data
path = '../data/glove.6B.{:.0f}d.txt'.format(embed_dim)

embed_dict = {}
with open(path,'r') as f:
  lines = f.readlines()
  for l in lines:
    w = l.split()[0]
    v = np.array(l.split()[1:]).astype('float')
    embed_dict[w] = v

embed_dict['@@_unknown_@@'] = np.random.random(embed_dim)

# remove all the unnecesary files
!rm -rf glove.6B.zip
!rm -rf glove.6B.50d.txt
!rm -rf glove.6B.100d.txt
!rm -rf glove.6B.200d.txt
!rm -rf glove.6B.300d.txt

# check the length of the dictionary
len(embed_dict.keys())

--2020-03-27 08:09:27--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-03-27 08:09:27--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-03-27 08:09:27--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2020-0

400001

### **1.2.4 Helper function extracting only those embeddings vectors which are presented in our training data**

In [0]:
def extract_weight(text_dictionary):
  """
  Helper function extracting only those embeddings vectors which are presented in our training data
  
  :param text_dictionary:
    type: Dictionary
    description: Pre-trained embeddings

  :return pre_train_weight:
    type: Dictionary
    description: Filtered pre-trained embeddings
  """
  pre_train_weight = []
  for word_index in text_dictionary.index2word.keys():
    if word_index != 0:
      word = text_dictionary.index2word[word_index]
      try:
        word_vector = embed_dict[word].reshape(1,-1)
      except:
        word_vector = embed_dict['@@_unknown_@@'].reshape(1,-1) # handle unknown word
      pre_train_weight = np.vstack([pre_train_weight,word_vector])
    
    # add for padding
    elif word_index == len(text_dictionary.index2word.keys()):  
      pre_train_weight = np.r_[pre_train_weight, np.zeros((1, embed_dim))]
    
    else:
      word = text_dictionary.index2word[word_index]
      try:
        word_vector = embed_dict[word].reshape(1,-1)
      except:
        word_vector = embed_dict['@@_unknown_@@'].reshape(1,-1) # handle unknown word
      pre_train_weight = word_vector
  return pre_train_weight

## **1.3 Load the data**

<hr>

**Source:**  https://ucsb.app.box.com/s/7yq601ijl1lzvlfu4rjdbbxforzd2oag

In [0]:
%%time
# Open
data = pd.read_csv('../data/wikihowSep.csv',
                    error_bad_lines = False).astype(str)
print(data.shape)

(1585695, 5)
CPU times: user 13.2 s, sys: 1.32 s, total: 14.5 s
Wall time: 17.8 s


# **2 Preprocess the data**

## **2.1 Text Preprocessing using our predefined function**

In [0]:
%%time
# Preprocess
for item in ['text', 'headline']:
  exec("""{}_data = text_preprocessing(data=data,
    item = '{}',
    contraction_map=CONTRACTION_MAP,
    drop_digits=False,
    remove_stopwords=False,
    stemming=False)""".format(item, item), locals(), globals()
)

# Cleaning
del data
gc.collect()

CPU times: user 4min 14s, sys: 8.49 s, total: 4min 22s
Wall time: 4min 22s


## **2.2 Clean the data according to the rules specified within the report**

**Print some statistics**

In [0]:
# get lengths of input articles and target summaries
text_len = [len(t) for t in text_data]
head_len = [len(h) for h in headline_data]

# Print some statistics of the uncleansed data
print('Some statistics')

print('Average length of articles is {:.2f}.'.format(np.array(text_len).mean()))
print('Min = {:.0f}, Max = {:.0f}, Std = {:.2f}'.format(min(text_len), max(text_len), np.array(text_len).std()))

print('-----')

print('Average length of summaries is {:.2f}.'.format(np.array(head_len).mean()))
print('Min = {:.0f}, Max = {:.0f}, Std = {:.2f}'.format(min(head_len), max(head_len), np.array(head_len).std()))

Some statistics
Average length of articles is 65.62.
Min = 2, Max = 2967, Std = 58.83
-----
Average length of summaries is 11.08.
Min = 2, Max = 2945, Std = 6.89


In [0]:
# specified the maximum number of examples and the maximum threshold of lentghts of input articles and target summaries
max_examples = 150000
max_threshold = 0.75

# drop examples with an invalid ratio of length of text and headline
text_len = [len(t) for t in text_data]
head_len = [len(h) for h in headline_data]

ratio = [h/t for t, h in zip(text_len, head_len)]

problems1 = [problem for problem, r in enumerate(ratio) if (r > max_threshold)]
print(len(problems1))
text_data, headline_data = np.delete(text_data, problems1), np.delete(headline_data, problems1)
print("Number of examples after filtering: {:.0f}".format(text_data.shape[0]))

# drop too long articles (to avoid struggles with CUDA memory) and too short
text_len = [len(t) for t in text_data]

problems2 = [problem for problem, text_length in enumerate(text_len) if ((text_length > 200) | (text_length < 10) )]
print(len(problems2))
text_data, headline_data = np.delete(text_data, problems2), np.delete(headline_data, problems2)
print("Number of examples after filtering: {:.0f}".format(text_data.shape[0]))

# drop too pairs with too short/long summaries
head_len = [len(h) for h in headline_data]

problems3 = [problem for problem, headline_len in enumerate(head_len) if ( (headline_len > 75) | (headline_len < 2) )]
print(len(problems3))
text_data, headline_data = np.delete(text_data, problems3), np.delete(headline_data, problems3)
print("Number of examples after filtering: {:.0f}".format(text_data.shape[0]))

# some cleaning
del text_len, head_len, ratio, problems1, problems2, problems3
gc.collect()

# trim the data to have only a subset of the data for our project
try:
  text_data, headline_data = text_data[:max_examples], headline_data[:max_examples]
except:
  pass

print(text_data.shape, headline_data.shape)

326422
Number of examples after filtering: 1259273
44706
Number of examples after filtering: 1214567
32
Number of examples after filtering: 1214535
(150000,) (150000,)


**Print some statistics**

In [0]:
# get lengths of input articles and target summaries
text_len = [len(t) for t in text_data]
head_len = [len(h) for h in headline_data]

print('Some statistics')

print('Average length of articles is {:.2f}.'.format(np.array(text_len).mean()))
print('Min = {:.0f}, Max = {:.0f}, Std = {:.2f}'.format(min(text_len), max(text_len), np.array(text_len).std()))

print('-----')

print('Average length of summaries is {:.2f}.'.format(np.array(head_len).mean()))
print('Min = {:.0f}, Max = {:.0f}, Std = {:.2f}'.format(min(head_len), max(head_len), np.array(head_len).std()))

Some statistics
Average length of articles is 87.47.
Min = 10, Max = 200, Std = 42.66
-----
Average length of summaries is 9.45.
Min = 3, Max = 68, Std = 4.49


## **2.3 Split the data**

<hr>

- Trainin data - ~90%
- Validation/dev data - ~5%
- Test data - ~5%

In [0]:
np.random.seed(222)

split = np.random.uniform(0, 1, size = text_data.shape[0])

# Train set
text_train, headline_train = text_data[split <= 0.9], headline_data[split <= 0.9]
# Validation set
text_val, headline_val = text_data[(split > 0.9) & (split <= 0.95)], headline_data[(split > 0.9) & (split <= 0.95)]
# Test set
text_test, headline_test = text_data[split > 0.95], headline_data[split > 0.95]

**Print some statistics**

In [0]:
print('Average lengths of articles is {:.2f}'.format(np.array([len(text) for text in text_train]).mean()))

print('Average lengths of sumaries is {:.2f}'.format(np.array([len(text) for text in headline_train]).mean()))

Average lengths of articles is 87.39
Average lengths of sumaries is 9.45


## **2.4 Sort the data**

In [0]:
def sort_data(text, headline):
  """
  Function sorting data w.r.t. to the lengths of input articles, from the longest to the shortes one

  :param text:
    type: Numpy.Object
    description: Input articles
  :param headline:
    type: Numpy.Object
    descritpion: Target summaries

  :return text:  
    type: Numpy.Object
    description: Sorted input articles from the longest one to the shortest one
  :return headline:
    type: Numpy.Object
    description: Rearranged target summaries w.r.t. input articles
  """
  headline = np.array(
      [y for x,y in sorted(zip(text, headline), key = lambda pair: len(pair[0]), reverse = True)]
  )
  text = list(text)
  text.sort(key = lambda x: len(x), reverse = True)
  text = np.array(text)

  return text, headline

In [0]:
# Train set
text_train, headline_train = sort_data(text_train, headline_train)
# Validation set
text_val, headline_val = sort_data(text_val, headline_val)
# Test set
text_test, headline_test = sort_data(text_test, headline_test)

# **3 Text & Headline dictionary**

## **3.1 Extract dictionaries**

In [0]:
# Class Language Dictionary
class LangDict:
  """
  Source: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
  """
  def __init__(self):
    self.word2index = {}
    self.word2count = {}
    self.index2word = {0: "sos", 1: "eos"}
    self.n_words = 2

  def add_article(self, article):
    for word in article:
      self.add_word(word)

  def add_word(self, word):
    if word not in self.word2index:
      self.word2index[word] = self.n_words
      self.word2count[word] = 1
      self.index2word[self.n_words] = word
      self.n_words += 1
    else:
      self.word2count[word] += 1

In [0]:
# Create dictionary based on the training data
text_dictionary = LangDict()
headline_dictionary = LangDict()

for article in text_train:
  text_dictionary.add_article(article)

for article in headline_train:
  headline_dictionary.add_article(article)

## **3.2 Print some statistics and drop infrequent words**

In [0]:
print("There are {:.0f} distinct words in the untrimmed text dictionary".format(len(text_dictionary.word2index.keys())))
print("There are {:.0f} distinct words in the untrimmed headline dictionary".format(len(headline_dictionary.word2index.keys())))

# Trim a dictionary to the words with at least 10 occurences within the text
text_min_count = 1
head_min_count = 2

## TEXT DICTIONARY
subset_words = [word for (word, count) in text_dictionary.word2count.items() if count >= text_min_count]
text_dictionary.word2index = {word: i for (word, i) in zip(subset_words, range(len(subset_words)))}
text_dictionary.index2word = {i: word for (word, i) in zip(subset_words, range(len(subset_words)))}
text_dictionary.word2count = {word: count for (word, count) in text_dictionary.word2count.items() if count >= text_min_count}

## HEADLINE DICTIONARY
subset_words = [word for (word, count) in headline_dictionary.word2count.items() if count >= head_min_count]
headline_dictionary.word2index = {word: i for (word, i) in zip(subset_words, range(len(subset_words)))}
headline_dictionary.index2word = {i: word for (word, i) in zip(subset_words, range(len(subset_words)))}
headline_dictionary.word2count = {word: count for (word, count) in headline_dictionary.word2count.items() if count >= head_min_count}

print("There are {:.0f} distinct words in the trimmed text dictionary, where only word with at least {:.0f} occurences are retained".format(len(text_dictionary.word2index.keys()), text_min_count))
print("There are {:.0f} distinct words in the trimmed headline dictionary, where only word with at least {:.0f} occurences are retained".format(len(headline_dictionary.word2index.keys()), head_min_count))
del text_min_count, head_min_count, subset_words

There are 67860 distinct words in the untrimmed text dictionary
There are 23368 distinct words in the untrimmed headline dictionary
There are 67860 distinct words in the trimmed text dictionary, where only word with at least 1 occurences are retained
There are 15049 distinct words in the trimmed headline dictionary, where only word with at least 2 occurences are retained


## **3.3 Add pad token**

In [0]:
## TEXT DICTIONARY
pad_idx = max(list(text_dictionary.index2word.keys())) + 1

text_dictionary.word2index['<pad>'] = pad_idx
text_dictionary.index2word[pad_idx] = '<pad>'

print(len(text_dictionary.index2word.keys()))

## HEADLINE DICTIONARY
pad_idx = max(list(headline_dictionary.index2word.keys())) + 1

headline_dictionary.word2index['<pad>'] = pad_idx
headline_dictionary.index2word[pad_idx] = '<pad>'

print(len(headline_dictionary.index2word.keys()))

67861
15050


## **3.4 Save dictionaries**

In [0]:
# text_dictionary
with open('../data/text.dictionary', 'wb') as text_dictionary_file:
  pickle.dump(text_dictionary, text_dictionary_file)

# headline_dictionary
with open('../data/headline.dictionary', 'wb') as headline_dictionary_file:
  pickle.dump(headline_dictionary, headline_dictionary_file)

## **3.5 Extract embedding vectors and save them**

In [0]:
%%time
pre_train_weight = extract_weight(text_dictionary)
pre_train_weight = np.array(pre_train_weight, dtype = np.float32)
np.save('../data/embedding.npy', pre_train_weight)

pre_train_weight_head = extract_weight(headline_dictionary)
pre_train_weight_head = np.array(pre_train_weight_head, dtype = np.float32)
np.save('../data/embedding_headline.npy', pre_train_weight_head)

del embed_dict
gc.collect()

CPU times: user 15min 33s, sys: 31.3 s, total: 16min 4s
Wall time: 16min 9s


# **4. Transform the data into the sequence of indexed words**

In [0]:
# Train set
text_train, text_lengths_train, headline_train, headline_lengths_train = data2PaddedArray(text_train, headline_train, {'text_dictionary': text_dictionary,
                                                                                                                       'headline_dictionary': headline_dictionary},
                                                                                          pre_train_weight)

# Validation set
text_val, text_lengths_val, headline_val, headline_lengths_val = data2PaddedArray(text_val, headline_val, {'text_dictionary': text_dictionary,
                                                                                                           'headline_dictionary': headline_dictionary},
                                                                                  pre_train_weight)

# Test set
text_test, text_lengths_test, headline_test, headline_lengths_test = data2PaddedArray(text_test, headline_test, {'text_dictionary': text_dictionary,
                                                                                                                 'headline_dictionary': headline_dictionary},
                                                                                       pre_train_weight)

## **4.1 Save the preprocessed and transformed data**

In [0]:
# Training data
np.save(
    '../data/text_train.npy',
    text_train
)
np.save(
    '../data/headline_train.npy',
    headline_train
)

# Validation/dev data
np.save(
    '../data/text_val.npy',
    text_val
)
np.save(
    '../data/headline_val.npy',
    headline_val
)

# Test data
np.save(
    '../data/text_test.npy',
    text_test
)
np.save(
    '../data/headline_test.npy',
    headline_test
)

## **4.2 Save the lengths of input and target sequences**

In [0]:
# Training data
np.save(
    '../data/text_lengths_train.npy',
    text_lengths_train
)
np.save(
    '../data/headline_lengths_train.npy',
    headline_lengths_train
)

# Validation/dev data
np.save(
    '../data/text_lengths_val.npy',
    text_lengths_val
)
np.save(
    '../data/headline_lengths_val.npy',
    headline_lengths_val
)

# Test data
np.save(
    '../data/text_lengths_test.npy',
    text_lengths_test
)
np.save(
    '../data/headline_lengths_test.npy',
    headline_lengths_test
)