**Function keeping Colab running**

<hr>

1. Ctrl + Shift + I
2. Put the code below into the console

In [0]:
"""

function ClickConnect(){
    console.log("Clicked on connect button"); 
    document.querySelector("colab-connect-button").click()
}
setInterval(ClickConnect,60000)

"""

'\n\nfunction ClickConnect(){\n    console.log("Clicked on connect button"); \n    document.querySelector("colab-connect-button").click()\n}\nsetInterval(ClickConnect,60000)\n\n'

# **GANs for Abstractive Text Summarization**
## **NLP Group Project**
## **Statistical Natural Language Processing (COMP0087), University College London**

<hr>

**Project description**

A lot of endeavours have already been devoted to NLP text summarization techniques, and abstractive methods have proved to be more proficient in generating human-like sentences. At the same time, GANs has been enjoying considerable success in the area of real-valued data such as an image generation. Recently, researchers have begun to come up with ideas on how to overcome various obstacles during training GAN models for discrete data, though not a lot of work seemed to be directly dedicated to the text summarization itself. We, therefore, would like to pursue to tackle the issue of text summarization using the GAN techniques inspired by sources enlisted below.

<hr>

**Collaborators**

- Daniel Stancl (daniel.stancl.19@ucl.ac.uk)
- Dorota Jagnesakova (dorota.jagnesakova.19@ucl.ac.uk)
- Guolinag HE (guoliang.he.19@ucl.ac.uk)
- Zakhar Borok

# **1 Setup**

<hr>

- install and import libraries
- download stopwords
- remove and clone the most recent version of git repository
- run a script with a CONTRACTION_MAP
- run a script with a function for text preprocessing

### **GitHub stuff**

**Set GitHub credentials and username of repo owner**

In [0]:
# credentials
user_email = 'dannyi@seznam.cz'
user = "gansforlife"
user_password = "dankodorkamichaelzak"

# username of repo owner
owner_username = 'stancld'
# reponame
reponame = 'GeneratingHeadline_GANs'

# generate 
add_origin_link = (
    'https://{}:{}github@github.com/{}/{}.git'.format(
    user, user_password, owner_username, reponame)
)

print("Link used for git cooperation:\n{}".format(add_origin_link))

Link used for git cooperation:
https://gansforlife:dankodorkamichaelzakgithub@github.com/stancld/GeneratingHeadline_GANs.git


**Clone GitHub repo on the personal drive**

In [0]:
%%time

## Clone GitHub repo to the desired folder
# Mount drive
from google.colab import drive
drive.mount("/content/drive", force_remount = True)
%cd "drive/My Drive/projects"

# Remove NLP_Project if presented and clone up-to-date repo
!rm -r GeneratingHeadline_GANs
!git clone https://github.com/stancld/GeneratingHeadline_GANs.git

# Go to the NLP_Project folder
%cd GeneratingHeadline_GANs

# Config global user and add origin enabling us to execute push commands
!git config --global user.email user_email
!git remote rm origin
!git remote add origin https://gansforlife:dankodorkamichaelzakgithub@github.com/stancld/GeneratingHeadline_GANs.git

Mounted at /content/drive
/content/drive/My Drive/projects
Cloning into 'GeneratingHeadline_GANs'...
remote: Enumerating objects: 377, done.[K
remote: Counting objects: 100% (377/377), done.[K
remote: Compressing objects: 100% (225/225), done.[K
remote: Total 377 (delta 161), reused 294 (delta 78), pack-reused 0[K
Receiving objects: 100% (377/377), 16.66 MiB | 19.06 MiB/s, done.
Resolving deltas: 100% (161/161), done.
/content/drive/My Drive/projects/GeneratingHeadline_GANs
CPU times: user 210 ms, sys: 85.7 ms, total: 295 ms
Wall time: 9.49 s


**Function push_to_repo**

In [0]:
def push_to_repo():
  """
  models_branch
  """
  !git remote rm origin
  !git remote add origin https://gansforlife:dankodorkamichaelzak@github.com/stancld/GeneratingHeadline_GANs.git
  !git checkout master
  !git pull origin master
  !git checkout models_branch
  !git add .
  !git commit -m "model state update"
  !git checkout master
  !git merge models_branch
  !git push -u origin master

### **General stuff**

**Import essential libraries and load necessary conditionalities**

In [0]:
import os
import sys
import time
import gc
import copy

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim


import re
import unicodedata
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

from gensim.models import Word2Vec

%matplotlib inline

In [0]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**Set essential parameters**

In [0]:
# Set torch.device to use GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(torch.cuda.get_device_name())

Tesla P100-PCIE-16GB


**Run python files from with classes used throughtout the document**

In [0]:
run Code/contractions.py

In [0]:
# code for text_preprocessing()
run Code/text_preprocessing.py

In [0]:
# code for transforming data to padded array
run Code/data2PaddedArray.py

In [0]:
# code for the baseline model class _Seq2Seq()
run Code/Models/Attention_seq2seq.py

In [0]:
# code for the training class
run Code/Models/generator_training_class.py

### **Pretrained embeddings**

<hr>

**TODO:** *Put a comment which kind of embeddings we used. Add some references and so on*

In [0]:
embed_dim = 200

# Download and unzip GloVe embedding
#!wget http://nlp.stanford.edu/data/glove.6B.zip
#!unzip glove.6B.zip


# input your pre-train txt path and parse the data
#path = '../data/glove.6B.100d.txt'
path = '../data/glove.6B.{:.0f}d.txt'.format(embed_dim)
embed_dict = {}
with open(path,'r') as f:
  lines = f.readlines()
  for l in lines:
    w = l.split()[0]
    v = np.array(l.split()[1:]).astype('float')
    embed_dict[w] = v

embed_dict['@@_unknown_@@'] = np.random.random(embed_dim) # if we use 100 dimension embeddings

# remove all the unnecesary files
#!rm -rf glove.6B.zip
#!rm -rf glove.6B.50d.txt
#!rm -rf glove.6B.100d.txt
#!rm -rf glove.6B.200d.txt
#!rm -rf glove.6B.300d.txt

# check the length of the dictionary
len(embed_dict.keys())

400001

**Function for extracting relevant matrix of pretrained weights** 

In [0]:
def extract_weight(text_dictionary):
  """
  :param text_dictionary:
  """
  pre_train_weight = []
  for word_index in text_dictionary.index2word.keys():
    if word_index != 0:
      word = text_dictionary.index2word[word_index]
      try:
        word_vector = embed_dict[word].reshape(1,-1)
      except:
        word_vector = embed_dict['@@_unknown_@@'].reshape(1,-1) # handle unknown word
      pre_train_weight = np.vstack([pre_train_weight,word_vector])
    
    # add for padding
    elif word_index == len(text_dictionary.index2word.keys()):  
      pre_train_weight = np.r_[pre_train_weight, np.zeros((1, embed_dim))]
    
    else:
      word = text_dictionary.index2word[word_index]
      try:
        word_vector = embed_dict[word].reshape(1,-1)
      except:
        word_vector = embed_dict['@@_unknown_@@'].reshape(1,-1) # handle unknown word
      pre_train_weight = word_vector
  return pre_train_weight

# **2 Load and process the data**

<hr>

**Source of the data:** https://ucsb.app.box.com/s/7yq601ijl1lzvlfu4rjdbbxforzd2oag

##### *Download and open data*

In [0]:
data = pd.read_csv('../data/wikihowSep.csv',
                   error_bad_lines = False).astype(str)
print(data.shape)

(1585695, 5)


##### *Pre-process data*

In [0]:
%time

for item in ['text', 'headline']:
  exec("""{}_data = text_preprocessing(data=data, item = '{}', contraction_map=CONTRACTION_MAP,
                                  drop_digits=False, remove_stopwords=False, stemming=False)""".format(item, item),
       locals(), globals()
  )

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.96 µs


##### *Clean flawed examples*

<hr>

- drop examples based on the threshold

In [0]:
# drop examples with an invalid ratio of length of text and headline
text_len = [len(t) for t in text_data]
head_len = [len(h) for h in headline_data]

print('Some statistics')

print('Average length of articles is {:.2f}.'.format(np.array(text_len).mean()))
print('Min = {:.0f}, Max = {:.0f}, Std = {:.2f}'.format(min(text_len), max(text_len), np.array(text_len).std()))

print('-----')

print('Average length of summaries is {:.2f}.'.format(np.array(head_len).mean()))
print('Min = {:.0f}, Max = {:.0f}, Std = {:.2f}'.format(min(head_len), max(head_len), np.array(head_len).std()))

Some statistics
Average length of articles is 90.06.
Min = 2, Max = 1660, Std = 59.56
-----
Average length of summaries is 10.29.
Min = 2, Max = 226, Std = 6.45


In [0]:
max_examples = 150000
max_threshold = 0.75

# drop examples with an invalid ratio of length of text and headline
text_len = [len(t) for t in text_data]
head_len = [len(h) for h in headline_data]

ratio = [h/t for t, h in zip(text_len, head_len)]

problems1 = [problem for problem, r in enumerate(ratio) if (r > max_threshold)]
text_data, headline_data = np.delete(text_data, problems1), np.delete(headline_data, problems1)
print("Number of examples after filtering: {:.0f}".format(text_data.shape[0]))

# drop too long articles (to avoid struggles with CUDA memory) and too short
text_len = [len(t) for t in text_data]

problems2 = [problem for problem, text_length in enumerate(text_len) if ((text_length > 200) | (text_length < 10) )]
text_data, headline_data = np.delete(text_data, problems2), np.delete(headline_data, problems2)
print("Number of examples after filtering: {:.0f}".format(text_data.shape[0]))

# drop too pairs with too short/long summaries
head_len = [len(h) for h in headline_data]

problems3 = [problem for problem, headline_len in enumerate(head_len) if ( (headline_len > 75) | (headline_len < 2) )]
text_data, headline_data = np.delete(text_data, problems3), np.delete(headline_data, problems3)
print("Number of examples after filtering: {:.0f}".format(text_data.shape[0]))

# some cleaning
del text_len, head_len, ratio, problems1, problems2, problems3
gc.collect()

"""
# trim the data to have only a subset of the data for our project
try:
  data = data[:max_examples]
except:
  pass
"""

Number of examples after filtering: 138821
Number of examples after filtering: 132047
Number of examples after filtering: 132041
(150000, 5)


*Print some statistics*

In [0]:
# drop examples with an invalid ratio of length of text and headline
text_len = [len(t) for t in text_data]
head_len = [len(h) for h in headline_data]

print('Some statistics')

print('Average length of articles is {:.2f}.'.format(np.array(text_len).mean()))
print('Min = {:.0f}, Max = {:.0f}, Std = {:.2f}'.format(min(text_len), max(text_len), np.array(text_len).std()))

print('-----')

print('Average length of summaries is {:.2f}.'.format(np.array(head_len).mean()))
print('Min = {:.0f}, Max = {:.0f}, Std = {:.2f}'.format(min(head_len), max(head_len), np.array(head_len).std()))

Some statistics
Average length of articles is 88.48.
Min = 10, Max = 200, Std = 42.77
-----
Average length of summaries is 9.41.
Min = 3, Max = 68, Std = 4.53


##### *Split data into train/val/test set*

<hr>

It's crucial to do this split in this step so that a dictionary that will be created for our model won't contain any words from validation/test set which are not presented in the training data.

In [0]:
np.random.seed(222)

split = np.random.uniform(0, 1, size = text_data.shape[0])

# Train set
text_train, headline_train = text_data[split <= 0.9], headline_data[split <= 0.9]
# Validation set
text_val, headline_val = text_data[(split > 0.9) & (split <= 0.95)], headline_data[(split > 0.9) & (split <= 0.95)]
# Test set
text_test, headline_test = text_data[split > 0.95], headline_data[split > 0.95]

del data
gc.collect()

344

*Print some statistics*

In [0]:
print('Average lengths of articles is {:.2f}'.format(np.array([len(text) for text in text_train]).mean()))

print('Average lengths of sumaries is {:.2f}'.format(np.array([len(text) for text in headline_train]).mean()))

Average lengths of articles is 88.41
Average lengths of sumaries is 9.40


##### *Sort dataset from the longest sequence to the shortest one*

In [0]:
def sort_data(text, headline):
  """
  """
  headline = np.array(
      [y for x,y in sorted(zip(text, headline), key = lambda pair: len(pair[0]), reverse = True)]
  )
  text = list(text)
  text.sort(key = lambda x: len(x), reverse = True)
  text = np.array(text)

  return text, headline

In [0]:
# Train set
text_train, headline_train = sort_data(text_train, headline_train)
# Validation set
text_val, headline_val = sort_data(text_val, headline_val)
# Test set
text_test, headline_test = sort_data(text_test, headline_test)

### **Prepare dictionary and embeddings**

##### *Create a dictionary and prepare a digestible representation of the data*

In [0]:
class LangDict:
  """
  Source: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
  """
  def __init__(self):
    self.word2index = {}
    self.word2count = {}
    self.index2word = {0: "sos", 1: "eos"}
    self.n_words = 2

  def add_article(self, article):
    for word in article:
      self.add_word(word)

  def add_word(self, word):
    if word not in self.word2index:
      self.word2index[word] = self.n_words
      self.word2count[word] = 1
      self.index2word[self.n_words] = word
      self.n_words += 1
    else:
      self.word2count[word] += 1

In [0]:
# Create dictionary based on the training data
text_dictionary = LangDict()
headline_dictionary = LangDict()

for article in text_train:
  text_dictionary.add_article(article)

for article in headline_train:
  headline_dictionary.add_article(article)

In [0]:
print("There are {:.0f} distinct words in the untrimmed text dictionary".format(len(text_dictionary.word2index.keys())))
print("There are {:.0f} distinct words in the untrimmed headline dictionary".format(len(headline_dictionary.word2index.keys())))

# Trim a dictionary to the words with at least 10 occurences within the text
text_min_count = 1
head_min_count = 2

## TEXT DICTIONARY
subset_words = [word for (word, count) in text_dictionary.word2count.items() if count >= text_min_count]
text_dictionary.word2index = {word: i for (word, i) in zip(subset_words, range(len(subset_words)))}
text_dictionary.index2word = {i: word for (word, i) in zip(subset_words, range(len(subset_words)))}
text_dictionary.word2count = {word: count for (word, count) in text_dictionary.word2count.items() if count >= text_min_count}

## HEADLINE DICTIONARY
subset_words = [word for (word, count) in headline_dictionary.word2count.items() if count >= head_min_count]
headline_dictionary.word2index = {word: i for (word, i) in zip(subset_words, range(len(subset_words)))}
headline_dictionary.index2word = {i: word for (word, i) in zip(subset_words, range(len(subset_words)))}
headline_dictionary.word2count = {word: count for (word, count) in headline_dictionary.word2count.items() if count >= head_min_count}

print("There are {:.0f} distinct words in the trimmed text dictionary, where only word with at least {:.0f} occurences are retained".format(len(text_dictionary.word2index.keys()), text_min_count))
print("There are {:.0f} distinct words in the trimmed headline dictionary, where only word with at least {:.0f} occurences are retained".format(len(headline_dictionary.word2index.keys()), head_min_count))
del text_min_count, head_min_count, subset_words

There are 62271 distinct words in the untrimmed text dictionary
There are 21634 distinct words in the untrimmed headline dictionary
There are 62271 distinct words in the trimmed text dictionary, where only word with at least 1 occurences are retained
There are 13930 distinct words in the trimmed headline dictionary, where only word with at least 2 occurences are retained


*Add pad token*

In [0]:
## TEXT DICTIONARY
pad_idx = max(list(text_dictionary.index2word.keys())) + 1

text_dictionary.word2index['<pad>'] = pad_idx
text_dictionary.index2word[pad_idx] = '<pad>'

print(len(text_dictionary.index2word.keys()))

## HEADLINE DICTIONARY
pad_idx = max(list(headline_dictionary.index2word.keys())) + 1

headline_dictionary.word2index['<pad>'] = pad_idx
headline_dictionary.index2word[pad_idx] = '<pad>'

print(len(headline_dictionary.index2word.keys()))

62272
13931


##### *Extract embedding vectors for words we need*

In [0]:
%%time
pre_train_weight = extract_weight(text_dictionary)
pre_train_weight = np.array(pre_train_weight, dtype = np.float32)

del embed_dictl
gc.collect()

CPU times: user 11min 29s, sys: 32.5 s, total: 12min 2s
Wall time: 12min 2s


### **Transform the data**

In [0]:
# Train set
text_train, text_lengths_train, headline_train, headline_lengths_train = data2PaddedArray(text_train, headline_train, {'text_dictionary': text_dictionary,
                                                                                                                       'headline_dictionary': headline_dictionary},
                                                                                          pre_train_weight)
# Validation set
text_val, text_lengths_val, headline_val, headline_lengths_val = data2PaddedArray(text_val, headline_val, {'text_dictionary': text_dictionary,
                                                                                                           'headline_dictionary': headline_dictionary},
                                                                                  pre_train_weight)
# Test set
text_test, text_lengths_test, headline_test, headline_lengths_test = data2PaddedArray(text_test, headline_test, {'text_dictionary': text_dictionary,
                                                                                                                 'headline_dictionary': headline_dictionary},
                                                                                       pre_train_weight)

# **3 Training**


## **3.1 Generator - Pretraining**

<hr>

**Description**

In [0]:
grid = {'max_epochs': 3,
        'batch_size': 32,
        'learning_rate': 5e-4,
        'clip': 10,
        'l2_reg': 1e-4,
        'model_name': "generator031"
      }

##### model ######
OUTPUT_DIM = len(headline_dictionary.index2word.keys())
ENC_EMB_DIM = pre_train_weight.shape[1]
ENC_HID_DIM = 512
DEC_HID_DIM = 512

enc_num_layers = 1 # number of layers in RNN
dec_num_layers = 1 # number of layers in RNN

ENC_DROPOUT = 0.1
DEC_DROPOUT = 0.1

In [0]:
Generator = generator(model = _Seq2Seq, loss_function = nn.CrossEntropyLoss, optimiser = optim.Adam, l2_reg = grid['l2_reg'], batch_size = grid['batch_size'],
                      text_dictionary = text_dictionary, embeddings = pre_train_weight, max_epochs = grid['max_epochs'], learning_rate = grid['learning_rate'],
                      clip = grid['clip'], teacher_forcing_ratio = 1, OUTPUT_DIM = OUTPUT_DIM, ENC_HID_DIM = ENC_HID_DIM, ENC_EMB_DIM = ENC_EMB_DIM,
                      DEC_HID_DIM = DEC_HID_DIM, ENC_DROPOUT = ENC_DROPOUT, DEC_DROPOUT = DEC_DROPOUT, enc_num_layers = enc_num_layers, dec_num_layers = dec_num_layers,
                      device = device, model_name = grid['model_name'], push_to_repo = push_to_repo)

In [0]:
Generator.load()

In [0]:
Generator.model

_Seq2Seq(
  (encoder): _Encoder(
    (rnn): GRU(200, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (decoder): _Decoder(
    (attention): _Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
    (rnn): GRU(1224, 512)
    (fc_out): Linear(in_features=1736, out_features=13931, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
)

In [0]:
Generator.train(X_train = text_train,
                y_train = headline_train,
                X_val = text_val,
                y_val = headline_val,
                X_train_lengths = text_lengths_train,
                y_train_lengths = headline_lengths_train,
                X_val_lengths = text_lengths_val,
                y_val_lengths = headline_lengths_val)

Epoch 1 - Intermediate loss 3.292 after 1.97 % of training examples.
Total time 60.1 s.
Epoch 1 - Intermediate loss 3.038 after 4.33 % of training examples.
Total time 120.8 s.
Epoch 1 - Intermediate loss 2.925 after 6.84 % of training examples.
Total time 181.6 s.
Epoch 1 - Intermediate loss 2.856 after 9.34 % of training examples.
Total time 242.2 s.
Epoch 1 - Intermediate loss 2.801 after 11.95 % of training examples.
Total time 304.0 s.
Epoch 1 - Intermediate loss 2.756 after 14.40 % of training examples.
Total time 364.6 s.
Epoch 1 - Intermediate loss 2.724 after 17.04 % of training examples.
Total time 425.2 s.
Epoch 1 - Intermediate loss 2.682 after 19.65 % of training examples.
Total time 485.9 s.
Epoch 1 - Intermediate loss 2.646 after 22.29 % of training examples.
Total time 546.3 s.
Epoch 1 - Intermediate loss 2.611 after 24.85 % of training examples.
Total time 607.2 s.
Epoch 1 - Intermediate loss 2.579 after 27.40 % of training examples.
Total time 667.6 s.
Epoch 1 - Inter

## **3.2 Generator - Generating summaries**

<hr>

**Description**

In [0]:
"""
!git pull origin master
"""

remote: Enumerating objects: 9, done.[K
remote: Counting objects:  11% (1/9)[Kremote: Counting objects:  22% (2/9)[Kremote: Counting objects:  33% (3/9)[Kremote: Counting objects:  44% (4/9)[Kremote: Counting objects:  55% (5/9)[Kremote: Counting objects:  66% (6/9)[Kremote: Counting objects:  77% (7/9)[Kremote: Counting objects:  88% (8/9)[Kremote: Counting objects: 100% (9/9)[Kremote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (1/1)[Kremote: Compressing objects: 100% (1/1), done.[K
remote: Total 5 (delta 4), reused 5 (delta 4), pack-reused 0[K
Unpacking objects:  20% (1/5)   Unpacking objects:  40% (2/5)   Unpacking objects:  60% (3/5)   Unpacking objects:  80% (4/5)   Unpacking objects: 100% (5/5)   Unpacking objects: 100% (5/5), done.
From https://github.com/stancld/GeneratingHeadline_GANs
 * branch            master     -> FETCH_HEAD
   32913f2..769693e  master     -> origin/master
Updating 32913f2..769693e
Fast-forward
 Co

In [0]:
"""
# code for the training class
run Code/Models/generator_training_class.py
"""

In [0]:
o = Generator.generate_summaries(input_val = text_train,
                             input_val_lengths = text_lengths_train,
                             target_val = headline_train,
                             target_val_lengths = headline_lengths_train)

In [0]:
pad_ix = headline_dictionary.word2index['<pad>']

In [0]:
for k in range(10):
  print(' '.join([headline_dictionary.index2word[i] for i in o[:, k] if i != pad_ix]))

suckle tap the red . eos
suckle integrate the 4 . eos
suckle dial 1 4 1 . eos
suckle cut the brim . eos
suckle choose a circle . eos
suckle wash your hands . eos
suckle wash your hair . eos
suckle tap the three . eos
suckle add the water . eos
suckle done . eos


In [0]:
for k in range(10):
  print(' '.join([headline_dictionary.index2word[i] for i in headline_train[:, k] if i != pad_ix]))

sos click on the sound again with your right mouse button and click on modify lip sync mapping . eos
sos identify all common prime factors . eos
sos input the column headings input base t to three . eos
sos cut the wire mesh large enough to fold around the box .be sure you allow for some overlap at the seam . eos
sos cut out another 5 items that are sold by quantity . eos
sos look out when the bear comes to eat the three three ! eos
sos mavis s hair is a bob kind of style with bangs . eos
sos do not overdo three . eos
sos add beer to the glass . eos
sos if you get accepted at your three . eos


In [0]:
`Generator.text_dictionary.word2index['sos']