#### Text Summarization for BBC News - Seq2Seq with Attention using Bi-LSTM and/or Transformer BERT Embeddings
<br>
<br>
This coding project would aim at demonstrating the experiences of doing one of the most common natural language processing (NLP) task, Text Summarization (TS), on the well-recognized public dataset sourced from 2,225 pieces of BBC News (availabel on Kaggle: <a href="https://www.kaggle.com/pariza/bbc-news-summary">https://www.kaggle.com/pariza/bbc-news-summary</a>) These news are usually within around 300 to 500 words, depending on the topics (e.g. shorter for business or entertainment news, while longer for political texts). The following three approaches were attempted respectively: <br>
<ul>
<li>Seq2Seq with Bidirectional-LSTM embeddings and Luong Attention</li>
<li>Seq2Seq with Transformer (positional embeddings + multi-head attention)</li>
<li>Seq2Seq with pre-trained BERT models for preprocessing and encoding
    <ul><li>Preprocessor:  A Lite BERT (ALBERT)</li>
        <li>Encoder:  BERT with Talking-Heads Attention and Gated GELU</li>
    </ul>
</li>
</ul>
<br>
There are two approaches on doing text summarization: Extractive Text Summarization (ETS) and Abstractive Text Sumarization (ATS). <br>
 <br>
The extractive approach would be easier on the perspective of implementations, as the abstractive approach normally requires a huge number of training epochs and abundant samples to achieve results that are qualitatively satisfactory compared to human written summaries, and this would cost considerably large expenses on cloud resource acquisitions, which would be hardly affordable in small scale tasks. Moreover, as the summary texts in this dataset are basically extracting a portion of crucial sentences from the original full texts and reordering them. The unit of tokenization would be sentences in this scenario. Therefore, the notebook presents the follows conducting the extractive approach.  <br>
 <br>
For the evaluation metrics, 1) accuracy represents whether the model could extract the correct key sentences and match the order of the delivered texts; 2) precision represents the propotion of all predicted sentences that are correctly extracted, i.e. the ability of the model in selecting the key contents; 3) recall represents the propotion of all sentences appeared in the summary that are also included in the predicted texts, i.e. the ability of the model in capturing or covering what should actually be present; 4) BLEU score borrowed from machine translation which indicates how well the generated summary could be compared to the actual summary after exlusion of the position tags marked padding, unknown, begin and end of sentences, and it could be treated as a modified score of precision.<br>
 <br>

<table>
    <thead>
        <tr>
          <th>News Topics</th>
          <th colspan="4">BiLSTM Embedding + Luong Attention</th>
          <th colspan="4">BERT Embedding + Luong Attention</th>
          <th colspan="4">Transformer Encoding + Self-Attention</th>
        </tr>
        <tr>
          <th></th>
          <th>Mean Accuracy</th>
          <th>Mean Precision</th>
          <th>Mean Recall</th>
          <th>Mean BLEU</th>
          <th>Mean Accuracy</th>
          <th>Mean Precision</th>
          <th>Mean Recall</th>
          <th>Mean BLEU</th>
          <th>Mean Accuracy</th>
          <th>Mean Precision</th>
          <th>Mean Recall</th>
          <th>Mean BLEU</th>
        </tr>
    </thead>
    <tbody>
      <tr>
        <td>Business</td>
        <td>0.3217</td>	
        <td>0.3718</td>	
        <td>0.2072</td>	
        <td>0.4628</td>	
        <td>0.3971</td>	
        <td>0.7884</td>	
        <td>0.4582</td>	
        <td>0.5418</td>	
        <td>0.3654</td>	
        <td>0.3630</td>	
        <td>0.6087</td>	
        <td>0.4963</td>
      </tr>
      <tr>
        <td>Entertainment</td>
        <td>0.0612</td>	
        <td>0.2890</td>	
        <td>0.6172</td>	
        <td>0.4894</td>	
        <td>0.0939</td>	
        <td>0.6401</td>	
        <td>0.5576</td>	
        <td>0.4933</td>	
        <td>0.0776</td>
        <td>0.3446</td>	
        <td>0.5730</td>	
        <td>0.4895</td>
      </tr>
      <tr>
        <td>Politics</td>
        <td>0.0569</td>	
        <td>0.3220</td>	
        <td>0.7384</td>	
        <td>0.4530</td>	
        <td>0.0668</td>	
        <td>0.5675</td>	
        <td>0.6667</td>	
        <td>0.4923</td>	
        <td>0.0875</td>	
        <td>0.4047</td>	
        <td>0.7103</td>	
        <td>0.4945</td>
      </tr>
      <tr>
        <td>Sports</td>
        <td>0.0583</td>	
        <td>0.3120</td>	
        <td>0.4673</td>	
        <td>0.4674</td>	
        <td>0.1902</td>	
        <td>0.4850</td>	
        <td>0.4583</td>	
        <td>0.4777</td>	
        <td>0.2375</td>	
        <td>0.3132</td>	
        <td>0.8737</td>	
        <td>0.4897</td>
      </tr>
      <tr>
        <td>Technology</td>
        <td>0.0801</td>	
        <td>0.3266</td>	
        <td>0.3471</td>	
        <td>0.4107</td>	
        <td>0.1856</td>	
        <td>0.7495</td>	
        <td>0.3611</td>	
        <td>0.4504</td>	
        <td>0.1091</td>	
        <td>0.3617</td>	
        <td>0.7109</td>	
        <td>0.4278</td>
      </tr>
      <tr>
      <td><b>Overall</b></td>
      <td><b>0.1156</b></td>	
      <td><b>0.3243</b></td>	
      <td><b>0.4754</b></td>	
      <td><b>0.4567</b></td>	
      <td><b>0.1867</b></td>	
      <td><b>0.6461</b></td>	
      <td><b>0.5004</b></td>	
      <td><b>0.4911</b></td>	
      <td><b>0.1754</b></td>	
      <td><b>0.3574</b></td>	
      <td><b>0.6953</b></td>	
      <td><b>0.4796</b></td>
      </tr>
    </tbody>
</table>

<br>
From above table gathering all results of different models fitted in this notebook, it could be found that using Transformer-based encoding methods involving positional embeddings, no matter tailored to customized training or calling the pre-trained BERT models from Tensorflow hub, had outweighed the Bidirectional LSTM encoded embeddings, in all accuracy, precision and recall metrics. In general, the BLEU score for all analyses reached 0.4 or above, which was generally an acceptable level. <br>
<br>
Comparing the customized training using Bi-LSTM embeddings and Transformer positional embeddings, the latter one had generally slight increases on the average precision, but also greatly boosted the average recall for Business, Sports and Technology news from the range of 0.2 - 0.4 to 0.6 - 0.8 (overall from 0.475 to 0.695) while the accuracy of Sports achieved the highest level with the customized Transformer model. <br>
<br>
The pre-trained BERT model with talking-heads attention and gated GELU called from the Tensorflow hub had projected each sentence to a (1 x 128 x 768) dimension array, based on the training results from the developer. This embeddings was passed as the encoder outputs directly to the decoder model with Luong Attention. The results deviated from the custom models in a greater extent, such that the average precision of the predicted summaries was enhanced to over 0.64 from 0.35, but the average recall dropped compared to the customized Transformer models. It gave a scenario that the precision was more emphasized than recall, which might mean the models with pre-trained BERT embeddings could be more conservative to capture correctly the key contents at its guess, while having a weaker coverage against all needed sentences.

In [None]:
!unzip -uq "/content/drive/My Drive/Colab Notebooks/NLP/bbc_news_archive.zip"

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
import os
import re
import codecs

summary_dir = sorted(os.listdir("./BBC News Summary/Summaries"))
text_dir = sorted(os.listdir("./BBC News Summary/News Articles"))

summary = dict()
raw = dict()

for i in summary_dir:
    summary[i] = list()
    data = os.listdir("./BBC News Summary/Summaries/" + i)
    for j in data:
      try:
        f = open("./BBC News Summary/Summaries/" + i + '/' + j, 'r').read()
      except:
        f = codecs.open("./BBC News Summary/Summaries/" + i + '/' + j, 'rb').read().decode(errors='replace')
      f = f.replace('\n', ' ')
      summary[i].append(f)

for i in text_dir:
    raw[i] = list()
    data = os.listdir("./BBC News Summary/News Articles/" + i)
    for j in data:
      try:
        f = open("./BBC News Summary/News Articles/" + i + '/' + j, 'r').read()
      except:
        f = codecs.open("./BBC News Summary/News Articles/" + i + '/' + j, 'rb').read().decode(errors='replace')
      f = f.replace('\n', ' ')
      raw[i].append(f)

In [None]:
from random import sample, seed, random, randint
seed(42)
txt_idx = [len(summary[x]) for x in summary]
train_idx = [sample(range(0, x), int(x*0.95)) for x in txt_idx]
test_idx = [[x for x in range(0, txt_idx[y]) if x not in train_idx[y]] for y in range(len(train_idx))]

In [None]:
train_summary = [[summary[list(summary.keys())[x]][tr] for tr in train_idx[x]] for x in range(len(summary))]
train_raw = [[raw[list(raw.keys())[x]][tr] for tr in train_idx[x]] for x in range(len(raw))]
test_summary = [[summary[list(summary.keys())[x]][ts] for ts in test_idx[x]] for x in range(len(summary))]
test_raw = [[raw[list(raw.keys())[x]][ts] for ts in test_idx[x]] for x in range(len(raw))]

In [None]:
text_dir

['business', 'entertainment', 'politics', 'sport', 'tech']

In [None]:
summary_dir

['business', 'entertainment', 'politics', 'sport', 'tech']

In [None]:
list(raw.keys())

['business', 'entertainment', 'politics', 'sport', 'tech']

In [None]:
list(summary.keys())

['business', 'entertainment', 'politics', 'sport', 'tech']

In [None]:
import nltk
import re
from nltk.util import ngrams
from nltk import word_tokenize
from nltk import sent_tokenize

nltk.download('stopwords')
nltk.download('punkt')
stop_words = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
## Word tokenization
"""
def word_tokenization(doc):
    doc = doc.lower()
    doc = re.sub('<br/>', ' ', doc)
    doc = re.sub('<br />', ' ', doc)
    doc = re.sub("'s|'m|'re|'ve|'ll|'d|n't", ' ', doc)
    doc = re.sub(r'"', ' ', doc)
    doc = re.sub(r'[^a-zA-Z\s]\W+|\d+|[!?@#%^&*\[\]\\(){}<>]|[.]|[/]|[$]|[-;:,`~=_+]', ' ', doc)
    doc = doc.strip()
    tokens = word_tokenize(doc)
    tokens = [x for x in tokens if x not in stop_words]
    return tokens

train_raw_token_categorized = list()
train_summary_token_categorized = list()
for i in range(len(train_summary)):
  train_raw_token_categorized.append([word_tokenization(x) for x in train_raw[i]])
  train_summary_token_categorized.append([word_tokenization(x) for x in train_summary[i]])

test_raw_token_categorized = list()
test_summary_token_categorized = list()
for i in range(len(test_summary)):
  test_raw_token_categorized.append([word_tokenization(x) for x in test_raw[i]])
  test_summary_token_categorized.append([word_tokenization(x) for x in test_summary[i]])
"""

In [None]:
## Trigram tokenization
"""
def ngram_tokenization(doc, n):
    doc = doc.lower()
    doc = re.sub('<br/>', ' ', doc)
    doc = re.sub('<br />', ' ', doc)
    doc = re.sub("'s|'m|'re|'ve|'ll|'d|n't", ' ', doc)
    doc = re.sub(r'"', ' ', doc)
    doc = re.sub(r'[^a-zA-Z\s]\W+|\d+|[!?@#%^&*\[\]\\(){}<>]|[.]|[/]|[$]|[-;:,`~=_+]', ' ', doc)
    doc = doc.strip()
    tokens = ngrams(doc.split(), n)
    return tokens

train_raw_token_categorized = list()
train_summary_token_categorized = list()
for i in range(len(train_summary)):
  train_raw_token_categorized.append([ngram_tokenization(x, 3) for x in train_raw[i]])
  train_summary_token_categorized.append([ngram_tokenization(x, 3) for x in train_summary[i]])

test_raw_token_categorized = list()
test_summary_token_categorized = list()
for i in range(len(test_summary)):
  test_raw_token_categorized.append([ngram_tokenization(x, 3) for x in test_raw[i]])
  test_summary_token_categorized.append([ngram_tokenization(x, 3) for x in test_summary[i]])
"""

In [None]:
## Sentence tokenization
def summary_sentence_tokenization(doc):
    doc = re.sub('<br/>', ' ', doc)
    doc = re.sub('<br />', ' ', doc)
    tokens = re.split(r'([\.|!|?][A-Z\"])', doc)
    re_tokens = []
    if len(tokens) > 1:
        re_tokens.append(tokens[0] + tokens[1][0])
        reserve_letter = tokens[1][1]
        for n in range(2, len(tokens), 2):
            if n + 1 < len(tokens):
                re_tokens.append(reserve_letter + tokens[n] + tokens[n+1][0])
            else:
                re_tokens.append(reserve_letter + tokens[n])
            try:
                reserve_letter = tokens[n+1][1]
            except:
                continue
        re_tokens = " ".join(re_tokens)
        tokens = sent_tokenize(re_tokens)
    tokens = [x.strip() for x in tokens if len(x.strip()) > 0]
    tokens = [x.lower() for x in tokens]
    return tokens

def raw_sentence_tokenization(doc):
    doc = doc.lower()
    doc = re.sub('<br/>', ' ', doc)
    doc = re.sub('<br />', ' ', doc)
    tokens = sent_tokenize(doc)
    tokens = [z for y in [x.split("  ") for x in tokens] for z in y]
    tokens = [x.strip() for x in tokens if len(x.strip()) > 0]
    return tokens

train_raw_token_categorized = list()
train_summary_token_categorized = list()
for i in range(len(train_summary)):
  train_raw_token_categorized.append([raw_sentence_tokenization(x) for x in train_raw[i]])
  train_summary_token_categorized.append([summary_sentence_tokenization(x) for x in train_summary[i]])

test_raw_token_categorized = list()
test_summary_token_categorized = list()
for i in range(len(test_summary)):
  test_raw_token_categorized.append([raw_sentence_tokenization(x) for x in test_raw[i]])
  test_summary_token_categorized.append([summary_sentence_tokenization(x) for x in test_summary[i]])

In [None]:
## gather tokens of all topics
train_raw_token = [y for x in train_raw_token_categorized for y in x]
train_summary_token = [y for x in train_summary_token_categorized for y in x]
test_raw_token = [y for x in test_raw_token_categorized for y in x]
test_summary_token = [y for x in test_summary_token_categorized for y in x]

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split

In [None]:
## convert encoder dictionary
def process_seq2seq_train_encoder_input(encoder):
    reserved = {'<PAD>': 0, '<UNK>': 1}
    enc_list = [w for i in encoder for w in i]
    enc_dict = {e:i+2 for i,e in enumerate(set(enc_list))}
    enc_dict = {**reserved, **enc_dict}
    enc_seq = []
    ## reserved key-index for padding sequence length, out-of-dictionary words
    for e in range(len(encoder)):
        enc_sub_seq = []
        for se in encoder[e]:
            enc_sub_seq.append(enc_dict.get(se))
        enc_seq.append(enc_sub_seq)
    return enc_dict, enc_seq
    
## convert decoder dictionary
def process_seq2seq_train_decoder_input(decoder):
    reserved = {'<PAD>': 0, '<UNK>': 1, '<BOS>':2, '<EOS>':3}
    dec_list = [w for i in decoder for w in i]
    dec_dict = {e:i+4 for i,e in enumerate(set(dec_list))}
    dec_dict = {**reserved, **dec_dict}
    dec_seq= []
    ## pad <BOS> and <EOS> at the beginning and ending of decoder inputs as indicator for teacher forcing in 3-D outputs
    for f in range(len(decoder)):
        dec_sub_seq = []
        dec_sub_seq.append(dec_dict.get('<BOS>'))
        for sf in decoder[f]:
            dec_sub_seq.append(dec_dict.get(sf))
        dec_sub_seq.append(dec_dict.get('<EOS>'))
        dec_seq.append(dec_sub_seq)
    return dec_dict, dec_seq

## create an one-hot encoded vector for each position of the sequence length
def process_seq2seq_train_decoder_y(decoder_text, decoder_dict):
    # define length
    max_length_de = max([len(x) for x in decoder_text])
    len_de = len(decoder_dict)
    # initialize matrix
    decoder_output_label = np.zeros((len(decoder_text), max_length_de, len_de), dtype="float32")
    ## decoder output data would be ahead of decoder input data by one timestep
    for i, s1 in enumerate(decoder_text):
      for j, s2 in enumerate(s1):
        if j > 0:
          decoder_output_label[i][j-1][s2] = 1
    return decoder_output_label

"""  Setting a maximum limit on available decoding units. """
def process_seq2seq_train_decoder_y_modified(encoder_text, decoder_text, max_length_de=None, max_length_de_cat=None):

    if max_length_de == None:
        max_length_de = max([len(x) for x in encoder_text]) + 2
    if max_length_de_cat == None:
        max_length_de_cat = max_length_de + 2

    decoder_output_label = np.zeros((len(decoder_text), max_length_de, max_length_de_cat), dtype="float32")
    decoder_n = 0
    
    for d,e in zip(decoder_text, encoder_text):
        reserved = {'<PAD>': 0, '<UNK>': 1, '<BOS>':2, '<EOS>':3 }
        enc_list = [w for w in e]
        enc_dict = {j:i+4 for i,j in enumerate(set(enc_list))}
        enc_dict = {**reserved, **enc_dict}
        for n in range(len(d)):
            if d[n] in enc_dict.keys():
                ind = enc_dict.get(d[n])
            else:
                ind = enc_dict.get('<PAD>')
            decoder_output_label[decoder_n, n + 1, ind] = 1
        decoder_n += 1

    for k in range(len(decoder_text)):
        decoder_output_label[k, 0, 2] = 1
        decoder_output_label[k, max_length_de - 1, 3] = 1
            
    return decoder_output_label

In [None]:
def get_encoder_decoder_inputs(raw_set, summary_set):
  raw_dict, raw_seq = process_seq2seq_train_encoder_input(raw_set)
  summary_dict, summary_seq = process_seq2seq_train_decoder_input(summary_set)
  # summary_seq_y = process_seq2seq_train_decoder_y(summary_seq, summary_dict)
  summary_seq_y = process_seq2seq_train_decoder_y_modified(raw_set, summary_set)
  return raw_dict, raw_seq, summary_dict, summary_seq, summary_seq_y

In [None]:
## "Business" topic news:
## generate encoder and decoder sequence
raw_dict, raw_seq = process_seq2seq_train_encoder_input(train_raw_token_categorized[0])
summary_dict, summary_seq = process_seq2seq_train_decoder_input(train_summary_token_categorized[0])
## pad encoder and decoder sequence
raw_seq = pad_sequences(raw_seq, maxlen = max([len(x) for x in raw_seq]), padding='post')
# summary_seq = pad_sequences(summary_seq, maxlen = max([len(x) for x in summary_seq]), padding='post')
summary_seq = pad_sequences(summary_seq, maxlen = max([len(x) for x in raw_seq]) + 2, padding='post')
## dictionary for each document, list of positional sequence, 3d-array one-hot matrix
# y = process_seq2seq_train_decoder_y(summary_seq, summary_dict)
y = process_seq2seq_train_decoder_y_modified(train_raw_token_categorized[0], train_summary_token_categorized[0])
## texts for inputs into bert
bert_inputs = train_raw_token_categorized[0]

In [None]:
## "Entertainment" topic news:
## generate encoder and decoder sequence
raw_dict, raw_seq = process_seq2seq_train_encoder_input(train_raw_token_categorized[1])
summary_dict, summary_seq = process_seq2seq_train_decoder_input(train_summary_token_categorized[1])
## pad encoder and decoder sequence
raw_seq = pad_sequences(raw_seq, maxlen = max([len(x) for x in raw_seq]), padding='post')
# summary_seq = pad_sequences(summary_seq, maxlen = max([len(x) for x in summary_seq]), padding='post')
summary_seq = pad_sequences(summary_seq, maxlen = max([len(x) for x in raw_seq]) + 2, padding='post')
## dictionary for each document, list of positional sequence, 3d-array one-hot matrix
# y = process_seq2seq_train_decoder_y(summary_seq, summary_dict)
y = process_seq2seq_train_decoder_y_modified(train_raw_token_categorized[1], train_summary_token_categorized[1])
## texts for inputs into bert
bert_inputs = train_raw_token_categorized[1]

In [None]:
## "Politics" topic news:
## generate encoder and decoder sequence
raw_dict, raw_seq = process_seq2seq_train_encoder_input(train_raw_token_categorized[2])
summary_dict, summary_seq = process_seq2seq_train_decoder_input(train_summary_token_categorized[2])
## pad encoder and decoder sequence
raw_seq = pad_sequences(raw_seq, maxlen = max([len(x) for x in raw_seq]), padding='post')
# summary_seq = pad_sequences(summary_seq, maxlen = max([len(x) for x in summary_seq]), padding='post')
summary_seq = pad_sequences(summary_seq, maxlen = max([len(x) for x in raw_seq]) + 2, padding='post')
## dictionary for each document, list of positional sequence, 3d-array one-hot matrix
# y = process_seq2seq_train_decoder_y(summary_seq, summary_dict)
y = process_seq2seq_train_decoder_y_modified(train_raw_token_categorized[2], train_summary_token_categorized[2])
## texts for inputs into bert
bert_inputs = train_raw_token_categorized[2]

In [None]:
## "Sports" topic news:
## generate encoder and decoder sequence
raw_dict, raw_seq = process_seq2seq_train_encoder_input(train_raw_token_categorized[3])
summary_dict, summary_seq = process_seq2seq_train_decoder_input(train_summary_token_categorized[3])
## pad encoder and decoder sequence
raw_seq = pad_sequences(raw_seq, maxlen = max([len(x) for x in raw_seq]), padding='post')
# summary_seq = pad_sequences(summary_seq, maxlen = max([len(x) for x in summary_seq]), padding='post')
summary_seq = pad_sequences(summary_seq, maxlen = max([len(x) for x in raw_seq]) + 2, padding='post')
## dictionary for each document, list of positional sequence, 3d-array one-hot matrix
# y = process_seq2seq_train_decoder_y(summary_seq, summary_dict)
y = process_seq2seq_train_decoder_y_modified(train_raw_token_categorized[3], train_summary_token_categorized[3])
## texts for inputs into bert
bert_inputs = train_raw_token_categorized[3]

In [None]:
## "Tech" topic news:
## generate encoder and decoder sequence
raw_dict, raw_seq = process_seq2seq_train_encoder_input(train_raw_token_categorized[4])
summary_dict, summary_seq = process_seq2seq_train_decoder_input(train_summary_token_categorized[4])
## pad encoder and decoder sequence
raw_seq = pad_sequences(raw_seq, maxlen = max([len(x) for x in raw_seq]), padding='post')
# summary_seq = pad_sequences(summary_seq, maxlen = max([len(x) for x in summary_seq]), padding='post')
summary_seq = pad_sequences(summary_seq, maxlen = max([len(x) for x in raw_seq]) + 2, padding='post')
## dictionary for each document, list of positional sequence, 3d-array one-hot matrix
# y = process_seq2seq_train_decoder_y(summary_seq, summary_dict)
y = process_seq2seq_train_decoder_y_modified(train_raw_token_categorized[4], train_summary_token_categorized[4])
## texts for inputs into bert
bert_inputs = train_raw_token_categorized[4]

In [None]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import LSTM, Embedding, Bidirectional, Dense
from tensorflow.keras.layers import TimeDistributed, Dropout, Activation, Concatenate, Dot
from tensorflow.keras.optimizers import Adam, RMSprop
from nltk.translate.bleu_score import sentence_bleu
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import model_from_json

In [None]:
from sklearn.utils import resample

## fitting with 5 bootstrap samples
random_state = [42, 101, 111, 123, 999]

def bootstrap_samples(num_training_samples, self_defined_random_state, 
                      encoder_training_samples, 
                      decoder_training_samples, decoder_training_samples_output):
  
  sample_index = list(range(0, num_training_samples))
  boot = resample(sample_index, replace=False, 
                  n_samples = int(num_training_samples*0.9), 
                  random_state = self_defined_random_state)
  
  enc_train = [encoder_training_samples[ref] for ref in boot]
  enc_val = [encoder_training_samples[ref] for ref in range(0, len(encoder_training_samples)) if ref not in boot]
  
  dec_train_in = [decoder_training_samples[ref] for ref in boot]
  dec_val_in = [decoder_training_samples[ref] for ref in range(0, len(decoder_training_samples)) if ref not in boot]
  
  dec_train_out = [decoder_training_samples_output[ref] for ref in boot]
  dec_val_out = [decoder_training_samples_output[ref] for ref in range(0, len(decoder_training_samples_output)) if ref not in boot]
  
  enc_train = np.array(enc_train)
  enc_val = np.array(enc_val)
  dec_train_in = np.array(dec_train_in)
  dec_val_in = np.array(dec_val_in)
  dec_train_out = np.array(dec_train_out)
  dec_val_out = np.array(dec_val_out)
  
  return enc_train, enc_val, dec_train_in, dec_val_in, dec_train_out, dec_val_out

In [None]:
import tensorflow as tf
tf.config.run_functions_eagerly(True)

In [None]:
def seq2seq_bilstm_luong(encoder_dict, decoder_dict, encoder, decoder):

    len_en = len(encoder_dict)
    len_de = len(decoder_dict)
    max_length_en = max([len(x) for x in encoder])
    max_length_de = max([len(x) for x in decoder])

    ## Encoder structure with Bi-LSTM
    encoder_inputs = Input(shape=(None, ))
    encoder_embed = Embedding(input_dim=len_en, output_dim=500)(encoder_inputs)
    encoder_LSTM = Bidirectional(LSTM(250, return_state=True, return_sequences=True))
    encoder_hidden_vec, forward_last_h, forward_last_c, backward_last_h, backward_last_c = encoder_LSTM(encoder_embed)
    enc_state_last_h = Concatenate()([forward_last_h, backward_last_h])
    enc_state_last_c = Concatenate()([forward_last_c, backward_last_c])
    encoder_states = [enc_state_last_h, enc_state_last_c]

    ## Decoder structure with 2-layer stacked LSTM
    decoder_inputs = Input(shape=(None, ))
    decoder_embed = Embedding(input_dim=len_de, output_dim=1000)(decoder_inputs)
    decoder_LSTM = LSTM(units=500, return_state=True, return_sequences=True)
    decoder_LSTM_layer = decoder_LSTM(decoder_embed, initial_state = encoder_states)
    decoder_LSTM2 = LSTM(units=500, return_state=True, return_sequences=True)
    decoder_hidden_vec, dec_state_last_h, dec_state_last_c = decoder_LSTM2(decoder_LSTM_layer)
        
    ## Attention mechanism
    attention_score = Dot([2,2])([decoder_hidden_vec, encoder_hidden_vec])
    attention_weight = Activation('softmax')(attention_score)
    context = Dot([2,1])([attention_weight, encoder_hidden_vec])
    decoder_outputs_combined_context = Concatenate()([context, decoder_hidden_vec])

    hidden_state_outputs = TimeDistributed(Dense((max_length_de + 2) * 2, activation='tanh'))(decoder_outputs_combined_context)
    outputs = TimeDistributed(Dense(max_length_de + 2, activation='softmax'))(hidden_state_outputs)

    model = Model([encoder_inputs, decoder_inputs], outputs)

    return model

In [None]:
def training_seq2seq_bilstm_luong(lr, epoch, batch):

    optimizer_learning_rate = lr
    num_epochs = epoch
    num_batch = batch

    model = seq2seq_bilstm_luong(raw_dict, summary_dict, raw_seq, y)
    model.compile(optimizer=Adam(learning_rate = optimizer_learning_rate), loss='categorical_crossentropy', metrics=['acc'])    

    for b in range(len(random_state)):
        enc_train, enc_val, dec_train_in, dec_val_in, dec_train_out, dec_val_out = \
          bootstrap_samples(len(raw_seq), random_state[b], raw_seq, summary_seq, y)
        # training the main model
        model.fit([enc_train, dec_train_in], dec_train_out, 
                  batch_size = num_batch, epochs = num_epochs, validation_data=([enc_val, dec_val_in], dec_val_out))
        print("\n")
    
    return model

In [None]:
## Business
busi_model = training_seq2seq_bilstm_luong(0.0001, 10, 8)

In [None]:
## Entertainment
entt_model = training_seq2seq_bilstm_luong(0.0001, 40, 8)

In [None]:
## Politics
polit_model = training_seq2seq_bilstm_luong(0.0001, 40, 8)

In [None]:
## Sports
sport_model = training_seq2seq_bilstm_luong(0.0001, 20, 8)

In [None]:
## Tech
tech_model = training_seq2seq_bilstm_luong(0.0001, 30, 8)

In [None]:
## Business
## generate encoder and decoder sequence
test_dict, test_seq = process_seq2seq_train_encoder_input(test_raw_token_categorized[0])
test_summary_dict, test_summary_seq = process_seq2seq_train_decoder_input(test_summary_token_categorized[0])
## pad encoder and decoder sequence
test_raw_seq = pad_sequences(test_seq, maxlen = max([len(x) for x in raw_seq]), padding='post')
test_summary_seq = pad_sequences(test_summary_seq, maxlen = max([len(x) for x in y]), padding='post')
## dictionary for each document, list of positional sequence, 3d-array one-hot matrix
# y = process_seq2seq_train_decoder_y(summary_seq, summary_dict)
test_y = process_seq2seq_train_decoder_y_modified(test_raw_token_categorized[0], test_summary_token_categorized[0],
                                                  max_length_de = y.shape[1], max_length_de_cat = y.shape[2])
## bert inputs
bert_inputs_test = test_raw_token_categorized[0]

In [None]:
## Entertainment
## generate encoder and decoder sequence
test_dict, test_seq = process_seq2seq_train_encoder_input(test_raw_token_categorized[1])
test_summary_dict, test_summary_seq = process_seq2seq_train_decoder_input(test_summary_token_categorized[1])
## pad encoder and decoder sequence
test_raw_seq = pad_sequences(test_seq, maxlen = max([len(x) for x in raw_seq]), padding='post')
test_summary_seq = pad_sequences(test_summary_seq, maxlen = max([len(x) for x in y]), padding='post')
## dictionary for each document, list of positional sequence, 3d-array one-hot matrix
test_y = process_seq2seq_train_decoder_y_modified(test_raw_token_categorized[1], test_summary_token_categorized[1],
                                                  max_length_de = y.shape[1], max_length_de_cat = y.shape[2])
## bert inputs
bert_inputs_test = test_raw_token_categorized[1]

In [None]:
## Politics
## generate encoder and decoder sequence
test_dict, test_seq = process_seq2seq_train_encoder_input(test_raw_token_categorized[2])
test_summary_dict, test_summary_seq = process_seq2seq_train_decoder_input(test_summary_token_categorized[2])
## pad encoder and decoder sequence
test_raw_seq = pad_sequences(test_seq, maxlen = max([len(x) for x in raw_seq]), padding='post')
test_summary_seq = pad_sequences(test_summary_seq, maxlen = max([len(x) for x in y]), padding='post')
## dictionary for each document, list of positional sequence, 3d-array one-hot matrix
test_y = process_seq2seq_train_decoder_y_modified(test_raw_token_categorized[2], test_summary_token_categorized[2],
                                                  max_length_de = y.shape[1], max_length_de_cat = y.shape[2])
## bert inputs
bert_inputs_test = test_raw_token_categorized[2]

In [None]:
## Sports
## generate encoder and decoder sequence
test_dict, test_seq = process_seq2seq_train_encoder_input(test_raw_token_categorized[3])
test_summary_dict, test_summary_seq = process_seq2seq_train_decoder_input(test_summary_token_categorized[3])
## pad encoder and decoder sequence
test_raw_seq = pad_sequences(test_seq, maxlen = max([len(x) for x in raw_seq]), padding='post')
test_summary_seq = pad_sequences(test_summary_seq, maxlen = max([len(x) for x in y]), padding='post')
## dictionary for each document, list of positional sequence, 3d-array one-hot matrix
test_y = process_seq2seq_train_decoder_y_modified(test_raw_token_categorized[3], test_summary_token_categorized[3],
                                                  max_length_de = y.shape[1], max_length_de_cat = y.shape[2])
## bert inputs
bert_inputs_test = test_raw_token_categorized[3]

In [None]:
## Tech
## generate encoder and decoder sequence
test_dict, test_seq = process_seq2seq_train_encoder_input(test_raw_token_categorized[4])
test_summary_dict, test_summary_seq = process_seq2seq_train_decoder_input(test_summary_token_categorized[4])
## pad encoder and decoder sequence
test_raw_seq = pad_sequences(test_seq, maxlen = max([len(x) for x in raw_seq]), padding='post')
test_summary_seq = pad_sequences(test_summary_seq, maxlen = max([len(x) for x in y]), padding='post')
## dictionary for each document, list of positional sequence, 3d-array one-hot matrix
test_y = process_seq2seq_train_decoder_y_modified(test_raw_token_categorized[4], test_summary_token_categorized[4],
                                                  max_length_de = y.shape[1], max_length_de_cat = y.shape[2])
## bert inputs
bert_inputs_test = test_raw_token_categorized[4]

In [None]:
def predict_sequence_embedding(model, input_encoder_seq, n_steps_in_seq):
    # initialization
    dec_input = np.zeros((1, n_steps_in_seq))
    # populate the <BOS> tag of the targeted generated sequence
    dec_input[0, 0] = 2
    # decoding the sequence
    output = []
    for t in range(n_steps_in_seq):
        dec_output = model.predict([input_encoder_seq, dec_input])
        output.append(dec_output[0,t,:])
        activated_index = np.argmax(dec_output[0,t,:])
        if t + 1 < n_steps_in_seq:
            dec_input[0, t + 1] = activated_index
    
    return np.array(output)

In [None]:
def evaluate_sequence_batch(n_steps, decoder_model, encoder_text_samples, bert=False, bert_inputs=None):

    # make predictions using the inference models
    n_steps_in_seq = n_steps
    inference_seq = []
    for t in range(len(test_raw_seq)):
        if bert == False:
            y_estimated = predict_sequence_embedding(decoder_model, test_raw_seq[t].reshape(1, test_raw_seq[t].shape[0]), n_steps_in_seq)
        else:
            y_estimated = predict_sequence_embedding(decoder_model, bert_inputs[t], n_steps_in_seq)
        inference_seq.append(y_estimated)
    
    ## initialize metric lists
    predicted_seq = []
    validated_seq = []
    bleu = []
    bleu_sample = []
    avg_acc = []
    accuracy_per_run = []
    avg_precision = []
    precision_per_run = []
    avg_recall = []
    recall_per_run = []

    for samples in range(len(inference_seq)):
        pred = []
        actual = []

        if bert == False:
            reserved = {'<PAD>': 0, '<UNK>': 1, '<BOS>':2, '<EOS>':3 }
        else:
            reserved = {'pad': 0, 'UNK': 1, 'BOS':2, 'EOS':3 }
            
        enc_list = [w for w in encoder_text_samples[samples]]
        enc_dict = {j:i+4 for i,j in enumerate(set(enc_list))}
        enc_dict = {**reserved, **enc_dict}

        ## get evaluations
        acc_score = 0
        total = 0
        prec_score = 0
        recall_score = 0
        
        for p in range(len(inference_seq[samples])):
            total += 1
            try:
                predicted_token_index = np.argmax(inference_seq[samples][p])
                predicted_token = list(enc_dict.keys())[list(enc_dict.values()).index(predicted_token_index)]
            except:
                predicted_token_index = 1
                predicted_token = list(enc_dict.keys())[list(enc_dict.values()).index(predicted_token_index)]
            try:
                validated_token_index = np.argmax(test_y[samples][p])
                validated_token = list(enc_dict.keys())[list(enc_dict.values()).index(validated_token_index)]
            except:
                validated_token_index = 1
                validated_token = list(enc_dict.keys())[list(enc_dict.values()).index(validated_token_index)]

            if predicted_token_index == validated_token_index:
                acc_score += 1
            pred.append(predicted_token)
            actual.append(validated_token)

        predicted_seq.append(pred)
        validated_seq.append(actual)
        bleu_sample.append(sentence_bleu(", ".join([x for x in list(set(pred)) if x not in ['<EOS>', '<BOS>','<PAD>']]), 
                                         ", ".join([x for x in list(set(actual)) if x not in ['<EOS>', '<BOS>','<PAD>']])))
        accuracy = acc_score / total
        accuracy_per_run.append(accuracy)
        if len([x for x in list(set(pred)) if x not in ['<EOS>', '<BOS>','<PAD>']]) != 0:
            precision = len(list(set([x for x in list(set(pred)) if x not in ['<EOS>', '<BOS>','<PAD>']]) & 
                                set([x for x in list(set(actual)) if x not in ['<EOS>', '<BOS>','<PAD>']]))) / len([x for x in list(set(pred)) if x not in ['<EOS>', '<BOS>','<PAD>']])
        else:
            precision = 0
        precision_per_run.append(precision)
        if len([x for x in list(set(actual)) if x not in ['<EOS>', '<BOS>','<PAD>']]) != 0:
            recall = len(list(set([x for x in list(set(pred)) if x not in ['<EOS>', '<BOS>','<PAD>']]) & 
                              set([x for x in list(set(actual)) if x not in ['<EOS>', '<BOS>','<PAD>']]))) / len([x for x in list(set(actual)) if x not in ['<EOS>', '<BOS>','<PAD>']])
        else:
            recall = 0
        recall_per_run.append(recall)
      
    avg_acc.append(np.mean(np.array(accuracy_per_run)))
    avg_precision.append(np.mean(np.array(precision_per_run)))
    avg_recall.append(np.mean(np.array(recall_per_run)))
    bleu.append(np.mean(np.array(bleu_sample)))

    return avg_acc, avg_precision, avg_recall, bleu, inference_seq, predicted_seq

**Business:**

In [None]:
avg_acc, avg_precision, avg_recall, bleu, inference_seq, predicted_seq = evaluate_sequence_batch(
    len(summary_seq[0]), busi_model, test_raw_token_categorized[0]
)

In [None]:
print("Test Result: Mean Accuracy per Token Position for each document = " + str(round(avg_acc[0], ndigits=4)))
print("Test Result: Mean Precision per Token Position for each document = " + str(round(avg_precision[0], ndigits=4)))
print("Test Result: Mean Recall per Token Position for each document = " + str(round(avg_recall[0], ndigits=4)))
print("Test Result: BLEU score = " + str(round(bleu[0], ndigits=4)))

Test Result: Mean Accuracy per Token Position for each document = 0.3217
Test Result: Mean Precision per Token Position for each document = 0.3718
Test Result: Mean Recall per Token Position for each document = 0.2072
Test Result: BLEU score = 0.4628


In [None]:
## print an example inference
n = randint(0,len(inference_seq)-1)
print("### Encoder inputs:")
print("\n".join(test_raw_token_categorized[0][n]))
print("\n")
print("### Decoder outputs:")
print("\n".join(test_summary_token_categorized[0][n]))
print("\n")
print("### Predicted decoder:")
print("\n".join([x for x in list(set(predicted_seq[n])) if x not in ['<EOS>', '<BOS>','<PAD>']]))

### Encoder inputs:
mystery surrounds new yukos owner
the fate of russia's yuganskneftegas - the oil firm sold to a little-known buyer on sunday - is the subject of frantic speculation in moscow.
baikal finance group emerged as the auction winner, agreeing to pay 260.75bn roubles (£4.8bn; $9.4bn).
russia's newspapers claimed that baikal was a front for gas monopoly gazprom, which had been expected to win.
the sale has destroyed yukos, once the owner of yuganskneftegas, said founder mikhail khodorkovsky.
"yuganskneftegas has been sold in the best traditions of the 90s.
the authorities have made themselves a wonderful christmas present - russia's most efficient oil company has been destroyed," the interfax news agency quoted mr khodorkovsky as saying via his lawyers.
gazprom had been expected to win the auction but is thought to have failed to get finance for the deal after a us court injunction barred it from taking part.
last week, yukos filed for chapter 11 bankruptcy protection in th

**Entertainment:**

In [None]:
avg_acc, avg_precision, avg_recall, bleu, inference_seq, predicted_seq = evaluate_sequence_batch(
    len(summary_seq[0]), entt_model, test_raw_token_categorized[1]
)

In [None]:
print("Test Result: Mean Accuracy per Token Position for each document = " + str(round(avg_acc[0], ndigits=4)))
print("Test Result: Mean Precision per Token Position for each document = " + str(round(avg_precision[0], ndigits=4)))
print("Test Result: Mean Recall per Token Position for each document = " + str(round(avg_recall[0], ndigits=4)))
print("Test Result: BLEU score = " + str(round(bleu[0], ndigits=4)))

Test Result: Mean Accuracy per Token Position for each document = 0.0612
Test Result: Mean Precision per Token Position for each document = 0.289
Test Result: Mean Recall per Token Position for each document = 0.6172
Test Result: BLEU score = 0.4894


In [None]:
## print an example inference
n = randint(0,len(inference_seq)-1)
print("### Encoder inputs:")
print("\n".join(test_raw_token_categorized[1][n]))
print("\n")
print("### Decoder outputs:")
print("\n".join(test_summary_token_categorized[1][n]))
print("\n")
print("### Predicted decoder:")
print("\n".join([x for x in list(set(predicted_seq[n])) if x not in ['<EOS>', '<BOS>','<PAD>']]))

### Encoder inputs:
ray dvd beats box office takings
oscar-nominated film biopic ray has surpassed its us box office takings with a combined tally of $80m (£43m) from dvd and video sales and rentals.
ray's success on dvd outstripped its $74m (£40m) us box office total, earning more than $40m (£22m) on the first day of the dvd's release alone.
ray has been nominated in six oscar categories including best film and best actor for jamie foxx.
the film recounts the life of blues singer ray charles, who died in 2004. in its first week on home entertainment release the film was the number one selling dvd, with the limited edition version coming in at number 11. sony horror film the grudge, starring michelle gellar, was the us' second best-selling dvd, with jennifer lopez and richard gere's romantic comedy shall we dance?
at number three.
foxx's critically acclaimed performance as ray has already earned him a screen actors guild award for best actor, as well as a prestigious golden globe.
ray 

**Politics:**

In [None]:
avg_acc, avg_precision, avg_recall, bleu, inference_seq, predicted_seq = evaluate_sequence_batch(
    len(summary_seq[0]), polit_model, test_raw_token_categorized[2]
)

In [None]:
print("Test Result: Mean Accuracy per Token Position for each document = " + str(round(avg_acc[0], ndigits=4)))
print("Test Result: Mean Precision per Token Position for each document = " + str(round(avg_precision[0], ndigits=4)))
print("Test Result: Mean Recall per Token Position for each document = " + str(round(avg_recall[0], ndigits=4)))
print("Test Result: BLEU score = " + str(round(bleu[0], ndigits=4)))

Test Result: Mean Accuracy per Token Position for each document = 0.0569
Test Result: Mean Precision per Token Position for each document = 0.322
Test Result: Mean Recall per Token Position for each document = 0.7384
Test Result: BLEU score = 0.453


In [None]:
## print an example inference
n = randint(0,len(inference_seq)-1)
print("### Encoder inputs:")
print("\n".join(test_raw_token_categorized[2][n]))
print("\n")
print("### Decoder outputs:")
print("\n".join(test_summary_token_categorized[2][n]))
print("\n")
print("### Predicted decoder:")
print("\n".join([x for x in list(set(predicted_seq[n])) if x not in ['<EOS>', '<BOS>','<PAD>']]))

### Encoder inputs:
anti-terror plan faces first test
plans to allow home secretary charles clarke to place terror suspects under house arrest without trial are set for their first real test in parliament.
tories, lib dems and some labour mps are poised to vote against the plans.
mr clarke says the powers are needed to counter terror threats.
opponents say only judges, not politicians, should be able to order detention of uk citizens.
the government is expected to win wednesday's vote in the commons, but faces a battle in the house of lords.
the prevention of terrorism bill was published on tuesday.
it proposes "control orders", which would mean house arrest in the most serious cases, and curfews, electronic tagging and limits on telephone and internet access for other suspects.
the two opposition parties are particularly worried that the control orders would initially be imposed on the say-so of the home secretary, rather than a judge.
tory shadow home secretary david davis warned of 

**Sports:** 

In [None]:
avg_acc, avg_precision, avg_recall, bleu, inference_seq, predicted_seq = evaluate_sequence_batch(
    len(summary_seq[0]), sport_model, test_raw_token_categorized[3]
)

In [None]:
print("Test Result: Mean Accuracy per Token Position for each document = " + str(round(avg_acc[0], ndigits=4)))
print("Test Result: Mean Precision per Token Position for each document = " + str(round(avg_precision[0], ndigits=4)))
print("Test Result: Mean Recall per Token Position for each document = " + str(round(avg_recall[0], ndigits=4)))
print("Test Result: BLEU score = " + str(round(bleu[0], ndigits=4)))

Test Result: Mean Accuracy per Token Position for each document = 0.0583
Test Result: Mean Precision per Token Position for each document = 0.312
Test Result: Mean Recall per Token Position for each document = 0.4673
Test Result: BLEU score = 0.4674


In [None]:
## print an example inference
n = randint(0,len(inference_seq)-1)
print("### Encoder inputs:")
print("\n".join(test_raw_token_categorized[3][n]))
print("\n")
print("### Decoder outputs:")
print("\n".join(test_summary_token_categorized[3][n]))
print("\n")
print("### Predicted decoder:")
print("\n".join([x for x in list(set(predicted_seq[n])) if x not in ['<EOS>', '<BOS>','<PAD>']]))

### Encoder inputs:
britain boosted by holmes double
athletics fans endured a year of mixed emotions in 2004 as stunning victories went hand-in-hand with disappointing defeats and more drugs scandals.
kelly holmes finally fulfilled her potential by storming to double gold on the track at the olympic games.
holmes helped erase the gloom hanging over team gb after their biggest medal hope, paula radcliffe, dropped out of the marathon and then the 10,000m.
britain's men's 4x100m relay team also did their bit by taking a shock gold.
holmes had started the year in disappointing style, falling over in the final of 1500m at the world indoor championships where she was favourite.
her olympic build-up was clouded by self doubt but that proved unfounded as she overhauled rival maria mutola to win the 800m - her first global title.
just five days later, the 34-year-old made it double gold in the 1500m.
it was the first time in 84 years a briton has achieved the olympic middle-distance double.
whi

**Technology:**

In [None]:
avg_acc, avg_precision, avg_recall, bleu, inference_seq, predicted_seq = evaluate_sequence_batch(
    len(summary_seq[0]), tech_model, test_raw_token_categorized[4]
)

In [None]:
print("Test Result: Mean Accuracy per Token Position for each document = " + str(round(avg_acc[0], ndigits=4)))
print("Test Result: Mean Precision per Token Position for each document = " + str(round(avg_precision[0], ndigits=4)))
print("Test Result: Mean Recall per Token Position for each document = " + str(round(avg_recall[0], ndigits=4)))
print("Test Result: BLEU score = " + str(round(bleu[0], ndigits=4)))

Test Result: Mean Accuracy per Token Position for each document = 0.0801
Test Result: Mean Precision per Token Position for each document = 0.3266
Test Result: Mean Recall per Token Position for each document = 0.3471
Test Result: BLEU score = 0.4107


In [None]:
## print an example inference
n = randint(0,len(inference_seq)-1)
print("### Encoder inputs:")
print("\n".join(test_raw_token_categorized[4][n]))
print("\n")
print("### Decoder outputs:")
print("\n".join(test_summary_token_categorized[4][n]))
print("\n")
print("### Predicted decoder:")
print("\n".join([x for x in list(set(predicted_seq[n])) if x not in ['<EOS>', '<BOS>','<PAD>']]))

### Encoder inputs:
'no re-draft' for eu patent law
a proposed european law on software patents will not be re-drafted by the european commission (ec) despite requests by meps.
the law is proving controversial and has been in limbo for a year.
some major tech firms say it is needed to protect inventions, while others fear it will hurt smaller tech firms the ec says the council of ministers will adopt a draft version that was agreed upon last may but said it would review "all aspects of the directive".
the directive is intended to offer patent protection to inventions that use software to achieve their effect, in other words, "computer implemented invention".
in a letter, ec president jos&#233; manuel barroso told the president of the european parliament, josep borrell, that the commission "did not intend to refer a new proposal to the parliament and the council (of ministers)" as it had supported the agreement reached by ministers in may 2004.
if the european council agrees on the draf

In [None]:
!pip install tensorflow_text

In [None]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text

In [None]:
class seq2seq_with_bert:

    def __init__(self, encoder_text, max_length_en, decoder_dict, decoder_vector, seq_length):
        self.decoder_dict = decoder_dict
        self.decoder_vector =  decoder_vector
        self.seq_length = seq_length
        self.len_de = len(decoder_dict)
        self.max_length_de = max([len(x) for x in decoder_vector])
        self.encoder_text = encoder_text
        self.max_length_en = max_length_en

    def get_bert_preprocessor_outputs(self):
        ## loaded model
        tfhub_albert_preprocess = "https://tfhub.dev/tensorflow/albert_en_preprocess/3"
        albert_preprocess = hub.load(tfhub_albert_preprocess)
        ## padding the texts to same length
        len_en = [self.max_length_en - len(d) for d in self.encoder_text]
        pad_encoder_text = [[" ".join(self.encoder_text[x] + ['pad'] * len_en[x])] for x in range(len(self.encoder_text))]
        ## get outputs preprocessed
        preprocess_encoder_text = [albert_preprocess(x) for x in pad_encoder_text]
        return preprocess_encoder_text

    def get_bert_encoder_embeddings(self, preprocess_encoder_text):
        ## loaded model
        tfhub_bert_encoder = "https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/2"
        ggelu_bert_encoder = hub.load(tfhub_bert_encoder)
        bert_encoder_outputs = [ggelu_bert_encoder(x)['sequence_output'] for x in preprocess_encoder_text]
        return bert_encoder_outputs

    def build_model(self):

        ## BERT encoder
        bert_inputs = Input(shape=(None, 768))

        ## Decoder 2-layer stacked LSTM
        decoder_inputs = Input(shape=(None, ))
        decoder_embed = Embedding(input_dim = self.len_de, output_dim = 768)(decoder_inputs)
        decoder_LSTM = LSTM(units=768, return_state=True, return_sequences=True)
        decoder_LSTM_layer = decoder_LSTM(decoder_embed)
        decoder_LSTM2 = LSTM(units=768, return_state=True, return_sequences=True)
        decoder_hidden_vec, dec_state_last_h, dec_state_last_c = decoder_LSTM2(decoder_LSTM_layer)
        
        ## Attention mechanism
        attention_score = Dot([2,2])([decoder_hidden_vec, bert_inputs])
        attention_weight = Activation('softmax')(attention_score)
        context = Dot([2,1])([attention_weight, bert_inputs])
        decoder_outputs_combined_context = Concatenate()([context, decoder_hidden_vec])
        hidden_state_outputs = TimeDistributed(Dense((self.max_length_de + 2) * 2, activation='tanh'))(decoder_outputs_combined_context)
        outputs = TimeDistributed(Dense(self.max_length_de + 2, activation='softmax'))(hidden_state_outputs)

        model = tf.keras.Model([bert_inputs, decoder_inputs], outputs)

        return model

In [None]:
def training_seq2seq_with_bert(lr, epoch, batch):

    optimizer_learning_rate = lr
    num_epochs = epoch
    num_batch = batch

    bert = seq2seq_with_bert(bert_inputs, len(raw_seq[0]), summary_dict, y, 128)
    bert_preprocessed = bert.get_bert_preprocessor_outputs()
    bert_encoded = bert.get_bert_encoder_embeddings(bert_preprocessed)
    bert_seq2seq = bert.build_model()
    bert_seq2seq.compile(optimizer=Adam(learning_rate = optimizer_learning_rate), loss='categorical_crossentropy', metrics=['acc'])

    for b in range(len(random_state)):
        enc_train, enc_val, dec_train_in, dec_val_in, dec_train_out, dec_val_out = \
          bootstrap_samples(len(bert_encoded), random_state[b], 
                            [tf.reshape(x, [128,768]) for x in bert_encoded], summary_seq, y)
        # training the main model
        bert_seq2seq.fit([enc_train, dec_train_in], dec_train_out, 
                         batch_size = num_batch, epochs = num_epochs, validation_data=([enc_val, dec_val_in], dec_val_out))
        print("\n")

    return bert_seq2seq

In [None]:
## Business
business_bert_seq2seq = training_seq2seq_with_bert(0.00001, 20, 2)

In [None]:
## Entertainment
entertain_bert_seq2seq = training_seq2seq_with_bert(0.00001, 40, 2)

In [None]:
## Politics
politics_bert_seq2seq = training_seq2seq_with_bert(0.00001, 40, 2)

In [None]:
## Sports
sports_bert_seq2seq = training_seq2seq_with_bert(0.00001, 20, 2)

In [None]:
## Tech
tech_bert_seq2seq = training_seq2seq_with_bert(0.00001, 30, 2)

In [None]:
## Business
## Encoding vectors for the testing set
business_bert_test = seq2seq_with_bert(bert_inputs_busi_test, len(test_raw_seq[0]), test_summary_dict, test_y, 128)
business_bert_preprocessed_test = business_bert_test.get_bert_preprocessor_outputs()
business_bert_encoded_test = business_bert_test.get_bert_encoder_embeddings(business_bert_preprocessed_test)

In [None]:
avg_acc, avg_precision, avg_recall, bleu, inference_seq, predicted_seq = evaluate_sequence_batch(
    len(summary_seq[0]), business_bert_seq2seq, test_raw_token_categorized[0], 
    bert=True, bert_inputs=[x for x in business_bert_encoded_test]
)

In [None]:
print("Test Result: Mean Accuracy per Token Position for each document = " + str(round(avg_acc[0], ndigits=4)))
print("Test Result: Mean Precision per Token Position for each document = " + str(round(avg_precision[0], ndigits=4)))
print("Test Result: Mean Recall per Token Position for each document = " + str(round(avg_recall[0], ndigits=4)))
print("Test Result: BLEU score = " + str(round(bleu[0], ndigits=4)))

Test Result: Mean Accuracy per Token Position for each document = 0.3971
Test Result: Mean Precision per Token Position for each document = 0.7884
Test Result: Mean Recall per Token Position for each document = 0.4582
Test Result: BLEU score = 0.5418


In [None]:
## Entertainment
## Encoding vectors for the testing set
entertain_bert_test = seq2seq_with_bert(bert_inputs_entt_test, len(test_raw_seq[0]), test_summary_dict, test_y, 128)
entertain_bert_preprocessed_test = entertain_bert_test.get_bert_preprocessor_outputs()
entertain_bert_encoded_test = entertain_bert_test.get_bert_encoder_embeddings(entertain_bert_preprocessed_test)

In [None]:
avg_acc, avg_precision, avg_recall, bleu, inference_seq, predicted_seq = evaluate_sequence_batch(
    len(summary_seq[0]), entertain_bert_seq2seq, test_raw_token_categorized[1], 
    bert=True, bert_inputs=[x for x in entertain_bert_encoded_test]
)

In [None]:
print("Test Result: Mean Accuracy per Token Position for each document = " + str(round(avg_acc[0], ndigits=4)))
print("Test Result: Mean Precision per Token Position for each document = " + str(round(avg_precision[0], ndigits=4)))
print("Test Result: Mean Recall per Token Position for each document = " + str(round(avg_recall[0], ndigits=4)))
print("Test Result: BLEU score = " + str(round(bleu[0], ndigits=4)))

Test Result: Mean Accuracy per Token Position for each document = 0.0939
Test Result: Mean Precision per Token Position for each document = 0.6401
Test Result: Mean Recall per Token Position for each document = 0.5576
Test Result: BLEU score = 0.4933


In [None]:
## Politics
## Encoding vectors for the testing set
politics_bert_test = seq2seq_with_bert(bert_inputs_polit_test, len(test_raw_seq[0]), test_summary_dict, test_y, 128)
politics_bert_preprocessed_test = politics_bert_test.get_bert_preprocessor_outputs()
politics_bert_encoded_test = politics_bert_test.get_bert_encoder_embeddings(politics_bert_preprocessed_test)

In [None]:
avg_acc, avg_precision, avg_recall, bleu, inference_seq, predicted_seq = evaluate_sequence_batch(
    len(summary_seq[0]), politics_bert_seq2seq, test_raw_token_categorized[2], 
    bert=True, bert_inputs=[x for x in politics_bert_encoded_test]
)

In [None]:
print("Test Result: Mean Accuracy per Token Position for each document = " + str(round(avg_acc[0], ndigits=4)))
print("Test Result: Mean Precision per Token Position for each document = " + str(round(avg_precision[0], ndigits=4)))
print("Test Result: Mean Recall per Token Position for each document = " + str(round(avg_recall[0], ndigits=4)))
print("Test Result: BLEU score = " + str(round(bleu[0], ndigits=4)))

Test Result: Mean Accuracy per Token Position for each document = 0.0668
Test Result: Mean Precision per Token Position for each document = 0.5675
Test Result: Mean Recall per Token Position for each document = 0.6667
Test Result: BLEU score = 0.4923


In [None]:
## Sports
## Encoding vectors for the testing set
sports_bert_test = seq2seq_with_bert(bert_inputs_sport_test, len(test_raw_seq[0]), test_summary_dict, test_y, 128)
sports_bert_preprocessed_test = sports_bert_test.get_bert_preprocessor_outputs()
sports_bert_encoded_test = sports_bert_test.get_bert_encoder_embeddings(sports_bert_preprocessed_test)

In [None]:
avg_acc, avg_precision, avg_recall, bleu, inference_seq, predicted_seq = evaluate_sequence_batch(
    len(summary_seq[0]), sports_bert_seq2seq, test_raw_token_categorized[3], 
    bert=True, bert_inputs=[x for x in sports_bert_encoded_test]
)

In [None]:
print("Test Result: Mean Accuracy per Token Position for each document = " + str(round(avg_acc[0], ndigits=4)))
print("Test Result: Mean Precision per Token Position for each document = " + str(round(avg_precision[0], ndigits=4)))
print("Test Result: Mean Recall per Token Position for each document = " + str(round(avg_recall[0], ndigits=4)))
print("Test Result: BLEU score = " + str(round(bleu[0], ndigits=4)))

Test Result: Mean Accuracy per Token Position for each document = 0.1902
Test Result: Mean Precision per Token Position for each document = 0.485
Test Result: Mean Recall per Token Position for each document = 0.4583
Test Result: BLEU score = 0.4777


In [None]:
## Tech
## Encoding vectors for the testing set
tech_bert_test = seq2seq_with_bert(bert_inputs_tech_test, len(test_raw_seq[0]), test_summary_dict, test_y, 128)
tech_bert_preprocessed_test = tech_bert_test.get_bert_preprocessor_outputs()
tech_bert_encoded_test = tech_bert_test.get_bert_encoder_embeddings(tech_bert_preprocessed_test)

In [None]:
avg_acc, avg_precision, avg_recall, bleu, inference_seq, predicted_seq = evaluate_sequence_batch(
    len(summary_seq[0]), tech_bert_seq2seq, test_raw_token_categorized[4], 
    bert=True, bert_inputs=[x for x in tech_bert_encoded_test]
)

In [None]:
print("Test Result: Mean Accuracy per Token Position for each document = " + str(round(avg_acc[0], ndigits=4)))
print("Test Result: Mean Precision per Token Position for each document = " + str(round(avg_precision[0], ndigits=4)))
print("Test Result: Mean Recall per Token Position for each document = " + str(round(avg_recall[0], ndigits=4)))
print("Test Result: BLEU score = " + str(round(bleu[0], ndigits=4)))

Test Result: Mean Accuracy per Token Position for each document = 0.1856
Test Result: Mean Precision per Token Position for each document = 0.7495
Test Result: Mean Recall per Token Position for each document = 0.3611
Test Result: BLEU score = 0.4504


In [None]:
class TransformerEncoder(tf.keras.layers.Layer):

    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super(TransformerEncoder, self).__init__(**kwargs)

        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads

        self.attention = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = tf.keras.Sequential(
            [tf.keras.layers.Dense(dense_dim, activation="relu"), 
             tf.keras.layers.Dense(embed_dim)]
        )
        self.layernorm_1 = tf.keras.layers.LayerNormalization()
        self.layernorm_2 = tf.keras.layers.LayerNormalization()

        self.supports_masking = True

    def call(self, inputs, mask=None):

        attention_output = self.attention(query=inputs, value=inputs, key=inputs)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)

        return self.layernorm_2(proj_input + proj_output)

In [None]:
class PositionalEmbedding(tf.keras.layers.Layer):
  
    def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
        super(PositionalEmbedding, self).__init__(**kwargs)

        self.sequence_length = sequence_length
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim

        self.token_embeddings = tf.keras.layers.Embedding(input_dim = vocab_size, output_dim = embed_dim)
        self.position_embeddings = tf.keras.layers.Embedding(input_dim = sequence_length, output_dim = embed_dim)

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

In [None]:
class TransformerDecoder(tf.keras.layers.Layer):

    def __init__(self, embed_dim, latent_dim, num_heads, **kwargs):
        super(TransformerDecoder, self).__init__(**kwargs)

        self.embed_dim = embed_dim
        self.latent_dim = latent_dim
        self.num_heads = num_heads

        self.attention_1 = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.attention_2 = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = tf.keras.Sequential(
            [tf.keras.layers.Dense(latent_dim, activation="relu"), 
             tf.keras.layers.Dense(embed_dim)]
        )
        self.layernorm_1 = tf.keras.layers.LayerNormalization()
        self.layernorm_2 = tf.keras.layers.LayerNormalization()
        self.layernorm_3 = tf.keras.layers.LayerNormalization()

        self.supports_masking = True

    def call(self, inputs, encoder_outputs, mask=None):

        attention_output_1 = self.attention_1(query=inputs, value=inputs, key=inputs)
        out_1 = self.layernorm_1(inputs + attention_output_1)
        attention_output_2 = self.attention_2(query=out_1, value=encoder_outputs, key=encoder_outputs)
        out_2 = self.layernorm_2(out_1 + attention_output_2)
        proj_output = self.dense_proj(out_2)
        return self.layernorm_3(out_2 + proj_output)

    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat([tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)], axis=0)
        return tf.tile(mask, mult)

In [None]:
def Transformer_Model():

    vocab_size_encoder = len(raw_dict)
    vocab_size_decoder = len(summary_dict)
    sequence_length_encoder = raw_seq.shape[1]
    sequence_length_decoder = summary_seq.shape[1]
    embed_dim = round(vocab_size_encoder // 20, -1)
    latent_dim = round(vocab_size_encoder // 2, -3)
    num_heads = 8

    encoder_inputs = tf.keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
    encoder_x = PositionalEmbedding(sequence_length_encoder, vocab_size_encoder , embed_dim)(encoder_inputs)
    encoder_outputs = TransformerEncoder(embed_dim, latent_dim, num_heads)(encoder_x)

    decoder_inputs = tf.keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
    decoder_x = PositionalEmbedding(sequence_length_decoder, vocab_size_decoder, embed_dim)(decoder_inputs)
    decoder_x = TransformerDecoder(embed_dim, latent_dim, num_heads)(decoder_x, encoder_outputs)
    decoder_outputs = tf.keras.layers.Dense((sequence_length_decoder + 2) * 2, activation="relu")(decoder_x)
    decoder_outputs = tf.keras.layers.Dense(sequence_length_decoder + 2, activation="softmax")(decoder_outputs)

    transformer = tf.keras.Model([encoder_inputs, decoder_inputs], decoder_outputs, name="transformer")

    return transformer

In [None]:
def training_Transformer(lr, epoch, batch):

    optimizer_learning_rate = lr
    num_epochs = epoch
    num_batch = batch

    tfm = Transformer_Model()
    tfm.compile(optimizer=Adam(learning_rate = optimizer_learning_rate), loss='categorical_crossentropy', metrics=['acc'])

    for b in range(len(random_state)):
        enc_train, enc_val, dec_train_in, dec_val_in, dec_train_out, dec_val_out = \
          bootstrap_samples(len(raw_seq), random_state[b], raw_seq, summary_seq, y)
        # training the main model
        tfm.fit([enc_train, dec_train_in], dec_train_out, 
                batch_size = num_batch, epochs = num_epochs, validation_data=([enc_val, dec_val_in], dec_val_out))
        print("\n")

    return tfm

In [None]:
## Business
busi_tfm = training_Transformer(0.00001, 10, 4)

In [None]:
## Entertainment
entt_tfm = training_Transformer(0.00001, 10, 4)

In [None]:
## Politics
polit_tfm = training_Transformer(0.00001, 10, 4)

In [None]:
## Sports
sport_tfm = training_Transformer(0.00001, 10, 4)

In [None]:
## Tech
tech_tfm = training_Transformer(0.00001, 10, 4)

In [None]:
## Business
avg_acc, avg_precision, avg_recall, bleu, inference_seq, predicted_seq = evaluate_sequence_batch(
    len(summary_seq[0]), busi_tfm, test_raw_token_categorized[0]
)

In [None]:
print("Test Result: Mean Accuracy per Token Position for each document = " + str(round(avg_acc[0], ndigits=4)))
print("Test Result: Mean Precision per Token Position for each document = " + str(round(avg_precision[0], ndigits=4)))
print("Test Result: Mean Recall per Token Position for each document = " + str(round(avg_recall[0], ndigits=4)))
print("Test Result: BLEU score = " + str(round(bleu[0], ndigits=4)))

Test Result: Mean Accuracy per Token Position for each document = 0.3654
Test Result: Mean Precision per Token Position for each document = 0.363
Test Result: Mean Recall per Token Position for each document = 0.6087
Test Result: BLEU score = 0.4963


In [None]:
## Entertainment
avg_acc, avg_precision, avg_recall, bleu, inference_seq, predicted_seq = evaluate_sequence_batch(
    len(summary_seq[0]), entt_tfm, test_raw_token_categorized[1]
)

In [None]:
print("Test Result: Mean Accuracy per Token Position for each document = " + str(round(avg_acc[0], ndigits=4)))
print("Test Result: Mean Precision per Token Position for each document = " + str(round(avg_precision[0], ndigits=4)))
print("Test Result: Mean Recall per Token Position for each document = " + str(round(avg_recall[0], ndigits=4)))
print("Test Result: BLEU score = " + str(round(bleu[0], ndigits=4)))

Test Result: Mean Accuracy per Token Position for each document = 0.0776
Test Result: Mean Precision per Token Position for each document = 0.3446
Test Result: Mean Recall per Token Position for each document = 0.573
Test Result: BLEU score = 0.4895


In [None]:
## Politics
avg_acc, avg_precision, avg_recall, bleu, inference_seq, predicted_seq = evaluate_sequence_batch(
    len(summary_seq[0]), polit_tfm, test_raw_token_categorized[2]
)

In [None]:
print("Test Result: Mean Accuracy per Token Position for each document = " + str(round(avg_acc[0], ndigits=4)))
print("Test Result: Mean Precision per Token Position for each document = " + str(round(avg_precision[0], ndigits=4)))
print("Test Result: Mean Recall per Token Position for each document = " + str(round(avg_recall[0], ndigits=4)))
print("Test Result: BLEU score = " + str(round(bleu[0], ndigits=4)))

Test Result: Mean Accuracy per Token Position for each document = 0.0875
Test Result: Mean Precision per Token Position for each document = 0.4047
Test Result: Mean Recall per Token Position for each document = 0.7103
Test Result: BLEU score = 0.4945


In [None]:
## Sports
avg_acc, avg_precision, avg_recall, bleu, inference_seq, predicted_seq = evaluate_sequence_batch(
    len(summary_seq[0]), polit_tfm, test_raw_token_categorized[3]
)

In [None]:
print("Test Result: Mean Accuracy per Token Position for each document = " + str(round(avg_acc[0], ndigits=4)))
print("Test Result: Mean Precision per Token Position for each document = " + str(round(avg_precision[0], ndigits=4)))
print("Test Result: Mean Recall per Token Position for each document = " + str(round(avg_recall[0], ndigits=4)))
print("Test Result: BLEU score = " + str(round(bleu[0], ndigits=4)))

Test Result: Mean Accuracy per Token Position for each document = 0.2375
Test Result: Mean Precision per Token Position for each document = 0.3132
Test Result: Mean Recall per Token Position for each document = 0.8737
Test Result: BLEU score = 0.4897


In [None]:
## Tech
avg_acc, avg_precision, avg_recall, bleu, inference_seq, predicted_seq = evaluate_sequence_batch(
    len(summary_seq[0]), polit_tfm, test_raw_token_categorized[4]
)

In [None]:
print("Test Result: Mean Accuracy per Token Position for each document = " + str(round(avg_acc[0], ndigits=4)))
print("Test Result: Mean Precision per Token Position for each document = " + str(round(avg_precision[0], ndigits=4)))
print("Test Result: Mean Recall per Token Position for each document = " + str(round(avg_recall[0], ndigits=4)))
print("Test Result: BLEU score = " + str(round(bleu[0], ndigits=4)))

Test Result: Mean Accuracy per Token Position for each document = 0.1091
Test Result: Mean Precision per Token Position for each document = 0.3617
Test Result: Mean Recall per Token Position for each document = 0.7109
Test Result: BLEU score = 0.4278
