#**TLDR** (Transfer Learning: Deep Recap) by *Ramazan Ramazanov*

TLDR: BERT Encoder for Dialogue Summarization

##Introduction

All customer service calls to the businesses are recorded for "quality and training" purposes. Nowadays, we have mostly reliable speech-to-text technologies. Based on my professional background, I understand how important it is to monitor these calls to ensure good quality customer service, proper training and compliance. Most of these calls are lengthy, filled with template greetings and irrelevant chit-chat. This, coupled with the thousands of calls per day, makes the quality checks <1% of total call volume. That's why I tried to tackle the problem of dialogue summarization.

We briefly covered Transformer model architecture but never explored it in detail. I apply the language understanding model from the Tensorflow tutorial **[1]** to train it for dialogue summarization. Then I replace the encoder with the BERT model while keeping the decoder part the same. BERT has state-of-the-art performance when applied to various NLP projects, and I wanted to use its power to summarize dialogues. Finally, a pre-trained T5 transformer was used to compare the results.

The dataset I'm using is the SAMSum corpus **[2]**. It is "A Human-annotated Dialogue Dataset for Abstractive Summarization". Abstractive summarization is an interpretative technique, and it can be paraphrase instead of quoting the text directly. It's different than extractive summarization, which highlights and quotes essential parts of the text. 

For both models, tokenizers from hugging face were used **[3]**. BERT model is also called using the *transformers* package. As will be mentioned below, a small BERT model was used to avoid out-of-memory issues and long runtimes

##Preprocessing

We start by installing tensorflow_text package and transformers from huggingface:

In [None]:
pip install -q tensorflow_text

In [None]:
pip install transformers

Importing packages

In [None]:
import collections
import logging
import os
import pathlib
import re
import string
import sys
import time
import json

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf
import tensorflow_text as text
from transformers import BertTokenizer, TFBertModel, BertConfig

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


Open SAMSum dataset. It has Testing, Training and Validation datasets in json files. To save memory, we only extract training and testing datasets into pandas dataframe:

In [None]:
data_dir = '/content/gdrive/My Drive/3546 Project/SAMSum/'

with open(data_dir + 'train.json', encoding="utf8") as f:
    train = json.load(f)
train_df = pd.DataFrame(train)
del train

with open(data_dir + 'test.json', encoding="utf8") as f:
    test = json.load(f)
test_df = pd.DataFrame(test)
del test

'''
with open(data_dir + 'val.json', encoding="utf8") as f:
    val = json.load(f)
val_df = pd.DataFrame(val)
del val
'''

'\nwith open(data_dir + \'val.json\', encoding="utf8") as f:\n    val = json.load(f)\nval_df = pd.DataFrame(val)\ndel val\n'

Some pre-processing:

In [None]:
#Drop id column from the dataframe, and remove end-of-line characters
train_df.drop(columns='id', inplace=True)
train_df['summary'] = train_df['summary'].str.replace('\r', ' ')
train_df['summary'] = train_df['summary'].str.replace('\n', ' ')
train_df['dialogue'] = train_df['dialogue'].str.replace('\r', ' ')
train_df['dialogue'] = train_df['dialogue'].str.replace('\n', ' ')

test_df.drop(columns='id', inplace=True)
test_df['summary'] = test_df['summary'].str.replace('\r', ' ')
test_df['summary'] = test_df['summary'].str.replace('\n', ' ')
test_df['dialogue'] = test_df['dialogue'].str.replace('\r', ' ')
test_df['dialogue'] = test_df['dialogue'].str.replace('\n', ' ')

'''
val_df['summary'] = val_df['summary'].str.replace('\r', ' ')
val_df['summary'] = val_df['summary'].str.replace('\n', ' ')
val_df['dialogue'] = val_df['dialogue'].str.replace('\r', ' ')
val_df['dialogue'] = val_df['dialogue'].str.replace('\n', ' ')
'''

"\nval_df['summary'] = val_df['summary'].str.replace('\r', ' ')\nval_df['summary'] = val_df['summary'].str.replace('\n', ' ')\nval_df['dialogue'] = val_df['dialogue'].str.replace('\r', ' ')\nval_df['dialogue'] = val_df['dialogue'].str.replace('\n', ' ')\n"

We can utilize pre-trained BERT tokenizer from huggingface:

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

Define a function which will create tokenized dialogues and their corresponding summaries as tensors:

In [None]:
def tokenize_pairs(dialg, summ):
    dialg = tf.dtypes.cast(tokenizer(dialg, padding=True, truncation=True, return_tensors="tf")['input_ids'], tf.int64)
    summ = tf.dtypes.cast(tokenizer(summ, padding=True, truncation=True, return_tensors="tf")['input_ids'], tf.int64)
    return dialg, summ

Getting the final tokenized dataset ready:

In [None]:
#Divide the tokenized tensors into batches of size 50
BATCH_SIZE = 50

dialg, summ = tokenize_pairs(train_df.iloc[:, 1].to_list(), train_df.iloc[:, 0].to_list())

#Combine tokens and make them into batches
train_ds = tf.data.Dataset.from_tensor_slices((dialg, summ))
train_ds = train_ds.batch(BATCH_SIZE)

#Delete dataframes to save memory
del train_df
del dialg
del summ

##Building the Transformer Architecture

As mentioned previously, I modified the Transformer model from Tensorflow tutorials **[1]**.

For positional encoding, we build our custom functions:

In [None]:
def get_angles(pos, i, d_model):
  angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
  return pos * angle_rates

In [None]:
def positional_encoding(position, d_model):
  angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                          np.arange(d_model)[np.newaxis, :],
                          d_model)

  # apply sin to even indices in the array; 2i
  angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

  # apply cos to odd indices in the array; 2i+1
  angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

  pos_encoding = angle_rads[np.newaxis, ...]

  return tf.cast(pos_encoding, dtype=tf.float32)

Creating functions to mask padding and hide future tokens in a sequence:

In [None]:
def create_padding_mask(seq):
  seq = tf.cast(tf.math.equal(seq, 0), tf.float32)

  # add extra dimensions to add the padding
  # to the attention logits.
  return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)

In [None]:
def create_look_ahead_mask(size):
  mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
  return mask  # (seq_len, seq_len)

For q (query), k (key) and v (value), create dot product attention:

In [None]:
def scaled_dot_product_attention(q, k, v, mask):
  """Calculate the attention weights.
  q, k, v must have matching leading dimensions.
  k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
  The mask has different shapes depending on its type(padding or look ahead)
  but it must be broadcastable for addition.

  Args:
    q: query shape == (..., seq_len_q, depth)
    k: key shape == (..., seq_len_k, depth)
    v: value shape == (..., seq_len_v, depth_v)
    mask: Float tensor with shape broadcastable
          to (..., seq_len_q, seq_len_k). Defaults to None.

  Returns:
    output, attention_weights
  """

  matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

  # scale matmul_qk
  dk = tf.cast(tf.shape(k)[-1], tf.float32)
  scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

  # add the mask to the scaled tensor.
  if mask is not None:
    scaled_attention_logits += (mask * -1e9)

  # softmax is normalized on the last axis (seq_len_k) so that the scores
  # add up to 1.
  attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)

  output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

  return output, attention_weights

By generalizing the above dot product attention, create multi head attention for q,k,v:

In [None]:
class MultiHeadAttention(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads):
    super(MultiHeadAttention, self).__init__()
    self.num_heads = num_heads
    self.d_model = d_model

    assert d_model % self.num_heads == 0

    self.depth = d_model // self.num_heads

    self.wq = tf.keras.layers.Dense(d_model)
    self.wk = tf.keras.layers.Dense(d_model)
    self.wv = tf.keras.layers.Dense(d_model)

    self.dense = tf.keras.layers.Dense(d_model)

  def split_heads(self, x, batch_size):
    """Split the last dimension into (num_heads, depth).
    Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
    """
    x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
    return tf.transpose(x, perm=[0, 2, 1, 3])

  def call(self, v, k, q, mask):
    batch_size = tf.shape(q)[0]

    q = self.wq(q)  # (batch_size, seq_len, d_model)
    k = self.wk(k)  # (batch_size, seq_len, d_model)
    v = self.wv(v)  # (batch_size, seq_len, d_model)

    q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
    k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
    v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

    # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
    # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
    scaled_attention, attention_weights = scaled_dot_product_attention(
        q, k, v, mask)

    scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)

    concat_attention = tf.reshape(scaled_attention,
                                  (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)

    output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)

    return output, attention_weights

Both encoder and decoder have feed forward networks post multi-head attention. For FFN, we have (dff to d_model) fully connected hidden layers:

In [None]:
def point_wise_feed_forward_network(d_model, dff):
  return tf.keras.Sequential([
      tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
      tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
  ])

Define Encoder with multi-head attention and FFN defined above, and add normalization and dropout for optimization:

In [None]:
class EncoderLayer(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, dff, rate=0.1):
    super(EncoderLayer, self).__init__()

    self.mha = MultiHeadAttention(d_model, num_heads)
    self.ffn = point_wise_feed_forward_network(d_model, dff)

    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)

  def call(self, x, training, mask):

    attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
    attn_output = self.dropout1(attn_output, training=training)
    out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)

    ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
    ffn_output = self.dropout2(ffn_output, training=training)
    out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)

    return out2

Decoder applies masked multi-head attention to the output embeddings, another multi-head attention to the encoder outputs and adds a FFN to generate output probabilities using softmax. That's why we have 3 normalization and dropout layers:

In [None]:
class DecoderLayer(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, dff, rate=0.1):
    super(DecoderLayer, self).__init__()

    self.mha1 = MultiHeadAttention(d_model, num_heads)
    self.mha2 = MultiHeadAttention(d_model, num_heads)

    self.ffn = point_wise_feed_forward_network(d_model, dff)

    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)
    self.dropout3 = tf.keras.layers.Dropout(rate)

  def call(self, x, enc_output, training,
           look_ahead_mask, padding_mask):
    # enc_output.shape == (batch_size, input_seq_len, d_model)

    attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
    attn1 = self.dropout1(attn1, training=training)
    out1 = self.layernorm1(attn1 + x)

    attn2, attn_weights_block2 = self.mha2(
        enc_output, enc_output, out1, padding_mask)  # (batch_size, target_seq_len, d_model)
    attn2 = self.dropout2(attn2, training=training)
    out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)

    ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
    ffn_output = self.dropout3(ffn_output, training=training)
    out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)

    return out3, attn_weights_block1, attn_weights_block2

Build full Encoder and Decoder. Note that we will use Encoder only for the first model, later BERT encoder will replace this:

In [None]:
class Encoder(tf.keras.layers.Layer):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
               maximum_position_encoding, rate=0.1):
    super(Encoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers

    self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
    self.pos_encoding = positional_encoding(maximum_position_encoding,
                                            self.d_model)

    self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate)
                       for _ in range(num_layers)]

    self.dropout = tf.keras.layers.Dropout(rate)

  def call(self, x, training, mask):

    seq_len = tf.shape(x)[1]

    # adding embedding and position encoding.
    x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]

    x = self.dropout(x, training=training)

    for i in range(self.num_layers):
      x = self.enc_layers[i](x, training, mask)

    return x  # (batch_size, input_seq_len, d_model)

In [None]:
class Decoder(tf.keras.layers.Layer):
  def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size,
               maximum_position_encoding, rate=0.1):
    super(Decoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers

    self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
    self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)

    self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate)
                       for _ in range(num_layers)]
    self.dropout = tf.keras.layers.Dropout(rate)

  def call(self, x, enc_output, training,
           look_ahead_mask, padding_mask):

    seq_len = tf.shape(x)[1]
    attention_weights = {}

    x = self.embedding(x)  # (batch_size, target_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]

    x = self.dropout(x, training=training)

    for i in range(self.num_layers):
      x, block1, block2 = self.dec_layers[i](x, enc_output, training,
                                             look_ahead_mask, padding_mask)

      attention_weights[f'decoder_layer{i+1}_block1'] = block1
      attention_weights[f'decoder_layer{i+1}_block2'] = block2

    # x.shape == (batch_size, target_seq_len, d_model)
    return x, attention_weights

This is where we utilize Transfer Learning. Define two Transformers, one with the custom built encoder we defined above, the other with the BERT model as encoder

In [None]:
class TransformerCustom(tf.keras.Model):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
               target_vocab_size, pe_input, pe_target, rate=0.1):
    super(TransformerCustom, self).__init__()

    #Call custom built Encoder
    self.tokenizer = Encoder(num_layers, d_model, num_heads, dff,
                             input_vocab_size, pe_input, rate)

    self.decoder = Decoder(num_layers, d_model, num_heads, dff,
                           target_vocab_size, pe_target, rate)

    self.final_layer = tf.keras.layers.Dense(target_vocab_size)

  def call(self, inp, tar, training, enc_padding_mask,
           look_ahead_mask, dec_padding_mask):

    enc_output = self.tokenizer(inp, training, enc_padding_mask)  # (batch_size, inp_seq_len, d_model)

    # dec_output.shape == (batch_size, tar_seq_len, d_model)
    dec_output, attention_weights = self.decoder(
        tar, enc_output, training, look_ahead_mask, dec_padding_mask)

    final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)

    return final_output, attention_weights

In [None]:
class TransformerBERT(tf.keras.Model):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
               target_vocab_size, pe_input, pe_target, rate=0.1):
    super(TransformerBERT, self).__init__()

    #Call Bert encoder
    self.tokenizer = TFBertModel(BertConfig(hidden_size=d_model,
                                            num_hidden_layers=num_layers,
                                            num_attention_heads=num_heads))

    self.decoder = Decoder(num_layers, d_model, num_heads, dff,
                           target_vocab_size, pe_target, rate)

    self.final_layer = tf.keras.layers.Dense(target_vocab_size)

  def call(self, inp, tar, training, enc_padding_mask,
           look_ahead_mask, dec_padding_mask):

    #We need the sequence of hidden-states at the output of the last layer of the BERT model
    enc_output = self.tokenizer(inp).last_hidden_state # (batch_size, inp_seq_len, d_model)

    # dec_output.shape == (batch_size, tar_seq_len, d_model)
    dec_output, attention_weights = self.decoder(
        tar, enc_output, training, look_ahead_mask, dec_padding_mask)

    final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)

    return final_output, attention_weights

Due to the lack of resources, the model cannot be as complex. I realized the default parameters worked fine, and increasing them led to out-of-memory issues. For instance, the model could not handle bert-base model, which has 12 layers, 768 hidden dimensions, 12 heads.

In [None]:
num_layers = 4
d_model = 128
dff = 512
num_heads = 8
dropout_rate = 0.1

Create Adam optimizer with a custom learning schedule:

In [None]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
  def __init__(self, d_model, warmup_steps=4000):
    super(CustomSchedule, self).__init__()

    self.d_model = d_model
    self.d_model = tf.cast(self.d_model, tf.float32)

    self.warmup_steps = warmup_steps

  def __call__(self, step):
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (self.warmup_steps ** -1.5)

    return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

Adam optimizer has adaptable learning rate, so it is auto-tuning:

In [None]:
learning_rate = CustomSchedule(d_model)

optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98,
                                     epsilon=1e-9)

In [None]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

Defining loss and accuracy metrics. We apply masking because we have padded embeddings:

In [None]:
def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_sum(loss_)/tf.reduce_sum(mask)


def accuracy_function(real, pred):
  accuracies = tf.equal(real, tf.argmax(pred, axis=2))

  mask = tf.math.logical_not(tf.math.equal(real, 0))
  accuracies = tf.math.logical_and(mask, accuracies)

  accuracies = tf.cast(accuracies, dtype=tf.float32)
  mask = tf.cast(mask, dtype=tf.float32)
  return tf.reduce_sum(accuracies)/tf.reduce_sum(mask)

In [None]:
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.Mean(name='train_accuracy')

Initialize the transformer classes with the above parameters. According to BERT documentation, the English vocabulary size for the model is 30,522:

In [None]:
transformerCustom = TransformerCustom(
    num_layers=num_layers,
    d_model=d_model,
    num_heads=num_heads,
    dff=dff,
    input_vocab_size= tf.convert_to_tensor(30522, dtype=tf.int32),
    target_vocab_size= tf.convert_to_tensor(30522, dtype=tf.int32),
    pe_input=600, #this is max positional encoding
    pe_target=600, #this is max positional encoding
    rate=dropout_rate)

In [None]:
transformerBERT = TransformerBERT(
    num_layers=num_layers,
    d_model=d_model,
    num_heads=num_heads,
    dff=dff,
    input_vocab_size= tf.convert_to_tensor(30522, dtype=tf.int32),
    target_vocab_size= tf.convert_to_tensor(30522, dtype=tf.int32),
    pe_input=600, #this is max positional encoding
    pe_target=600, #this is max positional encoding
    rate=dropout_rate)

Create encoder, decoder and look ahead masks:

In [None]:
def create_masks(inp, tar):
  # Encoder padding mask
  enc_padding_mask = create_padding_mask(inp)

  # Used in the 2nd attention block in the decoder.
  # This padding mask is used to mask the encoder outputs.
  dec_padding_mask = create_padding_mask(inp)

  # Used in the 1st attention block in the decoder.
  # It is used to pad and mask future tokens in the input received by
  # the decoder.
  look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
  dec_target_padding_mask = create_padding_mask(tar)
  combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)

  return enc_padding_mask, combined_mask, dec_padding_mask

Create and keep checkpoints for both models every 5 epochs:

In [None]:
checkpoint_path_cust = "./checkpoints/cust/train"
checkpoint_path_bert = "./checkpoints/bert/train"

ckptCustom = tf.train.Checkpoint(transformer=transformerCustom,
                           optimizer=optimizer)
ckptBERT = tf.train.Checkpoint(transformer=transformerBERT,
                           optimizer=optimizer)

ckpt_manager_cust = tf.train.CheckpointManager(ckptCustom, 
                                               checkpoint_path_cust, 
                                               max_to_keep=5)

ckpt_manager_bert = tf.train.CheckpointManager(ckptBERT, 
                                               checkpoint_path_bert, 
                                               max_to_keep=5)

# if a checkpoint exists, restore the latest checkpoint.
if ckpt_manager_cust.latest_checkpoint:
  ckptCustom.restore(ckpt_manager_cust.latest_checkpoint)
  print('Latest checkpoint restored!!')

if ckpt_manager_bert.latest_checkpoint:
  ckptBERT.restore(ckpt_manager_bert.latest_checkpoint)
  print('Latest checkpoint restored!!')

In [None]:
EPOCHS = 20

In [None]:
# The @tf.function trace-compiles train_step into a TF graph for faster
# execution. The function specializes to the precise shape of the argument
# tensors. To avoid re-tracing due to the variable sequence lengths or variable
# batch sizes (the last batch is smaller), use input_signature to specify
# more generic shapes.

train_step_signature = [
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
    tf.TensorSpec(shape=(None, None), dtype=tf.int64)
]


#Create a function to train with the custom encoder
@tf.function(input_signature=train_step_signature)
def train_step_custom(inp, tar):
  tar_inp = tar[:, :-1]
  tar_real = tar[:, 1:]

  enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)

  #GradientTape uses Custom Encoder
  with tf.GradientTape() as tape:
      predictions, _ = transformerCustom(inp, tar_inp,
                                  True,
                                  enc_padding_mask,
                                  combined_mask,
                                  dec_padding_mask)
      loss = loss_function(tar_real, predictions)

  gradients = tape.gradient(loss, transformerCustom.trainable_variables)
  optimizer.apply_gradients(zip(gradients, transformerCustom.trainable_variables))
      
  train_loss(loss)
  train_accuracy(accuracy_function(tar_real, predictions))


#Create a function to train with the BERT encoder
@tf.function(input_signature=train_step_signature)
def train_step_bert(inp, tar):
  tar_inp = tar[:, :-1]
  tar_real = tar[:, 1:]

  enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)

  #Call Transformer with BERT encoder
  with tf.GradientTape() as tape:
      predictions, _ = transformerBERT(inp, tar_inp,
                                  True,
                                  enc_padding_mask,
                                  combined_mask,
                                  dec_padding_mask)
      loss = loss_function(tar_real, predictions)
      
  gradients = tape.gradient(loss, transformerBERT.trainable_variables)
  optimizer.apply_gradients(zip(gradients, transformerBERT.trainable_variables))
      
  train_loss(loss)
  train_accuracy(accuracy_function(tar_real, predictions))

After we train the transformer, we define the following function to generate summaries for the input sentences:

In [None]:
def evaluate(sentence, max_length=50, use_bert=False):
  # inp sentence is the dialogue, hence adding the start and end token

  sentence = tf.dtypes.cast(tokenizer(sentence, padding=True, truncation=True, return_tensors="tf")['input_ids'], tf.int64)

  encoder_input = sentence

  # as the target is english, the first word to the transformer should be the
  # english start token.
  start, end = tokenizer([''])['input_ids'][0]
  output = tf.convert_to_tensor([start])
  output = tf.dtypes.cast(tf.expand_dims(output, 0), tf.int64)

  for i in range(max_length):
    enc_padding_mask, combined_mask, dec_padding_mask = create_masks(
        encoder_input, output)

    # predictions.shape == (batch_size, seq_len, vocab_size)
    if use_bert:
        predictions, attention_weights = transformerBERT(encoder_input,
                                                    output,
                                                    False,
                                                    enc_padding_mask,
                                                    combined_mask,
                                                    dec_padding_mask)
    else:
        predictions, attention_weights = transformerCustom(encoder_input,
                                                    output,
                                                    False,
                                                    enc_padding_mask,
                                                    combined_mask,
                                                    dec_padding_mask)

    # select the last word from the seq_len dimension
    predictions = predictions[:, -1:, :]  # (batch_size, 1, vocab_size)

    predicted_id = tf.argmax(predictions, axis=-1)

    # concatentate the predicted_id to the output which is given to the decoder
    # as its input.
    output = tf.concat([output, predicted_id], axis=-1)

    # return the result if the predicted_id is equal to the end token
    if predicted_id == end:
      break

  text = tokenizer.decode(output[0])

  return text

##Custom Encoder

Compile and fit the transformer with the custom encoder:

In [None]:
for epoch in range(EPOCHS):
  start = time.time()

  train_loss.reset_states()
  train_accuracy.reset_states()

  # inp -> dialogue, tar -> summary
  for (batch, (inp, tar)) in enumerate(train_ds):
    train_step_custom(inp, tar)

    if batch % 50 == 0:
      print(f'Epoch {epoch + 1} Batch {batch} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')

  if (epoch + 1) % 5 == 0:
    ckpt_save_path = ckpt_manager_cust.save()
    print(f'Saving checkpoint for epoch {epoch+1} at {ckpt_save_path}')

  print(f'Epoch {epoch + 1} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')

  print(f'Time taken for 1 epoch: {time.time() - start:.2f} secs\n')

Epoch 1 Batch 0 Loss 10.3286 Accuracy 0.0000
Epoch 1 Batch 50 Loss 10.2825 Accuracy 0.0258
Epoch 1 Batch 100 Loss 10.1952 Accuracy 0.0512
Epoch 1 Batch 150 Loss 10.0671 Accuracy 0.0598
Epoch 1 Batch 200 Loss 9.8963 Accuracy 0.0642
Epoch 1 Batch 250 Loss 9.6873 Accuracy 0.0670
Epoch 1 Loss 9.4806 Accuracy 0.0685
Time taken for 1 epoch: 71.99 secs

Epoch 2 Batch 0 Loss 8.0209 Accuracy 0.0773
Epoch 2 Batch 50 Loss 7.7415 Accuracy 0.0773
Epoch 2 Batch 100 Loss 7.4768 Accuracy 0.0772
Epoch 2 Batch 150 Loss 7.2567 Accuracy 0.0772
Epoch 2 Batch 200 Loss 7.0935 Accuracy 0.0814
Epoch 2 Batch 250 Loss 6.9667 Accuracy 0.0882
Epoch 2 Loss 6.8728 Accuracy 0.0929
Time taken for 1 epoch: 59.13 secs

Epoch 3 Batch 0 Loss 6.2913 Accuracy 0.1255
Epoch 3 Batch 50 Loss 6.1904 Accuracy 0.1446
Epoch 3 Batch 100 Loss 6.0847 Accuracy 0.1566
Epoch 3 Batch 150 Loss 5.9855 Accuracy 0.1639
Epoch 3 Batch 200 Loss 5.8959 Accuracy 0.1703
Epoch 3 Batch 250 Loss 5.8114 Accuracy 0.1761
Epoch 3 Loss 5.7488 Accuracy 0.18

Let's test a few sentences:

In [None]:
print('############# FIRST TEST #############')
sentence = test_df.iloc[6]['dialogue']
print('DIALOGUE: ' + sentence)
print('TRUE SUMMARY: ' + test_df.iloc[6]['summary'])
print('CUSTOM SUMMARY : ' + evaluate(sentence=sentence))
print('\n')

print('############# SECOND TEST #############')
sentence = test_df.iloc[10]['dialogue']
print('DIALOGUE: ' + sentence)
print('TRUE SUMMARY: ' + test_df.iloc[10]['summary'])
print('CUSTOM SUMMARY : ' + evaluate(sentence=sentence))
print('\n')

print('############# THIRD TEST #############')
sentence = test_df.iloc[11]['dialogue']
print('DIALOGUE: ' + sentence)
print('TRUE SUMMARY: ' + test_df.iloc[11]['summary'])
print('CUSTOM SUMMARY : ' + evaluate(sentence=sentence))
print('\n')

############# FIRST TEST #############
DIALOGUE: Max: Know any good sites to buy clothes from?  Payton: Sure :) <file_other> <file_other> <file_other> <file_other> <file_other> <file_other> <file_other>  Max: That's a lot of them!  Payton: Yeah, but they have different things so I usually buy things from 2 or 3 of them.  Max: I'll check them out. Thanks.   Payton: No problem :)  Max: How about u?  Payton: What about me?  Max: Do u like shopping?  Payton: Yes and no.  Max: How come?  Payton: I like browsing, trying on, looking in the mirror and seeing how I look, but not always buying.  Max: Y not?  Payton: Isn't it obvious? ;)  Max: Sry ;)  Payton: If I bought everything I liked, I'd have nothing left to live on ;)  Max: Same here, but probably different category ;)  Payton: Lol  Max: So what do u usually buy?  Payton: Well, I have 2 things I must struggle to resist!  Max: Which are?  Payton: Clothes, ofc ;)  Max: Right. And the second one?  Payton: Books. I absolutely love reading!  M

We see highly inaccurate summaries, which mostly don't even mention people in the dialogue. There seems to be an overfitting problem here, because some of the sentences can be repeated in very different contexts, such as follows:

In [None]:
print('############# FIRST TEST #############')
sentence = test_df.iloc[1]['dialogue']
print('DIALOGUE: ' + sentence)
print('TRUE SUMMARY: ' + test_df.iloc[1]['summary'])
print('CUSTOM SUMMARY : ' + evaluate(sentence=sentence))
print('\n')

print('############# SECOND TEST #############')
sentence = test_df.iloc[7]['dialogue']
print('DIALOGUE: ' + sentence)
print('TRUE SUMMARY: ' + test_df.iloc[7]['summary'])
print('CUSTOM SUMMARY : ' + evaluate(sentence=sentence))
print('\n')

############# FIRST TEST #############
DIALOGUE: Eric: MACHINE!  Rob: That's so gr8!  Eric: I know! And shows how Americans see Russian ;)  Rob: And it's really funny!  Eric: I know! I especially like the train part!  Rob: Hahaha! No one talks to the machine like that!  Eric: Is this his only stand-up?  Rob: Idk. I'll check.  Eric: Sure.  Rob: Turns out no! There are some of his stand-ups on youtube.  Eric: Gr8! I'll watch them now!  Rob: Me too!  Eric: MACHINE!  Rob: MACHINE!  Eric: TTYL?  Rob: Sure :)
TRUE SUMMARY: Eric and Rob are going to watch a stand-up on youtube.
CUSTOM SUMMARY : [CLS] There's a bomb threat at the university. He got a new worker from the choir next month. [SEP]


############# SECOND TEST #############
DIALOGUE: Rita: I'm so bloody tired. Falling asleep at work. :-(  Tina: I know what you mean.  Tina: I keep on nodding off at my keyboard hoping that the boss doesn't notice..  Rita: The time just keeps on dragging on and on and on....   Rita: I keep on looking a

A terrifying result.

We realize that this architecture is better suited for machine translation. We also realize why summarization is a hard problem to deal with - nowadays translators are much more successful, but we don't see any good summarizers. 

##BERT Encoder

Now we use BERT as encoder:

In [None]:
for epoch in range(EPOCHS):
  start = time.time()

  train_loss.reset_states()
  train_accuracy.reset_states()

  # inp -> dialogue, tar -> summary
  for (batch, (inp, tar)) in enumerate(train_ds):
    train_step_bert(inp, tar)

    if batch % 50 == 0:
      print(f'Epoch {epoch + 1} Batch {batch} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')

  if (epoch + 1) % 5 == 0:
    ckpt_save_path = ckpt_manager_bert.save()
    print(f'Saving checkpoint for epoch {epoch+1} at {ckpt_save_path}')

  print(f'Epoch {epoch + 1} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')

  print(f'Time taken for 1 epoch: {time.time() - start:.2f} secs\n')

Epoch 1 Batch 0 Loss 10.3367 Accuracy 0.0000
Epoch 1 Batch 50 Loss 10.2868 Accuracy 0.0286
Epoch 1 Batch 100 Loss 10.1932 Accuracy 0.0526
Epoch 1 Batch 150 Loss 10.0635 Accuracy 0.0607
Epoch 1 Batch 200 Loss 9.8934 Accuracy 0.0650
Epoch 1 Batch 250 Loss 9.6858 Accuracy 0.0676
Epoch 1 Loss 9.4796 Accuracy 0.0690
Time taken for 1 epoch: 92.72 secs

Epoch 2 Batch 0 Loss 8.0212 Accuracy 0.0773
Epoch 2 Batch 50 Loss 7.7415 Accuracy 0.0773
Epoch 2 Batch 100 Loss 7.4769 Accuracy 0.0772
Epoch 2 Batch 150 Loss 7.2572 Accuracy 0.0777
Epoch 2 Batch 200 Loss 7.0945 Accuracy 0.0848
Epoch 2 Batch 250 Loss 6.9692 Accuracy 0.0909
Epoch 2 Loss 6.8770 Accuracy 0.0947
Time taken for 1 epoch: 83.82 secs

Epoch 3 Batch 0 Loss 6.2938 Accuracy 0.1182
Epoch 3 Batch 50 Loss 6.1904 Accuracy 0.1402
Epoch 3 Batch 100 Loss 6.0773 Accuracy 0.1520
Epoch 3 Batch 150 Loss 5.9735 Accuracy 0.1606
Epoch 3 Batch 200 Loss 5.8837 Accuracy 0.1676
Epoch 3 Batch 250 Loss 5.8017 Accuracy 0.1737
Epoch 3 Loss 5.7409 Accuracy 0.17

We see slightly longer runtimes, but we can see the model accuracy is considerably higher than before, 58% vs 40%, and lower loss of 1.7 vs. 2.8. These results are impressive.

Let's take a look at a few generated summaries below. Here's my personal favourite summary:

In [None]:
sentence = test_df.iloc[2]['dialogue']
true_summ = test_df.iloc[2]['summary']
pred_summ_bert = evaluate(sentence=sentence,use_bert=True)

In [None]:
sentence

"Lenny: Babe, can you help me with something?  Bob: Sure, what's up?  Lenny: Which one should I pick?  Bob: Send me photos  Lenny:  <file_photo>  Lenny:  <file_photo>  Lenny:  <file_photo>  Bob: I like the first ones best  Lenny: But I already have purple trousers. Does it make sense to have two pairs?  Bob: I have four black pairs :D :D  Lenny: yeah, but shouldn't I pick a different color?  Bob: what matters is what you'll give you the most outfit options  Lenny: So I guess I'll buy the first or the third pair then  Bob: Pick the best quality then  Lenny: ur right, thx  Bob: no prob :)"

In [None]:
true_summ

"Lenny can't decide which trousers to buy. Bob advised Lenny on that topic. Lenny goes with Bob's advice to pick the trousers that are of best quality."

In [None]:
pred_summ

'[CLS] Bob has a new trousers. He will pick up Bob a pink trousers. [SEP]'

The result is not accurate, however it has one interesting aspect: it talks about pink trousers but "pink" is never mentioned in the original dialogue. The abstraction here generated new information.

Let's see the summaries for other dialogues:

In [None]:
print('############# FIRST TEST #############')
sentence = test_df.iloc[0]['dialogue']
print('DIALOGUE: ' + sentence)
print('TRUE SUMMARY: ' + test_df.iloc[0]['summary'])
print('BERT SUMMARY : ' + evaluate(sentence=sentence,use_bert=True))
print('\n')

print('############# SECOND TEST #############')
sentence = test_df.iloc[1]['dialogue']
print('DIALOGUE: ' + sentence)
print('TRUE SUMMARY: ' + test_df.iloc[1]['summary'])
print('BERT SUMMARY : ' + evaluate(sentence=sentence,use_bert=True))
print('\n')

print('############# THIRD TEST #############')
sentence = test_df.iloc[3]['dialogue']
print('DIALOGUE: ' + sentence)
print('TRUE SUMMARY: ' + test_df.iloc[3]['summary'])
print('BERT SUMMARY : ' + evaluate(sentence=sentence,use_bert=True))
print('\n')

############# FIRST TEST #############
DIALOGUE: Hannah: Hey, do you have Betty's number? Amanda: Lemme check Hannah: <file_gif> Amanda: Sorry, can't find it. Amanda: Ask Larry Amanda: He called her last time we were at the park together Hannah: I don't know him well Hannah: <file_gif> Amanda: Don't be shy, he's very nice Hannah: If you say so.. Hannah: I'd rather you texted him Amanda: Just text him 🙂 Hannah: Urgh.. Alright Hannah: Bye Amanda: Bye bye
TRUE SUMMARY: Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.
BERT SUMMARY : [CLS] Amanda is worried about Amanda's number to Amanda's advice on it. Hannah and Hannah's number is very older and he's very well. Hannah and Hannah have a number of HRr D's number of useful on a


############# SECOND TEST #############
DIALOGUE: Eric: MACHINE!  Rob: That's so gr8!  Eric: I know! And shows how Americans see Russian ;)  Rob: And it's really funny!  Eric: I know! I especially like the train part!  Rob: Hahaha

We clearly see the difference compared to the previous model - BERT actually refers to the people in the dialogue, can give some context on what's going on in the dialogue (such as dinner, movies, phone number). It's flawed but we have a winner!

##T5 Summarizer

While researching for my project, I came across to T5 model **[4]**, and realized it had summarization capabilities. That's why I wanted to briefly compare it to my models. It uses pytorch:




In [None]:
!pip install transformers==2.8.0
!pip install torch==1.4.0

In [None]:
import torch
import json 
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

An example:

In [None]:
t5model = T5ForConditionalGeneration.from_pretrained('t5-small')
t5tokenizer = T5Tokenizer.from_pretrained('t5-small')
device = torch.device('cpu')

In [None]:
text ="""
The US has "passed the peak" on new coronavirus cases, President Donald Trump said and predicted that some states would reopen this month.
The US has over 637,000 confirmed Covid-19 cases and over 30,826 deaths, the highest for any country in the world.
At the daily White House coronavirus briefing on Wednesday, Trump said new guidelines to reopen the country would be announced on Thursday after he speaks to governors.
"We'll be the comeback kids, all of us," he said. "We want to get our country back."
The Trump administration has previously fixed May 1 as a possible date to reopen the world's largest economy, but the president said some states may be able to return to normalcy earlier than that.
"""


preprocess_text = text.strip().replace("\n","")
t5_prepared_Text = "summarize: "+preprocess_text
print ("original text preprocessed: \n", preprocess_text)

tokenized_text = t5tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)


# summmarize 
summary_ids = t5model.generate(tokenized_text,
                                    num_beams=4,
                                    no_repeat_ngram_size=2,
                                    min_length=30,
                                    max_length=100,
                                    early_stopping=True)

output = t5tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print ("\n\nSummarized text: \n",output)

original text preprocessed: 
 The US has "passed the peak" on new coronavirus cases, President Donald Trump said and predicted that some states would reopen this month.The US has over 637,000 confirmed Covid-19 cases and over 30,826 deaths, the highest for any country in the world.At the daily White House coronavirus briefing on Wednesday, Trump said new guidelines to reopen the country would be announced on Thursday after he speaks to governors."We'll be the comeback kids, all of us," he said. "We want to get our country back."The Trump administration has previously fixed May 1 as a possible date to reopen the world's largest economy, but the president said some states may be able to return to normalcy earlier than that.


Summarized text: 
 the us has over 637,000 confirmed Covid-19 cases and over 30,826 deaths. president Donald Trump predicts some states will reopen the country in april, he said. "we'll be the comeback kids, all of us," the president says.


Works good with an article. Let's see how it fares with dialogues:

In [None]:
for i in range(4):
    preprocess_text = test_df.iloc[i]['dialogue']
    t5_prepared_Text = "summarize: "+preprocess_text
    print ("\n\noriginal text preprocessed: \n", preprocess_text)

    tokenized_text = t5tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)


    # summmarize 
    summary_ids = t5model.generate(tokenized_text,
                                        num_beams=4,
                                        no_repeat_ngram_size=2,
                                        min_length=5,
                                        max_length=40,
                                        early_stopping=True)

    output = t5tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    print ("\nSummarized text: \n",output)



original text preprocessed: 
 Hannah: Hey, do you have Betty's number? Amanda: Lemme check Hannah: <file_gif> Amanda: Sorry, can't find it. Amanda: Ask Larry Amanda: He called her last time we were at the park together Hannah: I don't know him well Hannah: <file_gif> Amanda: Don't be shy, he's very nice Hannah: If you say so.. Hannah: I'd rather you texted him Amanda: Just text him 🙂 Hannah: Urgh.. Alright Hannah: Bye Amanda: Bye bye

Summarized text: 
 Amanda: Lemme check Hannah: file_gif> Amanda. he called her last time we were at the park together Hannah. I don't know him well Hannah


original text preprocessed: 
 Eric: MACHINE!  Rob: That's so gr8!  Eric: I know! And shows how Americans see Russian ;)  Rob: And it's really funny!  Eric: I know! I especially like the train part!  Rob: Hahaha! No one talks to the machine like that!  Eric: Is this his only stand-up?  Rob: Idk. I'll check.  Eric: Sure.  Rob: Turns out no! There are some of his stand-ups on youtube.  Eric: Gr8! I'll 

It's acting more like an extractive summarization, just isolating the parts it find the most relevant. The results are underwhelming (and slow), but the extractive nature can be very useful for customer calls, if someone is looking whether a specific topic/term was discussed. Ultimately, BERT model performs much better.

##Further Discussion, Further Improvements


Firstly, as we see above, the BERT model was very impressive compared to our custom-built encoder. Its abstraction level is primarily relevant, and the training accuracy has grown visibly epoch-to-epoch. And T5 is better suited for article/news summarizations.

And personally I learnt that there are more resources using pytorch than tensorflow for this problem (higgingface tutorials are mostly in tensorflow).


Here is some future work that could be carried:
1.   Combining a state-of-art model such as BERT with a vanilla decoder is a big disservice to BERT. Although we still see impressive results, we need to find a better way to incorporate BERT's power into the transformer's decoder part.
2.   We only used a "tiny" BERT model. Still, there are bigger models with more hidden layers and number of heads. We can easily use pre-trained models to save computation power. Still, BERT is trained on Wikipedia data, and it might not be as powerful when applied to dialogues.
3.   We only used accuracy as a measure during this project - **[2]** uses ROUGE metrics to measure the transformer-generated summaries' quality.
4.   Similar to **[2]**, we could've news data to the model. However, the size of the CNN dataset they have used is significant, and we need more computational power.
5.   Although our model was not suitable for more hyperparameter tuning, there is an opportunity for more pre-processing. Although we were very constricted due to the computationally heavy transformer model, 
6.   We could explore breaking down the dialogues into first and second-person texts while keeping the positional encodings. This would give us more context about who said what because we observed that the model could get confused.
7.   It might also be interesting to compare BERT performance to the OpenNMT library.

##References

**[1]** *Tensorflow: "Transformer model for language understanding"* [https://www.tensorflow.org/tutorials/text/transformer]

Used to build the transformer architecture

**[2]** *SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization* [https://arxiv.org/abs/1911.12237]

Dialogue Dataset used in the models

**[3]** *Hugging Face* [https://huggingface.co/]

Used BERT tokenizer for both models, BERT encoder to replace the trainable encoder from the original architecture, and T5 summarizer to compare it to these results

**[4]** *Abstractive Text Summarization using T5* [https://github.com/faiztariq/FzLabs/blob/master/abstractive-text-summarization-t5.ipynb]

Checking the performance when applied to dialogues

Some other reads:

https://towardsdatascience.com/practical-nlp-summarising-short-and-long-speeches-with-hugging-faces-pipeline-bc7df76bd366

https://towardsdatascience.com/building-a-multi-label-text-classifier-using-bert-and-tensorflow-f188e0ecdc5d

https://medium.com/rocket-mortgage-technology-blog/conversational-summarization-with-natural-language-processing-c073a6bcaa3a

https://www.tensorflow.org/tutorials/text/classify_text_with_bert

https://skimai.com/fine-tuning-bert-for-sentiment-analysis/
