<a href="https://colab.research.google.com/github/rahiakela/transfer-learning-for-natural-language-processing/blob/main/2-getting-started-with-baselines/5_BERT_based_models_for_email_sentiment_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT based models for Email Sentiment Classification

Our goal is to establish a set of baselines for a pair of concrete NLP problems, which we will later be able to use to measure progressive improvements gained from leveraging increasingly sophisticated transfer learning
approaches. In the process of doing this, we aim to advance your general NLP instincts and refresh your understanding of typical procedures involved in setting up problem-solving pipelines for such problems. You will review techniques ranging from tokenization to data structure and model selection. We first train some traditional machine learning models from scratch to establish some preliminary baselines for these problems.

We will focus on a pair of important representative example NLP problems – spam
classification of email, and sentiment classification of movie reviews. This exercise will arm you with a number of important skills, including some tips for obtaining, visualizing and preprocessing data. 

Three major model classes will be covered, namely linear models such as logistic regression, decision-tree-based models such as random forests, and neural-network-based models such as ELMo. These classes are additionally represented by support vector machines (SVMs) with linear kernels, gradient-boosting machines (GBMs) and BERT respectively. 

<img src='https://github.com/rahiakela/img-repo/blob/master/transfer-learning-for-natural-language-processing/content-classification-supervised-models.png?raw=1' width='800'/>



## Setup

Ref: https://stackoverflow.com/questions/57742410/error-on-scope-variable-while-using-tensorflow-hub

In [None]:
%%shell

pip install keras==2.2.4 # critical dependency
pip install tensorflow==1.15
pip install "tensorflow_hub>=0.6.0"
pip install -q bert-tensorflow

In [2]:
import numpy as np  # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import email        # email package for processing email messages
import random
import re
import time
from tqdm import tqdm


import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras import backend as K

from bert.tokenization import FullTokenizer

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

import matplotlib.pyplot as plt

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
# Initialize tensorflow session
sess = tf.Session()

In [4]:
from google.colab import files
files.upload() # upload kaggle.json file

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"rahiakela","key":"484f91b2ebc194b0bff8ab8777c1ebff"}'}

In [5]:
%%shell

mkdir -p ~/.kaggle
mv kaggle.json ~/.kaggle/
ls ~/.kaggle
chmod 600 /root/.kaggle/kaggle.json

# download dataset from kaggle
kaggle datasets download -d wcukierski/enron-email-dataset
unzip -qq enron-email-dataset.zip

kaggle datasets download -d rtatman/fraudulent-email-corpus
unzip -qq fraudulent-email-corpus.zip

rm -rf enron-email-dataset.zip fraudulent-email-corpus.zip

kaggle.json
Downloading enron-email-dataset.zip to /content
 94% 338M/358M [00:04<00:00, 61.1MB/s]
100% 358M/358M [00:04<00:00, 78.6MB/s]
Downloading fraudulent-email-corpus.zip to /content
 91% 5.00M/5.52M [00:00<00:00, 26.9MB/s]
100% 5.52M/5.52M [00:00<00:00, 27.0MB/s]




In [6]:
def extract_messages(df):
  messages = []
  for item in df["message"]:
    # Return a message object structure from a string
    e = email.message_from_string(item)
    # get message body
    message_body = e.get_payload()
    messages.append(message_body)
  print("Successfully retrieved message body from e-mails!")
  return messages

## Preprocessing Email Spam Data

Here, we are interested in developing an algorithm that can detect whether any given email is spam or not, at scale. To do this, we will build a dataset from two separate sources – the popular Enron email corpus as a proxy for email that is not spam, and a collection of “419” fraudulent emails as a proxy for email that is spam.

We will view this as a supervised classification task, where we will first train a classifier on a collection of emails labeled as either spam or not spam. 

In particular, we will sample the Enron Corpus – the largest public email collection, related to the notorious Enron financial scandal – as a proxy for email that are not spam, and sample “419” fraudulent emails, representing the best known type of spam, as a proxy for email that is spam. Both of these types of emails are openly available on [Kaggle](https://www.kaggle.com/wcukierski/enron-email-dataset).

The Enron corpus contains about half a million emails written by employees of the Enron Corporation, as collected by the Federal Energy Commission for the purposes of investigating the collapse of the company. It has been used extensively in the literature to study machine learning methods for email applications and is often the first data source researchers working with emails look to for initial experimentation with algorithm prototypes. On Kaggle, it is
available as a single-column .csv file with one email per row. Note that this data is still cleaner than one can expect to typically find in many practical applications in the wild.

<img src='https://github.com/rahiakela/img-repo/blob/master/transfer-learning-for-natural-language-processing/spam-email-preprocessing.png?raw=1' width='800'/>

The body of the email will first be separated from the headers of the email, some statistics about the dataset will be teased out to get a sense for the data, stopwords will be removed from the email, and it will then be classified as either spam or not spam.

### Loading and Visualizing the Fraudulent Email Corpus

Let’s load the “419” fraudulent email corpus, so that we can have some example data in our training set representing the “spam” class.

> Since this dataset comes as a .txt file, versus a .csv, the preprocessing steps are slightly different. First
of all, we have to specify the encoding when reading the file as latin1, otherwise the default encoding option of
utf-8 will fail. It is often the case in practice that one needs to experiment with a number of different encodings,
with the aforementioned two being the most popular ones, to get some datasets to read correctly. Additionally,
note that because this .txt file is one big column of emails (with headers) separated by line breaks and white
space, and is not separated nicely into rows with one email per row – as was the case for the Enron corpus – we
can’t use Pandas to neatly load it as was done before. We will read all the emails into a single string, and split
the string on a code word that appears close to the beginning of each email’s header, i.e, “From r”.

In [7]:
filepath = "./fradulent_emails.txt"
with open(filepath, "r", encoding="latin1") as file:
  data = file.read()

Split on the code word From r appearing close to the beginning of each email

In [8]:
fraud_emails = data.split("From r")
print("Successfully loaded {} spam emails!".format(len(fraud_emails)))

Successfully loaded 3978 spam emails!


Now that the fraudulent data is loaded as a list, we can convert it into a Pandas DataFrame.

In [9]:
fraud_bodies = extract_messages(pd.DataFrame(fraud_emails, columns=["message"], dtype=str))
fraud_bodies_df = pd.DataFrame(fraud_bodies[1:])

fraud_bodies_df.head()

Successfully retrieved message body from e-mails!


Unnamed: 0,0
0,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...
1,"Dear Friend,\n\nI am Mr. Ben Suleman a custom ..."
2,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
3,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
4,"Dear sir, \n \nIt is with a heart full of hope..."


### Loading and Visualizing the Enron Corpus

The first thing we need to do is load the data with the popular Pandas library, and to take a peek at a slice of the data to make sure we have a good sense of what it looks like.

In [10]:
filepath = "./emails.csv"

# Read the enron data into a pandas.DataFrame called emails
emails = pd.read_csv(filepath)
print("Successfully loaded {} rows and {} columns!".format(emails.shape[0], emails.shape[1]))
print(emails.head())

Successfully loaded 517401 rows and 2 columns!
                       file                                            message
0     allen-p/_sent_mail/1.  Message-ID: <18782981.1075855378110.JavaMail.e...
1    allen-p/_sent_mail/10.  Message-ID: <15464986.1075855378456.JavaMail.e...
2   allen-p/_sent_mail/100.  Message-ID: <24216240.1075855687451.JavaMail.e...
3  allen-p/_sent_mail/1000.  Message-ID: <13505866.1075863688222.JavaMail.e...
4  allen-p/_sent_mail/1001.  Message-ID: <30922949.1075863688243.JavaMail.e...


In [11]:
# take a closer look at the first email
print(emails.loc[0]["message"])

Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
From: phillip.allen@enron.com
To: tim.belden@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Tim Belden <Tim Belden/Enron@EnronXGate>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Here is our forecast

 


We see that the messages are contained within the message column of the resulting DataFrame, with the extra fields at the beginning of each message – including Message ID, To, From, etc.,– being referred to as the message’s header information or simply header.

Traditional spam classification methods derive features from the header information for classifying the message as spam or not. Here, we would like to perform the same task based on the content of the message only. One possible motivation for this approach is the fact that email training data may often be de-identified in practice due to privacy concerns and regulations, thereby making header info unavailable. Thus, we need to separate the headers from the messages in our dataset.

In [12]:
bodies = extract_messages(emails)

Successfully retrieved message body from e-mails!


In [13]:
# We then can display some processed emails
bodies_df = pd.DataFrame(bodies)
print(bodies_df.head())

                                                   0
0                          Here is our forecast\n\n 
1  Traveling to have a business meeting takes the...
2                     test successful.  way to go!!!
3  Randy,\n\n Can you send me a schedule of the s...
4                Let's shoot for Tuesday at 11:45.  


In [14]:
# extract random 10000 enron email bodies for building dataset
bodies_df = pd.DataFrame(random.sample(bodies, 10000))

# expand default pandas display options to make emails more clearly visible when printed
pd.set_option("display.max_colwidth", 300)
# you could do print(bodies_df.head()), but Jupyter displays this nicer for pandas DataFrames
bodies_df.head()

Unnamed: 0,0
0,"hotel info included at the bottom\n---------------------- Forwarded by Kay Mann/Corp/Enron on 02/26/2001 05:03 \nPM ---------------------------\n\n\nLaura R Pena@ECT\n02/26/2001 05:00 PM\nTo: Jeffery Ader/HOU/ECT@ECT, Heather Kroll/HOU/ECT@ECT, Kay \nMann/Corp/Enron@Enron, Steve Montovano/NA/Enr..."
1,"Drew, here's my first stab at a rehearing argument. I did not go as far as I \ncould have because I really think we should argue that there should be no \nlimitation at all and that TW should be able to decide for itself what \ncapacity it can sell as firm. Instead, I've proposed a solution th..."
2,"From Midland\n\ngo west on I-20\ntake the first Monahans exit (should be Business 20 exit)\ngo thru town, thru the one stop light\n4 blocks past the light is a Kent Tire Center on the left\n\nTW office is behind the Tire Center\n\naddress: 109 South Gail Street\nphone: 915-943-3297\n\nKevin Hyat..."
3,"Mark, my question has been answered, so you can disregard my previous e-mail request to you for information. Jim"
4,"17:56:06 Synchronizing Mailbox 'Perlingiere, Debra'\n17:56:06 Synchronizing Hierarchy\n17:56:06 Synchronizing Favorites\n17:56:06 Synchronizing Folder 'Inbox'\n17:56:13 \t 7 item(s) added to offline folder\n17:56:13 \t 2 item(s) updated in offline folder\n17:56:13 \t 9 item(s) deleted in o..."


The following (commented out) code is arguably the more "pythonic" way of achieving the extraction of bodies from messages. It is only 2 lines long and achieves the same result.

In [15]:
#messages = emails["message"].apply(email.message_from_string)
#bodies_df = messages.apply(lambda x: x.get_payload()).sample(10000)

### Email text preprocessing

Having loaded both datasets, we are now ready to sample emails from each one into a single DataFrame that will represent the overall dataset covering both classes of emails. Before doing this, we must decide how many samples to draw from each class. Ideally, the number of samples in each class will represent the natural distribution of emails in the wild, i.e, if we expect our classifier to encounter 60% spam emails and 40% nonspam emails when deployed, then a ratio such as 600 to 400 respectively might make sense.

**Note that a severe imbalance in the data, such as 99% for nonspam and 1% for spam may overfit to predict nonspam most of the time, an issue than needs to be considered when building datasets.** Since this is an idealized experiment, and we do not have any information on natural distributions of classes, we will
assume a 50/50 split. 

We also need to give some thought to how we are going to tokenize the emails, i.e., split emails into subunits of text - words, sentences, etc. To start off, we will tokenize into words, as this is the most common approach. 

We must also decide the maximum number of tokens per email, and the maximum length of each token, to ensure that the occasional “extremely long” email does not bog down the performance of our classifier. 

We do all these by specifying the following general hyperparameters, which will later be tuned experimentally to enhance performance as needed:

In [16]:
n_sample = 1000   # number of samples to generate in each class - 'spam', 'not spam'
maxtokens = 200    # the maximum number of tokens per document
maxtokenlen = 100  # the maximum length of each token

With these hyperparameters specified, we can now create a single DataFrame for the overarching training dataset. Let’s take the opportunity to also perform remaining preprocessing tasks, namely removing stop words, punctuations and tokenizing.

#### Tokenization

Let’s proceed by defining a function to tokenize emails by splitting them into words.

In [17]:
def tokenize(row):
  if row is None or row is "":
    tokens = ""
  else:
    tokens = str(row).split(" ")[:maxtokens]
  return tokens

#### Remove punctuation and unnecessary characters

Taking another look at the emails on the previous pair of pages, we see that they contain a lot of punctuation characters, and the spam emails tend to be capitalized. 

**In order to ensure that classification is done based on language content only, we have to remove punctuation marks and other non-word characters from the emails.** We do this by employing regular expressions with the Python regex library. We also normalize words by turning them into lower case.

In [18]:
def reg_expressions(row):
  tokens = []
  try:
    for token in row:
      token = token.lower()          # make all characters lower case
      token = re.sub(r"[\W\d]", "", token)
      token = token[:maxtokenlen]    # truncate all tokens to hyperparameter maxtokenlen
      tokens.append(token)
  except:
    token = ""
    tokens.append(token)
  return tokens

#### Stop-word removal

Finally, let’s define a function to remove stopwords - words that occur so frequently in language that they offer no useful information for classification. This includes words such as “the” and “are”, and the popular library NLTK provides a heavily used list that we will employ.

In [19]:
stop_words = stopwords.words("english")

def stop_word_removal(row):
  token = [token for token in row if token not in stop_words]
  token = filter(None, token)

  return token

### Assemble both Datasets

We are now going to put all these functions together to build the single dataset representing both classes. Most methods expect this dataset to be a Numpy array in order to process it, so we convert it to that form after combining the emails.

Now, putting all the preprocessing steps together we assemble our dataset...

In [20]:
# Convert everything to lower-case, truncate to maxtokens and truncate each token to maxtokenlen

# Apply predefined processing functions
enron_emails = bodies_df.iloc[:, 0].apply(tokenize)
enron_emails = enron_emails.apply(stop_word_removal)
enron_emails = enron_emails.apply(reg_expressions)
# sample the right number of emails from each class.
enron_emails = enron_emails.sample(n_sample)

# Apply predefined processing functions
spam_emails = fraud_bodies_df.iloc[:, 0].apply(tokenize)
spam_emails = spam_emails.apply(stop_word_removal)
spam_emails = spam_emails.apply(reg_expressions)
# sample the right number of emails from each class.
spam_emails = spam_emails.sample(n_sample)

# convert to Numpy array
raw_data = pd.concat([enron_emails, spam_emails], axis=0).values

Now, let’s take a peek at the result to make sure things are proceeding as expected:

In [21]:
print("Shape of combined data is:", raw_data.shape)
print("Data is:")
print(raw_data)

Shape of combined data is: (2000,)
Data is:
[list(['peggyis', '', 'points', 'eau', 'claire', 'sheraton', 'im', 'trying', 'book', 'room', 'tuesday', 'eveningthankspatti', 'xpeggy', 'hedstrom', '', 'amto', 'sally', 'beckhouectectcc', 'subject', 'operational', 'risk', 'meetingwe', 'set', 'meeting', 'operational', 'risk', 'february', 'th', '', 'pm', 'we', 'room', 'booked', 'sheraton', 'set', 'similar', 'houston', 'meeting', 'i', 'invited', 'three', 'people', 'accounting', 'department', 'to', 'attend', 'well', 'laura', 'scott', 'cheryl', 'dawes', 'dave', 'hanslip', 'attendees', 'accounting', 'everyone', 'looking', 'forward', 'meeting', 'to', 'discuss', 'issue', 'so', 'far', 'weather', 'still', 'mild', 'talk', 'soonpeggy'])
 list(['everyone', 'list', '', 'likely', 'actively', 'involved', 'negotiating', 'swap', 'docs', 'highlghted', 'bluebill', 'bradforddebbie', 'bracketttanya', 'rohauermolly', 'harrissue', 'vasanpaul', 'radousjane', 'wilhitewendy', 'conwelled', 'sackstracy', 'ngojay', 'willi

We see that the resulting array has divided the text into word units, as we intended to.

Let’s create the headers corresponding to these emails, consisting of n_sample=1000 of spam emails followed by n_sample=1000 of non-spam emails:

In [22]:
categories = ["spam", "notspam"]
header = ([1] * n_sample)
header.extend(([0] * n_sample)) 

We are now ready to convert this Numpy array into numerical features that can actually be fed to the algorithms for classification.

### Converting the data to the form expected by Bert

Before using this function to train a model, we will need to adapt our preprocessed data a bit for this model architecture.

We use the below function to combine each such list into a single text string. This is the format in which the BERT TensorFlow hub model expects the input, and we are glad to oblige.

> **NOTE**: The combined string in this case has stopwords removed – steps that are often not required in deep learning practice due to the uncanny ability of artificial neural networks to figure out what is important and isn’t,
i.e., feature engineering, automatically. In our case, since we are trying to compare the strengths and weaknesses of the different model types for this problem, applying the same kind of preprocessing for all algorithms makes sense and is arguably the right approach. We note however that ELMo was pretrained on a corpus containing stopwords, as was BERT.

Having fully vectorized the dataset, we must remember that it is not shuffled with respect to classes, i.e., it contains Nsamp = 1000 spam emails followed by an equal number of nonspam emails. Depending on how this dataset is split, in our case by picking the first 70% for training and the remainder for testing, this could lead to a training set composed of spam only, which would obviously lead to failure. In order to create a randomized ordering of class samples in the dataset, we will need to shuffle the data in unison with the header/list of labels.


In [23]:
# shuffle raw data first
def unison_shuffle_data(data, header):
  p = np.random.permutation(len(header))
  data = data[p]
  header = np.asarray(header)[p]

  return data, header

In [24]:
# we expect a single string per email here, versus a list of tokens for the sklearn models previously explored
def convert_data(raw_data, header):
  converted_data, labels = [], []
  for i in range(raw_data.shape[0]):
    # combine list of tokens representing each email into single string
    out = " ".join(raw_data[i])
    converted_data.append(out)
    labels.append(header[i])
  converted_data = np.array(converted_data, dtype=object)[:, np.newaxis]

  return converted_data, np.array(labels)

As the very last step of preparing the email dataset for training by our baseline classifiers, we split it into independent training and testing or validation sets. This will allow us to evaluate the performance of the classifier on a set of data that was not used for training, an important thing
to ensure in machine learning practice. We elect to use 70% of the data for training, and 30% for testing/validation afterwards.

In [25]:
raw_data, header = unison_shuffle_data(raw_data, header)

# split into independent 70% training and 30% testing sets
idx = int(0.7 * raw_data.shape[0])  # get 70% index value

# 70% of data for training
train_x, train_y = convert_data(raw_data[:idx], header[:idx])

# remaining 30% for testing
test_x, test_y = convert_data(raw_data[idx:], header[idx:])

print("train_x/train_y list details, to make sure they are of the right form:")
print(len(train_x))
print(train_x)
print(len(train_y))
print(train_y[:5])

train_x/train_y list details, to make sure they are of the right form:
1400
[['emailmessagemessage object xfeef']
 ['my dearestwith due regard respect i emailing although wrong seeksuch assistance without prior knowledge person involved i hope iam mistaking i also believe end day willnever betray i family projecti princess fayad w bolkiah wife prince jefri bolkiah formerfinance minister brunei tiny oilrich sultanate gulfisland i mean amplify extended royal family history whichhas already disseminated international media thecontroversial dispute erupted husband stepbrotherthe sultan brunei his majesty sultan hassanal bolkiah muizzadinwaddaulahas may know international media sultan his majestysultan hassanal bolkiah muizzadin waddaulah accused husband offinancial mismanagement impropriety us million dollars thiswas result asian financial crisis made husbandscompany amedeo development company government owned brunei investmentcompany declared bankrupt tenure office however myhusband jail 

Since 70% of 2000 is 1400, looks good! (for n_sample=1000)

## Neural Network Models

Neural networks are the most important class of machine learning algorithms for handling perceptual problems such as computer vision and NLP.

we will train two representative pretrained neural network language models
on the two illustrative example problems we have been baselining.

The two models we will consider here are:

- **ELMo** – Embeddings from Language Models, and
- **BERT** – Bidirectional Encoder Representations from Transformers.

ELMo includes elements of convolutional and recurrent (specifically LSTM) elements, while the appropriately named BERT is transformer-based.

The simplest form of transfer learning fine-tuning will be employed, where a single dense classification layer is trained on top of the corresponding pretrained embedding over our dataset of labels.


### Bidirectional Encoder Representations from Transformers (BERT)

**Bidirectional Encoder Representations from Transformers (BERT) model** was also named after a popular Sesame Street character as a nod to the trend started by ELMo. BERT variants achieve some of the best performance in transferring pretrained language model knowledge to downstream NLP tasks. The model was similarly trained to predict words in a sequence of words, although the exact masking procedure is somewhat different. It can also be done in an unsupervised manner on very large corpuses, and the resulting weights similarly
generalize to a variety of other NLP tasks. **Arguably, to familiarize oneself with transfer learning in NLP, it is indispensable for one to familiarize oneself with BERT.**

It will suffice to mention here that the model employs character-level convolutions to build up preliminary embeddings of word tokens, followed by transformer-based encoders with selfattention layers that provide the model with a context of surrounding words. **The transformer functionally replaced the role of the bidirectional LSTMs employed by ELMo. Transformers have some advantages versus LSTMs with respect to training scalability as well.**. Again, we will use Keras with Tensorflow backend to build our model.


### Define BERT layer

The BERT model is also available through the Tensorflow Hub. In order to
make the hub model usable by Keras, we similarly define a custom Keras layer that instantiates it in the right format.

In [26]:
class BertLayer(tf.keras.layers.Layer):

  def __init__(self, n_fine_tune_layers=10, pooling="mean", bert_path="https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1", **kwargs):
    self.n_fine_tune_layers = n_fine_tune_layers    # Default number of top layers to unfreeze for training
    self.trainable = True
    self.output_size = 768      # BERT embedding dimension, i.e., size of resulting output semantic vectors
    self.pooling = pooling      # Choice of regularization type
    self.bert_path = bert_path  # Pretrained model to use, this is the large uncased original version of the model

    if self.pooling not in ["first", "mean"]:
      raise NameError(f"Undefined pooling type (must be either first or mean, but is {self.pooling}")

    super(BertLayer, self).__init__(**kwargs)

  def build(self, input_shape):
    """function for building BERT embedding"""
    # Download pretrained BERT model from Tensorflow Hub
    self.bert = hub.Module(self.bert_path, trainable=self.trainable, name=f"{self.name}_module") 

    # Remove unused layers
    trainable_vars = self.bert.variables
    if self.pooling == "first":
      trainable_vars = [var for var in trainable_vars if not "/cls/" in var.name]
      trainable_vars = ["pooler/dense"]
    elif self.pooling == "mean":
      trainable_vars = [var for var in trainable_vars if not "/cls/" in var.name and not "/pooler/" in var.name]
      trainable_vars = []
    else:
      raise NameError(f"Undefined pooling type (must be either first or mean, but is {self.pooling}")
    
    # Select how many layers to fine tune
    for i in range(self.n_fine_tune_layers):
      trainable_layers.append(f"encoder/layer_{str(11 - i)}")

    # Update trainable vars to contain only the specified layers
    trainable_vars = [var for var in trainable_vars if any([layer in var.name for layer in trainable_vars])]

    # Add to trainable weights
    for var in trainable_vars:
      self._trainable_weights.append(var)

    for var in self.bert.variables:
      if var not in self._trainable_weights:
        self._non_trainable_weights.append(var)

    super(BertLayer, self).build(input_shape)

  def call(self, inputs):
    """specify function for calling embedding"""
    inputs = [K.cast(x, dtype="int32") for x in inputs]
    input_ids, input_mask, segment_ids = inputs
    # Inputs to BERT take a very specific triplet form
    bert_inputs = dict(
        input_ids = input_ids,
        input_mask = input_mask,
        segment_ids = segment_ids
    )
    
    if self.pooling == "first":
      pooled = self.bert(inputs=bert_inputs, signature="tokens", as_dict=True)["pooled_output"]
    elif self.pooling == "mean":
      result = self.bert(inputs=bert_inputs, signature="tokens", as_dict=True)["sequence_output"]
      # BERT “masks” some words and then attempts to predict them as learning target
      mul_mask = lambda x, m: x * tf.expand_dims(m, axis=-1)
      masked_reduce_mean = lambda x, m: tf.reduce_sum(mul_mask(x, m), axis=1) / (tf.reduce_sum(m, axis=1, keepdims=True) + 1e-10)
      input_mask = tf.cast(input_mask, tf.float32)
      pooled = masked_reduce_mean(result, input_mask)
    else:
      raise NameError(f"Undefined pooling type (must be either first or mean, but is {self.pooling}")

    return pooled

  def compute_output_shape(self, input_shape):
    """specify output shape"""
    return (input_shape[0], self.output_size)

### Build BERT model

Similarly to what we did for ELMo, we performed a sequence of analogous post-processing steps on the data from the prior sections to put it into the format required by the BERT model. In addition to what was done to concatenate the bag-of-words token representations into a list of strings, we subsequently need to convert each concatenated string into 3 arrays – **input ids, input masks and segment ids** – prior to feeding them to the BERT model.

In [29]:
def build_model(max_seq_length):
  input_ids = tf.keras.layers.Input(shape=(max_seq_length, ), name="input_ids")
  input_masks = tf.keras.layers.Input(shape=(max_seq_length, ), name="input_masks")
  segment_ids = tf.keras.layers.Input(shape=(max_seq_length, ), name="segment_ids")
  bert_inputs = [input_ids, input_masks, segment_ids]

  # We do not retrain any BERT layers, but rather use the pretrained model as an embedding and retrain some new
  # layers on top of it so just extract BERT features, don't fine-tune
  bert_output = BertLayer(n_fine_tune_layers=0)(bert_inputs)

  # train dense classification layer on top of extracted features
  dense = tf.keras.layers.Dense(256, activation="relu")(bert_output)      # new layer outputting 256-dimensional feature vectors
  prediction = tf.keras.layers.Dense(1, activation="sigmoid")(dense)

  # we could use sigmoid activation as well, but we choose softmax
  # to enable us use sparse_categorical_crossentropy and sparse_categorical_accuracy below
  model = tf.keras.models.Model(inputs=bert_inputs, outputs=prediction)
  # use sparse_categorical_crossentropy and sparse_categorical_accuracy do avoid having to one-hot encode the labels
  model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

  model.summary()

  return model

In [30]:
# initialize tensorflow variables correctly
def initialize_vars(sess):
  sess.run(tf.local_variables_initializer())
  sess.run(tf.global_variables_initializer())
  sess.run(tf.tables_initializer())
  K.set_session(sess)

Now let's first define critical functions that define various components of the BERT model.

In [31]:
class InputExample(object):
  """A single training/test example for simple sequence classification."""

  def __init__(self, guid, text_a, text_b=None, label=None):
    """
    Constructs a InputExample.
    Args:
      guid: Unique id for the example.
      text_a: string. The untokenized text of the first sequence. For single
        sequence tasks, only this sequence must be specified.
      text_b: (Optional) string. The untokenized text of the second sequence.
        Only must be specified for sequence pair tasks.
      label: (Optional) string. The label of the example. This should be
        specified for train examples, but not for test examples.
    """
    self.guid = guid
    self.text_a = text_a
    self.text_b = text_b
    self.label = label

In [32]:
def create_tokenizer_from_hub_module(bert_path):
  """Get the vocab file and casing info from the Hub module."""
  bert_module = hub.Module(bert_path)
  tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
  vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"], tokenization_info["do_lower_case"]])

  return FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)

In [33]:
def convert_single_example(tokenizer, example, max_seq_length=256):
  """Converts a single `InputExample` into a single `InputFeatures`."""
  tokens_a = tokenizer.tokenize(example.text_a)
  if len(tokens_a) > max_seq_length - 2:
    tokens_a = tokens_a[0: (max_seq_length - 2)]

  tokens = []
  segment_ids = []
  tokens.append("[CLS]")
  segment_ids.append(0)
  for token in tokens_a:
    tokens.append(token)
    segment_ids.append(0)
  tokens.append("[SEP]")
  segment_ids.append(0)

  input_ids = tokenizer.convert_tokens_to_ids(tokens)

  # The mask has 1 for real tokens and 0 for padding tokens. Only real tokens are attended to.
  input_mask = [1] * len(input_ids)

  # Zero-pad up to the sequence length.
  while len(input_ids) < max_seq_length:
    input_ids.append(0)
    input_mask.append(0)
    segment_ids.append(0)

  assert len(input_ids) == max_seq_length
  assert len(input_mask) == max_seq_length
  assert len(segment_ids) == max_seq_length

  return input_ids, input_mask, segment_ids, example.label

In [35]:
def convert_examples_to_features(tokenizer, examples, max_seq_length=256):
  """Convert a set of `InputExample`s to a list of `InputFeatures`."""
  input_ids, input_masks, segment_ids, labels = [], [], [], []
  for example in tqdm(examples, desc="Converting examples to features"):
    input_id, input_mask, segment_id, label = convert_single_example(tokenizer, example, max_seq_length)
    input_ids.append(input_id)
    input_masks.append(input_mask)
    segment_ids.append(segment_id)
    labels.append(label)

  return (np.array(input_ids), np.array(input_masks), np.array(segment_ids), np.array(labels).reshape(-1, 1))

In [36]:
def convert_text_to_examples(texts, labels):
  InputExamples = []
  for text, label in zip(texts, labels):
    InputExample.append(InputExample(guid=None, text_a=" ".join(text), text_b=None, label=label))

  return InputExamples

### Train BERT model

We note that most of the trainable parameters in this case (approximately 260 thousand of them) are coming from the layers we added on top of the custom ELMo model. In other words, this is our first instance of transfer learning – learning a pair of new layers on top of the pretrained model shared by ELMo’s creators.

In practice, one can increase the value of this parameter until the speed of convergence of a typical problem instance does not benefit from the increase, or whenever the GPU memory is no longer large enough for a single data batch to
fit on it during an iteration of the algorithm, whichever happens first. Additionally, when dealing with a multi-GPU scenario, some evidence that the optimal scaling-up schedule of the batch size is linear in the number of GPUs, has been presented.

In [None]:
df_history = pd.DataFrame(history.history)

fig,ax = plt.subplots()
plt.plot(range(df_history.shape[0]),df_history['val_acc'],'bs--',label='validation')
plt.plot(range(df_history.shape[0]),df_history['acc'],'r^--',label='training')
plt.xlabel('epoch')
plt.ylabel('accuracy')
plt.title('ELMo Email Classification Training')
plt.legend(loc='best')
plt.grid()
plt.show()
# Save figures
fig.savefig('ELMoConvergence.eps', format='eps')
fig.savefig('ELMoConvergence.pdf', format='pdf')
fig.savefig('ELMoConvergence.png', format='png')
fig.savefig('ELMoConvergence.svg', format='svg')

We see that a validation accuracy of approximately 98.83% is attained at the 4th epoch, i.e., in under a minute. This performance is comparable to the performance of the logistic regression approach, which is only slightly better at 98.8%. We note that the behavior of the algorithm is stochastic, i.e., it behaves differently from run to run.

Finally, we note that the divergence of training and validation accuracies is suggestive of the beginning of overfitting as indicative in the figure. This
lends credence to the hypothesis that increasing the amount of signal by increasing the length of tokens, as specified by hyper-parameter maxtokenlen, and the number of tokens per email, as specified by maxtokens, may increase performance further. Naturally, increasing the number of samples per class by cranking up Nsamp should also work to improve performance.

Each epoch again takes approximately 10 seconds and a validation accuracy of approximately 70% is achieved in under a minute at the 2nd epoch.

**Note that some evidence of overfitting can be observed at the 3rd and later epochs, as the training accuracy continues to improve, i.e., the fit to the data improves, while the validation accuracy remains lower.**