## BERT

BERT [[1]](#fn1) is a successful neural architectures which excells in solving a variety of natural language processing tasks. Training an entire model from scratch required lots of computing resources but luckily there is a variety of pre-trained BERT models available which can be fine-tuned to particular downstream language processing tasks. 

We use the `transformers` library and experiment with the provided pre-trained BERT models using the [IMDB Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/). The following libraries are required to run this notebook:

- transformers==3.5.1
- torch==1.7.0
- tensorflow==2.3.1
- gensim==3.8.3
- scikit-learn==0.23.2
- matplotlib==3.3.2
- tensorflow-datasets==3.2.1

----
<span id="fn1"> [1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171â€“4186, 2019. </span>

----

The IMDB dataset is available for download on the internet but it is also available in the `tensorflow-datasets` package. We use this version.  First, we load the supervised part of the dataset and convert  it from `tf.Tensor` into Python data structures. We use only the supervised part (i.e., reviews with ratings) which contains 25,000 samples in each training and testing set. Movie reviews are categorized as negative when its $score<=4$ and as positive when $score>=7$. We skip the reviews with neutral rating in the example below.

In [1]:
import tensorflow_datasets
train, test = tensorflow_datasets.load('imdb_reviews', split=['train', 'test'], as_supervised=True)
train = [(text.decode('utf-8'), int(score)) for text, score in tensorflow_datasets.as_numpy(train)]
test = [(text.decode('utf-8'), int(score)) for text, score in tensorflow_datasets.as_numpy(test)]
X_train, y_train = [x[0] for x in train], [x[1] for x in train]
X_test, y_test = [x[0] for x in test], [x[1] for x in test]

We first compute a baseline to which we can compare BERT. A typical baseline in text mining is a _tf-idf_ representation using support vector machines for prediction. We keep _tf-idf_ computation simple and efficient using gensim helper functions. Testing data is stored separatly and is not used when computing the dictionary and _idf_ weights:

In [2]:
import gensim
def to_tfidf(documents, dic=None, tfidf_model=None):
    documents = [gensim.parsing.preprocessing.preprocess_string(doc) for doc in documents]
    if dic is None:
        dic = gensim.corpora.Dictionary(documents)
        dic.filter_extremes()
    bows = [dic.doc2bow(doc) for doc in documents]
    if tfidf_model is None:
        tfidf_model = gensim.models.tfidfmodel.TfidfModel(dictionary=dic)
    tfidf_vectors = tfidf_model[bows]
    return tfidf_vectors, dic, tfidf_model

X_train_tfidf, dic, tfidf_model = to_tfidf(X_train)
X_test_tfidf, _, __ = to_tfidf(X_test, dic, tfidf_model)

In order to use gensim's _tf-idf_ sparse vectors in classifiers from the scikit library, we convert them to scipy sparse matrices and transpose them so that documents become rows of the matrix. Then we can train the SVM classifier and use it for prediction. Finally, we compute the baseline accuracy (which turns out to be very good). The authors of the dataset reported only 3% better accuracy (88.89%) but they used extensive text preprocessing and also used the unlabeled part of the data.

In [3]:
from sklearn.svm import LinearSVC
from sklearn import metrics

svc = LinearSVC()
svc.fit(gensim.matutils.corpus2csc(X_train_tfidf).T, y_train)
y_predicted = svc.predict(gensim.matutils.corpus2csc(X_test_tfidf).T)
print('Accuracy: {:.3f}'.format(metrics.accuracy_score(y_test, y_predicted)))

Accuracy: 0.851


Let's now focus on BERT. We first use BERT's subword tokenizer which encodes input tokens into integers, inserts special tokens, and computes the attention mask which marks whether the model should pay attention or not.

In [4]:
from transformers import BertTokenizer
from pprint import pprint
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
sentence = X_train[0][:100]
tokenized = tokenizer(sentence, return_tensors='pt')  # return arrays as PyTorch tensors
pprint(sentence, compact=True)
pprint(tokenized, compact=True)

("This was an absolutely terrible movie. Don't be lured in by Christopher "
 'Walken or Michael Ironside. ')
{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 'input_ids': tensor([[  101,  2023,  2001,  2019,  7078,  6659,  3185,  1012,  2123,  1005,
          1056,  2022, 26673,  1999,  2011,  5696,  3328,  2368,  2030,  2745,
          3707,  7363,  1012,   102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}


To see the actual token strings instead of integer IDs we convert IDs to tokens:

In [5]:
pprint(tokenizer.convert_ids_to_tokens(tokenized['input_ids'].tolist()[0]), compact=True)

['[CLS]', 'this', 'was', 'an', 'absolutely', 'terrible', 'movie', '.', 'don',
 "'", 't', 'be', 'lured', 'in', 'by', 'christopher', 'walk', '##en', 'or',
 'michael', 'iron', '##side', '.', '[SEP]']


BERT can be fine-tuned to a variety of NLP tasks. One of the uses of BERT is feature extraction where different combinations of layers can be used to extract contextual word embeddings. Although this approach cannot compete with fine-tuned models in term of predictive performance, the extracted vectors can be for e.g., visualiyation or similarity computations. Contextual embeddings can be obtained by applying the model to the input obtained from the tokenizer. By default, the model will return only the sequence of hidden states at the output layer but other hidden states can be obtained by setting the parameter `output_hidden_states=True` which causes the model to return 12 hidden states of the `bert-base` model along with the input embedding. The shape of a hidden state is determined by the batch size (1), the number of tokens (24) and the type of BERT (`bert-base` has 768 neurons per layer).

In [6]:
from transformers import BertModel
model = BertModel.from_pretrained('bert-base-uncased', return_dict=True, output_hidden_states=True)
output = model(**tokenized)
print(len(output.hidden_states))
print(output.last_hidden_state.shape)

13
torch.Size([1, 24, 768])


According to the authors of BERT, in order to get the best contextual embedding of a token for the named entitiy recognition task, the last four layers shall be concatenated. The resulting embedding vectors can be used in the same way as, e.g., word2vec embeddings with the additional benefit that the context is taken into account. The snippet below concatenates last four layers and construct an 1D tensor.

In [7]:
import torch
token_id = 1 # word "this"
layers = [output.hidden_states[-i][0][token_id] for i in [3,2,1,0]]
embedding = torch.cat(layers)
print(embedding)
print(embedding.shape)

tensor([-0.0292, -0.4178, -0.3146,  ...,  0.4475,  0.6696,  0.1820],
       grad_fn=<CatBackward>)
torch.Size([3072])


The `transformers` package provides a ready-to-user pipeline with a fine-tuned model for binary text classification called `sentiment-analysis` which is based on variant of BERT called DistilBERT which preserves 95% of BERT's performance while running 60% faster and with 40% less parameters. The version in the pipeline is fine tuned for sequence classification (sentiment of the text).

In [8]:
from transformers import pipeline
from transformers import BertForSequenceClassification, BertTokenizer, BertModel

sa_classifier = pipeline('sentiment-analysis')
type(sa_classifier.model)

transformers.modeling_distilbert.DistilBertForSequenceClassification

We run the pipeline on the test set and compute the accuracy. The DistilBERT model used in the pipeline limits its input to 512 tokens which is too short for some documents. As the pipeline does not automatically truncate the input, we have to check and truncate the input manually. Since BERT-based methods use subword tokenization, there is no easy way to find where to cut the original text to obtain a token sequence of length 512. We therefore limit all inputs to conservative 1024 characters which shall always be less than 512 tokens. This also significantly speeds up the computation with, hopefully, little decrease of the quality.

In [9]:
from transformers import BertForSequenceClassification

predicted_sentiment = []
MAX_CHARS = 1024
for i,doc in enumerate(X_test):
    doc = doc[:MAX_CHARS]
    prediction = sa_classifier(doc)[0]
    decision = 1 if prediction['label'] == 'POSITIVE' else 0
    predicted_sentiment.append(decision)

Using the predicted sentiment, we can evaluate the quality of the produced BERT-based classifier. It turns out that it is already better than our baseline classifier and very close to the best result reported by the authors of the dataset. Taking into account that the model was trained on different data such a result is outstanding.

In [10]:
from sklearn import metrics
print('Accuracy: {:.3f}'.format(metrics.accuracy_score(y_test, predicted_sentiment)))

Accuracy: 0.872


Finally, we fine-tune BERT on the actual IMDB dataset. Because the details of fine-tuning procedure are not within the scope of this notebook, we use the provided `bert_finetune_classification.py` script which provides functions for basic fine-tuning. We can experimet with a number of parameters, e.g., the number of epochs, maximal length of the input, batch size, etc. We use the default values wherever possible and set the maximal length of the input to 128 in order to speed up the computation.

In [3]:
import bert_finetune_classification as bft
bft.set_seed(123)
data = {'X_train': X_train,
        'X_val': X_test,
        'y_train': y_train,
        'y_val': y_test}
inputs = bft.preprocess(data, max_len=128)
loaders = bft.make_dataloaders(data, inputs)
bert_classifier, optimizer, scheduler = bft.initialize_model(loaders, epochs=2)
bft.train(bert_classifier, optimizer, scheduler, loaders, epochs=2, evaluation=False)

There are 1 GPU(s) available.
Device name: GeForce RTX 2060 SUPER
Start training...

 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
----------------------------------------------------------------------
   1    |   300   |   0.375713   |     -      |     -     |  124.97  
   1    |   600   |   0.307289   |     -      |     -     |  126.03  
   1    |   781   |   0.288199   |     -      |     -     |   75.95  
----------------------------------------------------------------------


 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
----------------------------------------------------------------------
   2    |   300   |   0.168120   |     -      |     -     |  127.00  
   2    |   600   |   0.149514   |     -      |     -     |  126.97  
   2    |   781   |   0.143595   |     -      |     -     |   76.32  
----------------------------------------------------------------------


Training complete!


Using the fine-tuned model we can predict the ratings for the reviews in the test set. The resulting probabilities are converted into 0/1 score using 0.5 as the threshold. The obtained accuracy of the fine-tuned model is better than the best reported. We believe that it can be further improved by optimizing the parameters of the model. Try for yourself!

In [6]:
import numpy as np
predicted_probs = bft.bert_predict(bert_classifier, loaders['val_dataloader'])
y_predicted = np.where(predicted_probs[:,0] >= 0.5, 0, 1)
print('Accuracy: {:.3f}'.format(metrics.accuracy_score(y_test, y_predicted)))

Accuracy: 0.893
