<a href="https://colab.research.google.com/github/victorviro/Deep_learning_python/blob/master/BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In a previous [notebook](https://github.com/victorviro/Deep_learning_python/blob/master/NLP_Attention_and_Transformer_architecture.ipynb) we saw the attention module and the transformer architecture. In this notebook, we are going to deep into the *Bidirectional Encoder Representations for Transformers* ([BERT](https://arxiv.org/abs/1810.04805)), which is a deep learning model that has given state-of-the-art results on a wide variety of natural language processing tasks . Soon after the release of the paper describing the model, the team also open-sourced the code of the model and made available for download versions of the model that were already pre-trained on massive datasets.

BERT builds on top of several clever ideas that have been bubbling up in the NLP community recently – including but not limited to [Semi-supervised Sequence Learning](https://arxiv.org/abs/1511.01432), [ELMo](https://arxiv.org/abs/1802.05365), [ULMFiT](https://arxiv.org/abs/1801.06146), the [OpenAI transformer](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf), and the [Transformer](https://arxiv.org/abs/1706.03762).

BERT is basically a trained Transformer Encoder stack. You can check the previous [notebook](https://github.com/victorviro/Deep_learning_python/blob/master/NLP_Attention_and_Transformer_architecture.ipynb) where we explain the Transformer model.

The paper presents two model sizes for BERT:

- BERT BASE: Comparable in size to the OpenAI Transformer to compare performance.
- BERT LARGE: A ridiculously huge model that achieved the state of the art results reported in the paper.

![](https://i.ibb.co/rGx9Ytq/bert-models.png)

Both BERT model sizes have a large number of encoder layers (which the paper calls *Transformer Blocks*). These also have larger feedforward-networks and more attention heads than the default configuration in the reference implementation of the Transformer in the initial paper.

Modern models used today have this architecture stacking multiple transformer blocks. 

Just like the vanilla encoder of the transformer, BERT takes a sequence of words as input which keeps flowing up the stack. Each layer applies self-attention, and passes its results through a feed-forward network, and then hands it off to the next encoder. Finally, the word embeddings gotten have a deep contextualized information about the word in the document.

How we can use it? There are two steps:

- First, pretraining a language model with a big dataset like Wikipedia.

- Secondly, once the model is pretrained, we can apply it in another context like a text classification task (spam classification in the image below).



![](https://i.ibb.co/GQ55F4c/bert-usage.png)

# Bert pretraining

The task to train a Transformer stack of encoders is a complex hurdle that BERT resolves by adopting a "masked language model". A 15% of tokens in the text are randomly replaced by the `[MASK]` token. A "language model head" is appended to the output of the last BERT hidden layer. The training process tries to minimize the cross-entropy loss of predicting the original word at `[MASK]`. This procedure is similar to a typical exercise of a foreign language test where a sentence is given with a missing word and asks you what word would be there or fit better. This task has sense since to decide what word would be in the mask, it would look at all words in the document to make a decision, so looking the context of the document, the word will be "improvisation" in the example below.

![](https://i.ibb.co/k8T3zDx/bert-pretaining.png)


Beyond masking 15% of the input, BERT also mixes things a bit to improve how the model later fine-tunes. Sometimes it randomly replaces a word with another word and asks the model to predict the correct word in that position.

To make BERT better at handling relationships between multiple sentences, the pre-training process includes an additional task: Given two sentences (A and B), is B likely to be the sentence that follows A, or not?

![](https://i.ibb.co/bgkZqpx/bert-pretaining2.png)

This pretraining process is computationally expensive so we can import it from someone who did it.

# BERT as transfer learning

BERT can be used as a transfer learning tool, to obtain better embedding for our documents. We can train a small network on BERT outputs for our particular task. For example, for a spam classifier:

![](https://i.ibb.co/8rtrGZc/BERT-spam-classifier.png)

Each position outputs a vector. For this sentence classification example, we focus on the output of only the first position (that we passed the special `[CLS]` token to). This vector can now be used as the input for a classifier of our choosing. The paper shows great results by just using a single-layer neural network as the classifier.

The BERT paper shows several ways to use BERT for different tasks.

![](https://i.ibb.co/s60fg8Z/use-cases-BERT.png)

# Transformers library

[Transformers](https://github.com/huggingface/transformers) is a state-of-the-art NLP library for PyTorch and TensorFlow 2.0. It provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in different languages.It also provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets.


# References

- [Attention Is All You Need](https://arxiv.org/abs/1706.03762)

- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)

- [The Illustrated BERT, ELMo, and co](http://jalammar.github.io/illustrated-bert/)

- [Understanding transformers](https://www.analyticsvidhya.com/blog/2019/06/understanding-transformers-nlp-state-of-the-art-models/)

- [Transformers library](https://huggingface.co/transformers/)