## The Illustrated BERT, ELMo, and co. (How NLP Cracked破解 Transfer Learning)

![Screen Shot 2021-08-22 at 2.58.46 PM.png](attachment:5dd789b1-3b4a-4f9d-a57c-88949a21df07.png)

上图中，最新发布的BERT是一个NLP任务的里程碑式模型，它的发布势必会带来一个NLP的新时代。BERT是一个算法模型，它的出现打破了大量的自然语言处理任务的记录。在BERT的论文发布不久后，Google的研发团队还开放了该模型的代码，并提供了一些在大量数据集上预训练好的算法模型下载方式。Goole开源这个模型，并提供预训练好的模型，这使得所有人都可以通过它来构建一个涉及NLP的算法模型，节约了大量训练语言模型所需的时间，精力，知识和资源。

![Screen Shot 2021-08-22 at 3.00.48 PM.png](attachment:02070b0e-b996-4e67-9952-4e9f19bf942c.png)

人们需要了解许多概念才能正确地理解 BERT 是什么。因此，在查看模型本身所涉及的概念之前，让我们先看看使用 BERT 的方法。

### Example: Sentence Classification

The most straight-forward way to use BERT is to use it to classify a single piece of text. This model would look like this:

![Screen Shot 2021-08-22 at 3.03.35 PM.png](attachment:2b3be60d-2f9c-4c85-8313-f092b5b2db3d.png)

为了训练一个这样的模型，（主要是训练一个分类器），在训练阶段BERT模型发生的变化很小。该训练过程称为微调  

为了更方便理解，我们下面举一个分类器的例子。分类器是属于监督学习领域的，这意味着你需要一些标记的数据来训练这些模型。对于垃圾邮件分类器的示例，标记的数据集由邮件的内容和邮件的类别2部分组成（类别分为“垃圾邮件”或“非垃圾邮件”）。

Other examples for such a use-case include:

* **Sentiment analysis**
    * Input: Movie/Product review. Output: is the review positive or negative?
    * Example dataset: SST

* **Fact-checking(事实核查)** 
    * Input: sentence. Output: “Claim(索赔)” or “Not Claim”
    * More ambitious/futuristic example:
        * Input: Claim sentence. Output: “True” or “False”

### Model Architecture

![Screen Shot 2021-08-22 at 3.13.45 PM.png](attachment:7846ab6c-3d22-4055-aa1f-e9970d1324ae.png)

The paper presents two model sizes for BERT:

* BERT BASE – Comparable in size to the OpenAI Transformer in order to compare performance     
* BERT LARGE – A ridiculously huge model which achieved the state of the art results reported in the paper     

BERT is basically a trained Transformer Encoder stack. This is a good time to direct you to read my earlier post The Illustrated Transformer which explains the [Transformer model]() – a foundational concept for BERT and the concepts we’ll discuss next.

![Screen Shot 2021-08-22 at 3.46.26 PM.png](attachment:648aa627-71ad-44fe-9a25-5efc27f58772.png)

Both BERT model sizes have a large number of encoder layers (which the paper calls Transformer Blocks) – twelve for the Base version, and twenty four for the Large version. These also have larger feedforward-networks (768 and 1024 hidden units respectively), and more attention heads (12 and 16 respectively) than the default configuration in the reference implementation of the Transformer in the initial paper (6 encoder layers, 512 hidden units, and 8 attention heads).

### Model Inputs

![Screen Shot 2021-08-22 at 6.08.52 PM.png](attachment:3f3d48fe-d6f9-4297-a2ec-a90ab5c98f98.png)

The first input token is supplied with a special \[CLS\] token for reasons that will become apparent later on. CLS here stands for Classification.

BERT与Transformer 的编码方式一样。将固定长度的字符串作为输入，数据由下而上传递计算，每一层都用到了self attention，并通过前馈神经网络传递其结果，将其交给下一个编码器。

![Screen Shot 2021-08-22 at 6.11.49 PM.png](attachment:1c0df70d-3485-47d0-83d7-abda354f8b65.png)

这样的架构，似乎是沿用了Transformer 的架构（除了层数，不过这是我们可以设置的参数）。那么BERT与Transformer 不同之处在哪里呢？可能在模型的输出上，我们可以发现一些端倪.

### Model Outputs

Each position outputs a vector of size hidden_size (768 in BERT Base). For the sentence classification example we’ve looked at above, we focus on the output of only the first position (that we passed the special \[CLS\] token to).

![Screen Shot 2021-08-28 at 12.57.46 PM.png](attachment:220c22de-bf80-4ba6-94dd-140110963d24.png)

That vector can now be used as the input for a classifier of our choosing. The paper achieves great results by just using a single-layer neural network as the classifier.

![Screen Shot 2021-08-28 at 1.01.50 PM.png](attachment:cf476392-5c19-4545-ae87-8bd5542b9bd7.png)

If you have more labels (for example if you’re an email service that tags emails with “spam”, “not spam”, “social”, and “promotion”), you just tweak the classifier network to have more output neurons that then pass through softmax.

### Parallels with Convolutional Nets

For those with a background in computer vision, this vector hand-off(切换) should be reminiscent of(联想) what happens between the convolution part of a network like VGGNet and the fully-connected classification portion at the end of the network.

![Screen Shot 2021-08-28 at 1.05.28 PM.png](attachment:fd8acac4-190d-40e4-a377-a926a2f03b82.png)

### A New Age of Embedding(词嵌入新时代)

BERT的开源随之而来的是一种词嵌入的更新。Up until now, word-embeddings have been a major force in how leading NLP models deal with language. Methods like Word2Vec and Glove have been widely used for such tasks. Let’s recap(回顾) how those are used before pointing to what has now changed.

### Word Embedding Recap

For words to be processed by machine learning models, they need some form of numeric representation that models can use in their calculation. Word2Vec showed that we can use a vector (a list of numbers) to properly represent words in a way that captures semantic or meaning-related relationships (e.g. the ability to tell if words are similar, or opposites, or that a pair of words like “Stockholm” and “Sweden” have the same relationship between them as “Cairo” and “Egypt” have between them) as well as syntactic, or grammar-based, relationships (e.g. the relationship between “had” and “has” is the same as that between “was” and “is”).

The field quickly realized it’s a great idea to use embeddings that were pre-trained on vast amounts of text data instead of training them alongside the model(与模型一起) on what was frequently a small dataset. So it became possible to download a list of words and their embeddings generated by pre-training with Word2Vec or GloVe. This is an example of the GloVe embedding of the word “stick” (with an embedding vector size of 200)

![Screen Shot 2021-08-28 at 1.17.30 PM.png](attachment:d979b3a0-ecd2-4868-a91b-bb6b618500bd.png)

Since these are large and full of numbers, I use the following basic shape in the figures in my posts to show vectors:

![Screen Shot 2021-08-28 at 1.18.20 PM.png](attachment:9c204390-d95e-4a00-931d-d3c9d0ff2d9b.png)

### ELMo:Context Matters(语境问题)

上面介绍的词嵌入方式有一个很明显的问题，因为使用预训练好的词向量模型，那么无论上下文的语境关系如何，每个单词都只有一个唯一的且已经固定保存的向量化形式“。举例在中文中，'长' 这个字，在 '长度' 这个词中表示度量，在 '长高' 这个词中表示增加。那么为什么我们不通过”长'周围是度或者是高来判断它的读音或者它的语义呢？这个问题就派生出语境化的词嵌入模型。

![Screen Shot 2021-08-28 at 1.23.36 PM.png](attachment:7274e93d-535a-415f-b3ea-73533f46a8f4.png)

**Contextualized word-embeddings(上下文词嵌入)** can give words different embeddings based on the meaning they carry in the context of the sentence.

Instead of using a fixed embedding for each word, ELMo looks at the entire sentence before assigning each word in it an embedding. It uses a bi-directional LSTM trained on a specific task to be able to create those embeddings.

![Screen Shot 2021-08-28 at 1.28.11 PM.png](attachment:848b5d03-653f-4516-8c64-61012d0d89ed.png)

ELMo provided a significant step towards pre-training in the context of NLP. The ELMo LSTM would be trained on a massive dataset in the language of our dataset, and then we can use it as a component in other models that need to handle language.

What’s ELMo’s secret?

ELMo gained its language understanding from being trained to predict the next word in a sequence of words - a task called Language Modeling,想想输入法就是这个道理。This is convenient because we have vast amounts of text data that such a model can learn from without needing labels.

![Screen Shot 2021-08-28 at 1.35.02 PM.png](attachment:b28e5259-da30-48a7-8336-f806fab26296.png)

A step in the pre-training process of ELMo: Given “Let’s stick to” as input, predict the next most likely word – a language modeling task. When trained on a large dataset, the model starts to pick up on language patterns. It’s unlikely it’ll accurately guess the next word in this example. More realistically, after a word such as “hang”, it will assign a higher probability to a word like “out” (to spell “hang out”) than to “camera”.

ELMo actually goes a step further and trains a bi-directional LSTM – so that its language model doesn’t only have a sense of the next word, but also the previous word.

![Screen Shot 2021-08-28 at 1.41.06 PM.png](attachment:5f332ba9-eaf3-4005-aec1-1d53a67aa2e4.png)

ELMo comes up with the contextualized embedding through grouping together the hidden states (and initial embedding) in a certain way (concatenation followed by weighted summation).

![Screen Shot 2021-08-28 at 1.41.47 PM.png](attachment:6526af20-3ad7-40e2-98a7-757d12f5f7e4.png)

