<a href="https://colab.research.google.com/github/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-III-NLP/09_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##BERT

BERT, which stands for Bidirectional Encoder Representations from Transformers,
is a model based on a Transformer encoder.

The original BERT model was trained on two huge corpora: BookCorpus
(composed of 800M words in 11,038 unpublished books) and English Wikipedia
(2.5B words). It has twelve "layers" (the original Transformer had only six), twelve
attention heads, and 768 hidden dimensions, totaling 110 million parameters.

If that’s too large for your GPU, though, don’t worry: There are many different
versions of BERT for all tastes and budgets, and you can find them in Google
Research’s BERT repository.



We’ll start our NLP journey by following the steps of Alice and Dorothy, from
[Alice’s Adventures in Wonderland](https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/1476) by Lewis Carroll and [The Wonderful Wizard of Oz](https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/1740) by L. Frank Baum.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-III-NLP/images/alice_dorothy.png?raw=1)

*Left: "Alice and the Baby Pig" illustration by John Tenniel's, from "Alice's Adventure's in Wonderland" (1865).*

*Right: "Dorothy meets the Cowardly Lion" illustration by W.W. Denslow, from "The Wonderful Wizard of Oz" (1900)*


##Setup

In [1]:
try:
    import google.colab
    import requests
    url = 'https://raw.githubusercontent.com/dvgodoy/PyTorchStepByStep/master/config.py'
    r = requests.get(url, allow_redirects=True)
    open('config.py', 'wb').write(r.content)
except ModuleNotFoundError:
    pass

from config import *
config_chapter11()
# This is needed to render the plots in this chapter
from plots.chapter11 import *

Downloading files from GitHub repo to Colab...
Finished!


In [2]:
%%capture

!pip install datasets

In [3]:
import os
import json
import errno
import requests
import numpy as np
from copy import deepcopy
from operator import itemgetter

import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset, Dataset

from data_generation.nlp import ALICE_URL, WIZARD_URL, download_text
from stepbystep.v4 import StepByStep
# These are the classes we built in Chapter 10
from seq2seq import *

In [4]:
from datasets import load_dataset, Split
from transformers import (
    DataCollatorForLanguageModeling,
    BertModel, BertTokenizer, BertForSequenceClassification,
    DistilBertModel, DistilBertTokenizer,
    DistilBertForSequenceClassification,
    AutoModelForSequenceClassification,
    AutoModel, AutoTokenizer, AutoModelForCausalLM,
    Trainer, TrainingArguments, pipeline, TextClassificationPipeline
)
from transformers.pipelines import SUPPORTED_TASKS

##Downloading Books

In [None]:
!rm -rf data

In [None]:
# let's download data
HOME_DIR = "data"
download_text(ALICE_URL, HOME_DIR)
download_text(WIZARD_URL, HOME_DIR)

In [None]:
# let's see the downloaded data
#!cat data/alice28-1476.txt

In [None]:
#!cat data/wizoz10-1740.txt

We need to remove these additions to the original texts:

In [None]:
alice_file = os.path.join(HOME_DIR, "alice28-1476.txt")
with open(alice_file, "r") as f:
  # The actual texts of the books are contained between lines 105 and 3703
  alice_text = "".join(f.readlines()[104:3704])

wizard_file = os.path.join(HOME_DIR, "wizoz10-1740.txt")
with open(wizard_file, "r") as f:
  # The actual texts of the books are contained between lines 309 and 5099
  wizard_text = "".join(f.readlines()[310:5100])

In [None]:
print(alice_text[:500])
print("\n", "#"*70, "\n")
print(wizard_text[:500])

                ALICE'S ADVENTURES IN WONDERLAND

                          Lewis Carroll

               THE MILLENNIUM FULCRUM EDITION 2.8




                            CHAPTER I

                      Down the Rabbit-Hole


  Alice was beginning to get very tired of sitting by her sister
on the bank, and of having nothing to do:  once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought Alice `w

 ###################################################################### 

                    THE WONDERFUL WIZARD OF OZ


                          1.  The Cyclone


    Dorothy lived in the midst of the great Kansas prairies, with
Uncle Henry, who was a farmer, and Aunt Em, who was the farmer's
wife.  Their house was small, for the lumber to build it had to be
carried by wagon many miles.  There were four walls, a floor and a
roof, which made one room; and this room contained a rusty looking

We can partially automate the removal of the extra lines by setting the real start and end lines of each text in a configuration file.

In [None]:
text_cfg = """fname,start,end
alice28-1476.txt,104,3704
wizoz10-1740.txt,310,5100"""
bytes_written = open(os.path.join(HOME_DIR, 'lines.cfg'), 'w').write(text_cfg)

##Sentence Tokenization

A token is a piece of a text, and to tokenize a text means to split
it into pieces; that is, into a list of tokens.

The most common kind of piece is a word.

So, tokenizing a text usually means to
split it into words using the white space as a separator.

In [None]:
sentence = "I'm following the white rabbit"
tokens = sentence.split(" ")
tokens

["I'm", 'following', 'the', 'white', 'rabbit']

Let's do sentence tokenization, which means to split a text into its sentences.

In [None]:
corpus_alice = sent_tokenize(alice_text)
corpus_wizard = sent_tokenize(wizard_text)

len(corpus_alice), (len(corpus_wizard))

(1612, 2240)

Let’s check one sentence from the first corpus of text.

In [None]:
corpus_alice[2]

'There was nothing so VERY remarkable in that; nor did Alice\nthink it so VERY much out of the way to hear the Rabbit say to\nitself, `Oh dear!'

Let’s check one sentence from the second corpus of text.

In [None]:
corpus_wizard[30]

'"There\'s a cyclone coming, Em," he called to his wife.'

Our dataset is going to be a collection of CSV files, one file for each book, with each
CSV file containing one sentence per line.

Therefore, we need to:

* clean the line breaks to make sure each sentence is on one line only;
* define an appropriate quote char to "wrap" the sentence such that the original commas and semicolons in the original text do not get misinterpreted as separation chars of the CSV file; and
* add a second column to the CSV file to
identify the original source of the sentence since we’ll be concatenating, and
shuffling the sentences before training a model on our corpora.

The sentence above should end up looking like this:
```log
\"There's a cyclone coming, Em," he called to his wife.\,wizoz10-1740.txt
```

The function below does the grunt work of cleaning, splitting, and saving the
sentences to a CSV file for us:

In [None]:
def sentence_tokenize(source, quote_char="\\", sep_char=",", include_header=True, include_source=True, extensions=("txt"), **kwargs):
  # If source is a folder, goes through all files inside it that match the desired extensions ('txt' by default)
  if os.path.isdir(source):
    filenames = [f for f in os.listdir(source) if os.path.isfile(os.path.join(source, f)) and os.path.splitext(f)[1][1:] in extensions]
  elif isinstance(source, str):
    filenames = [source]

  # If there is a configuration file, builds a dictionary with the corresponding start and end lines of each text file
  config_file = os.path.join(source, "lines.cfg")
  config = {}
  if os.path.exists(config_file):
    with open(config_file, "r") as f:
      rows = f.readlines()
    for r in rows[1:]:
      fname, start, end = r.strip().split(",")
      config.update({fname: (int(start), int(end))})

  new_fnames = []
  # For each file of text
  for fname in filenames:
    # If there's a start and end line for that file, use it
    try:
        start, end = config[fname]
    except KeyError:
        start = None
        end = None

    # Opens the file, slices the configures lines (if any)
    # cleans line breaks and uses the sentence tokenizer
    with open(os.path.join(source, fname), 'r') as f:
        contents = (''.join(f.readlines()[slice(start, end, None)]).replace('\n', ' ').replace('\r', ''))
    corpus = sent_tokenize(contents, **kwargs)

    # Builds a CSV file containing tokenized sentences
    base = os.path.splitext(fname)[0]
    new_fname = f'{base}.sent.csv'
    new_fname = os.path.join(source, new_fname)
    with open(new_fname, 'w') as f:
        # Header of the file
        if include_header:
            if include_source:
                f.write('sentence,source\n')
            else:
                f.write('sentence\n')
        # Writes one line for each sentence
        for sentence in corpus:
            if include_source:
                f.write(f'{quote_char}{sentence}{quote_char}{sep_char}{fname}\n')
            else:
                f.write(f'{quote_char}{sentence}{quote_char}\n')
    new_fnames.append(new_fname)

  # Returns list of the newly generated CSV files
  return sorted(new_fnames)

In [None]:
new_fnames = sentence_tokenize(HOME_DIR)
new_fnames

['data/alice28-1476.sent.csv', 'data/wizoz10-1740.sent.csv']

##Spacy sentence tokenization

In [None]:
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

sentences = []
for doc in nlp.pipe(corpus_alice):
  sentences.extend(sent.text for sent in doc.sents)

len(sentences), sentences[2]

(1615,
 'There was nothing so VERY remarkable in that; nor did Alice\nthink it so VERY much out of the way to hear the Rabbit say to\nitself, `Oh dear!')

##HuggingFace’s Dataset

In [None]:
# let's load from local files using HuggingFace
dataset = load_dataset(path="csv", data_files=new_fnames, quotechar="\\", split=Split.TRAIN)

In [None]:
# let's see attributes, like features, num_columns, and shape
dataset.features, dataset.num_columns, dataset.shape

({'sentence': Value(dtype='string', id=None),
  'source': Value(dtype='string', id=None)},
 2,
 (3852, 2))

In [None]:
dataset[2]

{'sentence': 'There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, `Oh dear!',
 'source': 'alice28-1476.txt'}

In [None]:
dataset["sentence"][:3]

["                ALICE'S ADVENTURES IN WONDERLAND                            Lewis Carroll                 THE MILLENNIUM FULCRUM EDITION 2.8                                 CHAPTER I                        Down the Rabbit-Hole     Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do:  once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'",
 'So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.',
 'There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, `Oh dear!']

In [None]:
dataset["source"][:3]

['alice28-1476.txt', 'alice28-1476.txt', 'alice28-1476.txt']

In [None]:
# check the unique sources
dataset.unique("source")

['alice28-1476.txt', 'wizoz10-1740.txt']

In [None]:
# let's create new columns
def is_alice_label(row):
  is_alice = int(row["source"] == "alice28-1476.txt")
  return {"label": is_alice}

In [None]:
dataset = dataset.map(is_alice_label)

In [None]:
dataset[2]

{'sentence': 'There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, `Oh dear!',
 'source': 'alice28-1476.txt',
 'label': 1}

In [None]:
# Now, we can finally shuffle the dataset and split it into training and test sets
shuffled_dataset = dataset.shuffle(seed=42)
split_dataset = shuffled_dataset.train_test_split(test_size=0.2)
split_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'source', 'label'],
        num_rows: 3081
    })
    test: Dataset({
        features: ['sentence', 'source', 'label'],
        num_rows: 771
    })
})

In [None]:
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

In [None]:
train_dataset[0]

{'sentence': 'Here was another puzzling question; and as Alice could not think of any good reason, and as the Caterpillar seemed to be in a VERY unpleasant state of mind, she turned away.',
 'source': 'alice28-1476.txt',
 'label': 1}

##Tokenization

In [5]:
# loading the pre-trained weights
bert_model = BertModel.from_pretrained("bert-base-uncased")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [6]:
# let's inspect the pre-trained model’s configuration
bert_model.config

BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.35.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [7]:
# Let’s create our first real BERT tokenizer
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [8]:
len(bert_tokenizer.vocab)

30522

In [9]:
# Let’s tokenize a pair of sentences using BERT’s WordPiece tokenizer
sentence1 = "Alice is inexplicably following the white rabbit"
sentence2 = "Follow the white rabbit, Neo"

tokens = bert_tokenizer(sentence1, sentence2, return_tensors="pt")
tokens

{'input_ids': tensor([[  101,  5650,  2003,  1999, 10288, 24759,  5555,  6321,  2206,  1996,
          2317, 10442,   102,  3582,  1996,  2317, 10442,  1010,  9253,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [10]:
# To actually see the word pieces, it’s easier to convert the input IDs back into tokens
print(bert_tokenizer.convert_ids_to_tokens(tokens["input_ids"][0]))

['[CLS]', 'alice', 'is', 'in', '##ex', '##pl', '##ica', '##bly', 'following', 'the', 'white', 'rabbit', '[SEP]', 'follow', 'the', 'white', 'rabbit', ',', 'neo', '[SEP]']


##Input Embeddings

In [11]:
# let’s take a look under BERT’s hood
input_embeddings = bert_model.embeddings
input_embeddings

BertEmbeddings(
  (word_embeddings): Embedding(30522, 768, padding_idx=0)
  (position_embeddings): Embedding(512, 768)
  (token_type_embeddings): Embedding(2, 768)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

In [12]:
# Let’s go over each of them
token_embeddings = input_embeddings.word_embeddings
token_embeddings

Embedding(30522, 768, padding_idx=0)

In [13]:
# As usual, embeddings will be returned by each token ID in the input
input_token_emb = token_embeddings(tokens["input_ids"])
input_token_emb

tensor([[[ 1.3630e-02, -2.6490e-02, -2.3503e-02,  ...,  8.6805e-03,
           7.1340e-03,  1.5147e-02],
         [-6.9710e-02, -8.8202e-02,  5.0619e-03,  ...,  1.4105e-02,
           2.1815e-02, -1.3769e-02],
         [-3.6044e-02, -2.4606e-02, -2.5735e-02,  ...,  3.3691e-03,
          -1.8300e-03,  2.6855e-02],
         ...,
         [ 5.2089e-05, -1.0468e-02, -9.9103e-03,  ...,  1.4558e-02,
           1.3217e-02,  2.2406e-02],
         [-3.5037e-02, -7.2933e-02, -3.6124e-02,  ..., -5.7723e-02,
          -5.5074e-03,  7.2688e-03],
         [-1.4521e-02, -9.9615e-03,  6.0263e-03,  ..., -2.5035e-02,
           4.6379e-03, -1.5378e-03]]], grad_fn=<EmbeddingBackward0>)

In [14]:
# Since each input may have up to 512 tokens, the position embedding layer has exactly that number of entries
position_embeddings = input_embeddings.position_embeddings
position_embeddings

Embedding(512, 768)

In [15]:
# Each sequentially numbered position, up to the total length of the input, will return its corresponding embedding
position_ids = torch.arange(512).expand((1, -1))
position_ids

tensor([[  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
          14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,
          28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,
          42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,
          56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,
          70,  71,  72,  73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83,
          84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,  97,
          98,  99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111,
         112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
         126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139,
         140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153,
         154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167,
         168, 169, 170, 171, 172, 173, 174, 175, 176

In [16]:
seq_length = tokens["input_ids"].size(1)
input_pos_emb = position_embeddings(position_ids[:, :seq_length])
input_pos_emb

tensor([[[ 1.7505e-02, -2.5631e-02, -3.6642e-02,  ...,  3.3437e-05,
           6.8312e-04,  1.5441e-02],
         [ 7.7580e-03,  2.2613e-03, -1.9444e-02,  ...,  2.8910e-02,
           2.9753e-02, -5.3247e-03],
         [-1.1287e-02, -1.9644e-03, -1.1573e-02,  ...,  1.4908e-02,
           1.8741e-02, -7.3140e-03],
         ...,
         [-9.2809e-03,  8.3268e-03, -4.1643e-03,  ...,  3.4903e-02,
          -1.8319e-02, -2.9017e-03],
         [-8.5999e-03,  3.2205e-04, -2.1249e-03,  ...,  2.7744e-02,
          -7.2760e-03, -2.0280e-03],
         [-3.4622e-04, -8.3709e-04, -2.2228e-02,  ...,  2.3493e-02,
          -4.5198e-04, -5.7741e-04]]], grad_fn=<EmbeddingBackward0>)

In [17]:
# Then, since there can only be either one or two sentences in the input, the segment embedding layer has only two entries:
segment_embeddings = input_embeddings.token_type_embeddings
segment_embeddings

Embedding(2, 768)

In [18]:
# For these embeddings, BERT will use the token_type_ids returned by the tokenizer
input_seg_emb = segment_embeddings(tokens["token_type_ids"])
input_seg_emb

tensor([[[ 0.0004,  0.0110,  0.0037,  ..., -0.0066, -0.0034, -0.0086],
         [ 0.0004,  0.0110,  0.0037,  ..., -0.0066, -0.0034, -0.0086],
         [ 0.0004,  0.0110,  0.0037,  ..., -0.0066, -0.0034, -0.0086],
         ...,
         [ 0.0011, -0.0030, -0.0032,  ...,  0.0047, -0.0052, -0.0112],
         [ 0.0011, -0.0030, -0.0032,  ...,  0.0047, -0.0052, -0.0112],
         [ 0.0011, -0.0030, -0.0032,  ...,  0.0047, -0.0052, -0.0112]]],
       grad_fn=<EmbeddingBackward0>)

Finally, BERT adds up all three embeddings (token, position, and segment):

In [19]:
input_emb = input_token_emb + input_pos_emb + input_seg_emb
input_emb

tensor([[[ 0.0316, -0.0411, -0.0564,  ...,  0.0021,  0.0044,  0.0219],
         [-0.0615, -0.0750, -0.0107,  ...,  0.0364,  0.0482, -0.0277],
         [-0.0469, -0.0156, -0.0336,  ...,  0.0117,  0.0135,  0.0109],
         ...,
         [-0.0081, -0.0051, -0.0172,  ...,  0.0542, -0.0103,  0.0083],
         [-0.0425, -0.0756, -0.0414,  ..., -0.0252, -0.0180, -0.0060],
         [-0.0138, -0.0138, -0.0194,  ...,  0.0032, -0.0011, -0.0133]]],
       grad_fn=<AddBackward0>)

It will still layer normalize the embeddings and apply dropout to them, but that’s it—these are the inputs BERT uses.

##Masked Language Model (MLM)

Let’s see an example of MLM, starting with an input sentence.

In [20]:
sentence = "Alice is inexplicably following the white rabbit"
tokens = bert_tokenizer(sentence)
tokens["input_ids"]

[101, 5650, 2003, 1999, 10288, 24759, 5555, 6321, 2206, 1996, 2317, 10442, 102]

Then, let’s create an instance of the data collator and apply it to our mini-batch of one.

In [21]:
torch.manual_seed(41)

data_collator = DataCollatorForLanguageModeling(tokenizer=bert_tokenizer, mlm_probability=0.15)
mlm_tokens = data_collator([tokens])
mlm_tokens

{'input_ids': tensor([[  101,  5650,  2003,  1999, 10288, 24759,   103,  6321,  2206,  1996,
          2317, 10442,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[-100, -100, -100, -100, -100, -100, 5555, -100, -100, -100, -100, -100,
         -100]])}

It’s actually easier to visualize
the difference if we convert the IDs back to tokens:

In [22]:
print(bert_tokenizer.convert_ids_to_tokens(tokens["input_ids"]))
print(bert_tokenizer.convert_ids_to_tokens(mlm_tokens["input_ids"][0]))

['[CLS]', 'alice', 'is', 'in', '##ex', '##pl', '##ica', '##bly', 'following', 'the', 'white', 'rabbit', '[SEP]']
['[CLS]', 'alice', 'is', 'in', '##ex', '##pl', '[MASK]', '##bly', 'following', 'the', 'white', 'rabbit', '[SEP]']


##Next Sentence Prediction (NSP)