#### Setup Codes

In [1]:
%load_ext autoreload
%autoreload 2

##### Google Colab Setup
we need to run a few commands to set up our environment on Google Colab. If you are running this notebook on a local machine you can skip this section. Run the following cell to mount your Google Drive.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import os
import sys

# TODO: Fill in the Google Drive path where you uploaded the assignment
# Example: If you create a 'Test' folder and put all the files under 'example' folder, then 'Test/example'
# GOOGLE_DRIVE_PATH_AFTER_MYDRIVE = 'Test/example'
GOOGLE_DRIVE_PATH_AFTER_MYDRIVE = 'GIT/tutorials/utils/'
GOOGLE_DRIVE_PATH = os.path.join('drive', 'My Drive', GOOGLE_DRIVE_PATH_AFTER_MYDRIVE)
sys.path.append(GOOGLE_DRIVE_PATH)

print(os.listdir(GOOGLE_DRIVE_PATH))

['__pycache__', 'for_knn.py', 'linear_classifier.py', 'custom_model_utils', 'Convolutional_Neural_Network', '_utils.py', 'save.py', '_word_processing.py', '_layers.py', 'enc2dec', 'data', 'models', 'colab_utils']


##### NLP Setup Codes

In [None]:
!pip install 'portalocker>=2.0.0'

In [None]:
!pip install datasets

In [6]:
import torch
import torchtext
import torchdata

print(f'torch version: {torch.__version__}')
print(f'torchtext version: {torchtext.__version__}')
print(f'torchtext data: {torchdata.__version__}')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

torch version: 2.2.1+cu121
torchtext version: 0.17.1+cpu
torchtext data: 0.7.1


# BERT

reference : https://github.com/codertimo/BERT-pytorch


In [7]:
import torch.nn as nn

# custom package
from models.transformer.utils import get_positional_encoding
from models.transformer.transformer import EncoderLayer

## Model Architecture

BERT's model architecture is a multi-layer bidirectional Transformer encoder based on the original Transformer. In this work, we denote the number of layers as $L$, the hidden size as $H$, and the number of self-attention heads as $A$. We primarily report results on two model sizes: $BERT_{BASE}(L=12, H=768,A=12, \text{Total Parameters}=110M)$ and $BERT_{LARGE}(L=24, H=1024,A=16, \text{Total Parameters}=340M)$. For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings.

> 논문에선 Transformer의 EncoderLayer를 Transformer Block으로 명명한다. 이름만 다를뿐 Transformer의 EncoderLayer와 같다. BERT모델 구조자체는 심플하다. 논문에서 강조하는 부분은 Masked ML과 Next Sentence Prediction이다. 구현은 이전 Transformer tutorial에서 구현한 모델을 사용했다.

> 사실 디테일에 대한 차이가 조금있다. 논문에선 ReLU대신 GELU를 사용한다.


In [8]:
class BERT(nn.Module):
  def __init__(self, vocab_size,
               d_model=768,
               n_layers=12,
               n_head=12,
               dropout=0.1,
               PAD_IDX=2,
               device='cpu'):
      super().__init__()

      self.device = device
      self.PAD_IDX=PAD_IDX
      self.token_embed = nn.Embedding(vocab_size, d_model, padding_idx=PAD_IDX)
      self.segment_embed = nn.Embedding(3, d_model, padding_idx=PAD_IDX)
      self.position_embed = get_positional_encoding(d_model).to(device)
      self.position_embed.require_grad = False

      self.dropout = nn.Dropout(dropout)
      dim_feedforward = d_model * 4

      self.transformer_blocks = nn.ModuleList([
          EncoderLayer(d_model, dim_feedforward, n_head, dropout) for _ in range(n_layers)])

  def forward(self, x, segment=None):

      # pad mask
      mask = (x != self.PAD_IDX).unsqueeze(1).repeat(1, x.size(1), 1).unsqueeze(1).to(self.device)

      if segment is not None:
        embed = self.segment_embed(segment)

      embed = self.token_embed(x) + self.position_embed[:,:x.size(1),:]
      embed = self.dropout(embed)

      for _blocks in self.transformer_blocks:
          embed = _blocks(embed, mask)

      return embed

In [9]:
class BertForPretrain(nn.Module):
  def __init__(self,vocab_size,
               d_model=768,
               n_layers=12,
               n_head=12,
               dropout=0.1,
               PAD_IDX=2,
               device='cpu'):
    super().__init__()

    self.bert = BERT(vocab_size, d_model, n_layers, n_head, dropout, PAD_IDX, device)

    # Masked LM
    self.masked_LM = nn.Linear(d_model, vocab_size)
    self.LM_softmax = nn.LogSoftmax(dim=-1)

    # Next Sentence Prediction
    self.next_sentence_prediction = nn.Linear(d_model, 2)
    self.next_softmax = nn.LogSoftmax(dim=-1)

  def forward(self, x, segment):

    embed = self.bert(x, segment)

    mask_lm_output = self.masked_LM(embed)
    mask_lm_output = self.LM_softmax(mask_lm_output)

    # only use CLS token
    next_sent_output = self.next_sentence_prediction(embed)[:, 0]
    next_sent_output = self.next_softmax(next_sent_output)


    return mask_lm_output, next_sent_output


## Input/Output Representation

We use WordPiece embeddings with a 30,000 token vocabulary.

- The first token of every sequence is always a special classification token $
([\text{CLS}])$. The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks.
- Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways.
  - We seperate them with a special token $([\text{SEP}])$.
  - We add a learned embedding to every token indicating whether it belongs to sentence $\text{A}$ or $\text{B}$

We denote input embedding as $E$, the final hidden vector of the special $[\text{CLS}]$ token as $C \in \mathbb{R}^H$, and the final hidden vector for the $i^{th}$ input token as $T^i \in \mathbb{R}^H$.

- BERT를 학습하는데 가장 중요한 부분이다. 모델자체를 간단하지만, 모델을 학습하는데 어떻게 representation하는지에 대한 부분이 중요하게 다뤄진다.

- 마스킹 모델링은 무작위로 단어를 마스킹하는데, 이러한 방식이 단어를 양방향으로 학습할수 있다고 강조한다.

In [None]:
import random
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import Dataset, DataLoader

# custom packages
from data.wikitext2 import load_WikiText2
from data.word_piece_vocab import build_wordpiece_vocab

In [9]:
datasets, _, _ = load_WikiText2()
tokenizer, vocab = build_wordpiece_vocab(datasets, vocab_file=None, vocab_size=30000)

print(f"size of vocab :{len(vocab)}")
print(f"size of datasets : {len(datasets)}")

size of vocab :30000
size of datasets : 79603


In [10]:
class BERTDataset(Dataset):
  def __init__(self, corpus, tokenizer, vocab, max_len=512,
               CLS='[CLS]', SEP='[SEP]', MASK='[MASK]', PAD='[PAD]'):

    self.CLS, self.SEP = CLS, SEP
    self.MASK, self.PAD = MASK, PAD

    self.max_len = max_len
    self.corpus = corpus
    self.corpus_index = list(range(len(corpus)))
    self.tokenizer = tokenizer
    self.vocab = vocab

  def __len__(self,):
    return len(self.corpus) - 1

  def __getitem__(self, idx):

    # random sentence
    t1, t2, is_next_label = self.get_random_sent(idx)

    # random masking
    masked_t1, t1_label = self.random_masking(t1)
    masked_t2, t2_label = self.random_masking(t2)


    # segment_label
    segment_label = ([1 for _ in range(len(masked_t1) + 2)] + [2 for _ in range(len(masked_t2) + 1)])[:self.max_len]
    input = ([self.CLS] + masked_t1 + [self.SEP] + masked_t2 + [self.SEP])[:self.max_len]
    label = ([self.CLS] + t1_label + [self.SEP] + t2_label + [self.SEP])[:self.max_len]


    # print(f"Is Next label? {is_next_label}\n")
    # print(f"Input Token: {input}")
    # print(f"Label Token: {label}")
    # print(f"Segment Label: {segment_label}")


    # convert token to idx
    input = self.convert_token_to_idx(input)
    label = self.convert_token_to_idx(label)
    segment_label = torch.tensor(segment_label, dtype=torch.int64)

    return input, label, segment_label, is_next_label


  def get_random_sent(self, idx):

    t1 = self.corpus[idx]
    t2 = self.corpus[idx + 1]

    #print(f"Sentence 1: {t1}")

    if random.random() > 0.5:
      #print(f"Sentence 2: {t2}")
      return t1, t2, True

    else:
      corpus_index = self.corpus_index.copy()
      corpus_index.remove(idx + 1)
      rnd_idx = random.choice(corpus_index)

      t2 = self.corpus[rnd_idx]
      #print(f"Sentence 2: {t2}")
      return t1, t2, False

  def random_masking(self, sentence):

    # convert sentence to tokens
    tokens = self.tokenizer(sentence)
    new_token = tokens.copy()
    label = []

    # mask 15% of all tokens in each sequence at random
    for idx in range(len(tokens)):
      label.append(self.PAD)

      if random.random() < 0.15:
        prob = random.random()

        # (1) the [MASK] token 80% of the time
        if prob < 0.8:
          new_token[idx] = self.MASK
          label[idx] = tokens[idx]

        # (2) a random token 10% of the time
        elif prob < 0.9:
          rnd_idx = random.randint(0, len(self.vocab)-1)
          new_token[idx] = self.vocab.itos[rnd_idx]

    return new_token, label

  def convert_token_to_idx(self, tokens):
    tokens = self.vocab.convert_tokens_to_indices(tokens)
    return torch.tensor(tokens, dtype=torch.int64)

In [11]:
class BERTCollate(object):
    def __init__(self, PAD_IDX=2, batch_first=True):
        self.PAD_IDX = PAD_IDX
        self.batch_first = batch_first

    def __call__(self, batch):
        inputs, labels, segments, nexts = zip(*batch)
        inputs = pad_sequence(inputs, padding_value=self.PAD_IDX, batch_first=self.batch_first)
        labels = pad_sequence(labels, padding_value=self.PAD_IDX, batch_first=self.batch_first)
        segments = pad_sequence(segments, padding_value=self.PAD_IDX, batch_first=self.batch_first)
        return inputs, labels, segments, torch.tensor(nexts, dtype=torch.int64)


## Sanity Check

In [12]:
sample_data = ['Once upon a time, a young boy moved to a small village.',
               'He always had a smile on his face as he made friends around.',
               'One day, a large snake appeared near the village, causing fear and panic among the villagers.',
               'They spread rumors and everyone felt uneasy.']

In [13]:
sample_dataset = BERTDataset(sample_data, tokenizer, vocab)
collate_fn = BERTCollate(PAD_IDX=vocab.PAD_IDX)
loader = DataLoader(sample_dataset, batch_size=1, collate_fn=collate_fn, shuffle=True)

In [15]:
input, label, segment_label, is_next_label = next(iter(loader))

print(f"\nConverted Input: {input.tolist()}")
print(f"Converted Label: {label.tolist()}")

Sentence 1: He always had a smile on his face as he made friends around.
Sentence 2: Once upon a time, a young boy moved to a small village.
Is Next label? False

Input Token: ['[CLS]', '[UNK]', 'always', 'had', 'a', 'smile', '[MASK]', 'his', 'face', 'as', 'he', 'made', 'friends', '[UNK]', '[SEP]', '[MASK]', 'upon', 'a', '[UNK]', 'a', 'young', 'boy', 'moved', 'to', 'a', 'small', '[UNK]', '[SEP]']
Label Token: ['[CLS]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', 'on', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[SEP]', '[UNK]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[SEP]']
Segment Label: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]

Converted Input: [[0, 1, 6606, 755, 25566, 3515, 4, 26006, 28062, 27652, 2630, 19327, 25947, 1, 3, 4, 15585, 25566, 1, 25566, 12737, 25347, 21482, 14493, 25566, 7413, 1, 3]]
Converted Label: [[0, 2, 2, 2, 2, 2, 14871, 2, 2, 2, 2, 2, 2, 

# Sanity check

In [14]:
import torch.optim as optim

# custom package
import colab_utils.bert as utils

In [15]:
train_datasets = BERTDataset(datasets, tokenizer, vocab, max_len=64)
collate_fn = BERTCollate(PAD_IDX=vocab.PAD_IDX)
data_loader = DataLoader(train_datasets, batch_size=128, collate_fn=collate_fn, shuffle=True)

In [16]:
model = BertForPretrain(vocab_size=len(vocab),
                        d_model=768,
                        n_layers=12,
                        n_head=12,
                        dropout=0.1,
                        PAD_IDX=vocab.PAD_IDX,
                        device=device)

criterion = nn.CrossEntropyLoss(ignore_index=vocab.PAD_IDX)
optimizer = optim.Adam(model.parameters(), lr=1e-4, betas=(0.9, 0.999), weight_decay=0.01)

In [17]:
model = model.to(device)

inputs, labels, segments, nexts = next(iter(data_loader))
inputs = inputs.to(device)
labels = labels.to(device)
segments = segments.to(device)
nexts = nexts.to(device)

optimizer.zero_grad()
mask_lm_output, next_sent_output = model(inputs, segments)
mask_loss = criterion(mask_lm_output.transpose(2,1), labels)
next_loss = criterion(next_sent_output, nexts)
loss = mask_loss + next_loss

print(f"loss : {loss.item()}")

loss.backward()
optimizer.step()

loss : 11.127464294433594


# Pre-train Net

In [18]:
history = utils.runner(model, criterion, optimizer, data_loader, 3, mode='train')
torch.save(model.bert.state_dict(), 'bert.pth')

Train using cuda
Epoch [1/3]          time: 0:05:12          Loss: 5.7094          
Epoch [2/3]          time: 0:05:12          Loss: 5.5664          
Epoch [3/3]          time: 0:05:12          Loss: 5.5658          

Finished Training
Toral Training Time: 0:15:37


# Finetune

In [19]:
from torch.utils.data import DataLoader, TensorDataset

# custom package
from data.word_processing import build_transform
from data.sst_2 import load_SST2, build_SST2_vocab, SST2Collate

In [20]:
_, dev_datasets, _ = load_SST2(root='.')
collate_fn = SST2Collate(build_transform(tokenizer, vocab),
                         PAD_IDX=vocab.PAD_IDX,
                         batch_first=True)

dev_dataloader = DataLoader(dev_datasets, batch_size=32, collate_fn=collate_fn)

print(f"size of dataset :{len(dev_datasets)}")

size of dataset :872


In [22]:
bert = BERT(vocab_size=len(vocab),
            d_model=768,
            n_layers=12,
            n_head=12,
            dropout=0.1,
            device=device)

bert.load_state_dict(torch.load("bert.pth"))

<All keys matched successfully>

In [23]:
classifier = nn.Linear(768, 2)
model = nn.Sequential(bert, classifier)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=5e-5)

In [24]:
history = utils.runner(model, criterion, optimizer, dev_dataloader, 3, mode='finetune')

Train using cuda
Epoch [1/3]          time: 0:00:02          Loss: 0.7122          ACC: 48.62%          
Epoch [2/3]          time: 0:00:02          Loss: 0.6965          ACC: 50.23%          
Epoch [3/3]          time: 0:00:02          Loss: 0.6987          ACC: 50.46%          

Finished Training
Toral Training Time: 0:00:07


### Use Transform Library

BERT를 직접 학습시키는데는 많은 시간과 GPU가 필요하다. 코랩에서 pre-train을 진행하기에는 어려움이 있다. BERT의 pre-train방식과 간단한 finetuning을 통해 작동방식을 알아보았고, 실제로 pre-train된 모델을 불러와 finetune하는 방법은 아래를 참조바란다.

> Hugging Face에서 해당 API에 대한 정보를 얻을수다.

- Hugging Face :https://huggingface.co/docs/tokenizers/api/tokenizer
- git hub :https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/tokenization_bert.py

In [25]:
from transformers import BertTokenizer, BertForSequenceClassification

In [None]:
inputs, labels = zip(*dev_datasets)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenized_texts = tokenizer(inputs, padding=True, truncation=True, return_tensors="pt")

dataset = TensorDataset(tokenized_texts["input_ids"], tokenized_texts["attention_mask"], torch.tensor(labels))
train_dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
optimizer = optim.Adam(model.parameters(), lr=5e-5)

In [27]:
history = utils.runner(model, None, optimizer, train_dataloader, 3, mode='finetune', custom=False)

Train using cuda
Epoch [1/3]          time: 0:00:02          Loss: 0.4926          ACC: 74.77%          
Epoch [2/3]          time: 0:00:02          Loss: 0.1228          ACC: 95.99%          
Epoch [3/3]          time: 0:00:02          Loss: 0.0397          ACC: 99.08%          

Finished Training
Toral Training Time: 0:00:08
