<a href="https://colab.research.google.com/github/respect5716/Deep-Learning-Paper-Implementation/blob/master/03_NLP/BERT%20_%20Pre-training%20of%20Deep%20Bidirectional%20Transformers%20for%20Language%20Understanding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT _ Pre-training of Deep Bidirectional Transformers for Language Understanding

## 0. Paper

### Info
* TItle : BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding
* Author : Jacob Devlin et al.
* Publication : NAACL 2019, [link](https://arxiv.org/abs/1810.04805)

### Summary
* Transformer encoder layer를 통해 bidirectional representation을 학습 시킴
* Masked language model 방법을 통해 Large corpus로 Pre-train 진행
* Pre-trained model을 이용하여 큰 변화없이 fine-tuning에서 좋은 성능 보임

### Differences
* dataset : GLUE -> Naver sentiment movie corpus, [link](https://github.com/e9t/nsmc/)
* pre-train : True -> False
* positional embedding : sinusoid -> embedding layer
* embedding : Wordpiece -> Morpheme
* max_len : 512 -> 64
* num_layer : 12 -> 4
* model_dim/dff : 768 -> 128
* num_head : 12 -> 4

## 1. Env

In [0]:
!pip install -q konlpy

In [0]:
# Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [0]:
# Libraries
import os
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from konlpy.tag import Komoran

import tensorflow as tf

In [0]:
# GPU Setting
!nvidia-smi

print(f'tensorflow version : {tf.__version__}')
print(f'available GPU list : {tf.config.list_physical_devices("GPU")}')

Wed May 27 13:23:36 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [0]:
# Hyperparameters
config = {
    'base_dir' : '/content/drive/Shared drives/Yoon/Project/Doing/Deep Learning Paper Implementation',
    'max_len' : 64,
    'num_layer' : 4,
    'model_dim' : 128,
    'dff' : 128,
    'num_head' : 4,
    'drop_rate' : 0.2,
    'batch_size' : 256,
    'epoch_size' : 3,

}

## 2. Data

In [0]:
def tokenize(string, tokenizer):
    tokens = tokenizer.pos(string)
    tokens = ['_'.join(i) for i in tokens]
    return tokens

class TextDataset(tf.keras.utils.Sequence):
    def __init__(self, data, vocab, shuffle, config):
        self.data = data
        self.vocab = vocab
        self.token2id = {j:i for i,j in enumerate(vocab)}
        self.data_len = len(data)
        self.shuffle = shuffle
        self.config = config
        self.idx = 0
        self.on_epoch_end()
    
    def __len__(self):
        return np.ceil(len(self.data) / self.config['batch_size']).astype(np.int)
    
    def on_epoch_end(self):
        if self.shuffle:
            self.indices = np.random.permutation(self.data_len)
        else:
            self.indices = np.arange(self.data_len)


    def __getitem__(self, idx):
        batch_idx = self.indices[self.config['batch_size']*idx : self.config['batch_size']*(idx+1)]
        x = self.data.iloc[batch_idx]['tokens']
        x = [['[CLS]'] + i for i in x]
        x = [i[:MAX_LENGTH] for i in x]
        x = [[self.token2id[j] if j in self.vocab else self.token2id['[UNK]'] for j in i] for i in x]
        x = [np.pad(i, (0, self.config['max_len']-len(i)), 'constant') for i in x]
        x = np.array(x).astype(np.int)
        y = self.data.iloc[batch_idx]['label']
        y = np.array(y).astype(np.int)
        return x, y

    def next(self):
        if self.idx == self.__len__():
            self.on_epoch_end()
            self.idx = 0
        x, y = self.__getitem__(self.idx)
        self.idx += 1
        return x, y

In [0]:
data_path = os.path.join(BASE_DIR, 'data/naver_sentiment_movie_corpus.zip')

In [0]:
!unzip $"{data_path}" -d "/content/data"

In [0]:
train_data = pd.read_table('/content/data/ratings_train.txt')
test_data = pd.read_table('/content/data/ratings_test.txt')

In [0]:
train_data.head()

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1


In [0]:
train_data.isnull().sum()

id          0
document    5
label       0
dtype: int64

In [0]:
train_data = train_data.dropna()
test_data = test_data.dropna()

In [0]:
tokenizer = Komoran()

In [0]:
train_data['tokens'] = train_data['document'].apply(lambda x : tokenize(x, tokenizer))
test_data['tokens'] = test_data['document'].apply(lambda x : tokenize(x, tokenizer))

In [0]:
vocab = list(itertools.chain(*train_data['tokens']))
vocab = list(set(vocab))
vocab = ['[PAD]', '[CLS]', '[MASK]', '[UNK]'] + vocab

In [0]:
print(len(vocab))

64916


In [0]:
train_dataset = TextDataset(train_data, vocab, True, config)
test_dataset = TextDataset(test_data, vocab, False, config)

In [0]:
x, y = train_dataset.next()

In [0]:
x.shape, y.shape

((256, 64), (256,))

## 3. Model

In [0]:
def create_padding_mask(seq):
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
    return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)

def gelu(x):
    cdf = 0.5 * (1.0 + tf.math.erf(x / tf.sqrt(2.0)))
    return x * cdf

def scaled_dot_product_attention(q, k, v, mask):
    matmul_qk = tf.matmul(q, k, transpose_b=True) 
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    if mask is not None:
        scaled_attention_logits += (mask * -1e9)  

    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1) 
    output = tf.matmul(attention_weights, v)  
    return output, attention_weights


class Embedding(tf.keras.layers.Layer):
    def __init__(self, num_token, model_dim, max_len, drop_rate):
        super(Embedding, self).__init__()
        self.token_embedding = tf.keras.layers.Embedding(num_token, model_dim, mask_zero=False, input_length=max_len)
        self.pos_embedding = tf.keras.layers.Embedding(max_len, model_dim, mask_zero=False, input_length=max_len)
        self.layernorm = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout = tf.keras.layers.Dropout(drop_rate)
        self.pos = tf.range(0, max_len)
    
    def call(self, x, training):
        token_embedded = self.token_embedding(x)
        pos_embedded = self.pos_embedding(self.pos)
        pos_embedded = pos_embedded[None,:,:]
        embedded = pos_embedded + token_embedded
        embedded = self.layernorm(embedded)
        embedded = self.dropout(embedded, training=training)
        return embedded


class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, model_dim, num_head):
        super(MultiHeadAttention, self).__init__()
        self.num_head = num_head
        self.model_dim = model_dim
        self.depth = model_dim // num_head
        assert model_dim % num_head == 0

        self.wq = tf.keras.layers.Dense(model_dim)
        self.wk = tf.keras.layers.Dense(model_dim)
        self.wv = tf.keras.layers.Dense(model_dim)
        self.dense = tf.keras.layers.Dense(model_dim)
        
    def split_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_head, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])
    
    def call(self, q, k, v, mask):
        batch_size = tf.shape(q)[0]
        
        q = self.wq(q)  # (batch_size, seq_len, d_model)
        k = self.wk(k)  # (batch_size, seq_len, d_model)
        v = self.wv(v)  # (batch_size, seq_len, d_model)
        
        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)
        
        # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
        # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
        scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask)
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.model_dim))  # (batch_size, seq_len_q, d_model)

        output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)            
        return output, attention_weights


class PointWiseFeedForward(tf.keras.layers.Layer):
    def __init__(self, model_dim, dff):
        super(PointWiseFeedForward, self).__init__()
        self.dense1 = tf.keras.layers.Dense(dff)
        self.dense2 = tf.keras.layers.Dense(model_dim)
    
    def call(self, x):
        x = self.dense1(x)
        x = gelu(x)
        x = self.dense2(x)
        return x


class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, model_dim, num_head, dff, drop_rate):
        super(EncoderLayer, self).__init__()

        self.mha = MultiHeadAttention(model_dim, num_head)
        self.pwff = PointWiseFeedForward(model_dim, dff)
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = tf.keras.layers.Dropout(drop_rate)
        self.dropout2 = tf.keras.layers.Dropout(drop_rate)
        
    def call(self, x, mask, training):
        attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)
        
        pwff_output = self.pwff(out1)  # (batch_size, input_seq_len, d_model)
        pwff_output = self.dropout2(pwff_output, training=training)
        out2 = self.layernorm2(out1 + pwff_output)  # (batch_size, input_seq_len, d_model)
        return out2

class OutputLayer(tf.keras.layers.Layer):
    def __init__(self, model_dim, drop_rate):
        super(OutputLayer, self).__init__()
        self.dense1 = tf.keras.layers.Dense(model_dim / 2, activation='relu')
        self.dense2 = tf.keras.layers.Dense(1, activation='sigmoid')
        self.dropout = tf.keras.layers.Dropout(drop_rate)
    
    def call(self, x, training):
        x = x[:, 0]
        x = self.dense1(x)
        x = self.dropout(x, training)
        x = self.dense2(x)
        return x

class Network(tf.keras.Model):
    def __init__(self, num_token, config):
        super(Network, self).__init__()
        self.config = config
        self.embedding = Embedding(num_token, config['model_dim'], config['max_len'], config['drop_rate'])
        self.encoder_layers = [EncoderLayer(config['model_dim'], config['num_head'], config['dff'], config['drop_rate']) for _ in range(config['num_layer'])]
        self.output_layer = OutputLayer(config['model_dim'], config['drop_rate'])
    
    def call(self, x, training):
        mask = create_padding_mask(x)
        x = self.embedding(x)
        for i in range(self.config['num_layer']):
            x = self.encoder_layers[i](x, mask, training)
        x = self.output_layer(x)
        return x

In [0]:
network = Network(len(vocab), config)
network.compile(
    loss = 'binary_crossentropy',
    optimizer = 'adam',
    metrics = ['acc']
)

_ = network(x, training=False)
print(network.summary())

Model: "network_14"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_42 (Embedding)     multiple                  8317696   
_________________________________________________________________
encoder_layer_66 (EncoderLay multiple                  99584     
_________________________________________________________________
encoder_layer_67 (EncoderLay multiple                  99584     
_________________________________________________________________
encoder_layer_68 (EncoderLay multiple                  99584     
_________________________________________________________________
encoder_layer_69 (EncoderLay multiple                  99584     
_________________________________________________________________
output_layer_12 (OutputLayer multiple                  8321      
Total params: 8,724,353
Trainable params: 8,724,353
Non-trainable params: 0
______________________________________________

## 4. Train

In [0]:
hist = network.fit(
    train_dataset,
    epochs = config['epoch_size']
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


## 5. Test

In [0]:
loss, acc = network.evaluate(test_dataset)

