<a href="https://colab.research.google.com/github/respect5716/Deep-Learning-Paper-Implementation/blob/master/03_NLP/DIET%20_%20Lightweight%20Language%20Understanding%20for%20Dialogue%20Systems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DIET _ Lightweight Language Understanding for Dialogue Systems

## 0. Paper

### Info
* TItle : DIET _ Lightweight Language Understanding for Dialogue Systems
* Author : Tanja Bunk et al.
* Publication : [link](https://arxiv.org/abs/2004.09936)

### Summary
* Intent Classification과 Entity Recognition을 Multi-tasking하는 Transformer 기반 모델

### Differences
* Dataset : NLU Benchmarks -> ATIS, [link](https://www.kaggle.com/siddhadev/atis-dataset-clean/data#)
* Featurization : Sparse and Pretrained Dense -> Unpretrained Dense
* Loss : Similarity -> Crossentropy

## 1. Setting

In [None]:
# Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [2]:
# Libraries
import os
import itertools
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import tensorflow as tf
import tensorflow_addons as tfa

  import pandas.util.testing as tm


In [3]:
# GPU Setting
!nvidia-smi

print(f'tensorflow version : {tf.__version__}')
print(f'available GPU list : {tf.config.list_physical_devices("GPU")}')

Mon Aug  3 04:48:33 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [4]:
# Hyperparameters
CONFIG = {
    'base_dir' : '/content/drive/Shared drives/Yoon/Project/Doing/Deep Learning Paper Implementation',
    'ffn_dim' : 128,
    'model_dim' : 128,
    'num_head' : 4,
    'num_layer' : 2,
    'seq_len' : 30,
    'drop_rate' : 0.2,
    'batch_size' : 64,
    'epoch_size' : 100
}

## 2. Data

In [None]:
data_path = os.path.join(CONFIG['base_dir'], 'data/atis.zip')
!unzip "{data_path}" -d /content/data

In [6]:
train_data = pd.read_csv('/content/data/atis.train.csv')
dev_data = pd.read_csv('/content/data/atis.dev.csv')
test_data = pd.read_csv('/content/data/atis.test.csv')

train_data.head()

Unnamed: 0,id,tokens,slots,intent
0,train-00001,BOS what is the cost of a round trip flight fr...,O O O O O O O B-round_trip I-round_trip O O B-...,atis_airfare
1,train-00002,BOS now i need a flight leaving fort worth and...,O O O O O O O B-fromloc.city_name I-fromloc.ci...,atis_flight
2,train-00003,BOS i need to fly from kansas city to chicago ...,O O O O O O B-fromloc.city_name I-fromloc.city...,atis_flight
3,train-00004,BOS what is the meaning of meal code s EOS,O O O O O O B-meal_code I-meal_code I-meal_code O,atis_abbreviation
4,train-00005,BOS show me all flights from denver to pittsbu...,O O O O O O B-fromloc.city_name O B-toloc.city...,atis_flight


### Preprocess

In [7]:
train_data['tokens'] = train_data['tokens'].apply(lambda x : x.split(' ')[1:-1])
dev_data['tokens'] = dev_data['tokens'].apply(lambda x : x.split(' ')[1:-1])
test_data['tokens'] = test_data['tokens'].apply(lambda x : x.split(' ')[1:-1])

train_data['slots'] = train_data['slots'].apply(lambda x : x.split(' ')[1:-1])
dev_data['slots'] = dev_data['slots'].apply(lambda x : x.split(' ')[1:-1])
test_data['slots'] = test_data['slots'].apply(lambda x : x.split(' ')[1:-1])

train_data['slots'] = train_data['slots'].apply(lambda x : [i.split('-')[1] if '-' in i else i for i in x])
dev_data['slots'] = dev_data['slots'].apply(lambda x : [i.split('-')[1] if '-' in i else i for i in x])
test_data['slots'] = test_data['slots'].apply(lambda x : [i.split('-')[1] if '-' in i else i for i in x])

In [8]:
token_vocab = list(set(itertools.chain(*train_data['tokens'])))
slot_vocab = list(set(itertools.chain(*train_data['slots'])))
intent_vocab = list(set(train_data['intent']))

In [9]:
token_vocab = ['[PAD]', '[CLS]', '[MASK]', '[UNK]'] + token_vocab

### Data Loader

In [10]:
class Tokenizer(object):
    def __init__(self, token_vocab, slot_vocab, intent_vocab):
        self.token_vocab = token_vocab
        self.slot_vocab = slot_vocab
        self.intent_vocab = intent_vocab
        self.prepare_dict()

    def prepare_dict(self):
        self.token2id = {j:i for i,j in enumerate(self.token_vocab)}
        self.slot2id = {j:i for i,j in enumerate(self.slot_vocab)}
        self.intent2id = {j:i for i,j in enumerate(self.intent_vocab)}

    def encode_token(self, token):
        return [self.token2id.get(i, self.token2id['[UNK]']) for i in token]

    def encode_slot(self, slot):
        return [self.slot2id[i] for i in slot]

    def encode_intent(self, intent):
        return self.intent2id[intent]

In [11]:
class Dataloader(object):
    def __init__(self, data, tokenizer, mode):
        self.data = data
        self.tokenizer = tokenizer
        self.mode = mode
        self.on_epoch_end()
    
    def __len__(self):
        return np.ceil(len(self.data) / CONFIG['batch_size'])
    
    def on_epoch_end(self):
        self.idx = 0
        if self.mode == 'test':
            self.indices = np.arange(len(self.data))
        else:
            self.indices = np.random.permutation(len(self.data))
    
    def __getitem__(self, idx):
        batch_idx = self.indices[CONFIG['batch_size']*idx : CONFIG['batch_size']*(idx+1)]
        batch_data = self.data.iloc[batch_idx]
        batch_tokens = batch_data['tokens']
        batch_masked_tokens = [[self.apply_mask(j) for j in i] for i in batch_tokens]

        batch_tokens = [self.tokenizer.encode_token(i) for i in batch_tokens]
        batch_tokens = np.array([self.pad_seq(i) for i in batch_tokens])
        batch_masked_tokens = [self.tokenizer.encode_token(i) for i in batch_masked_tokens]
        batch_masked_tokens = np.array([self.pad_seq(i) for i in batch_masked_tokens])
        batch_masked_tokens[:,-1] = self.tokenizer.token2id['[CLS]']

        batch_slots = batch_data['slots']
        batch_slots = [self.tokenizer.encode_slot(i) for i in batch_slots]
        batch_slots = np.array([self.pad_seq(i, self.tokenizer.slot2id['O']) for i in batch_slots])

        batch_intent = batch_data['intent']
        batch_intent = [self.tokenizer.encode_intent(i) for i in batch_intent]
        batch_intent = np.array(batch_intent)

        padding_mask = np.where(batch_tokens==0, 1, 0)[:,None,None,:].astype(np.float32)
        mlm_loss_mask = np.where(batch_masked_tokens==2, 1, 0).astype(np.float32)

        return batch_masked_tokens, (batch_tokens, batch_slots, batch_intent), (padding_mask, mlm_loss_mask)

    def apply_mask(self, token):
        r = np.random.rand()
        if r < 0.105:
            return '[MASK]'
        elif r < 0.12:
            return self.tokenizer.token_vocab[np.random.randint(5, len(self.tokenizer.token_vocab))]
        else:
            return token
    
    def pad_seq(self, seq, value=0):
        seq = seq[:CONFIG['seq_len']]
        return np.pad(seq, (0, CONFIG['seq_len']-len(seq)), 'constant', constant_values=value)

In [12]:
tokenizer = Tokenizer(token_vocab, slot_vocab, intent_vocab)
train_loader = Dataloader(train_data, tokenizer, 'train')

In [13]:
x, y, mask = train_loader.__getitem__(1)
x.shape, y[0].shape, y[1].shape, mask[0].shape, mask[1].shape

((64, 30), (64, 30), (64, 30), (64, 1, 1, 30), (64, 30))

## 3. Model

In [16]:
def create_padding_mask(seq):
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
    return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)

def create_look_ahead_mask(size):
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask  # (seq_len, seq_len)

def gelu(x):
    cdf = 0.5 * (1.0 + tf.math.erf(x / tf.sqrt(2.0)))
    return x * cdf


class Embedding(tf.keras.layers.Layer):
    def __init__(self, token_vocab_size):
        super(Embedding, self).__init__()
        self.token_embedding = tf.keras.layers.Embedding(token_vocab_size, CONFIG['model_dim'])
        self.position_embedding = tf.keras.layers.Embedding(CONFIG['seq_len'], CONFIG['model_dim'])
        self.pos = tf.range(0, CONFIG['seq_len'])

    def call(self, x):
        x = self.token_embedding(x)
        pos = self.position_embedding(self.pos)
        x += pos
        return x


class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, model_dim, num_head):
        super(MultiHeadAttention, self).__init__()
        self.model_dim = model_dim
        self.num_head = num_head
        self.projection_dim = self.model_dim // self.num_head
        assert self.model_dim % self.num_head == 0

        self.qw = tf.keras.layers.Dense(self.model_dim)
        self.kw = tf.keras.layers.Dense(self.model_dim)
        self.vw = tf.keras.layers.Dense(self.model_dim)
        self.w = tf.keras.layers.Dense(self.model_dim)
    
    def attention(self, q, k ,v, mask):
        dim = tf.cast(tf.shape(q)[-1], tf.float32)
        score = tf.matmul(q, k, transpose_b=True)
        scaled_score = score / tf.math.sqrt(dim)

        if mask is not None:
            scaled_score += (mask * -1e9)

        attention_weights = tf.nn.softmax(scaled_score)
        attention_outputs = tf.matmul(attention_weights, v)
        return attention_outputs, attention_weights
    
    def split_heads(self, x):
        batch_size = tf.shape(x)[0]
        x = tf.reshape(x, (batch_size, -1, self.num_head, self.projection_dim))
        x = tf.transpose(x, perm=[0, 2, 1, 3])
        return x
    
    def combine_heads(self, x):
        batch_size = tf.shape(x)[0]
        x = tf.transpose(x, perm=[0, 2, 1, 3])
        x = tf.reshape(x, (batch_size, -1, self.model_dim))
        return x
    
    def call(self, q, k, v, mask):
        q, k, v = self.qw(q), self.kw(k), self.vw(v)
        q, k, v = self.split_heads(q), self.split_heads(k), self.split_heads(v)
        outputs, weights = self.attention(q, k, v, mask)
        outputs = self.combine_heads(outputs)
        outputs = self.w(outputs)
        return outputs

class FeedForwardNetwork(tf.keras.layers.Layer):
    def __init__(self, model_dim, ffn_dim):
        super(FeedForwardNetwork, self).__init__()
        self.dense1 = tf.keras.layers.Dense(ffn_dim)
        self.dense2 = tf.keras.layers.Dense(model_dim)

    def call(self, x):
        x = self.dense1(x)
        x = gelu(x)
        x = self.dense2(x)
        return x

class TransformerLayer(tf.keras.layers.Layer):
    def __init__(self):
        super(TransformerLayer, self).__init__()
        self.mha = MultiHeadAttention(CONFIG['model_dim'], CONFIG['num_head'])
        self.ffn = FeedForwardNetwork(CONFIG['model_dim'], CONFIG['ffn_dim'])
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = tf.keras.layers.Dropout(CONFIG['drop_rate'])
        self.dropout2 = tf.keras.layers.Dropout(CONFIG['drop_rate'])

    def call(self, x, training, mask=None):
        out1 = self.mha(x, x, x, mask)
        out1 = self.dropout1(out1, training=training)
        out1 = self.layernorm1(x + out1)
        out2 = self.ffn(out1)
        out2 = self.dropout2(out2, training=training)
        out2 = self.layernorm2(out1 + out2)
        return out2

class CRFLayer(tf.keras.layers.Layer):
    def __init__(self, num_tag):
        super(CRFLayer, self).__init__()
        self.dense = tf.keras.layers.Dense(num_tag)
        self.transition_params = None

    def call(self, x, y, mask):
        shape = mask[:,0,0,:].sum(axis=-1)
        x = self.dense(x)
        outputs, self.transition_params = tfa.text.crf_log_likelihood(x, y, shape)
        preds, _ = tfa.text.crf.crf_decode(x, self.transition_params, shape)
        return outputs, preds

class Network(tf.keras.Model):
    def __init__(self, token_size, slot_size, intent_size):
        super(Network, self).__init__()
        self.embedding = Embedding(token_size)
        self.transformers = [TransformerLayer() for _ in range(CONFIG['num_layer'])]
        self.mlm_outputs = tf.keras.layers.Dense(token_size, activation='softmax')
        self.slot_outputs = CRFLayer(slot_size)
        self.intent_outputs = tf.keras.layers.Dense(intent_size, activation='softmax')
        self.optimizer = tf.keras.optimizers.Adam()

    def call(self, inputs, training):
        x, y, (padding_mask, mlm_loss_mask) = inputs
        x = self.embedding(x)
        for trm in self.transformers:
            x = trm(x, mask=padding_mask, training=training)
        mlm = self.mlm_outputs(x)
        slot, slot_pred = self.slot_outputs(x, y[1], mask[0])
        intent = self.intent_outputs(x[:,-1])
        return mlm, intent, slot, slot_pred

In [17]:
token_size = len(tokenizer.token_vocab)
slot_size = len(tokenizer.slot_vocab)
intent_size = len(tokenizer.intent_vocab)

In [19]:
network = Network(token_size, slot_size, intent_size)

## 4. Train

In [20]:
def train_step(network, data):
    x, y, mask = data

    with tf.GradientTape() as g:
        mlm, intent, slot, _ = network((x, y, mask), True)
        mlm_loss = tf.keras.losses.sparse_categorical_crossentropy(y[0], mlm)
        mlm_loss = tf.reduce_mean(mlm_loss * mask[1])
        slot_loss = tf.reduce_sum(-slot) / tf.cast(x.shape[0], tf.float32)
        intent_loss = tf.keras.losses.sparse_categorical_crossentropy(y[2], intent)
        intent_loss = tf.reduce_mean(intent_loss)
        total_loss = mlm_loss + slot_loss + intent_loss
        
    gradients = g.gradient(total_loss, network.trainable_variables)
    network.optimizer.apply_gradients(zip(gradients, network.trainable_variables))
    return {'total':total_loss, 'mlm':mlm_loss, 'slot':slot_loss, 'intent':intent_loss}

In [21]:
for ep in range(CONFIG['epoch_size']):
    for step, data in enumerate(train_loader):
        if len(data[0]) != CONFIG['batch_size']:
            continue
        loss = train_step(network, data)

    if ep % 10 == 0:
        print(f"EP : {str(ep).zfill(3)} | Total : {loss['total'].numpy():.3f} | MLM : {loss['mlm'].numpy():.3f} | Intent : {loss['intent'].numpy():.3f} | Slot : {loss['slot'].numpy():.3f}")

print(f"EP : {str(ep+1).zfill(3)} | Total : {loss['total'].numpy():.3f} | MLM : {loss['mlm'].numpy():.3f} | Intent : {loss['intent'].numpy():.3f} | Slot : {loss['slot'].numpy():.3f}")

Instructions for updating:
Use tf.identity instead.
EP : 000 | Total : 6.997 | MLM : 0.186 | Intent : 0.632 | Slot : 6.180
EP : 010 | Total : 0.974 | MLM : 0.122 | Intent : 0.057 | Slot : 0.795
EP : 020 | Total : 1.088 | MLM : 0.097 | Intent : 0.118 | Slot : 0.872
EP : 030 | Total : 0.717 | MLM : 0.082 | Intent : 0.026 | Slot : 0.609
EP : 040 | Total : 0.575 | MLM : 0.096 | Intent : 0.062 | Slot : 0.417
EP : 050 | Total : 0.648 | MLM : 0.073 | Intent : 0.010 | Slot : 0.565
EP : 060 | Total : 0.318 | MLM : 0.074 | Intent : 0.019 | Slot : 0.225
EP : 070 | Total : 0.608 | MLM : 0.062 | Intent : 0.021 | Slot : 0.525
EP : 080 | Total : 0.610 | MLM : 0.088 | Intent : 0.106 | Slot : 0.416
EP : 090 | Total : 0.353 | MLM : 0.039 | Intent : 0.006 | Slot : 0.308
EP : 100 | Total : 0.214 | MLM : 0.062 | Intent : 0.016 | Slot : 0.136


## 5. Test

In [22]:
def test_step(network, data):
    x, y, mask = data
    mlm, intent, _, slot = network((x, y, mask), False)
    
    mlm_acc = y[0] == np.argmax(mlm, axis=-1)
    mlm_acc = np.sum(mlm_acc * mask[1]) / np.sum(mask[1])

    slot_acc = tf.cast((slot == y[1]), tf.int32)
    slot_mask = (1- mask[0][:,0,0,:])
    slot_acc = np.sum(slot_acc * slot_mask) / np.sum(slot_mask)

    intent_acc = y[2] == np.argmax(intent, axis=-1)
    intent_acc = np.mean(intent_acc)
    return mlm_acc, slot_acc, intent_acc

In [23]:
test_loader = Dataloader(test_data, tokenizer, 'test')

In [24]:
acc = {'mlm':[], 'slot':[], 'intent':[]}
for step, data in enumerate(test_loader):
    if len(data[0]) != CONFIG['batch_size']:
        break
        
    mlm_acc, slot_acc, intent_acc = test_step(network, data)
    acc['mlm'].append(mlm_acc)
    acc['slot'].append(slot_acc)
    acc['intent'].append(intent_acc)

print(f"MLM : {np.mean(acc['mlm']):.3f} | Intent : {np.mean(acc['intent']):.3f} | Slot : {np.mean(acc['slot']):.3f}")

MLM : 0.534 | Intent : 0.958 | Slot : 0.944
