## Kernel Explanation
This Kernel constructed the model with two different kinds of methods

1. BiLSTM (pytorch + GloVe)
2. Bert (keras + TfHub)

This is the first competition that I participated in. I hope the kernel will provide some intuitive to novice for building their deep learning model.

**Please note you have to run the method 2 first, otherwise there will be a source allocation issue on GPU**

## Package Import

In [None]:
import pandas as pd
import numpy as np
import os
!pip install sacremoses
import sacremoses
import tqdm
import re
import string
!pip install sklearn
from sklearn.model_selection import train_test_split

## Data Preprocessing

In [None]:
train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

In [None]:
train.head()

In [None]:
train.describe()

### Text Cleaning

In [None]:
# https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)
def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub('', text)
def remove_html(text):
    html = re.compile(r'<.*?>')
    return html.sub('', text)
def remove_punct(text):
    table = str.maketrans('', '', string.punctuation)
    return text.translate(table)
def text_cleaning(x):
    return remove_punct(remove_html(remove_URL(remove_emoji(x))))
train['text'] = train['text'].apply(lambda x: text_cleaning(x))
test['text'] = test['text'].apply(lambda x: text_cleaning(x))

Forming training dataset, validation dataset, testing dataset

In [None]:
x_test = test.text.values
x_train, x_val, y_train, y_val = train_test_split(
    train.text.values, train.target.values, test_size=0.3, random_state=101
)

In [None]:
print("shape of test set:", x_test.shape)
print("shape of train set:", x_train.shape)
print("shape of val set:", x_val.shape)
print("true disaster rate in the train set:", round(sum(y_train)/len(y_train), 2))
print("true disaster rate in the val set:", round(sum(y_val)/len(y_val), 2))

In [None]:
number = 101
print("sample text:", x_train[number])
print("sample ans:", "true disaster" if y_train[number] else "not a disaster")

## Method 1: GloVe + BiLSTM with Pytorch
The code below was referred from the NYU [Nature Language Understanding](https://cims.nyu.edu/~sbowman/teaching.shtml) Course taught by [Sam Bowman](https://cims.nyu.edu/~sbowman/index.shtml)

### GloVe:

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. More explanations have been attached below:
* [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/)
* [NLP — Word Embedding & GloVe](https://medium.com/@jonathan_hui/nlp-word-embedding-glove-5e7f523999f6)
* [GloVe详解 - 范永勇](www.fanyeong.com/2018/02/19/glove-in-detail/) (If you can understand Mandarin, this is a good source to learn from)

In [None]:
def load_glove(path, embedding_dim):
    with open(path) as f:
        pad_token, unk_token = "<PAD>", "<UNK>"
        token_li = [pad_token, unk_token] # note that index 1 is <UNK>, which means unknown word
        embedding_ls = [np.zeros(embedding_dim), np.random.rand(embedding_dim)]
        for line in f:
            token, raw_embedding = line.split(maxsplit=1)
            token_li.append(token)
            embedding_ls.append(np.array([float(i) for i in raw_embedding.split()]))
    return token_li, np.array(embedding_ls)
path, embedding_dim = "../input/glove6b300d-50k/glove.6B.300d__50k.txt", 300
vocab, embeddings = load_glove(path, embedding_dim)

[Tokenize](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html):

sentence -> tokenize

"Coronavirus: WHO advises to wear masks in public areas" -> ['Coronavirus', ':', 'WHO', 'advises', 'to', 'wear', 'masks', 'in', 'public', 'areas']

In [None]:
def featurize(data, labels, tokenizer, vocal, max_seq_len = 128):
    voc_to_idx = {word:idx for idx, word in enumerate(vocab)}
    text_data = [[voc_to_idx.get(token, 1) for token in tokenizer.tokenize(text.lower())] for text in tqdm.tqdm_notebook(data)]
    label_data = labels
    return text_data, label_data
tokenizer = sacremoses.MosesTokenizer()
train_idc, train_lab = featurize(x_train, y_train, tokenizer, vocab)
val_idc, val_lab = featurize(x_val, y_val, tokenizer, vocab)

In [None]:
print("\nTrain text first 5 examples:\n", train_idc[:5])
print("\nTrain label first 5 examples:\n", train_lab[:5])

### Pytorch: 

If this is the first time you use the pytorch, highly recommend that go and check out the [DEEP LEARNING WITH PYTORCH: A 60 MINUTE BLITZ](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) first before scroll down below

In [None]:
import torch
from torch.utils.data import dataloader, Dataset

class data_loader(Dataset):
    def __init__(self, data_li, target_li, max_sen_len=128):
        self.data_li = data_li
        self.target_li = target_li
        self.max_sen_len = max_sen_len
        assert (len(self.data_li) == len(self.target_li))
    def __len__(self):
        return len(self.data_li)
    def __getitem__(self, key, max_sen_len=None):
        if not max_sen_len: max_sen_len = self.max_sen_len
        token_idx = self.data_li[key][:max_sen_len]
        label = self.target_li[key]
        return [token_idx, label]
    def collate_func(self, batch):
        data_list = [] # store padded sequences
        label_list = []
        max_batch_seq_len = min(len(max(batch, key=lambda x: len(x[0]))[0]), 128)
        for row in batch:
            sen_len = len(row[0])
            data_list.append(row[0]+[0]*(max_batch_seq_len-sen_len) if sen_len < max_batch_seq_len else row[0][:max_batch_seq_len])
            label_list.append(row[1])
        return [torch.tensor(np.array(data_list)), torch.tensor(np.array(label_list))]
batch_size, max_sen_len = 64, 60
train_dataset = data_loader(train_idc, train_lab, max_sen_len)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, collate_fn=train_dataset.collate_func, shuffle=True)
val_dataset = data_loader(val_idc, val_lab, train_dataset.max_sen_len)
val_loader = torch.utils.data.DataLoader(dataset=val_dataset, batch_size=batch_size, collate_fn=train_dataset.collate_func, shuffle=False)

Check data loader:

In [None]:
data_batch, labels = next(iter(train_loader))
print("data_batch\n", "shape: ", data_batch.shape, "\n", "sample:", data_batch)
print("labels\n", "shape: ", labels.shape, "\n", "sample:", labels)

### BiLSTM model

BiLSTM is the bidirectional LSTM model. It can capture the both front and back information per vocabulary for a sentence

More for BiLSTM:
* [Bi-LSTM](https://medium.com/@raghavaggarwal0089/bi-lstm-bc3d68da8bd0)
* [The Bidirectional Language Model](https://medium.com/@plusepsilon/the-bidirectional-language-model-1f3961d1fb27)

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class LSTMClassifier(nn.Module):
    def __init__(self, embeddings, hidden_size, num_layers, num_class, bidirectional, dropout_prob=0.3):
        super().__init__()
        self.embedding_layer = self.load_pretrain_embeddings(embeddings)
        self.embeddings_dim = embeddings.shape[1]
        self.bilstm = nn.LSTM(input_size=embeddings.shape[1], hidden_size=hidden_size, num_layers=num_layers, dropout=dropout_prob, bidirectional=bidirectional)
        self.dropout = nn.Dropout(p=dropout_prob)
        self.non_linear = nn.ReLU()
        self.clf = nn.Linear(2*hidden_size, num_classes) # classifier layer: 2 is due to bidirectional
        self.maxpool2 = nn.MaxPool2d(kernel_size=1)
    def load_pretrain_embeddings(self, embeddings):
        embedding_layer = nn.Embedding(embeddings.shape[0], embeddings.shape[1], padding_idx=0)
        embedding_layer.weight.data = torch.Tensor(embeddings).float()
        return embedding_layer
    def forward(self, inputs):
        X = self.embedding_layer(inputs)
        bilstm_out, (h_n, c_n) = self.bilstm(X)
        out = torch.max(input=bilstm_out, dim=1)
        out = self.non_linear(out.values)
        logits = self.clf(out)
        return logits

Define the evaluation function

In [None]:
def evaluate(model, dataloader, device):
    model.eval()
    with torch.no_grad():
        all_preds, all_labels = [], []
        for batch_text, batch_label in dataloader:
            y_preds = model(batch_text.to(device))
            all_preds.append(y_preds.detach().cpu().numpy())
            all_labels.append(batch_label.numpy())
    preds, labels = np.concatenate(np.array(all_preds), axis = 0), np.concatenate(np.array(all_labels), axis = 0)
    return (preds.argmax(-1)==labels).mean()

Tuning hyperparameters

In [None]:
hidden_size = 32
num_layers = 1
num_classes = 2
bidirectional = True
torch.manual_seed(1234)
device = torch.device('cpu')
# device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device('cpu')
# we will use cpu and leave the GPU resource for method 2
print("device: ", device)
model1 = LSTMClassifier(embeddings, hidden_size, num_layers, num_classes, bidirectional)
model1.to(device)
criterion = nn.CrossEntropyLoss()
learning_rate = 0.005
optimizer = optim.Adam(model1.parameters(), lr=learning_rate)

In [None]:
print("model structure:\n", model1)

Start training now and validate at the same time with tolerances

In [None]:
train_loss_history = []
val_acc_history = []
best_val_accuracy = 0
tolerance = 0
early_stop_patience = 2
num_epochs = 10
  
for epoch in tqdm.tqdm_notebook(range(num_epochs)):
    model1.train()
    for i, (batch_data, batch_label) in enumerate(train_loader):
        y_preds = model1(batch_data.to(device))
        loss = criterion(y_preds, batch_label.to(device)) # note that the prediction value need to be infront of the true value
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        train_loss_history.append(loss.item())
    val_acc = evaluate(model=model1, dataloader=val_loader, device=device)
    val_acc_history.append(val_acc)
    torch.save(model1, "pytorch_bilstm_best.pt")    
    print("epoch: {}; val_accuracy: {}".format(epoch, val_acc))
    if val_acc > best_val_accuracy: best_val_accuracy = val_acc
    else: tolerance += 1
    if tolerance > early_stop_patience: break   
print("Best validation accuracy is: ", best_val_accuracy)

### Evaluate the test set

Construct test data loader

In [None]:
test_idc, test_lab = featurize(x_test, [0]*len(x_test), tokenizer, vocab) # test label is fake data
test_dataset = data_loader(test_idc, test_lab, max_sen_len)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, collate_fn=test_dataset.collate_func, shuffle=True)

Test evaluation

In [None]:
model1.eval()
with torch.no_grad():
    all_preds = []
    for batch_text, _ in test_loader:
        preds = model1(batch_text.to(device))
        all_preds.append(preds.detach().cpu().numpy())
    all_preds = np.concatenate(np.array(all_preds), axis = 0)

pred_res1 = all_preds.argmax(-1)

## Method 2: Bert sample on TfHub with Keras

The following code has referred the web page below:
* [bert_en_uncased_L-24_H-1024_A-16](https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/2)
* [Disaster NLP: Keras BERT using TFHub](https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub)

If this is your first time to use keras, highly recommend [TensorFlow in Practice Specialization](https://www.coursera.org/specializations/tensorflow-in-practice) in the Coursera. 

If you want to know more about the transformer and the bert, I had listed two videos below. These are the best tutorial I know so far.
* [Transformer Neural Networks - EXPLAINED! (Attention is all you need)](https://www.youtube.com/watch?v=TQQlZhbC5ps&t=2s)
* [BERT Neural Network - EXPLAINED!](https://www.youtube.com/watch?v=xI0HHN5XKDo)

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub
# gpus = tf.config.experimental.list_physical_devices('GPU')
# tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])

!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py
import tokenization

In [None]:
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1", trainable=True)

vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

Check the tokenize result below:

In [None]:
text = "Coronavirus: WHO advises to wear masks in public areas"
tokenize_data = tokenizer.tokenize(text)
print("Text after tokenization:\n", tokenize_data, "\n")

max_len = 20
text = tokenize_data[:max_len-2]
input_seq = ["[CLS]"] + text + ["[SEP]"]
print("After adding [CLS] and [SEP]:\n", input_seq, "\n")

pad_len = max_len-len(input_seq)
tokens = tokenizer.convert_tokens_to_ids(input_seq)+[0]*pad_len
print("After converting Tokens to Id and adding the pad:\n", tokens, "\n")

pad_masks = [1]*len(input_seq) + [0]*pad_len
print("Pad Masking:\n", pad_masks, "\n")

Preprocess with adding [CLS], [SEP], padding, and tokenize

In [None]:
def pre_process(context_data, tokenizer, max_len=128):
    all_tokens = []
    all_masks = []
    all_segments = []
    for text in context_data:
        input_seq = ["[CLS]"]+ tokenizer.tokenize(text)[:max_len-2] + ["[SEP]"] # 2 for [CLS] and [SEP]
        pad_len = max_len-len(input_seq)
        tokens = tokenizer.convert_tokens_to_ids(input_seq) + [0]*pad_len
        pad_masks = [1]*len(input_seq) + [0]*pad_len
        segment_ids = [0]*max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

In [None]:
max_len = 128
train_input = pre_process(train.text.values, tokenizer, max_len)
train_label = train.target.values

Model construction

In [None]:
input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
# Fine Tuning
clf_output = sequence_output[:, 0, :]
out = Dense(1, activation='sigmoid')(clf_output)

model2 = Model(
    inputs=[input_word_ids, input_mask, segment_ids],
    outputs=out
)
model2.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
model2.summary()

In [None]:
checkpoint = ModelCheckpoint('keras_bert_model.h5', monitor='val_loss', save_best_only=True)

train_history = model.fit(
    train_input, train_label,
    validation_split=0.3,
    epochs=3,
    callbacks=[checkpoint],
    batch_size=16
)

Load model with test evaluation

In [None]:
model2.load_weights('keras_bert_model.h5')
test_input = pre_process(test.text.values, tokenizer, max_len)
pred_res2 = model2.predict(test_input).round().astype(int)

## Sumbit the result

In [None]:
pred_res = pred_res2 # pred_res1 or pred_res2

pred_res = np.array(sorted(np.array([list(test["id"])]+[list(pred_res)]).T, key=lambda x: x[0]))

submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")
assert(list(submission.id)==list(pred_res[:,0])) # check the number of test data
submission["target"] = pred_res[:,1]
submission.to_csv('submission.csv', index=False)