##  HW10: BERT fintuning. 

In this exercise, you are going to learn how to perform fine-tuning on a transformer-based model. First, we will provide a tutorial on fine-tuning the Large Movie Review Dataset (IMDB dataset) using distilBERT (https://arxiv.org/abs/1910.01108). After that, you have to complete the exercise by fine-tuning on the TRUE call-center dataset (HW6). This homework is based on the Hugging Face tutorial (https://huggingface.co/transformers/custom_datasets.html).

### 1. Install transformers library form Hugging Face

In [1]:
# !pip install torch==1.4.0
_ = !pip install transformers
_ = !pip install pythainlp
_ = !pip install sentencepiece

### 2. Download Large Movie Review Dataset 

In [2]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

--2021-04-19 14:35:39--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2021-04-19 14:35:40 (56.6 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



### 3. Preprocess the dataset  
Large Movie Review Dataset  is a dataset for binary sentiment classification. The input of this dataset is a movie review with its sentiment as a ground truth

In [3]:
from pathlib import Path
from sklearn.model_selection import train_test_split
import numpy as np

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

In [4]:
print("Unique label is {}, nb. of train data = {}, test_data = {}".format(np.unique(train_labels), len(train_texts), len(test_texts)))
for i in range(5):
  print("Data = {}".format(train_texts[i]))
  print("Label = {}".format(train_labels[i]))

Unique label is [0 1], nb. of train data = 20000, test_data = 25000
Data = This is a film that takes some digesting. On the one hand, we are offered a tough outward shell, a story that does not only derive the Catholic Church, but does so foolishly, and uninformed. On an inner layer, we are offered a story of orthodoxy over orthopraxis, and what happens when people follow blindly a faith that they must not understand.<br /><br />At first glance, it appeared this was supposed to be a comedy. If so, then Mr. Durang needs to open a dictionary, because he clearly does not know the meaning of the word. The jokes are pale; the humor is awkward and poorly delivered. In particular, Ms. Keaton's performance is flighty and over the top, well below the quality of her Annie Hall and Sleeper days. Jennifer Tilly is again the model of stridence, with her hi-pitched voice and whining style. All of this could be forgiven if it weren't for the last 20 minutes of this movie, that evidently was a controv

After the dataset is processed, we tokenize each input sentence. This tokenizer has a start token of '[CLS'] (id 101) and a seperator token '[SEP]' (id 102) at the end of each sentence. If the word is an Out-of-vocabulary word (OOV), the token id is 100. The tokenized output has the following format :

```python
{
  'input_ids': List[List[Int]]. List of tokenized input sentence.
  'attention_mask' : List[List[Int]].  List of masked token. See cell [7] for example.
}
```

In [5]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




In [6]:
tokenizer([ '[CLS] a' ], truncation=True, padding=True)

{'input_ids': [[101, 101, 1037, 102]], 'attention_mask': [[1, 1, 1, 1]]}

In [7]:
tokenizer( ['Pine apple apple pen  หมา ไก่', 'a b'], truncation=True, padding=True)

{'input_ids': [[101, 7222, 6207, 6207, 7279, 100, 100, 102], [101, 1037, 1038, 102, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0, 0, 0]]}

In [8]:
a = tokenizer(train_texts[:2], truncation=True, padding=True)
print(a)

{'input_ids': [[101, 2023, 2003, 1037, 2143, 2008, 3138, 2070, 17886, 2075, 1012, 2006, 1996, 2028, 2192, 1010, 2057, 2024, 3253, 1037, 7823, 15436, 5806, 1010, 1037, 2466, 2008, 2515, 2025, 2069, 18547, 1996, 3234, 2277, 1010, 2021, 2515, 2061, 13219, 2135, 1010, 1998, 4895, 2378, 29021, 1012, 2006, 2019, 5110, 6741, 1010, 2057, 2024, 3253, 1037, 2466, 1997, 26582, 2058, 2030, 2705, 7361, 2527, 9048, 2015, 1010, 1998, 2054, 6433, 2043, 2111, 3582, 25734, 1037, 4752, 2008, 2027, 2442, 2025, 3305, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2012, 2034, 6054, 1010, 2009, 2596, 2023, 2001, 4011, 2000, 2022, 1037, 4038, 1012, 2065, 2061, 1010, 2059, 2720, 1012, 22959, 2290, 3791, 2000, 2330, 1037, 9206, 1010, 2138, 2002, 4415, 2515, 2025, 2113, 1996, 3574, 1997, 1996, 2773, 1012, 1996, 13198, 2024, 5122, 1025, 1996, 8562, 2003, 9596, 1998, 9996, 5359, 1012, 1999, 3327, 1010, 5796, 1012, 17710, 22436, 1005, 1055, 2836, 2003, 3462, 2100, 1998, 2058, 1996, 2327, 1010, 2092, 2917, 19

In [9]:
train_encodings = tokenizer(train_texts, add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )
val_encodings = tokenizer(val_texts, add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )
test_encodings = tokenizer(test_texts, add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )



Convert the dataset into training format. You can see the training input format of distilBERT is in https://huggingface.co/transformers/model_doc/distilbert.html. 

In [10]:
train_data = [np.array(train_encodings['input_ids']), np.array(train_encodings['attention_mask'])]
val_data = [np.array(val_encodings['input_ids']), np.array(val_encodings['attention_mask'])]
test_data = [np.array(test_encodings['input_ids']), np.array(test_encodings['attention_mask'])]

### 4. Model fine-tuning
The model we used for fine-tuning is distilBERT (https://arxiv.org/abs/1910.01108), which is a smaller model distilled from the original BERT. Knowledge distillation is a well-known trick for improving the performance of a small model by learning an estimated uncertainty from a larger model instead of using a hard-label. If you want to know more about knowledge distillation, read https://arxiv.org/abs/1503.02531.

#### Model Initialization

In [23]:
from transformers import DistilBertForSequenceClassification
import torch

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels= 2)
model = torch.nn.DataParallel(model.cuda(), device_ids=[0])

LEARNING_RATE =  1e-5
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)


ModuleNotFoundError: ignored

#### Set up training generator

In contrast to model.fit which you have used in the previous lab. A more common way to feed the data is to use a generator. It is more memory-efficient than model.fit as the data is only quired when the iterator executes. For example, you can set the generator to load the image from the folder when called instead of storing all of them in the RAM. An example below is a way to create a simple generator, which aggregate the data points into a batch. Both PyTorch and TensorFlow also has a utility module for creating a generator (torch.utils.data.DataLoader for Torch and tf.data.Dataset for Tensorflow) 

In [None]:
def batch_data_generator(data, label, bs = 8, training = True):
  while(True):
    X1= []
    X2 = []
    Y = []
    from sklearn.utils import shuffle
    ids, masks = data[0], data[1]
    if(training):
      ids, masks, label = shuffle(ids, masks, label, random_state = 42)
    for a, b, c in zip(ids, masks, label):
      X1.append(a)
      X2.append(b)
      Y.append(c)
      if(len(X1) == bs):
        yield [np.array(X1), np.array(X2)], np.array(Y)
        X1= []
        X2 = []
        Y = []
    if(len(X1) > 0):
      yield [np.array(X1), np.array(X2)], np.array(Y)
    if(not training):
      yield None
      break


In [None]:
train_generator = batch_data_generator(train_data, np.array(train_labels, dtype = np.int), training = True)

In [None]:
dummy_generator = batch_data_generator(train_data, np.array(train_labels, dtype = np.int), training = True)
X_dummy, Y_dummy = next(dummy_generator)
print(X_dummy[0].shape, X_dummy[1].shape, Y_dummy.shape)

(8, 512) (8, 512) (8,)


#### Start Fine-tuning

In [None]:
device = "cuda:0"
from tqdm import tqdm_notebook
from sklearn.metrics import accuracy_score
from collections import deque 

train_acc_stat =  deque(maxlen = 100)
train_loss_stat =  deque(maxlen = 100)

for step in  tqdm_notebook(range(1000)):
    X, Y = next(train_generator)
    ids = torch.tensor(X[0], dtype = torch.long, device = device)
    mask = torch.tensor(X[1], dtype = torch.long, device = device)
    targets = torch.tensor(Y, dtype = torch.long).to(device)

    optimizer.zero_grad()
    outputs = model(ids, mask)
    loss = loss_fn(outputs['logits'], targets)
    
    loss.backward()
    optimizer.step()

    with torch.no_grad():
      train_acc = accuracy_score(Y, outputs['logits'].argmax(axis = 1).cpu().detach().numpy() )
      train_loss = loss.cpu().detach().numpy()
      train_acc_stat.append(train_acc)
      train_loss_stat.append(train_loss)

    if (step + 1) %100==0:
      print("iter = {} train_acc = {}".format(step, np.array(train_acc_stat).mean()))
      print("iter = {} train_loss = {}".format(step, np.array(train_loss_stat).mean()))


    if (step + 1) %500==0:
      #validation step
      with torch.no_grad():
        val_generator = batch_data_generator(val_data, np.array(val_labels, dtype = np.int), training = False)
        y_true = []
        y_pred = []
        while(True):
          d = next(val_generator)
          if(d is None): break
          X, Y = d
          ids = torch.tensor(X[0], dtype = torch.long, device = device)
          mask = torch.tensor(X[1], dtype = torch.long, device = device)
          outputs_cls = model(ids, mask)['logits'].argmax(axis = 1).cpu().detach().numpy()
          y_true.append(Y)
          y_pred.append(outputs_cls)
        y_true = np.concatenate(y_true)
        y_pred = np.concatenate(y_pred)
        print("val acc", accuracy_score(y_true, y_pred))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  if __name__ == '__main__':


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

iter = 99 train_acc = 0.78625
iter = 99 train_loss = 0.5005409121513367
iter = 199 train_acc = 0.84875
iter = 199 train_loss = 0.3588992655277252
iter = 299 train_acc = 0.88875
iter = 299 train_loss = 0.3059362471103668
iter = 399 train_acc = 0.89125
iter = 399 train_loss = 0.28195422887802124
iter = 499 train_acc = 0.88375
iter = 499 train_loss = 0.2709669768810272
val acc 0.8662
iter = 599 train_acc = 0.8875
iter = 599 train_loss = 0.26880982518196106
iter = 699 train_acc = 0.91625
iter = 699 train_loss = 0.22099053859710693
iter = 799 train_acc = 0.8925
iter = 799 train_loss = 0.273038387298584
iter = 899 train_acc = 0.90875
iter = 899 train_loss = 0.2313779890537262
iter = 999 train_acc = 0.92125
iter = 999 train_loss = 0.2166145294904709
val acc 0.9026



## TODO 
Compare the classification performance between the non-transformer model and the model fine-tuned using pretrained WangchanBERTa on TRUE call-center dataset (HW6). WangchanBERTa (https://arxiv.org/abs/2101.09635) is RoBERTa (https://arxiv.org/abs/1907.11692) trained on thai texts. RoBERTa is also supported in Hugging Face (https://huggingface.co/transformers/model_doc/roberta.html).

For this homework, you may focus only on the object tag.
To successfully fine-tune WangchanBERTa on the TRUE call-center dataset, you should:

1. Preprocess the dataset into the same format as the tutorial.
2. Tokenize the input from 1. See (https://colab.research.google.com/drive/1Kbk6sBspZLwcnOE61adAQo30xxqOQ9ko?usp=sharing&fbclid=IwAR23b8ZEoP6YxlUx7wWEu7dRCrVcyTFrZb3YSgI-nsxe_t4gy-bh8Rv5R9E#scrollTo=kAcpAdkddVQ8) for more details.
3. Process the tokenized input from 1. to the format that could be fed to the model.
4. Initialize WangchanBERTa (<b> you should choose the pretrained weight w.r.t. the tokenizer in 2.</b>)
5. Fine-tune the pretrained model.
6.  (Optional) Before fine-tuning is performed (before step 5), domain adaptation is often performed first by training a masked language model (maskLM). You can train maskLM by following this guideline (https://huggingface.co/transformers/model_doc/bert.html#bertformaskedlm).

# Preporcess True Dataset

In [11]:
_ = !wget --no-check-certificate https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv

In [12]:
import pandas as pd
import numpy as np

In [13]:
true_data_df = pd.read_csv('clean-phone-data-for-students.csv')

In [14]:
true_data_df.head()

Unnamed: 0,Sentence Utterance,Action,Object
0,<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counte...,enquire,payment
1,internet ยังความเร็วอยุ่เท่าไหร ครับ,enquire,package
2,ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้...,report,suspend
3,พี่ค่ะยังใช้ internet ไม่ได้เลยค่ะ เป็นเครื่อ...,enquire,internet
4,ฮาโหล คะ พอดีว่าเมื่อวานเปิดซิมทรูมูฟ แต่มันโ...,report,phone_issues


In [15]:
display(true_data_df['Object'].unique())
display(true_data_df.describe())

array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nonTrueMove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd',
       'TrueMoney', 'garbage', 'Payment', 'IDD', 'ringtone', 'Idd',
       'rate', 'loyalty_card', 'contact', 'officer', 'Balance', 'Service',
       'Loyalty_card'], dtype=object)

Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


In [16]:
true_data_df['Object']=true_data_df['Object'].str.lower().copy()
display(true_data_df['Object'].unique())
display(true_data_df.describe())

array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nontruemove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd', 'garbage',
       'ringtone', 'rate', 'loyalty_card', 'contact', 'officer'],
      dtype=object)

Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,26
top,บริการอื่นๆ,enquire,service
freq,97,10377,2528


In [17]:
unique_object = true_data_df['Object'].unique()

object_2_num_map = dict(zip(unique_object, range(len(unique_object))))
num_2_object_map = dict(zip(range(len(unique_object)), unique_object))

In [18]:
display(object_2_num_map)
display(num_2_object_map)

{'balance': 7,
 'balance_minutes': 18,
 'bill': 9,
 'contact': 24,
 'credit': 10,
 'detail': 8,
 'garbage': 20,
 'idd': 19,
 'information': 16,
 'internet': 3,
 'iservice': 13,
 'lost_stolen': 17,
 'loyalty_card': 23,
 'mobile_setting': 12,
 'nontruemove': 6,
 'officer': 25,
 'package': 1,
 'payment': 0,
 'phone_issues': 4,
 'promotion': 11,
 'rate': 22,
 'ringtone': 21,
 'roaming': 14,
 'service': 5,
 'suspend': 2,
 'truemoney': 15}

{0: 'payment',
 1: 'package',
 2: 'suspend',
 3: 'internet',
 4: 'phone_issues',
 5: 'service',
 6: 'nontruemove',
 7: 'balance',
 8: 'detail',
 9: 'bill',
 10: 'credit',
 11: 'promotion',
 12: 'mobile_setting',
 13: 'iservice',
 14: 'roaming',
 15: 'truemoney',
 16: 'information',
 17: 'lost_stolen',
 18: 'balance_minutes',
 19: 'idd',
 20: 'garbage',
 21: 'ringtone',
 22: 'rate',
 23: 'loyalty_card',
 24: 'contact',
 25: 'officer'}

In [19]:
print("Before Mappings")
display(true_data_df['Object'])
true_data_df['Object'] = np.vectorize(object_2_num_map.get)(true_data_df['Object'])

print("After Mappings")
display(true_data_df['Object'])

Before Mappings


0             payment
1             package
2             suspend
3            internet
4        phone_issues
             ...     
16170        internet
16171             idd
16172         balance
16173         balance
16174         package
Name: Object, Length: 16175, dtype: object

After Mappings


0         0
1         1
2         2
3         3
4         4
         ..
16170     3
16171    19
16172     7
16173     7
16174     1
Name: Object, Length: 16175, dtype: int64

# Tokenize

In [24]:
# Install thai2transformer 
# _ = !pip -q install torch==1.4.0 torchtext==0.4.0 torchvision==0.6.0
# _ = !pip -q install transformers==3.5.0 thai2transformers==0.1.2

_ = !pip -q install torch==1.4.0 transformers==3.5.0 thai2transformers==0.1.2

In [25]:
import numpy as np
from tqdm.auto import tqdm
import torch

#datasets
from datasets import load_dataset

#transformers
import transformers
from transformers import (
    CamembertTokenizer,
    AutoTokenizer,
    AutoModel,
    AutoModelForMaskedLM,
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    pipeline,
)

#thai2transformers
import thai2transformers
from thai2transformers.preprocess import process_transformers
from thai2transformers.metrics import (
    classification_metrics, 
    multilabel_classification_metrics,
)
from thai2transformers.tokenizers import (
    ThaiRobertaTokenizer,
    ThaiWordsNewmmTokenizer,
    ThaiWordsSyllableTokenizer,
    FakeSefrCutTokenizer,
    SEFR_SPLIT_TOKEN
)

from transformers import DistilBertForSequenceClassification, RobertaForSequenceClassification

ModuleNotFoundError: ignored

In [None]:
# model_names = [
#     'wangchanberta-base-att-spm-uncased',
#     'xlm-roberta-base',
#     'bert-base-multilingual-cased',
#     'wangchanberta-base-wiki-newmm',
#     'wangchanberta-base-wiki-ssg',
#     'wangchanberta-base-wiki-sefr',
#     'wangchanberta-base-wiki-spm',
# ]

# tokenizers = {
#     'wangchanberta-base-att-spm-uncased': AutoTokenizer,
#     'xlm-roberta-base': AutoTokenizer,
#     'bert-base-multilingual-cased': AutoTokenizer,
#     'wangchanberta-base-wiki-newmm': ThaiWordsNewmmTokenizer,
#     'wangchanberta-base-wiki-ssg': ThaiWordsSyllableTokenizer,
#     'wangchanberta-base-wiki-sefr': FakeSefrCutTokenizer,
#     'wangchanberta-base-wiki-spm': ThaiRobertaTokenizer,
# }
# public_models = ['xlm-roberta-base', 'bert-base-multilingual-cased'] 
# #@title Choose Pretrained Model
# model_name = "xlm-roberta-base" #@param ["wangchanberta-base-att-spm-uncased", "xlm-roberta-base", "bert-base-multilingual-cased", "wangchanberta-base-wiki-newmm", "wangchanberta-base-wiki-syllable", "wangchanberta-base-wiki-sefr", "wangchanberta-base-wiki-spm"]

# #create tokenizer
# tokenizer = tokenizers[model_name].from_pretrained(
#                 f'airesearch/{model_name}' if model_name not in public_models else f'{model_name}',
#                 revision='main',
#                 model_max_length=416,)

In [None]:
tokenizer = AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")

true_X_encodings = tokenizer(list(true_data_df['Sentence Utterance']), add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )



In [None]:
true_X = [np.array(true_X_encodings['input_ids']), np.array(true_X_encodings['attention_mask'])]

# Model initialization

In [None]:
num_labels = len(object_2_num_map)

# tokenizer = AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")
model = AutoModelForSequenceClassification.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased", num_labels= num_labels)
model = torch.nn.DataParallel(model.cuda(), device_ids=[0])

LEARNING_RATE =  1e-5
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

Some weights of the model checkpoint at airesearch/wangchanberta-base-att-spm-uncased were not used when initializing CamembertForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at airesearch/wa

In [None]:
def batch_data_generator(data, label, bs = 8, training = True):
  while(True):
    X1= []
    X2 = []
    Y = []
    from sklearn.utils import shuffle
    ids, masks = data[0], data[1]
    if(training):
      ids, masks, label = shuffle(ids, masks, label, random_state = 42)
    for a, b, c in zip(ids, masks, label):
      X1.append(a)
      X2.append(b)
      Y.append(c)
      if(len(X1) == bs):
        yield [np.array(X1), np.array(X2)], np.array(Y)
        X1= []
        X2 = []
        Y = []
    if(len(X1) > 0):
      yield [np.array(X1), np.array(X2)], np.array(Y)
    if(not training):
      yield None
      break

In [None]:
true_X2 = np.swapaxes(np.array(true_X), 0, 1)

In [None]:
from sklearn.model_selection import train_test_split

true_train_X, true_val_X, true_train_Y, true_val_Y = train_test_split(true_X2, true_data_df['Object'], test_size=.4)
true_test_X, true_val_X, true_test_Y, true_val_Y = train_test_split(true_val_X, true_val_Y, test_size=.5)

In [None]:
true_train_X = np.swapaxes(np.array(true_train_X), 0, 1)
true_test_X = np.swapaxes(np.array(true_test_X,), 0, 1)
true_val_X = np.swapaxes(np.array(true_val_X), 0, 1)

In [None]:
train_generator = batch_data_generator(true_train_X, true_train_Y, training = True)

In [None]:
dummy_generator = batch_data_generator(true_train_X, true_train_Y, training = True)
X_dummy, Y_dummy = next(dummy_generator)
print(X_dummy[0].shape, X_dummy[1].shape, Y_dummy.shape)

(8, 512) (8, 512) (8,)


In [None]:
device = "cuda:0"
from tqdm import tqdm_notebook
from sklearn.metrics import accuracy_score
from collections import deque 

train_acc_stat =  deque(maxlen = 100)
train_loss_stat =  deque(maxlen = 100)

for step in  tqdm_notebook(range(1000)):
    X, Y = next(train_generator)
    ids = torch.tensor(X[0], dtype = torch.long, device = device)
    mask = torch.tensor(X[1], dtype = torch.long, device = device)
    targets = torch.tensor(Y, dtype = torch.long).to(device)

    optimizer.zero_grad()
    outputs = model(ids, mask)
    # loss = loss_fn(outputs['logits'], targets)
    loss = loss_fn(outputs[0], targets)
    
    loss.backward()
    optimizer.step()

    with torch.no_grad():
      # train_acc = accuracy_score(Y, outputs['logits'].argmax(axis = 1).cpu().detach().numpy() )
      train_acc = accuracy_score(Y, outputs[0].argmax(axis = 1).cpu().detach().numpy() )
      train_loss = loss.cpu().detach().numpy()
      train_acc_stat.append(train_acc)
      train_loss_stat.append(train_loss)

    if (step + 1) %100==0:
      print("iter = {} train_acc = {}".format(step, np.array(train_acc_stat).mean()))
      print("iter = {} train_loss = {}".format(step, np.array(train_loss_stat).mean()))


    if (step + 1) %500==0:
      #validation step
      with torch.no_grad():
        val_generator = batch_data_generator(true_val_X, true_val_Y, training = False)
        y_true = []
        y_pred = []
        while(True):
          d = next(val_generator)
          if(d is None): break
          X, Y = d
          ids = torch.tensor(X[0], dtype = torch.long, device = device)
          mask = torch.tensor(X[1], dtype = torch.long, device = device)
          # outputs_cls = model(ids, mask)['logits'].argmax(axis = 1).cpu().detach().numpy()
          outputs_cls = model(ids, mask)[0].argmax(axis = 1).cpu().detach().numpy()
          y_true.append(Y)
          y_pred.append(outputs_cls)
        y_true = np.concatenate(y_true)
        y_pred = np.concatenate(y_pred)
        print("val acc", accuracy_score(y_true, y_pred))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  if __name__ == '__main__':


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

iter = 99 train_acc = 0.15125
iter = 99 train_loss = 2.9082915782928467
iter = 199 train_acc = 0.2775
iter = 199 train_loss = 2.5786659717559814
iter = 299 train_acc = 0.32875
iter = 299 train_loss = 2.3735148906707764
iter = 399 train_acc = 0.36375
iter = 399 train_loss = 2.219857692718506
iter = 499 train_acc = 0.355
iter = 499 train_loss = 2.243892192840576
iter = 599 train_acc = 0.4
iter = 599 train_loss = 2.10455584526062
iter = 699 train_acc = 0.4625
iter = 699 train_loss = 1.8929330110549927
iter = 799 train_acc = 0.5225
iter = 799 train_loss = 1.6740680932998657
iter = 899 train_acc = 0.53625
iter = 899 train_loss = 1.6326984167099
iter = 999 train_acc = 0.54375
iter = 999 train_loss = 1.6138421297073364
val acc 0.4970633693972179



In [None]:
test_generator = batch_data_generator(true_test_X, true_test_Y, training = False)
y_true = []
y_pred = []
while(True):
  d = next(test_generator)
  if(d is None): break
  X, Y = d
  ids = torch.tensor(X[0], dtype = torch.long, device = device)
  mask = torch.tensor(X[1], dtype = torch.long, device = device)
  outputs_cls = model(ids, mask)[0].argmax(axis = 1).cpu().detach().numpy()
  y_true.append(Y)
  y_pred.append(outputs_cls)
y_true = np.concatenate(y_true)
y_pred = np.concatenate(y_pred)
print("test acc", accuracy_score(y_true, y_pred))

test acc 0.4843894899536321


**<font color='red'>Comparing to my  non-transformer model in HW 6 which I got 30.8% accuracy,  the model fine-tuned using pretrained WangchanBERTa gives a significant higher accuracy at 48.4%.</font>**