# Simplistic T5 model with no fancy tricks

The previous example of Chinese-English machine translation has the following problems: 

<ul>
    <li>Dataset is trash</li>
    <li>Includes too many tricks (scheduler, parameter freezing, callback, metrics) that I cannot handle</li>
</ul>

Now write a T5 Chinese-English translator with better data and no fancy trick. 

In [1]:
import pandas as pd
from tokenizers import SentencePieceBPETokenizer
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import (
    T5Model, 
    T5ForConditionalGeneration, 
    AdamW,
)
import pytorch_lightning as pl
import time
from datetime import datetime

## Load data

The entire data is too large to load directly into memory. For now, only load the first `nLine` lines.  

Learn to handle big data with PyTorch dataloader if needed. 

In [2]:
%%time
enFile = open('./en-zh/UNv1.0.en-zh.en', 'r', encoding = 'utf-8')
zhFile = open('./en-zh/UNv1.0.en-zh.zh', 'r', encoding = 'utf-8')

nLine = 10000

dataMatrix = []

for i in range(nLine): 
    zhLine = zhFile.readline().strip()
    enLine = enFile.readline().strip()
    dataMatrix.append([zhLine, enLine])
    
df_UN = pd.DataFrame(dataMatrix, columns = ['zh', 'en']).sample(frac=1).reset_index(drop=True) # Shuffle the data
df_UN

# Notice: The run time of appending rows in DataFrame is notoriously long

Wall time: 17.9 ms


Unnamed: 0,zh,en
0,但是，鉴于ERS-1号仍在充分运转，这两颗卫星现由欧洲航天局以串联方式使用，这为许多应用提供...,"However, since ERS-1 is still fully operationa..."
1,67. 由于世界各国，特别是拉丁美洲，为了促进私人投资，实现基础设施现代化，以便吸收迅速变化...,. Satellite communications had evolved very ra...
2,卡塔尔 第九次报告 1993年5月16日 -,Qatar Ninth report 16 May 1993 -
3,1994年以激光雷达进行了五个阶段的测量。,Five periods of measurement with lidar instrum...
4,选择这些国家的主要依据是它们的需要和它们显然有能力提供使该项目能成功执行和维持的必要物质和政...,They had been selected mainly on the basis of ...
...,...,...
9995,597. 委员会在1995年8月10日举行的第1115次会议(见CERD/C/SR.1115...,"597. At its 1115th meeting, held on 10 August ..."
9996,(m) 禁止重新开始已经由可强制执行的判决了结的诉讼程序；,(m) The prohibition of the reopening of procee...
9997,A．发展当地能力,A. Development of indigenous capability
9998,1995年1月24日发射了Tsikada系列的一颗人造地球卫星。,24 January 1995 saw the launch of an artificia...


## Tokenization and PyTorch `Dataset`

We first instantiate SentencePiece tokenizers and train them on our data. 

<b style="color:red;">Warning!</b> For some reason I can no longer find the API for `SentencePieceBPETokenizer`. Did huggingface deprecate the old version tokenizer? 

In [3]:
# Need to store all texts in file before training tokenizer
pathAllZh = './en-zh/allZh.txt'
pathAllEn = './en-zh/allEn.txt'

zhTextsUN = df_UN['zh'].tolist()
enTextsUN = df_UN['en'].tolist()

with open(pathAllZh, 'w', encoding = 'utf-8') as file:
    for line in zhTextsUN:
        file.write(line + '\n')
    file.close()
    
with open(pathAllEn, 'w', encoding = 'utf-8') as file: 
    for line in enTextsUN:
        file.write(line + '\n')
    file.close()

In [4]:
# Instantiate and train tokenizers 
zhTokenizer = SentencePieceBPETokenizer()
zhTokenizer.train([pathAllZh], vocab_size = 500000, special_tokens = ['<s>', '<pad>', '</s>', '<unk>', '<mask>'])

enTokenizer = SentencePieceBPETokenizer()
enTokenizer.train([pathAllEn], vocab_size = 500000, special_tokens = ['<s>', '<pad>', '</s>', '<unk>', '<mask>'])

For more details about tokenizer, see `Bo-Eng-Machine-Transation/warm_up_Chinese_English/01_practice_ch_en_tranlation.ipynb`. 

Now define PyTorch `DataLoader`. 

In [5]:
class MyDataset(Dataset): 
    def __init__(self, zhTexts, enTexts, zhTokenizer, enTokenizer, zhMaxLen, enMaxLen): 
        super().__init__()
        self.zhTexts = zhTexts 
        self.enTexts = enTexts
        self.zhTokenizer = zhTokenizer
        self.enTokenizer = enTokenizer 
        
        # Enable padding and truncation
        self.zhTokenizer.enable_padding(length = zhMaxLen)
        self.zhTokenizer.enable_truncation(max_length = zhMaxLen)
        self.enTokenizer.enable_padding(length = enMaxLen)
        self.enTokenizer.enable_truncation(max_length = enMaxLen)
        
    '''
    Return the size of dataset
    '''
    def __len__(self):
        return len(self.zhTexts)
    
    '''
    -- The routine for querying one data entry 
    -- The index of must be specified as an argument
    -- Return a dictionary 
    '''
    def __getitem__(self, idx): 
        # Apply tokenizer 
        zhOutputs = self.zhTokenizer.encode(self.zhTexts[idx])
        enOutputs = self.enTokenizer.encode(self.enTexts[idx])
        
        # Get numerical tokens
        zhEncoding = zhOutputs.ids
        enEncoding = enOutputs.ids
        
        # Get attention mask 
        zhMask = zhOutputs.attention_mask
        enMask = enOutputs.attention_mask
        
        return {
            'source_ids': torch.tensor(zhEncoding), 
            'source_mask': torch.tensor(zhMask), 
            'target_ids': torch.tensor(enEncoding), 
            'target_mask': torch.tensor(enMask)
        }

## Define model class

Use Pytorch-lighning

In [6]:
class T5FineTuner(pl.LightningModule): 
    ''' Part 1: Define the architecture of model in init '''
    def __init__(self, hparams):
        super(T5FineTuner, self).__init__()
        self.model = T5ForConditionalGeneration.from_pretrained(
            hparams['pretrainedModelName'], 
            return_dict = True    # I set return_dict true so that outputs  are presented as dictionaries
        )
        self.zhTokenizer = hparams['zhTokenizer']
        self.enTokenizer = hparams['enTokenizer']
        self.hparams = hparams
        
        
    ''' Part 2: Define the forward propagation '''
    def forward(self, input_ids, attention_mask = None, decoder_input_ids = None, decoder_attention_mask = None, labels = None):  
        return self.model(
            input_ids, 
            attention_mask = attention_mask, 
            decoder_input_ids = decoder_input_ids, 
            decoder_attention_mask = decoder_attention_mask, 
            labels = labels
        )
    
    
    ''' Part 3: Configure optimizer and scheduler '''
    def configure_optimizers(self): 
        optimizer = AdamW(self.parameters(), lr = self.hparams['learning_rate'])
        return optimizer
    
    
    ''' Part 4.1: Training logic '''
    def training_step(self, batch, batch_idx):         
        loss = self._step(batch)
        self.log('train_loss', loss)
        return loss 
    
    
    def _step(self, batch): 
        labels = batch['target_ids'] 
        # labels[labels[:, ] == 0] = -100    # Change the pad id from 0 to -100, but I do not know why the example chooses to do so. I will comment it out for now
        
        outputs = self(
            input_ids = batch['source_ids'], 
            attention_mask = batch['source_mask'], 
            labels = labels, 
            decoder_attention_mask = batch['target_mask']
        )
        
        return outputs.loss

    
    ''' Part 4.2: Validation logic '''
    def validation_step(self, batch, batch_idx):  
        loss = self._step(batch)
        self.log('val_loss', loss)
        
        
    ''' Part 4.3: Test logic '''
    def test_step(self, batch, batch_idx): 
        loss = self._step(batch)
        self.log('test_loss', loss)
    
    
    ''' Part 5: Data loaders '''
    def _get_dataloader(self, start_idx, end_idx): 
        dataset = MyDataset(
            zhTexts = zhTextsUN[start_idx:end_idx], 
            enTexts = enTextsUN[start_idx:end_idx], 
            zhTokenizer = self.hparams['zhTokenizer'], 
            enTokenizer = self.hparams['enTokenizer'], 
            zhMaxLen = self.hparams['max_input_len'], 
            enMaxLen = self.hparams['max_output_len']
        )
        
        return DataLoader(dataset, batch_size = hparams['batch_size'])
    
    
    def train_dataloader(self): 
        start_idx = 0
        end_idx = int(self.hparams['train_percentage'] * len(zhTextsUN))
        return self._get_dataloader(start_idx, end_idx)
        
    
    def val_dataloader(self): 
        start_idx = int(self.hparams['train_percentage'] * len(zhTextsUN))
        end_idx = int((self.hparams['train_percentage'] + self.hparams['val_percentage']) * len(zhTextsUN))
        return self._get_dataloader(start_idx, end_idx)
    
    
    def test_dataloader(self): 
        start_idx = int((self.hparams['train_percentage'] + self.hparams['val_percentage']) * len(zhTextsUN))
        end_idx = len(zhTextsUN)
        return self.get_dataloader(start_idx, end_idx)

In [7]:
hparams = {
    'zhTokenizer': zhTokenizer,
    'enTokenizer': enTokenizer,
    'pretrainedModelName': 't5-small', 
    'train_percentage': 0.85, 
    'val_percentage': 0.13, 
    'learning_rate': 3e-4, 
    'max_input_len': 100, 
    'max_output_len': 100, 
    'batch_size': 32, 
    'num_train_epochs': 2, 
    'num_gpu': 1
}

## Training and testing

In [8]:
train_params = dict(
    gpus = hparams['num_gpu'], 
    max_epochs = hparams['num_train_epochs'], 
    progress_bar_refresh_rate = 20, 
)

model = T5FineTuner(hparams)

trainer = pl.Trainer(**train_params)

trainer.fit(model)

# Save model for later use
now = datetime.now()
trainer.save_checkpoint('t5simple_' + now.strftime("%Y-%d-%m-%Y--%H=%M=%S") + '.ckpt')

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 60 M  


HBox(children=(HTML(value='Validation sanity check'), FloatProgress(value=1.0, bar_style='info', layout=Layout…

0 8500


HBox(children=(HTML(value='Training'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), max…

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 2.63 GiB already allocated; 5.90 MiB free; 2.71 GiB reserved in total by PyTorch)

In [None]:
def _get_dataloader( start_idx, end_idx): 
    dataset = MyDataset(
        zhTexts = zhTextsUN[start_idx:end_idx], 
        enTexts = enTextsUN[start_idx:end_idx], 
        zhTokenizer = hparams['zhTokenizer'], 
        enTokenizer = hparams['enTokenizer'], 
        zhMaxLen = hparams['max_input_len'], 
        enMaxLen = hparams['max_output_len']
    )

    return DataLoader(dataset, batch_size = hparams['batch_size'])


def train_dataloader(): 
    start_idx = 0
    end_idx = int(hparams['train_percentage'] * len(zhTextsUN))
    return _get_dataloader(start_idx, end_idx)
