# Simplistic T5 model with no fancy tricks

The previous example of Chinese-English machine translation has the following problems: 

<ul>
    <li>Dataset is trash</li>
    <li>Includes too many tricks (scheduler, parameter freezing, callback, metrics) that I cannot handle</li>
</ul>

Now write a T5 Chinese-English translator with better data and no fancy trick. 

In [1]:
import pandas as pd
from tokenizers import SentencePieceBPETokenizer
import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import SparseAdam
from transformers import (
    T5Config, 
    T5ForConditionalGeneration, 
    AdamW,
)
import pytorch_lightning as pl
import time
from datetime import datetime
import textwrap

device = torch.device(
    'cuda:0' if torch.cuda.is_available() else 'cpu'
)
print(f'device = {device}')

device = cuda:0


## Load data

The entire data is too large to load directly into memory. For now, only load the first `nLine` lines.  

Learn to handle big data with PyTorch dataloader if needed. 

In [2]:
%%time
enFile = open('./en-zh/UNv1.0.en-zh.en', 'r', encoding = 'utf-8')
zhFile = open('./en-zh/UNv1.0.en-zh.zh', 'r', encoding = 'utf-8')

nSkip = 0
nLine = 100000

dataMatrix = []

for i in range(nSkip): 
    zhFile.readline()
    enFile.readline()

for i in range(nLine): 
    zhLine = zhFile.readline().strip()
    enLine = enFile.readline().strip()
    dataMatrix.append([zhLine, enLine])
    
df_UN = pd.DataFrame(dataMatrix, columns = ['zh', 'en']).sample(frac=1).reset_index(drop=True) # Shuffle the data
df_UN

# Notice: The run time of appending rows in DataFrame is notoriously long

Wall time: 248 ms


Unnamed: 0,zh,en
0,"我们赞扬和鼓励国际组织--包括联合国各专门机构的努力,它们为满足这些需要和愿望而工作。",We applaud and encourage the efforts of intern...
1,这些课程的开设除其他之外，主要利用本地区大学研究所、瑞典航天公司卫星图像公司和瑞典土地调查局...,"The courses are based on the resources of, int..."
2,在秘鲁，只要他们行为端正和胜任本职工作，法官的任职期安全和服务的稳定性就得到保障。,"In Peru, judges were guaranteed security of te..."
3,4. 在巴拿马，体现敌对行动或武装冲突中儿童权利的规范与对待国际人道主义法规范的态度相符，以...,4. In Panama the norms enshrining the rights o...
4,50. 广播、函授教育当然不能与教师引导的交互式学习对话相提并论，但由于至少有50%的非洲儿...,50. Distance learning was certainly not equiva...
...,...,...
99995,12.3 最初保健护理组对调查人口提供的妇女保健服务 43,Coverage of the census population by primary h...
99996,"一些非政府组织已经检查和批评过这方面的条件,并已将此记录在案。",Those conditions have been examined and critic...
99997,·废物管理(包括污水处理),● Waste management (including sewage treatment)
99998,"这一过程将分阶段进行,最多由9个身份查验中心同时操作。",The process will be conducted in successive ph...


## Tokenization and PyTorch `Dataset`

We first instantiate SentencePiece tokenizers and train them on our data. 

<b style="color:red;">Warning!</b> For some reason I can no longer find the API for `SentencePieceBPETokenizer`. Did huggingface deprecate the old version tokenizer? 

In [3]:
# Need to store all texts in file before training tokenizer
pathAllZh = './en-zh/allZh.txt'
pathAllEn = './en-zh/allEn.txt'

zhTextsUN = df_UN['zh'].tolist()
enTextsUN = df_UN['en'].tolist()

with open(pathAllZh, 'w', encoding = 'utf-8') as file:
    for line in zhTextsUN:
        file.write(line + '\n')
    file.close()
    
with open(pathAllEn, 'w', encoding = 'utf-8') as file: 
    for line in enTextsUN:
        file.write(line + '\n')
    file.close()

In [4]:
# Instantiate and train tokenizers 
# Warning T5 tokenizer has default vocab size 32128. We should make sure the vocab size of tokenizers and T5 model match. 

zhTokenizer = SentencePieceBPETokenizer()
zhTokenizer.train([pathAllZh], vocab_size = 32128, special_tokens = ['<s>', '<pad>', '</s>', '<unk>', '<mask>'])

enTokenizer = SentencePieceBPETokenizer()
enTokenizer.train([pathAllEn], vocab_size = 32128, special_tokens = ['<s>', '<pad>', '</s>', '<unk>', '<mask>'])

print('Chinese tokenizer vocab size:', zhTokenizer.get_vocab_size())
print('English tokenizer vocab size:', enTokenizer.get_vocab_size())

Chinese tokenizer vocab size: 32128
English tokenizer vocab size: 32128


For more details about tokenizer, see `Bo-Eng-Machine-Transation/warm_up_Chinese_English/01_practice_ch_en_tranlation.ipynb`. 

Now define PyTorch `DataLoader`. 

In [5]:
class MyDataset(Dataset): 
    def __init__(self, zhTexts, enTexts, zhTokenizer, enTokenizer, zhMaxLen, enMaxLen): 
        super().__init__()
        self.zhTexts = zhTexts 
        self.enTexts = enTexts
        self.zhTokenizer = zhTokenizer
        self.enTokenizer = enTokenizer 
        
        # Enable padding and truncation
        self.zhTokenizer.enable_padding(length = zhMaxLen)
        self.zhTokenizer.enable_truncation(max_length = zhMaxLen)
        self.enTokenizer.enable_padding(length = enMaxLen)
        self.enTokenizer.enable_truncation(max_length = enMaxLen)
        
    '''
    Return the size of dataset
    '''
    def __len__(self):
        return len(self.zhTexts)
    
    '''
    -- The routine for querying one data entry 
    -- The index of must be specified as an argument
    -- Return a dictionary 
    '''
    def __getitem__(self, idx): 
        # Apply tokenizer 
        zhOutputs = self.zhTokenizer.encode(self.zhTexts[idx])
        enOutputs = self.enTokenizer.encode(self.enTexts[idx])
        
        # Get numerical tokens
        zhEncoding = zhOutputs.ids
        enEncoding = enOutputs.ids
        
        # Get attention mask 
        zhMask = zhOutputs.attention_mask
        enMask = enOutputs.attention_mask
        
        return {
            'source_ids': torch.tensor(zhEncoding), 
            'source_mask': torch.tensor(zhMask), 
            'target_ids': torch.tensor(enEncoding), 
            'target_mask': torch.tensor(enMask)
        }

## Define model class

Use Pytorch-lighning

In [8]:
class T5FineTuner(pl.LightningModule): 
    ''' Part 1: Define the architecture of model in init '''
    def __init__(self, hparams):
        super(T5FineTuner, self).__init__()
     
        self.model = T5ForConditionalGeneration.from_pretrained(
            hparams['pretrainedModelName'], 
            return_dict = True,     # I set return_dict true so that outputs  are presented as dictionaries
        )
        
        self.zhTokenizer = hparams['zhTokenizer']
        self.enTokenizer = hparams['enTokenizer']
        self.hparams = hparams
        
        
    ''' Part 2: Define the forward propagation '''
    def forward(self, input_ids, attention_mask = None, decoder_input_ids = None, decoder_attention_mask = None, labels = None):  
        return self.model(
            input_ids, 
            attention_mask = attention_mask, 
            decoder_input_ids = decoder_input_ids, 
            decoder_attention_mask = decoder_attention_mask, 
            labels = labels
        )
    
    
    ''' Part 3: Configure optimizer and scheduler '''
    def configure_optimizers(self): 
        # I have no idea why to configure parameter this way 
        optimizer_grouped_parameters = [
            {
                # parameter with weight decay 
                'params': [param for name, param in model.named_parameters() if ('bias' not in name and 'LayerNorm.weight' not in name)], 
                'weight_decay': self.hparams['weight_decay'], 
            }, 
            {
                'params': [param for name, param in model.named_parameters() if ('bias' in name or 'LayerNorm.weight' in name)], 
                'weight_decay': 0.0, 
            }
        ]
        
        optimizer = AdamW(optimizer_grouped_parameters, lr = self.hparams['learning_rate'])
        return optimizer
    
    
    ''' Part 4.1: Training logic '''
    def training_step(self, batch, batch_idx):         
        loss = self._step(batch)
        self.log('train_loss', loss)
        return loss 
    
    
    def _step(self, batch): 
        labels = batch['target_ids'] 
        labels[labels[:, ] == 0] = -100    # Change the pad id from 0 to -100, but I do not know why the example chooses to do so. I will comment it out for now
        
        outputs = self(
            input_ids = batch['source_ids'], 
            attention_mask = batch['source_mask'], 
            labels = labels, 
            decoder_attention_mask = batch['target_mask']
        )
        
        return outputs.loss

    
    ''' Part 4.2: Validation logic '''
    def validation_step(self, batch, batch_idx):        
        loss = self._step(batch)
        self.log('val_loss', loss)
        
        
    ''' Part 4.3: Test logic '''
    def test_step(self, batch, batch_idx): 
        loss = self._step(batch)
        self.log('test_loss', loss)
    
    
    ''' Part 5: Data loaders '''
    def _get_dataloader(self, start_idx, end_idx): 
        dataset = MyDataset(
            zhTexts = zhTextsUN[start_idx:end_idx], 
            enTexts = enTextsUN[start_idx:end_idx], 
            zhTokenizer = self.hparams['zhTokenizer'], 
            enTokenizer = self.hparams['enTokenizer'], 
            zhMaxLen = self.hparams['max_input_len'], 
            enMaxLen = self.hparams['max_output_len']
        )
        
        return DataLoader(dataset, batch_size = hparams['batch_size'])
    
    
    def train_dataloader(self): 
        start_idx = 0
        end_idx = int(self.hparams['train_percentage'] * len(zhTextsUN))
        return self._get_dataloader(start_idx, end_idx)
        
    
    def val_dataloader(self): 
        start_idx = int(self.hparams['train_percentage'] * len(zhTextsUN))
        end_idx = int((self.hparams['train_percentage'] + self.hparams['val_percentage']) * len(zhTextsUN))
        return self._get_dataloader(start_idx, end_idx)
    
    
    def test_dataloader(self): 
        start_idx = int((self.hparams['train_percentage'] + self.hparams['val_percentage']) * len(zhTextsUN))
        end_idx = len(zhTextsUN)
        return self._get_dataloader(start_idx, end_idx)

In [9]:
hparams = {
    'zhTokenizer': zhTokenizer,
    'enTokenizer': enTokenizer,
    'pretrainedModelName': 't5-small', 
    'train_percentage': 0.85, 
    'val_percentage': 0.13, 
    'learning_rate': 3e-4, 
    'max_input_len': 100, 
    'max_output_len': 100, 
    'batch_size': 8, 
    'num_train_epochs': 2, 
    'num_gpu': 0, 
    'weight_decay': 0, 
}

## Training and testing

In [None]:
torch.cuda.empty_cache()

train_params = dict(
    gpus = hparams['num_gpu'], 
    max_epochs = hparams['num_train_epochs'], 
    progress_bar_refresh_rate = 20, 
)

model = T5FineTuner(hparams)

trainer = pl.Trainer(**train_params)

trainer.fit(model)

# Save model for later use
now = datetime.now()
trainer.save_checkpoint('t5simple_' + now.strftime("%Y-%d-%m-%Y--%H=%M=%S") + '.ckpt')

trainer.test()

In [11]:
# Load a previously saved model

torch.cuda.empty_cache()

modelLoaded = T5FineTuner.load_from_checkpoint(checkpoint_path='__01_t5simple_2020-06-12-2020--19=55=07.ckpt').to(device)

In [12]:
# Testting without help of `pl.LightningModule`
# start_idx = int((hparams['train_percentage'] + hparams['val_percentage']) * len(zhTextsUN))
# end_idx = len(zhTextsUN)
start_idx = 0
end_idx = 8

testset = MyDataset(
    zhTexts = zhTextsUN[start_idx:end_idx], 
    enTexts = enTextsUN[start_idx:end_idx], 
    zhTokenizer = hparams['zhTokenizer'], 
    enTokenizer = hparams['enTokenizer'], 
    zhMaxLen = hparams['max_input_len'], 
    enMaxLen = hparams['max_output_len']
)

test_dataloader = DataLoader(testset, batch_size = hparams['batch_size'])
testit = iter(test_dataloader)

# Take one batch from testset 
batch = next(testit)

# Generate target ids
outs = modelLoaded.model.generate(
    batch['source_ids'].cuda(), 
    attention_mask = batch['source_mask'].cuda(), 
    use_cache = True, 
    decoder_attention_mask = batch['target_mask'], 
    max_length = hparams['max_output_len'], 
#     num_beams = 2, 
#     repetition_penalty = 2.5, 
#     length_penalty = 1.0, 
#     early_stopping = True
)

pred_texts = [enTokenizer.decode(ids) for ids in outs.tolist()]
source_texts = [zhTokenizer.decode(ids) for ids in batch['source_ids'].tolist()]
target_texts = [enTokenizer.decode(ids) for ids in batch['target_ids'].tolist()]

for i in range(len(pred_texts)): 
    lines = textwrap.wrap("Chinese Text:\n%s\n" % source_texts[i], width=100)
    print("\n".join(lines))
    print("\nActual translation: %s" % target_texts[i])
    print("\nPredicted translation: %s" % pred_texts[i])
    print('=' * 50 + '\n')

Chinese Text: 我们赞和鼓励国际组织--包括联合国各专门机构的努力,它们为满足这些需要和愿望而工作。

Actual translation: We applaud and encourage the efforts of international organizations — including United Nations specialized agencies, which work to satisfy these needs and aspirations.

Predicted translation: We are also being made to the United Nations and the United Nations system for the United Nations system for the United Nations system for the United Nations system and the United Nations system for the United Nations system for the United Nations system for the United Nations system and the United Nations system for the United Nations system. It is also a number of human rights in the field of human rights and fundamental freedoms. The United Nations. It is also a number of human rights in the field of human rights and fundamental freedoms

Chinese Text: 这些课程的开设除其他之外,主要利用本地区大学研究所、瑞典航天公司卫星图公司和瑞典土地调查局的资源。

Actual translation: The courses are based on the resources of, inter alia, university institutes, the SSC Satellitbil

In [None]:
# %tensorboard --logdir lightning_logs/

In [None]:
print(T5FineTuner(hparams).parameters())