# Simplistic T5 model with no fancy tricks

The previous example of Chinese-English machine translation has the following problems: 

<ul>
    <li>Dataset is trash</li>
    <li>Includes too many tricks (scheduler, parameter freezing, callback, metrics) that I cannot handle</li>
</ul>

Now write a T5 Chinese-English translator with better data and no fancy trick. 

In [2]:
import pandas as pd
from tokenizers import SentencePieceBPETokenizer
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import (
    T5Model, 
    T5ForConditionalGeneration, 
    AdamW,
)
import pytorch_lightning as pl
import time

## Load data

The entire data is too large to load directly into memory. For now, only load the first `nLine` lines.  

Learn to handle big data with PyTorch dataloader if needed. 

In [3]:
%%time
enFile = open('./en-zh/UNv1.0.en-zh.en', 'r', encoding = 'utf-8')
zhFile = open('./en-zh/UNv1.0.en-zh.zh', 'r', encoding = 'utf-8')

nLine = 100000

dataMatrix = []

for i in range(nLine): 
    zhLine = zhFile.readline().strip()
    enLine = enFile.readline().strip()
    dataMatrix.append([zhLine, enLine])
    
df_UN = pd.DataFrame(dataMatrix, columns = ['zh', 'en'])
df_UN

# Notice: The run time of appending rows in DataFrame is notoriously long

Wall time: 243 ms


Unnamed: 0,zh,en
0,第918(1994)号决议,RESOLUTION 918 (1994)
1,1994年5月17日安全理事会第3377次会议通过,Adopted by the Security Council at its 3377th ...
2,安全理事会，,"The Security Council,"
3,重申其以往关于卢旺达局势的所有决议，特别是成立联合国卢旺达援助团(联卢援助团)的1993年1...,Reaffirming all its previous resolutions on th...
4,回顾安理会主席以安理会名义在1994年4月7日发表的声明(S/PRST/ 1994/16)和...,Recalling the statements made by the President...
...,...,...
99995,135. 关于粮食首脑会议，应高度重视有关会议方针和执行会议决议的问题。,135. The World Food Summit should give careful...
99996,首脑会议所通过的行动计划，应在为执行联合国其他重要会议和首脑会议的一整套方针而建立的机构中进...,The plan of action to be adopted at the Summit...
99997,发言人还特别强调指出，各感兴趣的组织应共同合作，有效支持各国实现世界粮食安全的倡议。,Improved coordination and cooperation among al...
99998,这并不妨碍联合国粮农组织在执行首脑会议决议中的领导作用。,That did not prevent FAO from playing a leadin...


## Tokenization and PyTorch `Dataset`

We first instantiate SentencePiece tokenizers and train them on our data. 

<b style="color:red;">Warning!</b> For some reason I can no longer find the API for `SentencePieceBPETokenizer`. Did huggingface deprecate the old version tokenizer? 

In [6]:
# Need to store all texts in file before training tokenizer
pathAllZh = './en-zh/allZh.txt'
pathAllEn = './en-zh/allEn.txt'

with open(pathAllZh, 'w', encoding = 'utf-8') as file:
    for line in df_UN['zh']:
        file.write(line + '\n')
    file.close()
    
with open(pathAllEn, 'w', encoding = 'utf-8') as file: 
    for line in df_UN['en']:
        file.write(line + '\n')
    file.close()

In [34]:
# Instantiate and train tokenizers 
zhTokenizer = SentencePieceBPETokenizer()
zhTokenizer.train([pathAllZh], vocab_size = 500000, special_tokens = ['<s>', '<pad>', '</s>', '<unk>', '<mask>'])

enTokenizer = SentencePieceBPETokenizer()
enTokenizer.train([pathAllEn], vocab_size = 500000, special_tokens = ['<s>', '<pad>', '</s>', '<unk>', '<mask>'])

For more details about tokenizer, see `Bo-Eng-Machine-Transation/warm_up_Chinese_English/01_practice_ch_en_tranlation.ipynb`. 

Now define PyTorch `DataLoader`. 

In [38]:
class MyDataset(Dataset): 
    def __init__(self, zhTexts, enTexts, zhTokenizer, enTokenizer, zhMaxLen, enMaxLen): 
        super().__init__()
        self.chTexts = chTexts 
        self.enTexts = enTexts
        self.chTokenizer = chTokenizer
        self.enTokenizer = enTokenizer 
        
        # Enable padding and truncation
        self.chTokenizer.enable_padding(length = chMaxLen)
        self.chTokenizer.enable_truncation(max_length = chMaxLen)
        self.enTokenizer.enable_padding(length = enMaxLen)
        self.enTokenizer.enable_truncation(max_length = enMaxLen)
        
    '''
    Return the size of dataset
    '''
    def __len__(self):
        return len(self.chTexts)
    
    '''
    -- The routine for querying one data entry 
    -- The index of must be specified as an argument
    -- Return a dictionary 
    '''
    def __getitem__(self, idx): 
        # Apply tokenizer 
        zhOutputs = zhTokenizer.encode(zhTexts[idx])
        enOutputs = enTokenizer.encode(enTexts[idx])
        
        # Get numerical tokens
        zhEncoding = zhOutputs.ids
        enEncoding = enOutputs.ids
        
        # Get attention mask 
        zhMask = zhOutputs.attention_mask
        enMask = enOutputs.attention_mask
        
        return {
            'source_ids': torch.tensor(zhEncoding), 
            'source_mask': torch.tensor(zhMask), 
            'target_ids': torch.tensor(enEncoding), 
            'target_mask': torch.tensor(enMask)
        }

## Define model class

Use Pytorch-lighning