# Practice round: Chinese-English translation

Huggingface transformer doc: https://huggingface.co/transformers/

Huggingface tokenizer doc: https://huggingface.co/transformers/

Useful resources from huggingface -- fine-tuning a model from scratch: https://huggingface.co/blog/how-to-train

The code I wrote before might be helpful: https://github.com/submal/ctec-lambus/blob/master/xprmt/xprmt_06.ipynb

A code example of fine-tuning T5 for text summarization: https://towardsdatascience.com/fine-tuning-a-t5-transformer-for-any-summarization-task-82334c64c81 

LighningModule API
https://pytorch-lightning.readthedocs.io/en/latest/lightning_module.html#lightningmodule-apihttps://pytorch-lightning.readthedocs.io/en/latest/lightning_module.html#lightningmodule-api

In [1]:
import json
import pandas as pd
import jieba
from tokenizers import SentencePieceBPETokenizer
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import (
    T5Model, 
    T5ForConditionalGeneration, 
    AdamW, 
    get_linear_schedule_with_warmup
)
import pytorch_lightning as pl
from pytorch_lightning.loggers import WandbLogger
import argparse
import time
import numpy as np
import nlp
import logging
import os

# Enable GPU if possible 
device = torch.device(
    'cuda:0' if torch.cuda.is_available() else 'cpu'
)
print(f'device = {device}')

device = cuda:0


## Load data

In [2]:
with open('./cn_en_weibo_data/data.cn-en.json', 'r', encoding = 'utf-8') as myfile:
    raw = myfile.read().split('\n')  

# Turn raw strings into a list of dictionaries
weiboDict = [json.loads(line) for line in raw]

# Load and shuffle data
weiboDf= pd.DataFrame(weiboDict).sample(frac=1).reset_index(drop=True)

weiboDf.tail()

Unnamed: 0,id,source,target
1998,3456748923243064,如果你真要梦想成真，就先从梦中醒来。,"If you really want to dream come true, first w..."
1999,3434266673334383,如果别人朝你扔石头，就不要扔回去了，留着作你建高楼的基石。大家早晨好,"If they throw stones at you, don't throw back,..."
2000,3507444192051677,这么早起，我都不是我了,good morning everybody
2001,10783626359,外国人眼中的中国禁烟,'This is China' no excuse for defying smoking ban
2002,15861401364,舞蹈是隐藏在灵魂中的一种语言。,is the hidden language of the soul.


It seems the data is far from clean. However, for prototyping purpose, we will not focus too much on cleaning right now. 

## Parsing and tokenizing Chinese texts

We use `jieba` library (结巴分词) for parsing Chinese text. For more information, see https://github.com/fxsjy/jieba/

In [3]:
chTexts = weiboDf['source']
enTexts = weiboDf['target']

# Tokenize all Chinese texts in the dataframe and store as a list
chTokensGen = [jieba.cut(sentence) for sentence in chTexts]

# Output a sample tokenization
print(list(chTokensGen[0]))

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\presu\AppData\Local\Temp\jieba.cache
Loading model cost 0.544 seconds.
Prefix dict has been built successfully.


['哪怕', '是', '世界末日', ',', '我', '都', '会', '爱', '你', '。', '[', '心', ']', ' ', '喜欢', '就', '关注']


It turns out with tokenizers based on `sentencePiece`, the tokenization happens at sentence level, and the tokenizer is trained recognize subwords. Therefore we will not use other parsers for now. 

In [4]:
pathAllCh = './cn_en_weibo_data/allCh.txt'
pathAllEn = './cn_en_weibo_data/allEn.txt'

# Store all Chinese text in a single file 
with open(pathAllCh, 'w', encoding = 'utf-8') as file: 
    for line in chTexts:
        file.write(line + '\n')
    file.close()
    
# Store all English text in a single file 
with open(pathAllEn, 'w', encoding = 'utf-8') as file: 
    for line in enTexts: 
        file.write(line + '\n')
    file.close()

My feeling is that we cannot use a pretrained tokenizer to train it from scratch. Instead, we might need to import Byte-Pair Encoding, or WordPiece, or SentencePiece by scratch. 

https://huggingface.co/transformers/tokenizer_summary.html#sentencepiece

https://github.com/huggingface/tokenizers

In the following cell, we train a `SentencePiece` tokenizer. 

In [5]:
chTokenizer = SentencePieceBPETokenizer()

chTokenizer.train([pathAllCh], 
                vocab_size = 20000, 
                special_tokens = ['<s>', '<pad>', '</s>', '<unk>', '<mask>'])

# Show an example of tokenizer works
output = chTokenizer.encode(chTexts[0])
print(output.ids, output.tokens, output.offsets, output.attention_mask)

# We shall save the tokenizer to disk 
chTokenizer.save_model('.', 'myTokenizer')

[4520, 474, 560, 6049, 6014, 1094, 1582, 1311, 406, 1035] ['▁哪', '怕', '是', '世界末日', ',我都会', '爱你', '。[心]', '▁喜欢', '就', '关注'] [(0, 1), (1, 2), (2, 3), (3, 7), (7, 11), (11, 13), (13, 17), (17, 20), (20, 21), (21, 23)] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


['.\\myTokenizer-vocab.json', '.\\myTokenizer-merges.txt']

We also have the option to encode a list of texts as a batch. 

In [6]:
output_batch = chTokenizer.encode_batch(chTexts[:3])

for output in output_batch:
    print(output.tokens)

['▁哪', '怕', '是', '世界末日', ',我都会', '爱你', '。[心]', '▁喜欢', '就', '关注']
['▁距离', '并不', '可怕', ',可', '怕', '的是', '心', '越来越', '远']
['▁不', '敢', '相信', ',我', '直到', '现在', '才', '翻唱', '了一', '首', 'J', 'B', '的歌', '。我', '非常', '开心', '你们', '喜欢']


## Demo: padding and truncation

Huggingface tokenizer allows us to pad or truncate according to a length. The following are common utilities for padding and truncation: 

`Tokenizer.enable_padding(**args)` -- Enable padding

`Tokenizer.padding` -- Info about padding

`Tokenizer.no_padding()` -- Disable padding

`Tokenizer.enable_truncation(**args)` -- Enable truncation 

`Tokenizer.truncation` -- Info about truncation 

`Tokenizer.no_truncation()` -- Disable truncation 

In [7]:
# With Padding
chTokenizer.enable_padding(length = 15)

output_batch = chTokenizer.encode_batch(chTexts[:3])

for output in output_batch:
    print(output.tokens)
    
print(chTokenizer.padding)

['▁哪', '怕', '是', '世界末日', ',我都会', '爱你', '。[心]', '▁喜欢', '就', '关注', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
['▁距离', '并不', '可怕', ',可', '怕', '的是', '心', '越来越', '远', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
['▁不', '敢', '相信', ',我', '直到', '现在', '才', '翻唱', '了一', '首', 'J', 'B', '的歌', '。我', '非常', '开心', '你们', '喜欢']
{'length': 15, 'pad_to_multiple_of': None, 'pad_id': 0, 'pad_token': '[PAD]', 'pad_type_id': 0, 'direction': 'right'}


In [8]:
# No padding
chTokenizer.no_padding()

output_batch = chTokenizer.encode_batch(chTexts[:3])

for output in output_batch:
    print(output.tokens)
    
print(chTokenizer.padding)

['▁哪', '怕', '是', '世界末日', ',我都会', '爱你', '。[心]', '▁喜欢', '就', '关注']
['▁距离', '并不', '可怕', ',可', '怕', '的是', '心', '越来越', '远']
['▁不', '敢', '相信', ',我', '直到', '现在', '才', '翻唱', '了一', '首', 'J', 'B', '的歌', '。我', '非常', '开心', '你们', '喜欢']
None


In [9]:
# With truncation
chTokenizer.enable_truncation(max_length = 3)

output_batch = chTokenizer.encode_batch(chTexts[:3])

for output in output_batch:
    print(output.tokens)
    
print(chTokenizer.truncation)

['▁哪', '怕', '是']
['▁距离', '并不', '可怕']
['▁不', '敢', '相信']
{'max_length': 3, 'stride': 0, 'strategy': 'longest_first'}


In [10]:
# No truncation
chTokenizer.no_truncation()

output_batch = chTokenizer.encode_batch(chTexts[:3])

for output in output_batch:
    print(output.tokens)
    
print(chTokenizer.truncation)

['▁哪', '怕', '是', '世界末日', ',我都会', '爱你', '。[心]', '▁喜欢', '就', '关注']
['▁距离', '并不', '可怕', ',可', '怕', '的是', '心', '越来越', '远']
['▁不', '敢', '相信', ',我', '直到', '现在', '才', '翻唱', '了一', '首', 'J', 'B', '的歌', '。我', '非常', '开心', '你们', '喜欢']
None


<span style="color:red;">Pending problem.</span> As I tried to follow the tutorial and load the tokenizer saved on disk, unexpected error was reported. For now, skip loading saved tokenizer and proceed with other important steps.  

<span style="color:red;">Bottleneck for now.</span> Do we need special token for T5? If yes, how to insert special T5 tokens into our tokenization? Similar to `tokenizers.processors.BertProcessing`, do we have `tokenizers.processors.T5Processing`? 

<span style="color:red;">Solution.</span> 1. Thoroughtly read documentation for T5 model in huggingface doc; 2. Explore `huggingface/tokenizers` library on github. 

For now, halt with tokenizer and proceed with language model until bumping into problems. Keep in mind the confusion about special token. 

## Tokenizing English text

In [11]:
enTokenizer = SentencePieceBPETokenizer()

enTokenizer.train([pathAllEn], 
                vocab_size = 20000, 
               special_tokens = ['<s>', '<pad>', '</s>', '<unk>', '<mask>'])

# Show an example of tokenizer works
output = enTokenizer.encode(enTexts[0])
print(output.ids, output.tokens, output.offsets, output.attention_mask)

# We shall save the tokenizer to disk 
# tokenizer.save_model('.', 'myTokenizer')

[1338, 4066, 910, 874, 918, 959, 918, 875, 872, 3267, 1080, 1422, 884, 6148, 1869, 2896, 47, 961, 1319] ['▁In', '▁spite', '▁of', '▁you', '▁and', '▁me', '▁and', '▁the', '▁s', 'illy', '▁world', '▁going', '▁to', '▁pieces', '▁around', '▁us,', 'I', '▁love', '▁you.'] [(0, 2), (2, 8), (8, 11), (11, 15), (15, 19), (19, 22), (22, 26), (26, 30), (30, 32), (32, 36), (36, 42), (42, 48), (48, 51), (51, 58), (58, 65), (65, 69), (69, 70), (70, 75), (75, 80)] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


## Preprocess before training

To utilize PyTorch and GPU computation, we need to create instances of `Dataset` object. 

`DataLoader` class allows us to iterate a dataset with given batch size. We will define dataloaders in `T5FineTuner` class. 

The following cell overwrites `Dataset` class. 

In [12]:
class MyDataset(Dataset):
    
    '''
    Load original data and util from memory or file
    Texts must be passed as lists 
    '''
    def __init__(self, 
                 chTexts, enTexts, # Suppose the two colums are of the same length
                 chTokenizer, enTokenizer, 
                 chMaxLen, enMaxLen): 
        super().__init__()
        self.chTexts = chTexts 
        self.enTexts = enTexts
        self.chTokenizer = chTokenizer
        self.enTokenizer = enTokenizer 
        
        # Enable padding and truncation
        self.chTokenizer.enable_padding(length = chMaxLen)
        self.chTokenizer.enable_truncation(max_length = chMaxLen)
        self.enTokenizer.enable_padding(length = enMaxLen)
        self.enTokenizer.enable_truncation(max_length = enMaxLen)
        
    '''
    Return the size of dataset
    '''
    def __len__(self):
        return len(self.chTexts)
    
    '''
    -- The routine for querying one data entry 
    -- The index of must be specified as an argument
    -- Return a dictionary 
    '''
    def __getitem__(self, idx): 
        # Apply tokenizer 
        chOutputs = chTokenizer.encode(chTexts[idx])
        enOutputs = enTokenizer.encode(enTexts[idx])
        
        # Get numerical tokens
        chEncoding = chOutputs.ids
        enEncoding = enOutputs.ids
        
        # Get attention mask 
        chMask = chOutputs.attention_mask
        enMask = enOutputs.attention_mask
        
        return {
            'source_ids': torch.tensor(chEncoding).to(device), 
            'source_mask': torch.tensor(chMask).to(device), 
            'target_ids': torch.tensor(enEncoding).to(device), 
            'target_mask': torch.tensor(enMask).to(device)
        }
    

Now we test `Dataset`. 

In [13]:
chMaxLen = 100
enMaxLen = 100

dataset = MyDataset(chTexts[:1500], enTexts[:1500], 
                    chTokenizer, enTokenizer, 
                    chMaxLen = chMaxLen, enMaxLen = enMaxLen)

print(len(dataset))
# print(dataset.__getitem__(0))

# dataloader = DataLoader(dataset, batch_size = 16, num_workers = 0)

1500


## Define model class

PyTorch native, despite its great flexibility, may trap you in detailed errors that mess up the entire code. For example, you may forget important details like `optimizer.zero_grad()` or `tensor.to(device)` in PyTorch native. For both learning purpose and clarity in the long run, we use `pytorch_lightning` to define model class. 

In [14]:
class T5FineTuner(pl.LightningModule): 
    
    ''' Part 1: Define the architecture of model in init '''
    def __init__(self, hparams): 
        super(T5FineTuner, self).__init__()
        self.hparams = hparams
        self.model = T5ForConditionalGeneration.from_pretrained(
            hparams.pretrainedModelName, 
            return_dict = True    # I set return_dict true so that outputs  are presented as dictionaries
        )
        self.chTokenizer = hparams.chTokenizer
        self.enTokenizer = hparams.enTokenizer
        # self.rouge_metric = nlp.load_metric('rouge')
        
        # No idea what the "freeze" is doing
        if self.hparams.freeze_embed:
            self.freeze_embeds()
        if self.hparams.freeze_encoder:
            self.freeze_params(self.model.get_encoder())
            assert_all_frozen(self.model.get_encoder())
            
            
            
    ''' Part 2: Define the forward propagation '''
    def forward(self, 
                input_ids, 
                attention_mask = None, 
                decoder_input_ids = None, 
                decoder_attention_mask = None, 
                lm_labels = None
               ): 
        # Type `Seq2SeqLMOutput`
        return self.model(
            input_ids, 
            attention_mask = attention_mask, 
            decoder_input_ids = decoder_input_ids, 
            decoder_attention_mask = decoder_attention_mask, 
            lm_labels = lm_labels
        )
    
    
    ''' Part 3: Prepare optimizer and scheduler '''
    def configure_optimizers(self): 
        model = self.model 
        no_decay = ['bias', 'LayerNorm.weight']
        optimizer_grouped_parameters = [
            {
                # model.named_parameters() can't find doc?
                'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 
                'weight_decay': self.hparams.weight_decay
            }, 
            {
                'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                'weight_decay': 0.0
            }
        ]
        optimizer = AdamW(
            optimizer_grouped_parameters, 
            lr = self.hparams.learning_rate
        )
        self.opt = optimizer
        return [optimizer]
    
    
    # Override this method to adjust how Trainer calls each optimizer 
    def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, second_order_closure = None, using_native_amp = False): 
        optimizer.step()
        optimizer.zero_grad()    # Why do we set zero_grad at this moment? 
        self.lr_scheduler.step()
    
    
    '''
    -- Part 4: Define training logic
    -- In PyTorch native, we have to manually define the epoch loop, define the batch loop, and manually perform model.train(), loss.backward(), optimizer.step(), optimizer.zero_grad()
    -- In pytorch_lightening, the training_step() method only needs to return the loss of the batch
    '''
    def training_step(self, batch, batch_idx): 
        loss = self._step(batch)
        tensorboard_logs = {'train_loss': loss}
        return {'loss': loss, 'log': tensorboard_logs}
    
    
    # subroutine for training_step()
    def _step(self, batch): 
        lm_labels = batch['target_ids']    # !! Does not apply! 
        lm_labels[lm_lables[:, ] == 0] = -100    # !! Verify that id for pad is 0 
         
        # !! There is a `__call__` method associated with self ?! 
        outputs = self(
            input_ids = batch['source_ids'],    # !! Does not apply! 
            attention_mask = batch['source_mask'], 
            lm_labels = lm_labels, 
            decoder_attention_mask = batch['target_mask']
        )
        
        return outputs.loss    # !! Or should it be outputs[0] ? 
    
    
    # Called at the end of training epoch 
    # Do something with all the outputs from every training step 
    def training_epoch_end(self, outputs): 
        avg_train_loss = torch.stack([x['loss'] for x in outputs]).mean()
        tensorboard_logs = {'avg_train_loss': avg_train_loss}
        return {
            'avg_train_loss': avg_train_loss, 
            'log': tensorboard_logs, 
            'progress_bar': tensorboard_logs
        }
    
    
    
    '''
    -- Part 5: Define validation logic
    -- In PyTorch native, we have to define the batch loop, and manually perform model.eval(), torch.no_grad()
    -- In pytorch_lightening, the training_step() method only needs to return the loss of the batch
    '''
    def validation_step(self, batch, batch_idx): 
        print('val', end = ', ')
        return self._generative_step(batch)
    
    # subroutine for validation_step()
    def _generative_step(self, batch): 
        t0 = time.time()
        
        # !! model.generate() Can't find doc !
        generated_ids = self.model.generate(
            batch['source_ids'],    # !! Does not apply ! 
            attention_mask = batch['source_mask'], 
            use_cache = True, 
            decoder_attention_mask = batch['target_mask'],     # !! Does not apply! 
            max_length = self.hparams['max_output_len'],
            num_beams = 2,     # ?? What is this?
            repetition_penalty = 2.5,     # ?? What is this?
            length_penalty = 1.0,     # ?? What is this?
            early_stopping = True    # ?? What is this?
        )
        
        preds = self.ids_to_clean_text(generated_ids)    # translation predicted by model 
        target = self.ids_to_clean_text(batch['target_ids'])    # !! Does not apply! 
        
        gen_time = (time.time() - t0) / batch['source_ids'].shape[0]    # !! Does not apply
        
        loss = self._step(batch)
        
        # Compute metrics
        # ?? What is the deal with "rouge" in the code example? 
        base_metrics = {'val_loss': loss}
        trans_len = np.mean(list(map(len, generated_ids)))
        base_metrics.update(
            gen_time = gen_time, 
            gen_len = trans_len, 
            preds = preds, 
            target = target
        )
        # self.rouge_metric.add_batch(preds, target)
        
        return base_metrics
        
    
    #
    def validation_epoch_end(self, outputs): 
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}
        
        # rouge_results = self.rouge_metric.compute()
        # rouge_dict = self.parse_score(rouge_results)
        
        tensorboard_logs.update(rouge1 = rouge_dict['rouge1'], rougeL = rouge_dict['rougeL'])
        
        # Clear out the lists for next epoch 
        # !! I don't see those two variables defined anywhere 
        self.target_gen = []
        self.prediction_gen = []
        
        return {
            'avg_val_loss': avg_loss, 
            # 'rouge1': rouge_results['rouge1'], 
            # 'rougeL': rouge_results['rougeL'], 
            'log': tensorboard_logs,
            'progress_bar': tensorboard_logs
        }
        
        
        
    '''Part 6: Define dataloaders'''
    def train_dataloader(self): 
        train_dataset = get_dataset(
            chTexts = chTexts[:1800], 
            enTexts = enTexts[:1800], 
            chTokenizer = self.chTokenizer, 
            enTokenizer = self.enTokenizer, 
            chMaxLen = self.hparams.max_input_len, 
            enMaxLen = self.hparams.max_output_len
        )
        
        dataloader = DataLoader(
            train_dataset, 
            batch_size = self.hparams.train_batch_size, 
            drop_last = True, 
            shuffle = True, 
            num_workers = 0 
        )
        
        # The code below deals with scheduler. And I have no idea what the code is doing 
        t_total = (
            (len(dataloader.dataset) // (self.hparams.train_batch_size * max(1, self.hparams.n_gpu)))
            // self.hparams.gradient_accumulation_steps
            * float(self.hparams.num_train_epochs)
        )
        scheduler = get_linear_schedule_with_warmup(
            self.opt, num_warmup_steps=self.hparams.warmup_steps, num_training_steps=t_total
        )
        self.lr_scheduler = scheduler
        return dataloader

    
    def val_dataloader(self):
        val_dataset = get_dataset(
            chTexts = chTexts[1800:1950], 
            enTexts = enTexts[1800:1950], 
            chTokenizer = self.chTokenizer, 
            enTokenizer = self.enTokenizer, 
            chMaxLen = self.hparams.max_input_len, 
            enMaxLen = self.hparams.max_output_len
        )
        
        return DataLoader(
            val_dataset, 
            batch_size = self.hparams.eval_batch_size, 
            num_workers = 0
        )
    
    
    def test_dataloader(self):
        test_dataset = get_dataset(
            chTexts = chTexts[1950:], 
            enTexts = enTexts[1950:], 
            chTokenizer = self.chTokenizer, 
            enTokenizer = self.enTokenizer, 
            chMaxLen = self.hparams.max_input_len, 
            enMaxLen = self.hparams.max_output_len
        )
        
        return DataLoader(
            test_dataset, 
            batch_size = self.hparams.eval_batch_size, 
            num_workers = 0
        )
    
    
    
    ''' ==================================
    # Collection of helper functions 
    # Not predefined by LightningModule
    ===================================='''
    
    # Decode a batch of ids and return a list of strings 
    def ids_to_clean_text(self, tokenizer, ids_batch): 
        ids_batch_tensor = torch.tensor(ids_batch)
        # Make sure that the ids come as a batch and that decode_batch() method will work properly 
        assert (id_batch_tensor.ndim >= 2), 'Ids do not form a batch'
        return tokenizer.decode_batch(ids_batch.tolist())
            
        
    # tqdm is a library for showing progress bar 
    # Retrieve info needed for progresse bar 
    def get_tqdm_dict(self): 
        # !! What is self.trainer? I never saw it defined 
        tqdm_dict = {
            'loss': '{:.3f}'.format(self.trainer.avg_loss), 
            'lr': self.lr_scheduler.get_last_lr()[-1]
        }
        return tqdm_dict

In [15]:
hparamsDict = {
    'chTokenizer': chTokenizer, 
    'enTokenizer': enTokenizer, 
    'pretrainedModelName': 't5-small', 
    'weight_decay': 0.0, 
    'learning_rate': 3e-4, 
    'max_input_len': 100, 
    'max_output_len': 100, 
    'train_batch_size': 8, 
    'eval_batch_size': 8, 
    'num_train_epochs': 2, 
    'n_gpu': 1
    # For now, we do train-test split manually when defining dataloader, instead of loading the following param 
    # 'n_train': 2000
    # 'n_val': 150
    # 'n_test': 50
}

# ?? What are these hyperparameters ? 
hparamsDictNoUnderstand = {
    'freeze_encoder': False, 
    'freeze_embed': False, 
    'adam_epsilon': 1e-8,
    'warmup_stpes': 0,
    'gradient_accumulation_steps': 8,
    'resume_from_checkpoint': None, 
    'val_check_interval': 0.05,
    'early_stop_callback': False, 
    'fp_16': False, 
    'opt_level': 'O1', 
    'max_grad_norm': 1.0, 
    'seed': 42
}

hparamsDict.update(hparamsDictNoUnderstand)

hparams = argparse.Namespace(**hparamsDict)

I have no idea what the following code cells are doing. 

In [16]:
logger = logging.getLogger(__name__)

class LoggingCallback(pl.Callback):
    def on_validation_end(self, trainer, pl_module):
        logger.info("***** Validation results *****")
        if pl_module.is_logger():
            metrics = trainer.callback_metrics
            # Log results
            for key in sorted(metrics):
                if key not in ["log", "progress_bar"]:
                    logger.info("{} = {}\n".format(key, str(metrics[key])))

    def on_test_end(self, trainer, pl_module):
        logger.info("***** Test results *****")

        if pl_module.is_logger():
            metrics = trainer.callback_metrics

            # Log and save results to file
            output_test_results_file = "./test_results.txt"
            with open(output_test_results_file, "w") as writer:
                for key in sorted(metrics):
                    if key not in ["log", "progress_bar"]:
                        logger.info("{} = {}\n".format(key, str(metrics[key])))
                        writer.write("{} = {}\n".format(key, str(metrics[key])))

In [17]:
## Set up wandb 
os.environ["WANDB_API_KEY"] = '4a50a6213c69c6a669deb96f81ced074ecac908a'
wandb_logger = WandbLogger(project='ch-en-try')

## Define Checkpoint function
checkpoint_callback = pl.callbacks.ModelCheckpoint(
    filepath='./', prefix="checkpoint", monitor="val_loss", mode="min", save_top_k=3
)

## If resuming from checkpoint, add an arg resume_from_checkpoint
train_params = dict(
    accumulate_grad_batches=hparams.gradient_accumulation_steps,
    gpus=hparams.n_gpu,
    max_epochs=hparams.num_train_epochs,
    # early_stop_callback=False,
    precision= 16 if hparams.fp_16 else 32,
    amp_level=hparams.opt_level,
    resume_from_checkpoint=hparams.resume_from_checkpoint,
    gradient_clip_val=hparams.max_grad_norm,
    checkpoint_callback=checkpoint_callback,
    val_check_interval=hparams.val_check_interval,
    # logger=wandb_logger,
    callbacks=[LoggingCallback()],
)



## Train model

In [18]:
def get_dataset(chTexts, enTexts, chTokenizer, enTokenizer, chMaxLen, enMaxLen):
    return MyDataset(chTexts, enTexts, chTokenizer, enTokenizer, chMaxLen, enMaxLen)

In [19]:
model = T5FineTuner(hparams)
trainer = pl.Trainer(**train_params)
trainer.fit(model)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 60 M  


HBox(children=(HTML(value='Validation sanity check'), FloatProgress(value=1.0, bar_style='info', layout=Layout…

val, 

TypeError: ids_to_clean_text() missing 1 required positional argument: 'ids_batch'