# Homework Description
- English to Chinese (Traditional) Translation
  - Input: an English sentence         (e.g.		tom is a student .)
  - Output: the Chinese translation  (e.g. 		湯姆 是 個 學生 。)

- TODO
    - Train a simple RNN seq2seq to acheive translation
    - Switch to transformer model to boost performance
    - Apply Back-translation to furthur boost performance

In [1]:
!nvidia-smi

Fri Apr 14 12:34:02 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 528.49       Driver Version: 528.49       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0  On |                  N/A |
|  0%   38C    P8    24W / 275W |    699MiB / 11264MiB |     19%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Download and import required packages

In [2]:
# !pip install 'torch>=1.6.0' editdistance matplotlib sacrebleu sacremoses sentencepiece tqdm wandb
# !pip install --upgrade jupyter ipywidgets

In [3]:
# !git clone https://github.com/pytorch/fairseq.git
# !cd fairseq && git checkout 9a1c497
# !pip install --upgrade ./fairseq/

In [4]:
import matplotlib.pyplot as plt
import sys
import pdb
import pprint
import logging
import os
import random

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils import data
import numpy as np
import tqdm.auto as tqdm
from pathlib import Path
from argparse import Namespace
from fairseq import utils



  from .autonotebook import tqdm as notebook_tqdm


In [5]:
torch.cuda.is_available()

True

In [6]:
np.version.version

'1.23.5'

# Fix random seed

In [7]:
seed = 33
random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  
np.random.seed(seed)  
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

# Dataset

## En-Zh Bilingual Parallel Corpus
* TED2020
    - Raw: 400,726 (sentences)   
    - Processed: 394,052 (sentences)
    

## Testdata
- Size: 4,000 (sentences)
- **Chinese translation is undisclosed. The provided (.zh) file is psuedo translation, each line is a '。'**

## Dataset Download

In [8]:
data_dir = './DATA/rawdata'
dataset_name = 'ted2020'
urls = (
    "https://github.com/figisiwirf/ml2023-hw5-dataset/releases/download/v1.0.1/ml2023.hw5.data.tgz",
    "https://github.com/figisiwirf/ml2023-hw5-dataset/releases/download/v1.0.1/ml2023.hw5.test.tgz"
)
file_names = (
    'ted2020.tgz', # train & dev
    'test.tgz', # test
)
prefix = Path(data_dir).absolute() / dataset_name

# prefix.mkdir(parents=True, exist_ok=True)
# for u, f in zip(urls, file_names):
#     path = prefix/f
#     if not path.exists():
#         !wget {u} -O {path}
#     if path.suffix == ".tgz":
#         !tar -xvf {path} -C {prefix}
#     elif path.suffix == ".zip":
#         !unzip -o {path} -d {prefix}
!mv {prefix/'raw.en'} {prefix/'train_dev.raw.en'}
!mv {prefix/'raw.zh'} {prefix/'train_dev.raw.zh'}
!mv {prefix/'test.en'} {prefix/'test.raw.en'}
!mv {prefix/'test.zh'} {prefix/'test.raw.zh'}

## Language

In [9]:
src_lang = 'en'
tgt_lang = 'zh'

data_prefix = f'{prefix}/train_dev.raw'
test_prefix = f'{prefix}/test.raw'

In [10]:
!head {data_prefix+'.'+src_lang} -n 5
!head {data_prefix+'.'+tgt_lang} -n 5

Thank you so much, Chris.
And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful.
I have been blown away by this conference, and I want to thank all of you for the many nice comments about what I had to say the other night.
And I say that sincerely, partly because I need that.
Put yourselves in my position.
非常謝謝你，克里斯。能有這個機會第二度踏上這個演講台
真是一大榮幸。我非常感激。
這個研討會給我留下了極為深刻的印象，我想感謝大家 對我之前演講的好評。
我是由衷的想這麼說，有部份原因是因為 —— 我真的有需要!
請你們設身處地為我想一想！


## Preprocess files

In [11]:
import re

def strQ2B(ustring):
    """Full width -> half width"""
    # reference:https://ithelp.ithome.com.tw/articles/10233122
    ss = []
    for s in ustring:
        rstring = ""
        for uchar in s:
            inside_code = ord(uchar)
            if inside_code == 12288:  # Full width space: direct conversion
                inside_code = 32
            elif (inside_code >= 65281 and inside_code <= 65374):  # Full width chars (except space) conversion
                inside_code -= 65248
            rstring += chr(inside_code)
        ss.append(rstring)
    return ''.join(ss)
                
def clean_s(s, lang):
    if lang == 'en':
        s = re.sub(r"\([^()]*\)", "", s) # remove ([text])
        s = s.replace('-', '') # remove '-'
        s = re.sub('([.,;!?()\"])', r' \1 ', s) # keep punctuation
    elif lang == 'zh':
        s = strQ2B(s) # Q2B
        s = re.sub(r"\([^()]*\)", "", s) # remove ([text])
        s = s.replace(' ', '')
        s = s.replace('—', '')
        s = s.replace('“', '"')
        s = s.replace('”', '"')
        s = s.replace('_', '')
        s = re.sub('([。,;!?()\"~「」])', r' \1 ', s) # keep punctuation
    s = ' '.join(s.strip().split())
    return s

def len_s(s, lang):
    if lang == 'zh':
        return len(s)
    return len(s.split())

def clean_corpus(prefix, l1, l2, ratio=9, max_len=1000, min_len=1):
    if Path(f'{prefix}.clean.{l1}').exists() and Path(f'{prefix}.clean.{l2}').exists():
        print(f'{prefix}.clean.{l1} & {l2} exists. skipping clean.')
        return
    with open(f'{prefix}.{l1}', 'r') as l1_in_f:
        with open(f'{prefix}.{l2}', 'r') as l2_in_f:
            with open(f'{prefix}.clean.{l1}', 'w') as l1_out_f:
                with open(f'{prefix}.clean.{l2}', 'w') as l2_out_f:
                    for s1 in l1_in_f:
                        s1 = s1.strip()
                        s2 = l2_in_f.readline().strip()
                        s1 = clean_s(s1, l1)
                        s2 = clean_s(s2, l2)
                        s1_len = len_s(s1, l1)
                        s2_len = len_s(s2, l2)
                        if min_len > 0: # remove short sentence
                            if s1_len < min_len or s2_len < min_len:
                                continue
                        if max_len > 0: # remove long sentence
                            if s1_len > max_len or s2_len > max_len:
                                continue
                        if ratio > 0: # remove by ratio of length
                            if s1_len/s2_len > ratio or s2_len/s1_len > ratio:
                                continue
                        print(s1, file=l1_out_f)
                        print(s2, file=l2_out_f)

In [12]:
clean_corpus(data_prefix, src_lang, tgt_lang)
clean_corpus(test_prefix, src_lang, tgt_lang, ratio=-1, min_len=-1, max_len=-1)

In [13]:
!head {data_prefix+'.clean.'+src_lang} -n 5
!head {data_prefix+'.clean.'+tgt_lang} -n 5

Thank you so much , Chris .
And it's truly a great honor to have the opportunity to come to this stage twice ; I'm extremely grateful .
I have been blown away by this conference , and I want to thank all of you for the many nice comments about what I had to say the other night .
And I say that sincerely , partly because I need that .
Put yourselves in my position .
非常謝謝你 , 克里斯 。 能有這個機會第二度踏上這個演講台
真是一大榮幸 。 我非常感激 。
這個研討會給我留下了極為深刻的印象 , 我想感謝大家對我之前演講的好評 。
我是由衷的想這麼說 , 有部份原因是因為我真的有需要 !
請你們設身處地為我想一想 !


## Split into train/valid

In [14]:
valid_ratio = 0.01 # 3000~4000 would suffice
train_ratio = 1 - valid_ratio

In [15]:
if (prefix/f'train.clean.{src_lang}').exists() \
and (prefix/f'train.clean.{tgt_lang}').exists() \
and (prefix/f'valid.clean.{src_lang}').exists() \
and (prefix/f'valid.clean.{tgt_lang}').exists():
    print(f'train/valid splits exists. skipping split.')
else:
    line_num = sum(1 for line in open(f'{data_prefix}.clean.{src_lang}'))
    labels = list(range(line_num))
    random.shuffle(labels)
    for lang in [src_lang, tgt_lang]:
        train_f = open(os.path.join(data_dir, dataset_name, f'train.clean.{lang}'), 'w')
        valid_f = open(os.path.join(data_dir, dataset_name, f'valid.clean.{lang}'), 'w')
        count = 0
        for line in open(f'{data_prefix}.clean.{lang}', 'r'):
            if labels[count]/line_num < train_ratio:
                train_f.write(line)
            else:
                valid_f.write(line)
            count += 1
        train_f.close()
        valid_f.close()

## Subword Units 
Out of vocabulary (OOV) has been a major problem in machine translation. This can be alleviated by using subword units.
- We will use the [sentencepiece](#kudo-richardson-2018-sentencepiece) package
- select 'unigram' or 'byte-pair encoding (BPE)' algorithm

In [16]:
import sentencepiece as spm
vocab_size = 8000
if (prefix/f'spm{vocab_size}.model').exists():
    print(f'{prefix}/spm{vocab_size}.model exists. skipping spm_train.')
else:
    spm.SentencePieceTrainer.train(
        input=','.join([f'{prefix}/train.clean.{src_lang}',
                        f'{prefix}/valid.clean.{src_lang}',
                        f'{prefix}/train.clean.{tgt_lang}',
                        f'{prefix}/valid.clean.{tgt_lang}']),
        model_prefix=prefix/f'spm{vocab_size}',
        vocab_size=vocab_size,
        character_coverage=1,
        model_type='unigram', # 'bpe' works as well
        input_sentence_size=1e6,
        shuffle_input_sentence=True,
        normalization_rule_name='nmt_nfkc_cf',
    )

In [17]:
spm_model = spm.SentencePieceProcessor(model_file=str(prefix/f'spm{vocab_size}.model'))
in_tag = {
    'train': 'train.clean',
    'valid': 'valid.clean',
    'test': 'test.raw.clean',
}
for split in ['train', 'valid', 'test']:
    for lang in [src_lang, tgt_lang]:
        out_path = prefix/f'{split}.{lang}'
        if out_path.exists():
            print(f"{out_path} exists. skipping spm_encode.")
        else:
            with open(prefix/f'{split}.{lang}', 'w') as out_f:
                with open(prefix/f'{in_tag[split]}.{lang}', 'r') as in_f:
                    for line in in_f:
                        line = line.strip()
                        tok = spm_model.encode(line, out_type=str)
                        print(' '.join(tok), file=out_f)

In [18]:
!head {data_dir+'/'+dataset_name+'/train.'+src_lang} -n 5
!head {data_dir+'/'+dataset_name+'/train.'+tgt_lang} -n 5

▁thank ▁you ▁so ▁much ▁, ▁chris ▁.
▁and ▁it ' s ▁tr u ly ▁a ▁great ▁ho n or ▁to ▁have ▁the ▁ op port un ity ▁to ▁come ▁to ▁this ▁st age ▁ t wi ce ▁ ; ▁i ' m ▁ex t re me ly ▁gr ate ful ▁.
▁i ▁have ▁been ▁ bl own ▁away ▁by ▁this ▁con fer ence ▁, ▁and ▁i ▁want ▁to ▁thank ▁all ▁of ▁you ▁for ▁the ▁many ▁ ni ce ▁ com ment s ▁about ▁what ▁i ▁had ▁to ▁say ▁the ▁other ▁night ▁.
▁and ▁i ▁say ▁that ▁since re ly ▁, ▁part ly ▁because ▁i ▁need ▁that ▁.
▁put ▁your s el ve s ▁in ▁my ▁ position ▁.
▁ 非常 謝 謝 你 ▁, ▁ 克 里 斯 ▁。 ▁ 能 有 這個 機會 第二 度 踏 上 這個 演講 台
▁ 真 是 一 大 榮 幸 ▁。 ▁我 非常 感 激 ▁。
▁這個 研 討 會 給我 留 下 了 極 為 深 刻 的 印 象 ▁, ▁我想 感 謝 大家 對我 之前 演講 的 好 評 ▁。
▁我 是由 衷 的 想 這麼 說 ▁, ▁有 部份 原因 是因為 我 真的 有 需要 ▁ !
▁ 請 你們 設 身 處 地 為 我想 一 想 ▁ !


## Binarize the data with fairseq
Prepare the files in pairs for both the source and target languages. \\
In case a pair is unavailable, generate a pseudo pair to facilitate binarization.

In [19]:
binpath = Path('./DATA/data-bin', dataset_name)
if binpath.exists():
    print(binpath, "exists, will not overwrite!")
else:
    !python -m fairseq_cli.preprocess \
        --source-lang {src_lang}\
        --target-lang {tgt_lang}\
        --trainpref {prefix/'train'}\
        --validpref {prefix/'valid'}\
        --testpref {prefix/'test'}\
        --destdir {binpath}\
        --joined-dictionary\
        --workers 2

2023-04-14 12:35:49 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='DATA\\data-bin\\ted2020', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, reset_logging=False, scoring='bleu', seed=1, source_lang='en', srcdict=None, suppress_crashes=False, target_lang='zh', task='translation', tensorboard_logdir=None, testpref='c:\\Users\\william\\Desktop\\graddescope\\DATA\\rawdata\\ted2020\\test', tgtdict=None, thresh

# Configuration for experiments

In [20]:
config = Namespace(
    datadir = "./DATA/data-bin/ted2020",
    # savedir = "./checkpoints/rnn",
    savedir = "./checkpoints/transformer",
    source_lang = src_lang,
    target_lang = tgt_lang,
    
    # cpu threads when fetching & processing data.
    num_workers=2,  
    # batch size in terms of tokens. gradient accumulation increases the effective batchsize.
    max_tokens=8192,
    accum_steps=2,
    
    # the lr s calculated from Noam lr scheduler. you can tune the maximum lr by this factor.
    lr_factor=2.,
    lr_warmup=4000,
    
    # clipping gradient norm helps alleviate gradient exploding
    clip_norm=1.0,
    
    # maximum epochs for training
    max_epoch=60,
    start_epoch=1,
    
    # beam size for beam search
    beam=5, 
    # generate sequences of maximum length ax + b, where x is the source length
    max_len_a=1.2, 
    max_len_b=10, 
    # when decoding, post process sentence by removing sentencepiece symbols and jieba tokenization.
    post_process = "sentencepiece",
    
    # checkpoints
    keep_last_epochs=5,
    resume=None, # if resume from checkpoint name (under config.savedir)
    
    # logging
    use_wandb=False,
)

# Logging
- logging package logs ordinary messages
- wandb logs the loss, bleu, etc. in the training process

In [21]:
logging.basicConfig(
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    level="INFO", # "DEBUG" "WARNING" "ERROR"
    stream=sys.stdout,
)
proj = "hw5.seq2seq"
logger = logging.getLogger(proj)
if config.use_wandb:
    import wandb
    wandb.init(project=proj, name=Path(config.savedir).stem, config=config)

# CUDA Environments

In [22]:
cuda_env = utils.CudaEnvironment()
utils.CudaEnvironment.pretty_print_cuda_env_list([cuda_env])
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

2023-04-14 12:37:16 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2023-04-14 12:37:16 | INFO | fairseq.utils | rank   0: capabilities =  6.1  ; total memory = 11.000 GB ; name = NVIDIA GeForce GTX 1080 Ti              
2023-04-14 12:37:16 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************


# Dataloading

## We borrow the TranslationTask from fairseq
* used to load the binarized data created above
* well-implemented data iterator (dataloader)
* built-in task.source_dictionary and task.target_dictionary are also handy
* well-implemented beach search decoder

In [23]:
from fairseq.tasks.translation import TranslationConfig, TranslationTask

## setup task
task_cfg = TranslationConfig(
    data=config.datadir,
    source_lang=config.source_lang,
    target_lang=config.target_lang,
    train_subset="train",
    required_seq_len_multiple=8,
    dataset_impl="mmap",
    upsample_primary=1,
)
task = TranslationTask.setup_task(task_cfg)

2023-04-14 12:37:16 | INFO | fairseq.tasks.translation | [en] dictionary: 7992 types
2023-04-14 12:37:16 | INFO | fairseq.tasks.translation | [zh] dictionary: 7992 types


In [24]:
logger.info("loading data for epoch 1")
task.load_dataset(split="train", epoch=1, combine=True) # combine if you have back-translation data.
task.load_dataset(split="valid", epoch=1)

2023-04-14 12:37:16 | INFO | hw5.seq2seq | loading data for epoch 1
2023-04-14 12:37:16 | INFO | fairseq.data.data_utils | loaded 390,112 examples from: ./DATA/data-bin/ted2020\train.en-zh.en
2023-04-14 12:37:16 | INFO | fairseq.data.data_utils | loaded 390,112 examples from: ./DATA/data-bin/ted2020\train.en-zh.zh
2023-04-14 12:37:16 | INFO | fairseq.tasks.translation | ./DATA/data-bin/ted2020 train en-zh 390112 examples
2023-04-14 12:37:16 | INFO | fairseq.data.data_utils | loaded 3,940 examples from: ./DATA/data-bin/ted2020\valid.en-zh.en
2023-04-14 12:37:16 | INFO | fairseq.data.data_utils | loaded 3,940 examples from: ./DATA/data-bin/ted2020\valid.en-zh.zh
2023-04-14 12:37:16 | INFO | fairseq.tasks.translation | ./DATA/data-bin/ted2020 valid en-zh 3940 examples


In [25]:
sample = task.dataset("valid")[1]
pprint.pprint(sample)
pprint.pprint(
    "Source: " + \
    task.source_dictionary.string(
        sample['source'],
        config.post_process,
    )
)
pprint.pprint(
    "Target: " + \
    task.target_dictionary.string(
        sample['target'],
        config.post_process,
    )
)

{'id': 1,
 'source': tensor([  24,   64,    5,   85, 1299,  142,  144,  190,  274,   37,    8,   88,
         237,   11,   78,   55,   12,  372,   20,  154,   62, 1012,  100,  484,
          75,  268,    6,  100, 1463,    7,    2]),
 'target': tensor([ 162,  125, 3759,  359,  157, 3058, 2923,    9, 2550,    4,  598,  123,
        1515,  551,  664,   65,  406,  570,   77, 1907, 3793,  189,   10,    2])}
('Source: you can throw out crazy theories and not have to back it up with '
 'data or graphs or research .')
'Target: 你能拋開這些瘋狂的理論 , 不用數據圖表、或研究來支撐它 。'


# Dataset iterator

* Controls every batch to contain no more than N tokens, which optimizes GPU memory efficiency
* Shuffles the training set for every epoch
* Ignore sentences exceeding maximum length
* Pad all sentences in a batch to the same length, which enables parallel computing by GPU
* Add eos and shift one token
    - teacher forcing: to train the model to predict the next token based on prefix, we feed the right shifted target sequence as the decoder input.
    - generally, prepending bos to the target would do the job (as shown below)
![seq2seq](https://i.imgur.com/0zeDyuI.png)
    - in fairseq however, this is done by moving the eos token to the begining. Empirically, this has the same effect. For instance:
    ```
    # output target (target) and Decoder input (prev_output_tokens): 
                   eos = 2
                target = 419,  711,  238,  888,  792,   60,  968,    8,    2
    prev_output_tokens = 2,  419,  711,  238,  888,  792,   60,  968,    8
    ```



In [26]:
def load_data_iterator(task, split, epoch=1, max_tokens=4000, num_workers=1, cached=True):
    batch_iterator = task.get_batch_iterator(
        dataset=task.dataset(split),
        max_tokens=max_tokens,
        max_sentences=None,
        max_positions=utils.resolve_max_positions(
            task.max_positions(),
            max_tokens,
        ),
        ignore_invalid_inputs=True,
        seed=seed,
        num_workers=num_workers,
        epoch=epoch,
        disable_iterator_cache=not cached,
        # Set this to False to speed up. However, if set to False, changing max_tokens beyond 
        # first call of this method has no effect. 
    )
    return batch_iterator

demo_epoch_obj = load_data_iterator(task, "valid", epoch=1, max_tokens=20, num_workers=1, cached=False)
demo_iter = demo_epoch_obj.next_epoch_itr(shuffle=True)
sample = next(demo_iter)
sample



{'id': tensor([3381]),
 'nsentences': 1,
 'ntokens': 12,
 'net_input': {'src_tokens': tensor([[  11,  259,  289,    4,   15, 1088,   19,   14,   36,  230,    4,  259,
           1600,  122,    7,    2]]),
  'src_lengths': tensor([16]),
  'prev_output_tokens': tensor([[   2, 1607,    4, 2475, 1799,    4,  242,  467, 2208, 2831,   34,   10,
              1,    1,    1,    1]])},
 'target': tensor([[1607,    4, 2475, 1799,    4,  242,  467, 2208, 2831,   34,   10,    2,
             1,    1,    1,    1]])}

* each batch is a python dict, with string key and Tensor value. Contents are described below:
```python
batch = {
    "id": id, # id for each example 
    "nsentences": len(samples), # batch size (sentences)
    "ntokens": ntokens, # batch size (tokens)
    "net_input": {
        "src_tokens": src_tokens, # sequence in source language
        "src_lengths": src_lengths, # sequence length of each example before padding
        "prev_output_tokens": prev_output_tokens, # right shifted target, as mentioned above.
    },
    "target": target, # target sequence
}
```

# Model Architecture
* We again inherit fairseq's encoder, decoder and model, so that in the testing phase we can directly leverage fairseq's beam search decoder.

In [27]:
from fairseq.models import (
    FairseqEncoder, 
    FairseqIncrementalDecoder,
    FairseqEncoderDecoderModel
)

# Encoder

- The Encoder is a RNN or Transformer Encoder. The following description is for RNN. For every input token, Encoder will generate a output vector and a hidden states vector, and the hidden states vector is passed on to the next step. In other words, the Encoder sequentially reads in the input sequence, and outputs a single vector at each timestep, then finally outputs the final hidden states, or content vector, at the last timestep.
- Parameters:
  - *args*
      - encoder_embed_dim: the dimension of embeddings, this compresses the one-hot vector into fixed dimensions, which achieves dimension reduction
      - encoder_ffn_embed_dim is the dimension of hidden states and output vectors
      - encoder_layers is the number of layers for Encoder RNN
      - dropout determines the probability of a neuron's activation being set to 0, in order to prevent overfitting. Generally this is applied in training, and removed in testing.
  - *dictionary*: the dictionary provided by fairseq. it's used to obtain the padding index, and in turn the encoder padding mask. 
  - *embed_tokens*: an instance of token embeddings (nn.Embedding)

- Inputs: 
    - *src_tokens*: integer sequence representing english e.g. 1, 28, 29, 205, 2 
- Outputs: 
    - *outputs*: the output of RNN at each timestep, can be furthur processed by Attention
    - *final_hiddens*: the hidden states of each timestep, will be passed to decoder for decoding
    - *encoder_padding_mask*: this tells the decoder which position to ignore


In [28]:
class RNNEncoder(FairseqEncoder):
    def __init__(self, args, dictionary, embed_tokens):
        super().__init__(dictionary)
        self.embed_tokens = embed_tokens
        
        self.embed_dim = args.encoder_embed_dim
        self.hidden_dim = args.encoder_ffn_embed_dim
        self.num_layers = args.encoder_layers
        
        self.dropout_in_module = nn.Dropout(args.dropout)
        self.rnn = nn.GRU(
            self.embed_dim, 
            self.hidden_dim, 
            self.num_layers, 
            dropout=args.dropout, 
            batch_first=False, 
            bidirectional=True
        )
        self.dropout_out_module = nn.Dropout(args.dropout)
        
        self.padding_idx = dictionary.pad()
        
    def combine_bidir(self, outs, bsz: int):
        out = outs.view(self.num_layers, 2, bsz, -1).transpose(1, 2).contiguous()
        return out.view(self.num_layers, bsz, -1)

    def forward(self, src_tokens, **unused):
        bsz, seqlen = src_tokens.size()
        
        # get embeddings
        x = self.embed_tokens(src_tokens)
        x = self.dropout_in_module(x)

        # B x T x C -> T x B x C
        x = x.transpose(0, 1)
        
        # pass thru bidirectional RNN
        h0 = x.new_zeros(2 * self.num_layers, bsz, self.hidden_dim)
        x, final_hiddens = self.rnn(x, h0)
        outputs = self.dropout_out_module(x)
        # outputs = [sequence len, batch size, hid dim * directions]
        # hidden =  [num_layers * directions, batch size  , hid dim]
        
        # Since Encoder is bidirectional, we need to concatenate the hidden states of two directions
        final_hiddens = self.combine_bidir(final_hiddens, bsz)
        # hidden =  [num_layers x batch x num_directions*hidden]
        
        encoder_padding_mask = src_tokens.eq(self.padding_idx).t()
        return tuple(
            (
                outputs,  # seq_len x batch x hidden
                final_hiddens,  # num_layers x batch x num_directions*hidden
                encoder_padding_mask,  # seq_len x batch
            )
        )
    
    def reorder_encoder_out(self, encoder_out, new_order):
        # This is used by fairseq's beam search. How and why is not particularly important here.
        return tuple(
            (
                encoder_out[0].index_select(1, new_order),
                encoder_out[1].index_select(1, new_order),
                encoder_out[2].index_select(1, new_order),
            )
        )

## Attention

- When the input sequence is long, "content vector" alone cannot accurately represent the whole sequence, attention mechanism can provide the Decoder more information.
- According to the **Decoder embeddings** of the current timestep, match the **Encoder outputs** with decoder embeddings to determine correlation, and then sum the Encoder outputs weighted by the correlation as the input to **Decoder** RNN.
- Common attention implementations use neural network / dot product as the correlation between **query** (decoder embeddings) and **key** (Encoder outputs), followed by **softmax**  to obtain a distribution, and finally **values** (Encoder outputs) is **weighted sum**-ed by said distribution.

- Parameters:
  - *input_embed_dim*: dimensionality of key, should be that of the vector in decoder to attend others
  - *source_embed_dim*: dimensionality of query, should be that of the vector to be attended to (encoder outputs)
  - *output_embed_dim*: dimensionality of value, should be that of the vector after attention, expected by the next layer

- Inputs: 
    - *inputs*: is the key, the vector to attend to others
    - *encoder_outputs*:  is the query/value, the vector to be attended to
    - *encoder_padding_mask*: this tells the decoder which position to ignore
- Outputs: 
    - *output*: the context vector after attention
    - *attention score*: the attention distribution


In [29]:
class AttentionLayer(nn.Module):
    def __init__(self, input_embed_dim, source_embed_dim, output_embed_dim, bias=False):
        super().__init__()

        self.input_proj = nn.Linear(input_embed_dim, source_embed_dim, bias=bias)
        self.output_proj = nn.Linear(
            input_embed_dim + source_embed_dim, output_embed_dim, bias=bias
        )

    def forward(self, inputs, encoder_outputs, encoder_padding_mask):
        # inputs: T, B, dim
        # encoder_outputs: S x B x dim
        # padding mask:  S x B
        
        # convert all to batch first
        inputs = inputs.transpose(1,0) # B, T, dim
        encoder_outputs = encoder_outputs.transpose(1,0) # B, S, dim
        encoder_padding_mask = encoder_padding_mask.transpose(1,0) # B, S
        
        # project to the dimensionality of encoder_outputs
        x = self.input_proj(inputs)

        # compute attention
        # (B, T, dim) x (B, dim, S) = (B, T, S)
        attn_scores = torch.bmm(x, encoder_outputs.transpose(1,2))

        # cancel the attention at positions corresponding to padding
        if encoder_padding_mask is not None:
            # leveraging broadcast  B, S -> (B, 1, S)
            encoder_padding_mask = encoder_padding_mask.unsqueeze(1)
            attn_scores = (
                attn_scores.float()
                .masked_fill_(encoder_padding_mask, float("-inf"))
                .type_as(attn_scores)
            )  # FP16 support: cast to float and back

        # softmax on the dimension corresponding to source sequence
        attn_scores = F.softmax(attn_scores, dim=-1)

        # shape (B, T, S) x (B, S, dim) = (B, T, dim) weighted sum
        x = torch.bmm(attn_scores, encoder_outputs)

        # (B, T, dim)
        x = torch.cat((x, inputs), dim=-1)
        x = torch.tanh(self.output_proj(x)) # concat + linear + tanh
        
        # restore shape (B, T, dim) -> (T, B, dim)
        return x.transpose(1,0), attn_scores

# Decoder

* The hidden states of **Decoder** will be initialized by the final hidden states of **Encoder** (the content vector)
* At the same time, **Decoder** will change its hidden states based on the input of the current timestep (the outputs of previous timesteps), and generates an output
* Attention improves the performance
* The seq2seq steps are implemented in decoder, so that later the Seq2Seq class can accept RNN and Transformer, without furthur modification.
- Parameters:
  - *args*
      - decoder_embed_dim: is the dimensionality of the decoder embeddings, similar to encoder_embed_dim，
      - decoder_ffn_embed_dim: is the dimensionality of the decoder RNN hidden states, similar to encoder_ffn_embed_dim
      - decoder_layers: number of layers of RNN decoder
      - share_decoder_input_output_embed: usually, the projection matrix of the decoder will share weights with the decoder input embeddings
  - *dictionary*: the dictionary provided by fairseq
  - *embed_tokens*: an instance of token embeddings (nn.Embedding)
- Inputs: 
    - *prev_output_tokens*: integer sequence representing the right-shifted target e.g. 1, 28, 29, 205, 2 
    - *encoder_out*: encoder's output.
    - *incremental_state*: in order to speed up decoding during test time, we will save the hidden state of each timestep. see forward() for details.
- Outputs: 
    - *outputs*: the logits (before softmax) output of decoder for each timesteps
    - *extra*: unsused

In [30]:
class RNNDecoder(FairseqIncrementalDecoder):
    def __init__(self, args, dictionary, embed_tokens):
        super().__init__(dictionary)
        self.embed_tokens = embed_tokens
        
        assert args.decoder_layers == args.encoder_layers, f"""seq2seq rnn requires that encoder 
        and decoder have same layers of rnn. got: {args.encoder_layers, args.decoder_layers}"""
        assert args.decoder_ffn_embed_dim == args.encoder_ffn_embed_dim*2, f"""seq2seq-rnn requires 
        that decoder hidden to be 2*encoder hidden dim. got: {args.decoder_ffn_embed_dim, args.encoder_ffn_embed_dim*2}"""
        
        self.embed_dim = args.decoder_embed_dim
        self.hidden_dim = args.decoder_ffn_embed_dim
        self.num_layers = args.decoder_layers
        
        
        self.dropout_in_module = nn.Dropout(args.dropout)
        self.rnn = nn.GRU(
            self.embed_dim, 
            self.hidden_dim, 
            self.num_layers, 
            dropout=args.dropout, 
            batch_first=False, 
            bidirectional=False
        )
        self.attention = AttentionLayer(
            self.embed_dim, self.hidden_dim, self.embed_dim, bias=False
        ) 
        # self.attention = None
        self.dropout_out_module = nn.Dropout(args.dropout)
        
        if self.hidden_dim != self.embed_dim:
            self.project_out_dim = nn.Linear(self.hidden_dim, self.embed_dim)
        else:
            self.project_out_dim = None
        
        if args.share_decoder_input_output_embed:
            self.output_projection = nn.Linear(
                self.embed_tokens.weight.shape[1],
                self.embed_tokens.weight.shape[0],
                bias=False,
            )
            self.output_projection.weight = self.embed_tokens.weight
        else:
            self.output_projection = nn.Linear(
                self.output_embed_dim, len(dictionary), bias=False
            )
            nn.init.normal_(
                self.output_projection.weight, mean=0, std=self.output_embed_dim ** -0.5
            )
        
    def forward(self, prev_output_tokens, encoder_out, incremental_state=None, **unused):
        # extract the outputs from encoder
        encoder_outputs, encoder_hiddens, encoder_padding_mask = encoder_out
        # outputs:          seq_len x batch x num_directions*hidden
        # encoder_hiddens:  num_layers x batch x num_directions*encoder_hidden
        # padding_mask:     seq_len x batch
        
        if incremental_state is not None and len(incremental_state) > 0:
            # if the information from last timestep is retained, we can continue from there instead of starting from bos
            prev_output_tokens = prev_output_tokens[:, -1:]
            cache_state = self.get_incremental_state(incremental_state, "cached_state")
            prev_hiddens = cache_state["prev_hiddens"]
        else:
            # incremental state does not exist, either this is training time, or the first timestep of test time
            # prepare for seq2seq: pass the encoder_hidden to the decoder hidden states
            prev_hiddens = encoder_hiddens
        
        bsz, seqlen = prev_output_tokens.size()
        
        # embed tokens
        x = self.embed_tokens(prev_output_tokens)
        x = self.dropout_in_module(x)

        # B x T x C -> T x B x C
        x = x.transpose(0, 1)
                
        # decoder-to-encoder attention
        if self.attention is not None:
            x, attn = self.attention(x, encoder_outputs, encoder_padding_mask)
                        
        # pass thru unidirectional RNN
        x, final_hiddens = self.rnn(x, prev_hiddens)
        # outputs = [sequence len, batch size, hid dim]
        # hidden =  [num_layers * directions, batch size  , hid dim]
        x = self.dropout_out_module(x)
                
        # project to embedding size (if hidden differs from embed size, and share_embedding is True, 
        # we need to do an extra projection)
        if self.project_out_dim != None:
            x = self.project_out_dim(x)
        
        # project to vocab size
        x = self.output_projection(x)
        
        # T x B x C -> B x T x C
        x = x.transpose(1, 0)
        
        # if incremental, record the hidden states of current timestep, which will be restored in the next timestep
        cache_state = {
            "prev_hiddens": final_hiddens,
        }
        self.set_incremental_state(incremental_state, "cached_state", cache_state)
        
        return x, None
    
    def reorder_incremental_state(
        self,
        incremental_state,
        new_order,
    ):
        # This is used by fairseq's beam search. How and why is not particularly important here.
        cache_state = self.get_incremental_state(incremental_state, "cached_state")
        prev_hiddens = cache_state["prev_hiddens"]
        prev_hiddens = [p.index_select(0, new_order) for p in prev_hiddens]
        cache_state = {
            "prev_hiddens": torch.stack(prev_hiddens),
        }
        self.set_incremental_state(incremental_state, "cached_state", cache_state)
        return

## Seq2Seq
- Composed of **Encoder** and **Decoder**
- Recieves inputs and pass to **Encoder** 
- Pass the outputs from **Encoder** to **Decoder**
- **Decoder** will decode according to outputs of previous timesteps as well as **Encoder** outputs  
- Once done decoding, return the **Decoder** outputs

In [31]:
class Seq2Seq(FairseqEncoderDecoderModel):
    def __init__(self, args, encoder, decoder):
        super().__init__(encoder, decoder)
        self.args = args
    
    def forward(
        self,
        src_tokens,
        src_lengths,
        prev_output_tokens,
        return_all_hiddens: bool = True,
    ):
        """
        Run the forward pass for an encoder-decoder model.
        """
        encoder_out = self.encoder(
            src_tokens, src_lengths=src_lengths, return_all_hiddens=return_all_hiddens
        )
        logits, extra = self.decoder(
            prev_output_tokens,
            encoder_out=encoder_out,
            src_lengths=src_lengths,
            return_all_hiddens=return_all_hiddens,
        )
        return logits, extra

# Model Initialization

In [32]:
# # HINT: transformer architecture
from fairseq.models.transformer import (
    TransformerEncoder, 
    TransformerDecoder,
)

def build_model(args, task):
    """ build a model instance based on hyperparameters """
    src_dict, tgt_dict = task.source_dictionary, task.target_dictionary

    # token embeddings
    encoder_embed_tokens = nn.Embedding(len(src_dict), args.encoder_embed_dim, src_dict.pad())
    decoder_embed_tokens = nn.Embedding(len(tgt_dict), args.decoder_embed_dim, tgt_dict.pad())
    
    # encoder decoder
    # HINT: TODO: switch to TransformerEncoder & TransformerDecoder
    # encoder = RNNEncoder(args, src_dict, encoder_embed_tokens)
    # decoder = RNNDecoder(args, tgt_dict, decoder_embed_tokens)
    encoder = TransformerEncoder(args, src_dict, encoder_embed_tokens)
    decoder = TransformerDecoder(args, tgt_dict, decoder_embed_tokens)

    # sequence to sequence model
    model = Seq2Seq(args, encoder, decoder)
    
    # initialization for seq2seq model is important, requires extra handling
    def init_params(module):
        from fairseq.modules import MultiheadAttention
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=0.02)
            if module.bias is not None:
                module.bias.data.zero_()
        if isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=0.02)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        if isinstance(module, MultiheadAttention):
            module.q_proj.weight.data.normal_(mean=0.0, std=0.02)
            module.k_proj.weight.data.normal_(mean=0.0, std=0.02)
            module.v_proj.weight.data.normal_(mean=0.0, std=0.02)
        if isinstance(module, nn.RNNBase):
            for name, param in module.named_parameters():
                if "weight" in name or "bias" in name:
                    param.data.uniform_(-0.1, 0.1)
            
    # weight initialization
    model.apply(init_params)
    return model

## Architecture Related Configuration

For strong baseline, please refer to the hyperparameters for *transformer-base* in Table 3 in [Attention is all you need](#vaswani2017)

In [33]:
arch_args = Namespace(
    encoder_embed_dim=256,
    encoder_ffn_embed_dim=1024,
    encoder_layers=4,
    decoder_embed_dim=256,
    decoder_ffn_embed_dim=1024,
    decoder_layers=4,
    share_decoder_input_output_embed=True,
    dropout=0.15,
)

# HINT: these patches on parameters for Transformer
def add_transformer_args(args):
    args.encoder_attention_heads=4
    args.encoder_normalize_before=True
    
    args.decoder_attention_heads=4
    args.decoder_normalize_before=True
    
    args.activation_fn="relu"
    args.max_source_positions=1024
    args.max_target_positions=1024
    
    # patches on default parameters for Transformer (those not set above)
    from fairseq.models.transformer import base_architecture
    base_architecture(arch_args)

add_transformer_args(arch_args)

In [34]:
if config.use_wandb:
    wandb.config.update(vars(arch_args))

In [35]:
model = build_model(arch_args, task)
logger.info(model)

2023-04-14 12:37:19 | INFO | hw5.seq2seq | Seq2Seq(
  (encoder): TransformerEncoder(
    (dropout_module): FairseqDropout()
    (embed_tokens): Embedding(7992, 256, padding_idx=1)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0-3): 4 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=256, out_features=256, bias=True)
          (v_proj): Linear(in_features=256, out_features=256, bias=True)
          (q_proj): Linear(in_features=256, out_features=256, bias=True)
          (out_proj): Linear(in_features=256, out_features=256, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=256, out_features=1024, bias=True)
        (fc2): Linear(in_features=1024, out_features=25

# Optimization

## Loss: Label Smoothing Regularization
* let the model learn to generate less concentrated distribution, and prevent over-confidence
* sometimes the ground truth may not be the only answer. thus, when calculating loss, we reserve some probability for incorrect labels
* avoids overfitting

code [source](https://fairseq.readthedocs.io/en/latest/_modules/fairseq/criterions/label_smoothed_cross_entropy.html)

In [36]:
class LabelSmoothedCrossEntropyCriterion(nn.Module):
    def __init__(self, smoothing, ignore_index=None, reduce=True):
        super().__init__()
        self.smoothing = smoothing
        self.ignore_index = ignore_index
        self.reduce = reduce
    
    def forward(self, lprobs, target):
        if target.dim() == lprobs.dim() - 1:
            target = target.unsqueeze(-1)
        # nll: Negative log likelihood，the cross-entropy when target is one-hot. following line is same as F.nll_loss
        nll_loss = -lprobs.gather(dim=-1, index=target)
        #  reserve some probability for other labels. thus when calculating cross-entropy, 
        # equivalent to summing the log probs of all labels
        smooth_loss = -lprobs.sum(dim=-1, keepdim=True)
        if self.ignore_index is not None:
            pad_mask = target.eq(self.ignore_index)
            nll_loss.masked_fill_(pad_mask, 0.0)
            smooth_loss.masked_fill_(pad_mask, 0.0)
        else:
            nll_loss = nll_loss.squeeze(-1)
            smooth_loss = smooth_loss.squeeze(-1)
        if self.reduce:
            nll_loss = nll_loss.sum()
            smooth_loss = smooth_loss.sum()
        # when calculating cross-entropy, add the loss of other labels
        eps_i = self.smoothing / lprobs.size(-1)
        loss = (1.0 - self.smoothing) * nll_loss + eps_i * smooth_loss
        return loss

# generally, 0.1 is good enough
criterion = LabelSmoothedCrossEntropyCriterion(
    smoothing=0.1,
    ignore_index=task.target_dictionary.pad(),
)

## Optimizer: Adam + lr scheduling
Inverse square root scheduling is important to the stability when training Transformer. It's later used on RNN as well.
Update the learning rate according to the following equation. Linearly increase the first stage, then decay proportionally to the inverse square root of timestep.
$$lrate = d_{\text{model}}^{-0.5}\cdot\min({step\_num}^{-0.5},{step\_num}\cdot{warmup\_steps}^{-1.5})$$

In [37]:
def get_rate(d_model, step_num, warmup_step):
    # TODO: Change lr from constant to the equation shown above
    # lr = 0.001
    lr = (d_model**(-0.5) * min(step_num**(-0.5), step_num*(warmup_step**(-1.5))))
    return lr

In [38]:
class NoamOpt:
    "Optim wrapper that implements rate."
    def __init__(self, model_size, factor, warmup, optimizer):
        self.optimizer = optimizer
        self._step = 0
        self.warmup = warmup
        self.factor = factor
        self.model_size = model_size
        self._rate = 0
    
    @property
    def param_groups(self):
        return self.optimizer.param_groups
        
    def multiply_grads(self, c):
        """Multiplies grads by a constant *c*."""                
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is not None:
                    p.grad.data.mul_(c)
        
    def step(self):
        "Update parameters and rate"
        self._step += 1
        rate = self.rate()
        for p in self.param_groups:
            p['lr'] = rate
        self._rate = rate
        self.optimizer.step()
        
    def rate(self, step = None):
        "Implement `lrate` above"
        if step is None:
            step = self._step
        return 0 if not step else self.factor * get_rate(self.model_size, step, self.warmup)

## Scheduling Visualized

In [39]:
optimizer = NoamOpt(
    model_size=arch_args.encoder_embed_dim, 
    factor=config.lr_factor, 
    warmup=config.lr_warmup, 
    optimizer=torch.optim.AdamW(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9, weight_decay=0.0001))
# plt.plot(np.arange(1, 100000), [optimizer.rate(i) for i in range(1, 100000)])
# plt.legend([f"{optimizer.model_size}:{optimizer.warmup}"])
None

# Training Procedure

## Training

In [40]:
from fairseq.data import iterators
from torch.cuda.amp import GradScaler, autocast
gnorms = []
def train_one_epoch(epoch_itr, model, task, criterion, optimizer, accum_steps=1):
    itr = epoch_itr.next_epoch_itr(shuffle=True)
    itr = iterators.GroupedIterator(itr, accum_steps) # gradient accumulation: update every accum_steps samples
    
    stats = {"loss": []}
    scaler = GradScaler() # automatic mixed precision (amp) 
    
    model.train()
    progress = tqdm.tqdm(itr, desc=f"train epoch {epoch_itr.epoch}", leave=False)
    for samples in progress:
        model.zero_grad()
        accum_loss = 0
        sample_size = 0
        # gradient accumulation: update every accum_steps samples
        for i, sample in enumerate(samples):
            if i == 1:
                # emptying the CUDA cache after the first step can reduce the chance of OOM
                torch.cuda.empty_cache()

            sample = utils.move_to_cuda(sample, device=device)
            target = sample["target"]
            sample_size_i = sample["ntokens"]
            sample_size += sample_size_i
            
            # mixed precision training
            with autocast():
                net_output = model.forward(**sample["net_input"])
                lprobs = F.log_softmax(net_output[0], -1)            
                loss = criterion(lprobs.view(-1, lprobs.size(-1)), target.view(-1))
                
                # logging
                accum_loss += loss.item()
                # back-prop
                scaler.scale(loss).backward()                
        
        scaler.unscale_(optimizer)
        optimizer.multiply_grads(1 / (sample_size or 1.0)) # (sample_size or 1.0) handles the case of a zero gradient
        gnorm = nn.utils.clip_grad_norm_(model.parameters(), config.clip_norm) # grad norm clipping prevents gradient exploding
        with open('gnorm.txt', 'a') as gtf:
            gtf.write(f"{gnorm.item()}\n")
        gnorms.append(gnorm.cpu().item())
        
        scaler.step(optimizer)
        scaler.update()
        
        # logging
        loss_print = accum_loss/sample_size
        stats["loss"].append(loss_print)
        progress.set_postfix(loss=loss_print)
        if config.use_wandb:
            wandb.log({
                "train/loss": loss_print,
                "train/grad_norm": gnorm.item(),
                "train/lr": optimizer.rate(),
                "train/sample_size": sample_size,
            })
        
    loss_print = np.mean(stats["loss"])
    logger.info(f"training loss: {loss_print:.4f}")
    return stats

## Validation & Inference
To prevent overfitting, validation is required every epoch to validate the performance on unseen data.
- the procedure is essensially same as training, with the addition of inference step
- after validation we can save the model weights

Validation loss alone cannot describe the actual performance of the model
- Directly produce translation hypotheses based on current model, then calculate BLEU with the reference translation
- We can also manually examine the hypotheses' quality
- We use fairseq's sequence generator for beam search to generate translation hypotheses

In [41]:
# fairseq's beam search generator
# given model and input seqeunce, produce translation hypotheses by beam search
sequence_generator = task.build_generator([model], config)

def decode(toks, dictionary):
    # convert from Tensor to human readable sentence
    s = dictionary.string(
        toks.int().cpu(),
        config.post_process,
    )
    return s if s else "<unk>"

def inference_step(sample, model):
    gen_out = sequence_generator.generate([model], sample)
    srcs = []
    hyps = []
    refs = []
    for i in range(len(gen_out)):
        # for each sample, collect the input, hypothesis and reference, later be used to calculate BLEU
        srcs.append(decode(
            utils.strip_pad(sample["net_input"]["src_tokens"][i], task.source_dictionary.pad()), 
            task.source_dictionary,
        ))
        hyps.append(decode(
            gen_out[i][0]["tokens"], # 0 indicates using the top hypothesis in beam
            task.target_dictionary,
        ))
        refs.append(decode(
            utils.strip_pad(sample["target"][i], task.target_dictionary.pad()), 
            task.target_dictionary,
        ))
    return srcs, hyps, refs

In [42]:
import shutil
import sacrebleu

def validate(model, task, criterion, log_to_wandb=True):
    logger.info('begin validation')
    itr = load_data_iterator(task, "valid", 1, config.max_tokens, config.num_workers).next_epoch_itr(shuffle=False)
    
    stats = {"loss":[], "bleu": 0, "srcs":[], "hyps":[], "refs":[]}
    srcs = []
    hyps = []
    refs = []
    
    model.eval()
    progress = tqdm.tqdm(itr, desc=f"validation", leave=False)
    with torch.no_grad():
        for i, sample in enumerate(progress):
            # validation loss
            sample = utils.move_to_cuda(sample, device=device)
            net_output = model.forward(**sample["net_input"])

            lprobs = F.log_softmax(net_output[0], -1)
            target = sample["target"]
            sample_size = sample["ntokens"]
            loss = criterion(lprobs.view(-1, lprobs.size(-1)), target.view(-1)) / sample_size
            progress.set_postfix(valid_loss=loss.item())
            stats["loss"].append(loss)
            
            # do inference
            s, h, r = inference_step(sample, model)
            srcs.extend(s)
            hyps.extend(h)
            refs.extend(r)
            
    tok = 'zh' if task.cfg.target_lang == 'zh' else '13a'
    stats["loss"] = torch.stack(stats["loss"]).mean().item()
    stats["bleu"] = sacrebleu.corpus_bleu(hyps, [refs], tokenize=tok) # 計算BLEU score
    stats["srcs"] = srcs
    stats["hyps"] = hyps
    stats["refs"] = refs
    
    if config.use_wandb and log_to_wandb:
        wandb.log({
            "valid/loss": stats["loss"],
            "valid/bleu": stats["bleu"].score,
        }, commit=False)
    
    showid = np.random.randint(len(hyps))
    logger.info("example source: " + srcs[showid])
    logger.info("example hypothesis: " + hyps[showid])
    logger.info("example reference: " + refs[showid])
    
    # show bleu results
    logger.info(f"validation loss:\t{stats['loss']:.4f}")
    logger.info(stats["bleu"].format())
    return stats

# Save and Load Model Weights


In [43]:
def validate_and_save(model, task, criterion, optimizer, epoch, save=True):   
    stats = validate(model, task, criterion)
    bleu = stats['bleu']
    loss = stats['loss']
    if save:
        # save epoch checkpoints
        savedir = Path(config.savedir).absolute()
        savedir.mkdir(parents=True, exist_ok=True)
        
        check = {
            "model": model.state_dict(),
            "stats": {"bleu": bleu.score, "loss": loss},
            "optim": {"step": optimizer._step}
        }
        torch.save(check, savedir/f"checkpoint{epoch}.pt")
        shutil.copy(savedir/f"checkpoint{epoch}.pt", savedir/f"checkpoint_last.pt")
        logger.info(f"saved epoch checkpoint: {savedir}/checkpoint{epoch}.pt")
    
        # save epoch samples
        with open(savedir/f"samples{epoch}.{config.source_lang}-{config.target_lang}.txt", "w") as f:
            for s, h in zip(stats["srcs"], stats["hyps"]):
                f.write(f"{s}\t{h}\n")

        # get best valid bleu    
        if getattr(validate_and_save, "best_bleu", 0) < bleu.score:
            validate_and_save.best_bleu = bleu.score
            torch.save(check, savedir/f"checkpoint_best.pt")
            
        del_file = savedir / f"checkpoint{epoch - config.keep_last_epochs}.pt"
        if del_file.exists():
            del_file.unlink()
    
    return stats

def try_load_checkpoint(model, optimizer=None, name=None):
    name = name if name else "checkpoint_last.pt"
    checkpath = Path(config.savedir)/name
    if checkpath.exists():
        check = torch.load(checkpath)
        model.load_state_dict(check["model"])
        stats = check["stats"]
        step = "unknown"
        if optimizer != None:
            optimizer._step = step = check["optim"]["step"]
        logger.info(f"loaded checkpoint {checkpath}: step={step} loss={stats['loss']} bleu={stats['bleu']}")
    else:
        logger.info(f"no checkpoints found at {checkpath}!")

# Main
## Training loop

In [44]:
model = model.to(device=device)
criterion = criterion.to(device=device)

In [45]:
logger.info("task: {}".format(task.__class__.__name__))
logger.info("encoder: {}".format(model.encoder.__class__.__name__))
logger.info("decoder: {}".format(model.decoder.__class__.__name__))
logger.info("criterion: {}".format(criterion.__class__.__name__))
logger.info("optimizer: {}".format(optimizer.__class__.__name__))
logger.info(
    "num. model params: {:,} (num. trained: {:,})".format(
        sum(p.numel() for p in model.parameters()),
        sum(p.numel() for p in model.parameters() if p.requires_grad),
    )
)
logger.info(f"max tokens per batch = {config.max_tokens}, accumulate steps = {config.accum_steps}")

2023-04-14 12:37:19 | INFO | hw5.seq2seq | task: TranslationTask
2023-04-14 12:37:19 | INFO | hw5.seq2seq | encoder: TransformerEncoder
2023-04-14 12:37:19 | INFO | hw5.seq2seq | decoder: TransformerDecoder
2023-04-14 12:37:19 | INFO | hw5.seq2seq | criterion: LabelSmoothedCrossEntropyCriterion
2023-04-14 12:37:19 | INFO | hw5.seq2seq | optimizer: NoamOpt
2023-04-14 12:37:19 | INFO | hw5.seq2seq | num. model params: 11,465,728 (num. trained: 11,465,728)
2023-04-14 12:37:19 | INFO | hw5.seq2seq | max tokens per batch = 8192, accumulate steps = 2


In [47]:
epoch_itr = load_data_iterator(task, "train", config.start_epoch, config.max_tokens, config.num_workers)
try_load_checkpoint(model, optimizer, name=config.resume)
while epoch_itr.next_epoch_idx <= config.max_epoch:
    # train for one epoch
    train_one_epoch(epoch_itr, model, task, criterion, optimizer, config.accum_steps)
    stats = validate_and_save(model, task, criterion, optimizer, epoch=epoch_itr.epoch)
    logger.info("end of epoch {}".format(epoch_itr.epoch))    
    epoch_itr = load_data_iterator(task, "train", epoch_itr.next_epoch_idx, config.max_tokens, config.num_workers)

2023-04-14 12:37:48 | INFO | hw5.seq2seq | no checkpoints found at checkpoints\transformer\checkpoint_last.pt!


                                                                           

2023-04-14 12:41:22 | INFO | hw5.seq2seq | training loss: 6.9079
2023-04-14 12:41:22 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 12:41:53 | INFO | hw5.seq2seq | example source: being a diplomat , then and now , is an incredible job , and i loved every minute of it i enjoyed the status of it .
2023-04-14 12:41:53 | INFO | hw5.seq2seq | example hypothesis: 美國 , 是 , 是 , 大大大大大的 , 我很棒的 , 我很棒的 , 我很棒 。
2023-04-14 12:41:53 | INFO | hw5.seq2seq | example reference: 過去和今日相較 , 外交官是個超棒的工作 , 我超愛在裡面工作的分分秒秒 。 我享受著所有的事件 。
2023-04-14 12:41:53 | INFO | hw5.seq2seq | validation loss:	5.7053
2023-04-14 12:41:53 | INFO | hw5.seq2seq | BLEU = 1.26 16.6/3.2/0.7/0.2 (BP = 0.823 ratio = 0.837 hyp_len = 92426 ref_len = 110430)
2023-04-14 12:41:53 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint1.pt
2023-04-14 12:41:53 | INFO | hw5.seq2seq | end of epoch 1


                                                                           

2023-04-14 12:45:30 | INFO | hw5.seq2seq | training loss: 5.2460
2023-04-14 12:45:30 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 12:45:57 | INFO | hw5.seq2seq | example source: and now we know that there are hundreds of other sorts of cells , which can be very , very specific .
2023-04-14 12:45:57 | INFO | hw5.seq2seq | example hypothesis: 現在 , 我們知道有數百個細胞 , 可以非常複雜 。
2023-04-14 12:45:57 | INFO | hw5.seq2seq | example reference: 現在我們知道有上百種不同的細胞 , 負責非常特定的功能 。
2023-04-14 12:45:57 | INFO | hw5.seq2seq | validation loss:	4.7360
2023-04-14 12:45:57 | INFO | hw5.seq2seq | BLEU = 9.72 42.3/17.5/7.8/3.6 (BP = 0.813 ratio = 0.829 hyp_len = 91513 ref_len = 110430)
2023-04-14 12:45:58 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint2.pt
2023-04-14 12:45:58 | INFO | hw5.seq2seq | end of epoch 2


                                                                           

2023-04-14 12:49:34 | INFO | hw5.seq2seq | training loss: 4.6213
2023-04-14 12:49:34 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 12:50:02 | INFO | hw5.seq2seq | example source: the big brands are some of the most important powers , powerful powers , in this country .
2023-04-14 12:50:02 | INFO | hw5.seq2seq | example hypothesis: 最大的勇敢是最重要的力量 , 在這個國家 。
2023-04-14 12:50:02 | INFO | hw5.seq2seq | example reference: 有名的速食品牌名列這個國家最有勢力 , 最有影響力的名單中 , 超市也是 。
2023-04-14 12:50:02 | INFO | hw5.seq2seq | validation loss:	4.3132
2023-04-14 12:50:02 | INFO | hw5.seq2seq | BLEU = 14.06 45.5/21.3/10.8/5.8 (BP = 0.895 ratio = 0.901 hyp_len = 99443 ref_len = 110430)
2023-04-14 12:50:02 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint3.pt
2023-04-14 12:50:02 | INFO | hw5.seq2seq | end of epoch 3


                                                                           

2023-04-14 12:53:40 | INFO | hw5.seq2seq | training loss: 4.3203
2023-04-14 12:53:40 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 12:54:09 | INFO | hw5.seq2seq | example source: we'll pay you three dollars for it . "
2023-04-14 12:54:09 | INFO | hw5.seq2seq | example hypothesis: 我們會付你三塊錢 。 」
2023-04-14 12:54:09 | INFO | hw5.seq2seq | example reference: 我們會用$3.00跟你買 。 」
2023-04-14 12:54:09 | INFO | hw5.seq2seq | validation loss:	4.1355
2023-04-14 12:54:09 | INFO | hw5.seq2seq | BLEU = 15.41 49.2/23.9/12.7/7.0 (BP = 0.855 ratio = 0.865 hyp_len = 95469 ref_len = 110430)
2023-04-14 12:54:09 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint4.pt
2023-04-14 12:54:09 | INFO | hw5.seq2seq | end of epoch 4


                                                                           

2023-04-14 12:57:54 | INFO | hw5.seq2seq | training loss: 4.1342
2023-04-14 12:57:54 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 12:58:21 | INFO | hw5.seq2seq | example source: [protect your selfesteem] we have to catch our unhealthy psychological habits and change them .
2023-04-14 12:58:21 | INFO | hw5.seq2seq | example hypothesis: 「 私人自我利益 」 , 我們必須抓到我們的心理習慣 , 改變他們 。
2023-04-14 12:58:21 | INFO | hw5.seq2seq | example reference: 「 保護你的自尊心 」 我們需要改變不健康的心理習慣 。
2023-04-14 12:58:21 | INFO | hw5.seq2seq | validation loss:	3.9546
2023-04-14 12:58:21 | INFO | hw5.seq2seq | BLEU = 17.50 53.3/27.1/14.7/8.4 (BP = 0.853 ratio = 0.863 hyp_len = 95268 ref_len = 110430)
2023-04-14 12:58:21 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint5.pt
2023-04-14 12:58:21 | INFO | hw5.seq2seq | end of epoch 5


                                                                           

2023-04-14 13:01:59 | INFO | hw5.seq2seq | training loss: 3.9797
2023-04-14 13:01:59 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 13:02:28 | INFO | hw5.seq2seq | example source: and then you get to the finishes , the subject of all of those " go green " articles , and on the scale of a house they almost make no difference at all .
2023-04-14 13:02:28 | INFO | hw5.seq2seq | example hypothesis: 然後你到最後一家 , 所有這些 「 去綠色 」 的主題 , 和一間房子都差不多 。
2023-04-14 13:02:28 | INFO | hw5.seq2seq | example reference: 再來是表面處理 。 所有那些談 「 綠化 」 的文章 , 都以此為主題 。 以一個房子的規模來看 , 表面處理幾乎沒有什麼影響 。
2023-04-14 13:02:28 | INFO | hw5.seq2seq | validation loss:	3.8325
2023-04-14 13:02:28 | INFO | hw5.seq2seq | BLEU = 19.90 51.6/26.7/14.8/8.6 (BP = 0.974 ratio = 0.974 hyp_len = 107569 ref_len = 110430)
2023-04-14 13:02:28 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint6.pt
2023-04-14 13:02:28 | INFO | hw5.seq2seq | end of epoch 6


                                                                           

2023-04-14 13:06:08 | INFO | hw5.seq2seq | training loss: 3.8378
2023-04-14 13:06:08 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 13:06:37 | INFO | hw5.seq2seq | example source: i'd like to try something new .
2023-04-14 13:06:37 | INFO | hw5.seq2seq | example hypothesis: 我想嘗試新的東西 。
2023-04-14 13:06:37 | INFO | hw5.seq2seq | example reference: 我想嘗試一個新東西 ,
2023-04-14 13:06:37 | INFO | hw5.seq2seq | validation loss:	3.7154
2023-04-14 13:06:37 | INFO | hw5.seq2seq | BLEU = 21.05 53.0/28.1/15.8/9.3 (BP = 0.973 ratio = 0.973 hyp_len = 107475 ref_len = 110430)
2023-04-14 13:06:37 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint7.pt
2023-04-14 13:06:37 | INFO | hw5.seq2seq | end of epoch 7


                                                                           

2023-04-14 13:10:12 | INFO | hw5.seq2seq | training loss: 3.7396
2023-04-14 13:10:12 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 13:10:39 | INFO | hw5.seq2seq | example source: but unfortunately , the same day in fact , shortly after birth the calf died .
2023-04-14 13:10:39 | INFO | hw5.seq2seq | example hypothesis: 但不幸的是 , 相同的一天 , 事實上 , 在出生後 , 幾乎沒時間就死了 。
2023-04-14 13:10:39 | INFO | hw5.seq2seq | example reference: 但 , 不幸的是 , 同一天事實上 , 是才出生後不久幼鯨就死了 。
2023-04-14 13:10:39 | INFO | hw5.seq2seq | validation loss:	3.6564
2023-04-14 13:10:39 | INFO | hw5.seq2seq | BLEU = 21.64 55.7/30.1/17.3/10.4 (BP = 0.924 ratio = 0.926 hyp_len = 102301 ref_len = 110430)
2023-04-14 13:10:39 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint8.pt
2023-04-14 13:10:39 | INFO | hw5.seq2seq | end of epoch 8


                                                                           

2023-04-14 13:14:13 | INFO | hw5.seq2seq | training loss: 3.6672
2023-04-14 13:14:13 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 13:14:40 | INFO | hw5.seq2seq | example source: why are we conscious ?
2023-04-14 13:14:40 | INFO | hw5.seq2seq | example hypothesis: 為什麼我們有意識 ?
2023-04-14 13:14:40 | INFO | hw5.seq2seq | example reference: 為何我們擁有意識 ?
2023-04-14 13:14:40 | INFO | hw5.seq2seq | validation loss:	3.6092
2023-04-14 13:14:40 | INFO | hw5.seq2seq | BLEU = 22.31 57.1/31.3/18.2/11.1 (BP = 0.911 ratio = 0.915 hyp_len = 100998 ref_len = 110430)
2023-04-14 13:14:40 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint9.pt
2023-04-14 13:14:40 | INFO | hw5.seq2seq | end of epoch 9


                                                                            

2023-04-14 13:18:16 | INFO | hw5.seq2seq | training loss: 3.6110
2023-04-14 13:18:16 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 13:18:42 | INFO | hw5.seq2seq | example source: we who are diplomats , we are trained to deal with conflicts between states and issues between states .
2023-04-14 13:18:42 | INFO | hw5.seq2seq | example hypothesis: 我們是外交人 , 我們被訓練要處理國家和國家之間的衝突 。
2023-04-14 13:18:42 | INFO | hw5.seq2seq | example reference: 這就是我們:這些外交官 , 我們受過訓練以應付國家之間的衝突及問題
2023-04-14 13:18:42 | INFO | hw5.seq2seq | validation loss:	3.5725
2023-04-14 13:18:42 | INFO | hw5.seq2seq | BLEU = 21.62 58.3/32.2/18.6/11.4 (BP = 0.862 ratio = 0.870 hyp_len = 96115 ref_len = 110430)
2023-04-14 13:18:42 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint10.pt
2023-04-14 13:18:42 | INFO | hw5.seq2seq | end of epoch 10


                                                                            

2023-04-14 13:22:19 | INFO | hw5.seq2seq | training loss: 3.5673
2023-04-14 13:22:19 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 13:22:43 | INFO | hw5.seq2seq | example source: this is actual 3d points with two to three millimeter accuracy .
2023-04-14 13:22:43 | INFO | hw5.seq2seq | example hypothesis: 這是實際的3d點 , 有2到3毫米的準確度 。
2023-04-14 13:22:43 | INFO | hw5.seq2seq | example reference: 實際上這是由高達兩到三百萬個雷射光點所呈現的精確效果 。
2023-04-14 13:22:43 | INFO | hw5.seq2seq | validation loss:	3.5367
2023-04-14 13:22:43 | INFO | hw5.seq2seq | BLEU = 22.45 59.5/33.1/19.5/11.9 (BP = 0.863 ratio = 0.872 hyp_len = 96248 ref_len = 110430)
2023-04-14 13:22:43 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint11.pt
2023-04-14 13:22:43 | INFO | hw5.seq2seq | end of epoch 11


                                                                            

2023-04-14 13:26:20 | INFO | hw5.seq2seq | training loss: 3.5292
2023-04-14 13:26:20 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 13:26:49 | INFO | hw5.seq2seq | example source: you're walking around ; your car has 12 microprocessors .
2023-04-14 13:26:49 | INFO | hw5.seq2seq | example hypothesis: 你在四處走動 , 你的車子有12個微處理器 。
2023-04-14 13:26:49 | INFO | hw5.seq2seq | example reference: 四處走走 , 你的汽車里就有12個微處理器 。
2023-04-14 13:26:49 | INFO | hw5.seq2seq | validation loss:	3.5160
2023-04-14 13:26:49 | INFO | hw5.seq2seq | BLEU = 22.94 56.9/31.4/18.3/11.2 (BP = 0.933 ratio = 0.935 hyp_len = 103273 ref_len = 110430)
2023-04-14 13:26:49 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint12.pt
2023-04-14 13:26:49 | INFO | hw5.seq2seq | end of epoch 12


                                                                            

2023-04-14 13:30:26 | INFO | hw5.seq2seq | training loss: 3.5029
2023-04-14 13:30:26 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 13:30:52 | INFO | hw5.seq2seq | example source: they want to the only way for them to survive is to get a printing press .
2023-04-14 13:30:52 | INFO | hw5.seq2seq | example hypothesis: 他們希望他們能存活下來的唯一方法就是讓印刷機 。
2023-04-14 13:30:52 | INFO | hw5.seq2seq | example reference: 他們存活下去的唯一出路 , 目的就是要買一台印刷機
2023-04-14 13:30:52 | INFO | hw5.seq2seq | validation loss:	3.4789
2023-04-14 13:30:52 | INFO | hw5.seq2seq | BLEU = 23.29 58.6/32.6/19.1/11.7 (BP = 0.911 ratio = 0.915 hyp_len = 100991 ref_len = 110430)
2023-04-14 13:30:53 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint13.pt
2023-04-14 13:30:53 | INFO | hw5.seq2seq | end of epoch 13


                                                                            

2023-04-14 13:34:31 | INFO | hw5.seq2seq | training loss: 3.4727
2023-04-14 13:34:31 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 13:34:57 | INFO | hw5.seq2seq | example source: now about half the audience has their left hand up . why is that ?
2023-04-14 13:34:57 | INFO | hw5.seq2seq | example hypothesis: 大約有一半的觀眾有左手 , 為什麼會這樣 ?
2023-04-14 13:34:57 | INFO | hw5.seq2seq | example reference: 大約有一半的人舉起來的是左手 , 那<unk>阿<unk><unk> ?
2023-04-14 13:34:57 | INFO | hw5.seq2seq | validation loss:	3.4701
2023-04-14 13:34:57 | INFO | hw5.seq2seq | BLEU = 23.98 57.7/32.2/19.0/11.8 (BP = 0.945 ratio = 0.946 hyp_len = 104505 ref_len = 110430)
2023-04-14 13:34:58 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint14.pt
2023-04-14 13:34:58 | INFO | hw5.seq2seq | end of epoch 14


                                                                            

2023-04-14 13:38:35 | INFO | hw5.seq2seq | training loss: 3.4527
2023-04-14 13:38:35 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 13:39:01 | INFO | hw5.seq2seq | example source: god , the one who rules the entire universe , wants my bread ? "
2023-04-14 13:39:01 | INFO | hw5.seq2seq | example hypothesis: 上帝 , 統治整個宇宙的人 , 想要我的麵包 ? 」
2023-04-14 13:39:01 | INFO | hw5.seq2seq | example reference: 上帝 , 一個掌管全宇宙的神 , 要我的麵包 ? 」
2023-04-14 13:39:01 | INFO | hw5.seq2seq | validation loss:	3.4573
2023-04-14 13:39:01 | INFO | hw5.seq2seq | BLEU = 23.47 58.9/33.0/19.5/12.1 (BP = 0.902 ratio = 0.907 hyp_len = 100113 ref_len = 110430)
2023-04-14 13:39:01 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint15.pt
2023-04-14 13:39:01 | INFO | hw5.seq2seq | end of epoch 15


                                                                            

2023-04-14 13:42:39 | INFO | hw5.seq2seq | training loss: 3.4308
2023-04-14 13:42:39 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 13:43:06 | INFO | hw5.seq2seq | example source: and not only in the states , but in any country , in any economy .
2023-04-14 13:43:06 | INFO | hw5.seq2seq | example hypothesis: 不僅在美國 , 在任何國家 , 在任何國家 , 任何經濟體中 。
2023-04-14 13:43:06 | INFO | hw5.seq2seq | example reference: 而且不止在美國 , 在任何國家、任何經濟體系裏都有這樣的效益
2023-04-14 13:43:06 | INFO | hw5.seq2seq | validation loss:	3.4412
2023-04-14 13:43:06 | INFO | hw5.seq2seq | BLEU = 24.40 57.3/32.0/18.9/11.7 (BP = 0.966 ratio = 0.966 hyp_len = 106701 ref_len = 110430)
2023-04-14 13:43:06 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint16.pt
2023-04-14 13:43:07 | INFO | hw5.seq2seq | end of epoch 16


                                                                            

2023-04-14 13:46:44 | INFO | hw5.seq2seq | training loss: 3.4149
2023-04-14 13:46:44 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 13:47:11 | INFO | hw5.seq2seq | example source: we learned that thousands of people wanted to tell us their prices .
2023-04-14 13:47:11 | INFO | hw5.seq2seq | example hypothesis: 我們發現成千上萬人想要告訴我們他們的價格 。
2023-04-14 13:47:11 | INFO | hw5.seq2seq | example reference: 我們發現 , 有上千人想要告訴我們他們的價格 。
2023-04-14 13:47:11 | INFO | hw5.seq2seq | validation loss:	3.4366
2023-04-14 13:47:11 | INFO | hw5.seq2seq | BLEU = 24.35 58.2/32.7/19.4/12.1 (BP = 0.942 ratio = 0.943 hyp_len = 104157 ref_len = 110430)
2023-04-14 13:47:11 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint17.pt
2023-04-14 13:47:11 | INFO | hw5.seq2seq | end of epoch 17


                                                                            

2023-04-14 13:50:48 | INFO | hw5.seq2seq | training loss: 3.3979
2023-04-14 13:50:48 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 13:51:13 | INFO | hw5.seq2seq | example source: we're in christchurch , where people have lived through a devastating natural disaster and recovered .
2023-04-14 13:51:13 | INFO | hw5.seq2seq | example hypothesis: 我們在基督教堂 , 在那裡 , 人們經歷了嚴重的自然災難 , 並復原 。
2023-04-14 13:51:13 | INFO | hw5.seq2seq | example reference: 我們在基督城 , 這裡的人度過了嚴重的天然災難且從中恢復了 。
2023-04-14 13:51:13 | INFO | hw5.seq2seq | validation loss:	3.4179
2023-04-14 13:51:13 | INFO | hw5.seq2seq | BLEU = 24.16 59.2/33.3/19.7/12.3 (BP = 0.918 ratio = 0.921 hyp_len = 101757 ref_len = 110430)
2023-04-14 13:51:13 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint18.pt
2023-04-14 13:51:13 | INFO | hw5.seq2seq | end of epoch 18


                                                                            

2023-04-14 13:54:53 | INFO | hw5.seq2seq | training loss: 3.3830
2023-04-14 13:54:53 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 13:55:20 | INFO | hw5.seq2seq | example source: four years ago , a security researcher , or , as most people would call it , a hacker , found a way to literally make atms throw money at him .
2023-04-14 13:55:20 | INFO | hw5.seq2seq | example hypothesis: 四年前 , 一位安全研究員 , 或者 , 如大多數人所說的 , 駭客 , 發現了一種方法 , 能讓ms扔錢給他 。
2023-04-14 13:55:20 | INFO | hw5.seq2seq | example reference: 四年前 , 一位安全研究員 , 或者 , 大部分人會稱之為駭客 , 找到一個讓自動提款機向他吐鈔的方法 ,
2023-04-14 13:55:20 | INFO | hw5.seq2seq | validation loss:	3.4101
2023-04-14 13:55:20 | INFO | hw5.seq2seq | BLEU = 24.57 57.6/32.3/19.1/11.8 (BP = 0.966 ratio = 0.966 hyp_len = 106710 ref_len = 110430)
2023-04-14 13:55:20 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint19.pt
2023-04-14 13:55:20 | INFO | hw5.seq2seq | end of epoch 19


                                                                            

2023-04-14 13:58:57 | INFO | hw5.seq2seq | training loss: 3.3700
2023-04-14 13:58:57 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 13:59:23 | INFO | hw5.seq2seq | example source: when i got to guatemala in 1995 , i heard of a case of a massacre that happened on may 14 , 1982 , where the army came in , killed the men , and took the women and children in helicopters to an unknown location .
2023-04-14 13:59:23 | INFO | hw5.seq2seq | example hypothesis: 當我在1995年到瓜地馬拉時 , 我聽到一個大屠殺案例 , 發生在1982年5月14日 , 軍隊進來 , 殺死了男性 , 帶著直升機的孩子到未知的地點 。
2023-04-14 13:59:23 | INFO | hw5.seq2seq | example reference: 當我1995年到達瓜地馬拉時我聽說有一件在1982年5月14日發生的屠殺軍隊去到那裡 , 殺掉男人將女人和小孩子們用直升機載到不明地點
2023-04-14 13:59:23 | INFO | hw5.seq2seq | validation loss:	3.4039
2023-04-14 13:59:23 | INFO | hw5.seq2seq | BLEU = 24.28 59.2/33.4/19.8/12.4 (BP = 0.919 ratio = 0.922 hyp_len = 101854 ref_len = 110430)
2023-04-14 13:59:23 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint20.pt
2023-04-14 13:59:23 | INFO | hw5.seq2seq | end of epoch 20


                                                                            

2023-04-14 14:03:00 | INFO | hw5.seq2seq | training loss: 3.3573
2023-04-14 14:03:00 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 14:03:26 | INFO | hw5.seq2seq | example source: so , where do we look for inspiration ? we've still got bill clinton .
2023-04-14 14:03:26 | INFO | hw5.seq2seq | example hypothesis: 所以 , 我們要找什麼靈感 ? 我們仍然有比爾·克林頓 。
2023-04-14 14:03:26 | INFO | hw5.seq2seq | example reference: 那麼 , 我們還能從何處尋求啟發 ? 我們總還有比爾.柯林頓 。
2023-04-14 14:03:26 | INFO | hw5.seq2seq | validation loss:	3.3982
2023-04-14 14:03:26 | INFO | hw5.seq2seq | BLEU = 24.64 58.9/33.3/19.8/12.4 (BP = 0.936 ratio = 0.938 hyp_len = 103533 ref_len = 110430)
2023-04-14 14:03:26 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint21.pt
2023-04-14 14:03:26 | INFO | hw5.seq2seq | end of epoch 21


                                                                            

2023-04-14 14:07:04 | INFO | hw5.seq2seq | training loss: 3.3453
2023-04-14 14:07:04 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 14:07:31 | INFO | hw5.seq2seq | example source: jf: you know , i was thinking this morning , i don't even know what i would do without my women friends .
2023-04-14 14:07:31 | INFO | hw5.seq2seq | example hypothesis: jf:你知道嗎 ? 我今天早上在想 , 我甚至不知道我會怎麼做 , 沒有我的女朋友 。
2023-04-14 14:07:31 | INFO | hw5.seq2seq | example reference: jf:今早我在想我根本無法想像沒有我的女性朋友會怎樣
2023-04-14 14:07:31 | INFO | hw5.seq2seq | validation loss:	3.3829
2023-04-14 14:07:31 | INFO | hw5.seq2seq | BLEU = 24.86 58.7/33.3/19.8/12.5 (BP = 0.943 ratio = 0.944 hyp_len = 104280 ref_len = 110430)
2023-04-14 14:07:31 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint22.pt
2023-04-14 14:07:31 | INFO | hw5.seq2seq | end of epoch 22


                                                                            

2023-04-14 14:11:06 | INFO | hw5.seq2seq | training loss: 3.3363
2023-04-14 14:11:06 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 14:11:34 | INFO | hw5.seq2seq | example source: and i said , " yeah . "
2023-04-14 14:11:34 | INFO | hw5.seq2seq | example hypothesis: 我說: 「 是啊 。 」
2023-04-14 14:11:34 | INFO | hw5.seq2seq | example reference: 我說: 「 對啊 。 」
2023-04-14 14:11:34 | INFO | hw5.seq2seq | validation loss:	3.3866
2023-04-14 14:11:34 | INFO | hw5.seq2seq | BLEU = 24.91 58.1/32.8/19.5/12.2 (BP = 0.959 ratio = 0.960 hyp_len = 106033 ref_len = 110430)
2023-04-14 14:11:34 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint23.pt
2023-04-14 14:11:34 | INFO | hw5.seq2seq | end of epoch 23


                                                                            

2023-04-14 14:15:11 | INFO | hw5.seq2seq | training loss: 3.3274
2023-04-14 14:15:11 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 14:15:38 | INFO | hw5.seq2seq | example source: now i can feel a sensation of delight and beauty if i look at that eye .
2023-04-14 14:15:38 | INFO | hw5.seq2seq | example hypothesis: 如果我看著那眼睛 , 我可以感受到喜悅和美麗的感覺 。
2023-04-14 14:15:38 | INFO | hw5.seq2seq | example reference: 看著眼睛我感受到快樂和美 。
2023-04-14 14:15:38 | INFO | hw5.seq2seq | validation loss:	3.3824
2023-04-14 14:15:38 | INFO | hw5.seq2seq | BLEU = 25.22 57.6/32.5/19.4/12.2 (BP = 0.977 ratio = 0.977 hyp_len = 107936 ref_len = 110430)
2023-04-14 14:15:38 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint24.pt
2023-04-14 14:15:38 | INFO | hw5.seq2seq | end of epoch 24


                                                                            

2023-04-14 14:19:15 | INFO | hw5.seq2seq | training loss: 3.3199
2023-04-14 14:19:15 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 14:19:42 | INFO | hw5.seq2seq | example source: and he has a sweetheart , but she is american woman , not chinese .
2023-04-14 14:19:42 | INFO | hw5.seq2seq | example hypothesis: 他有甜美的 , 但她是美國女性 , 而不是中國人 。
2023-04-14 14:19:42 | INFO | hw5.seq2seq | example reference: 他有一個女朋友 , 但她不是中國人 , 而是一個美國女孩 。
2023-04-14 14:19:42 | INFO | hw5.seq2seq | validation loss:	3.3669
2023-04-14 14:19:42 | INFO | hw5.seq2seq | BLEU = 25.03 58.9/33.4/19.9/12.5 (BP = 0.945 ratio = 0.947 hyp_len = 104555 ref_len = 110430)
2023-04-14 14:19:42 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint25.pt
2023-04-14 14:19:42 | INFO | hw5.seq2seq | end of epoch 25


                                                                            

2023-04-14 14:23:20 | INFO | hw5.seq2seq | training loss: 3.3118
2023-04-14 14:23:20 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 14:23:46 | INFO | hw5.seq2seq | example source: that day , i discovered the power of fashion , and i've been in love with it ever since .
2023-04-14 14:23:46 | INFO | hw5.seq2seq | example hypothesis: 那天 , 我發現時尚的力量 , 我從此就愛上了它 。
2023-04-14 14:23:46 | INFO | hw5.seq2seq | example reference: 那天 , 我發現了時尚的力量 , 我從此就愛上了它 。
2023-04-14 14:23:46 | INFO | hw5.seq2seq | validation loss:	3.3641
2023-04-14 14:23:46 | INFO | hw5.seq2seq | BLEU = 25.00 59.2/33.7/20.3/12.8 (BP = 0.932 ratio = 0.934 hyp_len = 103139 ref_len = 110430)
2023-04-14 14:23:47 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint26.pt
2023-04-14 14:23:47 | INFO | hw5.seq2seq | end of epoch 26


                                                                            

2023-04-14 14:27:23 | INFO | hw5.seq2seq | training loss: 3.3020
2023-04-14 14:27:23 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 14:27:50 | INFO | hw5.seq2seq | example source: now , the island of mauritius is a small island off the east coast of madagascar in the indian ocean , and it is the place where the dodo bird was discovered and extinguished , all within about 150 years .
2023-04-14 14:27:50 | INFO | hw5.seq2seq | example hypothesis: 模里西斯島是一座小島 , 位於印度海岸的馬達加斯加東岸 , 在那裡 , dodododododo鳥被發現並且出名 , 大約在150年內 。
2023-04-14 14:27:50 | INFO | hw5.seq2seq | example reference: 馬里提斯島是一個小島位於馬達加斯加島東部海域處在印度洋中 , 就是在這裡渡渡鳥被發現也滅絕了 , 僅在短短一百五十年間
2023-04-14 14:27:50 | INFO | hw5.seq2seq | validation loss:	3.3628
2023-04-14 14:27:50 | INFO | hw5.seq2seq | BLEU = 24.86 58.5/33.1/19.6/12.3 (BP = 0.952 ratio = 0.953 hyp_len = 105234 ref_len = 110430)
2023-04-14 14:27:50 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint27.pt
2023-04-14 14:27:50 | INFO | hw5.seq2seq | end of epoch 27


                                                                            

2023-04-14 14:31:26 | INFO | hw5.seq2seq | training loss: 3.2964
2023-04-14 14:31:26 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 14:31:53 | INFO | hw5.seq2seq | example source: and the people that build things developers and governments they're naturally afraid of innovation , and they'd rather just use those forms that they know you'll respond to .
2023-04-14 14:31:53 | INFO | hw5.seq2seq | example hypothesis: 那些建造東西的人 , 他們自然害怕創新 , 他們寧願用他們知道的形式來回應 。
2023-04-14 14:31:53 | INFO | hw5.seq2seq | example reference: 蓋東西的人-開發者和政府-他們在本質上害怕創新他們寧願用這些他們知道你會如何回應的形式
2023-04-14 14:31:53 | INFO | hw5.seq2seq | validation loss:	3.3652
2023-04-14 14:31:53 | INFO | hw5.seq2seq | BLEU = 24.70 59.6/34.0/20.3/12.8 (BP = 0.917 ratio = 0.920 hyp_len = 101598 ref_len = 110430)
2023-04-14 14:31:53 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint28.pt
2023-04-14 14:31:53 | INFO | hw5.seq2seq | end of epoch 28


                                                                            

2023-04-14 14:35:31 | INFO | hw5.seq2seq | training loss: 3.2906
2023-04-14 14:35:31 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 14:35:57 | INFO | hw5.seq2seq | example source: so you're probably all wondering: the cave .
2023-04-14 14:35:57 | INFO | hw5.seq2seq | example hypothesis: 你可能都想知道:洞穴 。
2023-04-14 14:35:57 | INFO | hw5.seq2seq | example reference: 所以你大概會猜想:那個洞穴中
2023-04-14 14:35:57 | INFO | hw5.seq2seq | validation loss:	3.3499
2023-04-14 14:35:57 | INFO | hw5.seq2seq | BLEU = 25.08 59.2/33.6/20.1/12.7 (BP = 0.938 ratio = 0.940 hyp_len = 103835 ref_len = 110430)
2023-04-14 14:35:57 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint29.pt
2023-04-14 14:35:57 | INFO | hw5.seq2seq | end of epoch 29


                                                                            

2023-04-14 14:39:33 | INFO | hw5.seq2seq | training loss: 3.2816
2023-04-14 14:39:33 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 14:40:00 | INFO | hw5.seq2seq | example source: and that will get you very far .
2023-04-14 14:40:00 | INFO | hw5.seq2seq | example hypothesis: 這會使你非常遙遠 。
2023-04-14 14:40:00 | INFO | hw5.seq2seq | example reference: 這將對你受益無窮 。
2023-04-14 14:40:00 | INFO | hw5.seq2seq | validation loss:	3.3500
2023-04-14 14:40:00 | INFO | hw5.seq2seq | BLEU = 25.29 58.4/33.2/19.9/12.5 (BP = 0.960 ratio = 0.961 hyp_len = 106086 ref_len = 110430)
2023-04-14 14:40:00 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint30.pt
2023-04-14 14:40:00 | INFO | hw5.seq2seq | end of epoch 30


                                                                            

2023-04-14 14:43:36 | INFO | hw5.seq2seq | training loss: 3.2760
2023-04-14 14:43:36 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 14:44:02 | INFO | hw5.seq2seq | example source: one study asked people to estimate several statistics related to the scope of climate change .
2023-04-14 14:44:02 | INFO | hw5.seq2seq | example hypothesis: 一項研究要求人們估計和氣候變遷的範圍有關的幾個統計數據 。
2023-04-14 14:44:02 | INFO | hw5.seq2seq | example reference: 有一項研究要求受測者去估計幾項和氣候變遷範圍相關的統計數字 。
2023-04-14 14:44:02 | INFO | hw5.seq2seq | validation loss:	3.3530
2023-04-14 14:44:02 | INFO | hw5.seq2seq | BLEU = 24.67 59.2/33.7/20.1/12.7 (BP = 0.924 ratio = 0.927 hyp_len = 102337 ref_len = 110430)
2023-04-14 14:44:02 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint31.pt
2023-04-14 14:44:02 | INFO | hw5.seq2seq | end of epoch 31


                                                                            

2023-04-14 14:47:40 | INFO | hw5.seq2seq | training loss: 3.2700
2023-04-14 14:47:40 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 14:48:05 | INFO | hw5.seq2seq | example source: it's not such an ominous thing .
2023-04-14 14:48:05 | INFO | hw5.seq2seq | example hypothesis: 這並不是一件不可思議的事 。
2023-04-14 14:48:05 | INFO | hw5.seq2seq | example reference: 這個任務也不是太糟糕
2023-04-14 14:48:05 | INFO | hw5.seq2seq | validation loss:	3.3607
2023-04-14 14:48:05 | INFO | hw5.seq2seq | BLEU = 24.60 60.3/34.6/20.8/13.1 (BP = 0.895 ratio = 0.901 hyp_len = 99445 ref_len = 110430)
2023-04-14 14:48:05 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint32.pt
2023-04-14 14:48:05 | INFO | hw5.seq2seq | end of epoch 32


                                                                            

2023-04-14 14:51:42 | INFO | hw5.seq2seq | training loss: 3.2644
2023-04-14 14:51:42 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 14:52:09 | INFO | hw5.seq2seq | example source: so . . .
2023-04-14 14:52:09 | INFO | hw5.seq2seq | example hypothesis: 所以......
2023-04-14 14:52:09 | INFO | hw5.seq2seq | example reference: 所以......
2023-04-14 14:52:09 | INFO | hw5.seq2seq | validation loss:	3.3428
2023-04-14 14:52:09 | INFO | hw5.seq2seq | BLEU = 25.04 59.2/33.7/20.1/12.8 (BP = 0.936 ratio = 0.938 hyp_len = 103597 ref_len = 110430)
2023-04-14 14:52:09 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint33.pt
2023-04-14 14:52:09 | INFO | hw5.seq2seq | end of epoch 33


                                                                            

2023-04-14 14:55:44 | INFO | hw5.seq2seq | training loss: 3.2585
2023-04-14 14:55:44 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 14:56:11 | INFO | hw5.seq2seq | example source: and it's not just water that this works with .
2023-04-14 14:56:11 | INFO | hw5.seq2seq | example hypothesis: 這不僅適用於水 。
2023-04-14 14:56:11 | INFO | hw5.seq2seq | example reference: 然後它還不只是對水才有作用而已
2023-04-14 14:56:11 | INFO | hw5.seq2seq | validation loss:	3.3376
2023-04-14 14:56:11 | INFO | hw5.seq2seq | BLEU = 25.68 58.8/33.6/20.3/12.9 (BP = 0.958 ratio = 0.959 hyp_len = 105929 ref_len = 110430)
2023-04-14 14:56:11 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint34.pt
2023-04-14 14:56:11 | INFO | hw5.seq2seq | end of epoch 34


                                                                            

2023-04-14 14:59:47 | INFO | hw5.seq2seq | training loss: 3.2542
2023-04-14 14:59:47 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 15:00:14 | INFO | hw5.seq2seq | example source: ok , i'm not the only one whistling here .
2023-04-14 15:00:14 | INFO | hw5.seq2seq | example hypothesis: 好 , 我不是唯一吹口哨的人 。
2023-04-14 15:00:14 | INFO | hw5.seq2seq | example reference: 好 , 我不是這裡唯一會吹口哨的人 。
2023-04-14 15:00:14 | INFO | hw5.seq2seq | validation loss:	3.3304
2023-04-14 15:00:14 | INFO | hw5.seq2seq | BLEU = 25.44 59.3/33.9/20.3/12.8 (BP = 0.946 ratio = 0.947 hyp_len = 104626 ref_len = 110430)
2023-04-14 15:00:14 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint35.pt
2023-04-14 15:00:14 | INFO | hw5.seq2seq | end of epoch 35


                                                                            

2023-04-14 15:03:51 | INFO | hw5.seq2seq | training loss: 3.2497
2023-04-14 15:03:51 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 15:04:17 | INFO | hw5.seq2seq | example source: my father was literally born again .
2023-04-14 15:04:17 | INFO | hw5.seq2seq | example hypothesis: 我父親真的再次出生 。
2023-04-14 15:04:17 | INFO | hw5.seq2seq | example reference: 我爸爸真的重生了 。
2023-04-14 15:04:17 | INFO | hw5.seq2seq | validation loss:	3.3352
2023-04-14 15:04:17 | INFO | hw5.seq2seq | BLEU = 25.68 58.9/33.6/20.2/12.8 (BP = 0.960 ratio = 0.960 hyp_len = 106059 ref_len = 110430)
2023-04-14 15:04:17 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint36.pt
2023-04-14 15:04:18 | INFO | hw5.seq2seq | end of epoch 36


                                                                            

2023-04-14 15:07:53 | INFO | hw5.seq2seq | training loss: 3.2466
2023-04-14 15:07:53 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 15:08:19 | INFO | hw5.seq2seq | example source: so what this means is that nature doesn't have to continually redesign the brain .
2023-04-14 15:08:19 | INFO | hw5.seq2seq | example hypothesis: 這意味著大自然不需要不斷重新設計大腦 。
2023-04-14 15:08:19 | INFO | hw5.seq2seq | example reference: 這意味著大自然不需持續重新設計大腦 。
2023-04-14 15:08:19 | INFO | hw5.seq2seq | validation loss:	3.3456
2023-04-14 15:08:19 | INFO | hw5.seq2seq | BLEU = 25.21 60.4/34.8/21.0/13.4 (BP = 0.909 ratio = 0.913 hyp_len = 100856 ref_len = 110430)
2023-04-14 15:08:19 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint37.pt
2023-04-14 15:08:19 | INFO | hw5.seq2seq | end of epoch 37


                                                                            

2023-04-14 15:11:55 | INFO | hw5.seq2seq | training loss: 3.2418
2023-04-14 15:11:55 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 15:12:21 | INFO | hw5.seq2seq | example source: it releases cortisol that raises your heart rate , it modulates adrenaline levels and it clouds your thinking .
2023-04-14 15:12:21 | INFO | hw5.seq2seq | example hypothesis: 它會釋放皮質醇 , 提高你的心跳 , 它會調節腎上腺素的濃度 , 會讓你的思考 。
2023-04-14 15:12:21 | INFO | hw5.seq2seq | example reference: 我知道它會釋出皮質醇 , 增加你的心跳、調解腎上腺素、並讓你思緒渾沌不清 。
2023-04-14 15:12:21 | INFO | hw5.seq2seq | validation loss:	3.3209
2023-04-14 15:12:21 | INFO | hw5.seq2seq | BLEU = 25.68 59.1/33.7/20.2/12.8 (BP = 0.957 ratio = 0.958 hyp_len = 105814 ref_len = 110430)
2023-04-14 15:12:22 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint38.pt
2023-04-14 15:12:22 | INFO | hw5.seq2seq | end of epoch 38


                                                                            

2023-04-14 15:15:59 | INFO | hw5.seq2seq | training loss: 3.2378
2023-04-14 15:15:59 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 15:16:25 | INFO | hw5.seq2seq | example source: it's not true . how could it be ?
2023-04-14 15:16:25 | INFO | hw5.seq2seq | example hypothesis: 這不是真的 。 怎麼可能呢 ?
2023-04-14 15:16:25 | INFO | hw5.seq2seq | example reference: 這怎麼可能呢 ? 這是完全錯誤的 。
2023-04-14 15:16:25 | INFO | hw5.seq2seq | validation loss:	3.3252
2023-04-14 15:16:25 | INFO | hw5.seq2seq | BLEU = 25.47 59.3/33.8/20.4/12.9 (BP = 0.946 ratio = 0.947 hyp_len = 104603 ref_len = 110430)
2023-04-14 15:16:25 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint39.pt
2023-04-14 15:16:25 | INFO | hw5.seq2seq | end of epoch 39


                                                                            

2023-04-14 15:20:01 | INFO | hw5.seq2seq | training loss: 3.2331
2023-04-14 15:20:01 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 15:20:28 | INFO | hw5.seq2seq | example source: if you made that decision in 1965 , the down side of that is the next year we have the cultural revolution .
2023-04-14 15:20:28 | INFO | hw5.seq2seq | example hypothesis: 假如你在1965年做了那個決定 , 下一年我們有了文化革命 。
2023-04-14 15:20:28 | INFO | hw5.seq2seq | example reference: 如果你在1965年作出了這個決定 , 弊處是第二年爆發了文化大革命 。
2023-04-14 15:20:28 | INFO | hw5.seq2seq | validation loss:	3.3298
2023-04-14 15:20:28 | INFO | hw5.seq2seq | BLEU = 25.81 59.0/33.8/20.5/13.0 (BP = 0.956 ratio = 0.956 hyp_len = 105624 ref_len = 110430)
2023-04-14 15:20:28 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint40.pt
2023-04-14 15:20:28 | INFO | hw5.seq2seq | end of epoch 40


                                                                            

2023-04-14 15:24:04 | INFO | hw5.seq2seq | training loss: 3.2297
2023-04-14 15:24:04 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 15:24:31 | INFO | hw5.seq2seq | example source: and an especially important challenge that i've had to face is the great shortage of mental health professionals , such as psychiatrists and psychologists , particularly in the developing world .
2023-04-14 15:24:31 | INFO | hw5.seq2seq | example hypothesis: 我所面臨的重大挑戰是心理健康專業人士的大缺陷 , 例如精神科醫生和心理學家 , 尤其是在發展中國家 。
2023-04-14 15:24:31 | INFO | hw5.seq2seq | example reference: 而我們所要面對的一個特別重要的挑戰就是心理衛生專業人員的嚴重不足例如精神病學家與心理學家特別是在開發中世界
2023-04-14 15:24:31 | INFO | hw5.seq2seq | validation loss:	3.3358
2023-04-14 15:24:31 | INFO | hw5.seq2seq | BLEU = 25.47 59.3/33.8/20.3/12.8 (BP = 0.948 ratio = 0.949 hyp_len = 104834 ref_len = 110430)
2023-04-14 15:24:31 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint41.pt
2023-04-14 15:24:31 | INFO | hw5.seq2seq | end of epoch 41


                                                                            

2023-04-14 15:28:08 | INFO | hw5.seq2seq | training loss: 3.2269
2023-04-14 15:28:08 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 15:28:35 | INFO | hw5.seq2seq | example source: if you ask men why they did a good job , they'll say , " i'm awesome .
2023-04-14 15:28:35 | INFO | hw5.seq2seq | example hypothesis: 如果你問男人為什麼他們做得很好 , 他們會說: 「 我很棒 。
2023-04-14 15:28:35 | INFO | hw5.seq2seq | example reference: 如果你問男人 , 為什麼他們的工作做得不錯 , 他們會說 , " 我棒極了 。
2023-04-14 15:28:35 | INFO | hw5.seq2seq | validation loss:	3.3160
2023-04-14 15:28:35 | INFO | hw5.seq2seq | BLEU = 25.45 59.1/33.7/20.3/12.9 (BP = 0.947 ratio = 0.949 hyp_len = 104778 ref_len = 110430)
2023-04-14 15:28:35 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint42.pt
2023-04-14 15:28:35 | INFO | hw5.seq2seq | end of epoch 42


                                                                            

2023-04-14 15:32:12 | INFO | hw5.seq2seq | training loss: 3.2229
2023-04-14 15:32:12 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 15:32:39 | INFO | hw5.seq2seq | example source: this is a brain from a lamprey eel .
2023-04-14 15:32:39 | INFO | hw5.seq2seq | example hypothesis: 這是來自一個燈光eel的大腦 。
2023-04-14 15:32:39 | INFO | hw5.seq2seq | example reference: 這是一個從鰻魚取出的大腦
2023-04-14 15:32:39 | INFO | hw5.seq2seq | validation loss:	3.3234
2023-04-14 15:32:39 | INFO | hw5.seq2seq | BLEU = 25.92 58.5/33.4/20.1/12.7 (BP = 0.974 ratio = 0.975 hyp_len = 107625 ref_len = 110430)
2023-04-14 15:32:40 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint43.pt
2023-04-14 15:32:40 | INFO | hw5.seq2seq | end of epoch 43


                                                                            

2023-04-14 15:36:15 | INFO | hw5.seq2seq | training loss: 3.2184
2023-04-14 15:36:15 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 15:36:42 | INFO | hw5.seq2seq | example source: and this was something that was really brought home to me a year ago when i found out i was pregnant and the first scan revealed that my baby had a birth defect associated with exposure to estrogenic chemicals in the womb and the second scan revealed no heartbeat .
2023-04-14 15:36:42 | INFO | hw5.seq2seq | example hypothesis: 一年前 , 我發現我懷孕了 , 第一次掃描顯示 , 我的嬰兒出生缺陷與子宮內的雌激素化學物質和第二次掃描完全沒有心跳 。
2023-04-14 15:36:42 | INFO | hw5.seq2seq | example reference: 一年前發生在我身上的事 , 讓我更加如此確信 。 當時我懷孕了 , 第一次的掃描檢查就發現 , 我的寶寶有先天性缺陷 。 這和我子宮中含有雌激素化學物質有關 。 第二次掃描時 , 胎兒已經沒有心跳 ,
2023-04-14 15:36:42 | INFO | hw5.seq2seq | validation loss:	3.3173
2023-04-14 15:36:42 | INFO | hw5.seq2seq | BLEU = 25.47 59.5/33.9/20.4/12.9 (BP = 0.943 ratio = 0.944 hyp_len = 104298 ref_len = 110430)
2023-04-14 15:36:42 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint44.pt
2023-04-14 15:36:42 | INFO | hw5.seq2seq

                                                                            

2023-04-14 15:40:19 | INFO | hw5.seq2seq | training loss: 3.2146
2023-04-14 15:40:19 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 15:40:46 | INFO | hw5.seq2seq | example source: so one meeting tends to lead to another meeting , which leads to another meeting .
2023-04-14 15:40:46 | INFO | hw5.seq2seq | example hypothesis: 所以一個會議會導致另一場會議 , 會議導致另一場會議 。
2023-04-14 15:40:46 | INFO | hw5.seq2seq | example reference: 然後沒完沒了而且開會的人多
2023-04-14 15:40:46 | INFO | hw5.seq2seq | validation loss:	3.3181
2023-04-14 15:40:46 | INFO | hw5.seq2seq | BLEU = 25.74 59.6/34.2/20.7/13.1 (BP = 0.944 ratio = 0.945 hyp_len = 104383 ref_len = 110430)
2023-04-14 15:40:46 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint45.pt
2023-04-14 15:40:46 | INFO | hw5.seq2seq | end of epoch 45


                                                                            

2023-04-14 15:44:23 | INFO | hw5.seq2seq | training loss: 3.2135
2023-04-14 15:44:23 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 15:44:50 | INFO | hw5.seq2seq | example source: how did he learn them ?
2023-04-14 15:44:50 | INFO | hw5.seq2seq | example hypothesis: 他是怎麼學的 ?
2023-04-14 15:44:50 | INFO | hw5.seq2seq | example reference: 他是怎麼學會的 ?
2023-04-14 15:44:50 | INFO | hw5.seq2seq | validation loss:	3.3205
2023-04-14 15:44:50 | INFO | hw5.seq2seq | BLEU = 25.80 59.4/34.1/20.5/13.0 (BP = 0.951 ratio = 0.952 hyp_len = 105144 ref_len = 110430)
2023-04-14 15:44:50 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint46.pt
2023-04-14 15:44:50 | INFO | hw5.seq2seq | end of epoch 46


                                                                            

2023-04-14 15:48:25 | INFO | hw5.seq2seq | training loss: 3.2087
2023-04-14 15:48:25 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 15:48:50 | INFO | hw5.seq2seq | example source: we stay up all night focusing the lights , programming the lights , trying to find new ways to sculpt and carve light .
2023-04-14 15:48:50 | INFO | hw5.seq2seq | example hypothesis: 我們一整晚都聚焦在燈光上 , 編寫燈光 , 試圖找出新的方法來雕塑和雕刻光 。
2023-04-14 15:48:50 | INFO | hw5.seq2seq | example reference: 我們熬夜討論、設定燈光 , 試圖找到新的方式去雕塑、刻劃光 。
2023-04-14 15:48:50 | INFO | hw5.seq2seq | validation loss:	3.3249
2023-04-14 15:48:50 | INFO | hw5.seq2seq | BLEU = 25.43 60.2/34.5/20.8/13.2 (BP = 0.925 ratio = 0.928 hyp_len = 102449 ref_len = 110430)
2023-04-14 15:48:50 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint47.pt
2023-04-14 15:48:50 | INFO | hw5.seq2seq | end of epoch 47


                                                                            

2023-04-14 15:52:12 | INFO | hw5.seq2seq | training loss: 3.2059
2023-04-14 15:52:12 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 15:52:37 | INFO | hw5.seq2seq | example source: and here's the thing: our kids are ready for this kind of work .
2023-04-14 15:52:37 | INFO | hw5.seq2seq | example hypothesis: 重點是:我們的孩子已經準備好了這項工作了 。
2023-04-14 15:52:37 | INFO | hw5.seq2seq | example reference: 重點是 , 我們的孩子已經準備好做這些事 。
2023-04-14 15:52:37 | INFO | hw5.seq2seq | validation loss:	3.3110
2023-04-14 15:52:37 | INFO | hw5.seq2seq | BLEU = 25.82 59.2/34.0/20.5/13.0 (BP = 0.955 ratio = 0.956 hyp_len = 105537 ref_len = 110430)
2023-04-14 15:52:37 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint48.pt
2023-04-14 15:52:37 | INFO | hw5.seq2seq | end of epoch 48


                                                                            

2023-04-14 15:55:59 | INFO | hw5.seq2seq | training loss: 3.2024
2023-04-14 15:55:59 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 15:56:25 | INFO | hw5.seq2seq | example source: it forces air through a venturi force if there's no wind .
2023-04-14 15:56:25 | INFO | hw5.seq2seq | example hypothesis: 如果沒有風 , 它會強迫空氣 。
2023-04-14 15:56:25 | INFO | hw5.seq2seq | example reference: 它把空氣押進來 , 如果沒有風的話 , 就採用機器鼓風 。
2023-04-14 15:56:25 | INFO | hw5.seq2seq | validation loss:	3.3043
2023-04-14 15:56:25 | INFO | hw5.seq2seq | BLEU = 25.76 59.0/33.8/20.4/13.0 (BP = 0.956 ratio = 0.957 hyp_len = 105669 ref_len = 110430)
2023-04-14 15:56:25 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint49.pt
2023-04-14 15:56:25 | INFO | hw5.seq2seq | end of epoch 49


                                                                            

2023-04-14 15:59:47 | INFO | hw5.seq2seq | training loss: 3.2002
2023-04-14 15:59:47 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 16:00:11 | INFO | hw5.seq2seq | example source: the queen of clubs !
2023-04-14 16:00:11 | INFO | hw5.seq2seq | example hypothesis: 俱樂部的皇后 !
2023-04-14 16:00:11 | INFO | hw5.seq2seq | example reference: 梅花后 !
2023-04-14 16:00:11 | INFO | hw5.seq2seq | validation loss:	3.3104
2023-04-14 16:00:11 | INFO | hw5.seq2seq | BLEU = 25.71 59.8/34.4/20.8/13.2 (BP = 0.937 ratio = 0.939 hyp_len = 103642 ref_len = 110430)
2023-04-14 16:00:12 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint50.pt
2023-04-14 16:00:12 | INFO | hw5.seq2seq | end of epoch 50


                                                                            

2023-04-14 16:03:33 | INFO | hw5.seq2seq | training loss: 3.1975
2023-04-14 16:03:33 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 16:03:58 | INFO | hw5.seq2seq | example source: most of them are in other buildings not designed as schools .
2023-04-14 16:03:58 | INFO | hw5.seq2seq | example hypothesis: 大部分在其他的建築物裡沒有被設計成學校 。
2023-04-14 16:03:58 | INFO | hw5.seq2seq | example reference: 其他則隱身於非學校的場地
2023-04-14 16:03:58 | INFO | hw5.seq2seq | validation loss:	3.3065
2023-04-14 16:03:58 | INFO | hw5.seq2seq | BLEU = 25.66 59.2/33.8/20.3/12.9 (BP = 0.954 ratio = 0.955 hyp_len = 105432 ref_len = 110430)
2023-04-14 16:03:58 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint51.pt
2023-04-14 16:03:58 | INFO | hw5.seq2seq | end of epoch 51


                                                                            

2023-04-14 16:07:20 | INFO | hw5.seq2seq | training loss: 3.1943
2023-04-14 16:07:20 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 16:07:45 | INFO | hw5.seq2seq | example source: in fact , it's possible that reflecting just one or two percent more sunlight from the atmosphere could offset two degrees celsius or more of warming .
2023-04-14 16:07:45 | INFO | hw5.seq2seq | example hypothesis: 事實上 , 只要反射出大氣中的1%或2%的陽光 , 就能抵禦攝氏2度或更暖化 。
2023-04-14 16:07:45 | INFO | hw5.seq2seq | example reference: 事實上 , 有可能 , 只要從大氣再多反射1%或2%的陽光 , 就能抵消掉攝氏兩度以上的暖化 。
2023-04-14 16:07:45 | INFO | hw5.seq2seq | validation loss:	3.3143
2023-04-14 16:07:45 | INFO | hw5.seq2seq | BLEU = 25.33 60.1/34.5/20.8/13.2 (BP = 0.922 ratio = 0.925 hyp_len = 102182 ref_len = 110430)
2023-04-14 16:07:45 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint52.pt
2023-04-14 16:07:45 | INFO | hw5.seq2seq | end of epoch 52


                                                                            

2023-04-14 16:11:08 | INFO | hw5.seq2seq | training loss: 3.1936
2023-04-14 16:11:08 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 16:11:35 | INFO | hw5.seq2seq | example source: thank you .
2023-04-14 16:11:35 | INFO | hw5.seq2seq | example hypothesis: 謝謝
2023-04-14 16:11:35 | INFO | hw5.seq2seq | example reference: 謝謝各位 。
2023-04-14 16:11:35 | INFO | hw5.seq2seq | validation loss:	3.3196
2023-04-14 16:11:35 | INFO | hw5.seq2seq | BLEU = 25.60 60.2/34.6/20.9/13.4 (BP = 0.926 ratio = 0.929 hyp_len = 102590 ref_len = 110430)
2023-04-14 16:11:35 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint53.pt
2023-04-14 16:11:35 | INFO | hw5.seq2seq | end of epoch 53


                                                                            

2023-04-14 16:15:05 | INFO | hw5.seq2seq | training loss: 3.1896
2023-04-14 16:15:05 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 16:15:31 | INFO | hw5.seq2seq | example source: yes , there's issues about how money should be distributed , and that's still being refigured out .
2023-04-14 16:15:31 | INFO | hw5.seq2seq | example hypothesis: 是的 , 有關於金錢如何分配的問題 , 這仍然被改造出來 。
2023-04-14 16:15:31 | INFO | hw5.seq2seq | example reference: 是的 , 我們有資金分配的問題我們正在重新估算
2023-04-14 16:15:31 | INFO | hw5.seq2seq | validation loss:	3.3132
2023-04-14 16:15:31 | INFO | hw5.seq2seq | BLEU = 25.46 60.6/34.9/21.1/13.4 (BP = 0.916 ratio = 0.919 hyp_len = 101473 ref_len = 110430)
2023-04-14 16:15:31 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint54.pt
2023-04-14 16:15:31 | INFO | hw5.seq2seq | end of epoch 54


                                                                            

2023-04-14 16:19:07 | INFO | hw5.seq2seq | training loss: 3.1878
2023-04-14 16:19:07 | INFO | hw5.seq2seq | begin validation


                                                                            

2023-04-14 16:19:33 | INFO | hw5.seq2seq | example source: therefore , in this context , of course , it makes sense to dedicate all this time to spelling .
2023-04-14 16:19:33 | INFO | hw5.seq2seq | example hypothesis: 因此 , 在這個情境中 , 當然 , 拼字是合理的 。
2023-04-14 16:19:33 | INFO | hw5.seq2seq | example reference: 因此 , 在這種情境下 , 當然 , 把所有的時間花在拼字上是合理的 。
2023-04-14 16:19:33 | INFO | hw5.seq2seq | validation loss:	3.3042
2023-04-14 16:19:33 | INFO | hw5.seq2seq | BLEU = 25.84 59.2/33.9/20.5/13.0 (BP = 0.956 ratio = 0.957 hyp_len = 105654 ref_len = 110430)
2023-04-14 16:19:33 | INFO | hw5.seq2seq | saved epoch checkpoint: c:\Users\william\Desktop\graddescope\checkpoints\transformer/checkpoint55.pt
2023-04-14 16:19:33 | INFO | hw5.seq2seq | end of epoch 55


train epoch 56:  30%|██▉       | 238/797 [01:05<02:21,  3.96it/s, loss=3.21]

# Submission

In [None]:
# averaging a few checkpoints can have a similar effect to ensemble
checkdir=config.savedir
!python C:/Users/william/fairseq/scripts/average_checkpoints.py \
--inputs {checkdir} \
--num-epoch-checkpoints 5 \
--output {checkdir}/avg_last_5_checkpoint.pt

## Confirm model weights used to generate submission

In [None]:
# checkpoint_last.pt : latest epoch
# checkpoint_best.pt : highest validation bleu
# avg_last_5_checkpoint.pt: the average of last 5 epochs
try_load_checkpoint(model, name="avg_last_5_checkpoint.pt")
validate(model, task, criterion, log_to_wandb=False)
None

## Generate Prediction

In [None]:
def generate_prediction(model, task, split="test", outfile="./prediction-2.txt"):    
    task.load_dataset(split=split, epoch=1)
    itr = load_data_iterator(task, split, 1, config.max_tokens, config.num_workers).next_epoch_itr(shuffle=False)
    
    idxs = []
    hyps = []

    model.eval()
    progress = tqdm.tqdm(itr, desc=f"prediction")
    with torch.no_grad():
        for i, sample in enumerate(progress):
            # validation loss
            sample = utils.move_to_cuda(sample, device=device)

            # do inference
            s, h, r = inference_step(sample, model)
            
            hyps.extend(h)
            idxs.extend(list(sample['id']))
            
    # sort based on the order before preprocess
    hyps = [x for _,x in sorted(zip(idxs,hyps))]
    
    with open(outfile, "w") as f:
        for h in hyps:
            f.write(h+"\n")

In [None]:
generate_prediction(model, task)

# gradescope1


In [None]:
import os
os.environ["KMP_DUPLICATE_LTB_OK"] = "TRUE"

In [None]:
from torch.nn.functional import cosine_similarity as cs
pos_emb = model.decoder.embed_positions.weights.cpu().detach()
print(pos_emb.size())
ret = cs(pos_emb.unsqueeze(1), pos_emb, dim = 2)
plt.figure(figsize=(8, 8))
plt.matshow(ret)
plt.show()

# gradescope2


In [None]:
import matplotlib.pyplot as plt
gnorm_list = []
with open('gnorm.txt', 'r') as f:
    lines = f.readlines()
    for line in lines:
        if line[0].isdigit():
            gnorm_list.append(round(float(line[::-1]), 3))
        else:
            gnorm_list.append(-1)
    print(gnorm_list[:100])

# plt.plot(range(1, len(gnorms)+1, gnorms))
# plt.plot(range(1, len(gnorms)+1), [config.clip_norm] * len(gnorms), "-")
plt.plot([i for i in range(len(gnorm_list))], gnorm_list)
plt.title('Grad norm v.s. step')
plt.xlabel('step')
plt.ylabel('Grad norm')
plt.show

In [None]:
raise

# Back-translation

## Train a backward translation model

1. Switch the source_lang and target_lang in **config** 
2. Change the savedir in **config** (eg. "./checkpoints/transformer-back")
3. Train model

## Generate synthetic data with backward model 

### Download monolingual data

In [None]:
# mono_dataset_name = 'mono'

In [None]:
# mono_prefix = Path(data_dir).absolute() / mono_dataset_name
# mono_prefix.mkdir(parents=True, exist_ok=True)

# urls = (
#     "https://github.com/figisiwirf/ml2023-hw5-dataset/releases/download/v1.0.1/ted_zh_corpus.deduped.gz",
# )
# file_names = (
#     'ted_zh_corpus.deduped.gz',
# )

# for u, f in zip(urls, file_names):
#     path = mono_prefix/f
#     if not path.exists():
#         !wget {u} -O {path}
#     else:
#         print(f'{f} is exist, skip downloading')
#     if path.suffix == ".tgz":
#         !tar -xvf {path} -C {prefix}
#     elif path.suffix == ".zip":
#         !unzip -o {path} -d {prefix}
#     elif path.suffix == ".gz":
#         !gzip -fkd {path}

### TODO: clean corpus

1. remove sentences that are too long or too short
2. unify punctuation

hint: you can use clean_s() defined above to do this

In [None]:
# mono_prefix


In [None]:
# def clean_mono_corpus(mono_prefix, l1, l2, max_len=1000, min_len=1):
#     if Path(f'{mono_prefix}/ted_zh_corpus.deduped.clean.{l1}').exists() and Path(f'{mono_prefix}/ted_zh_corpus.deduped.clean.{l2}').exists():
#         print(f'{mono_prefix}/ted_zh_corpus.deduped.clean.{l1} & {l2} exists. skipping clean.')
#         return
#     with open(f'{mono_prefix}/ted_zh_corpus.deduped', 'r') as l1_in_f:
#         with open(f'{mono_prefix}/ted_zh_corpus.deduped.clean.{l1}', 'w') as l1_out_f:
#             with open(f'{mono_prefix}/ted_zh_corpus.deduped.clean.{l2}', 'w') as l2_out_f:
#                 for s1 in l1_in_f:
#                     s1 = s1.strip()
#                     s1 = clean_s(s1, l1)
#                     s1_len = len_s(s1, l1)
#                     if min_len > 0: # remove short sentence
#                         if s1_len < min_len:
#                             continue
#                     if max_len > 0: # remove long sentence
#                         if s1_len > max_len:
#                             continue
#                     print(s1, file=l1_out_f)
#                     print('.', file=l2_out_f)

In [None]:
# clean_mono_corpus(mono_prefix, 'zh','en')


In [None]:

# !head {data_prefix+'.clean.'+'zh'} -n 5
# !head {data_prefix+'.clean.'+'en'} -n 5

### TODO: Subword Units

Use the spm model of the backward model to tokenize the data into subword units

hint: spm model is located at DATA/raw-data/\[dataset\]/spm\[vocab_num\].model

In [None]:
# for lang in ['zh','en']:
#     out_path = mono_prefix/f'mono.tok.{lang}'
#     if out_path.exists():
#         print(f"{out_path} exists. skipping spm_encode.")
#     else:
#         with open(mono_prefix/f'mono.tok.{lang}', 'w') as out_f:
#             with open(mono_prefix/f'ted_zh_corpus.deduped.clean.{lang}', 'r') as in_f:
#                 for line in in_f:
#                     line = line.strip()
#                     tok = spm_model.encode(line, out_type=str)
#                     print(' '.join(tok), file=out_f)

### Binarize

use fairseq to binarize data

In [None]:
# binpath = Path('./DATA/data-bin', mono_dataset_name)
# src_dict_file = './DATA/data-bin/ted2020/dict.en.txt'
# tgt_dict_file = src_dict_file
# monopref = str(mono_prefix/"mono.tok") # whatever filepath you get after applying subword tokenization
# if binpath.exists():
#     print(binpath, "exists, will not overwrite!")
# else:
#     !python -m fairseq_cli.preprocess\
#         --source-lang 'zh'\
#         --target-lang 'en'\
#         --trainpref {monopref}\
#         --destdir {binpath}\
#         --srcdict {src_dict_file}\
#         --tgtdict {tgt_dict_file}\
#         --workers 2

### TODO: Generate synthetic data with backward model

Add binarized monolingual data to the original data directory, and name it with "split_name"

ex. ./DATA/data-bin/ted2020/\[split_name\].zh-en.\["en", "zh"\].\["bin", "idx"\]

then you can use 'generate_prediction(model, task, split="split_name")' to generate translation prediction

In [None]:
# # Add binarized monolingual data to the original data directory, and name it with "split_name"
# # ex. ./DATA/data-bin/ted2020/\[split_name\].zh-en.\["en", "zh"\].\["bin", "idx"\]
# !cp ./DATA/data-bin/mono/train.zh-en.zh.bin ./DATA/data-bin/ted2020/mono.zh-en.zh.bin
# !cp ./DATA/data-bin/mono/train.zh-en.zh.idx ./DATA/data-bin/ted2020/mono.zh-en.zh.idx
# !cp ./DATA/data-bin/mono/train.zh-en.en.bin ./DATA/data-bin/ted2020/mono.zh-en.en.bin
# !cp ./DATA/data-bin/mono/train.zh-en.en.idx ./DATA/data-bin/ted2020/mono.zh-en.en.idx

In [None]:
# # hint: 用反向模型在 split='mono' 上進行預測，生成 prediction_file
# generate_prediction(model, task, split="mono", outfile="./DATA/rawdata/mono/mono_prediction.txt")


In [None]:
# hint: do prediction on split='mono' to create prediction_file
# generate_prediction( ... ,split=... ,outfile=... )

### TODO: Create new dataset

1. Combine the prediction data with monolingual data
2. Use the original spm model to tokenize data into Subword Units
3. Binarize data with fairseq

In [None]:
# Combine prediction_file (.en) and mono.zh (.zh) into a new dataset.
# 
# hint: tokenize prediction_file with the spm model
# spm_model.encode(line, out_type=str)
# output: ./DATA/rawdata/mono/mono.tok.en & mono.tok.zh
#
# hint: use fairseq to binarize these two files again
# binpath = Path('./DATA/data-bin/synthetic')
# src_dict_file = './DATA/data-bin/ted2020/dict.en.txt'
# tgt_dict_file = src_dict_file
# monopref = ./DATA/rawdata/mono/mono.tok # or whatever path after applying subword tokenization, w/o the suffix (.zh/.en)
# if binpath.exists():
#     print(binpath, "exists, will not overwrite!")
# else:
#     !python -m fairseq_cli.preprocess\
#         --source-lang 'zh'\
#         --target-lang 'en'\
#         --trainpref {monopref}\
#         --destdir {binpath}\
#         --srcdict {src_dict_file}\
#         --tgtdict {tgt_dict_file}\
#         --workers 2

In [None]:
# # 合併剛剛生成的 prediction_file (.en) 以及中文 mono.zh (.zh)
# # 
# # hint: 在此用剛剛的 spm model 對 prediction_file 進行切斷詞
# # spm_model.encode(line, out_type=str)
# # output: ./DATA/rawdata/mono/mono.tok.en & mono.tok.zh
# #
# with open(mono_prefix/f'mono.tok.en', 'w') as out_f:
#     with open('./DATA/rawdata/mono/mono_prediction.txt', 'r') as in_f:
#         for line in in_f:
#             line = line.strip()
#             tok = spm_model.encode(line, out_type=str)
#             print(' '.join(tok), file=out_f)

In [None]:
# # hint: 在此用 fairseq 把這些檔案再 binarize
# binpath = Path('./DATA/data-bin/synthetic')
# src_dict_file = './DATA/data-bin/ted2020/dict.en.txt'
# tgt_dict_file = src_dict_file
# monopref = Path('./DATA/rawdata/mono/mono.tok') # or whatever path after applying subword tokenization, w/o the suffix (.zh/.en)
# if binpath.exists():
#     print(binpath, "exists, will not overwrite!")
# else:
#     !python -m fairseq_cli.preprocess\
#          --source-lang 'zh'\
#          --target-lang 'en'\
#          --trainpref {monopref}\
#          --destdir {binpath}\
#          --srcdict {src_dict_file}\
#          --tgtdict {tgt_dict_file}\
#          --workers 2

In [None]:
# # create a new dataset from all the files prepared above
# !cp -r ./DATA/data-bin/ted2020/ ./DATA/data-bin/ted2020_with_mono/

# !cp ./DATA/data-bin/synthetic/train.zh-en.zh.bin ./DATA/data-bin/ted2020_with_mono/train1.en-zh.zh.bin
# !cp ./DATA/data-bin/synthetic/train.zh-en.zh.idx ./DATA/data-bin/ted2020_with_mono/train1.en-zh.zh.idx
# !cp ./DATA/data-bin/synthetic/train.zh-en.en.bin ./DATA/data-bin/ted2020_with_mono/train1.en-zh.en.bin
# !cp ./DATA/data-bin/synthetic/train.zh-en.en.idx ./DATA/data-bin/ted2020_with_mono/train1.en-zh.en.idx

In [None]:
# config = Namespace(
#     datadir = "./DATA/data-bin/ted2020_with_mono",
#     savedir = "/content/drive/MyDrive/ML2021-hw5/checkpoints/transformer-big",
#     source_lang = "en",
#     target_lang = "zh",
    
#     # cpu threads when fetching & processing data.
#     num_workers=2,  
#     # batch size in terms of tokens. gradient accumulation increases the effective batchsize.
#     max_tokens=4096,
#     accum_steps=4,
    
#     # the lr s calculated from Noam lr scheduler. you can tune the maximum lr by this factor.
#     lr_factor=2.,
#     lr_warmup=4000,
    
#     # clipping gradient norm helps alleviate gradient exploding
#     clip_norm=1.0,
    
#     # maximum epochs for training
#     max_epoch=35,
#     start_epoch=1,
    
#     # beam size for beam search
#     beam=5, 
#     # generate sequences of maximum length ax + b, where x is the source length
#     max_len_a=1.2, 
#     max_len_b=10,
#     # when decoding, post process sentence by removing sentencepiece symbols.
#     post_process = "sentencepiece",
    
#     # checkpoints
#     keep_last_epochs=15,
#     resume=None, # if resume from checkpoint name (under config.savedir)
    
#     # logging
#     use_wandb=False

In [None]:
# ## setup task
# task_cfg = TranslationConfig(
#     data=config.datadir,
#     source_lang=config.source_lang,
#     target_lang=config.target_lang,
#     train_subset="train",
#     required_seq_len_multiple=8,
#     dataset_impl="mmap",
#     upsample_primary=1,
# )
# task = TranslationTask.setup_task(task_cfg)

In [None]:
# logger.info("loading data for epoch 1")
# task.load_dataset(split="train", epoch=1, combine=True) # combine if you have back-translation data.
# task.load_dataset(split="valid", epoch=1)

In [None]:
# sample = task.dataset("valid")[1]
# pprint.pprint(sample)
# pprint.pprint(
#     "Source: " + \
#     task.source_dictionary.string(
#         sample['source'],
#         config.post_process,
#     )
# )
# pprint.pprint(
#     "Target: " + \
#     task.target_dictionary.string(
#         sample['target'],
#         config.post_process,
#     )
# )

In [None]:
# demo_epoch_obj = load_data_iterator(task, "valid", epoch=1, max_tokens=20, num_workers=1, cached=False)
# demo_iter = demo_epoch_obj.next_epoch_itr(shuffle=True)
# sample = next(demo_iter)
# sample

In [None]:
# # transformer-big
# arch_args = Namespace(
#     encoder_embed_dim=1024,
#     encoder_ffn_embed_dim=4096,
#     encoder_layers=6,
#     decoder_embed_dim=1024,
#     decoder_ffn_embed_dim=4096,
#     decoder_layers=6,
#     share_decoder_input_output_embed=True,
#     dropout=0.3,
# )

# # # HINT: 補上Transformer用的參數
# def add_transformer_args(args):
#     args.encoder_attention_heads=16
#     args.encoder_normalize_before=True
    
#     args.decoder_attention_heads=16
#     args.decoder_normalize_before=True
    
#     args.activation_fn="relu"
#     args.max_source_positions=1024
#     args.max_target_positions=1024
    
#     # 補上我們沒有設定的Transformer預設參數
#     from fairseq.models.transformer import base_architecture 
#     base_architecture(arch_args)

# add_transformer_args(arch_args)

In [None]:
# model = build_model(arch_args, task)
# logger.info(model)

In [None]:
# # 把幾個 checkpoint 平均起來可以達到 ensemble 的效果
# checkdir=config.savedir
# !python ./fairseq/scripts/average_checkpoints.py \
# --inputs {checkdir} \
# --num-epoch-checkpoints 5 \
# --output {checkdir}/avg_last_5_checkpoint.pt

In [None]:
# # checkpoint_last.pt : 最後一次檢驗的檔案
# # checkpoint_best.pt : 檢驗 BLEU 最高的檔案
# # avg_last_5_checkpoint.pt:　最5後個檔案平均
# try_load_checkpoint(model, name="avg_last_5_checkpoint.pt")
# validate(model, task, criterion, log_to_wandb=False)
# None

In [None]:
# generate_prediction(model, task, outfile=".//prediction-2.txt")


Created new dataset "ted2020_with_mono"

1. Change the datadir in **config** ("./DATA/data-bin/ted2020_with_mono")
2. Switch back the source_lang and target_lang in **config** ("en", "zh")
2. Change the savedir in **config** (eg. "./checkpoints/transformer-bt")
3. Train model

# References

看這邊: https://github.com/pai4451/ML2021/blob/main/hw5/hw5.ipynb