本章では，日本語と英語の翻訳コーパスである京都フリー翻訳タスク (KFTT)を用い，ニューラル機械翻訳モデルを構築する．ニューラル機械翻訳モデルの構築には，fairseq，Hugging Face Transformers，OpenNMT-pyなどの既存のツールを活用せよ．
## knock90 データの準備
機械翻訳のデータセットをダウンロードせよ．訓練データ，開発データ，評価データを整形し，必要に応じてトークン化などの前処理を行うこと．ただし，この段階ではトークンの単位として形態素（日本語）および単語（英語）を採用せよ．

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# download KFTT data and unzip
!wget http://www.phontron.com/kftt/download/kftt-data-1.0.tar.gz
!tar -zxvf kftt-data-1.0.tar.gz

In [None]:
# install mecab 
!apt install mecab libmecab-dev mecab-ipadic-utf8

In [None]:
# install CRF++(実行に必要)
FILE_ID = '0B4y35FiV1wh7QVR6VXJ5dWExSTQ'
FILE_NAME = 'crfpp.tar.gz'
!wget 'https://docs.google.com/uc?export=download&id=$FILE_ID' -O $FILE_NAME
!tar xvf crfpp.tar.gz
%cd CRF++-0.58
!./configure && make && make install && ldconfig
!pwd
%cd /content

In [None]:
# use macab to tokenize japanese data
!cat kftt-data-1.0/data/orig/kyoto-train.ja | mecab > train.mecab.ja
!cat kftt-data-1.0/data/orig/kyoto-dev.ja | mecab > dev.mecab.ja
!cat kftt-data-1.0/data/orig/kyoto-test.ja | mecab > test.mecab.ja

In [None]:
with open('/content/kftt-data-1.0/data/tok/kyoto-train.cln.en') as f:
  data = f.readlines()
  for line in data[:9]:
    print(line)

In [None]:
for src, dst in [
    ('train.mecab.ja', 'train.spacy.ja'),
    ('dev.mecab.ja', 'dev.spacy.ja'),
    ('test.mecab.ja', 'test.spacy.ja'),
]:
    with open(src) as f:
        lst = []
        tmp = []
        for x in f:
            x = x.strip()
            if x == 'EOS':
                lst.append(' '.join(tmp))
                tmp = []
            elif x != '':
                tmp.append(x.split('\t')[0])
    with open(dst, 'w') as f:
        for line in lst:
            print(line, file=f)

In [None]:
import re
import spacy
# tokenize data-en

for src, dst in [
    ('kftt-data-1.0/data/orig/kyoto-train.en', 'train.spacy.en'),
    ('kftt-data-1.0/data/orig/kyoto-dev.en', 'dev.spacy.en'),
    ('kftt-data-1.0/data/orig/kyoto-test.en', 'test.spacy.en'),   
]:
  with open(src) as f, open(dst, 'w') as g:
    for x in f:
      x=x.strip()
      x = x.replace("(", "( ")
      x = x.replace(")", " )")
      x = x.replace(",", " ,")
      x = x.replace(".", " .")
      x = x.replace("\"", " \" ")
      x = re.sub(r'\s+', ' ', x)
      print(x, file=g)


In [None]:
with open('train.spacy.ja') as f:
  data = f.readlines()
  for line in data[:9]:
    print(line)

In [None]:
!pip install fairseq
# doc: https://github.com/facebookresearch/fairseq/tree/main/examples/translation

## knock91 機械翻訳モデルの訓練
90で準備したデータを用いて，ニューラル機械翻訳のモデルを学習せよ（ニューラルネットワークのモデルはTransformerやLSTMなど適当に選んでよい）．

In [None]:
# preprocess/binarize the data
# src:ja trg:en; preprocessed data is saved to data91
!fairseq-preprocess -s ja -t en \
    --trainpref train.spacy \
    --validpref dev.spacy \
    --destdir data91  \
    --thresholdsrc 5 \
    --thresholdtgt 5 \
    --workers 20
# runing time:3min28s
# tokenizer=None,bpe=None,cpu=False, criterion='cross_entropy',lr_scheduler='fixed',min_loss_scale=0.0001,optimizer=None,scoring='bleu', seed=1
# [ja] Dictionary: 49320 types
# [ja] train.spacy.ja: 440288 sents, 11412336 tokens, 1.0% replaced (by <unk>)
# [ja] dev.spacy.ja: 1166 sents, 26014 tokens, 1.04% replaced (by <unk>)
# [en] Dictionary: 61944 types
# [en] train.spacy.en: 440288 sents, 11763358 tokens, 2.68% replaced (by <unk>)
# [en] dev.spacy.en: 1166 sents, 25042 tokens, 3.95% replaced (by <unk>)

In [None]:
# train a Transformer translation model
# CUDA_VISIBLE_DEVICE=0 # specify which GPU to use
# epochs=3
!fairseq-train data91 \
    --fp16 \
    --save-dir save91 \
    --max-epoch 3 \
    --arch transformer --share-decoder-input-output-embed \
    --optimizer adam --clip-norm 1.0 \
    --lr 1e-3 --lr-scheduler inverse_sqrt --warmup-updates 2000 \
    --update-freq 1 \
    --dropout 0.2 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-tokens 8000 > 91.log
# epoch 001: 100% 1758/1759 [10:52<00:00,  2.65it/s, loss=7.646, nll_loss=6.351, ppl=81.61, wps=18200.4, ups=2.67, wpb=6827.2, bsz=233.9, num_updates=1700, lr=0.00085, gnorm=0.886, clip=22, loss_scale=8, train_wall=37, gb_free=5.8, wall=633]
# epoch 002: 100% 1758/1759 [10:53<00:00,  2.71it/s, loss=6.818, nll_loss=5.399, ppl=42.21, wps=18216, ups=2.69, wpb=6772.2, bsz=242.2, num_updates=3500, lr=0.000755929, gnorm=0.583, clip=1, loss_scale=8, train_wall=37, gb_free=5.6, wall=1324]
# epoch 003: 100% 1758/1759 [10:52<00:00,  2.76it/s, loss=6.398, nll_loss=4.913, ppl=30.12, wps=18330.8, ups=2.7, wpb=6782.8, bsz=279.4, num_updates=5200, lr=0.000620174, gnorm=0.624, clip=4, loss_scale=8, train_wall=37, gb_free=7.3, wall=1979]
# runing time:34min10s

## knock92 機械翻訳モデルの適用
91で学習したニューラル機械翻訳モデルを用い，与えられた（任意の）日本語の文を英語に翻訳するプログラムを実装せよ．

In [None]:
!fairseq-interactive --path save91/checkpoint3.pt data91 < test.spacy.ja | grep '^H' | cut -f3 > 92.out
# [ja] dictionary: 49320 types
# [en] dictionary: 61944 types
# Total time: 301.925 seconds; translation time: 289.058

In [None]:
with open('92.out') as f:
  data = f.readlines()
  print(len(data))   # 1160
  for line in data[:9]:
    print(line)
# <unk>
# <unk> ( <unk> ) was a priest of the Rinzai sect in the late Kamakura period .
# He was the founder of the Soto sect .
# He was also known as <unk> .
# He was the founder of the sect .
# His posthumous Buddhist name was <unk> .
# It is also called <unk> .
# It is said to be the origin of the word <unk> or <unk> in Japan .
# It is said that he was a disciple of <unk> .

## knock93 BLEUスコアの計測
91で学習したニューラル機械翻訳モデルの品質を調べるため，評価データにおけるBLEUスコアを測定せよ.

In [None]:
!pip install tensorboardX

In [None]:
!fairseq-score --sys 92.out --ref test.spacy.en > 93.bleu

In [None]:
with open('93.bleu') as f:
  data = f.readlines()
  print(len(data))   # 2
  for line in data[:9]:
    print(line)
    
# Namespace(ignore_case=False, order=4, ref='test.spacy.en', sacrebleu=False, sentence_bleu=False, sys='92.out')
# BLEU4 = 5.15, 25.8/6.9/2.9/1.4 (BP=1.000, ratio=1.015, syslen=26553, reflen=26155)

## knock94 ビーム探索
91で学習したニューラル機械翻訳モデルで翻訳文をデコードする際に，ビーム探索を導入せよ．ビーム幅を1から100くらいまで適当に変化させながら，開発セット上のBLEUスコアの変化をプロットせよ．

In [None]:
%%bash
for N in `seq 5 11`; do
  fairseq-interactive --path save91/checkpoint_best.pt --beam $N data91 < test.spacy.ja | grep '^H' | cut -f3 > 94.$N.out
  done

In [None]:
%%bash
for N in `seq 5 11`; do
  fairseq-score --sys 94.$N.out --ref test.spacy.en > 94.$N.score
  done

In [None]:
import matplotlib.pyplot as plt

def read_score(filename):
  with open(filename) as f:
    x = f.readlines()[1]
    x = re.search(r'(?<=BLEU4 = )\d*\.\d*(?=,)', x)
  return float(x.group())

xs = range(5, 12)
ys = [read_score(f'94.{x}.score') for x in xs]
plt.plot(xs, ys)
plt.show()

## knock95 サブワード化
トークンの単位を単語や形態素からサブワードに変更し，91-94の実験を再度実施せよ.
## 1．訓練データを用いて、源言語と目標言語それぞれのsentencepieceモデルを構築し、サブワード化

[参照記事](https://note.com/npaka/n/n90f97543ec4b)

[SentecePiece](https://github.com/google/sentencepiece)

言語を処理する時、テキストをまず「トークン」に分割して、それを「ベクトル表現」に変換する。
* 形態素解析して得た「単語」は利用上に問題点ある:

語彙数が膨大で、高頻度語彙のみに限定している。低頻度語彙が捨てられて未知語として扱われている。
* SentencePieceの手順:

まず、テキストを単語に分割し、各単語の頻度を求める。
次は、低頻度単語は1語彙として扱い、より短い語彙に分割する。
語彙数が事前に指定したサイズになるまで、分割を繰り返します。

## 2．91-94の実験を再度実施
* preprocessでバイナリデータにする
* sentencepieceモデルを学習(教師なし学習)
* 推論時のbeam_sizeを変えながら、BLUEスコアを計算し、可視化する 

In [None]:
!pip install sentencepiece

In [None]:
import sentencepiece as spm
import re

# 学習の実行
spm.SentencePieceTrainer.Train(
    '--input=/content/kftt-data-1.0/data/orig/kyoto-train.ja --model_prefix=kyoto_ja --vocab_size=16000 --character_coverage=0.9995')

# 学習済み単語分割モデルを用いて日本語をトークン化
sp = spm.SentencePieceProcessor()
sp.Load('kyoto_ja.model')

for src, dst in [
  ('kftt-data-1.0/data/orig/kyoto-train.ja', 'train.sub.ja'),
  ('kftt-data-1.0/data/orig/kyoto-dev.ja', 'dev.sub.ja'),
  ('kftt-data-1.0/data/orig/kyoto-test.ja', 'test.sub.ja')
]:
  with open(src, 'r') as rf, open(dst, 'w') as wf:
    for x in rf:
      x = x.strip()
      x = re.sub(r'\s+', ' ', x)
      x = sp.encode_as_pieces(x)
      x = ' '.join(x)
      print(x, file=wf)

In [None]:
# 英語をトークン化
!pip install subword-nmt

!subword-nmt learn-bpe -s 16000 < kftt-data-1.0/data/orig/kyoto-train.en > kyoto_en.codes
!subword-nmt apply-bpe -c kyoto_en.codes < kftt-data-1.0/data/orig/kyoto-train.en > train.sub.en
!subword-nmt apply-bpe -c kyoto_en.codes < kftt-data-1.0/data/orig/kyoto-dev.en > dev.sub.en
!subword-nmt apply-bpe -c kyoto_en.codes < kftt-data-1.0/data/orig/kyoto-test.en > test.sub.en

# runing time:1min56sec  
# vocab_size:16000

In [None]:
!fairseq-preprocess -s ja -t en\
    --trainpref train.sub \
    --validpref dev.sub \
    --testpref test.sub \
    --tokenizer space \
    --workers 20 \
    --thresholdsrc 3 \
    --thresholdtgt 3 \
    --task translation \
    --workers 20 \
    --destdir knock95_subwords_sp

# Namespace(aim_repo=None, aim_run_hash=None, align_suffix=None, alignfile=None, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='knock95_subwords_sp', dict_only=False, empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_file=None, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, on_cpu_convert_precision=False, only_source=False, optimizer=None, padding_factor=8, plasma_path='/tmp/plasma', profile=False, quantization_config_path=None, reset_logging=False, scoring='bleu', seed=1, source_lang='ja', srcdict=None, suppress_crashes=False, target_lang='en', task='translation', tensorboard_logdir=None, testpref='test.sub', tgtdict=None, threshold_loss_scale=None, thresholdsrc=3, thresholdtgt=3, tokenizer='space', tpu=False, trainpref='train.sub', use_plasma_view=False, user_dir=None, validpref='dev.sub', wandb_project=None, workers=20)
# [ja] Dictionary: 17048 types
# [ja] train.sub.ja: 440288 sents, 10462018 tokens, 0.0147% replaced (by <unk>)
# [ja] Dictionary: 17048 types
# [ja] dev.sub.ja: 1166 sents, 24223 tokens, 0.0206% replaced (by <unk>)
# [ja] Dictionary: 17048 types
# [ja] test.sub.ja: 1160 sents, 26130 tokens, 0.0153% replaced (by <unk>)
# [en] Dictionary: 18656 types
# [en] train.sub.en: 440288 sents, 13280091 tokens, 0.022% replaced (by <unk>)
# [en] Dictionary: 18656 types
# [en] dev.sub.en: 1166 sents, 29011 tokens, 0.0103% replaced (by <unk>)
# [en] Dictionary: 18656 types
# [en] test.sub.en: 1160 sents, 31468 tokens, 0.0254% replaced (by <unk>)
# Wrote preprocessed data to knock95_subwords_sp

In [None]:
!fairseq-train knock95_subwords_sp \
    --fp16 \
    --save-dir save95 \
    --max-epoch 3 \
    --arch transformer --share-decoder-input-output-embed \
    --optimizer adam --clip-norm 1.0 \
    --lr 1e-3 --lr-scheduler inverse_sqrt --warmup-updates 2000 \
    --update-freq 1 \
    --dropout 0.2 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-tokens 8000 > 95.log

In [None]:
!fairseq-interactive --path save95/checkpoint_best.pt knock95_subwords_sp < test.spacy.ja | grep '^H' | cut -f3 > 95.out

# [ja] dictionary: 17048 types
# [en] dictionary: 18656 types
# Total time: 379.424 seconds; translation time: 367.011

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')
def spacy_tokenize(src, dst):
    with open(src) as f, open(dst, 'w') as g:
        for x in f:
            x = x.strip()
            x = ' '.join([doc.text for doc in nlp(x)])
            print(x, file=g)
spacy_tokenize('95.out', '95.out.spacy')


In [None]:
!fairseq-score --sys 95.out.spacy --ref test.spacy.en

# Namespace(ignore_case=False, order=4, ref='test.spacy.en', sacrebleu=False, sentence_bleu=False, sys='95.out.spacy')
# BLEU4 = 1.70, 16.7/2.8/0.7/0.3 (BP=1.000, ratio=1.350, syslen=35302, reflen=26155)

## knock96 学習過程の可視化
Tensorboardなどのツールを用い，ニューラル機械翻訳モデルが学習されていく過程を可視化せよ．可視化する項目としては，学習データにおける損失関数の値とBLEUスコア，開発データにおける損失関数の値とBLEUスコアなどを採用せよ．

In [None]:
!fairseq-train knock95_subwords_sp \
    --fp16 \
    --tensorboard-logdir log96 \
    --save-dir save96 \
    --max-epoch 5 \
    --arch transformer --share-decoder-input-output-embed \
    --optimizer adam --clip-norm 1.0 \
    --lr 1e-3 --lr-scheduler inverse_sqrt --warmup-updates 2000 \
    --dropout 0.2 --weight-decay 0.0001 \
    --update-freq 1 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-tokens 8000 > 96.log

# knock97 ハイパー・パラメータの調整

ニューラルネットワークのモデルや，そのハイパーパラメータを変更しつつ，開発データにおけるBLEUスコアが最大となるモデルとハイパーパラメータを求めよ．

In [None]:
#学習
%%bash
# training with different dropout rate
for N in `seq 0.1 0.2 0.5`; do
  fairseq-train knock95_subwords_sp \
        --save-dir checkpoints/train.sub.dropout_$N \
        --arch transformer --share-decoder-input-output-embed \
        --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
        --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
        --dropout $N --weight-decay 0.0001 \
        --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
        --max-tokens 4096 \
        --max-epoch 5
  done

#推論
for N in `seq 0.1 0.2 0.5` ; do
  fairseq-interactive knock95_subwords_sp \
    --path checkpoints/train.sub.dropout_$N/checkpoint_best.pt \
    < test.sub.ja | grep '^H' | cut -f3 | sed -r 's/(@@ )|(@@ ?$)//g' > out97/dropout_$N.out
  done

#BLEUスコアを計算
for N in `seq 0.1 0.2 0.5` ; do
    echo dropout=$N >> out97/score97.out 
    fairseq-score --sys out97/dropout_$N.out --ref test.spacy.en >> out97/score97.out
  done

# 98.ドメイン適応
Japanese-English Subtitle Corpus (JESC)やJParaCrawlなどの翻訳データを活用し，KFTTのテストデータの性能向上を試みよ．

In [None]:
!wget http://www.kecl.ntt.co.jp/icl/lirg/jparacrawl/release/3.0/bitext/en-ja.tar.gz

In [None]:
import tarfile

with tarfile.open('en-ja.tar.gz') as tar:
    for f in tar.getmembers():
        if f.name.endswith('txt'):
            text = tar.extractfile(f).read().decode('utf-8')
            break

data = text.splitlines()
data = [x.split('\t') for x in data]
data = [x for x in data if len(x) == 4]
data = [[x[3], x[2]] for x in data]

with open('jparacrawl.ja', 'w') as f, open('jparacrawl.en', 'w') as g:
    for j, e in data:
        print(j, file=f)
        print(e, file=g)


In [None]:
with open('jparacrawl.ja') as f, open('train.jparacrawl.ja', 'w') as g:
    for x in f:
        x = x.strip()
        x = re.sub(r'\s+', ' ', x)
        x = sp.encode_as_pieces(x)
        x = ' '.join(x)
        print(x, file=g)

In [None]:
# execute subword
!subword-nmt apply-bpe -c kyoto_en.codes < jparacrawl.en > train.jparacrawl.en

In [None]:
!fairseq-preprocess -s ja -t en \
    --trainpref train.jparacrawl \
    --validpref dev.sub \
    --destdir data98  \
    --workers 20


In [None]:
#学習
%%bash
# training with different dropout rate
for N in `seq 0.001 0.0005`; do
  fairseq-train data98 \
        --fp16
        --save-dir checkpoints/train.jparacrawl.lr_$N \
        --arch transformer --share-decoder-input-output-embed \
        --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
        --lr $N --lr-scheduler inverse_sqrt --warmup-updates 4000 \
        --dropout $N --weight-decay 0.0001 \
        --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
        --max-tokens 4096 \
        --max-epoch 5
  done

In [None]:
#推論
%%bash
for N in `seq 0.001 0.0005` ; do
  fairseq-interactive data98 \
    --path checkpoints/train.jparacrawl.lr_$N/checkpoint_best.pt \
    < test.sub.ja | grep '^H' | cut -f3 | sed -r 's/(@@ )|(@@ ?$)//g' > out98/lr_$N.out
  done


In [None]:
#BLEUスコアを計算
for N in `seq 0.001 0.0005` ; do
    echo lr=$N >> out98/score98.out 
    fairseq-score --sys out98/lr_$N.out --ref test.spacy.en >> out98/score98.out
  done

# 99．翻訳サーバの構築
ユーザが翻訳したい文を入力すると，その翻訳結果がウェブブラウザ上で表示されるデモシステムを構築せよ．