# Practice round: Chinese-English translation

Useful resources from huggingface: https://huggingface.co/blog/how-to-train

In [3]:
import json
import pandas as pd
import jieba
from tokenizers import SentencePieceBPETokenizer

## Load data

In [4]:
with open('./cn_en_weibo_data/data.cn-en.json', 'r', encoding = 'utf-8') as myfile:
    raw = myfile.read().split('\n')  

# Turn raw strings into a list of dictionaries
weiboDict = [json.loads(line) for line in raw]

weiboDf = pd.DataFrame(weiboDict)

weiboDf.tail()

Unnamed: 0,id,source,target
1998,3477898759956095,能发现自己的错误是智慧，能改正自己的错误是勇敢。喜欢请关注,"Can find their own mistakes is wisdom, to corr..."
1999,3526354320111701,我的死后日願望是，可以有世界末日啦！@browNsugaR 我在:,Very good music in Suns [good] //@browNsugaR:E...
2000,3558553027867596,年后，让你觉得更失望的不是你做过的事情，而是你没有做过的事情。所以，一直想做的事情，不要再拖了,online#Twenty years from now you will be more ...
2001,3482173493644453,说：“在小寨军区侧门，有人维权。过来个军车就给人家看。不过貌似四医大不归省军区管吧,"It's hard for her to win,she should explicit m..."
2002,3562778084139688,没有受伤，不懂坚强；不犯错误，难以成长；未曾失败，何来成功。,Day78：You'll never be brave if you don't get h...


It seems the data is far from clean. However, for prototyping purpose, we will not focus too much on cleaning right now. 

## Parsing and tokenizing Chinese texts

We use `jieba` library (结巴分词) for parsing Chinese text. For more information, see https://github.com/fxsjy/jieba/

In [5]:
chTexts = weiboDf['source']
enTexts = weiboDf['target']

# Tokenize all Chinese texts in the dataframe and store as a list
chTokensGen = [jieba.cut(sentence) for sentence in chTexts]

# Output a sample tokenization
print(list(chTokensGen[0]))

Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\presu\AppData\Local\Temp\jieba.cache
Loading model cost 0.683 seconds.
Prefix dict has been built successfully.


['好', '的', '爱情', '使', '你', '通过', '一个', '人', '看到', '整个', '世界', '，', '坏', '的', '爱情', '使', '你', '为了', '一个', '人', '舍弃', '整个', '世界', '。']


It turns out with tokenizers based on `sentencePiece`, the tokenization happens at sentence level, and the tokenizer is trained recognize subwords. Therefore we will not use other parsers for now. 

In [6]:
pathAllCh = './cn_en_weibo_data/allCh.txt'
pathAllEn = './cn_en_weibo_data/allEn.txt'

# Store all Chinese text in a single file 
with open(pathAllCh, 'w', encoding = 'utf-8') as file: 
    for line in weiboDf['source']:
        file.write(line + '\n')
    file.close()

My feeling is that we cannot use a pretrained tokenizer to train it from scratch. Instead, we might need to import Byte-Pair Encoding, or WordPiece, or SentencePiece by scratch. 

https://huggingface.co/transformers/tokenizer_summary.html#sentencepiece

https://github.com/huggingface/tokenizers

In the following cell, we train a `SentencePiece` tokenizer. 

In [16]:
tokenizer = SentencePieceBPETokenizer()

tokenizer.train(['./cn_en_weibo_data/allCh.txt'], 
                vocab_size = 20000, 
               special_tokens = ['<s>', '<pad>', '</s>', '<unk>', '<mask>'])

# Show an example of tokenizer works
output = tokenizer.encode(chTexts[0])
print(output.ids, output.tokens, output.offsets)

# We shall save the tokenizer to disk 
tokenizer.save_model('.', 'myTokenizer')

[6950, 1491, 2254, 1036, 1158, 2196, 3013, 340, 2036, 1491, 1197, 1036, 5761, 4103, 94] ['▁好的爱情', '使你', '通过', '一个人', '看到', '整个', '世界,', '坏', '的爱情', '使你', '为了', '一个人', '舍弃', '整个世界', '。'] [(0, 4), (4, 6), (6, 8), (8, 11), (11, 13), (13, 15), (15, 18), (18, 19), (19, 22), (22, 24), (24, 26), (26, 29), (29, 31), (31, 35), (35, 36)]


['.\\myTokenizer-vocab.json', '.\\myTokenizer-merges.txt']

<span style="color:red;">Pending problem.</span> As I tried to follow the tutorial and load the tokenizer saved on disk, unexpected error was reported. For now, skip loading saved tokenizer and proceed with other important steps.  

<span style="color:red;">Bottleneck for now.</span> Do we need special token for T5? If yes, how to insert special T5 tokens into our tokenization? Similar to `tokenizers.processors.BertProcessing`, do we have `tokenizers.processors.T5Processing`? 

<span style="color:red;">Solution.</span> 1. Thoroughtly read documentation for T5 model in huggingface doc; 2. Explore `huggingface/tokenizers` library on github. 

<span style="color:red;">Question.</span> Is there documentation for `huggingface/tokenizers`?!

For now, halt with tokenizer and proceed with language model until bumping into problems. Keep in mind the confusion about special token. 