## 中文語料的 Tokenization for Transformer Models

原本參考的[Train GPT-2 in your own language](https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171)中，tokenization的部份對中文並不適用，我們進一步參考了其他文章（[[1]](https://clay-atlas.com/blog/2020/06/30/pytorch-%E5%A6%82%E4%BD%95%E4%BD%BF%E7%94%A8-hugging-face-%E6%89%80%E6%8F%90%E4%BE%9B%E7%9A%84-transformers-%E4%BB%A5-bert-%E7%82%BA%E4%BE%8B/)，[[2]](https://zhuanlan.zhihu.com/p/120315111)，[[3]](https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html)，[[4]](https://towardsdatascience.com/working-with-hugging-face-transformers-and-tf-2-0-89bf35e3555a)），來進行我們的 tokenization 作業。

在 [Working with Hugging Face Transformers and TF 2.0](https://towardsdatascience.com/working-with-hugging-face-transformers-and-tf-2-0-89bf35e3555a) 中提到，transformer model 實際運作的流程基本上都依循：

> Tokenizer definition → Tokenization of Documents → Model Definition → Model Training →Inference

因此，下面我們就從中文 tokenizer 的定義開始。

### Reference
1.[如何使用 hugging face 所提供的 transformers 以 bert / PyTorch 為例](https://clay-atlas.com/blog/2020/06/30/pytorch-%E5%A6%82%E4%BD%95%E4%BD%BF%E7%94%A8-hugging-face-%E6%89%80%E6%8F%90%E4%BE%9B%E7%9A%84-transformers-%E4%BB%A5-bert-%E7%82%BA%E4%BE%8B/)

2.[Huggingface简介及BERT代码浅析](https://zhuanlan.zhihu.com/p/120315111)

3.[進擊的 BERT：NLP 界的巨人之力與遷移學習](https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html)

4.[Working with Hugging Face Transformers and TF 2.0](https://towardsdatascience.com/working-with-hugging-face-transformers-and-tf-2-0-89bf35e3555a)

5.[中文GPT2预训练实战](https://finisky.github.io/2020/05/01/pretrainchinesegpt/)

## 淺談語言的標記化（tokenization）

標記化（[tokenization](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization)）屬於詞法分析（[lexical analysis](https://en.wikipedia.org/wiki/Lexical_analysis)）的一個部分，即將輸入字符串分割為標記、進而將標記進行分類的過程。生成的標記隨後便被用來進行語法分析。依據分析的目的不同，標記化可以有很多不同的作法，例如「斷詞」（[word segmentation](https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation)）就是把每個詞彙轉換成一個標記（token）。

以這句英文為例，`The quick brown fox jumps over the lazy dog` 的 word-based-tokenization in XML format 會成為：
```
<sentence>
  <word>The</word>
  <word>quick</word>
  <word>brown</word>
  <word>fox</word>
  <word>jumps</word>
  <word>over</word>
  <word>the</word>
  <word>lazy</word>
  <word>dog</word>
</sentence>
```

當然斷詞並不是標記化的唯一方法，我們可以把每個字作為獨立的標記（character-based-tokenization），或是把每個句子當做一個標記（sentence-based-tokenization），甚至可以把每個位元當做獨立的標記（byte-based-tokenization）。

Hugging Face 的 [`transformers`](https://huggingface.co/transformers/index.html) 套件本身就提供了眾多的標記化工具（[Tokenizer](https://huggingface.co/transformers/main_classes/tokenizer.html)）可供使用。[GPT2](https://huggingface.co/transformers/model_doc/gpt2.html)模型本身有專屬對應的的 [GPT2Tokenizer](https://huggingface.co/transformers/model_doc/gpt2.html#gpt2tokenizer)，屬於 [BPE tokenizer (Byte-Pair-Encoding)](https://medium.com/@pierre_guillou/byte-level-bpe-an-universal-tokenizer-but-aff932332ffe)，理論上 [BPE tokenizer](https://medium.com/@pierre_guillou/byte-level-bpe-an-universal-tokenizer-but-aff932332ffe)是不受限於語言的，但是對於中日韓文這種 multi-byte-character 的語言來說很容易因為「例外字元」出問題，因此還是 word-level- 或是 character-level-tokenization 比較合適。

[Transformers](https://huggingface.co/transformers/index.html) 套件並沒有中文斷詞的功能，而網路上可以看到的中文 GPT2 範例，都是用 Google 釋出的的 [BertTokenizer](https://huggingface.co/transformers/model_doc/bert.html#berttokenizer)，接下來我們就用[transformers](https://huggingface.co/transformers/index.html) 套件內建的工具做一些簡單的測試：

In [2]:
# coding: utf-8
import tensorflow as tf
from transformers import AutoTokenizer, AutoModel

# Tokenizer and Bert Model
tokenizer = AutoTokenizer.from_pretrained('bert-base-chinese')
#embedding = AutoModel.from_pretrained('bert-base-chinese')


# Preprocess
sent = '今天天氣真 Good。'
sent_token = ['[CLS]'] + tokenizer.tokenize(sent) + ['[SEP]']
sent_token_encode = tokenizer.convert_tokens_to_ids(sent_token)
sent_token_decode = tokenizer.convert_ids_to_tokens(sent_token_encode)

print('sent:', sent)
print('sent_token:', sent_token)
print('encode:', sent_token_encode)
print('decode:', sent_token_decode)

sent: 今天天氣真 Good。
sent_token: ['[CLS]', '今', '天', '天', '氣', '真', '[UNK]', '。', '[SEP]']
encode: [101, 791, 1921, 1921, 3706, 4696, 100, 511, 102]
decode: ['[CLS]', '今', '天', '天', '氣', '真', '[UNK]', '。', '[SEP]']


## 測試稍大的語料庫

我們剛才針對單一句子的測試成功，接下來要測試稍大的語料庫，我們以 500篇 wikipedia上長度超過500字的中文文章為例。

In [3]:
corpus_dir = '../data/test_wiki500/'
model_path = '../data/tokenizer_bert_base_chinese'


def encode_corpus(corpus_path):
    import os
    from pathlib import Path
    from transformers import BertTokenizerFast
    # Tokenizer and Bert Model
    tokenizer = BertTokenizerFast.from_pretrained("bert-base-chinese")
    #tokenizer.save_pretrained(model_path)
    #
    paths = [str(x) for x in Path(corpus_path).glob("**/*.txt")]
    print('Documents to encode: '+str(len(paths)))
    data = []
    for furi in paths:
        with open(furi,'r', encoding="utf8") as f:
            text = f.readlines()
            for sent in text[0].split(' '):
                sent_token = ['[CLS]'] + tokenizer.tokenize(sent) + ['[SEP]']
                sent_token_encode = tokenizer.convert_tokens_to_ids(sent_token)
                data.append(sent_token_encode)
    print('Sentences encoded: '+str(len(data)))
    return(data)

data = encode_corpus(corpus_dir)

Documents to encode: 500
Sentences encoded: 672846


In [4]:
for row in data[100:120]:
    sent_token_encode = row
    sent_token_decode = tokenizer.convert_ids_to_tokens(sent_token_encode)
    print('encode:', sent_token_encode)
    print('decode:', sent_token_decode)

encode: [101, 2119, 1558, 102]
decode: ['[CLS]', '學', '問', '[SEP]']
encode: [101, 3229, 7279, 4638, 7269, 4764, 5023, 2853, 6496, 4638, 3149, 7030, 7302, 913, 102]
decode: ['[CLS]', '時', '間', '的', '長', '短', '等', '抽', '象', '的', '數', '量', '關', '係', '[SEP]']
encode: [101, 3683, 1963, 3229, 7279, 1606, 855, 3300, 3189, 102]
decode: ['[CLS]', '比', '如', '時', '間', '單', '位', '有', '日', '[SEP]']
encode: [101, 2108, 5059, 1469, 2399, 5023, 102]
decode: ['[CLS]', '季', '節', '和', '年', '等', '[SEP]']
encode: [101, 5050, 6123, 102]
decode: ['[CLS]', '算', '術', '[SEP]']
encode: [101, 1217, 3938, 733, 7370, 102]
decode: ['[CLS]', '加', '減', '乘', '除', '[SEP]']
encode: [101, 738, 5632, 4197, 5445, 4197, 1765, 4496, 4495, 749, 102]
decode: ['[CLS]', '也', '自', '然', '而', '然', '地', '產', '生', '了', '[SEP]']
encode: [101, 3644, 1380, 677, 3295, 3300, 6882, 6258, 1914, 679, 1398, 4638, 6250, 3149, 5143, 5186, 102]
decode: ['[CLS]', '歷', '史', '上', '曾', '有', '過', '許', '多', '不', '同', '的', '記', '數', '系', '統', '[SEP]']
e

In [7]:
import pickle
with open('../data/encoded_wiki500.pkl', 'wb') as f:
    pickle.dump(data, f)