Skip to content

本方案碼表製作流程

sgalal edited this page Apr 27, 2020 · 3 revisions

本方案詞庫製作流程詳見本倉庫 build 分支。

依賴項

Install sgalal/opencc-python

$ git clone https://github.com/sgalal/opencc-python.git
$ cd opencc-python
$ python setup.py install

Install dependencies

$ pip install unihan-etl pandas sortedcontainers

運行腳本

$ unihan-etl -f kCantonese -F json --destination build/single_char/data/0-Unihan.json
$ build/build.py

字音收錄流程

  1. Export Cantonese pronunciation data in kCantonese to build/single_char/data/0-Unihan.json
  2. Download and process the five data files mentioned above to /build/single_char/data/0-*
  3. Sanitize the five data files and save to /build/single_char/data/1-*
  4. Generate the result according to the principles, then save to variable d_single_char

詞組收錄流程

  1. Download LSHK Word List to /build/word/data/香港語言學學會粵拼詞表.txt
  2. Read the file, discard single characters in the file and save the remained data to variable d_word
  3. Write d_single_char and d_word to file