Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions CNN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
## Convolutional Neural Networks

Author: Shenghong Liu (uni.liushenghong@gmail.com)

The existing LSTM-based segmenter has a linear O(n) time complexity which is not ideal for content-heavy platforms. Hence, I introduce a new model architecture for faster word segmentation of Southeast Asian languages like Thai and Burmese.

<img src="Figures/cnn.jpg" width="30%"/>

The convolutional neural network (CNN) architecture developed in this project achieved faster inference speeds with comparable accuracy for Thai. Not only was the linear time complexity issue resolved, the usage of dilated convolutions also helped maintain a high level of accuracy by capturing a wider context of the surrounding words.

| Model | F1-Score | Model Size | CPU Inference Speed |
|----------|:--------:|:---------:|---------:|
| LSTM Medium | 90.1 | 36 KB | 9.29 ms |
| LSTM Small | 86.7 | 12 KB | 6.68 ms |
| CNN Medium | 90.4 | 28 KB | 3.76 ms |
| ICU | 86.4 | 126 KB | ~0.2 ms|

### Examples

**Test Case 1**
| Algorithm | Output |
|----------|:---------|
| Unsegmented | พระราชประสงค์ของพระบาทสมเสด็จพระเจ้าอยู่หัวในรัชกาลปัจจุบันคือ |
| Manually Segmented | พระราชประสงค์_ของ_พระบาทสมเสด็จพระเจ้าอยู่หัว_ใน_รัชกาล_ปัจจุบัน_คือ |
| CNN | พระราชประสงค์_ของ_พระบาทสมเสด็จพระเจ้าอยู่หัว_ใน_รัชกาล_ปัจจุบัน_คือ |
| ICU | พระ_ราช_ประสงค์_ของ_พระบาท_สม_เสด็จ_พระเจ้าอยู่หัว_ใน_รัชกาล_ปัจจุบัน_คือ |
| LSTM | พระราชประสงค์_ของ_พระบาทสมเสด็จ_พระเจ้าอยู่หัว_ใน_รัชกาล_ปัจจุบัน_คือ |

**Test Case 2**
| Algorithm | Output |
|----------|:---------|
| Unsegmented | ในขณะเดียวกันผู้ที่ต้องการเงินเพื่อนำไปลงทุนหรือประกอบกิจการอื่นใด |
| Manually Segmented | ใน_ขณะ_เดียว_กัน_ผู้_ที่_ต้องการ_เงิน_เพื่อ_นำ_ไป_ลง_ทุน _หรือ_ประกอบ_กิจการ_อื่น_ใด |
| CNN | ใน_ขณะ_เดียว_กัน_ผู้_ที่_ต้องการ_เงิน_เพื่อ_นำ_ไป_ลง_ทุน _หรือ_ประกอบ_กิจการ_อื่น_ใด |
| ICU | ใน_ขณะ_เดียวกัน_ผู้_ที่_ต้องการ_เงิน_เพื่อน_ำ_ไป_ลงทุน_หรือ_ประกอบ_กิจการ_อื่น_ใด |
| LSTM | ใน_ขณะ_เดียว_กัน_ผู้_ที่_ต้อง_การ_เงิน_เพื่อ_นำ_ไป_ลง_ทุน _หรือ_ประกอบ_กิจการ_อื่น_ใด |

**Test Case 3**

| Algorithm | Output |
|----------|:---------|
| Unsegmented | เพราะเพียงกรดนิวคลีอิคของไวรัสอย่างเดียวก็สร้างไวรัสสมบูรณ์ |
| Manually Segmented | เพราะ_เพียง_กรด_นิวคลีอิค_ของ_ไวรัส_อย่าง_เดียว_ก็_สร้าง_ไวรัส_สมบูรณ์ |
| CNN | เพราะ_เพียง_กรด_นิว_คลี_อิค_ของ_ไวรัส_อย่าง_เดียว_ก็_สร้าง_ไวรัส_สมบูรณ์ |
| ICU | เพราะ_เพียง_กรด_นิ_วค_ลี_อิค_ของ_ไวรัส_อย่าง_เดียว_ก็_สร้าง_ไวรัส_สมบูรณ์ |
| LSTM | เพราะ_เพียง_กรดนิว_คลีอิค_ของ_ไวรัสอย่าง_เดียว_ก็_สร้าง_ไวรัสสมบูรณ์ |

### Hyperparameters

In Vertex AI Custom Training, you need to specify the following hyperparameters:

```
--path=gs://bucket_name/Data/
--language=Thai
--input-type=BEST
--model-type=cnn
--epochs=200
--filters=32
--name=Thai_codepoints_32
--edim=40
--embedding=codepoints
```

* **name:** This is the model name.
* **path:** This is the Google Cloud Storage Bucket link.
* **language:** This is the language that you'll like to train, such as ```Thai``` or ```Burmese```.
* **input-type:** This is the dataset type, such as ```BEST``` for Thai and ```my``` for Burmese. Refer to [Cloud Usage](Cloud%20Usage.md) for more details.
* **model-type:** This is the model architecture type, such as ```lstm``` or ```cnn```.
* **epochs:** This is the number of epochs used to train the model, it is recommended to use a number of >= 200 as the model only trains on 10% of the dataset in each epoch. The model will output the epoch that gives the best validation loss.
* **filters:** This is the number of filters in each Conv1D layer and plays a significant role in data size, accuracy, and inference speed.
* **edim:** This is embedding_dim, the length of each embedding vector and plays a significant role in data size, accuracy, and inference speed.
* **embedding:**: This determines what type of embedding is used to train the model, and can be one of the followings:
* `"grapheme_clusters_tf"`: This option should be used when one uses grapheme clusters as the embedding unit.
* `"codepoints"`: this option should be used when the embedding is based on code points.
* **learning-rate:** This determines the model's learning rate. The default is 0.001.


### Model Performance Comparison
**Codepoints**
| Filter Size | Accuracy | F1 Score | Model Size |
|----------|:---------:|:---------:|----------:|
| 8 | 93.1 | 84.8 | 13 KB |
| 16 | 94.5 | 87.7 | 16 KB |
| 32 | 95.7 | 90.4 | 28 KB |
| 64 | 96.6 | 92.5 | 52 KB |
| 128 | 97.3 | 94.0 | 95 KB |

**Grapheme Clusters**
| Filter Size | Accuracy | F1 Score | Model Size |
|----------|:---------:|:---------:|----------:|
| 8 | 94.1 | 89.3 | 13 KB |
| 16 | 95.2 | 91.2 | 14 KB |
| 32 | 95.9 | 92.6 | 24 KB |
| 64 | 96.6 | 93.7 | 34 KB |
| 128 | 97.1 | 94.7 | 55 KB |

[Embeddings Discussions](Embeddings%20Discussion.md) gives detailed comparisons between the embedding types.
49 changes: 49 additions & 0 deletions Cloud Usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
To bind the repository to Google Cloud Platform, follow the steps below:

1. In Google Cloud API & Services, enable
- Secret Manager API
- Artifact Registry API
- Vertex AI API
- Cloud Pub/Sub API
- Cloud Build API

2. Create a Google Cloud Storage Bucket, upload the dataset into gs://bucket_name/Data/ with the following directory structure:
Data/
├── Best/
│ ├── article/
│ ├── encyclopedia/
│ ├── news/
│ └── novel/
├── my_test_segmented.txt
├── my_train.txt
└── my_valid.txt

3. In Artifact Registry, create repository in the same region as the storage bucket.

4. In Cloud Build, create a trigger in the same region as the Artifact Registry.
- Choose a suitable event (e.g. Push to a branch)
- Select 2nd gen repository generation
- Link the GitHub repository
- Select Dockerfile (for Configurations) and Repository (for Location)
- Dockerfile name: ```Dockerfile```, image name: ```us-central1-docker.pkg.dev/project-name/registry-name/image:latest```
- Enable "Require approval before build executes"
- For manual image build, press Enable/ Run in the created trigger

5. After image is created and stored in Artifact Registry, select "Train new model" under the Training tab in Vertex AI.
- Training method: default (Custom training) and continue
- Model details: fill in name and continue
- Training container: select custom container and browse for latest built image, link to storage bucket and under arguments, modify and paste the following
```
--path=gs://bucket_name/Data/
--language=Thai
--input-type=BEST
--model-type=cnn
--epochs=200
--filters=32
--name=Thai_codepoints_32
--edim=40
--embedding=codepoints
```
- Hyperparameters: unselect and continue
- Compute and pricing: choose existing resources or deploy to new worker pool
- Prediction container: no prediction container and start training
16 changes: 16 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
FROM us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-16.py310:latest

RUN apt-get update && \
apt-get upgrade -y && \
apt-get install -y --no-install-recommends \
pkg-config libicu-dev && \
rm -rf /var/lib/apt/lists/*

WORKDIR /

COPY . /

RUN python3 -m pip install --upgrade pip && \
pip install -r requirements.txt

ENTRYPOINT ["python3", "train.py"]
Binary file added Figures/cnn.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ In this project, we develop a bi-directional LSTM model for word segmentation. F
train_data="exclusive BEST", eval_data="exclusive BEST")
```

You need to specify three hyper-parameters: `embedding`, `train_data`, and `eval_data`. Please refer to [Models Specicitaions](https://github.com/SahandFarhoodi/word_segmentation/blob/work/Models%20Specifications.md) for a detailed explanation of these hyper-parameters, and also for a list of trained models ready to be used in this repository and their specifications. If you don't have time to do that, just pick one of the trained models and make sure that name of the embedding you choose appears in the model name (`train_data` and `eval-data` doesn't affect segmentation of arbitrary inputs). Next, you can use the following commands to specify your input and segment it:
You need to specify three hyper-parameters: `embedding`, `train_data`, and `eval_data`. Please refer to [Models Specifications](https://github.com/SahandFarhoodi/word_segmentation/blob/work/Models%20Specifications.md) for a detailed explanation of these hyper-parameters, and also for a list of trained models ready to be used in this repository and their specifications. If you don't have time to do that, just pick one of the trained models and make sure that name of the embedding you choose appears in the model name (`train_data` and `eval-data` doesn't affect segmentation of arbitrary inputs). Next, you can use the following commands to specify your input and segment it:

```python
line = "ทำสิ่งต่างๆ ได้มากขึ้นขณะที่อุปกรณ์ล็อกและชาร์จอยู่ด้วยโหมดแอมเบียนท์"
Expand Down
51 changes: 51 additions & 0 deletions adaboost_cjk_segmenter/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
## AdaBoost for Cantonese

Relative to BudouX’s n-gram model, the new [radical](https://en.wikipedia.org/wiki/Chinese_character_radicals)-based AdaBoost model reaches comparable accuracy with under half the model size. The radical of a Chinese character is typically the character's semantic component. Morever, there are only 214 of them in [kRSUnicode](https://en.wikipedia.org/wiki/Kangxi_radicals), making it suitable for lightweight models. The other benefit of using radicals is that, even though the model is trained on only zh-hant data, the radical-based model generalised better, which makes it more suitable to deploy in zh-hant variants such as zh-tw and zh-hk (Cantonese).

**CITYU Test Dataset (zh-hant)**
| Model | F1-Score | Model Size |
|----------|:--------:|:---------:|
| BudouX | 86.27 | 64 KB |
| Radical-based | 85.82 | 31 KB |
| ICU | 89.46 | 2 MB |

**UDCantonese Dataset (zh-hk)**
| Model | F1-Score | Model Size |
|----------|:--------:|:---------:|
| BudouX | 73.51 | 64 KB |
| Radical-based | 89.76 | 31 KB |
| [PyCantonese](https://github.com/jacksonllee/pycantonese) | 94.98 | 1.3 MB |
| ICU | 79.14 | 2 MB |

### Examples

**Test Case 1 (zh-hant)**
| Algorithm | Output |
|----------|:---------|
| Unsegmented | 一名浙江新昌的茶商說正宗龍井產量有限需求量大價格高而貴州茶品質不差混雜在中間根本分不出來 |
| Manually Segmented | 一 . 名 . 浙江 . 新昌 . 的 . 茶商 . 說 . 正宗 . 龍井 . 產量 . 有限 . 需求量 . 大 . 價格 . 高 . 而 . 貴州茶 . 品質 . 不 . 差 . 混雜 . 在 . 中間 . 根本 . 分 . 不 . 出來 |
| Radical-based | 一 . 名 . 浙江 . 新昌 . 的 . 茶商 . 說 . 正宗 . 龍 . 井 . 產量 . 有限 . 需求 . 量 . 大 . 價格 . 高 . 而 . 貴州 . 茶 . 品質 . 不差 . 混雜 . 在 . 中間 . 根本 . 分 . 不 . 出來 |
| BudouX | 一 . 名 . 浙江 . 新昌 . 的 . 茶商 . 說 . 正宗 . 龍井 . 產量 . 有限 . 需求 . 量 . 大 . 價格 . 高 . 而 . 貴州 . 茶品質 . 不差 . 混雜 . 在 . 中間 . 根本 . 分 . 不 . 出來 |
| ICU | 一名 . 浙江 . 新 . 昌 . 的 . 茶商 . 說 . 正宗 . 龍井 . 產量 . 有限 . 需求量 . 大 . 價格 . 高 . 而 . 貴州 . 茶 . 品質 . 不差 . 混雜 . 在中 . 間 . 根本 . 分 . 不出來 |

**Test Case 2 (zh-hk)**
| Algorithm | Output |
|----------|:---------|
| Unsegmented | 點解你唔將呢句說話-點解你同我講,唔同你隔籬嗰啲人講呀? |
| Manually Segmented | 點解 . 你 . 唔 . 將 . 呢 . 句 . 說話 . - . 點解 . 你 . 同 . 我 . 講 . , . 唔 . 同 . 你 . 隔籬 . 嗰啲 . 人 . 講 . 呀 . ? |
| Radical-based | 點解 . 你 . 唔 . 將 . 呢句 . 說話 . - . 點解 . 你 . 同 . 我 . 講 . , . 唔同 . 你 . 隔籬 . 嗰啲 . 人 . 講 . 呀 . ? |
| BudouX | 點解你 . 唔 . 將 . 呢句 . 說話 . - . 點解你 . 同 . 我 . 講 . , . 唔同 . 你 . 隔籬 . 嗰啲人 . 講呀 . ? |
| ICU | 點 . 解 . 你 . 唔 . 將 . 呢 . 句 . 說話 . - . 點 . 解 . 你 . 同 . 我 . 講 . , . 唔 . 同 . 你 . 隔 . 籬 . 嗰 . 啲 . 人 . 講 . 呀 . ? |
| PyCantonese | 點解 . 你 . 唔 . 將 . 呢 . 句 . 說話 . - . 點解 . 你 . 同 . 我 . 講 . , . 唔同 . 你 . 隔籬 . 嗰啲 . 人 . 講 . 呀 . ? |

### Usage

Set up the environment using ```pip3 install -r requirements.txt```

```python
import json
with open('model.json', encoding="utf-8") as f:
model = json.load(f)
parser = AdaBoostSegmenter(model)
output = parser.predict("一名浙江新昌的茶商說") # [一, 名, 浙江, 新昌, 的, 茶商, 說]
```
1 change: 1 addition & 0 deletions adaboost_cjk_segmenter/model.json

Large diffs are not rendered by default.

58 changes: 58 additions & 0 deletions adaboost_cjk_segmenter/parser.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
from cihai.core import Cihai
c = Cihai()
if not c.unihan.is_bootstrapped:
c.unihan.bootstrap()

def get_radical(ch1: str):
char1 = c.unihan.lookup_char(ch1).first()
if char1 is None:
return 0
else:
r1 = char1.kRSUnicode.split(" ")[0]
if '\'' in r1:
return r1.split('\'')[0]
else:
return r1.split('.')[0]

class AdaBoostSegmenter:
def __init__(self, model):
self.model = model

def predict(self, sentence):
if sentence == '':
return []
chunks = [sentence[0]]
base_score = -sum(sum(g.values()) for g in self.model.values()) * 0.5

for i in range(1, len(sentence)):
score = base_score
L = len(chunks[-1])
score += 32**L
rad4 = get_radical(sentence[i])
if rad4:
score += self.model.get('RSRID', {}).get(f'{sentence[i-1]}:{rad4}', 0)
rad3 = get_radical(sentence[i-1])
if rad3:
score += self.model.get('LSRID', {}).get(f'{rad3}:{sentence[i]}', 0)
if rad3 and rad4:
score += self.model.get('RAD', {}).get(f'{rad3}:{rad4}', 0)

score += self.model.get('BW2', {}).get(sentence[i - 1:i + 1], 0)
if i > 1:
score += self.model.get('UW2', {}).get(sentence[i - 2], 0)
score += self.model.get('UW3', {}).get(sentence[i - 1], 0)
score += self.model.get('UW4', {}).get(sentence[i], 0)
if i + 1 < len(sentence):
score += self.model.get('UW5', {}).get(sentence[i + 1], 0)

if score > 0:
chunks.append(sentence[i])
else:
chunks[-1] += sentence[i]
return chunks

import json
with open('model.json', encoding="utf-8") as f:
model = json.load(f)
parser = AdaBoostSegmenter(model)
print("_".join(parser.predict("在香港實施「愛國者治港」的過程中,反對派人士被拘捕,獨立媒體停止運作,監察與匿名舉報現象日益增多。")))
13 changes: 13 additions & 0 deletions adaboost_cjk_segmenter/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
appdirs==1.4.4
cihai==0.35.0
greenlet==3.2.4
mypy==1.17.1
mypy_extensions==1.1.0
pathspec==0.12.1
PyYAML==6.0.2
SQLAlchemy==2.0.43
tomli==2.2.1
typing_extensions==4.15.0
unicodecsv==0.14.1
unihan-etl==0.37.0
zhon==2.1.1
Loading