Thai word segmentation with bi-directional RNN

This is code for preprocessing data, training model and inferring word segment boundaries of Thai text with bi-directional recurrent neural network. The model provides precision of 98.94%, recall of 99.28% and F1 score of 99.11%. Please see the blog post for the detailed description of the model.

Requirements

Python 3.4
TensorFlow 1.4
NumPy 1.13
scikit-learn 0.18

Files

preprocess.py: Preprocess corpus for model training
train.py: Train the Thai word segmentation model
predict_example.py: Example usage of the model to segment Thai words
saved_model: Pretrained model weights
thainlplib/labeller.py: Methods for preprocessing the corpus
thainlplib/model.py: Methods for training the model

Note that the InterBEST 2009 corpus is not included, but can be downloaded from the NECTEC website.

Usage

To try the prediction demo, run python3 predict_example.py. To preprocess the data, train the model and save the model, put the data files under data directory and then run python3 preprocess.py and python3 train.py.

Bug fixes and updates

3/10/2019: Switched license to MIT
1/6/2018: Fixed bug in splitting data incorrectly in preprocess.py. The model was retrained achieving precision 98.94, recall 99.28 and F1 score 99.11. Thank you Ekkalak Thongthanomkul for the bug report.
1/6/2018: Load the model variables with signature names in predict_example.py.

Contributors

Jussi Jousimo
Natsuda Laokulrat
Ben Carr
Ekkalak Thongthanomkul
Vee Satayamas

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
saved_model		saved_model
thainlplib		thainlplib
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
predict_example.py		predict_example.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thai word segmentation with bi-directional RNN

Requirements

Files

Usage

Bug fixes and updates

Contributors

License

About

Releases

Packages

Contributors 2

Languages

License

sertiscorp/thai-word-segmentation

Folders and files

Latest commit

History

Repository files navigation

Thai word segmentation with bi-directional RNN

Requirements

Files

Usage

Bug fixes and updates

Contributors

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages