Vietnamese Word Tokenize

Vietnamese Word Segmentation tool developed by Vietnamese Natural Language Processing research team - underthesea. The repository gives an end-to-end working example for reading datasets, training machine learning models, and evaluating performance of the models. It can easily be extended to train your own custom-defined models.

1. Installation

1.1 Requirements

Operating Systems: Linux (Ubuntu, CentOS), Mac
Python 3.6
Anaconda
languageflow==1.1.7

1.2 Download and Setup Environment

Clone project using git

$ git clone https://github.com/undertheseanlp/word_tokenize.git

Create environment and install requirements

$ cd word_tokenize
$ conda create -n word_tokenize python=3.6
$ pip install -r requirements.txt

2. Usage

Make sure you are in word_tokenize folder and activate word_tokenize environment

$ cd word_tokenize
$ source activate word_tokenize

2.1 Using a pre-trained model

$ python word_tokenize.py --text "Chàng trai 9X Quảng Trị khởi nghiệp từ nấm sò"
$ python word_tokenize.py --fin tmp/input.txt --fout tmp/output.txt

2.2 Train a new dataset

Train and test

$ python util/preprocess_vlsp2013.py
$ python train.py \
    --train tmp/vlsp2013/train.txt \
    --model tmp/model.bin

Predict with trained model

$ python word_tokenize.py \
    --fin tmp/input.txt --fout tmp/output.txt \
    --model tmp/model.bin

3. References

To be updated

Last update: May 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.en.md

README.en.md

Vietnamese Word Tokenize

Table of contents

1. Installation

1.1 Requirements

1.2 Download and Setup Environment

2. Usage

2.1 Using a pre-trained model

2.2 Train a new dataset

3. References

Files

README.en.md

Latest commit

History

README.en.md

File metadata and controls

Vietnamese Word Tokenize

Table of contents

1. Installation

1.1 Requirements

1.2 Download and Setup Environment

2. Usage

2.1 Using a pre-trained model

2.2 Train a new dataset

3. References