vnTagger - POS Tagging for Vietnamese
- vnTagger is an automatic tagger for tagging Vietnamese texts with high accuracy (around 95%)
- The program is developed in the Java programming language and is platform-independent
- Reference: An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts
About this repository
Changes may have been made in compare to the original version.
On a Unix/Linux system, use the provided script
vnTagger.sh to run the program, on a MS Windows, use
Tag a text file
You should provide two arguments for the program: an input text file to be tagged (with argument option
-i) and an output file for the program to write result to (with argument option
./vnTagger.sh -i samples/0.txt -o samples/0.tagged.xml
Note that the file
0.txt must exist and contain some Vietnamese text encoded in UTF-8 encoding. The result file
0.tagged.xml is a text file (A simple XML format) created by the program and it is always encoded in UTF-8 encoding.
- By default, syllables of compound words are separated by spaces, you can use option
-uto separate them by
- If you want that the result file is a plain text instead of an XML file, use the option
- If the input text is already tokenized, you can tell vnTagger to skip tokenization by passing using the
Thus, the command
./vnTagger.sh -i samples/0.txt -o samples/0.tagged.xml -u
will produce output with syllables separated by underscore characters.
./vnTagger.sh -i samples/0.txt -o samples/0.tagged.txt -u -p
will produce output with syllables separated by underscore characters and use a plain text output file instead of an XML file.
Test a tagged file
If you want to test the accuracy of the tagger on a correctly tagged file, use the argument
-t on the file to test, for example:
./vnTagger.sh -t samples/1.tagged.txt
Results of the test will be outputed to the standard console. Note that the test file need to be a plain text file in which syllables are separated by underscores, words are separated by spaces.
API for Developers
The main class of the tagger is
vn.hus.nlp.tagger.VietnameseMaxentTagger. This class provides three instance methods to tag text:
Tag a text and return a tagged string:
public String tagText(String text)
Tag an input text file and write the result to an output file, using an outputer:
public void tagFile(String inputFile, String outputFile, IOutputer outputer)
Tag an input text file and write the result to an output file, using a default plain outputer.
public void tagFile(String inputFile, String outputFile)
And a method for test a tagged file:
public void testFile(String filename)
The tagset in use contains 17 main lexical tags:
- Np - Proper noun
- Nc - Classifier
- Nu - Unit noun
- N - Common noun
- V - Verb
- A - Adjective
- P - Pronoun
- R - Adverb
- L - Determiner
- M - Numeral
- E - Preposition
- C - Subordinating conjunction
- CC - Coordinating conjunction
- I - Interjection
- T - Auxiliary, modal words
- Y - Abbreviation
- Z - Bound morphemes
- X - Unknown
There are also tags for delimiters and punctuations.
- Added option to POS tag pre-tokenized text (skip tokenization).
- Upgrade to use with Stanford Tagger 2.0.
- Upgrade the tokenizer module to vnTokenizer 4.1.1.
- Update resources
- Upgrade the tokenizer module to vnTokenizer 4.1.0 (with a minor bug fixed).
- Update resources, use a richer feature set for tagging texts, especially Vietnamese-specific features.
- Much better tagging results.
- Upgrade the tokenizer module to vnTokenizer 4.1
- Update resources.
See the LICENSE file.