- vnTagger is an automatic tagger for tagging Vietnamese texts with high accuracy (around 95%)
- The program is developed in the Java programming language and is platform-independent
- Reference: An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts
This is an unofficial fork of vnTagger, originally written by Le Hong Phuong.
The source code in this repository is currently updated to vnTagger 4.2.0b, released in 05/08/2010. It uses vnTokenizer version 4.1.1c to tokenize texts before tagging.
Changes may have been made in compare to the original version.
On a Unix/Linux system, use the provided script vnTagger.sh
to run the program, on a MS Windows, use vnTagger.bat
.
You should provide two arguments for the program: an input text file to be tagged (with argument option -i
) and an output file for the program to write result to (with argument option -o
).
For example:
./vnTagger.sh -i samples/0.txt -o samples/0.tagged.xml
Note that the file 0.txt
must exist and contain some Vietnamese text encoded in UTF-8 encoding. The result file 0.tagged.xml
is a text file (A simple XML format) created by the program and it is always encoded in UTF-8 encoding.
- By default, syllables of compound words are separated by spaces, you can use option
-u
to separate them by_
character. - If you want that the result file is a plain text instead of an XML file, use the option
-p
. - If the input text is already tokenized, you can tell vnTagger to skip tokenization by passing using the
-st
option.
Thus, the command
./vnTagger.sh -i samples/0.txt -o samples/0.tagged.xml -u
will produce output with syllables separated by underscore characters.
The command
./vnTagger.sh -i samples/0.txt -o samples/0.tagged.txt -u -p
will produce output with syllables separated by underscore characters and use a plain text output file instead of an XML file.
If you want to test the accuracy of the tagger on a correctly tagged file, use the argument -t
on the file to test, for example:
./vnTagger.sh -t samples/1.tagged.txt
Results of the test will be outputed to the standard console. Note that the test file need to be a plain text file in which syllables are separated by underscores, words are separated by spaces.
The main class of the tagger is vn.hus.nlp.tagger.VietnameseMaxentTagger
. This class provides three instance methods to tag text:
-
Tag a text and return a tagged string:
public String tagText(String text)
-
Tag an input text file and write the result to an output file, using an outputer:
public void tagFile(String inputFile, String outputFile, IOutputer outputer)
-
Tag an input text file and write the result to an output file, using a default plain outputer.
public void tagFile(String inputFile, String outputFile)
-
And a method for test a tagged file:
public void testFile(String filename)
The tagset in use contains 17 main lexical tags:
- Np - Proper noun
- Nc - Classifier
- Nu - Unit noun
- N - Common noun
- V - Verb
- A - Adjective
- P - Pronoun
- R - Adverb
- L - Determiner
- M - Numeral
- E - Preposition
- C - Subordinating conjunction
- CC - Coordinating conjunction
- I - Interjection
- T - Auxiliary, modal words
- Y - Abbreviation
- Z - Bound morphemes
- X - Unknown
There are also tags for delimiters and punctuations.
- Added option to POS tag pre-tokenized text (skip tokenization).
- Upgrade to use with Stanford Tagger 2.0.
- Upgrade the tokenizer module to vnTokenizer 4.1.1.
- Update resources
- Upgrade the tokenizer module to vnTokenizer 4.1.0 (with a minor bug fixed).
- Update resources, use a richer feature set for tagging texts, especially Vietnamese-specific features.
- Much better tagging results.
- Upgrade the tokenizer module to vnTokenizer 4.1
- Update resources.
See the LICENSE file.