vnTagger - POS Tagging for Vietnamese

Introduction

vnTagger is an automatic tagger for tagging Vietnamese texts with high accuracy (around 95%)
The program is developed in the Java programming language and is platform-independent
Reference: An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts

About this repository

This is an unofficial fork of vnTagger, originally written by Le Hong Phuong.

The source code in this repository is currently updated to vnTagger 4.2.0b, released in 05/08/2010. It uses vnTokenizer version 4.1.1c to tokenize texts before tagging.

Changes may have been made in compare to the original version.

Usage

On a Unix/Linux system, use the provided script vnTagger.sh to run the program, on a MS Windows, use vnTagger.bat.

Tag a text file

You should provide two arguments for the program: an input text file to be tagged (with argument option -i) and an output file for the program to write result to (with argument option -o).

For example:

./vnTagger.sh -i samples/0.txt  -o samples/0.tagged.xml

Note that the file 0.txt must exist and contain some Vietnamese text encoded in UTF-8 encoding. The result file 0.tagged.xml is a text file (A simple XML format) created by the program and it is always encoded in UTF-8 encoding.

By default, syllables of compound words are separated by spaces, you can use option -u to separate them by _ character.
If you want that the result file is a plain text instead of an XML file, use the option -p.
If the input text is already tokenized, you can tell vnTagger to skip tokenization by passing using the -st option.

Thus, the command

 ./vnTagger.sh -i samples/0.txt  -o samples/0.tagged.xml -u

will produce output with syllables separated by underscore characters.

The command

 ./vnTagger.sh -i samples/0.txt  -o samples/0.tagged.txt -u -p

will produce output with syllables separated by underscore characters and use a plain text output file instead of an XML file.

Test a tagged file

If you want to test the accuracy of the tagger on a correctly tagged file, use the argument -t on the file to test, for example:

./vnTagger.sh -t samples/1.tagged.txt

Results of the test will be outputed to the standard console. Note that the test file need to be a plain text file in which syllables are separated by underscores, words are separated by spaces.

API for Developers

The main class of the tagger is vn.hus.nlp.tagger.VietnameseMaxentTagger. This class provides three instance methods to tag text:

Tag a text and return a tagged string:
```
 public String tagText(String text)
```
Tag an input text file and write the result to an output file, using an outputer:
```
 public void tagFile(String inputFile, String outputFile, IOutputer outputer)
```
Tag an input text file and write the result to an output file, using a default plain outputer.
```
 public void tagFile(String inputFile, String outputFile)
```
And a method for test a tagged file:
```
 public void testFile(String filename)
```

Tagset

The tagset in use contains 17 main lexical tags:

Np - Proper noun
Nc - Classifier
Nu - Unit noun
N - Common noun
V - Verb
A - Adjective
P - Pronoun
R - Adverb
L - Determiner
M - Numeral
E - Preposition
C - Subordinating conjunction
CC - Coordinating conjunction
I - Interjection
T - Auxiliary, modal words
Y - Abbreviation
Z - Bound morphemes
X - Unknown

There are also tags for delimiters and punctuations.

Changes Logs

13/03/2012

Added option to POS tag pre-tokenized text (skip tokenization).

01/04/2010

Upgrade to use with Stanford Tagger 2.0.

25/12/2009

Upgrade the tokenizer module to vnTokenizer 4.1.1.
Update resources

30/11/2009

Upgrade the tokenizer module to vnTokenizer 4.1.0 (with a minor bug fixed).
Update resources, use a richer feature set for tagging texts, especially Vietnamese-specific features.
Much better tagging results.

18/07/2009

Upgrade the tokenizer module to vnTokenizer 4.1
Update resources.

LICENSE

See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
lib		lib
resources		resources
samples		samples
src/vn/hus/nlp/tagger		src/vn/hus/nlp/tagger
.classpath		.classpath
.gitignore		.gitignore
.project		.project
CHANGES.txt		CHANGES.txt
LICENSE.txt		LICENSE.txt
MANIFEST		MANIFEST
README.md		README.md
build.xml		build.xml
vnTagger.bat		vnTagger.bat
vnTagger.sh		vnTagger.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vnTagger - POS Tagging for Vietnamese

Introduction

About this repository

Usage

Tag a text file

Test a tagged file

API for Developers

Tagset

Changes Logs

13/03/2012

01/04/2010

25/12/2009

30/11/2009

18/07/2009

LICENSE

About

Releases

Packages

License

vunb/vnTagger

Folders and files

Latest commit

History

Repository files navigation

vnTagger - POS Tagging for Vietnamese

Introduction

About this repository

Usage

Tag a text file

Test a tagged file

API for Developers

Tagset

Changes Logs

13/03/2012

01/04/2010

25/12/2009

30/11/2009

18/07/2009

LICENSE

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages