JTCC is a Java library to tokenize Thai text into a list of TCCs. The rules used to determine TCCs' boundaries are implemented as grammar using ANTLR.
What is TCC ?
TCC or Thai Character Cluster (proposed in Character Cluster Based Thai Information Retrieval is a group of inseparable Thai characters. This inseparability derives from Thai writing system which is independent of any context. As a result, TCC can be determined by a simple list of rules describing e.g., what characters need to follow/precede other characters.
Output TCCs: ฉัน|ฝา|ก|ข|ว|ด|ขี้|ผึ้|ง|ใส่|ถุ|ง|ให้|เศ|ร|ษ|ฐี|
Output TCCs: สะ|ช้ะ|มา|บ้า|กิ|ถิ้|บี|ดี้|ขึ|ง|ทึ่|ง|รือ|ขื่อ|กุ|ตุ้|บ|สู|ตู่|เละ|เส๊ะ|เข|เป้|
Note that we only put the delimiter at the end of each TCC.
Applications of TCCs
The TCC itself has no use to the end users. TCC is mostly used in a bigger natural language processing system by acting as the first step of processing input text. An obvious merit of TCC is that it can be used to eliminate impossible word boundary positions in the running text.
Calling JTCC from the command line is as simple as calling a normal executable JAR file. Command-line JTCC has 3 modes.
- Tokenize input from stdin
- Tokenize the content in a file
- Tokenize the string passed as a command line argument
General usage format is
java -jar JTCC-x.x <mode_keyword> [argument]
Replace x.x with the version of JTCC in use.
Tokenize input from stdin
echo "Some input here" | java -jar JTCC-x.x.jar stdin
This tokenizes the input passed from stdin and outputs to the default stdout (screen).
Tokenize a content file
java -jar JTCC-x.x.jar file C:/thaitext.txt
This tokenizes the content at the path C:/thaitext.txt, and outputs to the screen.
Tokenize specified input string
java -jar JTCC.jar content "ตรงนี้เป็นเนื้อหาที่ต้องการตัด TCC. Content to tokenize into TCCs here."
This tokenizes whatever string coming after the keyword "content" and outputs to the screen.
JTCC is not a mature project nor does it provide a standard way of grouping inseparable Thai characters.
The term inseparable is, in fact, ambiguous in some cases. For example, given an input "ถุงให้", by relying on the original definition of TCC, the output TCCs should be "ถุ|ง|ให้|". However, some might argue that the delimiter after "ถุ" can be removed without much effort to make it as "ถุง|ให้|". One method to do so might be to look ahead one more character. In this case, it is "ใ". Since "ใ" cannot be grouped with "ง" (i.e.,/ it is impossible to have "งใ"), so it might be tempting to group "ง" to the previous TCC, thus forming "ถุง".
I agree that this argument makes sense. But, be reminded that the goal of this project is to create a library capable of tokenizing an input text into TCCs. The mentioned idea above seems to go beyond TCC (probably to syllable level). Therefore, we will stick with the global context-independent TCC tokenizing rules for now. At least, the mentioned look-ahead strategy will not be implemented in the near future.
JTCC is a Java package for tokenizing Thai text into a list of TCCs. Copyright (C) 2010 Wittawat Jitkrittum JTCC is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.