Moved to https://codeberg.org/mekong-lang/wordcut-engine

wordcut-engine

Word segmentation library in Rust

Example

use wordcut_engine::load_dict;
use wordcut_engine::Wordcut;
use std::path::Path;

fn main() {
    let dict_path = Path::new(concat!(
        env!("CARGO_MANIFEST_DIR"),
        "/dict.txt"
    ));
    let dict = load_dict(dict_path).unwrap();
    let wordcut = Wordcut::new(dict);
    println!("{}", wordcut.put_delimiters("หมากินไก่", "|"));
}

Algorithm

wordcut-engine has three steps:

Identifying clusters, which are substrings that must not be split
Identifying edges of split directed acyclic graph (split-DAG); The program does not add edges that break any cluster to the graph.
Tokenizing a string by finding the shortest path in the split-DAG

Identifying clusters

Identifying clusters identify which substrings that must not be split.

Wrapping regular expressions with parentheses

For example,

[ก-ฮ]็
[ก-ฮ][่-๋]
[ก-ฮ][่-๋][ะาำ]

The above rules are wrapped with parentheses as shown below:

([ก-ฮ]็)
([ก-ฮ][่-๋])
([ก-ฮ][่-๋][ะาำ])

Joining regular expressions with vertical bars (|)

for example,

([ก-ฮ]็)|([ก-ฮ][่-๋])|([ก-ฮ][่-๋][ะาำ])

Building a DFA from the joined regular expression using regex-automata
Creating a directed acyclic graph (DAG) by adding edges using the DFA
Identifying clusters following a shortest path of a DAG from step above

Note: wordcut-engine does not allow a context sensitive rule, since it hurts the performance too much. Moreover, instead of longest matching, we use a DAG, and its shortest path to contraint cluster boundary by another cluster, therefore newmm-style context sensitive rules are not required.

Identifying split-DAG edges

In contrary to identifying clusters, identifying split-DAG edges identify what must be split. Split-DAG edge makers, wordcut-engine has three types of split-DAG edge maker, that are:

Dictionary-based maker
Rule-based maker
Default maker (Unk edge builder)

The dictionary-based maker traverses a prefix tree, which is particularly a trie in wordcut-engine and create an edge that matched word in the prefix tree. Rule-based maker uses regex-automata's Regex matcher built from split rules to find longest matched substrings, and add corresponding edges to the graph. wordcut-engine removes edges that break clusters. The example of split rules are shown below:

[\r\t\n ]+
[A-Za-z]+
[0-9]+
[๐-๙]+
[\(\)"'`\[\]{}\\/]

If there is no edge for each of character indice yet, a default maker create a edge that connected the known rightmost boundary.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
data		data
example		example
src		src
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
Cargo.toml		Cargo.toml
LICENSE		LICENSE
LICENSE-PyThaiNLP		LICENSE-PyThaiNLP
LICENSE-khmerdict		LICENSE-khmerdict
LICENSE-myanmar-dict		LICENSE-myanmar-dict
LICENSE-thai.txt		LICENSE-thai.txt
Lao-Dictionary-LICENSE.txt		Lao-Dictionary-LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Moved to https://codeberg.org/mekong-lang/wordcut-engine

wordcut-engine

Example

Algorithm

Identifying clusters

Identifying split-DAG edges

About

Licenses found

Releases

Packages

Languages

License

Licenses found

veer66/wordcut-engine

Folders and files

Latest commit

History

Repository files navigation

Moved to https://codeberg.org/mekong-lang/wordcut-engine

wordcut-engine

Example

Algorithm

Identifying clusters

Identifying split-DAG edges

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages