Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 cjk-tokenizer-perl
Octocat-spinner-32 cjk-tokenizer
Octocat-spinner-32 LICENSE
Octocat-spinner-32 README.rst
README.rst

Description

This module is a word tokenizer for CJK texts, supporting n-gram tokenization. It is designed to be used with Xapian (http://xapian.org), and uses Xapian's unicode routines.

Currently, there is no documentation other than the source code.

Authors

Features

  • N-gram tokenization on CJK texts.
  • Conversion from Traditional Chinese to Simplified Chinese, and vice versa.

History

This project was taken from http://code.google.com/p/cjk-tokenizer/ , but then modified to use Xapian's internal unicode routines.

Something went wrong with that request. Please try again.