A Chinese Word Segmentation(中文分词) routine in pure Ruby
Ruby
Switch branches/tags
Nothing to show
Pull request Compare This branch is 2 commits ahead of yzhang:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin
dict
lib
public
views
.gitignore
LICENSE
README
Rakefile
VERSION
rseg.gemspec

README

Introduction
========
Rseg is a Chinese Word Segmentation(中文分词) routine in pure Ruby.

The algorithm is based on this article: http://xiecc.blog.163.com/blog/static/14032200671110224190/

Usage
========

Rseg now support two modes: inline and C/S mode.

1. Inline mode

> require 'rubygems'
> require 'rseg'
> Rseg.segment("需要分词的文章")
['需要', '分词', '的', '文章']

The first call to Rseg#segment will need about 30 seconds to load the dictionary, the second call will be very fast, you can also call Rseg#load to load dictionaries manually.

2. C/S mode

$ rseg_server
== Sinatra/0.9.4 has taken the stage on 4100

This will start rseg server on http://localhost:4100

You can visit it via your browser or the rseg command.

$ rseg '需要分词的文章'
需要 分词 的 文章

You can also access server with the Rseg#remote_segment

$ irb
> require 'rubygems'
> require 'rseg'
> Rseg.remote_segment("需要分词的文章") # This will be very fast
['需要', '分词', '的', '文章']

Performance
========
About 5M character/s on my Macbook (Intel Core 2 Duo 2GHz/4G mem). 

License
========

Rseg includes two built-in dictionaries:

* CC-CEDICT (http://cc-cedict.org/wiki/) with Creative Commons Attribution-Share Alike 3.0 License (http://creativecommons.org/licenses/by-sa/3.0/)
* Wikipedia Chinese article title list (http://download.wikimedia.org/zhwiki/) with Creative Commons Attribution-Share Alike 3.0 License(http://creativecommons.org/licenses/by-sa/3.0/)

The codes and others in Rseg are licensed under MIT license.

Feedback
========
All feedback are welcome, Yuanyi Zhang(zhangyuanyi#gmail.com)