Accurate Bayesian sentence tokenizer in Ruby.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
lib Bump version number. Apr 27, 2014
spec Add test case for right parens disappearing. Doesn't appear to be an … Apr 30, 2014
.travis.yml update .travis.yml Apr 27, 2014
README.rdoc correct links for badges Jul 24, 2013
Rakefile tests rewritten with RSpec and initial travis-ci setup added Jun 22, 2013
tactful_tokenizer.gemspec update .travis.yml Apr 27, 2014



Gem Version Build Status Coverage Status

TactfulTokenizer is a Ruby library for high quality sentence tokenization. It uses a Naive Bayesian statistical model, and is based on Splitta, but has support for '?' and '!' as well as primitive handling of XHTML markup. Better support for XHTML parsing is coming shortly.

Additionally supports unicode text tokenization.


require "tactful_tokenizer"
m =
m.tokenize_text("Here in the U.S. Senate we prefer to eat our friends. Is it easier that way? <em>Yes.</em> <em>Maybe</em>!")
#=> ["Here in the U.S. Senate we prefer to eat our friends.", "Is it easier that way?", "<em>Yes.</em>", "<em>Maybe</em>!"]

The input text is expected to consist of paragraphs delimited by line breaks.


gem install tactful_tokenizer


Copyright © 2010 Matthew Bunday. All rights reserved. Released under the GNU GPL v3.