Skip to content


Subversion checkout URL

You can clone with
Download ZIP
term extraction gem
Branch: master
Failed to load latest commit information.
bin Replacing pp call
lib properly escaping regex terms
test fixes bug with regex chars in terms
.document Renamed to term-extract to avoid gem naming confusion
.gitignore Renamed to term-extract to avoid gem naming confusion
Gemfile Bumping rbtagger, for tests to pass on ruby192.
LICENSE.txt Renamed to term-extract to avoid gem naming confusion
VERSION Version bump to 0.5.2
term-extract.gemspec Regenerate gemspec for version 0.5.2


term_extract - Term Extract


term_extract extracts proper nouns (named things like 'Manchester United') and ordinary nouns (like 'event') from text documents.


An example extracting terms from a piece of content:

require 'term_extract'

content = <<DOC
Business Secretary Vince Cable will stay in cabinet despite
"declaring war" on Rupert Murdoch, says Downing Street.

terms = TermExtract.extract(content)


The #extract method takes an (optional) options hash, that allows the term extractor behaviour to be modified. The following options are available:

  • min_occurance - The minimum number of times a single word term must occur to be included in the results, default 3
  • min_terms - Always include multiword terms that comprise more than @min_terms words, default 2
  • types - Extract proper nouns (:nnp) or nouns (:nn) or both (:all), default :all
  • include_tags - Include the extracted POS tags in the results, default false
  • collapse_terms - Remove shorter terms that are part of larger ones, default true

Sample usage:

terms = TermExtract.extract(content, :types => :nnp, :include_tags => true)

Term Extraction Types

By default, the term extractor attempts to extract both ordinary nouns and proper nouns, this behaviour can be configured using the #types option and specifying :all (for both), :nn (for ordinary nouns) or :nnp (for proper nouns). These codes correspond to the relevent POS tags used during the term extraction process. Sample usage is shown below:

terms = TermExtract.extract(content, :types => :nnp)

Command Line Tool

There is a command line tool that can be used for testing the term extractor. It is best used in conjunction with another tool to extract the relevent content (e.g. pismo) :

pismo body | ruby -rubygems -e 'puts YAML.parse($[:body].value' | ./term-extract nnp | ruby -rubygems -e 'puts YAML.load($'

Note on Patches/Pull Requests

  • Fork the project.
  • Make your feature addition or bug fix.
  • Add tests for it. This is important so I don't break it in a future version unintentionally.
  • Commit, do not mess with Rakefile, version, or history as it's handled by Jeweler.
  • Send me a pull request. I may or may not accept it.


The algorithm and extraction code is based on the original python code at:

Copyright and License

GPL v3 - See LICENSE.txt for details. Copyright (c) 2010, Rob Lee

Something went wrong with that request. Please try again.