Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
term extraction gem
Ruby
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
bin
lib
test
.document
.gitignore
Gemfile
Gemfile.lock Bumping rbtagger, for tests to pass on ruby192.
LICENSE.txt Renamed to term-extract to avoid gem naming confusion
README.markdown
Rakefile
VERSION
term-extract.gemspec Regenerate gemspec for version 0.5.2

README.markdown

term_extract - Term Extract

Description:

term_extract extracts proper nouns (named things like 'Manchester United') and ordinary nouns (like 'event') from text documents.

Usage:

An example extracting terms from a piece of content:

require 'term_extract'

content = <<DOC
Business Secretary Vince Cable will stay in cabinet despite
"declaring war" on Rupert Murdoch, says Downing Street.
DOC

terms = TermExtract.extract(content)

Options

The #extract method takes an (optional) options hash, that allows the term extractor behaviour to be modified. The following options are available:

  • min_occurance - The minimum number of times a single word term must occur to be included in the results, default 3
  • min_terms - Always include multiword terms that comprise more than @min_terms words, default 2
  • types - Extract proper nouns (:nnp) or nouns (:nn) or both (:all), default :all
  • include_tags - Include the extracted POS tags in the results, default false
  • collapse_terms - Remove shorter terms that are part of larger ones, default true

Sample usage:

terms = TermExtract.extract(content, :types => :nnp, :include_tags => true)

Term Extraction Types

By default, the term extractor attempts to extract both ordinary nouns and proper nouns, this behaviour can be configured using the #types option and specifying :all (for both), :nn (for ordinary nouns) or :nnp (for proper nouns). These codes correspond to the relevent POS tags used during the term extraction process. Sample usage is shown below:

terms = TermExtract.extract(content, :types => :nnp)

Command Line Tool

There is a command line tool that can be used for testing the term extractor. It is best used in conjunction with another tool to extract the relevent content (e.g. pismo) :

pismo http://www.bbc.co.uk/news/uk-politics-12085506 body | ruby -rubygems -e 'puts YAML.parse($stdin.read)[:body].value' | ./term-extract nnp | ruby -rubygems -e 'puts YAML.load($stdin.read)'

Note on Patches/Pull Requests

  • Fork the project.
  • Make your feature addition or bug fix.
  • Add tests for it. This is important so I don't break it in a future version unintentionally.
  • Commit, do not mess with Rakefile, version, or history as it's handled by Jeweler.
  • Send me a pull request. I may or may not accept it.

Acknowledgements

The algorithm and extraction code is based on the original python code at:

http://pypi.python.org/pypi/topia.termextract/

Copyright and License

GPL v3 - See LICENSE.txt for details. Copyright (c) 2010, Rob Lee

Something went wrong with that request. Please try again.