Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
This package determines important terms within a given piece of content. It uses linguistic tools such as Parts-Of-Speech (POS) and some simple statistical analysis to determine the terms and their strength. NOTE: This is a fork by Joseph Turian of topia.termextract 1.1.0 CONTRIBUTIONS: * Unicode alphabetic characters are tokenized correctly. I changed TERM_SPEC in topic.termextract.tag: Old = [u'S', u'\xe3o', u'Paulo', u'was', u'home', u'to'] New = [u'S\xe3o', u'Paulo', u'was', u'home', u'to'] * extractor.extract() now has a parameter KEEP_ORIGINAL_SPACING=True, which allows you to keep the original spacing of the term: Old = [u'Mr . Smith'] New = [u'Mr. Smith'] * Fixed a bug where a term wouldn't be found if it was literally the last token of the sentence. * Fixed a bug (?) where unigram terms were included even if their tokens were part of a multiterm.