Skip to content
N-Gram-Based Text Categorization
Python
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
example_data
langdata
paper
.gitignore
LICENSE
README.md
example.py
ngramfreq.py

README.md

ngrambased-textcategorizer

Python implementation (PoC) of "N-Gram-Based Text Categorization (1994)" paper by William B. Cavnar and John M. Trenkle just for fun, learn and researching purposes.

How it works

To perform N-Gram-Based Text Categorization we need to compute N-grams (with N=1 to 5) for each word - and apostrophes - found in the text, doing something like (being the word "TEXT"):

  • bi-grams: _T, TE, EX, XT, T_
  • tri-grams: _TE, TEX, EXT, XT_, T_ _
  • quad-grams: _TEX, TEXT, EXT_, XT_ _, T_ _ _

With every N-Gram computed we just keep top 300 with most occurrences and save them as a "text category profile":

$ tail -n 10 langdata/spanish.dat 
  se   6826
nc  6788
su	6770
mi	6665
 con	6590
  con	6590
er 	6459
er   	6459
er  	6459
z	6379

Now, to check/guess text category we only need to generate N-grams in previous way and match against pre-computed profiles to calculate distance for every N-gram, choosing the nearest one (the profile with smallest total "distance")

Language detection example

I have included three profiles - spanish, english and french - for a quick demo:

$ python example.py -i example_data/El\ Hobbit\ -\ Una\ Tertulia\ Inesperada.txt 
[*] example_data/El Hobbit - Una Tertulia Inesperada.txt seems to be written in spanish

Creating category profile

>>> import ngramfreq
>>> text_categorizer = ngramfreq.NGramBasedTextCategorizer()
>>> lesmiserables_words = open('./example_data/lesmiserables.txt', mode='r').read()
>>> text_categorizer.generate_ngram_frequency_profile_from_raw_text(lesmiserables_words, 'french.dat')
>>> # or we could just do
>>> text_categorizer.generate_ngram_frequency_profile_from_file('./example_data/uuee_const.txt', 'english.dat')

Categorizing text

>>> import ngramfreq
>>> # create just one object for multiple checks
>>> text_categorizer = ngramfreq.NGramBasedTextCategorizer()
>>> text_categorizer.guess_language("Le temps est un grand maître, dit-on, le malheur est qu'il tue ses élèves.")
'french'
>>> # or for only one quick check
>>> ngramfreq.guess_language("The only thing necessary for the triumph of evil is that good men do nothing.")
'english'
You can’t perform that action at this time.