Skip to content
Fast Rust-based toolkit for ngram processing
Rust Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
fixtures
scripts
src
.gitignore
.gitmodules
Cargo.toml
README.md
build.rs

README.md

The ngrams tool

Fast rust-based command-line tool for processing text files, extracting ngrams, and making a statistical baseline (rough "language model").

  • command-line tool (ngrams) for Windows, Linux, Mac OS
  • normalizes ascii text (lowercases; removes non-alphabetical text; recognizes word and sentence boundaries)
  • takes windowed word slices as n-grams
  • tallies the occurrences of n-grams
  • optional client/server architecture for fast look-up of in-memory n-gram tallies

Installing

Install rust & cargo, then:

cargo build --release

Usage

ngrams

The ngrams tool takes an ascii text file and outputs tallied n-grams. The tally comes first and is space-padded, followed by a tab separator, and lastly the n-gram.

Input is taken to be STDIN if no files are given; otherwise, each given ascii file is processed in turn.

You can optionally specify the window size, i.e. the value of n (--number) in the n-grams (e.g. 3 for 3-grams, 4 for 4-grams, etc.): ngrams -n4 [FILES].

You can also optionally request the output to be sorted with the -s (--sort) flag. Alphabetical sorting is given by -sa and numerical sorting is given with -sn: ngrams -n4 -sn [FILES].

Example:

$ ./ngrams -n4 -sa ../bomdb/bom.txt
     1398	it came to pass
     1299	came to pass that
     1155	and it came to
      257	i say unto you
      234	to pass that the
      169	to pass that when
      159	now it came to
...

If more than one file is given on the command line, all of the results will be merged (accumulatd) in the output tally.

Saving off and reloading results

If you send the tallied results to a file, you can load them back again with the -t (--tallied-input) flag:

# Redirect tallied, unsorted output to file `bom.4grams.tallied`:
$ ./ngrams -n4 ../bomdb/bom.txt >bom.4grams.tallied

# Re-read tallied data, then sort it and show the result:
$ ./ngrams -t -sn bom.4grams.tallied
     1398	it came to pass
     1299	came to pass that
     1155	and it came to
      257	i say unto you
      234	to pass that the
      169	to pass that when
      159	now it came to
...

This capability is particularly useful if you've created a large aggregate file (i.e. baseline) that is the result of processing many files.

Client / Server

For advanced usage, you can also set ngrams up as a server, storing a large amount of data in-memory. This significantly speeds things up if you have a large baseline that you need to compare many books with.

Setting up the server looks like this:

$ ./ngrams --server -t baseline.4grams.tallied
Server mode enabled (unix:/tmp/ngramd.sock)
  -> serving 17417138 ngrams

Now, you can take an ascii file (i.e. an English book, in text format) and score it using the baseline:

$ ./ngrams --client -n4 book.txt
Client mode enabled (unix:/tmp/ngramd.sock)
  -> requesting ngrams
Score: 3.182188199
You can’t perform that action at this time.