a python implementation Venkataraman's model for word segmentation
This is a python reimplementation of the maximum-likelihood estimation approach proposed by Anand Venkataraman for unsupervised word segmentation.

The paper describing the approach can be found at the link here

As of 10/17/18, this repo only implements backoffs for unigrams and phonemes. To run, clone the repo:

git clone

cd into the directory, and run:


The results of the segmentation will be stored in results/result.txt.


Performance is measured using precision, recall, and F-score.

Precision is the number of correct words found out of all words found, recall is the number of correct words found out of all correct words, and F-scores it he geometric average of pricision and recall (2 * precision * recall) / (precision + recall).

To score a segmented lexicon, run [segmented lexicon] [model lexicon] where segmented lexicon and model lexicon are text files containing a dictionary of words in the lexicon. An example command is:

python results/lexicon.txt data/Bernstein-Ratner87-lexicon

It looks like results are:

Precision: 0.526275115919629
Recall: 0.5155185465556397
f1 Score: 0.5208413001912047
