Skip to content

yczeng/venkataraman-approach

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

venkataraman-approach

This is a python reimplementation of the maximum-likelihood estimation approach proposed by Anand Venkataraman for unsupervised word segmentation.

The paper describing the approach can be found at the link here

As of 10/17/18, this repo only implements backoffs for unigrams and phonemes. To run, clone the repo:

git clone https://github.com/yczeng/venkataraman-approach.git

cd into the directory, and run:

python search.py

The results of the segmentation will be stored in results/result.txt.

Scoring

Performance is measured using precision, recall, and F-score.

Precision is the number of correct words found out of all words found, recall is the number of correct words found out of all correct words, and F-scores it he geometric average of pricision and recall (2 * precision * recall) / (precision + recall).

To score a segmented lexicon, run score.py [segmented lexicon] [model lexicon] where segmented lexicon and model lexicon are text files containing a dictionary of words in the lexicon. An example command is:

python score.py results/lexicon.txt data/Bernstein-Ratner87-lexicon

It looks like results are:

Precision: 0.526275115919629
Recall: 0.5155185465556397
f1 Score: 0.5208413001912047

About

a python implementation Venkataraman's model for word segmentation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages