Skip to content

samrat/pos-tagger

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

pos-tagger

A parts-of-speech tagger using Hidden Markov Models and trained using the Brown Corpus (.zip file).

This repository also contains the files containing processed data in the brown.train and brown.counts files. brown.train contains the whole of the Brown corpus concattenated into a single file(with some cleaning up). brown.counts contains n-gram and word-tag counts obtained from brown.train.

The brown.counts file is formatted as follows:

2 WORDTAG jj inward
150 3-GRAM cs dt nn

The first line means that the word "inward" is paired with the "jj" tag 2 times in the corpus.

The second line means that the trigram(3-GRAM) ["cs" "dt" "nn"] appears 150 times in the corpus. There are also lines with counts for 1- and 2-grams.

The code used to produce brown.counts and brown.train files are in the pos-tagger.brown-counts namespace.

The Brown corpus tagset is described here.

Usage

First start a REPL inside the project dir:

lein repl

Then,

(require '[pos-tagger.hmm :refer :all])

(tag-sequence ["the" "man" "saw" "a" "dog" "."])
;; ("at" "nn" "vbd" "at" "nn" ".")

TODO

  • Tokenizer
  • Account for rare words, proper nouns, typos.(smoothing)
  • Try a trigram language model.
  • Evaluate performance.

License

Copyright © 2013 Samrat Man Singh

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.

About

A POS tagger using Hidden Markov Models and trained using the Brown corpus

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published