Application of Viterbi Algorithm in POS Tagging

An implementation of the Viterbi algorithm that find the most likely sequence of hidden events that would explain a sequence of observed events. In the context of Natural Language Processing (NLP), it is often applied to the Part-of-Speech (POS) tagging task, where the goal is to assign a part-of-speech tag (such as noun, verb, adjective, etc.) to each word in sentence.

The algorithm process

Problem setup:
- Observation sequence: the sequence of words in a sentence (e.g., "The cat sleeps").
- Hidden states: The POS tags (e.g., DET for determiner, NOUN for noun, VERB for verb).
- The task is to find the most likely sequence of POS tags that best explains the observed words in the sentence.
Steps of the Viterbi algorithm:
- Initialization:
  - Start with the FIRST word of the sentence and compute the probability of each POS tag given the first word.
  - This step uses the emission probabilities, i.e., the probability of a word given a tag, $P(w_i|t_i)$ and the initial state probabilities - the probability of a tag being the first tag in a sentence.
- Recursion:
  - For each subsequent word in the sentence, calculate the most likely tag for that word based on the previous word's tag.
  - Use the transition probabilities, i.e., the probability of one tag following another, $P(t_i|t_{i-1})$ and the emission probabilities $P(w_i|t_i)$.
  - For each word, compute the probability of each possible tag, considering the previous tag and the word's emission probability.
  - Store the highest probability and the corresponding tag for each word.
- Termination:
  - After Processing all words, choose the tag with the highest probability for the last word.
  - Trace back through the stored probabilities to find the most likely sequence of tags for the entire sentence.

Installation

I'll give you a script and you just have to paste it into your terminal:

git clone https://github.com/steevewu/nlp-viterbi-algorithm.git && cd nlp-viterbi-algorithm

# intsall pip dependencies
pip install -r requirements.txt

Note: I highly recommend you to use Python - virtual environment in order to prevent dependency conflicts between projects that require different package versions and keeps your global Python installation clean.

Usage

Here are the definitions of our model's parameters:

train_states: set of sequences [s1, s2, ..., sk] where each $s_i$ = [w1, w2, ..., wn].
train_observations: corresponding set of tags [t1, t2, ..., tk] for train_states where each $t_i$ = [p1, p2, ..., pn].

Example:

train_states = [
    ['South', 'Korea', "'s", 'government', 'Tuesday', 'also', 'unveiled', 'a', 'so-called', 'Green', 'New', 'Job', 'Creation', 'Plan', ',', 'expected', 'to', 'create', '9,60,000', 'new', 'jobs', '.'],
    ['When', 'the', 'Lion', 'found', 'that', 'he', 'could', 'not', 'escape', ',', 'he', 'flew', 'upon', 'the', 'sheep', 'and', 'killed', 'them', ',', 'and', 'then', 'attacked', 'the', 'oxen', '.'],
    ['The', 'cost', 'of', 'major', 'food', 'commodities', 'has', 'doubled', 'over', 'the', 'past', 'two', 'years', 'with', 'rice', ',', 'corn', 'and', 'wheat', 'at', 'record', 'highs', '.'],
    ['Argentina', "'s", 'Lionel', 'Messi', 'tied', 'the', 'match', 'six', 'minutes', 'later', '.'],
    ['In', 'addition', 'to', '65,000', 'regular', 'H1-B', 'visas', ',', 'It', 'also', 'has', 'approved', 'an', 'additional', '20,000', 'work', 'permits', 'each', 'year', 'for', 'foreigners', 'who', 'hold', 'an', 'advanced', 'degree', '(', 'master', "'s", 'or', 'doctorate', ')', 'from', 'an', 'American', 'college', 'or', 'university', '.'],
    ['Palestinian', 'President', 'Mahmoud', 'Abbas', 'is', 'reported', 'to', 'have', 'told', 'his', 'security', 'forces', 'to', 'use', '"', 'an', 'iron', 'fist', '"', 'to', 'maintain', 'the', 'current', 'cease-fire', 'with', 'Israel', '.'],
    ['The', 'Discovery', 'Channel', 'and', 'NASA', 'have', 'restored', '50', 'years', 'of', 'history', 'for', 'the', 'U.S.', 'space', 'agency', "'s", 'archives', '.'],
    ['Afghan', 'President', 'Hamid', 'Karzai', 'led', 'his', 'country', "'s", 'Independence', 'Day', 'celebrations', 'Sunday', 'with', 'a', 'call', 'for', 'young', 'Afghans', 'to', 'educate', 'themselves', 'to', 'preserve', 'their', 'freedom', '.'],
    ['Mr.', 'Blair', 'also', 'stressed', 'that', 'his', 'decisions', 'to', 'back', 'U.S.-led', 'military', 'actions', 'in', 'Afghanistan', 'and', 'Iraq', 'had', 'nothing', 'to', 'do', 'with', 'religion', '.'],
    ['Diplomats', 'reported', 'some', 'progress', 'during', 'Sunday', "'s", 'talks', ',', 'and', 'are', 'holding', 'out', 'hopes', 'for', 'a', 'breakthrough', '.']
]

train_observations = [
    ['NNP', 'NNP', 'POS', 'NN', 'NNP', 'RB', 'VBD', 'DT', 'JJ', 'NNP', 'NNP', 'NNP', 'NNP', 'NNP', ',', 'VBD', 'TO', 'VB', 'CD', 'JJ', 'NNS', '.'],
    ['WRB', 'DT', 'NN', 'VBD', 'IN', 'PRP', 'MD', 'RB', 'VB', ',', 'PRP', 'VBD', 'IN', 'DT', 'NNS', 'CC', 'VBD', 'PRP', ',', 'CC', 'RB', 'VBD', 'DT', 'NNS', '.'],
    ['DT', 'NN', 'IN', 'JJ', 'NN', 'NNS', 'VBZ', 'VBN', 'IN', 'DT', 'JJ', 'CD', 'NNS', 'IN', 'NN', ',', 'NN', 'CC', 'NN', 'IN', 'NN', 'NNS', '.'],
    ['NNP', 'POS', 'NNP', 'NNP', 'VBD', 'DT', 'NN', 'CD', 'NNS', 'RB', '.'],
    ['IN', 'NN', 'TO', 'CD', 'JJ', 'JJ', 'NNS', ',', 'PRP', 'RB', 'VBZ', 'VBN', 'DT', 'JJ', 'CD', 'NN', 'VBZ', 'DT', 'NN', 'IN', 'NNS', 'WP', 'VBP', 'DT', 'JJ', 'NN', 'LRB', 'NN', 'POS', 'CC', 'JJ', 'RRB', 'IN', 'DT', 'JJ', 'NN', 'CC', 'NN', '.'],
    ['JJ', 'NNP', 'NNP', 'NNP', 'VBZ', 'VBN', 'TO', 'VB', 'VBN', 'PRP$', 'NN', 'NNS', 'TO', 'VB', '``', 'DT', 'NN', 'NN', '``', 'TO', 'VB', 'DT', 'JJ', 'NN', 'IN', 'NNP', '.'],
    ['DT', 'NNP', 'NNP', 'CC', 'NNP', 'VBP', 'VBN', 'CD', 'NNS', 'IN', 'NN', 'IN', 'DT', 'NNP', 'NN', 'NN', 'POS', 'NNS', '.'],
    ['JJ', 'NNP', 'NNP', 'NNP', 'VBD', 'PRP$', 'NN', 'POS', 'NNP', 'NNP', 'NNS', 'NNP', 'IN', 'DT', 'NN', 'IN', 'JJ', 'NNS', 'TO', 'VB', 'PRP', 'TO', 'VB', 'PRP$', 'NN', '.'],
    ['NNP', 'NNP', 'RB', 'VBD', 'IN', 'PRP$', 'NNS', 'TO', 'VB', 'JJ', 'JJ', 'NNS', 'IN', 'NNP', 'CC', 'NNP', 'VBD', 'DT', 'TO', 'VB', 'IN', 'NN', '.'],
    ['NNS', 'VBD', 'DT', 'NN', 'IN', 'NNP', 'POS', 'NNS', ',', 'CC', 'VBP', 'VBG', 'RP', 'NNS', 'IN', 'DT', 'NN', '.']
]

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
dataset		dataset
.gitignore		.gitignore
README.md		README.md
algo.py		algo.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Application of Viterbi Algorithm in POS Tagging

The algorithm process

Installation

Usage

About

Uh oh!

Releases

Packages

Languages

steevewu/nlp-viterbi-algorithm

Folders and files

Latest commit

History

Repository files navigation

Application of Viterbi Algorithm in POS Tagging

The algorithm process

Installation

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages