Skip to content

steevewu/nlp-viterbi-algorithm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Application of Viterbi Algorithm in POS Tagging

An implementation of the Viterbi algorithm that find the most likely sequence of hidden events that would explain a sequence of observed events. In the context of Natural Language Processing (NLP), it is often applied to the Part-of-Speech (POS) tagging task, where the goal is to assign a part-of-speech tag (such as noun, verb, adjective, etc.) to each word in sentence.

The algorithm process

  1. Problem setup:

    • Observation sequence: the sequence of words in a sentence (e.g., "The cat sleeps").
    • Hidden states: The POS tags (e.g., DET for determiner, NOUN for noun, VERB for verb).
    • The task is to find the most likely sequence of POS tags that best explains the observed words in the sentence.
  2. Steps of the Viterbi algorithm:

    • Initialization:
      • Start with the FIRST word of the sentence and compute the probability of each POS tag given the first word.
      • This step uses the emission probabilities, i.e., the probability of a word given a tag, $P(w_i|t_i)$ and the initial state probabilities - the probability of a tag being the first tag in a sentence.
    • Recursion:
      • For each subsequent word in the sentence, calculate the most likely tag for that word based on the previous word's tag.
      • Use the transition probabilities, i.e., the probability of one tag following another, $P(t_i|t_{i-1})$ and the emission probabilities $P(w_i|t_i)$.
      • For each word, compute the probability of each possible tag, considering the previous tag and the word's emission probability.
      • Store the highest probability and the corresponding tag for each word.
    • Termination:
      • After Processing all words, choose the tag with the highest probability for the last word.
      • Trace back through the stored probabilities to find the most likely sequence of tags for the entire sentence.

Installation

I'll give you a script and you just have to paste it into your terminal:

git clone https://github.com/steevewu/nlp-viterbi-algorithm.git && cd nlp-viterbi-algorithm

# intsall pip dependencies
pip install -r requirements.txt

Note: I highly recommend you to use Python - virtual environment in order to prevent dependency conflicts between projects that require different package versions and keeps your global Python installation clean.

Usage

Here are the definitions of our model's parameters:

  1. train_states: set of sequences [s1, s2, ..., sk] where each $s_i$ = [w1, w2, ..., wn].
  2. train_observations: corresponding set of tags [t1, t2, ..., tk] for train_states where each $t_i$ = [p1, p2, ..., pn].

Example:

train_states = [
    ['South', 'Korea', "'s", 'government', 'Tuesday', 'also', 'unveiled', 'a', 'so-called', 'Green', 'New', 'Job', 'Creation', 'Plan', ',', 'expected', 'to', 'create', '9,60,000', 'new', 'jobs', '.'],
    ['When', 'the', 'Lion', 'found', 'that', 'he', 'could', 'not', 'escape', ',', 'he', 'flew', 'upon', 'the', 'sheep', 'and', 'killed', 'them', ',', 'and', 'then', 'attacked', 'the', 'oxen', '.'],
    ['The', 'cost', 'of', 'major', 'food', 'commodities', 'has', 'doubled', 'over', 'the', 'past', 'two', 'years', 'with', 'rice', ',', 'corn', 'and', 'wheat', 'at', 'record', 'highs', '.'],
    ['Argentina', "'s", 'Lionel', 'Messi', 'tied', 'the', 'match', 'six', 'minutes', 'later', '.'],
    ['In', 'addition', 'to', '65,000', 'regular', 'H1-B', 'visas', ',', 'It', 'also', 'has', 'approved', 'an', 'additional', '20,000', 'work', 'permits', 'each', 'year', 'for', 'foreigners', 'who', 'hold', 'an', 'advanced', 'degree', '(', 'master', "'s", 'or', 'doctorate', ')', 'from', 'an', 'American', 'college', 'or', 'university', '.'],
    ['Palestinian', 'President', 'Mahmoud', 'Abbas', 'is', 'reported', 'to', 'have', 'told', 'his', 'security', 'forces', 'to', 'use', '"', 'an', 'iron', 'fist', '"', 'to', 'maintain', 'the', 'current', 'cease-fire', 'with', 'Israel', '.'],
    ['The', 'Discovery', 'Channel', 'and', 'NASA', 'have', 'restored', '50', 'years', 'of', 'history', 'for', 'the', 'U.S.', 'space', 'agency', "'s", 'archives', '.'],
    ['Afghan', 'President', 'Hamid', 'Karzai', 'led', 'his', 'country', "'s", 'Independence', 'Day', 'celebrations', 'Sunday', 'with', 'a', 'call', 'for', 'young', 'Afghans', 'to', 'educate', 'themselves', 'to', 'preserve', 'their', 'freedom', '.'],
    ['Mr.', 'Blair', 'also', 'stressed', 'that', 'his', 'decisions', 'to', 'back', 'U.S.-led', 'military', 'actions', 'in', 'Afghanistan', 'and', 'Iraq', 'had', 'nothing', 'to', 'do', 'with', 'religion', '.'],
    ['Diplomats', 'reported', 'some', 'progress', 'during', 'Sunday', "'s", 'talks', ',', 'and', 'are', 'holding', 'out', 'hopes', 'for', 'a', 'breakthrough', '.']
]

train_observations = [
    ['NNP', 'NNP', 'POS', 'NN', 'NNP', 'RB', 'VBD', 'DT', 'JJ', 'NNP', 'NNP', 'NNP', 'NNP', 'NNP', ',', 'VBD', 'TO', 'VB', 'CD', 'JJ', 'NNS', '.'],
    ['WRB', 'DT', 'NN', 'VBD', 'IN', 'PRP', 'MD', 'RB', 'VB', ',', 'PRP', 'VBD', 'IN', 'DT', 'NNS', 'CC', 'VBD', 'PRP', ',', 'CC', 'RB', 'VBD', 'DT', 'NNS', '.'],
    ['DT', 'NN', 'IN', 'JJ', 'NN', 'NNS', 'VBZ', 'VBN', 'IN', 'DT', 'JJ', 'CD', 'NNS', 'IN', 'NN', ',', 'NN', 'CC', 'NN', 'IN', 'NN', 'NNS', '.'],
    ['NNP', 'POS', 'NNP', 'NNP', 'VBD', 'DT', 'NN', 'CD', 'NNS', 'RB', '.'],
    ['IN', 'NN', 'TO', 'CD', 'JJ', 'JJ', 'NNS', ',', 'PRP', 'RB', 'VBZ', 'VBN', 'DT', 'JJ', 'CD', 'NN', 'VBZ', 'DT', 'NN', 'IN', 'NNS', 'WP', 'VBP', 'DT', 'JJ', 'NN', 'LRB', 'NN', 'POS', 'CC', 'JJ', 'RRB', 'IN', 'DT', 'JJ', 'NN', 'CC', 'NN', '.'],
    ['JJ', 'NNP', 'NNP', 'NNP', 'VBZ', 'VBN', 'TO', 'VB', 'VBN', 'PRP$', 'NN', 'NNS', 'TO', 'VB', '``', 'DT', 'NN', 'NN', '``', 'TO', 'VB', 'DT', 'JJ', 'NN', 'IN', 'NNP', '.'],
    ['DT', 'NNP', 'NNP', 'CC', 'NNP', 'VBP', 'VBN', 'CD', 'NNS', 'IN', 'NN', 'IN', 'DT', 'NNP', 'NN', 'NN', 'POS', 'NNS', '.'],
    ['JJ', 'NNP', 'NNP', 'NNP', 'VBD', 'PRP$', 'NN', 'POS', 'NNP', 'NNP', 'NNS', 'NNP', 'IN', 'DT', 'NN', 'IN', 'JJ', 'NNS', 'TO', 'VB', 'PRP', 'TO', 'VB', 'PRP$', 'NN', '.'],
    ['NNP', 'NNP', 'RB', 'VBD', 'IN', 'PRP$', 'NNS', 'TO', 'VB', 'JJ', 'JJ', 'NNS', 'IN', 'NNP', 'CC', 'NNP', 'VBD', 'DT', 'TO', 'VB', 'IN', 'NN', '.'],
    ['NNS', 'VBD', 'DT', 'NN', 'IN', 'NNP', 'POS', 'NNS', ',', 'CC', 'VBP', 'VBG', 'RP', 'NNS', 'IN', 'DT', 'NN', '.']
]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages