An implementation of the Viterbi algorithm that find the most likely sequence of hidden events that would explain a sequence of observed events. In the context of Natural Language Processing (NLP), it is often applied to the Part-of-Speech (POS) tagging task, where the goal is to assign a part-of-speech tag (such as noun, verb, adjective, etc.) to each word in sentence.
-
Problem setup:
- Observation sequence: the sequence of words in a sentence (e.g., "The cat sleeps").
- Hidden states: The POS tags (e.g.,
DET
for determiner,NOUN
for noun,VERB
for verb). - The task is to find the most likely sequence of POS tags that best explains the observed words in the sentence.
-
Steps of the Viterbi algorithm:
- Initialization:
- Start with the FIRST word of the sentence and compute the probability of each POS tag given the first word.
- This step uses the emission probabilities, i.e., the probability of a word given a tag,
$P(w_i|t_i)$ and the initial state probabilities - the probability of a tag being the first tag in a sentence.
- Recursion:
- For each subsequent word in the sentence, calculate the most likely tag for that word based on the previous word's tag.
- Use the transition probabilities, i.e., the probability of one tag following another,
$P(t_i|t_{i-1})$ and the emission probabilities$P(w_i|t_i)$ . - For each word, compute the probability of each possible tag, considering the previous tag and the word's emission probability.
- Store the highest probability and the corresponding tag for each word.
- Termination:
- After Processing all words, choose the tag with the highest probability for the last word.
- Trace back through the stored probabilities to find the most likely sequence of tags for the entire sentence.
- Initialization:
I'll give you a script and you just have to paste it into your terminal:
git clone https://github.com/steevewu/nlp-viterbi-algorithm.git && cd nlp-viterbi-algorithm
# intsall pip dependencies
pip install -r requirements.txt
Note: I highly recommend you to use Python - virtual environment in order to prevent dependency conflicts between projects that require different package versions and keeps your global Python installation clean.
Here are the definitions of our model's parameters:
-
train_states
: set of sequences [s1, s2, ..., sk] where each$s_i$ = [w1, w2, ..., wn]. -
train_observations
: corresponding set of tags [t1, t2, ..., tk] for train_states where each$t_i$ = [p1, p2, ..., pn].
Example:
train_states = [
['South', 'Korea', "'s", 'government', 'Tuesday', 'also', 'unveiled', 'a', 'so-called', 'Green', 'New', 'Job', 'Creation', 'Plan', ',', 'expected', 'to', 'create', '9,60,000', 'new', 'jobs', '.'],
['When', 'the', 'Lion', 'found', 'that', 'he', 'could', 'not', 'escape', ',', 'he', 'flew', 'upon', 'the', 'sheep', 'and', 'killed', 'them', ',', 'and', 'then', 'attacked', 'the', 'oxen', '.'],
['The', 'cost', 'of', 'major', 'food', 'commodities', 'has', 'doubled', 'over', 'the', 'past', 'two', 'years', 'with', 'rice', ',', 'corn', 'and', 'wheat', 'at', 'record', 'highs', '.'],
['Argentina', "'s", 'Lionel', 'Messi', 'tied', 'the', 'match', 'six', 'minutes', 'later', '.'],
['In', 'addition', 'to', '65,000', 'regular', 'H1-B', 'visas', ',', 'It', 'also', 'has', 'approved', 'an', 'additional', '20,000', 'work', 'permits', 'each', 'year', 'for', 'foreigners', 'who', 'hold', 'an', 'advanced', 'degree', '(', 'master', "'s", 'or', 'doctorate', ')', 'from', 'an', 'American', 'college', 'or', 'university', '.'],
['Palestinian', 'President', 'Mahmoud', 'Abbas', 'is', 'reported', 'to', 'have', 'told', 'his', 'security', 'forces', 'to', 'use', '"', 'an', 'iron', 'fist', '"', 'to', 'maintain', 'the', 'current', 'cease-fire', 'with', 'Israel', '.'],
['The', 'Discovery', 'Channel', 'and', 'NASA', 'have', 'restored', '50', 'years', 'of', 'history', 'for', 'the', 'U.S.', 'space', 'agency', "'s", 'archives', '.'],
['Afghan', 'President', 'Hamid', 'Karzai', 'led', 'his', 'country', "'s", 'Independence', 'Day', 'celebrations', 'Sunday', 'with', 'a', 'call', 'for', 'young', 'Afghans', 'to', 'educate', 'themselves', 'to', 'preserve', 'their', 'freedom', '.'],
['Mr.', 'Blair', 'also', 'stressed', 'that', 'his', 'decisions', 'to', 'back', 'U.S.-led', 'military', 'actions', 'in', 'Afghanistan', 'and', 'Iraq', 'had', 'nothing', 'to', 'do', 'with', 'religion', '.'],
['Diplomats', 'reported', 'some', 'progress', 'during', 'Sunday', "'s", 'talks', ',', 'and', 'are', 'holding', 'out', 'hopes', 'for', 'a', 'breakthrough', '.']
]
train_observations = [
['NNP', 'NNP', 'POS', 'NN', 'NNP', 'RB', 'VBD', 'DT', 'JJ', 'NNP', 'NNP', 'NNP', 'NNP', 'NNP', ',', 'VBD', 'TO', 'VB', 'CD', 'JJ', 'NNS', '.'],
['WRB', 'DT', 'NN', 'VBD', 'IN', 'PRP', 'MD', 'RB', 'VB', ',', 'PRP', 'VBD', 'IN', 'DT', 'NNS', 'CC', 'VBD', 'PRP', ',', 'CC', 'RB', 'VBD', 'DT', 'NNS', '.'],
['DT', 'NN', 'IN', 'JJ', 'NN', 'NNS', 'VBZ', 'VBN', 'IN', 'DT', 'JJ', 'CD', 'NNS', 'IN', 'NN', ',', 'NN', 'CC', 'NN', 'IN', 'NN', 'NNS', '.'],
['NNP', 'POS', 'NNP', 'NNP', 'VBD', 'DT', 'NN', 'CD', 'NNS', 'RB', '.'],
['IN', 'NN', 'TO', 'CD', 'JJ', 'JJ', 'NNS', ',', 'PRP', 'RB', 'VBZ', 'VBN', 'DT', 'JJ', 'CD', 'NN', 'VBZ', 'DT', 'NN', 'IN', 'NNS', 'WP', 'VBP', 'DT', 'JJ', 'NN', 'LRB', 'NN', 'POS', 'CC', 'JJ', 'RRB', 'IN', 'DT', 'JJ', 'NN', 'CC', 'NN', '.'],
['JJ', 'NNP', 'NNP', 'NNP', 'VBZ', 'VBN', 'TO', 'VB', 'VBN', 'PRP$', 'NN', 'NNS', 'TO', 'VB', '``', 'DT', 'NN', 'NN', '``', 'TO', 'VB', 'DT', 'JJ', 'NN', 'IN', 'NNP', '.'],
['DT', 'NNP', 'NNP', 'CC', 'NNP', 'VBP', 'VBN', 'CD', 'NNS', 'IN', 'NN', 'IN', 'DT', 'NNP', 'NN', 'NN', 'POS', 'NNS', '.'],
['JJ', 'NNP', 'NNP', 'NNP', 'VBD', 'PRP$', 'NN', 'POS', 'NNP', 'NNP', 'NNS', 'NNP', 'IN', 'DT', 'NN', 'IN', 'JJ', 'NNS', 'TO', 'VB', 'PRP', 'TO', 'VB', 'PRP$', 'NN', '.'],
['NNP', 'NNP', 'RB', 'VBD', 'IN', 'PRP$', 'NNS', 'TO', 'VB', 'JJ', 'JJ', 'NNS', 'IN', 'NNP', 'CC', 'NNP', 'VBD', 'DT', 'TO', 'VB', 'IN', 'NN', '.'],
['NNS', 'VBD', 'DT', 'NN', 'IN', 'NNP', 'POS', 'NNS', ',', 'CC', 'VBP', 'VBG', 'RP', 'NNS', 'IN', 'DT', 'NN', '.']
]