# Speech-Tag-Identifier

This repository implements part of speech (POS) tagging
using an HMM model. This file (`notebook.ipynb`) will walk you how to use this repository and how it works.

### Table of Contents

1. <a href="#section1">Reading the data</a>
1. <a href="#section2">Tagset</a>
1. <a href="#section3">Taggers</a>
1. <a href="#section4">Baseline Tagger</a>
1. <a href="#section5">Viterbi: HMM Tagger</a>
1. <a href="#section6">Baseline Tagger vs Viterbi</a>
1. <a href="#section7">Identifying through Audio</a>


<a id='section1'></a>
## Reading the data
The dataset consists of thousands of sentences with ground-truth POS tags. 

The provided load_dataset function will read in the data as a nested list with the outer dimension representing each sentence and inner dimensin representing each tagged word. The following cells will help you go through the representation of the data.

The provided code converts all words to lowercase. It also adds a START and END tag for each sentence when it loads the sentence. These tags are just for standardization. They will not be considered in accuracy computation.

In [2]:
import utils
train_set = utils.load_dataset('data/brown-training.txt')
dev_set = utils.load_dataset('data/brown-test.txt')

In [3]:
print('training set has {} sentences'.format(len(train_set)))
print('dev set has {} sentences'.format(len(dev_set)))
print('The first sentence of training set has {} words'.format(len(train_set[0])))
print('The 10th word of the first sentence in the training set is "{}" with ground-truth tag "{}"'.format(train_set[0][9][0], train_set[0][9][1]))

training set has 35655 sentences
dev set has 9912 sentences
The first sentence of training set has 27 words
The 10th word of the first sentence in the training set is "investigation" with ground-truth tag "NOUN"


In [4]:
print('Here is an sample sentence from the training set:\n', train_set[0])

Here is an sample sentence from the training set:
 [('START', 'START'), ('the', 'DET'), ('fulton', 'NOUN'), ('county', 'NOUN'), ('grand', 'ADJ'), ('jury', 'NOUN'), ('said', 'VERB'), ('friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), ('of', 'IN'), ("atlanta's", 'NOUN'), ('recent', 'ADJ'), ('primary', 'NOUN'), ('election', 'NOUN'), ('produced', 'VERB'), ('``', 'PUNCT'), ('no', 'DET'), ('evidence', 'NOUN'), ("''", 'PUNCT'), ('that', 'CONJ'), ('any', 'DET'), ('irregularities', 'NOUN'), ('took', 'VERB'), ('place', 'NOUN'), ('.', 'PERIOD'), ('END', 'END')]


<a id='section2'></a>
<h2>Tagset</h2>

<p>
  The following is an example set of 16 part of speech tags.  
      This is the tagset used in the provided 
Brown corpus.
</p>

<ul>

<li> ADJ adjective
<li> ADV adverb
<li> IN preposition
<li> PART particle (e.g. after verb, looks like a preposition)

<li> PRON pronoun
<li> NUM number
<li> CONJ conjunction
<li> UH filler, exclamation

<li> TO infinitive
<li> VERB verb
<li> MODAL modal verb
<li> DET determiner

<li> NOUN noun
<li> PERIOD end of sentence punctuation
<li> PUNCT  other punctuation
<li> X miscellaneous hard-to-classify items
</ul>

<a id='section3'></a>
<h2>Taggers</h2>

There are two main types of tagging functions:

<ul>
<li> Baseline tagger
<li> Viterbi: HMM tagger
</ul>

The training data is used to train the parameters of the model and the test sets are used to test its accuracy. 

<a id='section4'></a>
<h2>Baseline Tagger</h2>

The Baseline tagger considers each word independently, ignoring previous words and tags. For each word w, it counts how many times w occurs with each tag in the training data. When processing the test data, it consistently gives w the tag that was seen most often. For unseen words, it guesses the tag that's seen the most often in training dataset.

In [5]:
import algorithm
import importlib
importlib.reload(algorithm)
print(algorithm.__doc__)

None


In [6]:
help(algorithm.baseline)

Help on function baseline in module submitted:

baseline(train, test)
    Implementation for the baseline tagger.
    input:  training data (list of sentences, with tags on the words)
            test data (list of sentences, no tags on the words, use utils.strip_tags to remove tags from data)
    output: list of sentences, each sentence is a list of (word,tag) pairs.
            E.g., [[(word1, tag1), (word2, tag2)], [(word3, tag3), (word4, tag4)]]



In [7]:
import time
importlib.reload(algorithm)
train_set = utils.load_dataset('data/brown-training.txt')
dev_set = utils.load_dataset('data/brown-test.txt')
start_time = time.time()
predicted = algorithm.baseline(train_set, utils.strip_tags(dev_set))
time_spend = time.time() - start_time
accuracy, _, _ = utils.evaluate_accuracies(predicted, dev_set)
multi_tag_accuracy, unseen_words_accuracy, = utils.specialword_accuracies(train_set, predicted, dev_set)

print("time spent: {0:.4f} sec".format(time_spend))
print("accuracy: {0:.4f}".format(accuracy))
print("multi-tag accuracy: {0:.4f}".format(multi_tag_accuracy))
print("unseen word accuracy: {0:.4f}".format(unseen_words_accuracy))

time spent: 0.6091 sec
accuracy: 0.9387
multi-tag accuracy: 0.9019
unseen word accuracy: 0.6782


#### <a id='section5'></a>
<h2>Viterbi: HMM Tagger</h2>
<p>
The Viterbi tagger implements the HMM trellis (Viterbi) decoding algoirthm. The probability of each
tag depends only on the previous tag, and the probability of each word depends
only on the corresponding tag. This model estimates
three sets of probabilities:

<ul>
<li>  Initial probabilities (How often does each tag occur at the start of
a sentence?)
<li>  Transition probabilities (How often does tag \(t_b\)  follow tag
\(t_a\)?)
<li>  Emission probabilities (How often does tag t yield word w?)
</ul>

<p>
All sentences must start with a START token, whose tag is START and must end with an END token, whose tag is END.

<p>
The algorithm is divided in five main steps:

<ul>
<li> Count occurrences of tags, tag pairs, tag/word pairs.
<li> Compute smoothed probabilities
<li> Take the log of each probability
<li> Construct the trellis.   Notice that
for each tag/time pair, you must store not only
the probability of the best path but also a pointer to the
previous tag/time pair in that path.
<li> Return the best path through the trellis.
</ul>

<p>
Laplace smoothing is used to get a better performance, since transition and emission probabilities
may return zero.
Make sure that your code for computing transition and emission probabilities
never returns zero.

<p>
It is expected that Viterbi will perform slightly worse than the baseline
code for the Brown development dataset for unseen words.

In [8]:
help(algorithm.viterbi)

Help on function viterbi in module submitted:

viterbi(train, test)
    Implementation for the viterbi tagger.
    input:  training data (list of sentences, with tags on the words)
            test data (list of sentences, no tags on the words)
    output: list of sentences with tags on the words
            E.g., [[(word1, tag1), (word2, tag2)], [(word3, tag3), (word4, tag4)]]



<a id='section6'></a>
## Comparing accuracy of Baseline Tagger vs Viterbi

You can compare the accuracy of both functions here. Remember to change the `train_set` and `dev_set` to your desired datasets.

In [17]:
import time
importlib.reload(algorithm)
train_set = utils.load_dataset('data/brown-training.txt')
dev_set = utils.load_dataset('data/brown-test.txt')
start_time = time.time()
print(len(utils.strip_tags(dev_set)))
print(utils.strip_tags(dev_set)[0])
print(len(utils.strip_tags(dev_set)[0]))

predicted = algorithm.viterbi(train_set, utils.strip_tags(dev_set))
time_spend = time.time() - start_time
accuracy, _, _ = utils.evaluate_accuracies(predicted, dev_set)
multi_tag_accuracy, unseen_words_accuracy, = utils.specialword_accuracies(train_set, predicted, dev_set)

print("stats for viterbi:")
print("time spent: {0:.4f} sec".format(time_spend))
print("accuracy: {0:.4f}".format(accuracy))
print("multi-tag accuracy: {0:.4f}".format(multi_tag_accuracy))
print("unseen word accuracy: {0:.4f} \n".format(unseen_words_accuracy))

predicted = algorithm.baseline(train_set, utils.strip_tags(dev_set))
time_spend = time.time() - start_time
accuracy, _, _ = utils.evaluate_accuracies(predicted, dev_set)
multi_tag_accuracy, unseen_words_accuracy, = utils.specialword_accuracies(train_set, predicted, dev_set)

print("stats for baseline:")
print("time spent: {0:.4f} sec".format(time_spend))
print("accuracy: {0:.4f}".format(accuracy))
print("multi-tag accuracy: {0:.4f}".format(multi_tag_accuracy))
print("unseen word accuracy: {0:.4f}".format(unseen_words_accuracy))

9912
['START', 'appointment', 'of', 'william', 's.', 'pfaff', 'jr.', ',', '41', ',', 'as', 'promotion', 'manager', 'of', 'the', 'times-picayune', 'publishing', 'company', 'was', 'announced', 'saturday', 'by', 'john', 'f.', 'tims', ',', 'president', 'of', 'the', 'company', '.', 'END']
32
stats for viterbi:
time spent: 17.6231 sec
accuracy: 0.9388
multi-tag accuracy: 0.9387
unseen word accuracy: 0.2563 

stats for baseline:
time spent: 18.8243 sec
accuracy: 0.9387
multi-tag accuracy: 0.9019
unseen word accuracy: 0.6782


<a id='section3'></a>

<a id='section7'></a>
## Identifying through Audio

Now let us if we can identify the speech-tag of words verbally using `speech.py`.

<a id='section4'></a>

<a id='grade'></a>