# Webinar 4 - Project 1 Walkthrough

**Part of speech tagger** reads text in some language and assigns parts of speech to each word in that text, such as noun, verb, adjective, etc.

You must complete Steps 1-3 below to pass the project. We will go through each and learn some useful tips for doing the projects.
- **Step 1:** Review the provided interface to load and access the text corpus
- **Step 2:** Build a Most Frequent Class tagger to use as a baseline
- **Step 3:** Build an HMM Part of Speech tagger and compare to the MFC baseline

In [None]:
# Import the libraries
import numpy as np
from helpers import Dataset

<hr>

## 1. Read and preprocess the dataset

### 1.1. Load the dataset

The data set we are using in this project is a copy of the **Brown corpus** that has already been pre-processed to only include the **universal tagset**. 

In [3]:
# Load the dataset
data = Dataset(tagfile = "./tags-universal.txt", 
               datafile = "./brown-universal.txt", 
               train_test_split = 0.8)

In [None]:
# Take a look at the dataset
data

The data have the following structure to it:

<p style="line-height: 25px;">
Dataset(sentences={<strong>'b100-5507'</strong>: Sentence(<strong>words</strong>=('Mr.', 'Podger', 'had', 'thanked', 'him', 'gravely', ',', 'and', 'now', 'he', 'made', 'use', 'of', 'the', 'advice', '.'), <strong>tags</strong>=('NOUN', 'NOUN', 'VERB', 'VERB', 'PRON', 'ADV', '.', 'CONJ', 'ADV', 'PRON', 'VERB', 'NOUN', 'ADP', 'DET', 'NOUN', '.')), 
...)})
</p>


- **"b100-5507"**: ID of the sentence
- **words**: Preprocessed tokens of the given sentence
- **tags**: The label of each tokens in the given sentence

### 1.2. Explore the dataset

In [29]:
# Check the amount of sentences in the dataset
print("Number of sentences in the whole dataset: ", len(data.sentences))
print("Number of sentences in the training set: ", len(data.training_set.sentences))
print("Number of sentences in the test set: ", len(data.testing_set.sentences))

Number of sentences in the whole dataset:  57340
Number of sentences in the training set:  45872
Number of sentences in the test set:  11468


In [36]:
# Check the amount of unique vocabularies in the dataset
print("Number of unique vocabularies in the whole dataset: ", len(data.vocab))
print("Number of unique vocabularies in the training set: ", len(data.training_set.vocab))
print("Number of unique vocabularies in the test set: ", len(data.testing_set.vocab))

Number of unique vocabularies in the whole dataset:  56057
Number of unique vocabularies in the training set:  50536
Number of unique vocabularies in the test set:  25112


### 1.3. Tags / Labels

In this project we will work with **12 tags** which we will use it as our labels. Tags are as follows:

In [40]:
# Check the whole tagsets (labels or outputs)
print("Total number of labels: ", len(data.tagset))
print("Labels: ", data.tagset)

Total number of labels:  12
Labels:  frozenset({'.', 'VERB', 'ADV', 'NUM', 'X', 'DET', 'NOUN', 'PRT', 'CONJ', 'ADJ', 'ADP', 'PRON'})


### 1.4. Get a specific sentence

You can get the tokens and tags of a specific ID. Let's see how to do it.

In [59]:
# Get a sentence by its ID
sentence_identifier = "b100-5507"

print("The whole sentence: \n", data.sentences[sentence_identifier])
print("-----------------------------------------------------------------------------")
print("Words: \n", data.sentences[sentence_identifier].words)
print("-----------------------------------------------------------------------------")
print("Tags: \n", data.sentences[sentence_identifier].tags)

The whole sentence: 
 Sentence(words=('Mr.', 'Podger', 'had', 'thanked', 'him', 'gravely', ',', 'and', 'now', 'he', 'made', 'use', 'of', 'the', 'advice', '.'), tags=('NOUN', 'NOUN', 'VERB', 'VERB', 'PRON', 'ADV', '.', 'CONJ', 'ADV', 'PRON', 'VERB', 'NOUN', 'ADP', 'DET', 'NOUN', '.'))
-----------------------------------------------------------------------------
Words: 
 ('Mr.', 'Podger', 'had', 'thanked', 'him', 'gravely', ',', 'and', 'now', 'he', 'made', 'use', 'of', 'the', 'advice', '.')
-----------------------------------------------------------------------------
Tags: 
 ('NOUN', 'NOUN', 'VERB', 'VERB', 'PRON', 'ADV', '.', 'CONJ', 'ADV', 'PRON', 'VERB', 'NOUN', 'ADP', 'DET', 'NOUN', '.')


### 1.5. Get words and tags easier

There is an easier way for getting words and tags in sentences which you can simply use **.X** or **.Y** on **data**, **training set**, or **test set**.

In [61]:
# Get the words in the first sentence
print(data.X[0])

('Mr.', 'Podger', 'had', 'thanked', 'him', 'gravely', ',', 'and', 'now', 'he', 'made', 'use', 'of', 'the', 'advice', '.')


In [62]:
# Get the tags in the first sentence
print(data.Y[0])

('NOUN', 'NOUN', 'VERB', 'VERB', 'PRON', 'ADV', '.', 'CONJ', 'ADV', 'PRON', 'VERB', 'NOUN', 'ADP', 'DET', 'NOUN', '.')


### 1.6. Use .strem()

**.strem()** is another way for getting the words and tags but with this difference that we can use this only for doing iterations.

In [70]:
for index, (i_word, i_tag) in enumerate(data.stream()):
    print("Word: ", i_word)
    print("Tag: ", i_tag)
    print("")
    if index == 5:
        break

Word:  Mr.
Tag:  NOUN

Word:  Podger
Tag:  NOUN

Word:  had
Tag:  VERB

Word:  thanked
Tag:  VERB

Word:  him
Tag:  PRON

Word:  gravely
Tag:  ADV



<hr>

## 2: Build a Most Frequent Class tagger

There are **3 TODOs** in this part which you have to complete:
1. Write the **pair_counts** function
2. Apply **pair_counts** to our training set
3. Create a **mfc_table**

### 2.1. Task 1 - Write the pair_counts function

There are many ways for approching such problem and here we will discuss only one of it. Below you can find some tips how to solve this:
- Create an empty dictionary at the start so later on you will append all of the result into it.
- Iterate through the sequence of tags. Also have the index of the exact iteration you are.
    - Get the equivalent word in sequence of words using the index.
    - Check to see if there is any tag inside the dictionary you have intialized at the start. 
        - If there wasn't any, They make it equal to an empty dictionary
        - If there was then get that dictionary's value using the tag
    - Check to see if the word we are in that iteration is inside the dictionary or not
        - If there wans't then make it equal to 1
        - If there was then add 1 to it

### 2.2. Task 2 - Apply pair_counts to our training set

There are many ways for approching such problem and here we will discuss only one of it. Below you can find some tips how to solve this:
- Iterate through all tags and append all of them into one list
- Iterate through all words and append all of them into one list
- Apply pair_counts function for tags and words that you have just appended

**Input should look something similar to below (not totally exactly):**

tags = ['NOUN',
 'NOUN',
 'VERB',
 'VERB',
 'PRON',
 'ADV',
 '.', ...]
 
 words = ['Mr.',
 'Podger',
 'had',
 'thanked',
 'him',
 'gravely',
 ',', ...]
 
 **Output should look something similar to below (not totally exactly):**

{'NOUN': {'Mr.': 845,
  'Podger': 22,
  'use': 353,
  'advice': 51,
  'difference': 149,
  'opinion': 95,
  'board': 166,
  'instrument': 45,
  'elasticity': 6,
  'pastes': 2, ...}, ...}

### 2.3. Task 3 - Create a mfc_table

There are many ways for approching such problem and here we will discuss only one of it. Below you can find some tips how to solve this:
- Initialize an empty dictionary for your mfc_table
- Iterate through your pair_count items
    - Consider the word in that loop as a key in the mfc_table dictionary
    - Since there are multiple tags for each word, Then get the tag that has the maximum number and assign it as the value for that key (word) in your dictionary.

**Input should look something similar to below (not totally exactly):**

dict_items([('Whenever', {'ADV': 13}), ('artists', {'NOUN': 35}), (',', {'.': 46500, 'X': 2}), ...])
 

**Output should look something similar to below (not totally exactly):**

{'Whenever': 'ADV',
 'artists': 'NOUN',
 ',': '.',
 'indeed': 'ADV',
 'turned': 'VERB',
 'to': 'PRT',
 'actual': 'ADJ',
 'representations': 'NOUN',
 'or': 'CONJ', ...}

<hr>

## 3: Build an HMM tagger

There are **10 TODOs** in this part which you have to complete:
1. Build the **unigram_counts** function
2. Apply **unigram_counts** function to tags on training set
3. Build the **bigram_counts** function
4. Apply **bigram_counts** function to tags on training set
5. Build the **starting_counts** function
6. Apply **starting_counts** to tags on training set
7. Build the **ending_counts** function
8. Apply **ending_counts** function to tags on training set
9. Create states with **emission probability**
10. **Add edges** or **transition probabilities** between states

### 3.1. Build the unigram_counts function

There are many ways for approching such problem and here we will discuss only one of it. Below you can find some tips how to solve this:

- Initialize an empty dictionary
- Iterate through sequences
    - At each iteration, Check to see if the item in sequences is inside the dictioanty or not
        - If there was then add 1 to it
        - If there wasn't then make it equal to 1

### 3.2. Apply unigram_counts function to tags on training set

There are many ways for approching such problem and here we will discuss only one of it. Below you can find some tips how to solve this:

- Iterate through data.stream and get each of tags
- Apply unigram_counts to it

**Input should look something similar to below (not totally exactly):**

['NOUN',
 'NOUN',
 'VERB',
 'VERB',
 'PRON',
 'ADV',
 '.',
 'CONJ',
 'ADV',
 'PRON',
 'VERB',
 'NOUN',
 'ADP',
 'DET',
 'NOUN', ...]

**Output should look something similar to below (not totally exactly):**

{'NOUN': 275558,
 'VERB': 182750,
 'PRON': 49334,
 'ADV': 56239,
 '.': 147565,
 'CONJ': 38151,
 ...}

### 3.3. Build the bigram_counts function

There are many ways for approching such problem and here we will discuss only one of it. Below you can find some tips how to solve this:

- Initialize an empty dictionary
- Get the bigrams of sequences using nltk library
- Iterate through bigrams that you have just created
    - At each iteration, Check to see if the item in sequences is inside the dictioanty or not
        - If there was then add 1 to it
        - If there wasn't then make it equal to 1

### 3.4. Apply bigram_counts function to tags on training set

There are many ways for approching such problem and here we will discuss only one of it. Below you can find some tips how to solve this:

- Iterate through data.stream and save each of tags in a list
- Apply unigram_counts to it

**Input should look something similar to below (not totally exactly):**

['NOUN',
 'NOUN',
 'VERB',
 'VERB',
 'PRON',
 'ADV',
 '.',
 'CONJ',
 'ADV',
 'PRON',
 'VERB',
 'NOUN',
 'ADP',
 'DET',
 'NOUN', ...]

**Output should look something similar to below (not totally exactly):**

{('NOUN', 'NOUN'): 41295,
 ('NOUN', 'VERB'): 43802,
 ('VERB', 'VERB'): 33668,
 ('VERB', 'PRON'): 10075,
 ('PRON', 'ADV'): 2665,
 ('ADV', '.'): 9570,
 ('.', 'CONJ'): 12992, ...}

### 3.5. Build the starting_counts function

There are many ways for approching such problem and here we will discuss only one of it. Below you can find some tips how to solve this:

- Initialize an empty list (for appending the start tags)
- Iterate through sequences
    - Append the first tag in that loop's list to the empty list that we were initialized
- Initialize an empty dictionary
- Iterate through the list that you appended the start tags
    - If the item in that loop is inside the dictionary then add up to 1
    - If the item is not inside the dictionary then makt it equal to 1

### 3.6. Apply starting_counts to tags on training set

There are many ways for approching such problem and here we will discuss only one of it. Below you can find some tips how to solve this:

- Get the tags inside the training set
- Apply starting_counts to it

**Input should look something similar to below (not totally exactly):**

(('ADV',
  'NOUN',
  '.',
  'ADV',
  '.',
  'VERB',
  'ADP',
  'ADJ',
  'NOUN', ...), ...)

**Output should look something similar to below (not totally exactly):**

{'ADV': 4185,
 'ADP': 5583,
 'ADJ': 1582,
 'PRT': 1718,
 'DET': 9763,
 'PRON': 7318,
 'NOUN': 6469, ...}

### 3.7. Build the ending_counts function

There are many ways for approching such problem and here we will discuss only one of it. Below you can find some tips how to solve this:

- Initialize an empty list (for appending the end tags)
- Iterate through sequences 
    - Append the last tag in that loop's list to the empty list that we were initialized
- Initialize an empty dictionary
- Iterate through the list that you appended the last tags
    - If the item in that loop is inside the dictionary then add up to 1
    - If the item is not inside the dictionary then makt it equal to 1

### 3.8. Apply ending_counts function to tags on training set

There are many ways for approching such problem and here we will discuss only one of it. Below you can find some tips how to solve this:

- Get the tags inside the training set
- Apply ending_counts to it

**Input should look something similar to below (not totally exactly):**

(('ADV',
  'NOUN',
  '.',
  'ADV',
  '.',
  'VERB',
  'ADP',
  'ADJ',
  'NOUN', ...), ...)

**Output should look something similar to below (not totally exactly):**

{'.': 44936,
 'NOUN': 722,
 'NUM': 63,
 'VERB': 75,
 'ADJ': 25,
 'ADV': 16,
 'ADP': 7, ...}

### 3.9. Create states with emission probability

There are many ways for approching such problem and here we will discuss only one of it. Below you can find some tips how to solve this:

- Initialize an empty dictionary for states
- Iterate through unique tags
    - Initialize an empty dictionary for capturing emission probabilities for a specific tag
    - Iterate through words and their occurance (in pair_counts). 
        - At each iteration, divide the occurace to the tag unigram of the tag in that loop and save it to the empty dictionary inside your first loop and consider word as its key.
    - Get the discrete distribution of probabilities
    - Add the distribtuion to a state
    - Add state to states dictionary which you initialized at the start
    - Add the state to model

### 3.10. Add edges or transition probabilities between states

Add the start and end edges:
- Iterate through the unique tags.
    - Get the state for a specific tag from the states dictionary you created before
    - Calculate the start tag probability by dividing the specific tag in tag_starts (that you created before) to sum of all values in tag_starts
    - Add the start probability in between states to the model
    - Calculate the end tag probability by dividing the specific tag in tag_ends (that you created before) to sum of all values in tag_ends
    - Add the end tag probability in between states to the model

Add in between edges:
- Iterate through the unique tags
    - Get the state for a specific tag from the states dictionary you created before
    - Initialze a sum of probabilities to 0
    - Iterate through the unique tags for the second time
        - Get the state for a specific tag from the states dictionary you created before for the second time
        - Get the bigram of two tags. one from first loop the other one from second loop
        - Calculate the transition probability
        - Sum the transition probability to our sum_of_probabilities
        - Add the transition to our model