<a href="https://colab.research.google.com/github/thedatadj/natural-language-processing/blob/main/part_of_speech_tagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Model that predicts the part of speech tags of words.

# Dataset
Consist of lists which items are strings containing words and their POS tags.

* Load training data
    * Load the vocabulary
    * Create a `{word: index}` dictionary
* Load test data
    * List of words in test data


## Load training data

In [122]:
# Download file
!gdown 18yqFMkgBIjKOiDZ43s9JXfClO3YNZotf

Downloading...
From: https://drive.google.com/uc?id=18yqFMkgBIjKOiDZ43s9JXfClO3YNZotf
To: /content/WSJ_02-21.pos
  0% 0.00/8.28M [00:00<?, ?B/s]100% 8.28M/8.28M [00:00<00:00, 185MB/s]


In [123]:
# Load training corpus
with open("/content/WSJ_02-21.pos", "r") as file0:
    training_data = file0.readlines()

In [124]:
# First item of the list
print(training_data[:1])

['In\tIN\n']


### Load vocabulary

In [125]:
# Download file
!gdown 1FtzoPTuRqF6DIgvWSRIJZLnlp959uiAr

Downloading...
From: https://drive.google.com/uc?id=1FtzoPTuRqF6DIgvWSRIJZLnlp959uiAr
To: /content/hmm_vocab.txt
  0% 0.00/197k [00:00<?, ?B/s]100% 197k/197k [00:00<00:00, 85.0MB/s]


In [126]:
# Load test corpus
with open("/content/hmm_vocab.txt", 'r') as file1:
    vocab = file1.read().split('\n')

In [127]:
# An item in vocab
vocab[23]

"'ve"

### `{word: index}` dictionary

In [128]:
# {word: index} dictionary
wid = {}
for i, word in enumerate(sorted(vocab)):
    wid[word] = i

In [129]:
# Index of a word in wid
wid['the']

22320

## Test data

In [130]:
# Download file
!gdown 1f-tIhCz9A6Kj9kqrpnhNbma4qPrJDnZL

Downloading...
From: https://drive.google.com/uc?id=1f-tIhCz9A6Kj9kqrpnhNbma4qPrJDnZL
To: /content/WSJ_24.pos
  0% 0.00/286k [00:00<?, ?B/s]100% 286k/286k [00:00<00:00, 113MB/s]


In [155]:
# Load file
with open('/content/WSJ_24.pos', 'r') as file2:
    testdata = file2.readlines()

In [156]:
# An item in test_data
testdata[65]

'by\tIN\n'

### Words list
List of words from `test_data`.

In [133]:
# Download and import preprocess function
!gdown 1fes2W5p9zRVvJxpr9IE7MsIUd459N5BE
from utils_pos import preprocess

Downloading...
From: https://drive.google.com/uc?id=1fes2W5p9zRVvJxpr9IE7MsIUd459N5BE
To: /content/utils_pos.py
  0% 0.00/8.09k [00:00<?, ?B/s]100% 8.09k/8.09k [00:00<00:00, 20.2MB/s]


In [134]:
# Download file
!gdown 1jBel8t5KpXi0tXXFoB6rD9cf6NCcDu5X

Downloading...
From: https://drive.google.com/uc?id=1jBel8t5KpXi0tXXFoB6rD9cf6NCcDu5X
To: /content/test.words
  0% 0.00/180k [00:00<?, ?B/s]100% 180k/180k [00:00<00:00, 101MB/s]


In [157]:
# Remove tags from the corpus and preprocess the words
_, testcorp = preprocess(vocab, "/content/test.words")

In [158]:
# Example of word in testvocab
testcorp[0]

'The'

# Training

## Transition counts
A dictionary where:
* `key`: pairs of tags
* `value`: the frequency of the pair in the training corpus

In [137]:
# Use the defaultdict class
from collections import defaultdict

In [138]:
# Helper function
from utils_pos import get_word_tag

In [139]:
# Initialize dictionary
tcounts = defaultdict(int)

# Initialize the previous tag
prev_tag = '<s>'

# Loop item in corpus
for wordtag in training_data:

    # Get word and tag
    word, tag = get_word_tag(wordtag, wid)

    # Update count
    tcounts[(prev_tag, tag)] += 1

    # Update prev_tag
    prev_tag = tag

In [140]:
tcounts[('IN', 'DT')]

32364

## Emission counts
Dictionary where:
* `key`: pairs of tags and words
* `value`: frequency of that pair in training set

In [141]:
# Initialize dictionary
ecounts = defaultdict(int)

for wordtag in training_data:
    word, tag = get_word_tag(wordtag, wid)
    ecounts[(tag, word)] += 1

In [142]:
ecounts[('NN', 'decrease')]

7

## Tag counts
Dictionary where:
* `key`: tag
* `value`: frequency of the tag

In [143]:
tagcounts = defaultdict(int)
for wordtag in training_data:
    word, tag = get_word_tag(wordtag, wid)
    tagcounts[tag] += 1

In [144]:
tagcounts['NN']

132935

## States
List containing all Part Of Speech Tags.

In [145]:
states = sorted(tagcounts.keys())
states[5:15]

[',', '--s--', '.', ':', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN']

# Testing
Assign a POS tag to every word in test corpus `testcorp`.

In [198]:
prep = testcorp
y = testdata

In [203]:
num_correct = 0

total = 0

predictions = defaultdict(int)
truepair = defaultdict(int)

for word, y_tup in zip(prep, y):
    y_tup_l = y_tup.split()
    if len(y_tup_l) == 2:
        true_label = y_tup_l[1]
    else:
        continue
    count_final = 0
    pos_final = 0
    if word in wid:
        for pos in states:
            key = (pos, word)
            if key in ecounts:
                count = ecounts[key]
                if count > count_final:
                    count_final = count
                    pos_final = pos
    predictions[word] = pos_final
    truepair[y_tup_l[0]] = y_tup_l[1]

In [207]:
predictions['the']

'DT'

In [208]:
truepair['the']

'DT'

My model correctly assigned a POS tag to the word "the" in the testdata.

# Evalutation
Calculate accuracy of the model.

In [209]:
count = 0
total = 0
for word in predictions:
    pos = predictions[word]
    truepos = truepair[word]
    if pos == truepos:
        count += 1
    total += 1
accuracy = count/total
accuracy

0.893070835745995

This model has a 90% accuracy.

<table>
    <tr>
        <td>
            Based on
        </td>
        <td>
            Assignment from the Natural Language Processing Specialization in Coursera.
        </td>
    </tr>
</table>