<a href="https://colab.research.google.com/github/ronenbendavid/IDC_NLP/blob/master/Ronen_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2

This assignment is about training and evaluating a POS tagger with some real data. The dataset is available through NLTK, a Python NLP package.

**Part 1** (no actions required)

The dataset is composed of a set of sentences. Each sentence is a list of tuples of a word and a tag, as provided by human annotators.
You should split the data to train and test sets in the following way:


In [0]:
import numpy as np
import operator
import nltk
from nltk.corpus import treebank 
nltk.download('treebank')
print(f"Number of sentences: {len(treebank.tagged_sents())}")

train_data = treebank.tagged_sents()[:3000] 
test_data = treebank.tagged_sents()[3000:] 
print(train_data[0])
print(test_data[0])

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
Number of sentences: 3914
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
[('At', 'IN'), ('Tokyo', 'NNP'), (',', ','), ('the', 'DT'), ('Nikkei', 'NNP'), ('index', 'NN'), ('of', 'IN'), ('225', 'CD'), ('selected', 'VBN'), ('issues', 'NNS'), (',', ','), ('which', 'WDT'), ('*T*-1', '-NONE-'), ('gained', 'VBD'), ('132', 'CD'), ('points', 'NNS'), ('Tuesday', 'NNP'), (',', ','), ('added', 'VBD'), ('14.99', 'CD'), ('points', 'NNS'), ('to', 'TO'), ('35564.43', 'CD'), ('.', '.')]


**Part 2**

Write a class simple_tagger, with methods *train* and *evaluate*. The method *train* receives the data as a list of treebank sentences, as presented below, and use it for training the tagger. In this case, it should learn a simple map of words to tags, defined as the most frequent tag for every word (in case there is more than one tag, select one randomly). The map should be stored as a class member for evaluation.

The method evaluate receives the data as a list of treebanl sentences, as presented above, and use it to evaluate the tagger performance. Specifically, it should calculate the word and sentence level accuracy.
The evaluation process is simply going word by word, querying the map (created by the train method) for each word’s tag and compare it to the true tag of that word. The word-level accuracy is the number of successes divided by the number of words. For OOV (out of vocabulary, or unknown) words, the tagger should assign the most frequent tag in the entire training set. The function should return the two numbers: word level accuracy and sentence level accuracy.


In [0]:
from collections import Counter, defaultdict

class simple_tagger:
  def __init__(self):
    self.mft = None
    self.mfw = None
    self.words = Counter()
    self.tags = Counter()
    self.tags_dict = None

  def train(self, data):
    # counter all pairs
    counter = Counter()
    tagged_data = list(data)
    for sentence in tagged_data: #range(len(tagged_data)):
      for pair in sentence:
          self.words.update([pair[0]])
          self.tags.update([pair[1]])
      counter.update(Counter(sentence))

    # most frequent tag/word
    n=1
    self.most_frequent_word = self.words.most_common()[0]
    self.most_frequent_tag = self.tags.most_common()[0]

    self.tags_dict = defaultdict(lambda: 0)
    for word in self.words:
      for tag in self.tags.keys():
        key = (word, tag)
        count = counter[key]
        if count > 0:
          if key in self.tags_dict:
            if count > tags_dict[key]:
              self.tags_dict[word] = tag
          else:
              self.tags_dict[word] = tag

  def evaluate(self, data):
    total_words = 0
    tagged_data = list(data)
    total_sentences = len(tagged_data)
    success_words = 0
    sentences_stats = 0
    counter = Counter()
    mean_sentence_accuracy = 0
    for sentence in tagged_data:
      total_words += len(sentence)
      success_sentence = 0
      for pair in sentence:
        word = pair[0]
        tag = pair[1]
        if word not in self.words:
          word = self.most_frequent_word
        if tag not in self.tags:
          tag = self.most_frequent_tag
        tagfromdict = self.tags_dict[word]
        if tag == tagfromdict:
          success_words += 1
          success_sentence += 1
      sentences_stats += success_sentence/len(sentence)
    word_level_accuarcy = success_words/total_words
    sentence_level_accuarcy = sentences_stats/total_sentences
    return word_level_accuarcy,sentence_level_accuarcy


In [0]:
st = simple_tagger()
st.train(train_data)
print(st.tags_dict)
print(len(st.tags_dict))


10779


In [0]:

wla,sla = st.evaluate(test_data)
print ("Word level accuarcy is : {:f}; Sentence level accuarcy is {:f}".format(wla,sla))

Word level accuarcy is : 0.733348; Sentence level accuarcy is 0.727978


**Part 3**

Similar to part 2, write the class hmm_tagger, which implements HMM tagging. The method *train* should build the matrices A, B and Pi, from the data as discussed in class. The method *evaluate* should find the best tag sequence for every input sentence, using he Viterbi decoding algorithm, and then calculate the word and sentence level accuracy using the gold-standard tags. I implemented the Viterbi algorithm for you in the next block, so you can should either plug it into your code or write your own Viterbi version.

Additional guidance:
1. The matrix B represents the probabilities of seeing a word within each POS label.
Since B is a matrix, you should build a dictionary that maps every unique word in the corpus to a serial numeric id (starting with 0). This way, every column in B represents the word that it’s id matches the index of the column.
2. During evaluation, you should first convert each word into it’s index and create the observation array to be given to Viterbi, as a list of ids. OOV words should be assigned with a random tag. To make sure Viterbi works appropriately, you can simply break the sentence into multiple segments every time you see OOV word, and decode every segment individually by Viterbi.


In [0]:
class hmm_tagger:
EPS = 1e-4
ZERO = 1e-300
LOG_ZERO = math.log(ZERO)
  def __init__(self, N, M, A=None, B=None, pi=None):
    self.N = N
    self.M = M
    self.__obs_seq = None
    self.__viterby = None
    self.__log_p_vit = None
    self.__log_p_fwd = None
    self.A = np.array(A, dtype=np.float64)
    self.B = np.array(B, dtype=np.float64)
    self.pi = np.array(pi, dtype=np.float64)
    # Remove zero
    self.A += ((self.A < ZERO) * ZERO).astype(np.float64)
    self.B += ((self.B < ZERO) * ZERO).astype(np.float64)
    self.pi += ((self.pi < ZERO) * ZERO).astype(np.float64)    
    self.log_A = np.log(self.A)
    self.log_B = np.log(self.B)
    self.log_pi = np.log(self.pi)

  def train(self, data):
    # TODO

  def evaluate(self, data):
    # TODO

In [0]:
# Viterbi
def viterbi (word_list, A, B, Pi):

    # initialization
    T = len(word_list)
    N = A.shape[0] # number of tags

    delta_table = np.zeros((N, T)) # initialise delta table
    psi = np.zeros((N, T))  # initialise the best path table

    delta_table[:,0] = B[:, word_list[0]] * Pi

    for t in range(1, T):
        for s in range (0, N):
            trans_p = delta_table[:, t-1] * A[:, s]
            psi[s, t], delta_table[s, t] = max(enumerate(trans_p), key=operator.itemgetter(1))
            delta_table[s, t] = delta_table[s, t] * B[s, word_list[t]]

    # Back tracking
    seq = np.zeros(T);
    seq[T-1] =  delta_table[:, T-1].argmax()
    for t in range(T-1, 0, -1):
      print(seq[t])
      seq[t-1] = psi[int(seq[t]),t]

    return seq

# A simple example to run the algorithm:

# A = np.array([[0.3, 0.7], [0.2, 0.8]])
# B = np.array([[0.1, 0.1, 0.3, 0.5], [0.3, 0.3, 0.2, 0.2]])
# Pi = np.array([0.4, 0.6])
# print(viterbi([3, 3, 3, 3], A, B, Pi))

**Part 4**

Compare the results obtained from both taggers and a MEMM tagger, implemented by NLTK, over the test data. To train the NLTK MEMM tagger you should execute the following lines (it may take some time to train...):

In [0]:
from nltk.tag import tnt 

tnt_pos_tagger = tnt.TnT()
tnt_pos_tagger.train(train_data)
print(tnt_pos_tagger.evaluate(test_data))

TODO: Print both, word level and sentence level accuracy for all the three taggers in a table.