### Named Entity Recognition over medical journal corpus

##### Homework 3, Fall, 2021
##### Prof. James H. Martin
###### author: Sushma Akoju

Notebook to train/fine-tune a BioBERT model to perform named entity recognition (NER). 

Required features:
  - Sentence id
  - Word
  - Tag

### This notebook includes
- Viterbi Dynamic programming approach with Trellis and back pointer approach.
- HMM aproach with Start, Transition, Emission probabilities for observations using Bigram approach. This is adaptable for trigram approach, which is not explored in this notebook in the interest of time as well as scope of homework
- Replacing RARE words with a RARE keyword to account for out-of-vocabulary and/or rare words in Test dataset.

### About the Dataset split;
- the dataset split is 80/20

#### References
- [Sequence Labeling for Parts of Speech and Named Entities:  ](https://web.stanford.edu/~jurafsky/slp3/8.pdf)
- [Hidden Markov Models:  ](https://web.stanford.edu/~jurafsky/slp3/A.pdf)
- [Viterbi algorithm: ](https://en.wikipedia.org/wiki/Viterbi_algorithm#Pseudocode)

#### Analysis:
- From above results, it is not recommended to continue using HMM and Viterbi approach for Named Entity Recognition, since the very pitfall of HMM model is that it is not flexible enough to unknown words as well as any new vocabulary words. It is possible to consider the previous word is a B tag, then next tag as an O tag, then the likelihood of having an I tag or O tag more than a B tag. However this consideration is not sufficient to generalize and improve HMM with Viterbi. 
- We could use additional features such as the first letter is a capital letter, all letters are capital letters, the previous word is a hyphen, or next word is a number, previous word + next word is an alphanumeric, all of which can act as better features for working with HMM. The idea is frequencies for each of these new features and their likelihood of being assigned a tag under a 5-gram approach, may work and seems like a reasonable approach to explore, but will be limiting to this approach alone.


In [None]:
import numpy as np
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd '/content/drive/MyDrive/Colab Notebooks/nlp-hw3'
!pwd

/content/drive/MyDrive/Colab Notebooks/nlp-hw3
/content/drive/MyDrive/Colab Notebooks/nlp-hw3


Preprocess the Data and analyze. 


In [None]:
all_lines = {}
ent_tags = {}
tokens = []
with open("S21-gene-train.txt", "r", encoding="utf8", newline="\n") as file:
  lines = file.readlines()

new_line_counter = 0
all_raw_lines = []
sentences = []
sentence = []
word_tags = {}
this_sentence_tag_pairs = []
for i,line in enumerate(lines):
  if line != "\n":
    this_line = line.split("\t")
    sentence.append(this_line[1])
    #print(this_line)
    ent_tags[this_line[1].strip()] = this_line[2].strip()
    this_pair = this_line[1].strip(), this_line[2].strip()
    this_sentence_tag_pairs.append(this_pair)
    tokens.append(this_line[1].strip())
    if this_line[1].strip() not in word_tags.keys():
      word_tags[this_line[1].strip()] = {"O": 0, "I":0, "B":0 }
    word_tags[this_line[1].strip()][this_line[2].strip()] += 1
    all_raw_lines.append({"Sentence #":new_line_counter,"Line":int(this_line[0].strip()),"Word":this_line[1].strip(), "Tag":this_line[2].strip()})
  else:
    new_line_counter += 1
    all_lines[i] = this_sentence_tag_pairs
    this_sentence_tag_pairs = []
    sentences.append(" ".join(sentence))
    sentence = []

In [None]:
all_lines

{9: [('Comparison', 'O'),
  ('with', 'O'),
  ('alkaline', 'B'),
  ('phosphatases', 'I'),
  ('and', 'O'),
  ('5', 'B'),
  ('-', 'I'),
  ('nucleotidase', 'I'),
  ('.', 'O')],
 16: [('Pharmacologic', 'O'),
  ('aspects', 'O'),
  ('of', 'O'),
  ('neonatal', 'O'),
  ('hyperbilirubinemia', 'O'),
  ('.', 'O')],
 62: [('When', 'O'),
  ('CSF', 'O'),
  ('[', 'O'),
  ('HCO3', 'O'),
  ('-]', 'O'),
  ('is', 'O'),
  ('shown', 'O'),
  ('as', 'O'),
  ('a', 'O'),
  ('function', 'O'),
  ('of', 'O'),
  ('CSF', 'O'),
  ('PCO2', 'O'),
  ('the', 'O'),
  ('data', 'O'),
  ('of', 'O'),
  ('K', 'O'),
  ('-', 'O'),
  ('depleted', 'O'),
  ('rats', 'O'),
  ('are', 'O'),
  ('no', 'O'),
  ('longer', 'O'),
  ('displaced', 'O'),
  ('when', 'O'),
  ('compared', 'O'),
  ('to', 'O'),
  ('controls', 'O'),
  ('but', 'O'),
  ('still', 'O'),
  ('have', 'O'),
  ('a', 'O'),
  ('significantly', 'O'),
  ('greater', 'O'),
  ('slope', 'O'),
  ('(', 'O'),
  ('1', 'O'),
  ('.', 'O'),
  ('21', 'O'),
  ('+/-', 'O'),
  ('0', 'O'),
  ('.

#### About the data
- Total number of sentences: 13795
- Total number of words/tokens in dataset: 308229
- Max number of words in a sentence: ~102
- Vocabulary size: 27282
- Total number of B tags: 13304
- Total number of I tags: 19527
- Total number of O tags: 276009
- The most common top 10 words are: “.”, “the”, “of”, “-”, (',','and','in','a','(', 'to'.
- The least common top 10 words are: 'K713','hypercholesterolemic','lutein','P69','conference','Talk','Tele','cruciform','TE105'   

In [None]:
train, test = train_test_split(list(all_lines.values()), test_size=0.2)
print(len(train), len(test)), train[2]

11036 2759


(None,
 [('Therefore', 'O'),
  ('the', 'O'),
  ('prevalences', 'O'),
  ('of', 'O'),
  ('total', 'O'),
  ('diabetes', 'O'),
  ('and', 'O'),
  ('GDM', 'O'),
  ('were', 'O'),
  ('1', 'O'),
  ('.', 'O'),
  ('19', 'O'),
  ('%', 'O'),
  ('and', 'O'),
  ('0', 'O'),
  ('.', 'O'),
  ('56', 'O'),
  ('%,', 'O'),
  ('respectively', 'O'),
  ('.', 'O')])

In [None]:
word_tag_pairs = [this_tuple for sent in train for this_tuple in sent]
all_words = [this_word_tag[0] for this_word_tag in word_tag_pairs]
all_tags = [this_word_tag[1] for this_word_tag in word_tag_pairs]
all_tag_string = ''.join(all_tags)
vocab = list(set(all_words))
unique_tags = set(all_tags)
tag_counter = Counter(all_tags)
word_tag_pair_counts = Counter(word_tag_pairs)

In [None]:
data = pd.DataFrame.from_records(all_raw_lines, index=range(1,len(all_raw_lines)+1) )
sent_word_count = data.groupby(['Sentence #']).count()['Word']
df = pd.DataFrame.from_dict(word_tags)
df_wt = pd.DataFrame.from_dict(word_tags).T
most_common = Counter(" ".join(data["Word"]).split()).most_common(10)
least_common = Counter(" ".join(data["Word"]).split()).most_common()[:-100-1:-1] 
data.head(5)

Unnamed: 0,Sentence #,Line,Word,Tag
1,0,1,Comparison,O
2,0,2,with,O
3,0,3,alkaline,B
4,0,4,phosphatases,I
5,0,5,and,O


In [None]:
word_tag_pairs

[('In', 'O'),
 ('conclusion', 'O'),
 (',', 'O'),
 ('these', 'O'),
 ('studies', 'O'),
 ('indicate', 'O'),
 ('that', 'O'),
 ('LiCl', 'O'),
 ('(', 'O'),
 ('1', 'O'),
 (')', 'O'),
 ('decreases', 'O'),
 ('histamine', 'O'),
 ('-', 'O'),
 ('stimulated', 'O'),
 ('gastric', 'O'),
 ('acid', 'O'),
 ('secretion', 'O'),
 (',', 'O'),
 ('and', 'O'),
 ('(', 'O'),
 ('2', 'O'),
 (')', 'O'),
 ('diminishes', 'O'),
 ('bile', 'O'),
 ('-', 'O'),
 ('induced', 'O'),
 ('disruption', 'O'),
 ('of', 'O'),
 ('the', 'O'),
 ('gastric', 'O'),
 ('mucosal', 'O'),
 ('barrier', 'O'),
 ('in', 'O'),
 ('the', 'O'),
 ('canine', 'O'),
 ('Heidenhain', 'O'),
 ('pouch', 'O'),
 ('.', 'O'),
 ('Two', 'O'),
 ('new', 'O'),
 ('glucosidase', 'B'),
 ('inhibitors', 'O'),
 ('(', 'O'),
 ('BAY', 'O'),
 ('m', 'O'),
 ('1099', 'O'),
 ('and', 'O'),
 ('BAY', 'O'),
 ('o', 'O'),
 ('1248', 'O'),
 (')', 'O'),
 ('were', 'O'),
 ('studied', 'O'),
 ('in', 'O'),
 ('volunteers', 'O'),
 ('and', 'O'),
 ('type', 'O'),
 ('II', 'O'),
 ('diabetics', 'O'),
 ('und

In [None]:
print(len(word_tags.keys()), len(word_tags.values()))

31328 31328


In [None]:
print("Summary of this Dataset:")
print("Total number of sentences: %d"%(len(all_lines)))
print("Total number of words/tokens in dataset: %d" %len(tokens))
print("Max number of words in a sentence: %d"%(max(sent_word_count)))
print("Vocabulary size: %d"% (len(vocab)))
print("Total number of B tags:", tag_counter['B'])
print("Total number of I tags:", tag_counter['I'])
print("Total number of O tags:", tag_counter['O'])
print("The most common top 10 words are: ",most_common)
print("The least common top 10 words are: ",least_common)

Summary of this Dataset:
Total number of sentences: 13795
Total number of words/tokens in dataset: 386201
Max number of words in a sentence: 255
Vocabulary size: 27523
Total number of B tags: 13276
Total number of I tags: 19696
Total number of O tags: 276997
The most common top 10 words are:  [('.', 15510), ('the', 15508), ('of', 14771), ('-', 13150), (',', 12190), ('and', 9980), ('in', 8065), ('a', 5498), ('(', 5375), ('to', 5177)]
The least common top 10 words are:  [('K713', 1), ('hypercholesterolemic', 1), ('lutein', 1), ('P69', 1), ('conference', 1), ('Talk', 1), ('Tele', 1), ('cruciform', 1), ('TE105', 1), ('probing', 1), ('nonoverlapping', 1), ('monochannel', 1), ('Termination', 1), ('SMemb', 1), ('Tie2', 1), ('Ang1', 1), ('accumulations', 1), ('Ang2', 1), ('INCENP', 1), ('dictating', 1), ('satMa', 1), ('SAF', 1), ('FBT', 1), ('fluorobenzyltrozamicol', 1), ('](+)-', 1), ('Closeup', 1), ('USH1', 1), ('5G', 1), ('IVS51', 1), ('M1281', 1), ('domoic', 1), ('sardine', 1), ('managing'

Function to find transition probabilities a_ij : p(tag2|tag1)
##### Steps
  - Input: tag2, tag1 and all train_word_tag_pairs
  - Get all tags for the train_word_tag_pairs
  - count tag1 = count of tag1 in train_word_tag_pairs
  - sequence_counts = 0
  - for all word_tag pairs:
  - - for all tag1, tag2 sequence
    - sequence_counts += 1
  - return sequence_counts, count_tag1
#### Create Transition probabilities matrix for sequences

###### Steps:
  - Create a tag matrix tag_count * tag_count
  - Fill in each tag matrix with transition probability for the two tags in the 

#### Create Emission Probabilities
###### Steps
  - Get count of tags that match the tag in word_tag_pairs
  - Get all word_tag_pairs that match the tag
  - Get count of words that match the word in word_tag_pairs that match the tag
  - return word_count_for_matched_tag, tag_count

In [None]:
def get_emission_prob(tag, word, word_tag_pair_counts, tag_counter:Counter):
  #print(tag, word, word_tag_pair_counts)
  tag_count = tag_counter[tag]
  word_count_for_this_tag = word_tag_pair_counts[(word, tag)]
  return (word_count_for_this_tag, tag_count)

In [None]:
num, denom = get_emission_prob("B", "VirD2",word_tag_pair_counts, tag_counter)
num/denom

0.00022597167821633023

In [None]:
def get_transition_prob( tag2, tag1, all_tags_string, tag_counter):
  count1 = tag_counter[tag1]
  sequence_count = all_tag_string.count(tag1+tag2)
  #print(tag1+tag2, sequence_count, all_tag_string)
  return sequence_count, count1

In [None]:
def create_tag_transition_matrix(unique_tags:list,all_tags_string:str, tag_coutner:Counter):
  tag_matrix = np.zeros((len(unique_tags), len(unique_tags)), dtype="float32")
  for i,tag1 in enumerate(list(unique_tags)):
    for j,tag2 in enumerate(list(unique_tags)):
      seq_count, count1 = get_transition_prob( tag2, tag1, all_tags_string, tag_counter)
      tag_matrix[i, j] = seq_count/count1
  return tag_matrix

In [None]:
bigram_trans_prob_matrix = create_tag_transition_matrix(unique_tags, all_tag_string, tag_counter)
utags = list(unique_tags)
transition_df = pd.DataFrame(bigram_trans_prob_matrix, columns = utags, index=utags)
transition_df

Unnamed: 0,I,B,O
I,0.393684,0.0,0.394293
B,0.584965,0.0,0.415035
O,0.0,0.047928,0.486283


Impement Viterbi with numpy and DP table (Trellis)
#### Viterbi Algorithm
- Inputs: Transition probabilities, emission probabilities, start probabilities, result_probabilities
###### Inputs/Variables
  - K is number of unique tags K = 3 (B,I,O)
  - Bigram case: number of states are BI, BB, BO, IO, II, IB, OI, OO, OB
  - Trigram case: number of states are 
    - BIB IIO OBO IBO BOO OBB BBO IOO IOI 
    - OOO IBI OBI IOB BBI OOB BII BOI IBB 
    - BOB BBB IIB OIB BIO OII OIO III OOI
  - Number of Tokens N = 31328 for this Dataset
  - Start probabalities are set of Initial probabilities and has size K * 1
  - Transition Probabilities Matrix T with a size K * K
  - Emission Matrix with a size N * K
  - End transition scores with a size of K * 1

###### Most Likely Hidden state Sequence
  - a hidden state sequence K * 1

###### Return value
  - Score of best sequence
  - Array of size N with integers (best sequence)


In [None]:
K = len(unique_tags) #total unique tags : 3
N = len(word_tags.keys()) # total number of tokens : 32328
start_p = np.zeros((K)) # 3x1
trans_p = np.zeros((K,K)) # 3 x 3
end_scores =  np.zeros((K))
emission_p = np.zeros((N,K)) # 31328 x 3
tag_order = ['B', 'O', 'I']
tot_tags = len(all_tags)

In [None]:
tag_counter, utags, tag_order,tot_tags, 

(Counter({'B': 13276, 'I': 19696, 'O': 276997}),
 ['I', 'B', 'O'],
 ['B', 'O', 'I'],
 309969)

In [None]:
start_p.shape

(3,)

In [None]:
count = 0
start_tags = []
for sent in train:
  tuples = [t[1] for t in sent[:1]]
  tag_seq = "".join(tuples)
  start_tags.append(tag_seq)
start_tag_counter = Counter(start_tags)
start_tag_counter['I'] = 0
tot_start_tags = sum(start_tag_counter.values())
start_tag_counter

Counter({'B': 575, 'I': 0, 'O': 10461})

In [None]:
for i,tag in enumerate(utags):
    start_p[i] = np.round(start_tag_counter[tag]/tot_tags, 3)

In [None]:
trans_p = np.asarray(bigram_trans_prob_matrix)
print(transition_df.head())

          I         B         O
I  0.393684  0.000000  0.394293
B  0.584965  0.000000  0.415035
O  0.000000  0.047928  0.486283


In [None]:
for i,word in enumerate(vocab):
  for j,tag in enumerate(utags):
    word_count_for_this_tag, tag_count = get_emission_prob(tag,word, word_tag_pair_counts, tag_counter)
    if tag_count != 0:
      emission_p[i][j] = word_count_for_this_tag / tag_count
    else:
      emission_p[i][j] = 0.0

In [None]:
e =  (emission_p - emission_p.min()) / (np.ptp(emission_p))
np.linalg.norm(emission_p/np.linalg.norm(emission_p)), e

(0.9999999999999994, array([[0.00000000e+00, 0.00000000e+00, 2.00466488e-05],
        [0.00000000e+00, 0.00000000e+00, 6.01399464e-05],
        [0.00000000e+00, 0.00000000e+00, 2.00466488e-05],
        ...,
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00]]))

In [None]:
for i, tag in enumerate(utags):
  num, denom = get_emission_prob(tag, ".", word_tag_pair_counts, tag_counter)
  end_scores[i] = num/denom

In [None]:
end_scores

array([0.00690496, 0.        , 0.04436149])

In [None]:
from past.builtins import xrange
def viterbi_dp(start_p, trans_p, emission_p, end_scores):
  pred = []
  trellis = np.array([[0.0 for j in range(N)] for i in range(K)])
  backpointer = np.array([[-1 for j in range(N)] for i in range(K)])

  for i in xrange(N):
      if i == 0:
          trellis[:,0] = np.add(start_p, end_scores)
          print(trellis[:,0])
      else:
          for j in xrange(K):
              trellis[j,i] = np.max(np.add(trans_p[:,j], trellis[:,i-1] )) + emission_p[i,j]
              backpointer[j,i] = np.argmax(np.add(trans_p[:,j], trellis[:,i-1]))
  
  final_score = np.max(np.add(trellis[:, i], end_scores ))
  index = np.argmax(np.add(trellis[:, i], end_scores ))
  print(trellis)
  while index != -1:
      pred.append(index)
      index = argmax_trellis[index, i]
      i -= 1
  
  pred.reverse()
  return (final_score, pred)

In [None]:
score, pred = viterbi_dp(start_p, trans_p, emission_p, end_scores)

[0.00690496 0.002      0.07836149]
[[6.90495532e-03 5.86965348e-01 9.80649348e-01 ... 1.52335613e+04
  1.52340475e+04 1.52345338e+04]
 [2.00000000e-03 1.26289810e-01 6.12583883e-01 ... 1.52334626e+04
  1.52339489e+04 1.52344351e+04]
 [7.83614913e-02 5.64655564e-01 1.05094242e+00 ... 1.52339009e+04
  1.52343872e+04 1.52348735e+04]]


In [None]:
score

15234.917863516936

In [None]:
sum(pred) == len(word_tags.keys())

False

In [None]:
preds = np.asarray(pred)
preds[preds==1],preds[preds==2], utags, len(test),len(preds)

(array([], dtype=int64),
 array([2, 2, 2, ..., 2, 2, 2]),
 ['I', 'B', 'O'],
 2759,
 31328)

In [None]:
len(sentences[:len(train)]) ==len(train)

True

In [None]:
ngrams = None
def get_ngrams(train, n):
  for sent in train:
    #print(sent)
    word_boundary = (n-1) * [(None, "*")]
    word_boundary.extend(sent)
    word_boundary.append((None, "STOP"))
    ngrams = (tuple(word_boundary[i:i+n]) for i in xrange(len(word_boundary)-n+1))
    #print(ngrams)
    for ngram in ngrams:
      yield ngram

In [None]:
from collections import defaultdict

def train_hmm(train, n):
  emission_counts = defaultdict(int)
  ngram_counts = [defaultdict(int) for i in xrange(n)]
  for ngram in get_ngrams(train, n):
      
      assert len(ngram) == n, "n = %i, expected %i" %(len(ngram), n)
      #sent = [[t] for t in ngram]
      tags = tuple([tag for word, tag in ngram])

      for i in xrange(2, n+1):
        ngram_counts[i-1][tags[-i:]] += 1

      if ngram[-1][0] is not None:
        ngram_counts[0][tags[-1:]] += 1
        emission_counts[ngram[-1]] += 1
      
      if ngram[-(n-1)][0] is None:
        ngram_counts[n-2][tuple((n-1) * ["*"])] += 1
      
  return ngram_counts, emission_counts

In [None]:
bigram_counts, emit_counts = train_hmm(train, 2)
unigram_counts, uemit_counts = train_hmm(train, 1)

In [None]:
unigram_counts, tag_counter,bigram_counts, len(word_tag_pair_counts), len(emit_counts), len(uemit_counts)

([defaultdict(int, {(): 11036, ('B',): 13276, ('I',): 19696, ('O',): 276997})],
 Counter({'B': 13276, 'I': 19696, 'O': 276997}),
 [defaultdict(int,
              {('*',): 11036, ('B',): 13276, ('I',): 19696, ('O',): 276997}),
  defaultdict(int,
              {('*', 'B'): 575,
               ('*', 'O'): 10461,
               ('B', 'I'): 7766,
               ('B', 'O'): 5508,
               ('B', 'STOP'): 2,
               ('I', 'I'): 11930,
               ('I', 'O'): 7758,
               ('I', 'STOP'): 8,
               ('O', 'B'): 12701,
               ('O', 'O'): 253270,
               ('O', 'STOP'): 11026})],
 30470,
 30470,
 30470)

In [None]:
word_tag_pair_counts[('cancer', 'O')], emit_counts[('cancer', 'O')], uemit_counts[('cancer', 'O')],unigram_counts

(76,
 76,
 76,
 [defaultdict(int, {(): 11036, ('B',): 13276, ('I',): 19696, ('O',): 276997})])

In [None]:
tag_counter,word_tag_pair_counts

(Counter({'B': 13276, 'I': 19696, 'O': 276997}),
 Counter({('In', 'O'): 617,
          ('conclusion', 'O'): 14,
          (',', 'O'): 9720,
          ('these', 'O'): 395,
          ('studies', 'O'): 189,
          ('indicate', 'O'): 117,
          ('that', 'O'): 2295,
          ('LiCl', 'O'): 2,
          ('(', 'O'): 4120,
          ('1', 'O'): 945,
          (')', 'O'): 2579,
          ('decreases', 'O'): 17,
          ('histamine', 'O'): 8,
          ('-', 'O'): 6992,
          ('stimulated', 'O'): 78,
          ('gastric', 'O'): 35,
          ('acid', 'O'): 284,
          ('secretion', 'O'): 43,
          ('and', 'O'): 7889,
          ('2', 'O'): 668,
          ('diminishes', 'O'): 2,
          ('bile', 'O'): 11,
          ('induced', 'O'): 330,
          ('disruption', 'O'): 13,
          ('of', 'O'): 11782,
          ('the', 'O'): 12439,
          ('mucosal', 'O'): 7,
          ('barrier', 'O'): 8,
          ('in', 'O'): 6486,
          ('canine', 'O'): 9,
          ('Heidenhain',

In [None]:
d = defaultdict(float)
d_rare = []
for w,c in Counter(all_words).items():
  if c < 5:
    d_rare.append(word)

In [None]:
Counter(all_words).most_common()[::-1]

[('Bacteriol', 1),
 ('neoplasia', 1),
 ('dysregulation', 1),
 ('morphologically', 1),
 ('p24', 1),
 ('DAo', 1),
 ('conduits', 1),
 ('valved', 1),
 ('SMV', 1),
 ('FY', 1),
 ('Funding', 1),
 ('sporting', 1),
 ('Epidemiological', 1),
 ('appressoria', 1),
 ('melanized', 1),
 ('21q22', 1),
 ('lividomycin', 1),
 ('Rv', 1),
 ('scd1', 1),
 ('Chc', 1),
 ('Cytological', 1),
 ('anyone', 1),
 ('Ps', 1),
 ('lettuce', 1),
 ('Aer', 1),
 ('predominately', 1),
 ('click', 1),
 ('ITDs', 1),
 ('interaural', 1),
 ('alkali', 1),
 ('disappears', 1),
 ('digestions', 1),
 ('dm3', 1),
 ('ichthiomycin', 1),
 ('carps', 1),
 ('PACs', 1),
 ('NPAT', 1),
 ('2N', 1),
 ('french', 1),
 ('prescriptions', 1),
 ('Rab3', 1),
 ('Bartter', 1),
 ('alternation', 1),
 ('Na3', 1),
 ('trigonal', 1),
 ('gal11', 1),
 ('PYK1', 1),
 ('menarche', 1),
 ('Waf1', 1),
 ('H2O', 1),
 ('Mucin', 1),
 ('CD19', 1),
 ('bronchodilator', 1),
 ('augments', 1),
 ('nebulized', 1),
 ('resumption', 1),
 ('convulsion', 1),
 ('fasciculations', 1),
 ('meth

In [None]:
RARE_KEYWORD = "_RARE_"
word_tag_pair_counts_r = word_tag_pair_counts.copy()
for wt, count in word_tag_pair_counts.items():
  word, tag = wt
  if word in d_rare:
    word_tag_pair_counts_r[(RARE_KEYWORD, tag)] = count
  #print(tag_counter[tag], tag, word, count)


In [None]:
for wt, count in word_tag_pair_counts_r.items():
  word, tag = wt
  p = count*  1.0 /tag_counter[tag]
  d[(word, tag)] = p
d

defaultdict(float,
            {('Bacteriol', 'O'): 3.6101474023184367e-06,
             ('In', 'O'): 0.0022274609472304756,
             ('conclusion', 'O'): 5.054206363245811e-05,
             (',', 'O'): 0.03509063275053521,
             ('these', 'O'): 0.0014260082239157825,
             ('studies', 'O'): 0.0006823178590381845,
             ('indicate', 'O'): 0.0004223872460712571,
             ('that', 'O'): 0.008285288288320812,
             ('LiCl', 'O'): 7.220294804636873e-06,
             ('(', 'O'): 0.01487380729755196,
             ('1', 'O'): 0.0034115892951909225,
             (')', 'O'): 0.009310570150579248,
             ('decreases', 'O'): 6.137250583941343e-05,
             ('histamine', 'O'): 2.8881179218547494e-05,
             ('-', 'O'): 0.025242150637010508,
             ('stimulated', 'O'): 0.00028159149738083805,
             ('gastric', 'O'): 0.0001263551590811453,
             ('acid', 'O'): 0.001025281862258436,
             ('secretion', 'O'): 0.000155236338

### Approch for RARE Words
For each word that has a frequency of less than 5 i.e. the word occurs less than 5 times, the word is replaced with *_RARE_* . Thus, any word that is not in the vocabulary, the zero frequencies for new word appearing in the Test dataset is almost acocunted for.

In [None]:
def get_rare_word_counts(d):
  max_e = 0.0
  max_label = ""
  if (RARE_KEYWORD,'I') in d:
    if d[(RARE_KEYWORD, 'I')] > max_e:
      max_e = d[(RARE_KEYWORD,'I')]
      max_label = 'I'
  if (RARE_KEYWORD,'B') in d:
    if d[(RARE_KEYWORD, 'B')] > max_e:
      max_e = d[(RARE_KEYWORD,'B')]
      max_label = 'B'
  if (RARE_KEYWORD,'O') in d:
    if d[(RARE_KEYWORD, 'O')] > max_e:
      max_e = d[(RARE_KEYWORD,'O')]
      max_label = 'O'
  return max_e, max_label

In [None]:
import math
preds_new = []
full_preds = []
max_e_rare, max_label_rare = get_rare_word_counts(d)
for sent in test:
  this_sent = []
  counter = 1
  for wt in sent:

    w,_ = wt
    max_e = 0.0
    max_label = ""
    if (w,'I') in d:
      if d[(w, 'I')] > max_e:
        max_e = d[(w,'I')]
        max_label = 'I'
    if (w,'B') in d:
      if d[(w, 'B')] > max_e:
        max_e = d[(w,'B')]
        max_label = 'B'
    if (w,'O') in d:
      if d[(w, 'O')] > max_e:
        max_e = d[(w,'O')]
        max_label = 'O'

    if max_e == 0.0 and max_label == "":
      max_e = max_e_rare
      max_label = max_label_rare
    
    #print(max_e, max_e_rare, w, d)
    full_preds.append((w, max_label, math.log(max_e, 2)))
    this_sent.append('\t'.join([str(counter), w, max_label])+'\n' )
    counter += 1
  preds_new.append(this_sent)

In [None]:
full_preds[:5], preds_new[:10]

([('The', 'O', -7.080920396025297),
  ('main', 'O', -13.83158331232704),
  ('aim', 'O', -14.379071107629533),
  ('of', 'O', -4.555213988108001),
  ('the', 'O', -4.476927937684343)],
 [['1\tThe\tO\n',
   '2\tmain\tO\n',
   '3\taim\tO\n',
   '4\tof\tO\n',
   '5\tthe\tO\n',
   '6\tcontribution\tO\n',
   '7\t,\tO\n',
   '8\twhich\tO\n',
   '9\topens\tO\n',
   '10\tan\tO\n',
   '11\tarena\tO\n',
   '12\tfor\tO\n',
   '13\tdiscussion\tO\n',
   '14\ton\tO\n',
   '15\tthe\tO\n',
   '16\tRivista\tO\n',
   '17\tdell\tO\n',
   "18\t'\tO\n",
   '19\tInfermiere\tO\n',
   '20\tis\tO\n',
   '21\tto\tO\n',
   '22\tcritically\tO\n',
   '23\tappraise\tO\n',
   '24\tpublished\tO\n',
   '25\tresearch\tO\n',
   '26\tworks\tO\n',
   '27\tfocusing\tO\n',
   '28\tboth\tO\n',
   '29\ton\tO\n',
   '30\tstrengths\tO\n',
   '31\tand\tO\n',
   '32\tnovelty\tO\n',
   '33\tand\tO\n',
   '34\tweaknesses\tO\n',
   '35\tin\tO\n',
   '36\tthe\tO\n',
   '37\thypothesis\tO\n',
   '38\tformulation\tO\n',
   '39\t,\tO\n',
 

In [None]:
len(preds_new) == len(test)

True

In [None]:
with open("preds_hmm.csv", 'w') as f:
  for s in preds_new:
    print("".join(s))
    f.write("".join(s))
f

1	The	O
2	main	O
3	aim	O
4	of	O
5	the	O
6	contribution	O
7	,	O
8	which	O
9	opens	O
10	an	O
11	arena	O
12	for	O
13	discussion	O
14	on	O
15	the	O
16	Rivista	O
17	dell	O
18	'	O
19	Infermiere	O
20	is	O
21	to	O
22	critically	O
23	appraise	O
24	published	O
25	research	O
26	works	O
27	focusing	O
28	both	O
29	on	O
30	strengths	O
31	and	O
32	novelty	O
33	and	O
34	weaknesses	O
35	in	O
36	the	O
37	hypothesis	O
38	formulation	O
39	,	O
40	methods	O
41	and	O
42	instruments	O
43	used	O
44	,	O
45	discussion	O
46	of	O
47	results	O
48	.	O

1	Examination	O
2	of	O
3	DNA	I
4	:	O
5	protein	I
6	binding	I
7	complexes	I
8	by	O
9	gel	O
10	-	I
11	shift	O
12	analysis	O
13	indicated	O
14	that	O
15	nuclear	B
16	factors	I
17	from	O
18	both	O
19	proliferative	O
20	and	O
21	growth	I
22	-	I
23	arrested	O
24	cells	O
25	bound	O
26	to	O
27	the	O
28	DNA	I
29	fragment	I
30	spanning	O
31	-	I
32	949	O
33	-	I
34	-	I
35	722	O
36	bp	O
37	.	O

1	Demonstration	O
2	of	O
3	fine	O
4	structure	O
5	of	O
6	the	O
7	guinea	B
8	pig	I
9	org

<_io.TextIOWrapper name='preds_hmm.csv' mode='w' encoding='UTF-8'>

#### The results and analysis
The results are not as satisfactory which further compelled to attempt other approaches towards the problem to NER task.