### Named Entity Recognition over medical journal corpus

##### Homework 3, Fall, 2021
##### Prof. James H. Martin
###### author: Sushma Akoju

Notebook to train/fine-tune a BioBERT model to perform named entity recognition (NER). 

Required features:
  - Sentence id
  - Word
  - Tag

### This notebook includes
- Viterbi Dynamic programming approach with Trellis and back pointer approach.
- HMM aproach with Start, Transition, Emission probabilities for observations using Bigram approach. This is adaptable for trigram approach, which is not explored in this notebook in the interest of time as well as scope of homework
- Replacing RARE words with a RARE keyword to account for out-of-vocabulary and/or rare words in Test dataset.

### About the Dataset split;
- the dataset split is 80/20

#### References
- [Sequence Labeling for Parts of Speech and Named Entities:  ](https://web.stanford.edu/~jurafsky/slp3/8.pdf)
- [Hidden Markov Models:  ](https://web.stanford.edu/~jurafsky/slp3/A.pdf)
- [Viterbi algorithm: ](https://en.wikipedia.org/wiki/Viterbi_algorithm#Pseudocode)
- [https://courses.engr.illinois.edu/cs447/fa2018/Slides/Lecture06.pdf](https://courses.engr.illinois.edu/cs447/fa2018/Slides/Lecture06.pdf) 
- [https://courses.engr.illinois.edu/cs447/fa2018/Slides/Lecture07.pdf](https://courses.engr.illinois.edu/cs447/fa2018/Slides/Lecture07.pdf )
- [https://www.coursera.org/lecture/dna-mutations/viterbi-learning-sM0CP](https://www.coursera.org/lecture/dna-mutations/viterbi-learning-sM0CP)
- [https://www.coursera.org/lecture/language-processing/viterbi-algorithm-what-are-the-most-probable-tags-FMAba ](https://www.coursera.org/lecture/language-processing/viterbi-algorithm-what-are-the-most-probable-tags-FMAba )
- [https://github.com/hmmlearn/hmmlearn/blob/master/lib/hmmlearn/hmm.py](https://github.com/hmmlearn/hmmlearn/blob/master/lib/hmmlearn/hmm.py)

In [1]:
import numpy as np
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd '/content/drive/MyDrive/Colab Notebooks/nlp-hw3'
!pwd

/content/drive/MyDrive/Colab Notebooks/nlp-hw3
/content/drive/MyDrive/Colab Notebooks/nlp-hw3


Preprocess the Data and analyze. 


In [3]:
all_lines = {}
ent_tags = {}
tokens = []
with open("S21-gene-train.txt", "r", encoding="utf8", newline="\n") as file:
  lines = file.readlines()

new_line_counter = 0
all_raw_lines = []
sentences = []
sentence = []
word_tags = {}
this_sentence_tag_pairs = []
for i,line in enumerate(lines):
  if line != "\n":
    this_line = line.split("\t")
    sentence.append(this_line[1])
    #print(this_line)
    ent_tags[this_line[1].strip()] = this_line[2].strip()
    this_pair = this_line[1].strip(), this_line[2].strip()
    this_sentence_tag_pairs.append(this_pair)
    tokens.append(this_line[1].strip())
    if this_line[1].strip() not in word_tags.keys():
      word_tags[this_line[1].strip()] = {"O": 0, "I":0, "B":0 }
    word_tags[this_line[1].strip()][this_line[2].strip()] += 1
    all_raw_lines.append({"Sentence #":new_line_counter,"Line":int(this_line[0].strip()),"Word":this_line[1].strip(), "Tag":this_line[2].strip()})
  else:
    new_line_counter += 1
    all_lines[i] = this_sentence_tag_pairs
    this_sentence_tag_pairs = []
    sentences.append(" ".join(sentence))
    sentence = []

In [4]:
all_lines

{9: [('Comparison', 'O'),
  ('with', 'O'),
  ('alkaline', 'B'),
  ('phosphatases', 'I'),
  ('and', 'O'),
  ('5', 'B'),
  ('-', 'I'),
  ('nucleotidase', 'I'),
  ('.', 'O')],
 16: [('Pharmacologic', 'O'),
  ('aspects', 'O'),
  ('of', 'O'),
  ('neonatal', 'O'),
  ('hyperbilirubinemia', 'O'),
  ('.', 'O')],
 62: [('When', 'O'),
  ('CSF', 'O'),
  ('[', 'O'),
  ('HCO3', 'O'),
  ('-]', 'O'),
  ('is', 'O'),
  ('shown', 'O'),
  ('as', 'O'),
  ('a', 'O'),
  ('function', 'O'),
  ('of', 'O'),
  ('CSF', 'O'),
  ('PCO2', 'O'),
  ('the', 'O'),
  ('data', 'O'),
  ('of', 'O'),
  ('K', 'O'),
  ('-', 'O'),
  ('depleted', 'O'),
  ('rats', 'O'),
  ('are', 'O'),
  ('no', 'O'),
  ('longer', 'O'),
  ('displaced', 'O'),
  ('when', 'O'),
  ('compared', 'O'),
  ('to', 'O'),
  ('controls', 'O'),
  ('but', 'O'),
  ('still', 'O'),
  ('have', 'O'),
  ('a', 'O'),
  ('significantly', 'O'),
  ('greater', 'O'),
  ('slope', 'O'),
  ('(', 'O'),
  ('1', 'O'),
  ('.', 'O'),
  ('21', 'O'),
  ('+/-', 'O'),
  ('0', 'O'),
  ('.

#### About the data
- Total number of sentences: 13795
- Total number of words/tokens in dataset: 308229
- Max number of words in a sentence: ~102
- Vocabulary size: 27282
- Total number of B tags: 13304
- Total number of I tags: 19527
- Total number of O tags: 276009
- The most common top 10 words are: “.”, “the”, “of”, “-”, (',','and','in','a','(', 'to'.
- The least common top 10 words are: 'K713','hypercholesterolemic','lutein','P69','conference','Talk','Tele','cruciform','TE105'   

In [5]:
train, test = train_test_split(list(all_lines.values()), test_size=0.2)
print(len(train), len(test)), train[2]

11036 2759


(None,
 [('During', 'O'),
  ('insulin', 'B'),
  ('infusion', 'O'),
  (',', 'O'),
  ('a', 'O'),
  ('20', 'O'),
  ('%', 'O'),
  ('dextrose', 'O'),
  ('solution', 'O'),
  ('was', 'O'),
  ('infused', 'O'),
  ('by', 'O'),
  ('a', 'O'),
  ('Biostator', 'O'),
  ('in', 'O'),
  ('order', 'O'),
  ('to', 'O'),
  ('maintain', 'O'),
  ('the', 'O'),
  ('patient', 'O'),
  ("'", 'O'),
  ('s', 'O'),
  ('glycemia', 'O'),
  ('at', 'O'),
  ('90', 'O'),
  ('mg', 'O'),
  ('/', 'O'),
  ('dl', 'O'),
  ('.', 'O')])

In [6]:
word_tag_pairs = [this_tuple for sent in train for this_tuple in sent]
all_words = [this_word_tag[0] for this_word_tag in word_tag_pairs]
all_tags = [this_word_tag[1] for this_word_tag in word_tag_pairs]
all_tag_string = ''.join(all_tags)
vocab = list(set(all_words))
unique_tags = set(all_tags)
tag_counter = Counter(all_tags)
word_tag_pair_counts = Counter(word_tag_pairs)

In [7]:
data = pd.DataFrame.from_records(all_raw_lines, index=range(1,len(all_raw_lines)+1) )
sent_word_count = data.groupby(['Sentence #']).count()['Word']
df = pd.DataFrame.from_dict(word_tags)
df_wt = pd.DataFrame.from_dict(word_tags).T
most_common = Counter(" ".join(data["Word"]).split()).most_common(10)
least_common = Counter(" ".join(data["Word"]).split()).most_common()[:-100-1:-1] 
data.head(5)

Unnamed: 0,Sentence #,Line,Word,Tag
1,0,1,Comparison,O
2,0,2,with,O
3,0,3,alkaline,B
4,0,4,phosphatases,I
5,0,5,and,O


In [8]:
word_tag_pairs

[('CONCLUSIONS', 'O'),
 (':', 'O'),
 ('A', 'O'),
 ('comparison', 'O'),
 ('of', 'O'),
 ('the', 'O'),
 ('LysU', 'B'),
 ('crystal', 'O'),
 ('structure', 'O'),
 ('with', 'O'),
 ('the', 'O'),
 ('structures', 'O'),
 ('of', 'O'),
 ('seryl', 'B'),
 ('-', 'I'),
 ('and', 'I'),
 ('aspartyl', 'I'),
 ('-', 'I'),
 ('tRNA', 'I'),
 ('synthetases', 'I'),
 ('enables', 'O'),
 ('a', 'O'),
 ('conserved', 'O'),
 ('core', 'O'),
 ('to', 'O'),
 ('be', 'O'),
 ('identified', 'O'),
 ('.', 'O'),
 ('Male', 'B'),
 ('-', 'I'),
 ('enhanced', 'I'),
 ('antigen', 'I'),
 ('gene', 'I'),
 ('is', 'O'),
 ('phylogenetically', 'O'),
 ('conserved', 'O'),
 ('and', 'O'),
 ('expressed', 'O'),
 ('at', 'O'),
 ('late', 'O'),
 ('stages', 'O'),
 ('of', 'O'),
 ('spermatogenesis', 'O'),
 ('.', 'O'),
 ('During', 'O'),
 ('insulin', 'B'),
 ('infusion', 'O'),
 (',', 'O'),
 ('a', 'O'),
 ('20', 'O'),
 ('%', 'O'),
 ('dextrose', 'O'),
 ('solution', 'O'),
 ('was', 'O'),
 ('infused', 'O'),
 ('by', 'O'),
 ('a', 'O'),
 ('Biostator', 'O'),
 ('in', 'O'

In [9]:
print(len(word_tags.keys()), len(word_tags.values()))

31328 31328


In [10]:
print("Summary of this Dataset:")
print("Total number of sentences: %d"%(len(all_lines)))
print("Total number of words/tokens in dataset: %d" %len(tokens))
print("Max number of words in a sentence: %d"%(max(sent_word_count)))
print("Vocabulary size: %d"% (len(vocab)))
print("Total number of B tags:", tag_counter['B'])
print("Total number of I tags:", tag_counter['I'])
print("Total number of O tags:", tag_counter['O'])
print("The most common top 10 words are: ",most_common)
print("The least common top 10 words are: ",least_common)

Summary of this Dataset:
Total number of sentences: 13795
Total number of words/tokens in dataset: 386201
Max number of words in a sentence: 255
Vocabulary size: 27620
Total number of B tags: 13343
Total number of I tags: 19862
Total number of O tags: 276790
The most common top 10 words are:  [('.', 15510), ('the', 15508), ('of', 14771), ('-', 13150), (',', 12190), ('and', 9980), ('in', 8065), ('a', 5498), ('(', 5375), ('to', 5177)]
The least common top 10 words are:  [('K713', 1), ('hypercholesterolemic', 1), ('lutein', 1), ('P69', 1), ('conference', 1), ('Talk', 1), ('Tele', 1), ('cruciform', 1), ('TE105', 1), ('probing', 1), ('nonoverlapping', 1), ('monochannel', 1), ('Termination', 1), ('SMemb', 1), ('Tie2', 1), ('Ang1', 1), ('accumulations', 1), ('Ang2', 1), ('INCENP', 1), ('dictating', 1), ('satMa', 1), ('SAF', 1), ('FBT', 1), ('fluorobenzyltrozamicol', 1), ('](+)-', 1), ('Closeup', 1), ('USH1', 1), ('5G', 1), ('IVS51', 1), ('M1281', 1), ('domoic', 1), ('sardine', 1), ('managing'

Function to find transition probabilities a_ij : p(tag2|tag1)
##### Steps
  - Input: tag2, tag1 and all train_word_tag_pairs
  - Get all tags for the train_word_tag_pairs
  - count tag1 = count of tag1 in train_word_tag_pairs
  - sequence_counts = 0
  - for all word_tag pairs:
  - - for all tag1, tag2 sequence
    - sequence_counts += 1
  - return sequence_counts, count_tag1
#### Create Transition probabilities matrix for sequences

###### Steps:
  - Create a tag matrix tag_count * tag_count
  - Fill in each tag matrix with transition probability for the two tags in the 

#### Create Emission Probabilities
###### Steps
  - Get count of tags that match the tag in word_tag_pairs
  - Get all word_tag_pairs that match the tag
  - Get count of words that match the word in word_tag_pairs that match the tag
  - return word_count_for_matched_tag, tag_count

In [11]:
def get_emission_prob(tag, word, word_tag_pair_counts, tag_counter:Counter):
  #print(tag, word, word_tag_pair_counts)
  tag_count = tag_counter[tag]
  word_count_for_this_tag = word_tag_pair_counts[(word, tag)]
  return (word_count_for_this_tag, tag_count)

In [12]:
num, denom = get_emission_prob("B", "VirD2",word_tag_pair_counts, tag_counter)
num/denom

0.00014989132878662968

In [13]:
def get_transition_prob( tag2, tag1, all_tags_string, tag_counter):
  count1 = tag_counter[tag1]
  sequence_count = all_tag_string.count(tag1+tag2)
  #print(tag1+tag2, sequence_count, all_tag_string)
  return sequence_count, count1

In [14]:
def create_tag_transition_matrix(unique_tags:list,all_tags_string:str, tag_coutner:Counter):
  tag_matrix = np.zeros((len(unique_tags), len(unique_tags)), dtype="float32")
  for i,tag1 in enumerate(list(unique_tags)):
    for j,tag2 in enumerate(list(unique_tags)):
      seq_count, count1 = get_transition_prob( tag2, tag1, all_tags_string, tag_counter)
      tag_matrix[i, j] = seq_count/count1
  return tag_matrix

In [15]:
bigram_trans_prob_matrix = create_tag_transition_matrix(unique_tags, all_tag_string, tag_counter)
utags = list(unique_tags)
transition_df = pd.DataFrame(bigram_trans_prob_matrix, columns = utags, index=utags)
transition_df

Unnamed: 0,B,I,O
B,0.0,0.582028,0.417972
I,0.000101,0.395328,0.390897
O,0.048199,0.0,0.486062


Impement Viterbi with numpy and DP table (Trellis)
#### Viterbi Algorithm
- Inputs: Transition probabilities, emission probabilities, start probabilities, result_probabilities
###### Inputs/Variables
  - K is number of unique tags K = 3 (B,I,O)
  - Bigram case: number of states are BI, BB, BO, IO, II, IB, OI, OO, OB
  - Trigram case: number of states are 
    - BIB IIO OBO IBO BOO OBB BBO IOO IOI 
    - OOO IBI OBI IOB BBI OOB BII BOI IBB 
    - BOB BBB IIB OIB BIO OII OIO III OOI
  - Number of Tokens N = 31328 for this Dataset
  - Start probabalities are set of Initial probabilities and has size K * 1
  - Transition Probabilities Matrix T with a size K * K
  - Emission Matrix with a size N * K
  - End transition scores with a size of K * 1

###### Most Likely Hidden state Sequence
  - a hidden state sequence K * 1

###### Return value
  - Score of best sequence
  - Array of size N with integers (best sequence)


In [16]:
K = len(unique_tags) #total unique tags : 3
N = len(word_tags.keys()) # total number of tokens : 32328
start_p = np.zeros((K)) # 3x1
trans_p = np.zeros((K,K)) # 3 x 3
end_scores =  np.zeros((K))
emission_p = np.zeros((N,K)) # 31328 x 3
tag_order = ['B', 'O', 'I']
tot_tags = len(all_tags)

In [17]:
tag_counter, utags, tag_order,tot_tags, 

(Counter({'B': 13343, 'I': 19862, 'O': 276790}),
 ['B', 'I', 'O'],
 ['B', 'O', 'I'],
 309995)

In [18]:
start_p.shape

(3,)

In [19]:
count = 0
start_tags = []
for sent in train:
  tuples = [t[1] for t in sent[:1]]
  tag_seq = "".join(tuples)
  start_tags.append(tag_seq)
start_tag_counter = Counter(start_tags)
start_tag_counter['I'] = 0
tot_start_tags = sum(start_tag_counter.values())
start_tag_counter

Counter({'B': 594, 'I': 0, 'O': 10442})

In [20]:
for i,tag in enumerate(utags):
    start_p[i] = np.round(start_tag_counter[tag]/tot_tags, 3)

In [21]:
trans_p = np.asarray(bigram_trans_prob_matrix)
print(transition_df.head())

          B         I         O
B  0.000000  0.582028  0.417972
I  0.000101  0.395328  0.390897
O  0.048199  0.000000  0.486062


In [22]:
for i,word in enumerate(vocab):
  for j,tag in enumerate(utags):
    word_count_for_this_tag, tag_count = get_emission_prob(tag,word, word_tag_pair_counts, tag_counter)
    if tag_count != 0:
      emission_p[i][j] = word_count_for_this_tag / tag_count
    else:
      emission_p[i][j] = 0.0

In [23]:
e =  (emission_p - emission_p.min()) / (np.ptp(emission_p))
np.linalg.norm(emission_p/np.linalg.norm(emission_p)), e

(1.0, array([[0.00000000e+00, 0.00000000e+00, 4.01557766e-05],
        [0.00000000e+00, 0.00000000e+00, 2.00778883e-05],
        [0.00000000e+00, 0.00000000e+00, 4.01557766e-05],
        ...,
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00]]))

In [24]:
for i, tag in enumerate(utags):
  num, denom = get_emission_prob(tag, ".", word_tag_pair_counts, tag_counter)
  end_scores[i] = num/denom

In [25]:
end_scores

array([0.        , 0.00674655, 0.04429351])

In [29]:
from past.builtins import xrange
def viterbi_dp(start_p, trans_p, emission_p, end_scores):
  pred = []
  trellis = np.array([[0.0 for j in range(N)] for i in range(K)])
  backpointer = np.array([[-1 for j in range(N)] for i in range(K)])

  for i in xrange(N):
      if i == 0:
          trellis[:,0] = np.add(start_p, end_scores)
          print(trellis[:,0])
      else:
          for j in xrange(K):
              trellis[j,i] = np.max(np.add(trans_p[:,j], trellis[:,i-1] )) + emission_p[i,j]
              backpointer[j,i] = np.argmax(np.add(trans_p[:,j], trellis[:,i-1]))
  
  final_score = np.max(np.add(trellis[:, i], end_scores ))
  index = np.argmax(np.add(trellis[:, i], end_scores ))
  print(trellis)
  while index != -1:
      pred.append(index)
      index = backpointer[index, i]
      i -= 1
  
  pred.reverse()
  return (final_score, pred)

In [30]:
score, pred = viterbi_dp(start_p, trans_p, emission_p, end_scores)

[0.002      0.00674655 0.07829351]
[[2.00000000e-03 1.26492502e-01 6.12557747e-01 ... 1.52265211e+04
  1.52270071e+04 1.52274932e+04]
 [6.74655120e-03 5.84028031e-01 9.79355778e-01 ... 1.52266170e+04
  1.52271031e+04 1.52275892e+04]
 [7.82935077e-02 5.64358753e-01 1.05042761e+00 ... 1.52269589e+04
  1.52274450e+04 1.52279311e+04]]


In [31]:
score

15227.975345293375

In [32]:
sum(pred) == len(word_tags.keys())

False

In [33]:
preds = np.asarray(pred)
preds[preds==1],preds[preds==2], utags, len(test),len(preds)

(array([], dtype=int64),
 array([2, 2, 2, ..., 2, 2, 2]),
 ['B', 'I', 'O'],
 2759,
 31328)

In [34]:
len(sentences[:len(train)]) ==len(train)

True

In [35]:
ngrams = None
def get_ngrams(train, n):
  for sent in train:
    #print(sent)
    word_boundary = (n-1) * [(None, "*")]
    word_boundary.extend(sent)
    word_boundary.append((None, "STOP"))
    ngrams = (tuple(word_boundary[i:i+n]) for i in xrange(len(word_boundary)-n+1))
    #print(ngrams)
    for ngram in ngrams:
      yield ngram

In [36]:
from collections import defaultdict

def train_hmm(train, n):
  emission_counts = defaultdict(int)
  ngram_counts = [defaultdict(int) for i in xrange(n)]
  for ngram in get_ngrams(train, n):
      
      assert len(ngram) == n, "n = %i, expected %i" %(len(ngram), n)
      #sent = [[t] for t in ngram]
      tags = tuple([tag for word, tag in ngram])

      for i in xrange(2, n+1):
        ngram_counts[i-1][tags[-i:]] += 1

      if ngram[-1][0] is not None:
        ngram_counts[0][tags[-1:]] += 1
        emission_counts[ngram[-1]] += 1
      
      if ngram[-(n-1)][0] is None:
        ngram_counts[n-2][tuple((n-1) * ["*"])] += 1
      
  return ngram_counts, emission_counts

In [37]:
bigram_counts, emit_counts = train_hmm(train, 2)
unigram_counts, uemit_counts = train_hmm(train, 1)

In [38]:
unigram_counts, tag_counter,bigram_counts, len(word_tag_pair_counts), len(emit_counts), len(uemit_counts)

([defaultdict(int, {(): 11036, ('B',): 13343, ('I',): 19862, ('O',): 276790})],
 Counter({'B': 13343, 'I': 19862, 'O': 276790}),
 [defaultdict(int,
              {('*',): 11036, ('B',): 13343, ('I',): 19862, ('O',): 276790}),
  defaultdict(int,
              {('*', 'B'): 594,
               ('*', 'O'): 10442,
               ('B', 'I'): 7766,
               ('B', 'O'): 5576,
               ('B', 'STOP'): 1,
               ('I', 'I'): 12096,
               ('I', 'O'): 7756,
               ('I', 'STOP'): 10,
               ('O', 'B'): 12749,
               ('O', 'O'): 253016,
               ('O', 'STOP'): 11025})],
 30600,
 30600,
 30600)

In [40]:
word_tag_pair_counts[('cancer', 'O')], emit_counts[('cancer', 'O')], uemit_counts[('cancer', 'O')],unigram_counts

(84,
 84,
 84,
 [defaultdict(int, {(): 11036, ('B',): 13343, ('I',): 19862, ('O',): 276790})])

In [41]:
tag_counter,word_tag_pair_counts

(Counter({'B': 13343, 'I': 19862, 'O': 276790}),
 Counter({('CONCLUSIONS', 'O'): 26,
          (':', 'O'): 705,
          ('A', 'O'): 606,
          ('comparison', 'O'): 61,
          ('of', 'O'): 11721,
          ('the', 'O'): 12389,
          ('LysU', 'B'): 1,
          ('crystal', 'O'): 11,
          ('structure', 'O'): 149,
          ('with', 'O'): 2745,
          ('structures', 'O'): 31,
          ('seryl', 'B'): 1,
          ('-', 'I'): 3574,
          ('and', 'I'): 114,
          ('aspartyl', 'I'): 1,
          ('tRNA', 'I'): 14,
          ('synthetases', 'I'): 5,
          ('enables', 'O'): 6,
          ('a', 'O'): 4391,
          ('conserved', 'O'): 130,
          ('core', 'O'): 38,
          ('to', 'O'): 4128,
          ('be', 'O'): 656,
          ('identified', 'O'): 232,
          ('.', 'O'): 12260,
          ('Male', 'B'): 1,
          ('enhanced', 'I'): 1,
          ('antigen', 'I'): 48,
          ('gene', 'I'): 685,
          ('is', 'O'): 1866,
          ('phylogenetical

In [42]:
d = defaultdict(float)
d_rare = []
for w,c in Counter(all_words).items():
  if c < 5:
    d_rare.append(word)

In [43]:
Counter(all_words).most_common()[::-1]

[('pE1', 1),
 ('JNK2', 1),
 ('FPCL', 1),
 ('lattices', 1),
 ('populated', 1),
 ('TIS1', 1),
 ('hprt', 1),
 ('skipped', 1),
 ('nucleated', 1),
 ('fibrils', 1),
 ('Amyloid', 1),
 ('Premature', 1),
 ('CDK6', 1),
 ('CDK4', 1),
 ('Pctr1', 1),
 ('congenic', 1),
 ('squirrel', 1),
 ('raccoon', 1),
 ('glabrous', 1),
 ('mechanoreceptor', 1),
 ('Slowly', 1),
 ('signficant', 1),
 ('223', 1),
 ('BBScV', 1),
 ('scorch', 1),
 ('blueberry', 1),
 ('adiposity', 1),
 ('*;', 1),
 ('polynucleotides', 1),
 ('Poly', 1),
 ('medialis', 1),
 ('PDE3A', 1),
 ('PDH', 1),
 ('annually', 1),
 ('TAH', 1),
 ('oesophageal', 1),
 ('Symptoms', 1),
 ('ABRs', 1),
 ('xylP', 1),
 ('Yang', 1),
 ('pentafluoropropionyl', 1),
 ('trifluoroethyl', 1),
 ('hydroxyphenylacetic', 1),
 ('phenylacetic', 1),
 ('implicating', 1),
 ('economic', 1),
 ('helminthiases', 1),
 ('sanitation', 1),
 ('explosively', 1),
 ('malnutrition', 1),
 ('poverty', 1),
 ('aggravated', 1),
 ('Echinostomiasis', 1),
 ('Pendrys', 1),
 ('Fluorosis', 1),
 ('carriage

In [44]:
RARE_KEYWORD = "_RARE_"
word_tag_pair_counts_r = word_tag_pair_counts.copy()
for wt, count in word_tag_pair_counts.items():
  word, tag = wt
  if word in d_rare:
    word_tag_pair_counts_r[(RARE_KEYWORD, tag)] = count
  #print(tag_counter[tag], tag, word, count)


In [45]:
for wt, count in word_tag_pair_counts_r.items():
  word, tag = wt
  p = count*  1.0 /tag_counter[tag]
  d[(word, tag)] = p
d

defaultdict(float,
            {('CONCLUSIONS', 'O'): 9.39340294085769e-05,
             (':', 'O'): 0.002547057335886412,
             ('A', 'O'): 0.0021893854546768308,
             ('comparison', 'O'): 0.0002203836843816612,
             ('of', 'O'): 0.042346183026843454,
             ('the', 'O'): 0.04475956501318689,
             ('LysU', 'B'): 7.494566439331484e-05,
             ('crystal', 'O'): 3.974132013439792e-05,
             ('structure', 'O'): 0.0005383142454568445,
             ('with', 'O'): 0.009917265797174753,
             ('structures', 'O'): 0.00011199826583330323,
             ('seryl', 'B'): 7.494566439331484e-05,
             ('-', 'I'): 0.1799415970194341,
             ('and', 'I'): 0.005739603262511328,
             ('aspartyl', 'I'): 5.034739703957305e-05,
             ('tRNA', 'I'): 0.0007048635585540227,
             ('synthetases', 'I'): 0.00025173698519786525,
             ('enables', 'O'): 2.1677083709671593e-05,
             ('a', 'O'): 0.01586401242819

### Approch for RARE Words
For each word that has a frequency of less than 5 i.e. the word occurs less than 5 times, the word is replaced with *_RARE_* . Thus, any word that is not in the vocabulary, the zero frequencies for new word appearing in the Test dataset is almost acocunted for.

In [46]:
def get_rare_word_counts(d):
  max_e = 0.0
  max_label = ""
  if (RARE_KEYWORD,'I') in d:
    if d[(RARE_KEYWORD, 'I')] > max_e:
      max_e = d[(RARE_KEYWORD,'I')]
      max_label = 'I'
  if (RARE_KEYWORD,'B') in d:
    if d[(RARE_KEYWORD, 'B')] > max_e:
      max_e = d[(RARE_KEYWORD,'B')]
      max_label = 'B'
  if (RARE_KEYWORD,'O') in d:
    if d[(RARE_KEYWORD, 'O')] > max_e:
      max_e = d[(RARE_KEYWORD,'O')]
      max_label = 'O'
  return max_e, max_label

In [47]:
import math
preds_new = []
full_preds = []
test_data = []
max_e_rare, max_label_rare = get_rare_word_counts(d)
for sent in test:
  this_sent = []
  test_sent = []
  counter = 1
  for wt in sent:

    w,tl = wt
    max_e = 0.0
    max_label = ""
    if (w,'I') in d:
      if d[(w, 'I')] > max_e:
        max_e = d[(w,'I')]
        max_label = 'I'
    if (w,'B') in d:
      if d[(w, 'B')] > max_e:
        max_e = d[(w,'B')]
        max_label = 'B'
    if (w,'O') in d:
      if d[(w, 'O')] > max_e:
        max_e = d[(w,'O')]
        max_label = 'O'

    if max_e == 0.0 and max_label == "":
      max_e = max_e_rare
      max_label = max_label_rare
    
    #print(max_e, max_e_rare, w, d)
    full_preds.append((w, max_label, math.log(max_e, 2)))
    this_sent.append('\t'.join([str(counter), w, max_label,'\n']) )
    
    test_sent.append('\t'.join([str(counter), w, tl,'\n']) )
    
    counter += 1
  this_sent.append( '\n')
  test_sent.append( '\n')
  preds_new.append(this_sent)
  test_data.append(test_sent)

In [48]:
full_preds[:5], preds_new[:10]

([('Finally', 'O', -13.990969454734024),
  (',', 'O', -4.843165963227503),
  ('the', 'O', -4.481660173933329),
  ('stability', 'O', -13.686114873205604),
  ('of', 'O', -4.561624255079534)],
 [['1\tFinally\tO\t\n',
   '2\t,\tO\t\n',
   '3\tthe\tO\t\n',
   '4\tstability\tO\t\n',
   '5\tof\tO\t\n',
   '6\tthe\tO\t\n',
   '7\tnucleotide\tI\t\n',
   '8\tbinding\tI\t\n',
   '9\tfunction\tO\t\n',
   '10\tof\tO\t\n',
   '11\tthe\tO\t\n',
   '12\ttwo\tO\t\n',
   '13\tproteins\tI\t\n',
   '14\tis\tO\t\n',
   '15\tsimilar\tO\t\n',
   '16\tas\tO\t\n',
   '17\tassessed\tO\t\n',
   '18\tby\tO\t\n',
   '19\tsensitivity\tO\t\n',
   '20\tto\tO\t\n',
   '21\turea\tO\t\n',
   '22\t.\tO\t\n',
   '\n'],
  ['1\tHowever\tO\t\n',
   '2\t,\tO\t\n',
   '3\tdivision\tO\t\n',
   '4\tof\tO\t\n',
   '5\tthe\tO\t\n',
   '6\tchest\tO\t\n',
   '7\twall\tO\t\n',
   '8\tmuscles\tO\t\n',
   '9\t,\tO\t\n',
   '10\tusually\tO\t\n',
   '11\twith\tO\t\n',
   '12\tdiathermy\tO\t\n',
   '13\t,\tO\t\n',
   '14\tcontributes\tO\t

In [49]:
len(preds_new) == len(test)

True

In [50]:
with open("preds_hmm.csv", 'w') as f:
  for s in preds_new:
    print("".join(s))
    f.write("".join(s))
f

1	Finally	O	
2	,	O	
3	the	O	
4	stability	O	
5	of	O	
6	the	O	
7	nucleotide	I	
8	binding	I	
9	function	O	
10	of	O	
11	the	O	
12	two	O	
13	proteins	I	
14	is	O	
15	similar	O	
16	as	O	
17	assessed	O	
18	by	O	
19	sensitivity	O	
20	to	O	
21	urea	O	
22	.	O	


1	However	O	
2	,	O	
3	division	O	
4	of	O	
5	the	O	
6	chest	O	
7	wall	O	
8	muscles	O	
9	,	O	
10	usually	O	
11	with	O	
12	diathermy	O	
13	,	O	
14	contributes	O	
15	to	O	
16	prolonged	O	
17	pain	O	
18	and	O	
19	morbidity	O	
20	.	O	


1	We	O	
2	found	O	
3	that	O	
4	lung	O	
5	cancer	O	
6	tissues	O	
7	of	O	
8	positive	O	
9	67Ga	O	
10	scan	O	
11	expressed	O	
12	TFR	O	
13	,	O	
14	but	O	
15	those	O	
16	of	O	
17	a	O	
18	negative	O	
19	scan	O	
20	did	O	
21	not	O	
22	.	O	


1	In	O	
2	literature	O	
3	,	O	
4	the	O	
5	HBE	O	
6	has	O	
7	been	O	
8	displayed	O	
9	by	O	
10	application	O	
11	of	O	
12	the	O	
13	averaging	O	
14	method	O	
15	.	O	


1	EGV	O	
2	had	O	
3	no	O	
4	detectable	O	
5	effect	O	
6	on	O	
7	PP	O	
8	secretion	O	
9	under	O	
10	basal	O	
11	or	

<_io.TextIOWrapper name='preds_hmm.csv' mode='w' encoding='UTF-8'>

In [51]:
with open("test_hmm.csv", 'w') as f:
  for s in test_data:
    print("".join(s))
    f.write("".join(s))
f

1	Finally	O	
2	,	O	
3	the	O	
4	stability	O	
5	of	O	
6	the	O	
7	nucleotide	O	
8	binding	O	
9	function	O	
10	of	O	
11	the	O	
12	two	O	
13	proteins	O	
14	is	O	
15	similar	O	
16	as	O	
17	assessed	O	
18	by	O	
19	sensitivity	O	
20	to	O	
21	urea	O	
22	.	O	


1	However	O	
2	,	O	
3	division	O	
4	of	O	
5	the	O	
6	chest	O	
7	wall	O	
8	muscles	O	
9	,	O	
10	usually	O	
11	with	O	
12	diathermy	O	
13	,	O	
14	contributes	O	
15	to	O	
16	prolonged	O	
17	pain	O	
18	and	O	
19	morbidity	O	
20	.	O	


1	We	O	
2	found	O	
3	that	O	
4	lung	O	
5	cancer	O	
6	tissues	O	
7	of	O	
8	positive	O	
9	67Ga	O	
10	scan	O	
11	expressed	O	
12	TFR	B	
13	,	O	
14	but	O	
15	those	O	
16	of	O	
17	a	O	
18	negative	O	
19	scan	O	
20	did	O	
21	not	O	
22	.	O	


1	In	O	
2	literature	O	
3	,	O	
4	the	O	
5	HBE	O	
6	has	O	
7	been	O	
8	displayed	O	
9	by	O	
10	application	O	
11	of	O	
12	the	O	
13	averaging	O	
14	method	O	
15	.	O	


1	EGV	O	
2	had	O	
3	no	O	
4	detectable	O	
5	effect	O	
6	on	O	
7	PP	B	
8	secretion	O	
9	under	O	
10	basal	O	
11	or	

<_io.TextIOWrapper name='test_hmm.csv' mode='w' encoding='UTF-8'>

In [57]:
filepath = r'/content/drive/MyDrive/Colab Notebooks/entity-extraction/hw3/hmm/F21-gene-test.txt'
with open(filepath, "r", encoding="utf8", newline="\n") as file:
  lines = file.readlines()
test_data_f21 = []
sent = []
new_line_counter = 1
print(len(lines))
for i,line in enumerate(lines):
  if line != "\n":
    this_line = line.split("\t")

    #print(this_line)
    
    sent.append(this_line[1].strip())
  else:
    new_line_counter += 1
    test_data_f21.append(sent)
    sent = []

15228


In [59]:
len(test_data_f21),new_line_counter

(508, 509)

In [60]:
preds_new = []
full_preds = []
max_e_rare, max_label_rare = get_rare_word_counts(d)
for sent in test_data_f21:
  this_sent = []
  #test_sent = []
  counter = 1
  for w in sent:

    
    max_e = 0.0
    max_label = ""
    if (w,'I') in d:
      if d[(w, 'I')] > max_e:
        max_e = d[(w,'I')]
        max_label = 'I'
    if (w,'B') in d:
      if d[(w, 'B')] > max_e:
        max_e = d[(w,'B')]
        max_label = 'B'
    if (w,'O') in d:
      if d[(w, 'O')] > max_e:
        max_e = d[(w,'O')]
        max_label = 'O'

    if max_e == 0.0 and max_label == "":
      max_e = max_e_rare
      max_label = max_label_rare
    
    #print(max_e, max_e_rare, w, d)
    full_preds.append((w, max_label, math.log(max_e, 2)))
    this_sent.append('\t'.join([str(counter), w, max_label,'\n']) )
    
    #test_sent.append('\t'.join([str(counter), w, tl,'\n']) )
    
    counter += 1
  this_sent.append( '\n')
  #test_sent.append( '\n')
  preds_new.append(this_sent)
  #test_data.append(test_sent)

In [61]:
preds_new

[['1\tBACKGROUND\tO\t\n',
  '2\t:\tO\t\n',
  '3\tIschemic\tO\t\n',
  '4\theart\tO\t\n',
  '5\tdisease\tO\t\n',
  '6\tis\tO\t\n',
  '7\tthe\tO\t\n',
  '8\tprimary\tO\t\n',
  '9\tcause\tO\t\n',
  '10\tof\tO\t\n',
  '11\tmorbidity\tO\t\n',
  '12\tand\tO\t\n',
  '13\tmortality\tO\t\n',
  '14\tamong\tO\t\n',
  '15\tdiabetics\tO\t\n',
  '16\t,\tO\t\n',
  '17\tespecially\tO\t\n',
  '18\tthose\tO\t\n',
  '19\twho\tO\t\n',
  '20\tbecame\tO\t\n',
  '21\till\tO\t\n',
  '22\tat\tO\t\n',
  '23\ta\tO\t\n',
  '24\tyoung\tO\t\n',
  '25\tage\tO\t\n',
  '26\t.\tO\t\n',
  '\n'],
 ['1\tMore\tO\t\n',
  '2\timportantly\tO\t\n',
  '3\t,\tO\t\n',
  '4\tthis\tO\t\n',
  '5\tfusion\tI\t\n',
  '6\tconverted\tO\t\n',
  '7\ta\tO\t\n',
  '8\tless\tO\t\n',
  '9\teffective\tO\t\n',
  '10\tvaccine\tO\t\n',
  '11\tinto\tO\t\n',
  '12\tone\tO\t\n',
  '13\twith\tO\t\n',
  '14\tsignificant\tO\t\n',
  '15\tpotency\tO\t\n',
  '16\tagainst\tO\t\n',
  '17\testablished\tO\t\n',
  '18\tE7\tB\t\n',
  '19\t-\tI\t\n',
  '20\texpres

In [62]:
with open("preds_hmm_f21.csv", 'w') as f:
  for s in preds_new:
    print("".join(s))
    f.write("".join(s))
f

1	BACKGROUND	O	
2	:	O	
3	Ischemic	O	
4	heart	O	
5	disease	O	
6	is	O	
7	the	O	
8	primary	O	
9	cause	O	
10	of	O	
11	morbidity	O	
12	and	O	
13	mortality	O	
14	among	O	
15	diabetics	O	
16	,	O	
17	especially	O	
18	those	O	
19	who	O	
20	became	O	
21	ill	O	
22	at	O	
23	a	O	
24	young	O	
25	age	O	
26	.	O	


1	More	O	
2	importantly	O	
3	,	O	
4	this	O	
5	fusion	I	
6	converted	O	
7	a	O	
8	less	O	
9	effective	O	
10	vaccine	O	
11	into	O	
12	one	O	
13	with	O	
14	significant	O	
15	potency	O	
16	against	O	
17	established	O	
18	E7	B	
19	-	I	
20	expressing	O	
21	metastatic	O	
22	tumors	O	
23	.	O	


1	Reverse	B	
2	transcription	I	
3	-	I	
4	PCR	O	
5	analysis	O	
6	of	O	
7	mRNA	I	
8	from	O	
9	patients	O	
10	shows	O	
11	that	O	
12	each	O	
13	of	O	
14	these	O	
15	five	O	
16	mutations	I	
17	results	O	
18	in	O	
19	aberrant	O	
20	splicing	O	
21	.	O	


1	Using	O	
2	the	O	
3	postural	O	
4	and	O	
5	force	O	
6	data	O	
7	as	O	
8	input	O	
9	to	O	
10	a	O	
11	3	I	
12	-	I	
13	D	I	
14	biomechanical	O	
15	model	O	
16	,	O	
1

<_io.TextIOWrapper name='preds_hmm_f21.csv' mode='w' encoding='UTF-8'>

#### The results and analysis
The results are not as satisfactory which further compelled to attempt other approaches towards the problem to NER task.