# Named Entity Recognition
Task of identifying and categorizing key information (entities) in text. An entity can be any word or series of words that consistently refers to the same thing.

- [Link](https://www.geeksforgeeks.org/named-entity-recognition/)


## With Tensorflow
- [Link](https://www.kaggle.com/code/naseralqaydeh/named-entity-recognition-ner-with-tensorflow)
- [Using CRF](https://www.kaggle.com/code/bavalpreet26/ner-using-crf)
- [BERT Model for NER](https://www.kaggle.com/code/abhishek/entity-extraction-model-using-bert-pytorch)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf

In [2]:
path = "./resources/ner/ner_dataset.csv"
data = pd.read_csv(path, encoding="unicode_escape")

data.fillna(method="ffill", inplace = True)
data.head(5)

  data.fillna(method="ffill", inplace = True)


Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O


In [3]:
ready_dist_path = "./resources/ner/ner_corpus.csv"

ready_data = pd.read_csv(ready_dist_path)
ready_data.head(5)

Unnamed: 0,Sentence #,Sentence,POS,Tag
0,Sentence: 1,Thousands of demonstrators have marched throug...,"['NNS', 'IN', 'NNS', 'VBP', 'VBN', 'IN', 'NNP'...","['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', '..."
1,Sentence: 2,Families of soldiers killed in the conflict jo...,"['NNS', 'IN', 'NNS', 'VBN', 'IN', 'DT', 'NN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
2,Sentence: 3,They marched from the Houses of Parliament to ...,"['PRP', 'VBD', 'IN', 'DT', 'NNS', 'IN', 'NN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
3,Sentence: 4,"Police put the number of marchers at 10,000 wh...","['NNS', 'VBD', 'DT', 'NN', 'IN', 'NNS', 'IN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
4,Sentence: 5,The protest comes on the eve of the annual con...,"['DT', 'NN', 'VBZ', 'IN', 'DT', 'NN', 'IN', 'D...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."


In [4]:
data.shape

(1048575, 4)

In [5]:
# Unique Sentences
print("Unique Sentences: ", data['Sentence #'].nunique())

Unique Sentences:  47959


In [6]:
# Unique Words and Tags
print("Unique Words: ", data['Word'].nunique())
print("Unique Tags: ", data['Tag'].nunique())

Unique Words:  35177
Unique Tags:  17


In [7]:
tags = data['Tag'].unique()
tags = list(tags)
tags

['O',
 'B-geo',
 'B-gpe',
 'B-per',
 'I-geo',
 'B-org',
 'I-org',
 'B-tim',
 'B-art',
 'I-art',
 'I-per',
 'I-gpe',
 'I-tim',
 'B-nat',
 'B-eve',
 'I-eve',
 'I-nat']

In [8]:
ready_data.head(5)

Unnamed: 0,Sentence #,Sentence,POS,Tag
0,Sentence: 1,Thousands of demonstrators have marched throug...,"['NNS', 'IN', 'NNS', 'VBP', 'VBN', 'IN', 'NNP'...","['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', '..."
1,Sentence: 2,Families of soldiers killed in the conflict jo...,"['NNS', 'IN', 'NNS', 'VBN', 'IN', 'DT', 'NN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
2,Sentence: 3,They marched from the Houses of Parliament to ...,"['PRP', 'VBD', 'IN', 'DT', 'NNS', 'IN', 'NN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
3,Sentence: 4,"Police put the number of marchers at 10,000 wh...","['NNS', 'VBD', 'DT', 'NN', 'IN', 'NNS', 'IN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
4,Sentence: 5,The protest comes on the eve of the annual con...,"['DT', 'NN', 'VBZ', 'IN', 'DT', 'NN', 'IN', 'D...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."


In [9]:
X = list(ready_data['Sentence'])
Y = list(ready_data['Tag'])

In [10]:
X[:3]

['Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .',
 'Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as " Bush Number One Terrorist " and " Stop the Bombings . "',
 'They marched from the Houses of Parliament to a rally in Hyde Park .']

In [11]:
from ast import literal_eval

Y_ready = []

for sen_tags in Y:
    Y_ready.append(literal_eval(sen_tags))

In [12]:
Y_ready[:3]

[['O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'B-geo',
  'O',
  'O',
  'O',
  'O',
  'O',
  'B-geo',
  'O',
  'O',
  'O',
  'O',
  'O',
  'B-gpe',
  'O',
  'O',
  'O',
  'O',
  'O'],
 ['O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'B-per',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O'],
 ['O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'B-geo',
  'I-geo',
  'O']]

In [13]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [14]:
maxlen = 110
max_words = 36000

# Tokenizing the words
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X)
sequences = tokenizer.texts_to_sequences(X)

In [15]:
word_index = tokenizer.word_index
print("Found %s unique tokens." % len(word_index))

ind2word = dict([(value, key) for (key, value) in word_index.items()])

Found 27953 unique tokens.


In [16]:
word2ind = word_index

In [17]:
id2word = {}

for key, value in word2ind.items():
    id2word[value] = key

In [18]:
# Padding
X_preprocessed = pad_sequences(sequences, maxlen=maxlen, padding='post')

### Preprocessing Tags

In [19]:
# Assign unique identifiers for each tag and pad the tag list

tag2ind = {}

for i, tag in enumerate(tags):
    tag2ind[tag] = i

In [20]:
tag2ind

{'O': 0,
 'B-geo': 1,
 'B-gpe': 2,
 'B-per': 3,
 'I-geo': 4,
 'B-org': 5,
 'I-org': 6,
 'B-tim': 7,
 'B-art': 8,
 'I-art': 9,
 'I-per': 10,
 'I-gpe': 11,
 'I-tim': 12,
 'B-nat': 13,
 'B-eve': 14,
 'I-eve': 15,
 'I-nat': 16}

In [21]:
# Mapping identifier to the tag
id2tag = {}

for key, value in tag2ind.items():
    id2tag[value] = key

In [22]:
def preprocess_tags(tags2id, Y_ready):
    
    Y_preprocessed = []
    maxlen = 110
    # for each target 
    for y in Y_ready:
        
        # place holder to store the new preprocessed tag list
        Y_place_holder = []
        
        # for each tag in rhe tag list 
        for tag in y:
            # append the id of the tag in the place holder list
            Y_place_holder.append(tags2id[tag])
        
        # find the lenght of the new preprocessed tag list 
        len_new_tag_list = len(Y_place_holder)
        # find the differance in length between the len of tag list and padded sentences
        num_O_to_add = maxlen - len_new_tag_list
        
        # add 'O's to padd the tag lists
        padded_tags = Y_place_holder + ([tags2id['O']] * num_O_to_add)
        Y_preprocessed.append(padded_tags)
        
    return Y_preprocessed

In [23]:
Y_preprocessed = preprocess_tags(tag2ind, Y_ready)

In [24]:
X_preprocessed.shape

(47959, 110)

In [25]:
X_preprocessed = np.asarray(X_preprocessed)
Y_preprocessed = np.asarray(Y_preprocessed)

In [26]:
# 70% of the datat will be used for training 
training_samples = 0.7
# 15% of the datat will be used for validation 
validation_samples = 0.15
# 15% of the datat will be used for testing 
testing_samples = 0.15

In [27]:
indices = np.arange(len(Y_preprocessed))

In [28]:
np.random.seed(seed=666)
np.random.shuffle(indices)

In [29]:
X_preprocessed = X_preprocessed[indices]
Y_preprocessed = Y_preprocessed[indices]

In [30]:
X_train = X_preprocessed[: int(0.7 * len(X_preprocessed))]
print("Number of training examples: {}".format(len(X_train)))


X_val = X_preprocessed[int(0.7 * len(X_preprocessed)) : int(0.7 * len(X_preprocessed)) + (int(0.15 * len(X_preprocessed)) + 1)]
print("Number of validation examples: {}".format(len(X_val)))


X_test = X_preprocessed[int(0.7 * len(X_preprocessed)) + (int(0.15 * len(X_preprocessed)) + 1) : ]
print("Number of testing examples: {}".format(len(X_test)))



Y_train = Y_preprocessed[: int(0.7 * len(X_preprocessed))]
Y_val = Y_preprocessed[int(0.7 * len(X_preprocessed)) : int(0.7 * len(X_preprocessed)) + (int(0.15 * len(X_preprocessed)) + 1)]
Y_test = Y_preprocessed[int(0.7 * len(X_preprocessed)) + (int(0.15 * len(X_preprocessed)) + 1) : ]

print("Total number of examples after shuffling and splitting: {}".format(len(X_train) + len(X_val) + len(X_test)))

Number of training examples: 33571
Number of validation examples: 7194
Number of testing examples: 7194
Total number of examples after shuffling and splitting: 47959


In [31]:
train_dataset = tf.data.Dataset.from_tensor_slices((X_train, Y_train))
val_dataset = tf.data.Dataset.from_tensor_slices((X_val, Y_val))
test_dataset = tf.data.Dataset.from_tensor_slices((X_test, Y_test))

In [32]:
BATCH_SIZE = 132
SHUFFLE_BUFFER_SIZE = 132

train_dataset = train_dataset.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
val_dataset = val_dataset.batch(BATCH_SIZE)
test_dataset = test_dataset.batch(BATCH_SIZE)

In [33]:
embedding_dim = 300
maxlen = 110
num_tags = len(tags)
max_words = 36000

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(max_words, embedding_dim, input_length=maxlen),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=100, activation = 'tanh', return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=100, activation = 'tanh', return_sequences=True)),
    # tf.keras.layers.Dropout(0.5),
    tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(num_tags, activation='softmax'))
])



In [34]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [37]:
history = model.fit(train_dataset, validation_data=val_dataset, epochs=1)

[1m255/255[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m146s[0m 571ms/step - accuracy: 0.9721 - loss: 0.1018 - val_accuracy: 0.9767 - val_loss: 0.0762


In [38]:
model.summary()

In [39]:
model.evaluate(test_dataset)

[1m55/55[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 193ms/step - accuracy: 0.9764 - loss: 0.0772


[0.07576122134923935, 0.9768720865249634]

In [40]:
def make_prediction(model, preprocessed_sentence, id2word, id2tag):
    
    #if preprocessed_sentence.shape() != (1, 110):
    preprocessed_sentence = preprocessed_sentence.reshape((1, 110))
     
    # return preprocessed sentence to its orginal form
    sentence = preprocessed_sentence[preprocessed_sentence > 0]
    word_list = []
    for word in list(sentence):
        word_list.append(id2word[word])
    orginal_sententce = ' '.join(word_list)
    
    len_orginal_sententce = len(word_list)
    
    # make prediction
    prediction = model.predict(preprocessed_sentence)
    prediction = np.argmax(prediction[0], axis=1)
    
    # return the prediction to its orginal form
    prediction = list(prediction)[ : len_orginal_sententce] 
    
    pred_tag_list = []
    for tag_id in prediction:
        pred_tag_list.append(id2tag[tag_id])
    
    return orginal_sententce,  pred_tag_list

In [41]:
orginal_sententce,  pred_tag_list = make_prediction(model=model,
                                                    preprocessed_sentence=X_test[520],
                                                    id2word=id2word,
                                                    id2tag=id2tag)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 497ms/step


In [42]:
orginal_sententce

"india and burma have signed several new agreements to build stronger economic and defense ties during a visit to india by burma 's reclusive military ruler"

In [43]:
pred_tag_list

['B-geo',
 'O',
 'B-geo',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-geo',
 'O',
 'O',
 'O',
 'O']

In [47]:
# Giving Random Input

sentence = "india and pakistan are two countries in south asia"
sentence = sentence.split()
sentence = [word2ind[word] for word in sentence]
sentence = pad_sequences([sentence], maxlen=110, padding='post')

orginal_sententce,  pred_tag_list = make_prediction(model=model,
                                                    preprocessed_sentence=sentence,
                                                    id2word=id2word,
                                                    id2tag=id2tag)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step


In [49]:
orginal_sententce

'india and pakistan are two countries in south asia'

In [50]:
pred_tag_list

['B-geo', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'B-geo', 'B-geo']

## HMMLearn Module

In [51]:
%pip install seqeval
%pip install hmmlearn

from hmmlearn import hmm
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import numpy as np 
from sklearn.model_selection import train_test_split
import spacy

nlp = spacy.load("en_core_web_sm")

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25ldone
[?25h  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=62a39d73c4df4fe702255a4e7dc85caeda2851b7dc9dc5a85010d7012e33cddd
  Stored in directory: /Users/daver/Library/Caches/pip/wheels/5f/b8/73/0b2c1a76b701a677653dd79ece07cfabd7457989dbfbdcd8d7
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2
Note: you may need to restart the kernel to use updated packages.
Collecting hmmlearn
  Downloading hmmlearn-0.3.2-cp312-cp312-macosx_10_9_universal2.whl.metadata (2.9 kB)
Downloading hmmlearn-0.3.2-cp312-cp312-macosx_10_9_universal2.whl (189 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━

In [52]:
data = pd.read_csv('/Users/daver/Desktop/College Work/NLP_Lab_Exam_Codes/Lab Applications/resources/ner/ner_dataset.csv' , encoding='unicode_escape') 

In [53]:
preprocessed_data = data[['Word', 'Tag']]

word_encoder = LabelEncoder()
pos_encoder = LabelEncoder()

preprocessed_data['Word'] = word_encoder.fit_transform(preprocessed_data['Word'])
preprocessed_data['POS'] = pos_encoder.fit_transform(preprocessed_data['Tag'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  preprocessed_data['Word'] = word_encoder.fit_transform(preprocessed_data['Word'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  preprocessed_data['POS'] = pos_encoder.fit_transform(preprocessed_data['Tag'])


In [54]:
num_states = 5
num_features = 2  

model = hmm.MultinomialHMM(n_components=num_states, n_iter=100)

MultinomialHMM has undergone major changes. The previous version was implementing a CategoricalHMM (a special case of MultinomialHMM). This new implementation follows the standard definition for a Multinomial distribution (e.g. as in https://en.wikipedia.org/wiki/Multinomial_distribution). See these issues for details:
https://github.com/hmmlearn/hmmlearn/issues/335
https://github.com/hmmlearn/hmmlearn/issues/340


In [55]:
X = preprocessed_data[['Word', 'POS']]
lengths = [len(preprocessed_data)]

model.fit(X, lengths)

In [56]:
predicted_labels = model.predict(X)
predicted_tags = pos_encoder.inverse_transform(predicted_labels)
preprocessed_data['Predicted_Tag'] = predicted_tags

In [57]:
print(preprocessed_data.head())

    Word Tag  POS Predicted_Tag
0  15076   O   16         B-art
1  27699   O   16         B-art
2  20968   O   16         B-art
3  24217   O   16         B-art
4  26433   O   16         B-art


In [58]:
transition_probs = model.transmat_ 
emission_probs = model.emissionprob_  
initial_probs = model.startprob_  

In [59]:
doc = nlp("my name is aniruth")
named_entities = [(ent.text, ent.label_) for ent in doc.ents]

In [60]:
named_entities

[]