<a href="https://colab.research.google.com/github/rahulbhoyar1995/NER-Case-Study/blob/main/ner_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Author : Rahul Bhoyar

### Named Entity Recognition (NER)

Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) that involves identifying and classifying named entities in text into predefined categories like "Person" (PER), "Location" (GEO), "Organization" (ORG), etc.

### The Algorithm: BiLSTM for NER
In this example, we use a Bidirectional Long Short-Term Memory (BiLSTM) network for NER. Let's understand the key concepts.

#### 1. Long Short-Term Memory (LSTM)
LSTM: A type of Recurrent Neural Network (RNN) designed to remember information for long periods. Unlike regular RNNs, LSTMs can learn and retain long-range dependencies, making them effective for sequence prediction tasks.

#### 2. Bidirectional LSTM (BiLSTM)
Bidirectional: In a BiLSTM, we have two LSTMs for each time step, one processing the sequence from the start to the end (forward direction) and the other from the end to the start (backward direction). This allows the model to have both past and future context, which is useful for understanding the meaning of each word in a sentence.


### The Process: Training a BiLSTM Model for NER

**(A) Data Preprocessing**

(1) Tokenization:

Splitting text into individual words.


(2) Mapping to Indices:

Converting words and tags into numerical indices that the model can understand.

(3)Padding:

Ensuring all sentences have the same length by adding "padding" tokens to shorter sentences and truncating longer ones.

**(B) Model Building**

(2) Embedding Layer:

Converts each word into a dense vector of fixed size. These vectors capture semantic information about the words.

(2) BiLSTM Layer:

Processes the input sequences in both forward and backward directions.

(3) TimeDistributed Layer:

Applies a dense layer to each time step (word) independently, predicting the tag for each word.

**(C) Model Training**

(1) Compilation:

Setting up the model with an optimizer (e.g., Adam), loss function (e.g., categorical crossentropy), and evaluation metric (e.g., accuracy).


(B) Training: Fitting the model to the training data, adjusting weights to minimize the loss.


**(D) Prediction and Evaluation**

(1) Prediction: Using the trained model to predict tags for new sentences.

(2) Evaluation: Assessing the model’s performance on a test dataset.


### The Code

Here's the full code with explanations.


#### (A) Data Preprocessing

In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split

In [3]:
ner_data = pd.read_csv("ner_dataset.csv",  encoding='latin1')
ner_data.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


Understanding the dataframe.

In [4]:
ner_data.shape

(1048575, 4)

### Divide the Dataset:

Split the dataset into training, validation, and test sets (at least 20% for the test set).

In [17]:
class SentenceGetter(object):
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        agg_func = lambda s: [(w, t) for w, t in zip(s["Word"].values.tolist(),
                                                     s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]

In [None]:
getter = SentenceGetter(ner_data)
sentences = getter.sentences

In [38]:
sentences

[[('Thousands', 'O'),
  ('of', 'O'),
  ('demonstrators', 'O'),
  ('have', 'O'),
  ('marched', 'O'),
  ('through', 'O'),
  ('London', 'B-geo'),
  ('to', 'O'),
  ('protest', 'O'),
  ('the', 'O'),
  ('war', 'O'),
  ('in', 'O'),
  ('Iraq', 'B-geo'),
  ('and', 'O'),
  ('demand', 'O'),
  ('the', 'O'),
  ('withdrawal', 'O'),
  ('of', 'O'),
  ('British', 'B-gpe'),
  ('troops', 'O'),
  ('from', 'O'),
  ('that', 'O'),
  ('country', 'O'),
  ('.', 'O')],
 [('Iranian', 'B-gpe'),
  ('officials', 'O'),
  ('say', 'O'),
  ('they', 'O'),
  ('expect', 'O'),
  ('to', 'O'),
  ('get', 'O'),
  ('access', 'O'),
  ('to', 'O'),
  ('sealed', 'O'),
  ('sensitive', 'O'),
  ('parts', 'O'),
  ('of', 'O'),
  ('the', 'O'),
  ('plant', 'O'),
  ('Wednesday', 'B-tim'),
  (',', 'O'),
  ('after', 'O'),
  ('an', 'O'),
  ('IAEA', 'B-org'),
  ('surveillance', 'O'),
  ('system', 'O'),
  ('begins', 'O'),
  ('functioning', 'O'),
  ('.', 'O')],
 [('Helicopter', 'O'),
  ('gunships', 'O'),
  ('Saturday', 'B-tim'),
  ('pounded', 'O'

In [21]:
# Extract unique words and tags

In [40]:
words = list(set(ner_data["Word"].values))
words.append("ENDPAD")
len(words)

35179

In [41]:
tags = list(set(ner_data["Tag"].values))
len(tags)

17

In [43]:
# Dictionary mapping words and tags to indices


In [44]:
word2idx = {w: i for i, w in enumerate(words)}
word2idx

{'Baiji': 0,
 'Pelosi': 1,
 'Carreno': 2,
 'tissue': 3,
 '34-day': 4,
 'holocaust': 5,
 'Takatoshi': 6,
 'Saengprathum': 7,
 'Kissem': 8,
 'Shahar': 9,
 'snow-shortened': 10,
 'sexy': 11,
 'therapy': 12,
 'energy-saving': 13,
 'boyfriend': 14,
 'Plata': 15,
 'distinguish': 16,
 'deepwater': 17,
 'Chee-hwa': 18,
 'Rivalry': 19,
 'fright': 20,
 'paces': 21,
 'Tula': 22,
 'Dynamics': 23,
 'weather-related': 24,
 'Maharashtra': 25,
 'disinformation': 26,
 'engaging': 27,
 'sector': 28,
 'accordance': 29,
 'stove': 30,
 'neediest': 31,
 'Inarritu': 32,
 'legitimize': 33,
 'borrower': 34,
 'villas': 35,
 'Alliance': 36,
 'pace': 37,
 'Tarantino': 38,
 'courtroom': 39,
 'advised': 40,
 'krill': 41,
 'movement': 42,
 'Dissel': 43,
 'traumatic': 44,
 'Bodman': 45,
 'technicians': 46,
 'contenders': 47,
 'sunk': 48,
 'semblance': 49,
 '153.3': 50,
 'Although': 51,
 'Sajedinia': 52,
 'cracks': 53,
 'Donovan': 54,
 'Tabriz': 55,
 'drowned': 56,
 'refusals': 57,
 'kicker': 58,
 'Foy': 59,
 'Kumarat

In [45]:
tag2idx = {t: i for i, t in enumerate(tags)}
tag2idx

{'B-per': 0,
 'B-geo': 1,
 'B-org': 2,
 'B-tim': 3,
 'I-per': 4,
 'I-eve': 5,
 'B-eve': 6,
 'B-nat': 7,
 'I-nat': 8,
 'I-org': 9,
 'I-gpe': 10,
 'B-art': 11,
 'I-art': 12,
 'O': 13,
 'B-gpe': 14,
 'I-tim': 15,
 'I-geo': 16}

In [46]:
# Prepare data for the model
max_len = 50

In [55]:
X = [[word2idx[w[0]] for w in s] for s in sentences]
len(X)


47959

In [59]:
X[0:2]

[[25535,
  10470,
  29503,
  9656,
  5998,
  20545,
  32494,
  12958,
  2089,
  28395,
  5376,
  7799,
  29990,
  1565,
  27699,
  28395,
  5965,
  10470,
  25923,
  9980,
  20226,
  5131,
  5825,
  6558],
 [24053,
  18805,
  2527,
  9854,
  31295,
  12958,
  24730,
  19530,
  12958,
  31063,
  16557,
  33232,
  10470,
  28395,
  11249,
  16172,
  34689,
  20168,
  32984,
  15014,
  28960,
  29985,
  34812,
  5185,
  6558]]

In [48]:
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=word2idx["ENDPAD"])
X

array([[25535, 10470, 29503, ..., 35178, 35178, 35178],
       [24053, 18805,  2527, ..., 35178, 35178, 35178],
       [ 6969, 27815, 18981, ..., 35178, 35178, 35178],
       ...,
       [ 4621,   849,  4652, ..., 35178, 35178, 35178],
       [28705, 17591, 34689, ..., 35178, 35178, 35178],
       [24847, 29394, 26221, ..., 35178, 35178, 35178]], dtype=int32)

In [49]:
X.shape

(47959, 50)

In [56]:
y = [[tag2idx[w[1]] for w in s] for s in sentences]
len(y)

47959

In [51]:
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])
y

array([[13, 13, 13, ..., 13, 13, 13],
       [14, 13, 13, ..., 13, 13, 13],
       [13, 13,  3, ..., 13, 13, 13],
       ...,
       [13,  1, 13, ..., 13, 13, 13],
       [13, 13, 13, ..., 13, 13, 13],
       [13,  2,  9, ..., 13, 13, 13]], dtype=int32)

In [60]:
y = [to_categorical(i, num_classes=len(tags)) for i in y]
len(y)

47959

In [26]:
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

Loading Data:

Read the CSV file into a DataFrame and fill missing values.
SentenceGetter: Groups words and tags by sentences.


Mapping to Indices:

Creates dictionaries to map words and tags to numerical indices.

Padding and Encoding:

Converts sentences to fixed-length sequences of indices and encodes tags as one-hot vectors.

Splitting Data:

 Splits the dataset into training and test sets.

#### (B) Model Building

In [27]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional

# Define the model
model = Sequential([
    Embedding(input_dim=len(words), output_dim=50, input_length=max_len),
    Dropout(0.1),
    Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1)),
    TimeDistributed(Dense(len(tags), activation="softmax"))
])

model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 50, 50)            1758950   
                                                                 
 dropout (Dropout)           (None, 50, 50)            0         
                                                                 
 bidirectional (Bidirection  (None, 50, 200)           120800    
 al)                                                             
                                                                 
 time_distributed (TimeDist  (None, 50, 17)            3417      
 ributed)                                                        
                                                                 
Total params: 1883167 (7.18 MB)
Trainable params: 1883167 (7.18 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Embedding Layer:

Converts words to dense vectors.


BiLSTM Layer:

Processes sequences in both forward and backward directions.

TimeDistributed Layer:

Applies a dense layer to each word to predict its tag.

Compilation:

Sets up the optimizer, loss function, and metrics.

#### (C) Training the Model

This step will take some time.

In [28]:
history = model.fit(X_train, np.array(y_train), batch_size=32, epochs=5, validation_split=0.1, verbose=1)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Training:

Fits the model to the training data, using a batch size of 32 and training for 5 epochs.

In [29]:
# Evaluate the model
model.evaluate(X_test, np.array(y_test))



[0.046321723610162735, 0.9860008358955383]

#### (D) Prediction

In [69]:
from IPython.display import display, HTML


def predict_tags(sentence, tags, word2idx, max_len, model):
    words = sentence.split()
    seq = pad_sequences([[word2idx.get(w, word2idx["ENDPAD"]) for w in words]], maxlen=max_len, padding="post", value=word2idx["ENDPAD"])
    preds = model.predict(seq)
    preds = np.argmax(preds, axis=-1)
    predicted_tags = [tags[i] for i in preds[0]]
    predictions=  list(zip(words, predicted_tags[:len(words)]))
    df_predictions = pd.DataFrame(predictions, columns=["Word", "Tag"])

    # Display the DataFrame as a table
    display(HTML(df_predictions.to_html(index=False)))


Predict Tags:

Tokenizes the input sentence, converts it to indices, and pads it to the maximum length. The model predicts tags for each word, which are then converted back to their original form.

Display Results:

Creates a DataFrame from the predictions and displays it as a nicely formatted table in Jupyter.

Let's make some predictions on new sentences.

In [67]:
sentence = "Mark and John are good friends from London."
predictions = predict_tags(sentence, tags, word2idx, max_len, model)

[ 0 13  0 13 13]


Word,Tag
Mark,B-per
and,O
John,B-per
are,O
good,O
friends,O
from,O
London.,O


In [73]:
sentence = "Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country."

predictions = predict_tags(sentence, tags, word2idx, max_len, model)


[13 13 13 13 13 13  1 13 13 13]


Word,Tag
Thousands,O
of,O
demonstrators,O
have,O
marched,O
through,O
London,B-geo
to,O
protest,O
the,O


In [71]:
sentence = "London is the capital of England."
predictions = predict_tags(sentence, tags, word2idx, max_len, model)


[ 1 13 13 13 13 13 13 13 13 13]


Word,Tag
London,B-geo
is,O
the,O
capital,O
of,O
England.,O


In [72]:
sentence = "Hyde Park is a good place in parliament."
predictions = predict_tags(sentence, tags, word2idx, max_len, model)


[ 0  4 13 13 13 13 13 13 13 13]


Word,Tag
Hyde,B-per
Park,I-per
is,O
a,O
good,O
place,O
in,O
parliament.,O


#### Summary :

Preprocessing:

Prepare data by tokenizing, encoding, and padding sentences.
Model Building: Build a BiLSTM model using Tensorflow.

Training: Train the model on the preprocessed data.

Prediction: Predict NER tags for new sentences and display results in a tabular format.

By following these steps, we can effectively use a BiLSTM model for Named Entity Recognition, enabling us to identify and classify entities in text.