<a href="https://colab.research.google.com/github/rahulbhoyar1995/NER-Case-Study/blob/main/ner_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Author : Rahul Bhoyar

### Named Entity Recognition (NER)

Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) that involves identifying and classifying named entities in text into predefined categories like "Person" (PER), "Location" (GEO), "Organization" (ORG), etc.

## Data Preparation

In [18]:
import pandas as pd

In [19]:
data = pd.read_csv("ner_dataset.csv", encoding='latin1')
data

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
...,...,...,...,...
1048570,,they,PRP,O
1048571,,responded,VBD,O
1048572,,to,TO,O
1048573,,the,DT,O


There are 10,48,575 records divided in 4 columns.

As a part oof our problem statement we want only two columns : "Word" and "Tag".

In [21]:
data = data[["Word","Tag"]]
data

Unnamed: 0,Word,Tag
0,Thousands,O
1,of,O
2,demonstrators,O
3,have,O
4,marched,O
...,...,...
1048570,they,O
1048571,responded,O
1048572,to,O
1048573,the,O


Let's see how many null values are there.

In [24]:
missing_values_count = data.isnull().sum()
print(missing_values_count)

Word    10
Tag      0
dtype: int64


Here there are 10 records in Word column with null values.

In [22]:
null_values_df = data[data['Word'].isnull() | data['Tag'].isnull()]

# Display the rows with null values in 'Word' or 'Tag' columns
print("Rows with null values in 'Word' or 'Tag' columns:")
print(null_values_df)

Rows with null values in 'Word' or 'Tag' columns:
        Word Tag
197658   NaN   O
256026   NaN   O
257069   NaN   O
571211   NaN   O
613777   NaN   O
747019   NaN   O
901758   NaN   O
903054   NaN   O
944880   NaN   O
1003438  NaN   O


Removing the null values.

In [25]:
df = data.dropna(subset=['Word', 'Tag'])

# Display the cleaned DataFrame
print("\nDataFrame after removing rows with null values in 'Word' or 'Tag' columns:")
df


DataFrame after removing rows with null values in 'Word' or 'Tag' columns:


Unnamed: 0,Word,Tag
0,Thousands,O
1,of,O
2,demonstrators,O
3,have,O
4,marched,O
...,...,...
1048570,they,O
1048571,responded,O
1048572,to,O
1048573,the,O


Checking the uniques tags.

In [27]:
unique_tags = list(df["Tag"].unique())

In [28]:
print("Unique tags are :", unique_tags)

Unique tags are : ['O', 'B-geo', 'B-gpe', 'B-per', 'I-geo', 'B-org', 'I-org', 'B-tim', 'B-art', 'I-art', 'I-per', 'I-gpe', 'I-tim', 'B-nat', 'B-eve', 'I-eve', 'I-nat']


In [29]:
print("Total number of unique tags are :", len(unique_tags))

Total number of unique tags are : 17


Checking unique number of words.

In [31]:
unique_words = list(df["Word"].unique())

In [32]:
print("Total number of unique words are :", len(unique_words))

Total number of unique words are : 35177


Final Dataframe for modelling

In [33]:
df.shape

(1048565, 2)

In [34]:
df.head()

Unnamed: 0,Word,Tag
0,Thousands,O
1,of,O
2,demonstrators,O
3,have,O
4,marched,O


In [35]:
df.to_csv("data.csv")

### Approach 1: Traditional Machine Learning Algorithms.

It is classification problem.

Step 1: Dividing the dataset into training, testing and validation dataset.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
data = pd.read_csv('data.csv')  # Assuming the dataset is in CSV format

# Split the data into train+validation and test sets
train_val_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Further split train+validation into train and validation sets
train_data, val_data = train_test_split(train_val_data, test_size=0.25, random_state=42)  # 0.25 * 0.8 = 0.2

print(f"Train size: {len(train_data)}, Validation size: {len(val_data)}, Test size: {len(test_data)}")


Train size: 629139, Validation size: 209713, Test size: 209713


In [2]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import LabelEncoder

# Function to create features for each word
def word_to_features(sentence, i):
    word = sentence[i]
    features = {
        'word': word,
        'is_upper': word.isupper(),
        'is_title': word.istitle(),
        'is_digit': word.isdigit(),
    }
    if i > 0:
        word1 = sentence[i-1]
        features.update({
            '-1:word': word1,
            '-1:is_upper': word1.isupper(),
            '-1:is_title': word1.istitle(),
        })
    else:
        features['BOS'] = True

    if i < len(sentence)-1:
        word1 = sentence[i+1]
        features.update({
            '+1:word': word1,
            '+1:is_upper': word1.isupper(),
            '+1:is_title': word1.istitle(),
        })
    else:
        features['EOS'] = True

    return features

In [3]:
# Convert dataset into features and labels
def preprocess_data(data):
    sentences = data.groupby(data.index // 10)['Word'].apply(list).values
    labels = data.groupby(data.index // 10)['Tag'].apply(list).values

    X, y = [], []
    for sentence, label_seq in zip(sentences, labels):
        for i in range(len(sentence)):
            X.append(word_to_features(sentence, i))
            y.append(label_seq[i])

    return X, y

In [4]:
# Preprocess the datasets
X_train, y_train = preprocess_data(train_data)
X_val, y_val = preprocess_data(val_data)
X_test, y_test = preprocess_data(test_data)

In [6]:
X_train[0], y_train[0]

({'word': 'through',
  'is_upper': False,
  'is_title': False,
  'is_digit': False,
  'BOS': True,
  '+1:word': 'demonstrators',
  '+1:is_upper': False,
  '+1:is_title': False},
 'O')

In [None]:
# Vectorize features
vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(X_train)
X_val = vec.transform(X_val)
X_test = vec.transform(X_test)



In [None]:
# Encode labels
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_val = le.transform(y_val)
y_test = le.transform(y_test)


### The Algorithm: BiLSTM for NER
In this example, we use a Bidirectional Long Short-Term Memory (BiLSTM) network for NER. Let's understand the key concepts.

#### 1. Long Short-Term Memory (LSTM)
LSTM: A type of Recurrent Neural Network (RNN) designed to remember information for long periods. Unlike regular RNNs, LSTMs can learn and retain long-range dependencies, making them effective for sequence prediction tasks.

#### 2. Bidirectional LSTM (BiLSTM)
Bidirectional: In a BiLSTM, we have two LSTMs for each time step, one processing the sequence from the start to the end (forward direction) and the other from the end to the start (backward direction). This allows the model to have both past and future context, which is useful for understanding the meaning of each word in a sentence.


### The Process: Training a BiLSTM Model for NER

**(A) Data Preprocessing**

(1) Tokenization:

Splitting text into individual words.


(2) Mapping to Indices:

Converting words and tags into numerical indices that the model can understand.

(3)Padding:

Ensuring all sentences have the same length by adding "padding" tokens to shorter sentences and truncating longer ones.

**(B) Model Building**

(2) Embedding Layer:

Converts each word into a dense vector of fixed size. These vectors capture semantic information about the words.

(2) BiLSTM Layer:

Processes the input sequences in both forward and backward directions.

(3) TimeDistributed Layer:

Applies a dense layer to each time step (word) independently, predicting the tag for each word.

**(C) Model Training**

(1) Compilation:

Setting up the model with an optimizer (e.g., Adam), loss function (e.g., categorical crossentropy), and evaluation metric (e.g., accuracy).


(B) Training: Fitting the model to the training data, adjusting weights to minimize the loss.


**(D) Prediction and Evaluation**

(1) Prediction: Using the trained model to predict tags for new sentences.

(2) Evaluation: Assessing the model’s performance on a test dataset.


### The Code

Here's the full code with explanations.


#### (A) Data Preprocessing

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split

In [None]:
ner_data = pd.read_csv("ner_dataset.csv",  encoding='latin1')
ner_data.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


Understanding the dataframe.

In [None]:
ner_data.shape

(1048575, 4)

Group the senetences with its tags.

In [None]:
class SentenceGetter(object):
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        agg_func = lambda s: [(w, t) for w, t in zip(s["Word"].values.tolist(),
                                                     s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]

In [None]:
getter = SentenceGetter(ner_data)
sentences = getter.sentences

In [None]:
len(sentences)

47959

In [None]:
# Extract unique words and tags

In [None]:
words = list(set(ner_data["Word"].values))
words.append("ENDPAD")
len(words)

35179

In [None]:
tags = list(set(ner_data["Tag"].values))
len(tags)

17

In [None]:
# Dictionary mapping words and tags to indices


In [None]:
word2idx = {w: i for i, w in enumerate(words)}
len(word2idx)

35179

In [None]:
tag2idx = {t: i for i, t in enumerate(tags)}
len(tag2idx)

17

In [None]:
# Prepare data for the model
max_len = 50

In [None]:
X = [[word2idx[w[0]] for w in s] for s in sentences]
len(X)


47959

In [None]:
X[0:2]

[[21075], [744]]

In [None]:
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=word2idx["ENDPAD"])
X

array([[21075, 35178, 35178, ..., 35178, 35178, 35178],
       [  744, 35178, 35178, ..., 35178, 35178, 35178],
       [ 1521, 35178, 35178, ..., 35178, 35178, 35178],
       ...,
       [ 7598, 35178, 35178, ..., 35178, 35178, 35178],
       [29157, 35178, 35178, ..., 35178, 35178, 35178],
       [ 2981, 35178, 35178, ..., 35178, 35178, 35178]], dtype=int32)

In [None]:
X.shape

(47959, 50)

In [None]:
y = [[tag2idx[w[1]] for w in s] for s in sentences]
len(y)

47959

In [None]:
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])
y

array([[3, 3, 3, ..., 3, 3, 3],
       [6, 3, 3, ..., 3, 3, 3],
       [3, 3, 3, ..., 3, 3, 3],
       ...,
       [3, 3, 3, ..., 3, 3, 3],
       [3, 3, 3, ..., 3, 3, 3],
       [3, 3, 3, ..., 3, 3, 3]], dtype=int32)

In [None]:
y = [to_categorical(i, num_classes=len(tags)) for i in y]
len(y)

47959

In [None]:
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

Loading Data:

Read the CSV file into a DataFrame and fill missing values.
SentenceGetter: Groups words and tags by sentences.


Mapping to Indices:

Creates dictionaries to map words and tags to numerical indices.

Padding and Encoding:

Converts sentences to fixed-length sequences of indices and encodes tags as one-hot vectors.

Splitting Data:

 Splits the dataset into training and test sets.

#### (B) Model Building

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional

# Define the model
model = Sequential([
    Embedding(input_dim=len(words), output_dim=50, input_length=max_len),
    Dropout(0.1),
    Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1)),
    TimeDistributed(Dense(len(tags), activation="softmax"))
])

model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 50, 50)            1758950   
                                                                 
 dropout (Dropout)           (None, 50, 50)            0         
                                                                 
 bidirectional (Bidirection  (None, 50, 200)           120800    
 al)                                                             
                                                                 
 time_distributed (TimeDist  (None, 50, 17)            3417      
 ributed)                                                        
                                                                 
Total params: 1883167 (7.18 MB)
Trainable params: 1883167 (7.18 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Embedding Layer:

Converts words to dense vectors.


BiLSTM Layer:

Processes sequences in both forward and backward directions.

TimeDistributed Layer:

Applies a dense layer to each word to predict its tag.

Compilation:

Sets up the optimizer, loss function, and metrics.

#### (C) Training the Model

This step will take some time.

In [None]:
history = model.fit(X_train, np.array(y_train), batch_size=32, epochs=5, validation_split=0.1, verbose=1)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Training:

Fits the model to the training data, using a batch size of 32 and training for 5 epochs.

In [None]:
# Evaluate the model
model.evaluate(X_test, np.array(y_test))



[0.004684843588620424, 0.9982485175132751]

Loss (0.046321723610162735):

This value represents the model's loss on the test set. In this context, the loss is calculated using the categorical cross-entropy loss function, which measures the difference between the predicted and true probability distributions. A lower loss value indicates that the model's predictions are closer to the actual tags. The value 0.0463 indicates that the model has a relatively low error in its predictions on the test set.
Accuracy (0.9860008358955383):

This value represents the model's accuracy on the test set. Accuracy is the fraction of correct predictions made by the model. In this case, the value 0.9860 indicates that the model correctly predicted the NER tags for 98.60% of the words in the test set. This is a high accuracy, suggesting that the model is performing well.

#### (D) Prediction

In [None]:
from IPython.display import display, HTML


def predict_tags(sentence, tags, word2idx, max_len, model):
    words = sentence.split()
    seq = pad_sequences([[word2idx.get(w, word2idx["ENDPAD"]) for w in words]], maxlen=max_len, padding="post", value=word2idx["ENDPAD"])
    preds = model.predict(seq)
    preds = np.argmax(preds, axis=-1)
    predicted_tags = [tags[i] for i in preds[0]]
    predictions=  list(zip(words, predicted_tags[:len(words)]))
    df_predictions = pd.DataFrame(predictions, columns=["Word", "Tag"])

    # Display the DataFrame as a table
    display(HTML(df_predictions.to_html(index=False)))


Predict Tags:

Tokenizes the input sentence, converts it to indices, and pads it to the maximum length. The model predicts tags for each word, which are then converted back to their original form.

Display Results:

Creates a DataFrame from the predictions and displays it as a nicely formatted table in Jupyter.

Let's make some predictions on new sentences.

In [None]:
sentence = "India is the best place to live."
predictions = predict_tags(sentence, tags, word2idx, max_len, model)



Word,Tag
India,B-geo
is,O
the,O
best,O
place,O
to,O
live.,O


In [None]:
sentence_2 = "European Union is the biggest organisation."

predictions = predict_tags(sentence_2, tags, word2idx, max_len, model)




Word,Tag
European,B-org
Union,O
is,O
the,O
biggest,O
organisation.,O


In [None]:
sentence_3 = "In Germany and Nigeria, there are lot of other things which are not that good."

predictions = predict_tags(sentence_3, tags, word2idx, max_len, model)




Word,Tag
In,O
Germany,B-org
and,O
"Nigeria,",O
there,O
are,O
lot,O
of,O
other,O
things,O


In [None]:
sentence_4 = "Mosul and Suresh were best friends when they were in Baghdad."

predictions = predict_tags(sentence_4, tags, word2idx, max_len, model)




Word,Tag
Mosul,B-geo
and,O
Suresh,O
were,O
best,O
friends,O
when,O
they,O
were,O
in,O


### Summary :

Preprocessing:

Prepare data by tokenizing, encoding, and padding sentences.
Model Building: Build a BiLSTM model using Tensorflow.

Training: Train the model on the preprocessed data.

Prediction: Predict NER tags for new sentences and display results in a tabular format.

By following these steps, we can effectively use a BiLSTM model for Named Entity Recognition, enabling us to identify and classify entities in text.

### Future Steps

1. Use Pre-trained Embeddings:

Incorporate GloVe or BERT embeddings to improve performance.

2. Hyperparameter Tuning:

Experiment with different hyperparameters like batch size, learning rate, number of LSTM units, etc.


3. Ensemble Methods:

Combine predictions from multiple models to improve accuracy.


4. Error Analysis:

Analyze errors to understand common failure cases and address them

### Approach 1 : Use Pre-trained Embeddings

We'll start by incorporating pre-trained GloVe embeddings into our model to improve its performance.

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

--2024-06-15 11:33:38--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2024-06-15 11:33:38--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-06-15 11:33:38--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [None]:
# Load the embeddings
embedding_index = {}
with open("glove.6B.100d.txt", encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype="float32")
        embedding_index[word] = coefs

embedding_dim = 100
embedding_matrix = np.zeros((len(words), embedding_dim))
for word, i in word2idx.items():
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

Build Model with GloVe Embeddings

In [None]:
from tensorflow.keras.layers import Embedding

model_glove = Sequential([
    Embedding(input_dim=len(words), output_dim=embedding_dim, input_length=max_len, weights=[embedding_matrix], trainable=False),
    Dropout(0.1),
    Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1)),
    TimeDistributed(Dense(len(tags), activation="softmax"))
])

model_glove.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
model_glove.summary()


Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 50, 100)           3517900   
                                                                 
 dropout_1 (Dropout)         (None, 50, 100)           0         
                                                                 
 bidirectional_1 (Bidirecti  (None, 50, 200)           160800    
 onal)                                                           
                                                                 
 time_distributed_1 (TimeDi  (None, 50, 17)            3417      
 stributed)                                                      
                                                                 
Total params: 3682117 (14.05 MB)
Trainable params: 164217 (641.47 KB)
Non-trainable params: 3517900 (13.42 MB)
_________________________________________________________________


Train the Model with GloVe Embeddings

In [None]:
history_glove = model_glove.fit(X_train, np.array(y_train), batch_size=32, epochs=5, verbose=1)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
# Evaluate the model
model_glove.evaluate(X_test, np.array(y_test))



[0.02051382325589657, 0.994320273399353]

In [None]:
sentence_1 = "Germany is one of the main economy in the world."
predictions = predict_tags(sentence_1, tags, word2idx, max_len, model_glove)



Word,Tag
Germany,O
is,O
one,O
of,O
the,O
main,O
economy,O
in,O
the,O
world.,O


In [None]:
sentence_2 = "In Germany and Nigeria, there are lot of other things which are not that good."
predictions = predict_tags(sentence_2, tags, word2idx, max_len, model_glove)



Word,Tag
In,O
Germany,O
and,O
"Nigeria,",O
there,O
are,O
lot,O
of,O
other,O
things,O


In [None]:
sentence_3 = "London is in England."
predictions = predict_tags(sentence_3, tags, word2idx, max_len, model_glove)



Word,Tag
London,O
is,O
in,O
England.,O


In [None]:
sentence_4 = "India is the best place to live."
predictions = predict_tags(sentence_4, tags, word2idx, max_len, model_glove)



Word,Tag
India,O
is,O
the,O
best,O
place,O
to,O
live.,O


### Approach 2 : Hyperparameter Tuning



We will tune hyperparameters such as batch size, learning rate, and the number of LSTM units. We can use tools like Keras Tuner, but for simplicity, let's manually experiment with different configurations.



Define a Function to Build the Model with Hyperparameters

In [None]:
from tensorflow.keras.optimizers import Adam

def build_model(embedding_matrix, lstm_units=100, dropout_rate=0.1, learning_rate=0.001):
    model = Sequential([
        Embedding(input_dim=len(words), output_dim=embedding_dim, input_length=max_len, weights=[embedding_matrix], trainable=False),
        Dropout(dropout_rate),
        Bidirectional(LSTM(units=lstm_units, return_sequences=True, recurrent_dropout=dropout_rate)),
        TimeDistributed(Dense(len(tags), activation="softmax"))
    ])
    optimizer = Adam(learning_rate=learning_rate)
    model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=["accuracy"])
    return model


Train and Evaluate Models with Different Hyperparameters

In [None]:
# Example configuration 1
model_hp1 = build_model(embedding_matrix, lstm_units=50, dropout_rate=0.2, learning_rate=0.001)
history_hp1 = model_hp1.fit(X_train, np.array(y_train), batch_size=64, epochs=5, verbose=1)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


We can have multiple configurations like this.

In [None]:
# Evaluate the model
model_hp1.evaluate(X_test, np.array(y_test))



[0.02063002437353134, 0.994320273399353]

In [None]:
sentence_1 = "Germany is one of the main economy in the world."
predictions = predict_tags(sentence_1, tags, word2idx, max_len, model_hp1)



Word,Tag
Germany,O
is,O
one,O
of,O
the,O
main,O
economy,O
in,O
the,O
world.,O


In [None]:
sentence_2 = "India is a tropical country, away from London."
predictions = predict_tags(sentence_2, tags, word2idx, max_len, model_hp1)



Word,Tag
India,O
is,O
a,O
tropical,O
"country,",O
away,O
from,O
London.,O


In [None]:
sentence_3 = "Mosul, a plaxce nera Baghdad, is believed to have wonders."
predictions = predict_tags(sentence_3, tags, word2idx, max_len, model_hp1)



Word,Tag
"Mosul,",O
a,O
plaxce,O
nera,O
"Baghdad,",O
is,O
believed,O
to,O
have,O
wonders.,O


Repeat this for other configurations and compare the validation performance.

### Approach 3 : Ensemble Methods




Combining predictions from multiple models can improve accuracy. We'll average the probabilities from different models.

Train Multiple Models

In [None]:
# Example model 1
model1 = build_model(embedding_matrix, lstm_units=100, dropout_rate=0.1, learning_rate=0.001)
history1 = model1.fit(X_train, np.array(y_train), batch_size=32, epochs=5, verbose=1)



Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5

In [None]:
# Example model 2
model2 = build_model(embedding_matrix, lstm_units=150, dropout_rate=0.2, learning_rate=0.001)
history2 = model2.fit(X_train, np.array(y_train), batch_size=32, epochs=5, verbose=1)



Ensemble Predictions

In [None]:
def ensemble_predict(models, sentence, tags, word2idx, max_len):
    words = sentence.split()
    seq = pad_sequences([[word2idx.get(w, word2idx["ENDPAD"]) for w in words]], maxlen=max_len, padding="post", value=word2idx["ENDPAD"])

    # Sum predictions from all models
    total_preds = np.zeros((1, max_len, len(tags)))
    for model in models:
        preds = model.predict(seq)
        total_preds += preds

    # Average predictions
    avg_preds = total_preds / len(models)
    avg_preds = np.argmax(avg_preds, axis=-1)
    predicted_tags = [tags[i] for i in avg_preds[0]]
    return list(zip(words, predicted_tags[:len(words)]))




In [None]:
# Ensemble prediction
sentence = "Mark and John are good friends from London."
models = [model1, model2]
predictions = ensemble_predict(models, sentence, tags, word2idx, max_len)

# Display results
df_predictions = pd.DataFrame(predictions, columns=["Word", "Tag"])
from IPython.display import display, HTML
display(HTML(df_predictions.to_html(index=False)))

### Approach 4 : Error Analysis

##### Identify errors

In [None]:
def evaluate_and_analyze(model, X_test, y_test, idx2tag):
    preds = model.predict(X_test)
    preds = np.argmax(preds, axis=-1)
    y_true = np.argmax(y_test, axis=-1)

    errors = []
    for i in range(len(y_true)):
        for j in range(len(y_true[i])):
            if y_true[i][j] != preds[i][j] and y_true[i][j] != 0:
                errors.append((i, j, idx2tag[y_true[i][j]], idx2tag[preds[i][j]]))

    return errors

idx2tag = {i: t for t, i in tag2idx.items()}
errors = evaluate_and_analyze(model_glove, X_test, y_test, idx2tag)

# Display errors
error_df = pd.DataFrame(errors, columns=["Sentence Index", "Word Index", "True Tag", "Predicted Tag"])
display(HTML(error_df.to_html(index=False)))


### Summary

**1. Pre-trained Embeddings:**

Improved the model using GloVe embeddings.

**2. Hyperparameter Tuning:**

Experimented with different hyperparameters to find the best configuration.

**3. Ensemble Methods:**

Combined predictions from multiple models for better performance.

**4. Error Analysis:**

Analyzed errors to understand the model's limitations and guide future improvements.
