<a href="https://colab.research.google.com/github/rahulbhoyar1995/NER-Case-Study/blob/main/ner_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Author : Rahul Bhoyar

### Named Entity Recognition (NER)

Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) that involves identifying and classifying named entities in text into predefined categories like "Person" (PER), "Location" (GEO), "Organization" (ORG), etc.

In [5]:
pip install chardet

Collecting chardet
  Downloading chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Downloading chardet-5.2.0-py3-none-any.whl (199 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.4/199.4 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: chardet
Successfully installed chardet-5.2.0
Note: you may need to restart the kernel to use updated packages.


## Data Preparation

In [6]:
import chardet

# Detect the encoding of the file
with open('ner_dataset.csv', 'rb') as f:
    result = chardet.detect(f.read())
    encoding = result['encoding']

print(f"Detected encoding: {encoding}")

Detected encoding: Windows-1252


In [9]:
data = pd.read_csv("ner_dataset.csv", encoding=encoding)
data

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
...,...,...,...,...
723383,,.,.,O
723384,Sentence: 33049,He,PRP,O
723385,,is,VBZ,O
723386,,the,DT,O


There are 723388 records divided in 4 columns.

As a part oof our problem statement we want only two columns : "Word" and "Tag".

In [10]:
data = data[["Word","Tag"]]
data

Unnamed: 0,Word,Tag
0,Thousands,O
1,of,O
2,demonstrators,O
3,have,O
4,marched,O
...,...,...
723383,.,O
723384,He,O
723385,is,O
723386,the,O


Let's see how many null values are there.

In [11]:
missing_values_count = data.isnull().sum()
print(missing_values_count)

Word    5
Tag     1
dtype: int64


Here there are 10 records in Word column with null values.

In [12]:
null_values_df = data[data['Word'].isnull() | data['Tag'].isnull()]

# Display the rows with null values in 'Word' or 'Tag' columns
print("Rows with null values in 'Word' or 'Tag' columns:")
print(null_values_df)

Rows with null values in 'Word' or 'Tag' columns:
       Word  Tag
197658  NaN    O
256026  NaN    O
257069  NaN    O
571211  NaN    O
613777  NaN    O
723387   16  NaN


Removing the null values.

In [13]:
df = data.dropna(subset=['Word', 'Tag'])

# Display the cleaned DataFrame
print("\nDataFrame after removing rows with null values in 'Word' or 'Tag' columns:")
df


DataFrame after removing rows with null values in 'Word' or 'Tag' columns:


Unnamed: 0,Word,Tag
0,Thousands,O
1,of,O
2,demonstrators,O
3,have,O
4,marched,O
...,...,...
723382,attack,O
723383,.,O
723384,He,O
723385,is,O


Checking the uniques tags.

In [14]:
unique_tags = list(df["Tag"].unique())

In [15]:
print("Unique tags are :", unique_tags)

Unique tags are : ['O', 'B-geo', 'B-gpe', 'B-per', 'I-geo', 'B-org', 'I-org', 'B-tim', 'B-art', 'I-art', 'I-per', 'I-gpe', 'I-tim', 'B-nat', 'B-eve', 'I-eve', 'I-nat']


In [16]:
print("Total number of unique tags are :", len(unique_tags))

Total number of unique tags are : 17


Checking unique number of words.

In [17]:
unique_words = list(df["Word"].unique())

In [18]:
print("Total number of unique words are :", len(unique_words))

Total number of unique words are : 29650


Final Dataframe for modelling

In [19]:
df.shape

(723382, 2)

In [20]:
df.head()

Unnamed: 0,Word,Tag
0,Thousands,O
1,of,O
2,demonstrators,O
3,have,O
4,marched,O


In [21]:
df.to_csv("data.csv")

### Approach 1: Traditional Machine Learning Algorithms.

It is classification problem.

Step 1: Dividing the dataset into training, testing and validation dataset.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
data = pd.read_csv('data.csv')  # Assuming the dataset is in CSV format


In [2]:
data = data[["Word","Tag"]]
data

Unnamed: 0,Word,Tag
0,Thousands,O
1,of,O
2,demonstrators,O
3,have,O
4,marched,O
...,...,...
723377,attack,O
723378,.,O
723379,He,O
723380,is,O


In [3]:
unique_words = data["Word"].unique()


print(f"Total number of unique words are :", len(unique_words))

Total number of unique words are : 29650


In [4]:
data = data.drop_duplicates(subset=['Word', 'Tag'])

# Print the unique DataFrame
print("DataFrame after removing duplicates:")
data

DataFrame after removing duplicates:


Unnamed: 0,Word,Tag
0,Thousands,O
1,of,O
2,demonstrators,O
3,have,O
4,marched,O
...,...,...
723222,Hajj,B-org
723223,Ismail,I-org
723224,Jabber,I-org
723354,Junaid,B-per


In [5]:

# Split the data into train+validation and test sets
train_val_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Further split train+validation into train and validation sets
train_data, val_data = train_test_split(train_val_data, test_size=0.25, random_state=42)  # 0.25 * 0.8 = 0.2

print(f"Train size: {len(train_data)}, Validation size: {len(val_data)}, Test size: {len(test_data)}")


Train size: 21911, Validation size: 7304, Test size: 7304


In [6]:
def spliting_features_dependent_var(data):
    X = data["Word"]
    y = data["Tag"]
    return X, y

In [7]:
# Preprocess the datasets
X_train, y_train = spliting_features_dependent_var(train_data)
X_val, y_val = spliting_features_dependent_var(val_data)
X_test, y_test = spliting_features_dependent_var(test_data)



In [8]:
X_train.shape, y_train.shape

((21911,), (21911,))

In [9]:
X_val.shape, y_val.shape

((7304,), (7304,))

In [10]:
X_test.shape, y_test.shape

((7304,), (7304,))

In [11]:
train_data = pd.DataFrame({'Word': X_train, 'Tag': y_train})
val_data = pd.DataFrame({'Word': X_val, 'Tag': y_val})
test_data = pd.DataFrame({'Word': X_test, 'Tag': y_test})

In [14]:
import pandas as pd
from sklearn.feature_extraction import DictVectorizer

def preprocess_and_vectorize(data):
    X, y = [], []
    
    # Iterate over each row in the dataframe
    for index, row in data.iterrows():
        word = row['Word']
        tag = row['Tag']
        
        # Create a dictionary of features (only 'Word' in this case)
        features = {'Word': word}
        
        # Append the feature dictionary and corresponding label to X and y
        X.append(features)
        y.append(tag)
    
    # Vectorize features using DictVectorizer
    vec = DictVectorizer(sparse=False)
    X_vectorized = vec.fit_transform(X)
    
    return X_vectorized, y, vec

In [15]:
# Preprocess and vectorize training data
X_train_vec, y_train_vec, vec = preprocess_and_vectorize(train_data)


In [17]:
# Transform validation and test data using the same vectorizer
X_val_vec = vec.transform(val_data.to_dict('records'))
X_test_vec = vec.transform(test_data.to_dict('records'))

In [18]:
from sklearn.linear_model import LogisticRegression

# Instantiate Logistic Regression model
model = LogisticRegression()

# Train the model on vectorized training data
model.fit(X_train_vec, y_train_vec)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [19]:
from sklearn.metrics import classification_report

# Predict on validation data
y_pred_val = model.predict(X_val_vec)

# Evaluate model performance on validation data
print("Classification Report on Validation Data:")
print(classification_report(y_val, y_pred_val))


Classification Report on Validation Data:
              precision    recall  f1-score   support

       B-art       0.00      0.00      0.00        42
       B-eve       0.00      0.00      0.00        11
       B-geo       0.00      0.00      0.00       519
       B-gpe       0.00      0.00      0.00       102
       B-nat       0.00      0.00      0.00         6
       B-org       0.00      0.00      0.00       465
       B-per       0.00      0.00      0.00       504
       B-tim       0.00      0.00      0.00       189
       I-art       0.00      0.00      0.00        43
       I-eve       0.00      0.00      0.00        15
       I-geo       0.00      0.00      0.00       171
       I-gpe       0.00      0.00      0.00         8
       I-nat       0.00      0.00      0.00         1
       I-org       0.00      0.00      0.00       476
       I-per       0.00      0.00      0.00       626
       I-tim       0.00      0.00      0.00       107
           O       0.55      1.00      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [20]:
# Predict on test data
y_pred_test = model.predict(X_test_vec)

# Evaluate model performance on test data
print("Classification Report on Test Data:")
print(classification_report(y_test, y_pred_test))


Classification Report on Test Data:
              precision    recall  f1-score   support

       B-art       0.00      0.00      0.00        46
       B-eve       0.00      0.00      0.00        13
       B-geo       0.00      0.00      0.00       474
       B-gpe       0.00      0.00      0.00        79
       B-nat       0.00      0.00      0.00         8
       B-org       0.00      0.00      0.00       465
       B-per       0.00      0.00      0.00       500
       B-tim       0.00      0.00      0.00       161
       I-art       0.00      0.00      0.00        38
       I-eve       0.00      0.00      0.00        14
       I-geo       0.00      0.00      0.00       170
       I-gpe       0.00      0.00      0.00        10
       I-nat       0.00      0.00      0.00         3
       I-org       0.00      0.00      0.00       493
       I-per       0.00      0.00      0.00       639
       I-tim       0.00      0.00      0.00       118
           O       0.56      1.00      0.72  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [21]:
import pandas as pd

def predict_tags(model, vec, new_words):
    """
    Function to predict Named Entity Recognition tags for new words.

    Parameters:
    - model: Trained machine learning model (e.g., Logistic Regression)
    - vec: DictVectorizer instance used for vectorizing training data
    - new_words: List or Series of new words to predict tags for

    Returns:
    - DataFrame containing 'Word' and 'Predicted_Tag' columns
    """
    # Prepare new data in a DataFrame format
    new_data = pd.DataFrame({'Word': new_words})
    
    # Vectorize the new data using the same DictVectorizer instance
    X_new = vec.transform(new_data.to_dict('records'))
    
    # Make predictions using the trained model
    y_pred_new = model.predict(X_new)
    
    # Create a DataFrame to display predictions
    predictions_df = pd.DataFrame({'Word': new_words, 'Predicted_Tag': y_pred_new})
    
    return predictions_df

# Example usage:
# Assuming `model` is your trained Logistic Regression model
# Assuming `vec` is the DictVectorizer instance used for training data

# List of new words to predict tags for
new_words = ['Germany', 'is', 'a', 'beautiful', 'country']

# Predict tags for the new words
predictions = predict_tags(model, vec, new_words)

# Display predictions
print(predictions)


        Word Predicted_Tag
0    Germany             O
1         is             O
2          a             O
3  beautiful             O
4    country             O


### Approach 2: Deep Learning Algorithms.

### The Algorithm: BiLSTM for NER
In this example, we use a Bidirectional Long Short-Term Memory (BiLSTM) network for NER. Let's understand the key concepts.

#### 1. Long Short-Term Memory (LSTM)
LSTM: A type of Recurrent Neural Network (RNN) designed to remember information for long periods. Unlike regular RNNs, LSTMs can learn and retain long-range dependencies, making them effective for sequence prediction tasks.

#### 2. Bidirectional LSTM (BiLSTM)
Bidirectional: In a BiLSTM, we have two LSTMs for each time step, one processing the sequence from the start to the end (forward direction) and the other from the end to the start (backward direction). This allows the model to have both past and future context, which is useful for understanding the meaning of each word in a sentence.


### The Process: Training a BiLSTM Model for NER

**(A) Data Preprocessing**

(1) Tokenization:

Splitting text into individual words.


(2) Mapping to Indices:

Converting words and tags into numerical indices that the model can understand.

(3)Padding:

Ensuring all sentences have the same length by adding "padding" tokens to shorter sentences and truncating longer ones.

**(B) Model Building**

(2) Embedding Layer:

Converts each word into a dense vector of fixed size. These vectors capture semantic information about the words.

(2) BiLSTM Layer:

Processes the input sequences in both forward and backward directions.

(3) TimeDistributed Layer:

Applies a dense layer to each time step (word) independently, predicting the tag for each word.

**(C) Model Training**

(1) Compilation:

Setting up the model with an optimizer (e.g., Adam), loss function (e.g., categorical crossentropy), and evaluation metric (e.g., accuracy).


(B) Training: Fitting the model to the training data, adjusting weights to minimize the loss.


**(D) Prediction and Evaluation**

(1) Prediction: Using the trained model to predict tags for new sentences.

(2) Evaluation: Assessing the model’s performance on a test dataset.


### The Code

Here's the full code with explanations.


#### (A) Data Preprocessing

In [25]:
pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-2.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Collecting astunparse>=1.6.0 (from tensorflow)
  Downloading astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=23.5.26 (from tensorflow)
  Downloading flatbuffers-24.3.25-py2.py3-none-any.whl.metadata (850 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow)
  Downloading gast-0.5.4-py3-none-any.whl.metadata (1.3 kB)
Collecting google-pasta>=0.1.1 (from tensorflow)
  Downloading google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)
Collecting h5py>=3.10.0 (from tensorflow)
  Downloading h5py-3.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.5 kB)
Collecting libclang>=13.0.0 (from tensorflow)
  Downloading libclang-18.1.1-py2.py3-none-manylinux2010_x86_64.whl.metadata (5.2 kB)
Collecting ml-dtypes~=0.3.1 (from tensorflow)
  Downloading ml_dtypes-0.3.2-cp310-cp310-manylinux_2_17_x

In [26]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split

2024-06-19 06:15:06.332317: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [27]:
ner_data = pd.read_csv("ner_dataset.csv",  encoding='latin1')
ner_data.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


Understanding the dataframe.

In [28]:
ner_data.shape

(1048575, 4)

Group the senetences with its tags.

In [29]:
class SentenceGetter(object):
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        agg_func = lambda s: [(w, t) for w, t in zip(s["Word"].values.tolist(),
                                                     s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]

In [30]:
getter = SentenceGetter(ner_data)
sentences = getter.sentences

In [31]:
len(sentences)

47959

In [None]:
# Extract unique words and tags

In [32]:
words = list(set(ner_data["Word"].values))
words.append("ENDPAD")
len(words)

35179

In [33]:
tags = list(set(ner_data["Tag"].values))
len(tags)

17

In [None]:
# Dictionary mapping words and tags to indices


In [34]:
word2idx = {w: i for i, w in enumerate(words)}
len(word2idx)

35179

In [35]:
tag2idx = {t: i for i, t in enumerate(tags)}
len(tag2idx)

17

In [36]:
# Prepare data for the model
max_len = 50

In [37]:
X = [[word2idx[w[0]] for w in s] for s in sentences]
len(X)


47959

In [38]:
X[0:2]

[[5374], [1536]]

In [39]:
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=word2idx["ENDPAD"])
X

array([[ 5374, 35178, 35178, ..., 35178, 35178, 35178],
       [ 1536, 35178, 35178, ..., 35178, 35178, 35178],
       [11248, 35178, 35178, ..., 35178, 35178, 35178],
       ...,
       [ 5309, 35178, 35178, ..., 35178, 35178, 35178],
       [21838, 35178, 35178, ..., 35178, 35178, 35178],
       [18083, 35178, 35178, ..., 35178, 35178, 35178]], dtype=int32)

In [40]:
X.shape

(47959, 50)

In [41]:
y = [[tag2idx[w[1]] for w in s] for s in sentences]
len(y)

47959

In [42]:
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])
y

array([[ 7,  7,  7, ...,  7,  7,  7],
       [16,  7,  7, ...,  7,  7,  7],
       [ 7,  7,  7, ...,  7,  7,  7],
       ...,
       [ 7,  7,  7, ...,  7,  7,  7],
       [ 7,  7,  7, ...,  7,  7,  7],
       [ 7,  7,  7, ...,  7,  7,  7]], dtype=int32)

In [43]:
y = [to_categorical(i, num_classes=len(tags)) for i in y]
len(y)

47959

In [44]:
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

Loading Data:

Read the CSV file into a DataFrame and fill missing values.
SentenceGetter: Groups words and tags by sentences.


Mapping to Indices:

Creates dictionaries to map words and tags to numerical indices.

Padding and Encoding:

Converts sentences to fixed-length sequences of indices and encodes tags as one-hot vectors.

Splitting Data:

 Splits the dataset into training and test sets.

#### (B) Model Building

In [45]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional

# Define the model
model = Sequential([
    Embedding(input_dim=len(words), output_dim=50, input_length=max_len),
    Dropout(0.1),
    Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1)),
    TimeDistributed(Dense(len(tags), activation="softmax"))
])

model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
model.summary()




Embedding Layer:

Converts words to dense vectors.


BiLSTM Layer:

Processes sequences in both forward and backward directions.

TimeDistributed Layer:

Applies a dense layer to each word to predict its tag.

Compilation:

Sets up the optimizer, loss function, and metrics.

#### (C) Training the Model

This step will take some time.

In [46]:
history = model.fit(X_train, np.array(y_train), batch_size=32, epochs=5, validation_split=0.1, verbose=1)


Epoch 1/5
[1m1214/1214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m74s[0m 57ms/step - accuracy: 0.9894 - loss: 0.0900 - val_accuracy: 0.9977 - val_loss: 0.0073
Epoch 2/5
[1m1214/1214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m68s[0m 56ms/step - accuracy: 0.9983 - loss: 0.0058 - val_accuracy: 0.9985 - val_loss: 0.0048
Epoch 3/5
[1m1214/1214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m68s[0m 56ms/step - accuracy: 0.9990 - loss: 0.0035 - val_accuracy: 0.9985 - val_loss: 0.0045
Epoch 4/5
[1m1214/1214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m68s[0m 56ms/step - accuracy: 0.9991 - loss: 0.0030 - val_accuracy: 0.9986 - val_loss: 0.0045
Epoch 5/5
[1m1214/1214[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m68s[0m 56ms/step - accuracy: 0.9991 - loss: 0.0028 - val_accuracy: 0.9986 - val_loss: 0.0045


Training:

Fits the model to the training data, using a batch size of 32 and training for 5 epochs.

In [47]:
# Evaluate the model
model.evaluate(X_test, np.array(y_test))

[1m150/150[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 11ms/step - accuracy: 0.9986 - loss: 0.0045


[0.0044597783125936985, 0.9985813498497009]

Loss (0.046321723610162735):

This value represents the model's loss on the test set. In this context, the loss is calculated using the categorical cross-entropy loss function, which measures the difference between the predicted and true probability distributions. A lower loss value indicates that the model's predictions are closer to the actual tags. The value 0.0463 indicates that the model has a relatively low error in its predictions on the test set.
Accuracy (0.9860008358955383):

This value represents the model's accuracy on the test set. Accuracy is the fraction of correct predictions made by the model. In this case, the value 0.9860 indicates that the model correctly predicted the NER tags for 98.60% of the words in the test set. This is a high accuracy, suggesting that the model is performing well.

#### (D) Prediction

In [48]:
from IPython.display import display, HTML


def predict_tags(sentence, tags, word2idx, max_len, model):
    words = sentence.split()
    seq = pad_sequences([[word2idx.get(w, word2idx["ENDPAD"]) for w in words]], maxlen=max_len, padding="post", value=word2idx["ENDPAD"])
    preds = model.predict(seq)
    preds = np.argmax(preds, axis=-1)
    predicted_tags = [tags[i] for i in preds[0]]
    predictions=  list(zip(words, predicted_tags[:len(words)]))
    df_predictions = pd.DataFrame(predictions, columns=["Word", "Tag"])

    # Display the DataFrame as a table
    display(HTML(df_predictions.to_html(index=False)))


Predict Tags:

Tokenizes the input sentence, converts it to indices, and pads it to the maximum length. The model predicts tags for each word, which are then converted back to their original form.

Display Results:

Creates a DataFrame from the predictions and displays it as a nicely formatted table in Jupyter.

Let's make some predictions on new sentences.

In [49]:
sentence = "India is the best place to live."
predictions = predict_tags(sentence, tags, word2idx, max_len, model)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 564ms/step


Word,Tag
India,B-geo
is,O
the,O
best,O
place,O
to,B-per
live.,O


In [50]:
sentence_2 = "European Union is the biggest organisation."

predictions = predict_tags(sentence_2, tags, word2idx, max_len, model)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step


Word,Tag
European,O
Union,O
is,O
the,O
biggest,O
organisation.,O


In [None]:
sentence_3 = "In Germany and Nigeria, there are lot of other things which are not that good."

predictions = predict_tags(sentence_3, tags, word2idx, max_len, model)




Word,Tag
In,O
Germany,B-org
and,O
"Nigeria,",O
there,O
are,O
lot,O
of,O
other,O
things,O


In [51]:
sentence_4 = "Mosul and Suresh were best friends when they were in Baghdad."

predictions = predict_tags(sentence_4, tags, word2idx, max_len, model)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step


Word,Tag
Mosul,B-geo
and,B-org
Suresh,O
were,O
best,O
friends,O
when,O
they,O
were,O
in,O


In [53]:
sentence_5 = "Germany is a beautiful country.."

predictions = predict_tags(sentence_5, tags, word2idx, max_len, model)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step


Word,Tag
Germany,B-geo
is,O
a,O
beautiful,O
country..,O


In [54]:
sentence_6 = "The Eiffel Tower in Paris, France, is a famous tourist attraction."
predictions = predict_tags(sentence_6, tags, word2idx, max_len, model)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step


Word,Tag
The,O
Eiffel,O
Tower,O
in,O
"Paris,",O
"France,",O
is,O
a,O
famous,O
tourist,O


### Summary :

Preprocessing:

Prepare data by tokenizing, encoding, and padding sentences.
Model Building: Build a BiLSTM model using Tensorflow.

Training: Train the model on the preprocessed data.

Prediction: Predict NER tags for new sentences and display results in a tabular format.

By following these steps, we can effectively use a BiLSTM model for Named Entity Recognition, enabling us to identify and classify entities in text.

In [56]:
sentence_8 = "Russia and China signed a new trade agreement."
predictions = predict_tags(sentence_8, tags, word2idx, max_len, model)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step


Word,Tag
Russia,B-geo
and,B-org
China,B-geo
signed,B-org
a,B-org
new,O
trade,B-per
agreement.,O


### Future Steps

1. Use Pre-trained Embeddings:

Incorporate GloVe or BERT embeddings to improve performance.

2. Hyperparameter Tuning:

Experiment with different hyperparameters like batch size, learning rate, number of LSTM units, etc.


3. Ensemble Methods:

Combine predictions from multiple models to improve accuracy.


4. Error Analysis:

Analyze errors to understand common failure cases and address them

### Approach 3 : Use Pre-trained Embeddings

We'll start by incorporating pre-trained GloVe embeddings into our model to improve its performance.

In [57]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

--2024-06-19 06:29:15--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2024-06-19 06:29:15--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-06-19 06:29:16--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [58]:
# Load the embeddings
embedding_index = {}
with open("glove.6B.100d.txt", encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype="float32")
        embedding_index[word] = coefs

embedding_dim = 100
embedding_matrix = np.zeros((len(words), embedding_dim))
for word, i in word2idx.items():
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

Build Model with GloVe Embeddings

In [59]:
from tensorflow.keras.layers import Embedding

model_glove = Sequential([
    Embedding(input_dim=len(words), output_dim=embedding_dim, input_length=max_len, weights=[embedding_matrix], trainable=False),
    Dropout(0.1),
    Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1)),
    TimeDistributed(Dense(len(tags), activation="softmax"))
])

model_glove.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
model_glove.summary()


Train the Model with GloVe Embeddings

In [60]:
history_glove = model_glove.fit(X_train, np.array(y_train), batch_size=32, epochs=5, verbose=1)


Epoch 1/5
[1m1349/1349[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m66s[0m 46ms/step - accuracy: 0.9885 - loss: 0.1691
Epoch 2/5
[1m1349/1349[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 46ms/step - accuracy: 0.9943 - loss: 0.0210
Epoch 3/5
[1m1349/1349[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 46ms/step - accuracy: 0.9942 - loss: 0.0212
Epoch 4/5
[1m1349/1349[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m63s[0m 47ms/step - accuracy: 0.9943 - loss: 0.0206
Epoch 5/5
[1m1349/1349[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m63s[0m 47ms/step - accuracy: 0.9943 - loss: 0.0206


In [61]:
# Evaluate the model
model_glove.evaluate(X_test, np.array(y_test))

[1m150/150[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 12ms/step - accuracy: 0.9945 - loss: 0.0201


[0.02022641897201538, 0.9944704174995422]

In [62]:
sentence_1 = "Germany is one of the main economy in the world."
predictions = predict_tags(sentence_1, tags, word2idx, max_len, model_glove)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 547ms/step


Word,Tag
Germany,O
is,O
one,O
of,O
the,O
main,O
economy,O
in,O
the,O
world.,O


In [63]:
sentence_2 = "In Germany and Nigeria, there are lot of other things which are not that good."
predictions = predict_tags(sentence_2, tags, word2idx, max_len, model_glove)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step


Word,Tag
In,O
Germany,O
and,O
"Nigeria,",O
there,O
are,O
lot,O
of,O
other,O
things,O


### Approach 4 : Hyperparameter Tuning



We will tune hyperparameters such as batch size, learning rate, and the number of LSTM units. We can use tools like Keras Tuner, but for simplicity, let's manually experiment with different configurations.



Define a Function to Build the Model with Hyperparameters

In [64]:
from tensorflow.keras.optimizers import Adam

def build_model(embedding_matrix, lstm_units=100, dropout_rate=0.1, learning_rate=0.001):
    model = Sequential([
        Embedding(input_dim=len(words), output_dim=embedding_dim, input_length=max_len, weights=[embedding_matrix], trainable=False),
        Dropout(dropout_rate),
        Bidirectional(LSTM(units=lstm_units, return_sequences=True, recurrent_dropout=dropout_rate)),
        TimeDistributed(Dense(len(tags), activation="softmax"))
    ])
    optimizer = Adam(learning_rate=learning_rate)
    model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=["accuracy"])
    return model


Train and Evaluate Models with Different Hyperparameters

In [65]:
# Example configuration 1
model_hp1 = build_model(embedding_matrix, lstm_units=50, dropout_rate=0.2, learning_rate=0.001)
history_hp1 = model_hp1.fit(X_train, np.array(y_train), batch_size=64, epochs=5, verbose=1)


Epoch 1/5
[1m675/675[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 40ms/step - accuracy: 0.9838 - loss: 0.3712
Epoch 2/5
[1m675/675[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 40ms/step - accuracy: 0.9943 - loss: 0.0260
Epoch 3/5
[1m675/675[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 40ms/step - accuracy: 0.9943 - loss: 0.0223
Epoch 4/5
[1m675/675[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 40ms/step - accuracy: 0.9943 - loss: 0.0210
Epoch 5/5
[1m675/675[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 40ms/step - accuracy: 0.9944 - loss: 0.0205


We can have multiple configurations like this.

In [66]:
# Evaluate the model
model_hp1.evaluate(X_test, np.array(y_test))

[1m150/150[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 8ms/step - accuracy: 0.9945 - loss: 0.0202


[0.020287295803427696, 0.9944704174995422]

In [67]:
sentence_1 = "Germany is one of the main economy in the world."
predictions = predict_tags(sentence_1, tags, word2idx, max_len, model_hp1)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 553ms/step


Word,Tag
Germany,O
is,O
one,O
of,O
the,O
main,O
economy,O
in,O
the,O
world.,O


Repeat this for other configurations and compare the validation performance.

### Approach 5 : Ensemble Methods




Combining predictions from multiple models can improve accuracy. We'll average the probabilities from different models.

Train Multiple Models

In [69]:
# Example model 1
model1 = build_model(embedding_matrix, lstm_units=100, dropout_rate=0.1, learning_rate=0.001)
history1 = model1.fit(X_train, np.array(y_train), batch_size=32, epochs=5, verbose=1)



Epoch 1/5
[1m1349/1349[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m67s[0m 46ms/step - accuracy: 0.9885 - loss: 0.1673
Epoch 2/5
[1m1349/1349[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 46ms/step - accuracy: 0.9943 - loss: 0.0212
Epoch 3/5
[1m1349/1349[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 46ms/step - accuracy: 0.9943 - loss: 0.0206
Epoch 4/5
[1m1349/1349[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 46ms/step - accuracy: 0.9943 - loss: 0.0206
Epoch 5/5
[1m1349/1349[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 46ms/step - accuracy: 0.9943 - loss: 0.0206


In [None]:
# Example model 2
model2 = build_model(embedding_matrix, lstm_units=150, dropout_rate=0.2, learning_rate=0.001)
history2 = model2.fit(X_train, np.array(y_train), batch_size=32, epochs=5, verbose=1)


Epoch 1/5
[1m1349/1349[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m95s[0m 67ms/step - accuracy: 0.9885 - loss: 0.1423
Epoch 2/5
[1m1349/1349[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m143s[0m 68ms/step - accuracy: 0.9943 - loss: 0.0209
Epoch 3/5
[1m1349/1349[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m91s[0m 68ms/step - accuracy: 0.9943 - loss: 0.0207
Epoch 4/5
[1m1349/1349[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m92s[0m 68ms/step - accuracy: 0.9943 - loss: 0.0207
Epoch 5/5
[1m 344/1349[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m1:07[0m 68ms/step - accuracy: 0.9941 - loss: 0.0210


Ensemble Predictions

In [None]:
def ensemble_predict(models, sentence, tags, word2idx, max_len):
    words = sentence.split()
    seq = pad_sequences([[word2idx.get(w, word2idx["ENDPAD"]) for w in words]], maxlen=max_len, padding="post", value=word2idx["ENDPAD"])

    # Sum predictions from all models
    total_preds = np.zeros((1, max_len, len(tags)))
    for model in models:
        preds = model.predict(seq)
        total_preds += preds

    # Average predictions
    avg_preds = total_preds / len(models)
    avg_preds = np.argmax(avg_preds, axis=-1)
    predicted_tags = [tags[i] for i in avg_preds[0]]
    return list(zip(words, predicted_tags[:len(words)]))

In [None]:
# Ensemble prediction
sentence = "Mark and John are good friends from London."
models = [model1, model2]
predictions = ensemble_predict(models, sentence, tags, word2idx, max_len)

# Display results
df_predictions = pd.DataFrame(predictions, columns=["Word", "Tag"])
from IPython.display import display, HTML
display(HTML(df_predictions.to_html(index=False)))

### Approach 6 : Error Analysis

##### Identify errors

In [None]:
def evaluate_and_analyze(model, X_test, y_test, idx2tag):
    preds = model.predict(X_test)
    preds = np.argmax(preds, axis=-1)
    y_true = np.argmax(y_test, axis=-1)

    errors = []
    for i in range(len(y_true)):
        for j in range(len(y_true[i])):
            if y_true[i][j] != preds[i][j] and y_true[i][j] != 0:
                errors.append((i, j, idx2tag[y_true[i][j]], idx2tag[preds[i][j]]))

    return errors

idx2tag = {i: t for t, i in tag2idx.items()}
errors = evaluate_and_analyze(model_glove, X_test, y_test, idx2tag)

# Display errors
error_df = pd.DataFrame(errors, columns=["Sentence Index", "Word Index", "True Tag", "Predicted Tag"])
display(HTML(error_df.to_html(index=False)))


### Approach 7:  Using Large Language Models.

In [1]:

!pip install transformers datasets evaluate transformers[torch]

zsh:1: no matches found: transformers[torch]


In [None]:
 # Creating HuggingFace Dataset first.

In [3]:
data = pd.read_csv("ner_dataset.csv",encoding='latin1')



In [4]:
import pandas as pd

# Example data structure:
# data = pd.read_csv('data.csv')

# Fill NaN values in 'Sentence #' column with appropriate values
data['Sentence #'].fillna(method='ffill', inplace=True)

# Replace NaN values in other columns (if any)
data.fillna('', inplace=True)

# Initialize variables to store sentences and tags
sentences = []
tags = []

# Group by 'Sentence #' and iterate through groups
for sentence_id, group in data.groupby('Sentence #'):
    # Concatenate words to form the sentence
    sentence = ' '.join(group['Word'].tolist())
    
    # Create a dictionary to store tags for the sentence
    sentence_tags = {}
    
    # Iterate through each word and its corresponding tag in the group
    for word, tag in zip(group['Word'], group['Tag']):
        sentence_tags[word] = tag
    
    # Append sentence and its tags to lists
    sentences.append(sentence)
    tags.append(sentence_tags)

# Create a new DataFrame
df_new = pd.DataFrame({
    'sentence': sentences,
    'tag': tags
})



  data['Sentence #'].fillna(method='ffill', inplace=True)


In [6]:
df_new.to_csv("dataset_for_llm.csv")

In [7]:
data = pd.read_csv("dataset_for_llm.csv")

In [9]:
data.drop(columns=['Unnamed: 0'], inplace=True)

In [10]:
data

Unnamed: 0,sentence,tag
0,Thousands of demonstrators have marched throug...,"{'Thousands': 'O', 'of': 'O', 'demonstrators':..."
1,Iranian officials say they expect to get acces...,"{'Iranian': 'B-gpe', 'officials': 'O', 'say': ..."
2,Helicopter gunships Saturday pounded militant ...,"{'Helicopter': 'O', 'gunships': 'O', 'Saturday..."
3,They left after a tense hour-long standoff wit...,"{'They': 'O', 'left': 'O', 'after': 'O', 'a': ..."
4,U.N. relief coordinator Jan Egeland said Sunda...,"{'U.N.': 'B-geo', 'relief': 'O', 'coordinator'..."
...,...,...
47954,Opposition leader Mir Hossein Mousavi has said...,"{'Opposition': 'O', 'leader': 'O', 'Mir': 'O',..."
47955,"On Thursday , Iranian state media published a ...","{'On': 'O', 'Thursday': 'B-tim', ',': 'O', 'Ir..."
47956,"Following Iran 's disputed June 12 elections ,...","{'Following': 'O', 'Iran': 'B-geo', ""'s"": 'O',..."
47957,"Since then , authorities have held public tria...","{'Since': 'O', 'then': 'O', ',': 'O', 'authori..."


In [12]:
from sklearn.model_selection import train_test_split

# Split into train, validation, and test sets
train_df, test_df = train_test_split(data, test_size=0.3, random_state=42)
valid_df, test_df = train_test_split(test_df, test_size=0.5, random_state=42)

# Reset index
train_df = train_df.reset_index(drop=True)
valid_df = valid_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)


In [13]:
train_df

Unnamed: 0,sentence,tag
0,Mousavi 's website also quotes him as criticiz...,"{'Mousavi': 'B-per', ""'s"": 'O', 'website': 'O'..."
1,The billboards contain photographs of hooded I...,"{'The': 'O', 'billboards': 'O', 'contain': 'O'..."
2,Uganda is the only country so far to agree to ...,"{'Uganda': 'B-org', 'is': 'O', 'the': 'O', 'on..."
3,Fighting between the Popular Movement for the ...,"{'Fighting': 'O', 'between': 'O', 'the': 'O', ..."
4,"Meanwhile , officials in Ukraine have reported...","{'Meanwhile': 'O', ',': 'O', 'officials': 'O',..."
...,...,...
33566,During an address Wednesday marking the Muslim...,"{'During': 'O', 'an': 'O', 'address': 'O', 'We..."
33567,General Abizaid made the remarks during a brie...,"{'General': 'B-org', 'Abizaid': 'I-org', 'made..."
33568,Milosevic had been on trial at the United Nati...,"{'Milosevic': 'B-per', 'had': 'O', 'been': 'O'..."
33569,Lieberman introduced the bill with Republican ...,"{'Lieberman': 'B-per', 'introduced': 'O', 'the..."


In [14]:
valid_df

Unnamed: 0,sentence,tag
0,"In 1784 , the French sold the island to Sweden...","{'In': 'O', '1784': 'B-tim', ',': 'O', 'the': ..."
1,Democrat Tom Daschle contradicts President Bus...,"{'Democrat': 'O', 'Tom': 'B-per', 'Daschle': '..."
2,Polls indicate that if early elections are hel...,"{'Polls': 'O', 'indicate': 'O', 'that': 'O', '..."
3,Local residents say the student crowds are sma...,"{'Local': 'O', 'residents': 'O', 'say': 'O', '..."
4,A group of Somali ministers walked out of a me...,"{'A': 'O', 'group': 'O', 'of': 'O', 'Somali': ..."
...,...,...
7189,Qatar had been the only Gulf Arab state to hav...,"{'Qatar': 'B-geo', 'had': 'O', 'been': 'O', 't..."
7190,"In Afghanistan , the prime minister rejected c...","{'In': 'O', 'Afghanistan': 'B-geo', ',': 'O', ..."
7191,The attack came after the Congolese government...,"{'The': 'O', 'attack': 'O', 'came': 'O', 'afte..."
7192,"In a separate incident today , authorities sai...","{'In': 'O', 'a': 'O', 'separate': 'O', 'incide..."


In [15]:
test_df

Unnamed: 0,sentence,tag
0,"From 2004 to 2007 , the economy grew about 10 ...","{'From': 'B-tim', '2004': 'I-tim', 'to': 'I-ti..."
1,"Earlier this week , the African Union dispatch...","{'Earlier': 'O', 'this': 'O', 'week': 'O', ','..."
2,China 's state news agency says scientists hav...,"{'China': 'B-geo', ""'s"": 'O', 'state': 'O', 'n..."
3,He said Americans are thankful for their sacri...,"{'He': 'O', 'said': 'O', 'Americans': 'B-gpe',..."
4,The letter fueled charges of racism and was co...,"{'The': 'O', 'letter': 'O', 'fueled': 'O', 'ch..."
...,...,...
7189,Officials say money from the fines will go to ...,"{'Officials': 'O', 'say': 'O', 'money': 'O', '..."
7190,A cease-fire was reached in 1991 .,"{'A': 'O', 'cease-fire': 'O', 'was': 'O', 'rea..."
7191,He denies any involvement .,"{'He': 'O', 'denies': 'O', 'any': 'O', 'involv..."
7192,Union head Roger Toussaint calls the dispute a...,"{'Union': 'O', 'head': 'O', 'Roger': 'B-per', ..."


In [2]:
pip install datasets

Note: you may need to restart the kernel to use updated packages.


In [20]:
from datasets import DatasetDict, Dataset
# Create Datasets from DataFrames
train_dataset = Dataset.from_pandas(train_df)
valid_dataset = Dataset.from_pandas(valid_df)
test_dataset = Dataset.from_pandas(test_df)

# Create DatasetDict
dataset_dict = DatasetDict({
    'train': train_dataset,
    'validation': valid_dataset,
    'test': test_dataset,
})

# Print dataset_dict information
dataset_dict

DatasetDict({
    train: Dataset({
        features: ['sentence', 'tag'],
        num_rows: 33571
    })
    validation: Dataset({
        features: ['sentence', 'tag'],
        num_rows: 7194
    })
    test: Dataset({
        features: ['sentence', 'tag'],
        num_rows: 7194
    })
})

Loading model and tokenisers

In [6]:
import os

os.environ['HF_TOKEN'] = ""

In [4]:
pip install transformers

Note: you may need to restart the kernel to use updated packages.


In [3]:
# Load model directly
from transformers import pipeline

pipe = pipeline("text-generation", model="gpt2")

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [2]:
pip install tf-keras

Collecting tf-keras
  Downloading tf_keras-2.16.0-py3-none-any.whl.metadata (1.6 kB)
Downloading tf_keras-2.16.0-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m42.8 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hInstalling collected packages: tf-keras
Successfully installed tf-keras-2.16.0
Note: you may need to restart the kernel to use updated packages.


In [None]:
sample = dataset_dict['test'][0]['sentence']
label = dataset_dict['test'][0]['tag']


     

In [25]:
sample

'From 2004 to 2007 , the economy grew about 10 % per year , driven largely by an expansion in the garment sector , construction , agriculture , and tourism .'

In [26]:
label

"{'From': 'B-tim', '2004': 'I-tim', 'to': 'I-tim', '2007': 'I-tim', ',': 'O', 'the': 'O', 'economy': 'O', 'grew': 'O', 'about': 'O', '10': 'O', '%': 'O', 'per': 'O', 'year': 'O', 'driven': 'O', 'largely': 'O', 'by': 'O', 'an': 'O', 'expansion': 'O', 'in': 'O', 'garment': 'O', 'sector': 'O', 'construction': 'O', 'agriculture': 'O', 'and': 'O', 'tourism': 'O', '.': 'O'}"

In [5]:

sample = """

Here is the sentence : 
'From 2004 to 2007 , the economy grew about 10 % per year , driven largely by an expansion in the garment sector , construction , agriculture , and tourism .'

Words tags are :
{'From': 'B-tim', '2004': 'I-tim', 'to': 'I-tim', '2007': 'I-tim', ',': 'O', 'the': 'O', 'economy': 'O', 'grew': 'O', 'about': 'O', '10': 'O', '%': 'O', 'per': 'O', 'year': 'O', 'driven': 'O', 'largely': 'O', 'by': 'O', 'an': 'O', 'expansion': 'O', 'in': 'O', 'garment': 'O', 'sector': 'O', 'construction': 'O', 'agriculture': 'O', 'and': 'O', 'tourism': 'O', '.': 'O'}

Here is the sentence : 
'In Germany, life is better compared to Austria.'

Words tags are :

"""


pipe(sample, max_length=500, num_return_sequences=1)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "\n\nHere is the sentence : \n'From 2004 to 2007 , the economy grew about 10 % per year , driven largely by an expansion in the garment sector , construction , agriculture , and tourism .'\n\nWords tags are :\n{'From': 'B-tim', '2004': 'I-tim', 'to': 'I-tim', '2007': 'I-tim', ',': 'O', 'the': 'O', 'economy': 'O', 'grew': 'O', 'about': 'O', '10': 'O', '%': 'O', 'per': 'O', 'year': 'O', 'driven': 'O', 'largely': 'O', 'by': 'O', 'an': 'O', 'expansion': 'O', 'in': 'O', 'garment': 'O', 'sector': 'O', 'construction': 'O', 'agriculture': 'O', 'and': 'O', 'tourism': 'O', '.': 'O'}\n\nHere is the sentence : \n'In Germany, life is better compared to Austria.'\n\nWords tags are :\n\n\n{'From': 'Sueffel', '2004': 'Toilet-googler', 'to': 'Toilet-googler', '2006': 'Ribbons', 'in': 'Toilet-googler', '2007': 'Tobacco', ',': 'Diesel', '2003': 'Robos', 'in': 'toilet-googler', '2007': 'Aurora Beer', ',': 'Theatrical', '.,,': 'Tobacco', '2009': 'Spoon', '.: 'toilet-googler', '2010': 'S

In [30]:

output = generate_summary(sample, llm=model)
print("Sample")
print(sample)
print("-------------------")
print("Model Generated Tags:")
print(output)
print("-------------------")
print("Correct Tags:")
print(label)

Sample
From 2004 to 2007 , the economy grew about 10 % per year , driven largely by an expansion in the garment sector , construction , agriculture , and tourism .
-------------------
Model Generated Tags:
From 2004 to 2007, the economy grew about 10 % per year, driven largely by an expansion in the garment sector, construction, agriculture, and tourism.
-------------------
Correct Tags:
{'From': 'B-tim', '2004': 'I-tim', 'to': 'I-tim', '2007': 'I-tim', ',': 'O', 'the': 'O', 'economy': 'O', 'grew': 'O', 'about': 'O', '10': 'O', '%': 'O', 'per': 'O', 'year': 'O', 'driven': 'O', 'largely': 'O', 'by': 'O', 'an': 'O', 'expansion': 'O', 'in': 'O', 'garment': 'O', 'sector': 'O', 'construction': 'O', 'agriculture': 'O', 'and': 'O', 'tourism': 'O', '.': 'O'}
