# Project Sections

1.   Data Collection
2.   Business understanding
3.   Data Annotation
4.   EDA Lifecyle
5.   Model experiments/improvement (Python code)


**Data Collection:**

Data is collected as sentences in a csv file. Source of data - sapient.

**Meta data understanding**

Each sentence is split as word with their corresponding POS and NER tags.

**Business understanding:**

  The scope of the problem is limited to find the NER tags for a given sentence.

**Data Annotation**
Each word in sentences are annotated by NER tags.

  








## NER Entity information


    geo = Geographical Entity
    org = Organization
    per = Person
    gpe = Geopolitical Entity
    tim = Time indicator
    art = Artifact
    eve = Event
    nat = Natural Phenomenon

## Components of IOB2 Format

- **B-**: Indicates the Beginning of an entity.
- **I-**: Indicates the Inside of an entity.
- **O**: Indicates that the token is Outside of any entity.

## How It Works

1. **Entity Start**: Tokens that start an entity are tagged with `B-` followed by the entity type (e.g., `B-LOC` for a location, `B-PER` for a person).
2. **Entity Continuation**: Tokens that are part of the entity but not at the start are tagged with `I-` followed by the same entity type.
3. **Non-Entity Tokens**: Tokens that do not belong to any entity are tagged with `O`.

Mout the google drive (Where the dataset is located)

In [28]:
from google.colab import drive
drive.mount('/content/drive')
dataset_path = '/content/drive/MyDrive/sapient'
dataset_name = 'ner_dataset.csv'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Import all the necessary library

In [29]:
from __future__ import unicode_literals, print_function

import csv
import json
import os
import random
import string
from collections import Counter
from datetime import datetime, timedelta
from itertools import combinations
from pathlib import Path

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt
import spacy
from spacy.tokens import DocBin
from spacy.util import filter_spans
from tqdm import tqdm
from sklearn.model_selection import train_test_split

In [30]:
ner_dataset_path = os.path.join(dataset_path, dataset_name)

In [31]:
# Reading CSV file using csv.reader
file_path = ner_dataset_path
with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
    reader = csv.reader(file)
    data = list(reader)

In [None]:
data[0:20]

[['Sentence #', 'Word', 'POS', 'Tag'],
 ['Sentence: 1', 'Thousands', 'NNS', 'O'],
 ['', 'of', 'IN', 'O'],
 ['', 'demonstrators', 'NNS', 'O'],
 ['', 'have', 'VBP', 'O'],
 ['', 'marched', 'VBN', 'O'],
 ['', 'through', 'IN', 'O'],
 ['', 'London', 'NNP', 'B-geo'],
 ['', 'to', 'TO', 'O'],
 ['', 'protest', 'VB', 'O'],
 ['', 'the', 'DT', 'O'],
 ['', 'war', 'NN', 'O'],
 ['', 'in', 'IN', 'O'],
 ['', 'Iraq', 'NNP', 'B-geo'],
 ['', 'and', 'CC', 'O'],
 ['', 'demand', 'VB', 'O'],
 ['', 'the', 'DT', 'O'],
 ['', 'withdrawal', 'NN', 'O'],
 ['', 'of', 'IN', 'O'],
 ['', 'British', 'JJ', 'B-gpe']]

In [32]:
def parse_csv(file_path):
    sentences = []
    current_sentence = []

    with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
        reader = list(csv.reader(file))[1:]
        for row in reader:
            if row[0].startswith('Sentence:'):
                if current_sentence:
                    sentences.append(current_sentence)
                    current_sentence = []
                # Append the first token of the new sentence
                if row[1]:
                    current_sentence.append((row[1], row[3]))
            else:
                if row[1]:  # Ignore rows with empty token
                    current_sentence.append((row[1], row[3]))

        if current_sentence:
            sentences.append(current_sentence)

    return sentences

In [33]:
sentences = parse_csv(file_path)

In [None]:
len(sentences)

47959

In [None]:
sentences[2]

[('They', 'O'),
 ('marched', 'O'),
 ('from', 'O'),
 ('the', 'O'),
 ('Houses', 'O'),
 ('of', 'O'),
 ('Parliament', 'O'),
 ('to', 'O'),
 ('a', 'O'),
 ('rally', 'O'),
 ('in', 'O'),
 ('Hyde', 'B-geo'),
 ('Park', 'I-geo'),
 ('.', 'O')]

In [34]:
def sentences_to_dataframe(sentences):
    data = []
    for i, sentence in enumerate(sentences):
        for token, ner in sentence:
            data.append([i, token, ner])
    df = pd.DataFrame(data, columns=['Sentence #', 'Token', 'NER'])
    return df

In [35]:
df = sentences_to_dataframe(sentences)

In [36]:
print(df.head())

   Sentence #          Token NER
0           0      Thousands   O
1           0             of   O
2           0  demonstrators   O
3           0           have   O
4           0        marched   O


In [None]:
# Display basic information about the DataFrame
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048554 entries, 0 to 1048553
Data columns (total 3 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   Sentence #  1048554 non-null  int64 
 1   Token       1048554 non-null  object
 2   NER         1048554 non-null  object
dtypes: int64(1), object(2)
memory usage: 24.0+ MB
None


In [37]:
# Display the distribution of NER tags
ner_counts = df['NER'].value_counts().reset_index()
ner_counts.columns = ['NER', 'Count']

In [38]:
fig = px.bar(ner_counts, x='NER', y='Count', title='Distribution of NER Tags')
fig.show()

## Number of tokens per sentence

In [None]:
# Display the number of tokens per sentence
tokens_per_sentence = df.groupby('Sentence #')['Token'].count().reset_index()
tokens_per_sentence.columns = ['Sentence #', 'Token Count']

fig = px.histogram(tokens_per_sentence, x='Token Count', nbins=20, title='Distribution of Tokens per Sentence')
fig.show()

# NER tag distribution in tabular format

In [None]:
# Plotly table for NER tag distribution
fig = go.Figure(data=[go.Table(
    header=dict(values=list(ner_counts.columns),
                fill_color='paleturquoise',
                align='left'),
    cells=dict(values=[ner_counts['NER'], ner_counts['Count']],
               fill_color='lavender',
               align='left'))
])

fig.update_layout(title='NER Tag Distribution Table')
fig.show()

# Preprocessing to convert the dataset to readable format

In [None]:
tokens_stats = tokens_per_sentence.describe().reset_index()
tokens_stats

Unnamed: 0,index,Sentence #,Token Count
0,count,47959.0,47959.0
1,mean,23979.0,21.86355
2,std,13844.715117,7.963066
3,min,0.0,1.0
4,25%,11989.5,16.0
5,50%,23979.0,21.0
6,75%,35968.5,27.0
7,max,47958.0,104.0


## Top N Tokens with Most Frequent NER Tags


In [None]:
top_n = 10
top_tokens = df[df['NER'] != 'O']['Token'].value_counts().nlargest(top_n).reset_index()
top_tokens.columns = ['Token', 'Count']

fig = px.bar(top_tokens, x='Token', y='Count', title=f'Top {top_n} Tokens with Most Frequent NER Tags')
fig.show()

##  Average Sentence Length by NER Tag

In [None]:
ner_sentence_lengths = df.groupby('NER').apply(lambda x: x.groupby('Sentence #').size().mean()).reset_index()
ner_sentence_lengths.columns = ['NER', 'Average Sentence Length']

fig = px.bar(ner_sentence_lengths, x='NER', y='Average Sentence Length', title='Average Sentence Length by NER Tag')
fig.show()

## Unique Tokens and Their Frequency

In [None]:

unique_tokens = df['Token'].value_counts().reset_index()
unique_tokens.columns = ['Token', 'Count']

fig = px.bar(unique_tokens.head(20), x='Token', y='Count', title='Top 20 Unique Tokens by Frequency')
fig.show()

In [None]:
def create_cooccurrence_matrix(df):
    cooccurrence = Counter()
    for sentence in df.groupby('Sentence #'):
        tokens = sentence[1]['NER'].values
        pairs = combinations(tokens, 2)
        for pair in pairs:
            if pair[0] != 'O' and pair[1] != 'O':
                cooccurrence[pair] += 1
    return cooccurrence

## Co-occurrence of NER Tags

In [None]:
cooccurrence_matrix = create_cooccurrence_matrix(df)

# Convert to DataFrame for Plotly
cooccurrence_df = pd.DataFrame(list(cooccurrence_matrix.items()), columns=['Pair', 'Count'])
cooccurrence_df['NER1'] = cooccurrence_df['Pair'].apply(lambda x: x[0])
cooccurrence_df['NER2'] = cooccurrence_df['Pair'].apply(lambda x: x[1])

fig = px.imshow(pd.pivot_table(cooccurrence_df, values='Count', index='NER1', columns='NER2', fill_value=0),
                title='NER Tag Co-occurrence Matrix')
fig.show()

In [None]:
# @title
pivot_table = pd.pivot_table(cooccurrence_df, values='Count', index='NER1', columns='NER2', fill_value=0)
print("Co-occurrence Matrix:")
print(pivot_table)

Co-occurrence Matrix:
NER2   B-art  B-eve  B-geo  B-gpe  B-nat  B-org  B-per  B-tim  I-art  I-eve  \
NER1                                                                          
B-art    100      4    136     38      1     67     64    120    341      2   
B-eve      4     27    165     32      1     44     27     86      3    271   
B-geo     80    151  19364   4911     69   5696   3818   8811     44     93   
B-gpe     61     77   8383   3778     23   3491   4002   4471     41     45   
B-nat      0      0     68      8     22     21      3     34      0      0   
B-org    114     90  10176   3131     46   5398   3908   6109     82     76   
B-per    127    106   9316   2879     27   4175   4179   5723    130     98   
B-tim     87     94   8451   2765     31   3649   2746   3566     85     77   
I-art     46      2     93     27      0     39     38     81    219      0   
I-eve      5     27    139     33      1     36     24     77      5    121   
I-geo     17     29   3672    

In [None]:
# Get unique tags from the DataFrame
unique_tags = df['NER'].unique()
tag_to_class = {tag: i for i, tag in enumerate(sorted(unique_tags))}

print("Unique tags and their class numbers:", tag_to_class)

Unique tags and their class numbers: {'B-art': 0, 'B-eve': 1, 'B-geo': 2, 'B-gpe': 3, 'B-nat': 4, 'B-org': 5, 'B-per': 6, 'B-tim': 7, 'I-art': 8, 'I-eve': 9, 'I-geo': 10, 'I-gpe': 11, 'I-nat': 12, 'I-org': 13, 'I-per': 14, 'I-tim': 15, 'O': 16}


## Function to convert IOB2 to SpaCy format

In [39]:

def iob_to_spacy_format(sentences):
    spacy_format = []
    for sentence in sentences:
        text = " ".join([token for token, _ in sentence])
        entities = []
        offset = 0
        for token, label in sentence:
            if label != 'O':
                start = text.find(token, offset)
                end = start + len(token)
                entities.append((start, end, label))
            offset += len(token) + 1  # +1 for the space
        spacy_format.append((text, {'entities': entities}))
    return spacy_format

In [40]:
spacy_format = iob_to_spacy_format(sentences)

In [41]:
spacy_format[0]

('Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .',
 {'entities': [(48, 54, 'B-geo'), (77, 81, 'B-geo'), (111, 118, 'B-gpe')]})

In [43]:
spacy_format[0][0][48:54]

'London'

## Split data to train/val/test

In [48]:
def ensure_labels_in_train(spacy_format):
    unique_labels = set()
    for _, annotations in spacy_format:
        for _, _, label in annotations['entities']:
            unique_labels.add(label)

    # Collect examples ensuring all labels are present in training
    label_to_examples = {label: [] for label in unique_labels}
    for example in spacy_format:
        text, annotations = example
        labels = set(label for _, _, label in annotations['entities'])
        for label in labels:
            label_to_examples[label].append(example)

    # Collect at least one example for each label
    train_set = []
    for examples in label_to_examples.values():
        train_set.append(random.choice(examples))

    # Collect the rest of the examples
    rest_set = [example for example in spacy_format if example not in train_set]

    return train_set, rest_set

def split_data(rest_set):
    train_rest, test_set = train_test_split(rest_set, test_size=0.2, random_state=42)
    train_set, val_set = train_test_split(train_rest, test_size=0.1/(1-0.1), random_state=42)
    return train_set, val_set, test_set

In [49]:
def load_json_split(folder_path):
  split_names = ['train','val','test']
  sets = []
  for split_name in split_names:
    split_file_path = os.path.join(folder_path, f'{split_name}.json')
    with open(split_file_path, 'r') as f:
        data = json.load(f)
    data = [(item['text'], {'entities': item['entities']}) for item in data]
    sets.append(data)
  return sets

# if not os.path.exists('/content/drive/MyDrive/sapient/train.json'):
train_set, rest_set = ensure_labels_in_train(spacy_format)
train_set, val_set, test_set = split_data(rest_set)
final_train_set = train_set + val_set
# else:
#   train_set,val_set,test_set = load_json_split('/content/drive/MyDrive/sapient')
#   final_train_set = train_set + val_set

## Store the data split in json for future usage

In [50]:
splits = [('train',train_set),('val',val_set),('test',test_set)]
path = "/content/drive/MyDrive/sapient"
for split in splits:
  split_name,split = split
  # Convert to a serializable format
  data_serializable = [{'text': text, 'entities': entities['entities']} for text, entities in split]
  save_path = os.path.join(path, f'{split_name}_split.json')
  # Save to a JSON file
  with open(save_path, 'w') as f:
      json.dump(data_serializable, f, indent=4)

In [51]:
print(len(train_set),len(val_set),len(test_set))

34092 4262 9589


In [52]:
def extract_entity_tags(data):
    entity_tags = []
    for text, annotations in data:
        entities = annotations['entities']
        for start, end, label in entities:
            entity_tags.append(label)
    return entity_tags

def calculate_entity_distribution(data, dataset_name):
    entity_tags = extract_entity_tags(data)
    entity_counter = Counter(entity_tags)
    total_entities = sum(entity_counter.values())
    entity_percentages = {label: (count / total_entities) * 100 for label, count in entity_counter.items()}
    return pd.DataFrame(list(entity_percentages.items()), columns=['Entity', 'Percentage']).assign(Dataset=dataset_name)


## Percentage of samples by tags in train,val,test

In [None]:
# Calculate distributions
train_distribution = calculate_entity_distribution(train_set, 'Train')
val_distribution = calculate_entity_distribution(val_set, 'Validation')
test_distribution = calculate_entity_distribution(test_set, 'Test')

# Combine distributions into a single DataFrame
combined_distribution = pd.concat([train_distribution, val_distribution, test_distribution])

# Create Plotly bar graph to visualize the distribution
fig = px.bar(combined_distribution, x='Entity', y='Percentage', color='Dataset', barmode='group',
             title='Entity Distribution in Train, Validation, and Test Data (Percentage)',
             labels={'Entity': 'Entity Type', 'Percentage': 'Percentage (%)'})
fig.show()


In [53]:
output_dir=Path("/content/drive/MyDrive/sapient")

## Method to create data in spacy format

In [54]:
def create_data(data):
  nlp = spacy.blank("en")
  doc_bin = DocBin()
  for example in tqdm(data):
     text = example[0]
     labels = example[1]['entities']
     doc = nlp.make_doc(text)
     ents = []
     for start, end, label in labels:
         span = doc.char_span(start, end, label=label, alignment_mode="contract")
         if span is None:
             print("Skipping entity")
         else:
             ents.append(span)
     filtered_ents = filter_spans(ents)
     doc.ents = filtered_ents
     doc_bin.add(doc)
  return doc_bin

## Store the spaCy format data to disk

In [55]:
#doc_bin.to_disk("train.spacy")
train_bin = create_data(train_set)
val_bin = create_data(val_set)
test_bin = create_data(test_set)
train_bin.to_disk(output_dir / "train_split.spacy")
val_bin.to_disk(output_dir / "val_split.spacy")
test_bin.to_disk(output_dir / "test_split.spacy")

100%|██████████| 34092/34092 [00:09<00:00, 3706.78it/s]
100%|██████████| 4262/4262 [00:01<00:00, 3149.84it/s]
100%|██████████| 9589/9589 [00:02<00:00, 3570.39it/s]


## Initilize a config from base config

In [58]:
!pip install 'spacy[transformers]'

Collecting spacy-transformers<1.4.0,>=1.1.2 (from spacy[transformers])
  Downloading spacy_transformers-1.3.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (197 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m197.8/197.8 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Collecting transformers<4.37.0,>=3.4.0 (from spacy-transformers<1.4.0,>=1.1.2->spacy[transformers])
  Downloading transformers-4.36.2-py3-none-any.whl (8.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m98.5 MB/s[0m eta [36m0:00:00[0m
Collecting spacy-alignments<1.0.0,>=0.7.2 (from spacy-transformers<1.4.0,>=1.1.2->spacy[transformers])
  Downloading spacy_alignments-0.9.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (313 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.0/314.0 kB[0m [31m41.8 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.8.0->spacy-transformers<1.4.0,>

In [59]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


In [60]:
!python -m spacy init fill-config /content/drive/MyDrive/sapient/base_config_split.cfg /content/drive/MyDrive/sapient/config_split.cfg

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
/content/drive/MyDrive/sapient/config_split.cfg
You can now add your data and train your pipeline:
python -m spacy train config_split.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [63]:
!python -m spacy train /content/drive/MyDrive/sapient/config_split.cfg --output /content/drive/MyDrive/sapient/split/ --paths.train /content/drive/MyDrive/sapient/train_split.spacy --paths.dev /content/drive/MyDrive/sapient/val_split.spacy --gpu-id 0

[38;5;4mℹ Saving to output directory: /content/drive/MyDrive/sapient/split[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['transformer', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  --------  ------  ------  ------  ------
  0       0         327.12    672.75    0.08    0.49    0.04    0.00
  0     200       32350.35  65457.18   80.35   81.60   79.13    0.80
  0     400        9140.80  15328.38   82.67   83.41   81.94    0.83
  1   

In [None]:
# @title
import spacy
from spacy.training import Example
from spacy.scorer import Scorer
from spacy.tokens import DocBin

# Load the trained model
model_path = "/content/drive/MyDrive/sapient/model-best"  # Change to your model path
nlp = spacy.load(model_path)

# Load the test data from .spacy file
test_data_path = "/content/drive/MyDrive/sapient/test.spacy"  # Change to your test data path
doc_bin = DocBin().from_disk(test_data_path)
test_docs = list(doc_bin.get_docs(nlp.vocab))

# Convert test data to SpaCy Example format
def convert_to_examples(docs):
    examples = []
    for doc in docs:
        example = Example.from_dict(doc, {"entities": [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]})
        examples.append(example)
    return examples

# Prepare test examples
test_examples = convert_to_examples(test_docs)

# Evaluate the model
def evaluate_ner_model(nlp, examples):
    scorer = Scorer()
    for example in examples:
        doc = nlp(example.text)
        example.predicted = doc
    scores = scorer.score(examples)
    return scores

# Get evaluation metrics
metrics = evaluate_ner_model(nlp, test_examples)

# Print the metrics
# metrics = scorer.scores
print(f"Precision: {metrics['ents_p']:.4f}")
print(f"Recall: {metrics['ents_r']:.4f}")
print(f"F1 Score: {metrics['ents_f']:.4f}")

# Detailed metrics (optional)
print("\nDetailed Metrics:")
for entity_type, results in metrics['ents_per_type'].items():
    print(f"{entity_type}:")
    print(f"  Precision: {results['p']:.4f}")
    print(f"  Recall: {results['r']:.4f}")
    print(f"  F1 Score: {results['f']:.4f}")

Precision: 0.8508
Recall: 0.8450
F1 Score: 0.8479

Detailed Metrics:
B-tim:
  Precision: 0.9183
  Recall: 0.9065
  F1 Score: 0.9124
B-geo:
  Precision: 0.8721
  Recall: 0.8942
  F1 Score: 0.8830
I-geo:
  Precision: 0.8457
  Recall: 0.7688
  F1 Score: 0.8054
B-gpe:
  Precision: 0.9431
  Recall: 0.9401
  F1 Score: 0.9416
B-per:
  Precision: 0.8315
  Recall: 0.8130
  F1 Score: 0.8222
I-per:
  Precision: 0.8205
  Recall: 0.9203
  F1 Score: 0.8675
B-org:
  Precision: 0.7855
  Recall: 0.7422
  F1 Score: 0.7632
I-tim:
  Precision: 0.8007
  Recall: 0.8034
  F1 Score: 0.8020
I-org:
  Precision: 0.7970
  Recall: 0.7823
  F1 Score: 0.7896
I-gpe:
  Precision: 1.0000
  Recall: 0.6429
  F1 Score: 0.7826
B-nat:
  Precision: 0.5556
  Recall: 0.3846
  F1 Score: 0.4545
B-eve:
  Precision: 0.7143
  Recall: 0.3030
  F1 Score: 0.4255
B-art:
  Precision: 0.0000
  Recall: 0.0000
  F1 Score: 0.0000
I-art:
  Precision: 0.0000
  Recall: 0.0000
  F1 Score: 0.0000
I-eve:
  Precision: 0.5000
  Recall: 0.2500
  F1 

In [64]:
!python -m spacy evaluate /content/drive/MyDrive/sapient/split/model-best/ /content/drive/MyDrive/sapient/test_split.spacy --gpu-id 0

[38;5;4mℹ Using GPU: 0[0m
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[1m

TOK     100.00
NER P   86.32 
NER R   84.59 
NER F   85.45 
SPEED   7926  

[1m

             P       R       F
B-geo    88.65   88.87   88.76
I-geo    81.37   80.56   80.96
B-per    82.82   87.02   84.87
I-per    84.06   93.27   88.43
B-org    80.30   71.81   75.82
B-tim    92.36   88.34   90.31
B-gpe    94.98   93.94   94.46
I-org    84.02   76.24   79.94
I-tim    86.62   74.62   80.17
B-art    17.78   10.96   13.56
B-eve    48.61   49.30   48.95
I-eve    36.73   33.96   35.29
I-gpe    73.47   67.92   70.59
I-art    20.00    8.93   12.35
B-nat    63.64   35.00   45.16
I-nat   100.00   25.00   40.00



In [65]:
nlp_ner = spacy.load("/content/drive/MyDrive/sapient/model-best")
doc = nlp_ner(test_set[2][0])
spacy.displacy.render(doc, style="ent", jupyter=True)



[W111] Jupyter notebook detected: if using `prefer_gpu()` or `require_gpu()`, include it in the same cell right before `spacy.load()` to ensure that the model is loaded on the correct device. More information: http://spacy.io/usage/v3#jupyter-notebook-gpu



## Observations and Findings

1. **Overall Performance**:
   - The model shows good performance on common entity types such as `B-geo`, `B-gpe`, `B-org`, `B-per`, `I-per`, `B-tim`, and `I-tim`, with F1-scores above 75.
   - The highest performance is observed in `B-gpe` (F1-score 94.92) and `B-geo` (F1-score 88.41)

2. **Low Performance**:
   - Certain entity types such as `I-art`, `B-art`, `B-eve`, `I-eve`, `I-nat`, and `B-nat` have very low F1-scores.


## Future Improvement Strategy

1. **Data Augmentation**:
   - Increase the quantity of training data, particularly for underperforming entity types.

2. **Class Balancing**:
   - Collecting more instances of underrepresented entity types like `I-art`, `B-art`, `B-eve`, `I-eve`, `I-nat`, and `B-nat`.

3. **Model Architecture and Hyperparameters**:
   - Experiment with different model architectures and hyperparameters.
   - Hyperparameter tuning - learning rate, batch size, and the number of epochs to optimize the training process.
