# Building models for named entity recognition

The project consists in building two named entity recognition (NER) systems. The systems will make use of the IOB tagging scheme to detect entities of type PER, ORG, LOC and MISC. The tagging scheme thus includes the following tags, assuming one tag per token:

- B-PER and I-PER: token corresponds to the start, resp. the inside, of a person's entity
- B-LOC and I-LOC: token corresponds to the start, resp. the inside, of a location entity
- B-ORG and I-ORG: token corresponds to the start, resp. the inside, of an organization entity
- B-MISC and I-MISC: token corresponds to the start, resp. the inside, of any other named entity
- O: token corresponds to no entity

## Dataset

The dataset has been marginally cleaned and reformatted for facilitated use. You can directly load the three folds from the json file provided:

```python
with open('conll03-iob-pos.json', 'r') as f:
    data = json.load(f)
```
For each fold, the dataset consists of a list of dictionaries, one per sample, with the two fields 'tokens' and 'labels', e.g.

{'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'tags': ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']}





In [None]:
import json
import gzip

from transformers import AutoModel, AutoTokenizer

from sklearn.metrics import accuracy_score

import torch
from torch.utils.data import Dataset, DataLoader


In [None]:
#
# tag to id mapping and vice versa
#
# for tokens that does not have a tag, we will use -100 as the corresponding tag ID
#

tag2id = {
    'O': 0,
    'B-LOC': 1, 'I-LOC': 2,
    'B-ORG': 3, 'I-ORG': 4,
    'B-PER': 5, 'I-PER': 6,
    'B-MISC': 7, 'I-MISC': 8
}

id2tag = list(tag2id.keys())

print(id2tag)

['O', 'B-LOC', 'I-LOC', 'B-ORG', 'I-ORG', 'B-PER', 'I-PER', 'B-MISC', 'I-MISC']


In [None]:
#
# load data from json file
#

with gzip.open('conll03-iob-pos.json.gz', 'r') as f:
    data = json.load(f)

for fold in ('train', 'valid', 'test'):
    print(fold, len(data[fold]))

print(data['train'][0])

train 14041
valid 3250
test 3453
{'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'tags': ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']}


In [None]:
#
# load BERT's tokenizer -- this
#

checkpoint = 'distilbert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
print(tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

DistilBertTokenizerFast(name_or_path='distilbert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)


In [None]:
#
# Here's an example showing how to tokenize texts and create the corresponding aligned and encoded labels
#
# Note that the tokenizer enables to retrieve the index of the corresponding wordform for each (sub-word) token
# through the inputs.word_ids(batch_index=i) function (to retrieve input word indices for each token in
# inputs['input_ids'][i]). Special tokens ([CLS], [SEP], [PAD]) are mapped to None. We will make use of this
# mapping to create token-level labels adapted to sub-word tokenization. See next cell.
#

train_texts = [x['tokens'] for x in data['train']]
train_labels = [x['tags'] for x in data['train']]

inputs = tokenizer(train_texts, is_split_into_words=True, padding=True, truncation=True, return_tensors="pt")

print(train_texts[0])
print(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]))
print(inputs.word_ids(batch_index=0))

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
['[CLS]', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[P

In [None]:
def align_and_encode_labels(_token_ids, _word_ids, _labels):
    '''
    Align word-level labels to sub-word tokens for an entry
    '''

    global tag2id

    ignore_id = -100

    buf = [ignore_id] # ignore tag for token [CLS]

    prev_token_word = -1
    which_type = 0

    # print(len(_token_ids), tokenizer.convert_ids_to_tokens(_token_ids))
    # print(_word_ids)
    # print(_labels)

    for i in range(1, len(_token_ids)):
        word_id = _word_ids[i]

        if word_id == None:
            # token does not belong to any input word ([CLS], [SEP] or [PAD]) -- ignore
            buf.append(ignore_id)

        else:
            tag_id = tag2id[_labels[word_id]]

            if word_id == prev_token_word:
            # sub-word token of the previous word: need to do something
            #   word has an O tag: just use a O tag
            #   word has an I-X tag: just use the I-X tag
            #   word has a B-X tag: replace by corresponding I-X tag

                buf.append(tag_id + 1 if tag_id in (1, 3, 5, 7) else tag_id)

            else:
                # token starting a new word --> keep tag unchanged
                prev_token_word = word_id
                buf.append(tag_id)

    return buf

#
# The following illustrate how we can get aligned and encoded labels for sample i in the training set.
#

i = 10

print(train_texts[i], train_labels[i])

new_labels = align_and_encode_labels(inputs['input_ids'][i], inputs.word_ids(batch_index=i), train_labels[i])

tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][i])

for j in range(len(tokens)):
    if tokens[j] != '[PAD]':
        print(tokens[j], ' -- ', id2tag[new_labels[j]] if new_labels[j] >= 0 else 'NONE')

['Spanish', 'Farm', 'Minister', 'Loyola', 'de', 'Palacio', 'had', 'earlier', 'accused', 'Fischler', 'at', 'an', 'EU', 'farm', 'ministers', "'", 'meeting', 'of', 'causing', 'unjustified', 'alarm', 'through', '"', 'dangerous', 'generalisation', '.', '"'] ['B-MISC', 'O', 'O', 'B-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'B-PER', 'O', 'O', 'B-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
[CLS]  --  NONE
Spanish  --  B-MISC
Farm  --  O
Minister  --  O
Loyola  --  B-PER
de  --  I-PER
Pa  --  I-PER
##la  --  I-PER
##cio  --  I-PER
had  --  O
earlier  --  O
accused  --  O
Fi  --  B-PER
##sch  --  I-PER
##ler  --  I-PER
at  --  O
an  --  O
EU  --  B-ORG
farm  --  O
ministers  --  O
'  --  O
meeting  --  O
of  --  O
causing  --  O
un  --  O
##ju  --  O
##st  --  O
##ified  --  O
alarm  --  O
through  --  O
"  --  O
dangerous  --  O
general  --  O
##isation  --  O
.  --  O
"  --  O
[SEP]  --  NONE


In [None]:
# Now it's up to you to pursue this notebook, define your datasets, models and evaluate.

## **Step 1:** Dataset Definition & Preparation
 We begin by defining a custom NERDataset class as BERT uses sub-word tokenization

Tokenization: We tokenize all texts in a batch at once using tokenizer(). This handles padding and truncation automatically, ensuring all sequences are the same length.

Alignment: We iterate through each sample and use the align_and_encode_labels function (defined in the preamble) to map the original word-level tags to the new sub-word tokens. This ensures that if a word is split, its label is correctly propagated (or ignored) for the sub-tokens.

####Output:
 The __getitem__ method returns a dictionary containing input_ids, attention_mask, and the aligned labels, which is the format expected by Hugging Face models and our custom training loop.

In [None]:
class NERDataset(Dataset):
    def __init__(self, _data, _tokenizer):
        # Extract raw texts and tags
        texts = [x['tokens'] for x in _data]
        tags = [x['tags'] for x in _data]

        # Tokenize inputs
        self.encodings = _tokenizer(texts, is_split_into_words=True, padding=True, truncation=True, return_tensors="pt")

        self.labels = []

        # Align labels for each sample
        for i in range(len(tags)):
            word_ids = self.encodings.word_ids(batch_index=i)


            aligned_labels = align_and_encode_labels(self.encodings['input_ids'][i], word_ids, tags[i])
            self.labels.append(torch.tensor(aligned_labels))

        # Convert list of label tensors to a single stacked tensor
        self.labels = torch.stack(self.labels)

    def __getitem__(self, idx):
        # Return dictionary with the model forward() method
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

    def __len__(self):
        return len(self.labels)



ds_train = NERDataset(data['train'], tokenizer)
ds_valid = NERDataset(data['valid'], tokenizer)
ds_test = NERDataset(data['test'], tokenizer)

batch_size = 32
train_loader = DataLoader(ds_train, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(ds_valid, batch_size=batch_size)
test_loader = DataLoader(ds_test, batch_size=batch_size)

## **Step 2:** The RNN-Based Model

We define a standard Bi-LSTM model and set bidirectional=True because it is important to know the words after the current one.

* Embedding Layer:
Maps the BERT tokenizer's vocabulary indices to dense vectors.

* LSTM Layer:
Processes the sequence in both directions. The output dimension is hidden_dim * 2 because we concatenate the forward and backward hidden states.

* Linear Layer:
Projects the concatenated hidden states to the number of possible tags

In [None]:
import torch.nn as nn


class LSTM_NER(nn.Module):
    """
    Bidirectional LSTM for Named Entity Recognition.

    Architecture:
    1. Embedding Layer: Learns dense representations for token IDs from scratch.
    2. Bi-LSTM Layer: Captures context from both left and right directions.
    3. Linear Head: Projects the LSTM output to the tag space.
    """
    def __init__(self, vocab_size, embed_dim, hidden_dim, n_tags):
        super().__init__()
        # Embedding: vocab_size -> embed_dim
        # We use the tokenizer's vocab size to ensure we can handle any token ID produced by it.
        self.embedding = nn.Embedding(vocab_size, embed_dim)

        # LSTM: embed_dim -> hidden_dim
        # batch_first=True ensures input shape is (batch, seq_len, features)
        # bidirectional=True allows the model to see future context
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)

        # Linear: hidden_dim * 2 -> n_tags
        # We multiply by 2 because the LSTM is bidirectional (forward & backward)
        self.fc = nn.Linear(hidden_dim * 2, n_tags)

    def forward(self, input_ids, **kwargs):

        # Embed tokens
        # Shape: (batch_size, seq_len) -> (batch_size, seq_len, embed_dim)
        x = self.embedding(input_ids)

        # Run LSTM
        # output shape: (batch_size, seq_len, hidden_dim * 2)
        # We ignore the hidden states (h_n, c_n) returned as the second tuple element
        x, _ = self.lstm(x)

        # Project to tag space
        # Shape: (batch_size, seq_len, n_tags)
        logits = self.fc(x)

        return logits

# Instantiate the RNN model
# We choose hyperparameters: 128 for embedding and 256 for hidden state
print("Initializing LSTM Model...")
rnn_model = LSTM_NER(vocab_size=tokenizer.vocab_size, embed_dim=128, hidden_dim=256, n_tags=len(tag2id))
print(rnn_model)


Initializing LSTM Model...
LSTM_NER(
  (embedding): Embedding(28996, 128)
  (lstm): LSTM(128, 256, batch_first=True, bidirectional=True)
  (fc): Linear(in_features=512, out_features=9, bias=True)
)


## **Step 3:** The Fine-Tuned BERT Model

* Rationale:
Here we adapt a pre-trained Transformer (DistilBERT). Unlike document classification where we only care about the [CLS] token, for NER we need a prediction for every token.

* Base Model:
It is important to use the "cased" version because capital letters help the model tell the difference between a name and a regular word (for example, "Apple" the company versus "apple" the fruit)

* Sequence Output:
We access outputs.last_hidden_state, which gives us a vector (size 768) for every token in the sequence.

* Dropout:
Added to prevent overfitting during fine-tuning.

In [None]:
# --- STEP 3: BERT MODEL DEFINITION ---

class BERT_NER(nn.Module):
    """
    Fine-tuning a pre-trained BERT model for Token Classification.

    Architecture:
    1. DistilBERT Encoder: Pre-trained transformer layers.
    2. Dropout: Regularization.
    3. Linear Head: Projects the 768-dim BERT embeddings to the tag space.
    """
    def __init__(self, n_tags, dropout=0.1):
        super().__init__()
        # Load the base DistilBERT model
        # We use 'distilbert-base-cased' because case sensitivity is important for NER
        self.bert = AutoModel.from_pretrained('distilbert-base-cased')

        self.dropout = nn.Dropout(dropout)

        # Project 768 (DistilBERT's hidden size) to number of tags
        # The configuration is accessed via self.bert.config
        self.classifier = nn.Linear(self.bert.config.hidden_size, n_tags)

    def forward(self, input_ids, attention_mask, **kwargs):
        # 1. Run BERT
        # DistilBERT outputs a tuple or object. We need the 'last_hidden_state'.
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)

        # 2. Get full sequence output
        # Shape: (batch_size, seq_len, 768)
        # Unlike text classification (where we take [:, 0, :]), we keep the whole sequence
        sequence_output = outputs.last_hidden_state

        # 3. Dropout and Classify
        sequence_output = self.dropout(sequence_output)

        # Shape: (batch_size, seq_len, n_tags)
        logits = self.classifier(sequence_output)

        return logits

# Instantiate the BERT model
print("Initializing BERT Model...")
bert_model = BERT_NER(n_tags=len(tag2id))
print(bert_model)

Initializing BERT Model...


model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

BERT_NER(
  (bert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_fea

## **Step 4** Training & Final Evaluation

###Rationale:
In this final phase, we define the training loop. A crucial detail is handling the -100 label, which indicates padded or special tokens; we must ignore these in our loss calculation to avoid corrupting the gradients.

###Training:
We train the LSTM for 5 epochs and the BERT model for 3 epochs.

###Evaluation:
The project requires two metrics.

* Token Accuracy: Used during training to monitor convergence.

* Entity Recognition Rate: It calculates the exact sentence match rate.

In [None]:
# --- STEP 4: TRAINING & EVALUATION ---

import torch.optim as optim

# 1. HELPER: Single Epoch Training
def train_epoch(model, loader, optimizer, device):
    model.train()
    total_loss = 0
    # CRITICAL: ignore_index=-100 ensures we don't calculate loss for padding/special tokens
    criterion = nn.CrossEntropyLoss(ignore_index=-100)

    for batch in loader:
        input_ids = batch['input_ids'].to(device)
        labels = batch['labels'].to(device)
        attention_mask = batch.get('attention_mask')
        if attention_mask is not None:
            attention_mask = attention_mask.to(device)

        optimizer.zero_grad()

        # Forward pass
        if isinstance(model, BERT_NER):
            logits = model(input_ids, attention_mask)
        else:
            logits = model(input_ids)

        # Flatten outputs for CrossEntropyLoss: (batch * seq_len, n_tags)
        loss = criterion(logits.view(-1, len(tag2id)), labels.view(-1))

        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    return total_loss / len(loader)

# 2. HELPER: Token Accuracy (for monitoring)
def evaluate_token_acc(model, loader, device):
    model.eval()
    all_preds, all_labels = [], []

    with torch.no_grad():
        for batch in loader:
            input_ids = batch['input_ids'].to(device)
            labels = batch['labels'].to(device)
            attention_mask = batch.get('attention_mask')
            if attention_mask is not None:
                attention_mask = attention_mask.to(device)

            if isinstance(model, BERT_NER):
                logits = model(input_ids, attention_mask)
            else:
                logits = model(input_ids)

            predictions = torch.argmax(logits, dim=-1)

            # Masking: Only select valid labels (not -100)
            mask = (labels != -100)
            all_preds.extend(predictions[mask].cpu().numpy())
            all_labels.extend(labels[mask].cpu().numpy())

    return accuracy_score(all_labels, all_preds)

# 3. HELPER: Full Training Loop
def run_training(model, model_name, train_loader, valid_loader, n_epochs, lr):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"--- Training {model_name} on {device} ---")
    model.to(device)

    if model_name == 'BERT':
        optimizer = optim.AdamW(model.parameters(), lr=lr) # Lower LR for transformers
    else:
        optimizer = optim.Adam(model.parameters(), lr=lr)

    for epoch in range(n_epochs):
        loss = train_epoch(model, train_loader, optimizer, device)
        val_acc = evaluate_token_acc(model, valid_loader, device)
        print(f"Epoch {epoch+1}/{n_epochs} | Loss: {loss:.4f} | Val Token Acc: {val_acc:.4f}")

    return model

# --- EXECUTE TRAINING ---

# Train LSTM (Higher LR, more epochs)
rnn_model = run_training(rnn_model, 'LSTM', train_loader, valid_loader, n_epochs=5, lr=1e-3)

# Train BERT (Lower LR, fewer epochs)
bert_model = run_training(bert_model, 'BERT', train_loader, valid_loader, n_epochs=3, lr=5e-5)


# --- FINAL EVALUATION: ENTITY RECOGNITION ---

def get_decoded_predictions(model, loader, device):
    """
    Decodes predictions back to tag strings (e.g., 'B-PER') to evaluate full entities.
    """
    model.eval()
    true_seqs, pred_seqs = [], []

    with torch.no_grad():
        for batch in loader:
            input_ids = batch['input_ids'].to(device)
            labels = batch['labels'].to(device)
            attention_mask = batch.get('attention_mask')
            if attention_mask is not None:
                attention_mask = attention_mask.to(device)

            if isinstance(model, BERT_NER):
                logits = model(input_ids, attention_mask)
            else:
                logits = model(input_ids)

            preds = torch.argmax(logits, dim=-1).cpu().numpy()
            lbls = labels.cpu().numpy()

            for i in range(len(lbls)):
                # Filter special tokens (-100) to reconstruct the sentence
                valid = lbls[i] != -100
                true_seqs.append([id2tag[x] for x in lbls[i][valid]])
                pred_seqs.append([id2tag[x] for x in preds[i][valid]])

    return true_seqs, pred_seqs

def calculate_entity_success(true_seqs, pred_seqs):
    # Strict accuracy: The whole sentence must be perfect
    perfect = sum([1 for t, p in zip(true_seqs, pred_seqs) if t == p])
    return perfect / len(true_seqs)

# Compare Models on Test Set
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("\n--- Final Test Set Evaluation ---")
lstm_true, lstm_pred = get_decoded_predictions(rnn_model, test_loader, device)
lstm_score = calculate_entity_success(lstm_true, lstm_pred)
print(f"LSTM Strict Sentence Accuracy: {lstm_score:.4f}")

bert_true, bert_pred = get_decoded_predictions(bert_model, test_loader, device)
bert_score = calculate_entity_success(bert_true, bert_pred)
print(f"BERT Strict Sentence Accuracy: {bert_score:.4f}")

# Visual Sanity Check
print("\nSample Output (BERT):")
print(f"True: {bert_true[0]}")
print(f"Pred: {bert_pred[0]}")

--- Training LSTM on cuda ---
Epoch 1/5 | Loss: 0.5423 | Val Token Acc: 0.8938
Epoch 2/5 | Loss: 0.2267 | Val Token Acc: 0.9273
Epoch 3/5 | Loss: 0.1205 | Val Token Acc: 0.9377
Epoch 4/5 | Loss: 0.0601 | Val Token Acc: 0.9402
Epoch 5/5 | Loss: 0.0280 | Val Token Acc: 0.9431
--- Training BERT on cuda ---
Epoch 1/3 | Loss: 0.1832 | Val Token Acc: 0.9771
Epoch 2/3 | Loss: 0.0550 | Val Token Acc: 0.9816
Epoch 3/3 | Loss: 0.0301 | Val Token Acc: 0.9787

--- Final Test Set Evaluation ---
LSTM Strict Sentence Accuracy: 0.5601
BERT Strict Sentence Accuracy: 0.8187

Sample Output (BERT):
True: ['O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred: ['O', 'O', 'O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


##**Part 5:** Analysis & Conclusion


Comparison of Results in three key ways:

* Speed: BERT learned much faster. It reached its best performance in just 3 epochs, while the LSTM needed 5 epochs.

* Token Accuracy (~98% vs ~94%): BERT was more accurate at the word level. While LSTM's 94% seems high, it is misleading because most words are just tagged "O" (not an entity). A model can get a high score just by guessing "O" for everything. BERT's higher score proves it is actually finding the entities.

* Sentence Success Rate (81% vs 56%): BERT correctly tagged every single word in a sentence 81% of the time. The LSTM only tagged 56%. This causes some mistakes like marking only half of "New York"—which counts as a failure.

### Why BERT?
There are 2 main differences

* LSTM (Started from zero): This model had to learn everything from scratch. It had to figure out English grammar and how to spot names at the same time. This is very hard to do with a small dataset.

* BERT (Started with knowledge): BERT was pre-trained on millions of books and articles. It already "understood" English structure. We only had to teach it the specific tags (like PER or LOC).

So BERT  learned faster and made fewer mistakes.

### Error Analysis (Sample Output)
failure case printed :

* **True:** `[... B-LOC, I-LOC, I-LOC ... B-PER, I-PER, I-PER ...]`
* **Pred:** `[... B-MISC, I-MISC, I-MISC ... B-PER, I-PER, O ...]`

There are 2 common mistakes:

* Wrong Label (Classification Error): The model correctly found the first entity (it knew it was 3 words long), but it guessed the wrong type. This usually happens when a word is ambiguous (e.g., "Washington" can be a person or a place).

* Cut Short (Segmentation Error): For the second entity, the model started correctly (B-PER, I-PER) but stopped too early. It missed the last part of the name. This often happens with rare last names or when the tokenizer splits a word into pieces the model doesn't recognize.

###Conclusion

So we can say for tasks on limited labeled data, BERT performs significantly better than training Recurrent Neural Networks (RNNs) from scratch.