# Named Entity Recognition using CRF Model

We use the **CoNLL-2003** dataset to train a Conditional Random Field (CRF) model for Named Entity Recognition.

**9 Distinct NER Tags:**

| Tag | Meaning |
|-----|---------|
| O | Outside any entity |
| B-PER | Beginning of a Person |
| I-PER | Inside a Person |
| B-ORG | Beginning of an Organization |
| I-ORG | Inside an Organization |
| B-LOC | Beginning of a Location |
| I-LOC | Inside a Location |
| B-MISC | Beginning of Miscellaneous |
| I-MISC | Inside Miscellaneous |

#### Importing Libraries

In [1]:
from datasets import load_dataset
from sklearn_crfsuite import CRF
from sklearn_crfsuite.metrics import flat_f1_score, flat_classification_report

  from .autonotebook import tqdm as notebook_tqdm


#### Loading the CoNLL-2003 Dataset

In [2]:
dataset = load_dataset("conll2003", trust_remote_code=True)
print(dataset)

`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'conll2003' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
Using the latest cached version of the dataset since conll2003 couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'conll2003' at /Users/thesakshipandey/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98 (last modified on Sat Nov  2 01:08:19 2024).


DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})


In [3]:
# NER tag mapping in CoNLL-2003
tag_names = dataset["train"].features["ner_tags"].feature.names
print("NER Tags:", tag_names)
print("Number of tags:", len(tag_names))

NER Tags: ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
Number of tags: 9


In [4]:
# Explore a sample sentence
sample = dataset["train"][0]
print("Tokens:", sample["tokens"])
print("POS tags:", sample["pos_tags"])
print("NER tags:", [tag_names[t] for t in sample["ner_tags"]])

Tokens: ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
POS tags: [22, 42, 16, 21, 35, 37, 16, 21, 7]
NER tags: ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']


In [5]:
# Dataset statistics
for split in ["train", "validation", "test"]:
    print(f"{split}: {len(dataset[split])} sentences")

train: 14041 sentences
validation: 3250 sentences
test: 3453 sentences


#### Preparing Sentences as (word, POS, NER) Tuples

Convert each dataset example into a list of `(word, pos_tag, ner_tag)` tuples for CRF processing.

In [6]:
pos_tag_names = dataset["train"].features["pos_tags"].feature.names

def convert_to_tuples(example):
    """Convert a dataset example to list of (word, pos, ner_tag) tuples."""
    return list(zip(
        example["tokens"],
        [pos_tag_names[p] for p in example["pos_tags"]],
        [tag_names[t] for t in example["ner_tags"]]
    ))

train_sents = [convert_to_tuples(ex) for ex in dataset["train"]]
test_sents = [convert_to_tuples(ex) for ex in dataset["test"]]

print("Sample sentence:")
print(train_sents[0])

Sample sentence:
[('EU', 'NNP', 'B-ORG'), ('rejects', 'VBZ', 'O'), ('German', 'JJ', 'B-MISC'), ('call', 'NN', 'O'), ('to', 'TO', 'O'), ('boycott', 'VB', 'O'), ('British', 'JJ', 'B-MISC'), ('lamb', 'NN', 'O'), ('.', '.', 'O')]


#### Feature Extraction

For each word, we extract features: lowercase form, suffixes, capitalization, POS tag, and context features from neighboring words.

In [19]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent) - 1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features

def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

In [20]:
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

print(f"Training sentences: {len(X_train)}")
print(f"Test sentences: {len(X_test)}")

Training sentences: 14041
Test sentences: 3453


#### Training the CRF Model

In [21]:
crf = CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)

0,1,2
,algorithm,'lbfgs'
,min_freq,
,all_possible_states,
,all_possible_transitions,True
,c1,0.1
,c2,0.1
,max_iterations,100
,num_memories,
,epsilon,
,period,


In [22]:
# Predicting on the test set
y_pred = crf.predict(X_test)

#### Evaluating Model Performance

We use precision, recall and F1-score since the dataset has class imbalance (the `O` tag dominates).

In [23]:
# Weighted F1 score
f1 = flat_f1_score(y_test, y_pred, average='weighted')
print(f"Weighted F1 Score: {f1:.4f}")

Weighted F1 Score: 0.9562


In [24]:
# Detailed classification report for all 9 tags
labels = ['B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC', 'O']
report = flat_classification_report(y_test, y_pred, labels=labels, digits=3)
print(report)

              precision    recall  f1-score   support

       B-PER      0.822     0.860     0.841      1617
       I-PER      0.861     0.951     0.904      1156
       B-ORG      0.775     0.727     0.750      1661
       I-ORG      0.679     0.734     0.705       835
       B-LOC      0.856     0.814     0.834      1668
       I-LOC      0.745     0.626     0.681       257
      B-MISC      0.819     0.754     0.785       702
      I-MISC      0.688     0.653     0.670       216
           O      0.989     0.989     0.989     38323

    accuracy                          0.956     46435
   macro avg      0.804     0.790     0.795     46435
weighted avg      0.956     0.956     0.956     46435



#### Top Learned Feature Weights

Inspect the most important features the CRF learned for each entity tag.

In [25]:
def top_bottom(d, k=10):
    items = sorted(d.items(), key=lambda x: x[1])
    return items[-k:], items[:k]

topT, botT = top_bottom(crf.transition_features_, 10)
print("Top transitions:")
for (a,b), w in topT:
    print(a,"->",b, w)

print("\nWorst transitions:")
for (a,b), w in botT:
    print(a,"->",b, w)

Top transitions:
O -> B-PER 2.010622
I-PER -> I-PER 2.929024
O -> O 3.343132
B-ORG -> I-ORG 3.825759
I-ORG -> I-ORG 4.178668
I-LOC -> I-LOC 4.411735
B-MISC -> I-MISC 4.60809
I-MISC -> I-MISC 4.819772
B-LOC -> I-LOC 4.899898
B-PER -> I-PER 5.005456

Worst transitions:
O -> I-ORG -5.837198
O -> I-MISC -4.973732
O -> I-LOC -4.724829
B-PER -> B-PER -4.527656
B-LOC -> I-ORG -4.252898
O -> I-PER -4.229781
B-MISC -> I-ORG -4.187154
B-ORG -> B-LOC -3.978385
B-ORG -> I-PER -3.918017
I-PER -> B-PER -3.739063
