# Named Entity Recognition

In [1]:
# install required libraries
!pip install sklearn-crfsuite seqeval spacy-transformers

Collecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting spacy-transformers
  Downloading spacy_transformers-1.3.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting python-crfsuite>=0.9.7 (from sklearn-crfsuite)
  Downloading python_crfsuite-0.9.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Collecting transformers<4.50.0,>=3.4.0 (from spacy-transformers)
  Downloading transformers-4.49.0-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Collecting spacy-alignments<1.0.0,>=0.7.2 (from spacy-transformers)
  Downloading sp

In [2]:
# download spacy model
!python -m spacy download en_core_web_trf

Collecting en-core-web-trf==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.8.0/en_core_web_trf-3.8.0-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting spacy-curated-transformers<1.0.0,>=0.2.2 (from en-core-web-trf==3.8.0)
  Downloading spacy_curated_transformers-0.3.0-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting curated-transformers<0.2.0,>=0.1.0 (from spacy-curated-transformers<1.0.0,>=0.2.2->en-core-web-trf==3.8.0)
  Downloading curated_transformers-0.1.1-py2.py3-none-any.whl.metadata (965 bytes)
Collecting curated-tokenizers<0.1.0,>=0.0.9 (from spacy-curated-transformers<1.0.0,>=0.2.2->en-core-web-trf==3.8.0)
  Downloading curated_tokenizers-0.0.9-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Downloading spacy_curated_transformers-0.3.0-py2.py3-none-any.whl (236 kB)
[2K   [90m━━━━━

In [3]:
from google.colab import drive
import os

drive.mount('/content/drive')

project_path = '/content/drive/MyDrive/NLP_Projects/Week_2/ner'
os.chdir(project_path)

Mounted at /content/drive


## Import Libraries

In [4]:
# import libraries
import spacy
from spacy.tokens import DocBin

import nltk
from nltk.corpus.reader import ConllCorpusReader

from sklearn_crfsuite import CRF, scorers, metrics
from sklearn_crfsuite.metrics import flat_f1_score

import time

from collections import Counter

from seqeval.metrics import classification_report, f1_score

import warnings
warnings.filterwarnings('ignore')

from tqdm import tqdm
# from spacy.util import use_gpu

import re

from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV, cross_val_score, PredefinedSplit
from sklearn.metrics import make_scorer
from scipy.stats import loguniform

In [5]:
# check if GPU is available for SpaCy model
print("spaCy GPU Available:", spacy.prefer_gpu())

spaCy GPU Available: False


We load SpaCy's transformer-based English model (`en_core_web_trf`) which leverages pretrained transformer embeddings (like BERT) for more accurate natural language processing tasks such as NER and POS tagging.

In [6]:
nlp = spacy.load('en_core_web_trf')

## Explore the CoNLL-2003 Training and Test Data

We load the CoNLL-2003 dataset using NLTK’s `ConllCorpusReader` and extract BIO-tagged sentences for named entity recognition. This section provides basic statistics about the dataset:

- **Training and Test Size**: Number of sentences and total token-level observations in both the training and test splits.
- **Vocabulary and Label Space**: Counts of unique words and entity types present in the training data.
- **Distribution Summary**: The 10 most frequent words and entities to help us understand class imbalance or dominant terms in the corpus.

In [7]:
train = ConllCorpusReader('./Data/', 'eng.train', ['words', 'pos', 'ignore', 'chunk'])
test_a = ConllCorpusReader('./Data', 'eng.testa', ['words', 'pos', 'ignore', 'chunk'])
test_b = ConllCorpusReader('./Data', 'eng.testb', ['words', 'pos', 'ignore', 'chunk'])

train_sentences = train.iob_sents()
test_sentences = test_a.iob_sents() + test_b.iob_sents()

train_sentences = [sent for sent in train_sentences if len(sent) > 0]
test_sentences = [sent for sent in test_sentences if len(sent) > 0]

In [8]:
print('Number of sentences in train:', len(train_sentences))
print('Number of sentences in test:', len(test_sentences))
print()
print('Total number of observations in train:', sum(map(len, train_sentences)))
print('Total number of observations in test:', sum(map(len, test_sentences)))

Number of sentences in train: 14041
Number of sentences in test: 6703

Total number of observations in train: 203621
Total number of observations in test: 97797


In [9]:
words = [word for sentence in train_sentences for word, _, _ in sentence]
entities = [ent for sentence in train_sentences for _, _, ent in sentence]

print('Number of unique words:', len(set(words)))
print('Number of unique entities:', len(set(entities)))

Number of unique words: 23623
Number of unique entities: 9


In [10]:
word_counts = Counter(words)
entity_counts = Counter(entities)

print('Ten most common words:', word_counts.most_common(10))
print()
print('Ten most common entities:', entity_counts.most_common(10))

Ten most common words: [('.', 7374), (',', 7290), ('the', 7243), ('of', 3751), ('in', 3398), ('to', 3382), ('a', 2994), ('(', 2861), (')', 2861), ('and', 2838)]

Ten most common entities: [('O', 169578), ('B-LOC', 7140), ('B-PER', 6600), ('B-ORG', 6321), ('I-PER', 4528), ('I-ORG', 3704), ('B-MISC', 3438), ('I-LOC', 1157), ('I-MISC', 1155)]


## SpaCy Pre-trained Model for NER

SpaCy provides powerful pretrained models for a wide range of NLP tasks, including Named Entity Recognition (NER). These models are trained on large annotated corpora and can recognize common entity types such as `PERSON`, `ORG`, `GPE`, `DATE`, and more.

In this project, we use `en_core_web_trf`, SpaCy’s transformer-based English model. It leverages transformer architectures (like BERT) under the hood for better accuracy on complex language patterns.

Advantages of using a SpaCy pretrained NER model:
- **Out-of-the-box performance**: No training required — just load and run.
- **Fast and efficient**: Optimized for performance even with deep models.
- **Transfer learning**: Uses contextual embeddings from large language models.

However, these models may produce entities not aligned with custom tag sets (e.g., the CoNLL-2003 `PER`, `LOC`, `ORG`, `MISC`). To address this, we map SpaCy's entity labels to the CoNLL schema before evaluation.

### Quick Experiment with the spaCy Transformer Model

Before building our own named entity recognizer, we run a quick experiment using spaCy’s pretrained transformer model (`en_core_web_trf`). This allows us to compare its predictions against our custom CRF model later and evaluate how well out-of-the-box solutions perform on our dataset.

In [11]:
text = 'Apple Inc. was founded by Steve Jobs and is headquartered in Cupertino, California.'

start_time = time.time()
doc = nlp(text)
end_time = time.time()

print('Entities:')
for ent in doc.ents:
  print(f'{ent.text} -> {ent.label_}')

print()
print('Execution Time:', end_time - start_time)

Entities:
Apple Inc. -> ORG
Steve Jobs -> PERSON
Cupertino -> GPE
California -> GPE

Execution Time: 0.2668280601501465


### spaCy Model Evaluation Function

This function evaluates the pretrained spaCy NER model on our dataset by aligning its predicted entity spans with the gold BIO-tagged labels. Because spaCy outputs entity spans without BIO tags, we manually convert them using character alignment and label mapping. This allows for a fair comparison between the spaCy model and the BIO-formatted ground truth.


In [12]:
def evaluate_spacy_ner(tagged_sents, label_map):
  """
  Evaluates spaCy's pretrained NER model against a BIO-tagged dataset.

  This function converts spaCy's entity span predictions into BIO format
  and aligns them with the true labels from the dataset.

  Args:
      tagged_sents (List[List[Tuple[str, str, str]]]):
          A list of sentences, where each sentence is a list of (word, POS, entity) tuples.
      label_map (Dict[str, str]):
          A mapping from spaCy's entity labels (e.g., 'PERSON', 'ORG') to the dataset's tag scheme (e.g., 'PER', 'ORG').

  Returns:
      Tuple[List[List[str]], List[List[str]]]:
          - `predictions`: BIO-formatted label predictions from spaCy.
          - `true_labels`: Ground truth BIO labels from the input data.

  Notes:
      - Assumes each sentence is tokenized and the gold labels follow IOB format.
      - Uses character alignment (`char_span`) to map spaCy spans to token indices.
      - Entities not in the label map are defaulted to 'O'.
  """
  predictions, true_labels = [], []

  tqdm_bar = tqdm(total = len(tagged_sents), mininterval = 0)

  for s, sentence in enumerate(tagged_sents):
    words, _, entities = zip(*sentence)
    text = ' '.join(words)

    doc = nlp(text)
    preds = ['O'] * len(words)

    for ent in doc.ents:
      span = doc.char_span(ent.start_char, ent.end_char, alignment_mode = 'expand')

      if span:
        for i, word in enumerate(words):
          if text.find(word) == ent.start_char:
            prediction = label_map.get(ent.label_, 'O')
            preds[i] = f'B-{prediction}' if prediction != 'O' else 'O'
            for j in range(1, len(span)):
              if j + i < len(words):
                preds[j + i] = f'I-{prediction}' if prediction != 'O' else 'O'
            break

    predictions.append(preds)
    true_labels.append(list(entities))

    if s % 250 == 0:
      tqdm_bar.update(250)

  tqdm_bar.close()

  return predictions, true_labels

### Understanding MISC Labels and Mapping spaCy Entity Types

In this section, we explore which words are commonly labeled as `MISC` in the CoNLL-2003 dataset. The `MISC` category is used for entities that don't fall under standard categories like `PER` (person), `LOC` (location), or `ORG` (organization). Examples often include nationalities (e.g., "French", "Israeli"), political affiliations, or event names like "Cup" or "League".

To analyze this, we extract all words in the training set that are tagged as `B-MISC` or `I-MISC`, count their frequency, and inspect the most common examples. This helps us better understand what the dataset considers `MISC`, which is especially useful for aligning external model predictions (like spaCy's) to this tag schema.

We then define a `label_map` to translate spaCy’s NER labels (e.g., `PERSON`, `GPE`, `NORP`, etc.) into CoNLL-style tags (`PER`, `LOC`, `ORG`, `MISC`, or `O`). This is important because spaCy predicts its own set of entity types, and we need a consistent label space to evaluate against the true CoNLL annotations.

In [13]:
words_with_misc = []
for i, sent in enumerate(train_sentences):
  word, _, entity = zip(*sent)
  for j, e in enumerate(entity):
    if e == 'B-MISC' or e == 'I-MISC':
      words_with_misc.append(word[j])

misc_counts = Counter(words_with_misc)
print(misc_counts.most_common(100))

[('Russian', 95), ('Cup', 95), ('German', 84), ('British', 78), ('Open', 71), ('French', 67), ('European', 55), ('American', 55), ('Australian', 52), ('Iraqi', 51), ('Dutch', 50), ('Israeli', 49), ('GMT', 49), ('World', 49), ('League', 49), ('DIVISION', 48), ('African', 45), ('English', 42), ('Olympic', 41), ('LEAGUE', 41), ('Palestinian', 39), ('U.S.', 38), ('Italian', 35), ('South', 34), ('Kurdish', 33), ('Bosnian', 32), ('Moslem', 32), ('Democratic', 31), ('Belgian', 28), ('Japanese', 28), ('Serb', 28), ('Republican', 28), ('OPEN', 26), ('Chinese', 25), ('Grand', 25), ('Turkish', 24), ('Sudanese', 23), ('Polish', 22), ('Iranian', 22), ('Swiss', 22), ('C$', 21), ('Brazilian', 21), ('Indian', 21), ('Palestinians', 20), ('of', 20), ('CUP', 20), ('Canadian', 20), ('National', 20), ('Chechen', 20), ('Thai', 20), ('ENGLISH', 19), ('A$', 19), ('Wimbledon', 18), ('Serbs', 18), ('GERMAN', 16), ('Series', 16), ('MAJOR', 16), ('Major', 16), ('EASTERN', 16), ('CENTRAL', 16), ('WESTERN', 16), ('

In [14]:
label_map = {
    'PERSON': 'PER',
    'GPE': 'LOC',
    'ORG': 'ORG',
    'NORP': 'MISC',
    'FAC': 'MISC',
    'EVENT': 'MISC',
    'WORK_OF_ART': 'O',
    'LAW': 'O',
    'LANGUAGE': 'MISC',
    'PRODUCT': 'MISC',
    'DATE': 'O',
    'TIME': 'O',
    'PERCENT': 'O',
    'MONEY': 'O',
    'QUANTITY': 'O',
    'ORDINAL': 'O',
    'CARDINAL': 'O'
}

### Evaluation of spaCy NER Predictions

We apply the `evaluate_spacy_ner` function to generate BIO-formatted entity predictions using the spaCy transformer-based NER model. The predictions are aligned to the CoNLL-2003 label set using our custom `label_map`.

The resulting predictions are then evaluated using `classification_report` from `sklearn`, which provides precision, recall, and F1-score metrics for each entity class (`PER`, `LOC`, `ORG`, `MISC`). This helps us assess how well the spaCy model aligns with the ground truth annotations in the test set.

In [15]:
pred, true_labels = evaluate_spacy_ner(test_sentences, label_map)

6750it [15:22,  7.32it/s]


In [17]:
print(classification_report(true_labels, pred))

              precision    recall  f1-score   support

         LOC       0.78      0.79      0.78      3505
        MISC       0.77      0.65      0.70      1624
         ORG       0.76      0.42      0.54      3002
         PER       0.88      0.89      0.89      3459

   micro avg       0.81      0.70      0.75     11590
   macro avg       0.80      0.69      0.73     11590
weighted avg       0.80      0.70      0.74     11590



## CRF-Based NER

Conditional Random Fields (CRFs) are a type of probabilistic sequence modeling algorithm commonly used for tasks like Named Entity Recognition (NER) and Part-of-Speech tagging.

Unlike classifiers that make independent predictions for each word, CRFs model the sequence of labels jointly, capturing dependencies between neighboring tags. This is particularly useful in NER, where tag consistency matters (e.g., an `I-PER` tag must follow a `B-PER`).

In the context of NER, CRFs learn the most likely tag sequence for a sentence by using features such as:
- The current word and surrounding words
- Part-of-speech tags
- Word shape (e.g., capitalization, digits, symbols)
- Prefixes/suffixes

CRFs are effective because they:
- Consider the context of the entire sentence
- Enforce valid tag transitions
- Support rich feature engineering

This makes them well-suited for structured prediction tasks involving sequences of text.

### Word Shape Feature Extraction

The `get_word_shape` function generates a simplified representation of a word's character pattern, known as **word shape**. This is a common feature in sequence labeling tasks like NER to help models generalize across words with similar patterns.

Each character is mapped as follows:
- `'X'` for uppercase letters
- `'x'` for lowercase letters
- `'d'` for digits
- `'s'` for symbols or punctuation

Repeated character types are **collapsed** using a regex to reduce redundancy. For example:
- `"Chicken"` becomes `"Xxxxxxx"` → `"Xx"`
- `"9-ball"` becomes `"d-xxxxx"` → `"dx"`
- `"NoMa'am!"` becomes `"XxXx'x!"` → `"XxXs"`

This shape helps the model recognize structure in unknown or rare words.

In [12]:
def get_word_shape(word):
  shape = ''
  for char in word:
    if char.isupper():
      shape += 'X'
    elif char.islower():
      shape += 'x'
    elif char.isdigit():
      shape += 'd'
    else:
      shape += 's'

  shape = re.sub(r'(.)\1+', r'\1', shape)
  return shape

In [13]:
print('Examples of Word Shapes:')
print('------------------------')
print('Chicken:', get_word_shape('Chicken'))
print('9-ball:', get_word_shape('9-ball'))
print("NoMa'am!:", get_word_shape("NoMa'am!"))

Examples of Word Shapes:
------------------------
Chicken: Xx
9-ball: dsx
NoMa'am!: XxXxsxs


### Feature Engineering for CRF-based NER

To train a CRF model for Named Entity Recognition, we need to extract rich, informative features for each word in a sentence. These features help the model capture local context and structural patterns relevant for entity labeling.

The `create_word_features` function constructs a dictionary of features for a given word in a sentence. These include:

- **Current word features:**
  - `word.lower()`: Lowercased form of the word
  - `word[-3:]`: Last 3 characters (suffix)
  - `word[3:]`: All but the first 3 characters (prefix)
  - `word.isupper()`, `word.islower()`, `word.istitle()`: Casing indicators
  - `word.isdigit()`: Whether the token is numeric
  - `word.shape`: Custom word shape (e.g., `'Xx'` for 'Chicken', `'d-xx'` for '9-ball')
  - `pos_tag`: The part-of-speech tag
  - `pos_tag[:2]`: The first two characters of the POS tag (to capture coarse tag category)

- **Previous and next word features:**
  - Lowercase word, POS tag, and shape of the previous and next tokens
  - Special flags for beginning (`BOS`) and end (`EOS`) of sentence

The `create_sentence_features` function applies this logic across a full sentence, producing a sequence of token-level feature dictionaries. The `get_labels` function extracts the true IOB entity tags from a sentence.

This feature-based approach provides the necessary structure for training the CRF to detect named entities using both lexical and contextual cues.

In [14]:
def create_word_features(sentence, i):
  """
  Extracts a set of contextual and lexical features for a word at position `i` in a sentence.

  Args:
      sentence (List[Tuple[str, str]]): A list of (word, POS) tuples representing a sentence.
      i (int): Index of the target word in the sentence.

  Returns:
      dict: A dictionary of features for the CRF model, including:
          - Lexical features: word lowercased, suffix/prefix, casing, digit, shape
          - POS tag and first two characters of the tag
          - Features of previous and next word (if available)
          - Special flags for beginning (`BOS`) or end (`EOS`) of sentence
  """
  word, pos_tag = sentence[i][0], sentence[i][1]

  features = {
      'bias': 1.0,
      'word.lower()': word.lower(),
      'word[-3:]': word[-3:],
      'word[3:]': word[3:],
      'word.isupper()': word.isupper(),
      'word.islower()': word.islower(),
      'word.istitle()': word.istitle(),
      'word.isdigit()': word.isdigit(),
      'word.shape': get_word_shape(word),
      'pos_tag': pos_tag,
      'pos_tag[:2]': pos_tag[:2]
  }

  if i > 0:
    prev_word, prev_pos_tag = sentence[i - 1][0], sentence[i - 1][1]
    features.update({
        '-1:word.lower()': prev_word.lower(),
        '-1:pos_tag': prev_pos_tag,
        '-1:shape': get_word_shape(prev_word)
    })
  else:
    features['BOS'] = True

  if i < len(sentence) - 1:
    next_word, next_pos_tag = sentence[i + 1][0], sentence[i + 1][1]
    features.update({
        '+1:word.lower()': next_word.lower(),
        '+1:pos_tag': next_pos_tag,
        '+1:word.shape': get_word_shape(next_word)
    })
  else:
    features['EOS'] = True

  return features

In [15]:
def create_sentence_features(sentence):
  """
  Generates a list of feature dictionaries for each word in a sentence.

  Args:
      sentence (List[Tuple[str, str, str]]): A list of (word, POS, label) tuples.

  Returns:
      List[Dict[str, Any]]: A list of dictionaries, each containing features for a single word.
                            Features are generated using the `create_word_features` function.
  """
  return [create_word_features(sentence, i) for i in range(len(sentence))]

def get_labels(sentence):
  """
  Extracts the label sequence from a tagged sentence.

  Args:
      sentence (List[Tuple[str, str, str]]): A list of (word, POS, label) tuples.

  Returns:
      List[str]: A list of labels corresponding to each token in the sentence.
  """
  return [label for _, _, label in sentence]

### Training the CRF Model

To train our Conditional Random Field (CRF) model for named entity recognition, we first transform our training data into feature dictionaries using the `create_sentence_features` function. These features include word characteristics, surrounding context, part-of-speech tags, and shape-based indicators.

We use the `sklearn-crfsuite` implementation of CRF with the following settings:
- `algorithm='lbfgs'`: A quasi-Newton optimization method.
- `c1=0.1`, `c2=0.1`: Regularization parameters to prevent overfitting.
- `max_iterations=100`: Maximum number of iterations for convergence.
- `all_possible_transitions=True`: Allows the model to consider transitions that may not appear in the training set, improving generalization.

This step fits the CRF model on the training data and learns transition and emission parameters for predicting BIO-tagged entities.

In [16]:
X_train = [create_sentence_features(sentence) for sentence in train_sentences]
y_train = [get_labels(sentence) for sentence in train_sentences]

In [17]:
crf_model = CRF(
    algorithm = 'lbfgs',
    c1 = 0.1,
    c2 = 0.1,
    max_iterations = 100,
    all_possible_transitions = True
)

In [18]:
crf_model.fit(X_train, y_train)

### Evaluating the CRF Model

After training, we evaluate the CRF model on the test set. The test sentences are first converted into feature dictionaries using the same `create_sentence_features` function used during training.

We then:
- Generate predictions using the trained `crf_model`.
- Use `classification_report` from `seqeval` to compute standard NER evaluation metrics such as precision, recall, and F1-score for each entity class.

This allows us to assess how well the CRF model is able to generalize to unseen data based on the BIO-tagged ground truth.



In [19]:
X_test = [create_sentence_features(sentence) for sentence in test_sentences]
y_test = [get_labels(sentence) for sentence in test_sentences]

In [20]:
y_pred = crf_model.predict(X_test)

In [21]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         LOC       0.90      0.90      0.90      3505
        MISC       0.86      0.81      0.84      1624
         ORG       0.82      0.77      0.79      3002
         PER       0.88      0.88      0.88      3459

   micro avg       0.87      0.85      0.86     11590
   macro avg       0.86      0.84      0.85     11590
weighted avg       0.86      0.85      0.86     11590



We see that the CRF model performs better than the SpaCy pre-trained model. This is likely due to the fact that we trained it from scratch on the labels we see in our dataset. The SpaCy model's outputs do not agree with our dataset and so we needed to adjust them manually, leading to worse performance.

### Hyperparameter Tuning with RandomizedSearchCV

To optimize the CRF model's performance, we perform hyperparameter tuning using `RandomizedSearchCV`. Specifically, we tune the `c1` and `c2` regularization parameters, which control L1 and L2 regularization strength respectively. Both are sampled from a log-uniform distribution between 0.01 and 1.

We define a custom scoring function using `flat_f1_score` with weighted averaging to ensure class imbalance is considered. A 3-fold cross-validation is used during the search.

After identifying the best hyperparameters, we evaluate the best estimator on the test set using `classification_report` to get detailed precision, recall, and F1-score per entity class.

This approach helps balance model complexity and generalization, improving performance on unseen data.

In [22]:
crf_model = CRF(
    algorithm = 'lbfgs',
    max_iterations = 100,
    all_possible_transitions = True
)

In [34]:
param_space = {
    'c1': loguniform(0.01, 1),
    'c2': loguniform(0.01, 1)
}

from sklearn_crfsuite import CRF

def custom_score(estimator, X, y):
  y_pred = estimator.predict(X)
  return flat_f1_score(y, y_pred, average = 'weighted')

In [None]:
crf_grid = RandomizedSearchCV(
    crf_model,
    param_space,
    scoring = custom_score,
    cv = 3,
    verbose = 3,
    error_score = 'raise'
)

crf_grid.fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV 1/3] END c1=0.16061952962331363, c2=0.8808514875228187;, score=0.968 total time=  22.0s
[CV 2/3] END c1=0.16061952962331363, c2=0.8808514875228187;, score=0.970 total time=  23.8s
[CV 3/3] END c1=0.16061952962331363, c2=0.8808514875228187;, score=0.966 total time=  23.1s
[CV 1/3] END c1=0.011038639969250728, c2=0.2291981564176078;, score=0.971 total time=  22.4s
[CV 2/3] END c1=0.011038639969250728, c2=0.2291981564176078;, score=0.974 total time=  24.1s
[CV 3/3] END c1=0.011038639969250728, c2=0.2291981564176078;, score=0.969 total time=  21.9s
[CV 1/3] END c1=0.3560044300530145, c2=0.017454558292561765;, score=0.969 total time=  22.1s
[CV 2/3] END c1=0.3560044300530145, c2=0.017454558292561765;, score=0.971 total time=  23.5s
[CV 3/3] END c1=0.3560044300530145, c2=0.017454558292561765;, score=0.968 total time=  23.7s
[CV 1/3] END c1=0.023377805507015053, c2=0.024281216631669565;, score=0.971 total time=  22.8s
[CV 2/3] E

In [40]:
print('Best params:', crf_grid.best_params_)
print('Best score:', crf_grid.best_score_)

Best params: {'c1': np.float64(0.023377805507015053), 'c2': np.float64(0.024281216631669565)}
Best score: 0.9716806319568848


In [41]:
best_crf = crf_grid.best_estimator_
y_pred = best_crf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         LOC       0.89      0.91      0.90      3505
        MISC       0.86      0.82      0.84      1624
         ORG       0.82      0.76      0.79      3002
         PER       0.88      0.89      0.88      3459

   micro avg       0.87      0.85      0.86     11590
   macro avg       0.86      0.85      0.85     11590
weighted avg       0.86      0.85      0.86     11590



The final performance of the model doesn't improve much with further hyperparameter tuning. In future experiments we could explore different features to improve performance.

### Conclusion: CRF vs. SpaCy

The CRF-based Named Entity Recognition model outperforms the pre-trained SpaCy transformer model on this dataset. While SpaCy leverages general-purpose language representations, the CRF is trained specifically on the CoNLL-2003 data, allowing it to learn fine-grained sequence patterns and label dependencies. The CRF's ability to incorporate handcrafted features like word shape, part-of-speech tags, and surrounding context helps it achieve higher precision and recall, particularly for challenging entity types like `MISC` and `ORG`.

This highlights the strength of CRFs in structured prediction tasks where domain-specific patterns and label consistency are crucial.