

# Sequence Labeling With CRF and HMM - NER

The extraction of relevant information from historical handwritten document collections is one of the key steps in order to make these manuscripts available for access and searches. In this context, instead of a pure transcription, the objective is to move towards document understanding. Concretely, the aim is to detect the named entities and assign each of them a semantic category, such as family names, places, occupations, etc.


A typical application scenario of named entity recognition is demographic documents, since they contain people's names, birthplaces, occupations, etc. In this scenario, the extraction of the key contents and its storage in databases allows the access to their contents and envision innovative services based in genealogical, social or demographic searches.

<p style = 'text-align: center'>
<img src = "http://dag.cvc.uab.es/wp-content/uploads/2016/07/esposalla_detall.jpg">
</p>

For further doubts and questions, refer to oriol.ramos@uab.cat and alicia.fornes@uab.cat.

Usage of Google Colab is not mandatory, but highly recommended as most of the behaviors are expected for a Linux VM with IPython bindings.

## First, we will install the unmet dependencies.

This will download some packages and the required data, it may take a while.

In [1]:
#@title
from IPython.display import clear_output

!git clone https://github.com/EauDeData/nlp-resources
!cp -r nlp-resources/ resources/
!rm -rf nlp-resources/


!pip install nltk
!pip install git+https://github.com/MeMartijn/updated-sklearn-crfsuite
clear_output()

from typing import *

from itertools import chain
import nltk
import numpy as np
import copy
import random
from collections import Counter

import pycrfsuite as crfs
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelBinarizer

from resources.data.dataloaders import EsposallesTextDataset

## Data Curation
Loading the dataset - From here you could re-use your previous work.

In [2]:
random.seed(42)
train_loader = EsposallesTextDataset('resources/data/esposalles/')
test_loader = copy.deepcopy(train_loader)
test_loader.test()

Example of data from each loader:
> Format string: ```word```:```label```


In [3]:
print([f"{x}:{y}" for x,y in zip(*train_loader[0])])
print([f"{x}:{y}" for x,y in zip(*test_loader[0])])

['Dilluns:other', 'a:other', '5:other', 'rebere:other', 'de:other', 'Hyacinto:name', 'Boneu:surname', 'hortola:occupation', 'de:other', 'Bara:location', 'fill:other', 'de:other', 'Juan:name', 'Boneu:surname', 'parayre:occupation', 'defunct:other', 'y:other', 'de:other', 'Maria:name', 'ab:other', 'Anna:name', 'donsella:state', 'filla:other', 'de:other', 't:name', 'Cases:surname', 'pages:occupation', 'de:other', 'Bara:location', 'defunct:other', 'y:other', 'de:other', 'Peyrona:name']
['Divendres:other', 'a:other', '18:other', 'rebere:other', 'de:other', 'Juan:name', 'Torres:surname', 'pages:occupation', 'habitant:other', 'en:other', 'Sabadell:location', 'fill:other', 'de:other', 'Bernat:name', 'Torres:surname', 'pages:occupation', 'de:other', 'Moya:location', 'bisbat:location', 'de:location', 'Vich:location', 'y:other', 'de:other', 'Antiga:name', 'defucts:other', 'ab:other', 'Margarida:name', 'donsella:state', 'filla:other', 'de:other', 'Juan:name', 'Argemir:surname', 'pages:occupation',

If the dataset is correctly downloaded you will see two different samples above, and both tests passed below.

In [4]:
#@title

# Dataset ckeck

for idx in range(len(train_loader)):

  x, y = train_loader[idx]
  if len(x) != len(y):
    print('train_set test not passed')
    break

else: print('train_set test passed')

for idx in range(len(test_loader)):

  x, y = test_loader[idx]
  if len(x) != len(y):
    print('test_set test not passed')
    break

else: print('test_set test passed')

train_set test passed
test_set test passed


Since most of the computation won't be done with strings, the following function will create a Look Up Table (LUT) that transforms string tokens into ```int``` tokens.

In [5]:
def create_tokens_lut(train_dataset) -> Dict:
  '''
  Input:
    train_dataset: Training dataset.

    Don't tokenize test_set as later on,
       we will be considering out-of-vocabulary words as <unk> tokens.

    NOTE: Tokens MUST be lowered (.lower()) before considering them.

  Ouput:
    LUT[Dict]: {
      word1: 0,
      word2: 1,
        ...
      wordn: n - 1
    }

  '''
  ct = 0
  dict = {}
  for i in range(len(train_dataset)):
    sentence,_ = train_dataset[i]
    for word in sentence:
      word = word.lower()
      if word not in dict:
        dict[word] = ct
        ct +=1

  return dict


LUT = create_tokens_lut(train_loader)

In [6]:
def check_oov_words(LUT = LUT, test_set = test_loader):
  oov = set()
  for i in range(len(test_set)):
    sentence,_ = test_set[i]
    for word in sentence:
      word = word.lower()
      if word not in LUT:
        oov.add(word)

  return oov

print(list(check_oov_words()))


['sto', 'arisart', 'buyra', 'conflent', 'llobre', 'darder', 'fabres', 'gatuellas', 'assensio', 'caxaler', 'begas', 'vivints', 'hor', 'gat', 'faneca', 'guardi#', 'villaro', 'tola', 'corties', 'felis', 'busquet', 'muntells', 'auger', 'dimas', 'ge', 'broquets', 'mso', 'nyella', 'rius', 'perris', 'spa', 'sengermes', 'morros', 'valta', 'comptat', 'clavetayre', 'melcior', 'rotxe', 'gusman', 'brasil', 'thome', 'cebriana', 'monblanch', 'payas', 'blanquart', 'theodora', 'plans', 'pallissa', 'poses', 'faliu', 'llorenci', 'moreno', 'ritoreta', 'novell', 'francesa', 'rianna', 'scrivent', 'bonastra', 'honorat', 'bachs', 'castigaleu', 'sitjar', 'arevig', 'campprecios', 'galeras', 'quart', 'victo', 'peramon', 'box', 'campderos', 'gassull', 'dimarts', 'majol', 'ortiz', 'raguera', 'miro', 'tamuyell', 'pachs', 'cabrer', 'monllor', 'macip', 'munmany', 'tibau', 'more', 'angli', 'terre', 'testa', 'ribagossa', 'crich', 'llondra', 'alaverni', 'juan#', 'conteso', 'islla', 'habitants', 'rosa', 'constansa', 'ca

Due to batch computation of some modules, sequences must have constant length. As a common practice, we will create three new tokens ```<bos>``` and ```<eos>``` for the start and the end of a given sequence and ```<unk>``` for unkown tokens in the application (test) layer or 0 padding during the training. Manually add those tokens to the ```LUT```.


Under those constraints, fill the corresponding functions that will post-process each batch. Feel free to code more post-processing functions if you need it.





In [7]:
LUT['<unk>'] = len(LUT) + 1
LUT['<bos>'] = len(LUT) + 1
LUT['<eos>'] = len(LUT) + 1

In [8]:
MAX_SEQUENCE_LENGTH = 50
def complete_seq(X) -> List[List]:

  '''

    Input:
      X: A batch of N sequences [
        [word1, ..., wordn],
        [word1, ..., wordm]
      ]

    Output:
      A batch of N sequences with MAX_SEQUENCE_LENGTH tokens.
        - The starting token will always be <sos>
        - The last 'real' token <eos>
        - Tokens from <eos> until MAX_SEQUENCE_LENGTH will be <unk> as 0 padding.

  '''
  for seq in X:
    seq = ['<bos>'] + seq + ['<eos>']
    diff = MAX_SEQUENCE_LENGTH - len(seq)
    for i in range(diff):
      seq = seq + ['<unk>']

  return X

def post_process(X, functions = [complete_seq,]):
  for f in functions: X = f(X)
  return X

## NER - Baseline Approach

The first approach we will try is based on computing the probabilities for each word in our training corpus. This means computing the most likely category for each word in the dictionary.

Compute the test categories predictions and measure the performance for this simple model.

In [9]:
def compute_emissions_dict(dataloader) -> Dict:
  '''

    Given the train loader ```dataloader```
     this function will compute the max likelihood dictionary for each word.

  Input:
    dataloader: train loader with EsposallesTextDataset

  Outputs:
    Dict: {
      pagès: {name: X occupation: X}, # REMEMBER TO LOWER YOUR TOKENS!
            ...
      LUT - wordn: {category: x%, ...}
    }

  '''
  counts = {}
  dict = {}
  for i in range(len(dataloader)):
    sentence, tag = dataloader[i]
    for word,y in zip(sentence,tag):
      word = word.lower()
      if word not in counts:
        counts[word] = 1
      else:
        counts[word] +=1

      if word not in dict:
        dict[word] = {}
        dict[word][y] = 1
      else:
        if y not in dict[word]:
          dict[word][y] = 1
        else:
          dict[word][y] +=1

  for word in dict:
    for y in dict[word]:
      dict[word][y] /= counts[word]

  return dict

In [10]:
priors = compute_emissions_dict(train_loader)

At this point, as an example, your emissions dictionary should yield the following emission:

$P(location |$ ```Prats``` $) = 18\%$

$P(surname |$ ```Prats``` $) = 72\%$

$P(other |$ ```Prats``` $) = 9\%$

In [11]:
priors['prats']

{'location': 0.18181818181818182,
 'surname': 0.7272727272727273,
 'other': 0.09090909090909091}

This method has its limitations in terms of lack of context and, therefore, low expresivity.

The following function will compute the confusion matrix for the predictions in the ```test_set``` in order to find the most problematic words.

* What do they all have in common?
* What kind of words are the least performers?
* What's your solution for out-of-vocabulary words? Can you provide a prediction for those?

In [12]:
def predict_test_set(emissions, test_set):

  '''

  input string: casament eduard pages

  prediction: [None, name, occupation]

  Important:
    Remember to check if you can provide a label for each word (OOVs?).
  
  '''
  predictions = []
  for i in range(len(test_set)):
    sequence_pred = []
    seq, _ = test_set[i]
    for word in seq:
      word = word.lower()
      if word in emissions:
        max = 0
        t = 'other'
        for tag in emissions[word]:
          if emissions[word][tag] > max:
            max = emissions[word][tag]
            t = tag

        sequence_pred.append(t)

      else:
        # We use the tag 'other' for the elements that apper for the first time in the test_set
        sequence_pred.append('other')

    predictions.append(sequence_pred)

  return predictions


predictions = predict_test_set(priors, test_loader)

def find_common_errors(x_test: List[List], y_pred: List[List], y_true: List[List]) -> Dict:

  '''
    Input:
      x_test: A list with each sample in the corpus with the words for which we
ran each prediction
          [
           ['lorem', 'ipsum', 'dolor', 'sit', 'amet'],
           ['Hello', 'world', '!!!'],
          ]

      y_pred: A list with the predicted labels for each word in x_test corpus.
          [
           ['1', '0', '0', '1', '2'],
           ['2', '1', '0'],
          ]

      y_true: GT for the x_test sample
          [
           ['0', '0', '0', '1', '2'],
           ['0', '1', '0'],
          ]
    {
      pages: [{'pred': prediction, gt: label}, {'pred': prediction, 'gt': label}, ...]
    }
  '''
  errors = {}
  for i in range(len(x_test)):
    for word, pred, true in zip(x_test[i], y_pred[i], y_true[i]):
      if word not in errors:
        errors[word] = []
        errors[word].append({'pred':pred, 'gt':true})
      else:
        errors[word].append({'pred':pred, 'gt':true})

  return errors



def compute_token_precision(x_test: List[List], y_pred: List[List], y_true: List[List]):
  errors = find_common_errors(x_test, y_pred, y_true)
  #compute the confusion matrix
  #c_matrix[pred][gt] = 30 <=> gt was predicted as pred 30 times
  c_matrix = {}
  for word in errors:
    for pair in errors[word]:
      pred, gt = pair['pred'], pair['gt']
      if pred not in c_matrix:
        c_matrix[pred] = {}
        c_matrix[pred][gt] = 1
      else:
        if gt not in c_matrix[pred]:
          c_matrix[pred][gt] = 1
        else:
          c_matrix[pred][gt]+=1

  precision = 0
  sum1 = 0
  sum2 = 0
  for pred in c_matrix:
    sum1 += c_matrix[pred][pred]
    for gt in c_matrix[pred]:
      sum2 += c_matrix[pred][gt]

  precision = sum1 / sum2
  return precision

  #{'location':{'location':49, 'surname':23},'surname':{}}


In [13]:
find_common_errors([test_loader[idx][0] for idx in range(len(test_loader))], predictions, [test_loader[idx][1] for idx in range(len(test_loader))])['Esteva']


[{'pred': 'name', 'gt': 'location'},
 {'pred': 'name', 'gt': 'surname'},
 {'pred': 'name', 'gt': 'location'}]

In [14]:
compute_token_precision([test_loader[idx][0] for idx in range(len(test_loader))], predictions, [test_loader[idx][1] for idx in range(len(test_loader))])

0.885741873189572

In [15]:
print(predictions[2])
print(test_loader[2][1])

['other', 'other', 'other', 'other', 'other', 'name', 'name', 'occupation', 'other', 'location', 'other', 'other', 'name', 'name', 'occupation', 'other', 'location', 'other', 'other', 'other', 'other', 'other', 'other', 'name', 'other', 'name', 'other', 'state', 'other', 'other', 'name', 'other', 'occupation', 'other', 'location', 'other', 'other', 'name', 'other']
['other', 'other', 'other', 'other', 'other', 'name', 'surname', 'occupation', 'other', 'location', 'other', 'other', 'name', 'surname', 'occupation', 'other', 'location', 'other', 'other', 'other', 'other', 'other', 'other', 'name', 'other', 'name', 'name', 'state', 'other', 'other', 'name', 'surname', 'occupation', 'other', 'location', 'other', 'other', 'name', 'other']


The task involved Named Entity Recognition (NER) in historical handwritten documents, aiming to categorize entities like family names, places, and occupations. The current method employs a basic approach, assigning the most likely category to each word based on training data, but it lacks context and struggles with out-of-vocabulary (OOV) words. OOV words are currently labeled with the most frequent category, which may not be ideal. Analysis of the confusion matrix and common errors highlights areas for improvement, suggesting the incorporation of context-aware methods and better OOV handling techniques. These enhancements could enhance performance and accuracy in NER for historical handwritten documents.

## HMM Approach

As demonstrated in the previous experiment, using just the priors have not enough expresivity for managing both out of vocabulary words and polysemic words. Here we will use the ```python-crfsuite``` module to build a Hidden Markov Model and improve the predictions on ```test_set```.

Check <a href = 'https://python-crfsuite.readthedocs.io/en/latest/'>here</a> the  ```python-crfsuite``` documentation.

First, we will set up the parameters for our CRF model.


In [16]:
def get_word_to_hmm_features(sent, i):


    '''
     Reminder:
        The Markov assumption states that the transition for the i-th token
          depends on the (i-1)-th token.


    '''
    word, _ = sent[i]
    #emission probilities
    features = [
        'bias',
        'word.lower=' + word.lower(),
    ]
    if i == 0:
        features.append('bos')

    if (i == len(sent) - 1):
        features.append('eos')

    return features


def sent2HMMfeatures(sent):
    return [get_word_to_hmm_features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, label in sent]

def sent2tokens(sent):
    return [token for token, label in sent]

In [17]:
# transform the dataset
# to the (token, gt) tuple format
train_sents = [  [(x,y) for x,y in zip(*train_loader[idx])] for idx in range(len(train_loader))]
test_sents =  [  [(x,y) for x,y in zip(*test_loader[idx])] for idx in range(len(test_loader))]

In [18]:
print(train_sents[0])

[('Dilluns', 'other'), ('a', 'other'), ('5', 'other'), ('rebere', 'other'), ('de', 'other'), ('Hyacinto', 'name'), ('Boneu', 'surname'), ('hortola', 'occupation'), ('de', 'other'), ('Bara', 'location'), ('fill', 'other'), ('de', 'other'), ('Juan', 'name'), ('Boneu', 'surname'), ('parayre', 'occupation'), ('defunct', 'other'), ('y', 'other'), ('de', 'other'), ('Maria', 'name'), ('ab', 'other'), ('Anna', 'name'), ('donsella', 'state'), ('filla', 'other'), ('de', 'other'), ('t', 'name'), ('Cases', 'surname'), ('pages', 'occupation'), ('de', 'other'), ('Bara', 'location'), ('defunct', 'other'), ('y', 'other'), ('de', 'other'), ('Peyrona', 'name')]


In [19]:
X_train = [sent2HMMfeatures(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2HMMfeatures(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

In [20]:
trainer = crfs.Trainer(verbose=False) # Instance a CRF trainer

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq) # Stack the data

In [21]:
trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # Max Number of iterations for the iterative algorithm

    # include transitions that are possible, but not observed (smoothing)
    'feature.possible_transitions': True
})

In [22]:
%%time
trainer.train('npl_ner_crf.crfsuite') # Train the model and save it locally.

CPU times: user 490 ms, sys: 2.03 ms, total: 492 ms
Wall time: 494 ms


In [23]:
tagger = crfs.Tagger()
tagger.open('npl_ner_crf.crfsuite') # Load the inference API

<contextlib.closing at 0x7bd2fd8fdf90>

In [24]:
print(test_sents[0])

[('Divendres', 'other'), ('a', 'other'), ('18', 'other'), ('rebere', 'other'), ('de', 'other'), ('Juan', 'name'), ('Torres', 'surname'), ('pages', 'occupation'), ('habitant', 'other'), ('en', 'other'), ('Sabadell', 'location'), ('fill', 'other'), ('de', 'other'), ('Bernat', 'name'), ('Torres', 'surname'), ('pages', 'occupation'), ('de', 'other'), ('Moya', 'location'), ('bisbat', 'location'), ('de', 'location'), ('Vich', 'location'), ('y', 'other'), ('de', 'other'), ('Antiga', 'name'), ('defucts', 'other'), ('ab', 'other'), ('Margarida', 'name'), ('donsella', 'state'), ('filla', 'other'), ('de', 'other'), ('Juan', 'name'), ('Argemir', 'surname'), ('pages', 'occupation'), ('de', 'other'), ('Sabadell', 'location'), ('y', 'other'), ('de', 'other'), ('Aldonsa', 'name'), ('defuncts', 'other')]


In [25]:
example_sent = test_sents[0]
print(' '.join(sent2tokens(example_sent)), end='\n\n')

print("Predicted:", ' '.join(tagger.tag(sent2HMMfeatures(example_sent))))
print("Correct:  ", ' '.join(sent2labels(example_sent))) # Inference

Divendres a 18 rebere de Juan Torres pages habitant en Sabadell fill de Bernat Torres pages de Moya bisbat de Vich y de Antiga defucts ab Margarida donsella filla de Juan Argemir pages de Sabadell y de Aldonsa defuncts

Predicted: other other other other other name surname occupation other other location other other name surname occupation other location location location location other other name other other name state other other name surname occupation other location other other name other
Correct:   other other other other other name surname occupation other other location other other name surname occupation other location location location location other other name other other name state other other name surname occupation other location other other name other


In [26]:
from sklearn.preprocessing import LabelBinarizer
def bio_classification_report(y_true, y_pred):
    """

    Classification report.
    You can use this as evaluation for both in the baseline model and new model.

    """
    lb = LabelBinarizer()
    y_true_combined = lb.fit_transform(list(chain.from_iterable(y_true)))
    y_pred_combined = lb.transform(list(chain.from_iterable(y_pred)))

    tagset = set(lb.classes_) - {'O'}
    tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
    class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}

    return classification_report(
        y_true_combined,
        y_pred_combined,
        labels = [class_indices[cls] for cls in tagset],
        target_names = tagset,
    )

In [27]:
# Compute the predictions
y_pred = [tagger.tag(xseq) for xseq in X_test]

In [28]:
print(y_pred[1])
print(y_test[1])

['other', 'other', 'other', 'other', 'other', 'name', 'surname', 'occupation', 'other', 'location', 'other', 'other', 'name', 'surname', 'occupation', 'occupation', 'other', 'other', 'name', 'other', 'name', 'state', 'other', 'other', 'name', 'surname', 'occupation', 'other', 'location', 'other', 'other', 'name']
['other', 'other', 'other', 'other', 'other', 'name', 'surname', 'occupation', 'other', 'location', 'other', 'other', 'name', 'surname', 'surname', 'occupation', 'other', 'other', 'name', 'other', 'name', 'state', 'other', 'other', 'name', 'surname', 'occupation', 'other', 'location', 'other', 'other', 'name']


* Use the  ```bio_classification_report``` function in both the Baseline model and the HMM model. Do you observe any improvement? In which cases does it still fail?



In [29]:
## Comparison of the baseline
### Perform both quantitative and qualitative analysis
baseline_pred = predict_test_set(priors, test_loader)
baseline_true = [test_loader[i][1] for i in range(len(test_loader))]

hmm_pred = y_pred

print("Baseline classification report:")
print(bio_classification_report(baseline_true, baseline_pred))

print("HMM classification report:")
print(bio_classification_report(y_test, hmm_pred))

Baseline classification report:
              precision    recall  f1-score   support

    location       0.94      0.71      0.81       462
        name       0.94      0.95      0.94       494
  occupation       0.94      0.85      0.90       294
       other       0.85      0.99      0.91      1493
       state       0.98      0.95      0.96       113
     surname       0.84      0.47      0.60       251

   micro avg       0.89      0.89      0.89      3107
   macro avg       0.91      0.82      0.85      3107
weighted avg       0.89      0.89      0.88      3107
 samples avg       0.89      0.89      0.89      3107

HMM classification report:
              precision    recall  f1-score   support

    location       0.89      0.92      0.91       462
        name       0.98      0.96      0.97       494
  occupation       0.94      0.92      0.93       294
       other       0.97      0.98      0.97      1493
       state       0.98      0.95      0.96       113
     surname       

The HMM model shows significant improvements over the baseline, which relies heavily on emission probabilities. However, the baseline has limitations:

Contextual Understanding: It overlooks context, leading to errors when a word's meaning depends on neighboring words.

Out-of-Vocabulary Words: Performance suffers with out-of-vocabulary words, often resorting to assigning the most frequent label.

Neglects Label Transitions: It doesn't consider transitions between labels, missing valuable information.

Ignores Higher-Order Dependencies: Lacks consideration for relationships between non-adjacent words, essential for accurate labeling

Quantitative and Qualitative Analysis of HMM and Baseline Models

The HMM model is generally expected to outperform the Baseline model due to its consideration of tag dependencies. However, there are instances where the HMM model may falter, such as:

Data Ambiguity: Both models may struggle if the data contains ambiguity or inconsistencies in annotation.

Rare or Unseen Words/Entities: The HMM model may fail when encountering rare or unseen words/entities during testing, lacking information for accurate predictions.

Complex Tag Dependencies: In cases of intricate tag dependencies, the HMM model may still struggle to make accurate predictions.

To pinpoint the HMM model's failures, analyzing the confusion matrix can identify common misclassifications, offering insights for model improvement or the need for additional training data.

* Can this model provide a solution for out-of-vocabulary words?
* Can you provide examples of words which changed its category compared to the max-likelihood prior when introducing context? See the following example.

$P($ ```Noun |people``` $) = 80\%$

$P($ ``` people, Noun | "a  planet" ``` $) = 5\%$

$P($ ``` people, Verb | "a  planet" ``` $) = 90\%$

* How does it perform with respect to the out-of-vocabulary words? e.g. what's the precision for those?

*  The following function shows the less likely and most likely transitions. Comment them and perform a deep analysis on each transition, do they have something in common with the errors you found?

Handling Out-of-Vocabulary Words:
The HMM model struggles with out-of-vocabulary (OOV) words as it lacks specific emission probabilities for unseen words. It may resort to assigning the most frequent label or making arbitrary predictions based on transition probabilities. Advanced techniques like smoothing can help mitigate this issue.

Examples of Contextual Change in Word Category:
Consider the word "run":

P(Verb | run) = 70%
P(run, Verb | "a race") = 90%
P(run, Noun | "a race") = 10%

In this case, when "run" is observed in the context of "a race," it is more likely to be categorized as a verb (90%) rather than a noun (10%), contrary to its standalone probability of being a verb (70%).

Performance on Out-of-Vocabulary Words:
The model's precision for out-of-vocabulary words depends on its handling of such instances. If it assigns the most frequent label to all OOV words, precision may be higher. However, if the model struggles with OOV words and makes arbitrary predictions, precision may be lower. Separate evaluation of OOV words is crucial to assess model robustness.







In [30]:
from collections import Counter
info = tagger.info()

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(info.transitions).most_common(15))

print("\nTop unlikely transitions:")
print_transitions(Counter(info.transitions).most_common()[-15:])

Top likely transitions:
surname -> occupation 5.019786
location -> location 4.249178
occupation -> occupation 4.194908
name   -> surname 3.918272
surname -> surname 3.294527
other  -> name    2.642988
name   -> name    2.280106
other  -> location 1.814986
occupation -> other   1.799188
state  -> state   1.515564
name   -> state   1.182575
other  -> other   1.181984
state  -> other   0.735261
occupation -> state   0.467426
location -> other   0.403282

Top unlikely transitions:
state  -> occupation 0.214062
surname -> state   0.070542
location -> occupation -0.072833
name   -> other   -0.506374
occupation -> name    -0.617364
other  -> surname -0.626089
occupation -> location -0.735001
surname -> name    -0.998662
other  -> occupation -1.030786
other  -> state   -1.296538
name   -> occupation -1.650451
occupation -> surname -2.111763
name   -> location -2.166658
location -> surname -2.771394
location -> name    -3.351583


In [31]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-6s %s" % (weight, label, attr))

print("Top positive:")
print_state_features(Counter(info.state_features).most_common(20))

print("\nTop negative:")
print_state_features(Counter(info.state_features).most_common()[-20:])

Top positive:
10.471530 state  word.lower=viudo
9.201264 state  word.lower=donsella
9.171407 other  word.lower=ab
9.073835 other  word.lower=fill
9.043569 other  word.lower=defuncts
8.975021 other  word.lower=#
8.538504 state  word.lower=viuda
8.204700 other  word.lower=defunct
7.974212 other  word.lower=y
7.971931 other  word.lower=defuncta
7.816721 location word.lower=frances
7.655355 state  word.lower=dosella
7.522998 other  word.lower=rebere
7.259508 other  word.lower=habitant
7.223392 location word.lower=bara
7.056602 other  word.lower=a
6.963770 other  word.lower=de
6.655697 other  word.lower=filla
6.586878 other  word.lower=habitat
6.535491 occupation word.lower=llana

Top negative:
0.009387 occupation word.lower=pastisser
0.006814 surname word.lower=pere
-0.000147 name   word.lower=sr
-0.000667 location word.lower=dels
-0.015598 location word.lower=menat
-0.061897 surname word.lower=vila
-0.086558 surname word.lower=del
-0.118397 surname word.lower=toni
-0.285311 occupation bia


In this analysis, the model shows strong associations:

Surname -> Occupation, Occupation -> Occupation: Recognizes an occupation often follows a surname, and consecutive words can belong to the occupation category.
Location -> Location: Identifies consecutive location-related words.
Name -> Surname, Surname -> Surname: Understands the relationship between names and surnames; consecutive surnames may appear.
Top unlikely transitions reveal less probable transitions:

Location -> Name, Surname, Occupation: Indicates locations rarely transition directly to names, surnames, or occupations.
Name -> Occupation, Occupation -> Surname, Name: Shows uncommon transitions, possibly related to errors. For example, if transitioning from a location to a surname is unlikely, the model may struggle to identify correct sequences where this transition occurs.

## Hyperparameter Exploration

In the definition of the model we used some default hyperparameters related to the training algorithm.



```
trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # Max Number of iterations for the iterative algorithm

    # include transitions that are possible, but not observed (smoothing)
    'feature.possible_transitions': True
})
```
Can you improve the precision by better parametrization? Feel free to explore more parameters through [the documentation](https://sklearn-crfsuite.readthedocs.io/en/latest/api.html#sklearn_crfsuite.CRF).


In [32]:
# New parameters and results here
## Use the functions above, no need to re-work.
## There is no need of a deep analysis nor qualitative evaluation
param_grid = {
    'c1': [0.01, 0.1, 1, 10], # Coefficient for L1 regularization penalty
    'c2': [0.01, 0.1, 1, 10], # Coefficient for L2 regularization penalty

    # Different algorithms (Gradient descent using the L-BFGS method

    'algorithm': ['lbfgs', # 'l2sgd' - Stochastic Gradient Descent with L2 regularization term
                  'ap', # 'ap' - Averaged Perceptron
                  'pa', # 'pa' - Passive Aggressive (PA)
                  'arow'], # 'arow' - Adaptive Regularization Of Weight Vector (AROW) )

    'max_iterations': 100, # Maximum number of iterations
    'epsilon':00.1         # The epsilon parameter that determines the condition of convergence.
}

## CRF Approach

Additionaly, we can address the problem by adding complexity to the transition probabilities. In contrast to HMM, a CRF isn't subject to locality constraints when computing the posterior probabilities.

Implement a CRF word featurer that takes into account tokens beyond adjacent ones and your expected needs given the qualitative evaluation.



In [33]:

def get_word_to_crf_features(sentence, word_idx):
    word, _ = sentence[word_idx]

    features = [
        'bias',
        'word.lower=' + word.lower(),
    ]

    if word_idx > 0:
        prev_word, _ = sentence[word_idx - 1]
        features.extend([
            '-1:word.lower=' + prev_word.lower(),
        ])
    else:
        features.append('bos')

    if word_idx < len(sentence) - 1:
        next_word, _ = sentence[word_idx + 1]
        features.extend([
            '+1:word.lower=' + next_word.lower(),
        ])
    else:
        features.append('eos')

    return features


def get_sent_to_crf_features(sentence):
    return [get_word_to_crf_features(sentence, i) for i in range(len(sentence))]


In [34]:
X_train = [get_sent_to_crf_features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [get_sent_to_crf_features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

trainer_crf = crfs.Trainer(verbose=False) # Instance a CRF trainer

for xseq, yseq in zip(X_train, y_train):
    trainer_crf.append(xseq, yseq) # Stack the data

In [35]:
trainer_crf.set_params({

    'c1': 0.01,  # coefficient for L1 penalty
    'c2': 0.01,  # coefficient for L2 penalty

})

In [36]:
trainer_crf.train('npl_ner_crf-improved.crfsuite') # Train the model and save it locally.
tagger_crf = crfs.Tagger()
tagger_crf.open('npl_ner_crf.crfsuite') # Load the inference API

<contextlib.closing at 0x7bd32f276b90>

* Did the results improve? Comment your decision on the features used in the CRF approach. Is there a difference between using just adjacent tokens or unconstrained optimization?

In [37]:
## Quantitative and qualitative evaluation

y_pred_crf = [tagger_crf.tag(x) for x in X_test]

report_crf = bio_classification_report(y_test, y_pred_crf)
print("CRF Model:")
print(report_crf)

CRF Model:
              precision    recall  f1-score   support

    location       0.89      0.92      0.91       462
        name       0.98      0.96      0.97       494
  occupation       0.94      0.92      0.93       294
       other       0.97      0.98      0.97      1493
       state       0.98      0.95      0.96       113
     surname       0.94      0.93      0.93       251

   micro avg       0.96      0.96      0.96      3107
   macro avg       0.95      0.94      0.95      3107
weighted avg       0.96      0.96      0.96      3107
 samples avg       0.96      0.96      0.96      3107




Using only adjacent tokens in the CRF model may restrict its context understanding. Employing unconstrained optimization allows the model to consider a broader token range, enhancing its grasp of word-label relationships and potentially improving results. However, assessing the model's performance on the test set is crucial to confirm whether this approach indeed yields better predictions.
The results for both models were the same for all labels.

## Conclusions



This notebook explores three Named Entity Recognition methods: a baseline model, Hidden Markov Model, and Conditional Random Field, all using the BIO tagging scheme. We analyze their performance and discuss strengths and weaknesses.

Choosing the NER model depends on task requirements and resource availability. The baseline model suits simple tasks, while HMM and CRF models offer superior performance with increased complexity and computational requirements. CRF is especially effective for complex NER tasks requiring diverse features and non-local context.