# Conditional Random Fields

Conditional Random Fields (CRFs) are a type of statistical modeling method well-suited for structured prediction tasks, which involve predicting sequences of labels for sequences of input items. They are particularly popular in natural language processing (NLP) for tasks such as named entity recognition, part-of-speech tagging, and many others that involve making predictions that depend on the context of other elements in the sequence.

Background and Basics CRFs belong to the family of probabilistic graphical models and are a form of undirected graphical model (Markov random field). Unlike other simpler models such as the Hidden Markov Model (HMM) that make strong independence assumptions, CRFs model the conditional probability of a sequence of labels given a sequence of input tokens. This approach allows CRFs to take into account the interdependencies among the output labels and utilize a broad context of observable features within the input sequence.

Mathematical Formulation Formally, a CRF defines a conditional probability distribution  $P(Y∣X)$ over label sequences $Y$ given a particular input sequence 𝑋 X, rather than modeling the joint distribution $P(X,Y)$. For a sequence $X=(x 1​,x 2​,...,x n​)$ and corresponding labels $Y=(y 1​,y 2​,...,y n​)$, the probability of a label sequence given the input sequence is defined as:

$$ 𝑃 ( 𝑌 ∣ 𝑋 ) = {1 \over 𝑍( 𝑋 ) }exp ⁡ ( ∑ 𝑖 = 1 𝑛 ∑ 𝑘 \lambda 𝑘 𝑓 𝑘 ( 𝑦 𝑖 − 1 , 𝑦 𝑖 , 𝑋 , 𝑖 ) ) P(Y∣X)= Z(X) 1​exp( i=1 ∑ n$$

k ∑​λ k​f k​(y i−1​,y i​,X,i)) Where:

𝑓 𝑘 f k​are feature functions that depend on the input sequence 𝑋 X, the position 𝑖 i in the sequence, and the labels at positions 𝑖 i and 𝑖 − 1 i−1. 𝜆 𝑘 λ k​are weights learned from training data. 𝑍 ( 𝑋 ) Z(X) is a normalization factor that ensures the probabilities sum to 1, computed as the sum of the exponentiated scores of all possible label sequences given 𝑋 X. Advantages of CRFs Contextual Dependency: CRFs can model complex dependencies between labels, which is particularly useful in tasks where context significantly influences the output, such as in syntactic parsing or where labels are not independently set but are influenced by adjacent labels.

Feature Flexibility: Unlike models like HMMs that typically rely on fixed state-transition probabilities, CRFs can incorporate an arbitrary number of overlapping, interdependent features. This allows for inclusion of diverse information such as word tokens, part-of-speech tags, and other lexical features.

No Independence Assumptions: CRFs do not assume independence between the input features given the output labels, allowing them to capture more information about the input.

Limitations of CRFs Computational Complexity: Training CRFs, particularly over large datasets or with many features, can be computationally intensive due to the need to compute normalization factors across potentially large label spaces.

Feature Engineering: Although CRFs are powerful, they often require careful feature engineering to achieve optimal performance. This process can be labor-intensive and requires domain expertise.

Scalability Issues: CRFs can struggle to scale to very large datasets or very long sequences due to the computational costs associated with the inference and learning processes.

Common Applications Named Entity Recognition (NER): Identifying and classifying named entities in text into predefined categories such as names of persons, organizations, locations. Part-of-Speech Tagging: Assigning part of speech to each word in a sentence, such as noun, verb, adjective, etc. Segmentation and Labeling: Any task that requires segmenting a sequence and labeling its segments, such as tokenization or sentence breaking. CRFs have been foundational in advancing the field of NLP and remain relevant even as newer methods like deep learning-based approaches have begun to dominate the field. They are still used either standalone or as part of hybrid systems that combine traditional machine learning techniques with neural network architectures.

In [106]:
import datasets
from sklearn_crfsuite import CRF
from sklearn_crfsuite.metrics import flat_classification_report, flatten
from sklearn.metrics import classification_report
import numpy as np

sample_size = 1000
# Load CoNLL-2003 dataset
dataset = datasets.load_dataset('conll2003', trust_remote_code=True)

train_tokens = dataset["train"]["tokens"][:sample_size]
train_tags = dataset["train"]["ner_tags"][:sample_size]

test_tokens = dataset["test"]["tokens"][:sample_size]
test_tags = dataset["test"]["ner_tags"][:sample_size]

#converting the integer values of ner_tags to their coressponding ner_tag values as the dictionary given in data description
def int_ner(data):
    d = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}

    d1 = {v:k for k, v in d.items()}
    i=0
    for elements in data:
        for j in range(len(elements)):
            data[i][j] = d1[elements[j]]
        i+=1
    return data

train_tags = int_ner(train_tags)
test_tabs = int_ner(test_tags)
# Feature extraction functions
def word2features(sent, i):
    word = sent[i]
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
    }
    
    if i > 0:
        word1 = sent[i-1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
        })
    else:
        features['BOS'] = True  # Beginning of Sentence

    if i < len(sent) - 1:
        word1 = sent[i+1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
        })
    else:
        features['EOS'] = True  # End of Sentence

    return features

def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return sent

# Prepare data
X_train = [sent2features(s) for s in train_tokens]
y_train = train_tags
X_test = [sent2features(s) for s in test_tokens]
y_test = test_tags

# Train CRF
crf = CRF(algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=100, all_possible_transitions=True)
crf.fit(X_train, y_train)

# Predictions
y_pred = crf.predict(X_test)

# Evaluate
print(classification_report(flatten(y_test), flatten(y_pred)))


              precision    recall  f1-score   support

       B-LOC       0.73      0.71      0.72       446
      B-MISC       0.78      0.48      0.59       197
       B-ORG       0.68      0.35      0.46       523
       B-PER       0.73      0.74      0.74       672
       I-LOC       0.89      0.46      0.60        68
      I-MISC       0.64      0.65      0.65        83
       I-ORG       0.60      0.40      0.48       223
       I-PER       0.74      0.94      0.83       493
           O       0.95      0.98      0.97      9440

    accuracy                           0.91     12145
   macro avg       0.75      0.63      0.67     12145
weighted avg       0.90      0.91      0.90     12145

