# CRF model for POS Tagging

In this tutorial we are going to use [python-crfsuite package](https://github.com/scrapinghub/python-crfsuite) for training a CRF Model for POS tagging problem. The method we introduced here can be applied to other tagging problems such as Word Segmentation, NER, NP Chunking, and so on.

We will use the same dataset that we used for implementing HMM POS tagger.

## seqeval

In [None]:
%%capture
!pip install -q seqeval[cpu]

## python-crfsuite

[python-crfsuite](https://github.com/scrapinghub/python-crfsuite) is a python binding to CRFsuite.



In [None]:
!pip install -q python-crfsuite

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/993.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m993.5/993.5 kB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
[?25h

We import necessary packages for our work.

In [None]:
from itertools import chain
import nltk
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelBinarizer
import sklearn
import pycrfsuite
import scipy.stats
from sklearn.metrics import make_scorer

print(sklearn.__version__)

1.2.2


## Loading data


In [None]:
import nltk
from nltk.corpus import treebank

nltk.download('universal_tagset')
nltk.download('treebank')

[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


True

### Create train/test/split

In [None]:
from sklearn.model_selection import train_test_split

tagged_sentences = treebank.tagged_sents(tagset='universal')
train_tagged_sentences, test_tagged_sentences = train_test_split(tagged_sentences, test_size=0.2, random_state=42)

train_sentences = []
train_tag_sequences = []

test_sentences = []
test_tag_sequences = []

for sen in test_tagged_sentences:
    words, tags = zip(*sen)
    test_sentences.append(words)
    test_tag_sequences.append(tags)

for sen in train_tagged_sentences:
    words, tags = zip(*sen)
    train_sentences.append(words)
    train_tag_sequences.append(tags)

In [None]:
train_sentences[0]

('Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.')

In [None]:
train_tag_sequences[0]

('NOUN',
 'NOUN',
 '.',
 'NUM',
 'NOUN',
 'ADJ',
 '.',
 'VERB',
 'VERB',
 'DET',
 'NOUN',
 'ADP',
 'DET',
 'ADJ',
 'NOUN',
 'NOUN',
 'NUM',
 '.')

## Features

Next, define some features. In this example we use the list of features introducted the tutorial in [https://medium.com/analytics-vidhya/pos-tagging-using-conditional-random-fields-92077e5eaa31](https://medium.com/analytics-vidhya/pos-tagging-using-conditional-random-fields-92077e5eaa31).

We will write a function that returns a dictionary of following features for each word in the sentence.


In [None]:
def is_all_caps(word):
    return word.upper() == word and not word.isdigit()

def word2features(sentence, i):
    """
    Arguments:
        sentence (list): list of words [w1, w2,...,w_n]
        i (int): index of the word
    Return:
        features (dict): dictionary of features
    """
    word = sentence[i]
    features = {
        'is_first': i == 0,
        'is_last': i == len(sentence) - 1,
        'is_first_capital': word[0].isupper(),
        'is_all_caps': is_all_caps(word),    # ????
        'is_all_lower': word.lower() == word,  # ????
        'word': word,
        'word.lower()': word.lower(),
        'prefix_1': word[0],
        'prefix_2': word[:2],
        'prefix_3': word[:3],
        'prefix_4': word[:4],
        'suffix_1': word[-1],
        'suffix_2': word[-2:],
        'suffix_3': word[-3:],
        'suffix_4': word[-4:],
        'prev_word': '' if i==0 else sentence[i-1].lower(),
        'next_word': '' if i==len(sentence)-1 else sentence[i+1].lower(),
        'has_hyphen': '-' in word,
        'is_numeric': word.isdigit(),
        'capitals_inside': word[1:].lower() != word[1:]    # ????
    }

    return features


def sent2features(sentence):
    """
    sentence is a list of words [w1, w2,...,w_n]
    """
    return [word2features(sentence, i) for i in range(len(sentence))]


def sent2labels(sentence):
    """
    sentence is a list of tuples (word, postag)
    """
    return [postag for token, postag in sentence]

def untag(sentence):
    """
    sentence is a list of tuples (word, postag)
    """
    return [token for token, _ in sentence]

Let's see how the feature function works.

In [None]:
sent2features( train_sentences[0] )[10]

{'is_first': False,
 'is_last': False,
 'is_first_capital': False,
 'is_all_caps': False,
 'is_all_lower': True,
 'word': 'board',
 'word.lower()': 'board',
 'prefix_1': 'b',
 'prefix_2': 'bo',
 'prefix_3': 'boa',
 'prefix_4': 'boar',
 'suffix_1': 'd',
 'suffix_2': 'rd',
 'suffix_3': 'ard',
 'suffix_4': 'oard',
 'prev_word': 'the',
 'next_word': 'as',
 'has_hyphen': False,
 'is_numeric': False,
 'capitals_inside': False}

Now we can extract features from the data.

In [None]:
X_train = [sent2features(s) for s in train_sentences]
y_train = train_tag_sequences

X_test = [sent2features(s) for s in test_sentences]
y_test = test_tag_sequences

## Training

To see all possible CRF parameters check its docstring. Here we are using SGD training algorithm with L2 regularization.

In [None]:
trainer = pycrfsuite.Trainer(algorithm='lbfgs', verbose=True)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

In [None]:
trainer.params()

['feature.minfreq',
 'feature.possible_states',
 'feature.possible_transitions',
 'c1',
 'c2',
 'max_iterations',
 'num_memories',
 'epsilon',
 'period',
 'delta',
 'linesearch',
 'max_linesearch']

In [None]:
trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # stop earlier

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

In [None]:
%%time
trainer.train('postagger.crfsuite')

## Evaluation

Now we will evaluate our trained CRF model on the test data. We will use accuracy as our evaluation metric.

In [None]:
from sklearn.metrics import accuracy_score, classification_report

tagger = pycrfsuite.Tagger()
tagger.open('postagger.crfsuite')

y_pred = list( chain(*[tagger.tag(xseq) for xseq in X_test]) )
y_true = list( chain(*y_test) )

print(accuracy_score(y_true, y_pred))

0.9726994014307265


We obtained much better result than that of the first-order HMM model.

Let's see the details of classification results.

In [None]:
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           .       1.00      1.00      1.00      2354
         ADJ       0.91      0.87      0.89      1316
         ADP       0.98      0.99      0.98      2028
         ADV       0.91      0.92      0.92       634
        CONJ       1.00      0.99      0.99       471
         DET       0.99      0.99      0.99      1795
        NOUN       0.96      0.98      0.97      5943
         NUM       1.00      0.99      0.99       727
        PRON       0.99      1.00      1.00       523
         PRT       0.98      0.98      0.98       658
        VERB       0.96      0.96      0.96      2740
           X       1.00      1.00      1.00      1360

    accuracy                           0.97     20549
   macro avg       0.97      0.97      0.97     20549
weighted avg       0.97      0.97      0.97     20549



## Let’s check what classifier learned

In [None]:
from collections import Counter
info = tagger.info()

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(info.transitions).most_common(15))

print("\nTop unlikely transitions:")
print_transitions(Counter(info.transitions).most_common()[-15:])

Top likely transitions:
ADJ    -> NOUN    2.829943
VERB   -> PRT     2.367747
X      -> VERB    1.617303
NOUN   -> PRT     1.517131
ADP    -> NOUN    1.479857
ADP    -> PRON    1.470597
DET    -> NOUN    1.467798
ADV    -> ADJ     1.454885
ADV    -> ADV     1.416952
NUM    -> NOUN    1.414520
ADV    -> VERB    1.356902
DET    -> X       1.315338
NOUN   -> VERB    1.305948
NOUN   -> NOUN    1.288232
ADP    -> DET     1.266361

Top unlikely transitions:
PRT    -> .       -0.960814
ADJ    -> PRON    -0.971327
PRON   -> DET     -0.972367
DET    -> .       -1.071193
PRT    -> PRT     -1.112169
ADJ    -> DET     -1.214018
PRT    -> NUM     -1.234834
X      -> NOUN    -1.239027
CONJ   -> .       -1.293765
DET    -> ADP     -1.357986
X      -> PRT     -1.606091
ADP    -> X       -2.647249
CONJ   -> X       -2.761592
.      -> PRT     -3.404648
DET    -> PRT     -3.933182


## Prediction

In [None]:
sen = ['The', 'market', 'is', 'just', 'becoming', 'more', 'efficient', '.', "''"]
tagger.tag(sent2features( sen ))

['DET', 'NOUN', 'VERB', 'ADV', 'VERB', 'ADV', 'ADJ', '.', '.']

## References

- [sklearn-crfsuite tutorial](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#let-s-use-conll-2002-data-to-build-a-ner-system).
- [Quick Recipe: Build a POS tagger using a Conditional Random Field](https://nlpforhackers.io/crf-pos-tagger/)
- [NLP Guide: Identifying Part of Speech Tags using Conditional Random Fields](https://medium.com/analytics-vidhya/pos-tagging-using-conditional-random-fields-92077e5eaa31)
- [python-crfsuite](https://github.com/scrapinghub/python-crfsuite)