# Overview
The goal of this lab is explore Named Entity Recognition and entity extraction. In this lab we will cover the following:
- modern libraries for named entity recognition 
- training an entity extraction model using conditional random fields

## 1. Named Entity Extraction

Recall from class that named entities are proper nouns referring to places, peoples, organizations, product names, and many other names. The CoNLL 2003 ([Tjong and Meulder] https://aclanthology.org/W03-0419.pdf)) is specifies four types of named entities: persons, locations, organizations and miscellaneous. However there exist other schemes for labelling named entities such as [OntoNotes](https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/AnotGuideEnNE.pdf) schema which contains 18 categories. 


In this section we'll explore two off the shelf tools for extracting named entities.

### 1a. NER with Spacy
Spacy is a python library that can used for a wide range of NLP tasks. Spacy has a proprietary NER model which consists of subword embeddings, Bloom embeddings, CNN neural network and dynamic transition probabilities. Spacy NER was trained on the OntoNote5 NER schema (see above). 

We'll walk through how to install spacy and use the library to extract and visualize named entities. 

To install spacy run the cell below. We use pip to install the spacy library and then download the model weights seperately. Note this step may take a couple minutes.

In [None]:
!pip install -U spacy >> NULL
!python -m spacy download en_core_web_sm 

2023-01-20 11:10:18.964798: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.5.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 3.4.1
    Uninstalling en-core-web-sm-3.4.1:
      Successfully uninstalled en-core-web-sm-3.4.1
Successfully installed en-core-web-sm-3.5.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Next we'll load spacy and create spacy object which we can spacy text to for our extraction. When passing text to spacy, spacy will run a pipeline on the text and generate a dependency parse, part of speech tags, named entities, and lemmas. In the section below we show you can access all these features.

Be sure to use the `_` e.g. `.pos_` to get the label of the feature. 

In [None]:
import spacy 
spacy_nlp = spacy.load("en_core_web_sm")

text = "University of Galway is a public university located in Galway, Ireland."

# 1. Pass text to spacy
doc = spacy_nlp(text)

# 2. Loop over token in document to extract various features
for tok in doc:
  print(f"token: {tok}, | POS: {tok.pos_} | dependency: {tok.dep_} | lemma: {tok.lemma_}")



token: University, | POS: PROPN | dependency: nsubj | lemma: University
token: of, | POS: ADP | dependency: prep | lemma: of
token: Galway, | POS: PROPN | dependency: pobj | lemma: Galway
token: is, | POS: AUX | dependency: ROOT | lemma: be
token: a, | POS: DET | dependency: det | lemma: a
token: public, | POS: ADJ | dependency: amod | lemma: public
token: university, | POS: NOUN | dependency: attr | lemma: university
token: located, | POS: VERB | dependency: acl | lemma: locate
token: in, | POS: ADP | dependency: prep | lemma: in
token: Galway, | POS: PROPN | dependency: pobj | lemma: Galway
token: ,, | POS: PUNCT | dependency: punct | lemma: ,
token: Ireland, | POS: PROPN | dependency: appos | lemma: Ireland
token: ., | POS: PUNCT | dependency: punct | lemma: .


In [None]:
# We can extract the entities using the ents propert of the spacy doc object
for ent in doc.ents:
  print(f"entity: {ent}, entity label: {ent.label_}")

entity: University of Galway, entity label: ORG
entity: Galway, entity label: GPE
entity: Ireland, entity label: GPE


In [None]:
# We can also visualize the tagged entities using diplacy
from spacy import displacy

displacy.render(doc, style="ent", jupyter=True)

###1b. NER with Flair
[Flair](https://github.com/flairNLP/flair) is another NLP library that is maintained by Humboldt University which is useful for generating word embeddings, NER, and part of speech tagging amongst aother features.

We can install flair using pip (run the cell below)

In [None]:
!pip install flair

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting flair
  Downloading flair-0.11.3-py3-none-any.whl (401 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.9/401.9 KB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting ftfy
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.1/53.1 KB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting konoha<5.0.0,>=4.0.0
  Downloading konoha-4.6.5-py3-none-any.whl (20 kB)
Collecting pptree
  Downloading pptree-3.1.tar.gz (3.0 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting huggingface-hub
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting janome
  Downloading Janome-0.4.2-py2.py3-none-any.whl (19.7 MB)
[2K     [90m━━━━━━━━━━━━━━━

With Flair, we first convert all text into `Sentence` objects which we pass to a `SequenceTagger` object and specify the model we to work with. The `ner` model was trained to predict the CoNLL NER entities and the `ner-ontonotes` model was trained to predict the ontonotes categories. A full list of other models and NLP tasks that FLAIR supports can be found [here](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_2_TAGGING.md).

Below we'll walk though the same text from above and see what the different Flair models predict. Note the first time you run the code below, Flair will download the model which may take a few minutes.

In [None]:
from flair.data import Sentence
from flair.models import SequenceTagger

text = "University of Galway is a public university located in Galway, Ireland."

# 1.make a sentence
flair_sent = Sentence(text)

# load the NER tagger which use CoNLL NER tags
conll_tagger = SequenceTagger.load('ner')

# run NER over sentence
conll_tagger.predict(flair_sent)

# print the updated sentence object which contains the tags
print(flair_sent)



2023-01-16 17:00:50,879 loading file /root/.flair/models/ner-english/4f4cdab26f24cb98b732b389e6cebc646c36f54cfd6e0b7d3b90b25656e4262f.8baa8ae8795f4df80b28e7f7b61d788ecbb057d1dc85aacb316f1bd02837a4a4
2023-01-16 17:00:53,078 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>
Sentence: "University of Galway is a public university located in Galway , Ireland ." → ["University of Galway"/ORG, "Galway"/LOC, "Ireland"/LOC]


In [None]:
from flair.data import Sentence
from flair.models import SequenceTagger

text = "University of Galway is a public university located in Galway, Ireland."

# 1.make a sentence
flair_sent = Sentence(text)

# load the NER tagger which use CoNLL NER tags
onto_tagger = SequenceTagger.load('ner-ontonotes-fast')

# run NER over sentence
onto_tagger.predict(flair_sent)

# print the updated sentence object which contains the tags
print(flair_sent)

Downloading:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

2023-01-16 17:03:14,632 loading file /root/.flair/models/ner-english-ontonotes-fast/0d55dd3b912da9cf26e003035a0c269a0e9ab222f0be1e48a3bbba3a58c0fed0.c9907cd5fde3ce84b71a4172e7ca03841cd81ab71d13eb68aa08b259f57c00b6
2023-01-16 17:03:27,695 SequenceTagger predicts: Dictionary with 76 tags: <unk>, O, B-CARDINAL, E-CARDINAL, S-PERSON, S-CARDINAL, S-PRODUCT, B-PRODUCT, I-PRODUCT, E-PRODUCT, B-WORK_OF_ART, I-WORK_OF_ART, E-WORK_OF_ART, B-PERSON, E-PERSON, S-GPE, B-DATE, I-DATE, E-DATE, S-ORDINAL, S-LANGUAGE, I-PERSON, S-EVENT, S-DATE, B-QUANTITY, E-QUANTITY, S-TIME, B-TIME, I-TIME, E-TIME, B-GPE, E-GPE, S-ORG, I-GPE, S-NORP, B-FAC, I-FAC, E-FAC, B-NORP, E-NORP, S-PERCENT, B-ORG, E-ORG, B-LANGUAGE, E-LANGUAGE, I-CARDINAL, I-ORG, S-WORK_OF_ART, I-QUANTITY, B-MONEY
Sentence: "University of Galway is a public university located in Galway , Ireland ." → ["University of Galway"/ORG, "Galway"/GPE, "Ireland"/GPE]


### 1c. Exercise: Tool exploration.
For this exercise, play around with both libraries. Can find examples where one library works better than the other or perhaps where both libraries are limited?

Both models are sensitve to casing. Lowercasing words results in the models missing entities while uppercasing random words create false iids. 

In [None]:
text = "University of Galway is a public university located in Galway, Ireland."

# 1.make a sentence
flair_sent = Sentence(text.lower())

# run NER over sentence
onto_tagger.predict(flair_sent)

# print the updated sentence object which contains the tags
print(flair_sent)

Sentence: "university of galway is a public university located in galway , ireland ." → ["galway"/GPE, "galway"/GPE, "ireland"/GPE]


In [None]:
for ent in spacy_nlp(text.lower()).ents:
  print(ent, ent.label_)

galway GPE
ireland GPE


In [None]:
text2 = "THIS SENTENCE HAS RANDOM CAPITALIZED WORDS LIKE APPLES AND BANANAS."
# 1.make a sentence
flair_sent = Sentence(text2)

# run NER over sentence
onto_tagger.predict(flair_sent)

# print the updated sentence object which contains the tags
print(flair_sent)

Sentence: "THIS SENTENCE HAS RANDOM CAPITALIZED WORDS LIKE APPLES AND BANANAS ."


In [None]:
for ent in spacy_nlp(text2).ents:
  print(ent, ent.label_)

SENTENCE ORG
BANANAS ORG


## 2. Training an NER Model with Conditional Random Fields
We can treat entity tagging as token identification task. Given a sequence of tokens representing a sentence (e.g. `[University, of, Galway, is, a, university, located, in, Ireland, .]`), we want to predict which tokens correspond to a named entity (e.g. `LOC`, `ORG`, `MISC`, `PER`, `MISC`). Often our named entities are multi-token (e.g. University of Galway) and so the BIO (begin, inside, outside) annotation schema is used to specify which token belong to a label. Under BIO, the ORG label becomes `B-ORG` and `I-ORG` and anything that is not a named entity is classified as `O`. So for example our sentence would have the following labels:
`(University [B-ORG]) (of [I-ORG]) (Galway [I-ORG]) (is [O]) (located [O]) (in [O]) (Ireland [B-LOC]).` 

In this section we'll explore training a NER tagger using conditional random fields ([CRF](https://arxiv.org/pdf/1011.4088.pdf)). CRFs are a graphical models which observe the Markov Random field property where given a set of obervations X and and random variables Y, if X is given then all variable Y are neighbors in a graph. If we provide the CRF model with a set of features for each token (e.g. POS, etc), the CRF model can learn the transition probabilities for the BIO labels where we treat each BIO label as random variable. Intuitively I-tags are likely to follow B-tags and the model is likely to associate a noun with the LOC tags in presence of a word like `located`. CRFs are not just limited to syntactic and lexical features, often neural taggers will pass on the hidden states of a LSTM or word embeddings to a CRF layer prior to label classification. 

The `sklearn-crfsuite` library provides an implmentation of CRFs and a simple API we use to train our own tagger. This guide has been adpated from the [documentation ](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#check-best-estimator-on-our-test-data). 

Let's start by install the libraries we'll need. We'll the CoNLL 2003 dataset hosted on the Huggingfaces dataset hub for this exercise. Run the cells below to get started.


In [None]:
!pip install datasets >> NULL
!pip install -U 'scikit-learn<0.24'
!pip install sklearn-crfsuite

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-learn<0.24
  Downloading scikit_learn-0.23.2-cp38-cp38-manylinux1_x86_64.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m43.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninstalled scikit-learn-1.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.5 requires scikit-learn>=1.0.0, but you have scikit-learn 0.23.2 which is incompatible.
imbalanced-learn 0.8.1 requires scikit-learn>=0.24, but you have scikit-learn 0.23.2 which is incompatible.[0m[31m
[0mSuccessfully installed scikit-learn-0.23.2
Looking in i

Next we'll download our data and convert them into pandas dataframes for simplicity. The dataset uses numerical ids instead of string labels for things like POS, NER tag, etc. We provide a set of mapping functions that can convert the ids back to the string labels. Use `id2pos` to for part-of-speech mapping and `id2tag` for ner tags. 

In [None]:
from datasets import load_dataset
import pandas as pd
dataset = load_dataset("conll2003")

# MAP for POS to id and reverse
pos2id = {'"': 0, "''": 1, '#': 2, '$': 3, '(': 4, ')': 5, ',': 6, '.': 7, ':': 8, '``': 9, 'CC': 10, 'CD': 11, 'DT': 12,
 'EX': 13, 'FW': 14, 'IN': 15, 'JJ': 16, 'JJR': 17, 'JJS': 18, 'LS': 19, 'MD': 20, 'NN': 21, 'NNP': 22, 'NNPS': 23,
 'NNS': 24, 'NN|SYM': 25, 'PDT': 26, 'POS': 27, 'PRP': 28, 'PRP$': 29, 'RB': 30, 'RBR': 31, 'RBS': 32, 'RP': 33,
 'SYM': 34, 'TO': 35, 'UH': 36, 'VB': 37, 'VBD': 38, 'VBG': 39, 'VBN': 40, 'VBP': 41, 'VBZ': 42, 'WDT': 43,
 'WP': 44, 'WP$': 45, 'WRB': 46}
id2pos = {v:k for k,v in pos2id.items()}

# Map for NER tag to id and reverse
tag2id = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
id2tag = {v:k for k,v in tag2id.items()}

# Convert Datasets to pandas for ease of use
train = dataset["train"].to_pandas()
test = dataset["test"].to_pandas()

display(train.head(5))

Downloading builder script:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.2k [00:00<?, ?B/s]

Downloading and preparing dataset conll2003/conll2003 to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98...


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,id,tokens,pos_tags,chunk_tags,ner_tags
0,0,"[EU, rejects, German, call, to, boycott, Briti...","[22, 42, 16, 21, 35, 37, 16, 21, 7]","[11, 21, 11, 12, 21, 22, 11, 12, 0]","[3, 0, 7, 0, 0, 0, 7, 0, 0]"
1,1,"[Peter, Blackburn]","[22, 22]","[11, 12]","[1, 2]"
2,2,"[BRUSSELS, 1996-08-22]","[22, 11]","[11, 12]","[5, 0]"
3,3,"[The, European, Commission, said, on, Thursday...","[12, 22, 22, 38, 15, 22, 28, 38, 15, 16, 21, 3...","[11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 1...","[0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, ..."
4,4,"[Germany, 's, representative, to, the, Europea...","[22, 27, 21, 35, 12, 22, 22, 27, 16, 21, 22, 2...","[11, 11, 12, 13, 11, 12, 12, 11, 12, 12, 12, 1...","[5, 0, 0, 0, 0, 3, 4, 0, 0, 0, 1, 2, 0, 0, 0, ..."


### Exercise 2a: Data Extraction
You'll notice that data has been already tokenized and stored in a list. Further for each token there is POS tag and NER tag (there also a chunk tag which we'll ignore) stored in seperate lists. For simplicity let combine everything into single list and convert the ids into labels. For this excerise the goal is iterate over each row in our dataframe and generate a list of triples for each sentence. The triple consits of the following:
- token, pos label, ner label

The final output should look something like this:
```
[
  # Sentence 1 tokens: 
  [('EU', 'NNP', 'B-ORG'),
  ('rejects', 'VBZ', 'O'),
  ('German', 'JJ', 'B-MISC'),
  ('call', 'NN', 'O'),
  ('to', 'TO', 'O'),
  ('boycott', 'VB', 'O'),
  ('British', 'JJ', 'B-MISC'),
  ('lamb', 'NN', 'O'),
  ('.', '.', 'O')],

 # Sentence 2 tokens:
 [('Peter', 'NNP', 'B-PER'), 
 ('Blackburn', 'NNP', 'I-PER')]
 ...
]
```

The code below should help you get started. Be sure append the list of triples at the sentence level to overall `train_sents` and `test_sents` list. To recover the labels use the `id2pos` and `id2tag` dicts. For example, the ner id `1` maps to `B-PER` and calling `id2tag[1]` will output `B-PER`.  
``` 

In [None]:
from tqdm.notebook import tqdm
# Extract out data into list of triples 
train_sents = []
for _, row in tqdm(train.iterrows(), total=len(train)):
  tokens = row.tokens
  pos_tags = row.pos_tags
  ner_labels = row.ner_tags
  
  sent = []
  for i, tok in enumerate(tokens):
    sent.append((tok, id2pos[pos_tags[i]], id2tag[ner_labels[i]]))
  #display(sent)
  train_sents.append(sent)

test_sents = []
for _, row in tqdm(test.iterrows(), total=len(test)):
  tokens = row.tokens
  pos_tags = row.pos_tags
  ner_labels = row.ner_tags
  
  sent = []
  for i, tok in enumerate(tokens):
    sent.append((tok, id2pos[pos_tags[i]], id2tag[ner_labels[i]]))
  #display(sent)
  test_sents.append(sent)

  0%|          | 0/14041 [00:00<?, ?it/s]

  0%|          | 0/3453 [00:00<?, ?it/s]

In [None]:
# Take a look at the first two elements in train_sents
display(train_sents[:2])

[[('EU', 'NNP', 'B-ORG'),
  ('rejects', 'VBZ', 'O'),
  ('German', 'JJ', 'B-MISC'),
  ('call', 'NN', 'O'),
  ('to', 'TO', 'O'),
  ('boycott', 'VB', 'O'),
  ('British', 'JJ', 'B-MISC'),
  ('lamb', 'NN', 'O'),
  ('.', '.', 'O')],
 [('Peter', 'NNP', 'B-PER'), ('Blackburn', 'NNP', 'I-PER')]]

### Exercise 2b. CRF Feature Extraction
Next we'll generate a set of simple features for our CRF model. The input for the CRF model will be a list of lists, where each sublist represents a sentence and consits of a dict of features for each word. The inputs will look something like this:

```
X_train = [
            # Sentence 1
            [
              # Feature for the first word in the sentence
              {
                'bias': 1.0,
                'word_lowercased': 'eu',
                'word_is_upperc': True,
                'word_is_title': False,
                'word_is_digit': False,
                'pos': 'NNP',
                'BOS': True,
                'next_word_lowercased': 'rejects',
                'next_word_is_title': False,
                'next_word_is_upper': False,
                'next_word_pos': 'VBZ'
              }
              ... 
            ]
            ...
]
```

For this exercise you'll fill out the word2features method and generate the features for the CRF model. 

In [None]:
def word2features(sent, i):

    word = sent[i][0]
    pos = sent[i][1]

    features = {
        'bias': 1.0,
        'word_lowercased': word.lower(),
        'word_is_upperc':  word.isupper(),
        'word_is_title': word.istitle(),
        'word_is_digit': word.isdigit(),
        'pos': pos,
    
    }
    
    if i > 0:
        prev_word = sent[i-1][0]
        prev_pos = sent[i-1][1]
        features.update({
            'prev_word_lowercased': prev_word.lower(),
            'prev_word_is_title': prev_word.istitle(),
            'prev_word_is_upper': prev_word.isupper(),
            'prev_word_pos': prev_pos,
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        next_word = sent[i+1][0]
        next_pos = sent[i+1][1]
        features.update({
            'next_word_lowercased': next_word.lower(),
            'next_word_is_title': next_word.istitle(),
            'next_word_is_upper': next_word.isupper(),
            'next_word_pos': next_pos,
        })
    else:
        features['EOS'] = True

    return features


# Apply Function to 
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

Now let's generate the the train and test features.

In [None]:
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

In [None]:
# Sanity checking
print(len(X_train) == len(y_train))
print(len(X_test) == len(y_test))

True
True


In [None]:
# View the features for the first sentence
display(X_train[0][0])

{'bias': 1.0,
 'word_lowercased': 'eu',
 'word_is_upperc': True,
 'word_is_title': False,
 'word_is_digit': False,
 'pos': 'NNP',
 'BOS': True,
 'next_word_lowercased': 'rejects',
 'next_word_is_title': False,
 'next_word_is_upper': False,
 'next_word_pos': 'VBZ'}

### 2c. Training and Evaluating the Model
We've finally made it to training step. Run the code below to train the model.

In [None]:
import sklearn_crfsuite
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)

CRF(algorithm='lbfgs', all_possible_transitions=True, c1=0.1, c2=0.1,
    keep_tempfiles=None, max_iterations=500)

We can evaluate our model by calculating the weighted average F1 across our NER labels which are not O.

In [None]:
from sklearn_crfsuite import metrics

# List of NER tags dropping the O values
tag_list = ['B-LOC', 'B-ORG', 'B-PER', 'I-PER', 'B-MISC', 'I-ORG', 'I-LOC', 'I-MISC']
y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=tag_list)

0.802486455949713

We can also take a look at the classification report to how the model does per NER tag.

In [None]:
sorted_labels = sorted(
    tag_list,
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))

              precision    recall  f1-score   support

       B-LOC      0.861     0.805     0.832      1668
       I-LOC      0.794     0.646     0.712       257
      B-MISC      0.822     0.748     0.783       702
      I-MISC      0.709     0.676     0.692       216
       B-ORG      0.767     0.722     0.744      1661
       I-ORG      0.690     0.744     0.716       835
       B-PER      0.826     0.855     0.840      1617
       I-PER      0.864     0.952     0.906      1156

   micro avg      0.808     0.799     0.804      8112
   macro avg      0.792     0.769     0.778      8112
weighted avg      0.808     0.799     0.802      8112



Finally let's take a look at the transition probabilities the CRF model learned. 

In [None]:
from collections import Counter

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(crf.transition_features_).most_common(20))

print("\nTop unlikely transitions:")
print_transitions(Counter(crf.transition_features_).most_common()[-20:])

Top likely transitions:
B-ORG  -> I-ORG   7.363917
I-MISC -> I-MISC  7.248915
I-ORG  -> I-ORG   7.151640
B-PER  -> I-PER   7.139008
B-MISC -> I-MISC  6.797215
B-LOC  -> I-LOC   6.643375
I-LOC  -> I-LOC   6.127159
I-PER  -> I-PER   4.965731
O      -> B-PER   2.934855
O      -> O       2.895735
O      -> B-LOC   2.295845
O      -> B-ORG   2.076758
O      -> B-MISC  2.064629
B-LOC  -> O       0.664967
B-MISC -> O       0.549740
B-LOC  -> B-MISC  0.387209
I-MISC -> B-ORG   0.300258
B-MISC -> B-ORG   0.280139
B-ORG  -> O       0.249207
B-MISC -> B-PER   0.189149

Top unlikely transitions:
B-PER  -> B-MISC  -1.616498
I-PER  -> B-ORG   -1.635039
I-ORG  -> I-PER   -1.713222
I-ORG  -> B-PER   -1.718138
B-MISC -> I-PER   -1.754357
B-LOC  -> I-PER   -1.782421
B-MISC -> I-ORG   -1.793629
I-ORG  -> B-ORG   -1.852754
B-LOC  -> B-PER   -1.868231
I-LOC  -> B-PER   -1.937961
B-LOC  -> I-ORG   -1.954908
B-ORG  -> B-LOC   -2.054202
B-ORG  -> I-PER   -2.074430
I-PER  -> B-PER   -2.346264
B-PER  -> B-ORG  