## Train your own NER system using Conditional Random Fields

#### Based on Barbara Plank's exercises

In [2]:
from sklearn_crfsuite import CRF,metrics

### Introduction

Today, we are going to build a named entity recognizer by using an off-the-shelf toolkit called crfsuite.

If you have an error in the previous line, you will need to install a package that is called `sklearn-crfsuite`.

[Original CRFsuite tutorial](http://www.chokkan.org/software/crfsuite/tutorial.html)

[sklearn-crfsuite documentation](https://sklearn-crfsuite.readthedocs.io/en/latest/)

### Named Entity Recognition

Named Entity Recognition (NER) is the task to identify named entities, e.g., people, locations, organizations, in input text. It is usually treated as a sequence labeling task. Tokens that are not NEs (verbs, adjectives, etc.) are tagged as ”O” (other or not a named entity), so that every token is assigned a label. There are different ways on how to present NE tags, the most common is the BIO scheme (also known as IOB2). Here is an example of two sentences annotated for named entities in the BIO scheme:

```
John  B-PER
lives   O
in   O
Copenhagen B-LOC

Dirk B-PER
flew O
to O
New B-LOC
York I-LOC
```

  * What do you think B, I, and O stand for, respectively?
  * Alternatively, we can drop the ”B-” and ”I-” prefix and just have the label by itself. Can you identify disadvantages and advantages of this approach? Discuss with your neighbor.

Now it is time to create a NER tagger yourself. For this exercise, we will provide you the required corpus, divided in the following files:

  * `train.ner`
  * `test.ner`
  * pos/train.noun
  * pos/test.noun.
  
Additionally, you could perform extra experiments with corpora from [this website](https://github.com/davidsbatista/NER-datasets). Or you could also perform experiments using the data from CONLL 2003, available on NLTK, by using the following command:

```
from nltk.corpus import conll2002

train_sents = list(conll2002.iob_sents('esp.train'))
test_sents = list(conll2002.iob_sents('esp.testa'))

```

#### Exercise 1: Create data for every token

Acquaint yourself with the functions below.
  * What input does it expect?
  * What output does it produce?
  * Run the script on the train file and make sure you understand what it produces. Explain to your neighbor.

In [2]:
from nltk.corpus import conll2002

train_sents = list(conll2002.iob_sents('esp.train'))
test_sents = list(conll2002.iob_sents('esp.testa'))

In [51]:
import re
import string

In [56]:
re.search(r'[^\w\d\s]', 'a.b')

<_sre.SRE_Match object; span=(1, 2), match='.'>

In [72]:
# createdata.py
def extract_data(file_ner,file_pos,separator=" "):
    """
    extract token-level data (features) from file that is NER tagged and the corresponding POS file 
    this method should output the format expected by the next function word2features
    that then looks at a tokens context and produces the respective crfsuite format
    
    # Field names ==> this is what this script produces, needs to be aligned with nerfeats.py
    fields = 'w pos y'
    # alternative (add whatever you wanna add and make sure its aligned with nerfeats.py)
    fields = 'w pos cap y'
    """

    # read NER and POS from the two files
    words_tags=read_conll_file(file_ner)
    words_pos=read_conll_file(file_pos)
    
    # result variable
    # It will be a list of sentences
    # Each sentence will have a list of words
    # Each word will be represented by a three element tuple: WORD, POS, BIO-TAG
    result = []
    
    ## some checks, e.g., that both files have same length, same tokens
    assert(len(words_tags)==len(words_pos))
    
    for (words,tags),(_,pos) in zip(words_tags,words_pos):
        
        result.append([]) # We append one list more.
        
        for word,pos,tag in zip(words,pos,tags):
            # first letter is capitalized
            cap="+" if word[0].isupper() else "-"
            all_caps = "+" if word.isupper() else "-"
            word_punct = "+" if re.search(r'[^\w\d\s]', word) else "-"
            #################################
            ###### YOUR FEATURES HERE #######  
            #################################
            
            ## todo: output the cap feature and more 
            
            # In the last list we created before (result[-1]),
            # We will add each word, represented by the three element tuple
            result[-1].append((word,pos,tag,cap,all_caps,word_punct))

    return result
        
        
def read_conll_file(file_name):
    """
    read in a file with format:
    word1    tag1
    ...      ...
    wordN    tagN
    
    returns a list of list with (words,tags) tuples for every sentence
    """
    content=[]

    current_words = []
    current_tags = []
    
    for line in open(file_name, encoding='utf-8'):
        line = line.strip()
        
        if line:
            word, tag = line.split('\t')
            current_words.append(word)
            current_tags.append(tag)

        else:
            content.append((current_words, current_tags))
            current_words = []
            current_tags = []

    # if file does not end in newline (it should...), check whether there is an instance in the buffer
    if current_tags != []:
        content.append((current_words, current_tags))
    return content

In [68]:
#train_sents = extract_data("data/train.ner","data/pos/train.noun")
#test_sents = extract_data("data/test.ner","data/pos/test.noun")


train_sents = extract_data("data/train.ner","data/upos/train.lap-upos")
test_sents = extract_data("data/test.ner","data/upos/test.lap-upos")

train_sents[0:1]

[[('We', 'PRON', 'O', '+', '-', '-'),
  ('stay', 'VERB', 'O', '-', '-', '-'),
  ('in', 'ADP', 'O', '-', '-', '-'),
  ('Manhattan', 'NOUN', 'B-LOC', '+', '-', '-'),
  ('and', 'CONJ', 'O', '-', '-', '-'),
  ('it', 'PRON', 'O', '-', '-', '-'),
  ("'s", 'PRT', 'O', '-', '-', '+'),
  ('a', 'DET', 'O', '-', '-', '-'),
  ('long', 'ADV', 'O', '-', '-', '-'),
  ('way', 'NOUN', 'O', '-', '-', '-'),
  ('to', 'PRT', 'O', '-', '-', '-'),
  ('come', 'VERB', 'O', '-', '-', '-'),
  ('.', '.', 'O', '-', '-', '+')]]

#### Exercise 2: Create features for crfsuite

Given the output of the previous `extract_data` function, the next step is to produce features for a given token. In NER it is common to use the surrounding POS information as well as shape features. The function `word2features` gets a sentence and a word index, and generates a feature representation for each of the words. Below this function, there is also a function `sent2features()`, together with `sent2labels` and `sent2tokens`. 

It would be helpful for you to check how these functions work. Why don't you call each of those functions with their respective input variables?

After that, you should extend the function to add more features.

##### Note: whenever you add new information to `extract_data()` (e.g. start by outputting the cap information) you need to adjust the `word2features()` function accordingly to take these new features as input. `extract_data()` just looks at one token at a time, `word2features()` looks at the context and produces the features that crfsuite expects.


In [71]:
# nerfeats.py

def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    caps = sent[i][3]
    all_caps = sent[i][4]
    word_punct = sent[i][5]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'postag': postag,
        'postag[:2]': postag[:2],
        'caps': caps,
        'allcaps': all_caps,
        #'wordpunct': word_punct,
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        caps1 = sent[i-1][3]
        all_caps1 = sent[i-1][4]
        word_punct1 = sent[i-1][5]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
            '-1:caps': caps1,
            '-1:allcaps': all_caps1,
            #'-1:wordpunct': word_punct1,
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        caps1 = sent[i+1][3]
        all_caps1 = sent[i+1][4]
        word_punct1 = sent[i+1][5]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
            '+1:caps': caps1,
            '+1:allcaps': all_caps1,
            #'+1:wordpunct': word_punct1,

        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label, caps, all_caps, word_punct in sent]

def sent2tokens(sent):
    return [token for token, postag, label, caps, all_caps, word_punct in sent]

In [74]:
#YOUR CODE HERE
print(train_sents[5])
word2features(train_sents[5],1)
#sent2features(train_sents[5])
#sent2labels(train_sents[5]),sent2tokens(train_sents[5])

[('He', 'PRON', 'O', '+', '-', '-'), ('would', 'VERB', 'O', '-', '-', '-'), ('give', 'VERB', 'O', '-', '-', '-'), ('no', 'DET', 'O', '-', '-', '-'), ('forecasts', 'NOUN', 'O', '-', '-', '-'), ('on', 'ADP', 'O', '-', '-', '-'), ('advertising', 'NOUN', 'O', '-', '-', '-'), ('revenue', 'NOUN', 'O', '-', '-', '-'), ('.', '.', 'O', '-', '-', '+')]


{'bias': 1.0,
 'word.lower()': 'would',
 'word[-3:]': 'uld',
 'word[-2:]': 'ld',
 'postag': 'VERB',
 'postag[:2]': 'VE',
 'caps': '-',
 'allcaps': '-',
 '-1:word.lower()': 'he',
 '-1:postag': 'PRON',
 '-1:postag[:2]': 'PR',
 '-1:caps': '+',
 '-1:allcaps': '-',
 '+1:word.lower()': 'give',
 '+1:postag': 'VERB',
 '+1:postag[:2]': 'VE',
 '+1:caps': '-',
 '+1:allcaps': '-'}

#### Create training data

You have done all the required previous steps to generate training data that will be used to train a CRF model.

In the following four lines we just create training data and save them in a similar format to the one used by `scikit-learn`. There is a difference, though. Can you think about a difference from regular classification tasks that you have done using `scikit-learn`?

##### Hint: You can check how the variable `X_train`  or `y_train` looks like.

In [75]:
%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

Wall time: 1.53 s


#### Exercise 3: Train your NER system using crfsuite

Train your system using `sklearn-crfsuite`.

In [76]:
%%time
crf = CRF(
    algorithm='lbfgs',
    max_iterations=100,
    all_possible_transitions=True,
    all_possible_states=True,
    verbose=True
)
crf.fit(X_train, y_train)

loading training data to CRFsuite: 100%|██████████████| 9034/9034 [00:04<00:00, 1859.30it/s]



Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 1
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 343308
Seconds required: 9.401

L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 100
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=0.66  loss=92395.76 active=343308 feature_norm=1.00
Iter 2   time=0.32  loss=85403.92 active=343308 feature_norm=1.18
Iter 3   time=0.33  loss=79392.87 active=343308 feature_norm=1.35
Iter 4   time=0.34  loss=63782.86 active=343308 feature_norm=2.02
Iter 5   time=0.33  loss=43881.53 active=343308 feature_norm=3.83
Iter 6   time=0.33  loss=39055.84 active=343308 feature_norm=4.81
Iter 7   time=0.34  loss=35664.45 active=343308 feature_norm=6.56
Iter 8   time=0.33  loss=30998.36 active=343308 feature_norm=7.74
Iter 9   time=0.32  loss=30064.46 active=343308 feature_norm=8.56
Iter

#### Exercise 4: Let's see how it works!

Our model is saved in a variable called `crf`. Please, take the first three sentences from the test set and check how it works.

In [77]:
crf.predict(X_test[0:3]),sent2tokens(test_sents[0]),sent2tokens(test_sents[1]),sent2tokens(test_sents[2])

([['O', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'],
  ['O', 'B-PER', 'I-PER', 'O', 'B-LOC', 'O', 'O'],
  ['O', 'O', 'O']],
 ['Mladost',
  '(',
  'BJ',
  ')',
  'NUMBER',
  'NUMBER',
  'NUMBER',
  '2',
  '2',
  'NUMBER',
  'NUMBER'],
 ['NUMBER', 'Andrea', 'Muller', '(', 'Germany', ')', 'NUMBER'],
 ['Finals', ',', 'doubles'])

#### Exercise 5: Let's see how it ACTUALLY works!

In order to evaluate our model, as you may have realized, we cannot use the regular functions from `sklearn.metrics`. We are lucky, though! The `sklearn-crfsuite` suite includes a package with metrics, especially suited for this.

Could you please, then, check which is this models' macro-average F1-score?

In [78]:
y_pred = crf.predict(X_test)

metrics.flat_f1_score(y_test, y_pred,
                      average='macro')

0.8817677592812444

In [79]:
labels = list(crf.classes_)
labels.remove('O')
labels

['B-LOC', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'I-LOC']

In [80]:
metrics.flat_f1_score(y_test, y_pred,
                      average='macro', labels=labels)

0.863295931365789

In [81]:
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))

              precision    recall  f1-score   support

       B-LOC      0.899     0.886     0.893      1918
       I-LOC      0.858     0.746     0.798       268
       B-ORG      0.840     0.805     0.822      1733
       I-ORG      0.837     0.814     0.825      1095
       B-PER      0.897     0.898     0.897      1915
       I-PER      0.919     0.969     0.943      1350

   micro avg      0.881     0.871     0.876      8279
   macro avg      0.875     0.853     0.863      8279
weighted avg      0.880     0.871     0.875      8279



In [82]:
print(metrics.flat_classification_report(
    y_test, y_pred, digits=3))

              precision    recall  f1-score   support

       B-LOC      0.899     0.886     0.893      1918
       B-ORG      0.840     0.805     0.822      1733
       B-PER      0.897     0.898     0.897      1915
       I-LOC      0.858     0.746     0.798       268
       I-ORG      0.837     0.814     0.825      1095
       I-PER      0.919     0.969     0.943      1350
           O      0.992     0.994     0.993     48257

   micro avg      0.976     0.976     0.976     56536
   macro avg      0.892     0.873     0.882     56536
weighted avg      0.975     0.976     0.975     56536



### Nouns:
    precision    recall  f1-score   support

       B-LOC      0.873     0.794     0.832      1918
       B-ORG      0.829     0.624     0.712      1733
       B-PER      0.900     0.773     0.832      1915
       I-LOC      0.743     0.668     0.703       268
       I-ORG      0.791     0.565     0.659      1095
       I-PER      0.903     0.889     0.896      1350
           O      0.968     0.992     0.980     48257
   micro avg      0.955     0.955     0.955     56536
   macro avg      0.858     0.758     0.802     56536
weighted avg      0.952     0.955     0.952     56536

#### Exercise 6: Run your own POS tagger on the training and dev file

Instead of using the POS tags we provide, which are too simplistic (it just tags every word as NOUN) you could use the real POS tagged files called `upos/train.lap-upos` and `upos/test.lap-upos`.

You could also run your own POS tagger and use its predictions on the train and test file. Make sure your POS tagger produces the output the NER system expects.

What accuracies to you observe? What downstream effect has the accuracy of the POS tagger on the NER task? Hypothesize and investigate!

In [83]:
# YOUR CODE HERE