### Hands-on 3:

* Download and preprocess Named Entity Recognition (NER)  corpus (CONLL 2002)
* Prepare CRF model for NER
* Run CRF for training and evaluation


## Named Entity Recognition (NER) using CRF

The task of Named Entity Recognition (NER) involves the recognition of :<br>
* names of persons
* locations
* organizations
* dates
* ...


#### Example 

For example, the following sentence is tagged with sub-sequences indicating PER (for persons), LOC (for location) and ORG (for organization):

<br>

Wolff, currently a journalist in Argentina, played with Del Bosque in the final years of the seventies in Real Madrid.

<br>


_______________

<b>[PER Wolff ] </b> , currently a journalist in <b> [LOC Argentina ] </b> , played with <b> [PER Del Bosque ] </b> in the final years  of the seventies in <b> [ORG Real Madrid ] </b> .

_______________

<br>

### NER - Sub Task Involved :

NER involves 2 sub-tasks: <br>

* identifying the boundaries of such expressions (the open and close brackets) and  
* labeling the expressions (with tags such as PER, LOC or ORG). As for the task of chunking, this sequence labeling task is mapped to a classification tag, using a BIO encoding of the data:  <br>

### BIO Tagging:

The BIO / IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in a chunking task in computational linguistics (ex. named-entity recognition).

* The B- prefix before a tag indicates that the tag is the beginning of a chunk
* An I- prefix before a tag indicates that the tag is inside a chunk. 
* An O tag indicates that a token belongs to no entity / chunk.

The following figure shows how a BIO tagged sentence looks like:



```
    Wolff B-PER
            , O
    currently O
            a O
   journalist O
           in O
    Argentina B-LOC
            , O
       played O
         with O
          Del B-PER
       Bosque I-PER
           in O
          the O
        final O
        years O
           of O
          the O
    seventies O
           in O
         Real B-ORG
       Madrid I-ORG
            . O
```

### DataSet

Let’s use CoNLL 2002 data to build a NER system

CoNLL2002 corpus is available in NLTK. 

In [1]:
# download corpus

import nltk
nltk.download('conll2002')

# get training/testing datasets
from nltk.corpus import conll2002

[nltk_data] Downloading package conll2002 to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\conll2002.zip.


In [18]:
!pip install pyconll

Collecting pyconll
  Downloading pyconll-2.3.3-py3-none-any.whl (23 kB)
Installing collected packages: pyconll
Successfully installed pyconll-2.3.3


###Data Preparation

In [307]:

path = '.\emerging_entities_17-master\emerging_entities_17-master\wnut17train.conll'
f = open (path,"r",encoding = 'utf-8')

In [308]:
lines = f.read()
len(lines)

493781

In [309]:
lines



In [51]:
from pathlib import Path
raw_text = Path(path).read_text().strip()

raw_text



0

In [312]:
def read_file(file_path):
    fileobj = open(file_path, 'r', encoding='utf-8')
    samples = []
    tokens = []
    tags = []

    for content in fileobj:
 
        content = content.strip('\n')

        if content == '' or content == '\t':
            if len(tokens) != 0 :
                samples.append((tokens, tags))
                tokens = []
                tags = []
        else:
            contents = content.split('\t')
            tokens.append(contents[0])
            tags.append(contents[-1])
                
    return samples

path = '.\emerging_entities_17-master\emerging_entities_17-master\wnut17train.conll'
train = read_file(path)

def get_sents(lines):
    """
    Args:
        lines (Iterable[str]): the lines

    Yields:
        List[str]: the lines delimited by an empty line
    """
    sent = []
    stripped_lines = (line.strip() for line in lines)
    for line in stripped_lines:
        if line == '':
            yield sent
            sent = []
        else:
            sent.append(line)
    yield sent


In [355]:
path = '.\emerging_entities_17-master\emerging_entities_17-master\wnut17train.conll'
train = read_file(path)
#train[4]
path2 = '.\emerging_entities_17-master\emerging_entities_17-master\emerging.test.conll'
test = read_file(path2)
#test[0]
train[3000][0]

['@trippy_tay',
 '::',
 '[',
 'Mixtape',
 ']',
 'Criminal',
 'Manne',
 '-',
 'Trap',
 'Talk',
 '::',
 'Jan',
 '.',
 '30th',
 '!',
 'http://t.co/2HixTIfmvL',
 '@LiveMixtapes',
 '@Trapaholics',
 '@Criminal_Manne']

In [319]:
length = len(test)
for i in range(length):
    
    l = len(test[i][1])
    for j in range(l):
        la = test[i][1][j].split(',')[0]
        test[i][1][j]=la
test[7]

(['(',
  'Source',
  ':',
  'ANI',
  ')',
  'Visuals',
  'of',
  'the',
  'avalanche',
  'site',
  'in',
  'Gurez',
  'sector',
  '.'],
 ['O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'B-location',
  'I-location',
  'O',
  'B-corporation',
  'I-location',
  'O'])

In [213]:
## Training and testing


train_sents = list(conll2002.iob_sents('esp.train')) ## spain
test_sents = list(conll2002.iob_sents('esp.testb'))

print(train_sents[0])
#each tuple contains token, syntactic tag, ner label
print(train_sents[12])


[('Melbourne', 'NP', 'B-LOC'), ('(', 'Fpa', 'O'), ('Australia', 'NP', 'B-LOC'), (')', 'Fpt', 'O'), (',', 'Fc', 'O'), ('25', 'Z', 'O'), ('may', 'NC', 'O'), ('(', 'Fpa', 'O'), ('EFE', 'NC', 'B-ORG'), (')', 'Fpt', 'O'), ('.', 'Fp', 'O')]
[('Gómez', 'NC', 'B-PER'), ('de', 'SP', 'I-PER'), ('la', 'DA', 'I-PER'), ('Hera', 'NC', 'I-PER'), ('comentó', 'VMI', 'O'), ('que', 'CS', 'O'), ('los', 'DA', 'O'), ('contenidos', 'AQ', 'O'), ('de', 'SP', 'O'), ('la', 'DA', 'O'), ('publicación', 'NC', 'O'), ('"', 'Fe', 'O'), ('Ciudad', 'NC', 'B-MISC'), ('"', 'Fe', 'O'), ('responden', 'VMI', 'O'), ('al', 'SP', 'O'), ('programa', 'NC', 'O'), ('que', 'PR', 'O'), ('él', 'PP', 'O'), ('encabezó', 'VMI', 'O'), ('como', 'CS', 'O'), ('candidato', 'NC', 'O'), ('en', 'SP', 'O'), ('las', 'DA', 'O'), ('elecciones', 'NC', 'O'), ('municipales', 'AQ', 'O'), ('de', 'SP', 'O'), ('1999', 'Z', 'O'), (',', 'Fc', 'O'), ('en', 'SP', 'O'), ('las', 'DA', 'O'), ('que', 'PR', 'O'), ('IU', 'NC', 'B-ORG'), ('perdió', 'VMI', 'O'), ('su'

In [320]:
train_sents[0][0][2]

'B-LOC'

### Features

- word level features 
    - word shape
    - word suffix 
 
- Current/previous word context 
    - some information from nearby words is used.
    
- word POS tag
- label context 

This makes a simple baseline, but you certainly can add and remove some features to get (much?) better results - experiment with it.

sklearn-crfsuite (and python-crfsuite) supports several feature formats; 
here we use feature dicts.


In [321]:
def word2features2(sent, i):
    
    word = sent[0][i]
    #postag = sent[i][1]


    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        #'postag': postag,
        #'postag[:2]': postag[:2],
    }
    
    if i > 0:
        word1 = sent[0][i-1]
        #postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            #'-1:postag': postag1,
            #'-1:postag[:2]': postag1[:2],
        })
    else:
        # Indicate that it is the 'beginning of a document'
        features['BOS'] = True
        
    
    if i < len(sent[0])-1:
        word1 = sent[0][i+1]
        #postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            #'+1:postag': postag1,
            #'+1:postag[:2]': postag1[:2],
        })
    else:
        # Features for words that are not at the end of a document
        features['EOS'] = True

    return features

In [322]:
word2features2(train[0], 0)

{'bias': 1.0,
 'word.lower()': '@paulwalk',
 'word[-3:]': 'alk',
 'word[-2:]': 'lk',
 'word.isupper()': False,
 'word.istitle()': False,
 'word.isdigit()': False,
 'BOS': True,
 '+1:word.lower()': 'it',
 '+1:word.istitle()': True,
 '+1:word.isupper()': False}

In [3]:
# functions of sentence representations for sequence labelling
def word2features(sent, i):
    
    word = sent[i][0]
    postag = sent[i][1]


    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        # Indicate that it is the 'beginning of a document'
        features['BOS'] = True
        
    
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        # Features for words that are not at the end of a document
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

In [193]:
def sent2features2(sent):
    j=len(sent[0])
    return [word2features2(sent, i) for i in range(j)]

In [194]:
def sent2labels2(sent):
    return sent[1]

In [323]:
len(train[4][0])

12

In [324]:
sent2labels2(train[4])

['B-person', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

In [325]:
sent2features2(train[4])

[{'bias': 1.0,
  'word.lower()': '4dbling',
  'word[-3:]': 'ing',
  'word[-2:]': 'ng',
  'word.isupper()': False,
  'word.istitle()': True,
  'word.isdigit()': False,
  'BOS': True,
  '+1:word.lower()': "'s",
  '+1:word.istitle()': False,
  '+1:word.isupper()': False},
 {'bias': 1.0,
  'word.lower()': "'s",
  'word[-3:]': "'s",
  'word[-2:]': "'s",
  'word.isupper()': False,
  'word.istitle()': False,
  'word.isdigit()': False,
  '-1:word.lower()': '4dbling',
  '-1:word.istitle()': True,
  '-1:word.isupper()': False,
  '+1:word.lower()': 'place',
  '+1:word.istitle()': False,
  '+1:word.isupper()': False},
 {'bias': 1.0,
  'word.lower()': 'place',
  'word[-3:]': 'ace',
  'word[-2:]': 'ce',
  'word.isupper()': False,
  'word.istitle()': False,
  'word.isdigit()': False,
  '-1:word.lower()': "'s",
  '-1:word.istitle()': False,
  '-1:word.isupper()': False,
  '+1:word.lower()': 'til',
  '+1:word.istitle()': False,
  '+1:word.isupper()': False},
 {'bias': 1.0,
  'word.lower()': 'til',
  'w

This is what word2features extracts:

In [4]:
sample_sentence = " ".join([s for s,c,d in train_sents[2]])
sample_sentence

'El Abogado General del Estado , Daryl Williams , subrayó hoy la necesidad de tomar medidas para proteger al sistema judicial australiano frente a una página de internet que imposibilita el cumplimiento de los principios básicos de la Ley .'

In [5]:
train_sents[2][:10]

[('El', 'DA', 'O'),
 ('Abogado', 'NC', 'B-PER'),
 ('General', 'AQ', 'I-PER'),
 ('del', 'SP', 'I-PER'),
 ('Estado', 'NC', 'I-PER'),
 (',', 'Fc', 'O'),
 ('Daryl', 'VMI', 'B-PER'),
 ('Williams', 'NC', 'I-PER'),
 (',', 'Fc', 'O'),
 ('subrayó', 'VMI', 'O')]

In [6]:
word2features(train_sents[2], 1)

{'bias': 1.0,
 'word.lower()': 'abogado',
 'word[-3:]': 'ado',
 'word[-2:]': 'do',
 'word.isupper()': False,
 'word.istitle()': True,
 'word.isdigit()': False,
 'postag': 'NC',
 'postag[:2]': 'NC',
 '-1:word.lower()': 'el',
 '-1:word.istitle()': True,
 '-1:word.isupper()': False,
 '-1:postag': 'DA',
 '-1:postag[:2]': 'DA',
 '+1:word.lower()': 'general',
 '+1:word.istitle()': True,
 '+1:word.isupper()': False,
 '+1:postag': 'AQ',
 '+1:postag[:2]': 'AQ'}

In [7]:
sent2features(train_sents[2])

[{'bias': 1.0,
  'word.lower()': 'el',
  'word[-3:]': 'El',
  'word[-2:]': 'El',
  'word.isupper()': False,
  'word.istitle()': True,
  'word.isdigit()': False,
  'postag': 'DA',
  'postag[:2]': 'DA',
  'BOS': True,
  '+1:word.lower()': 'abogado',
  '+1:word.istitle()': True,
  '+1:word.isupper()': False,
  '+1:postag': 'NC',
  '+1:postag[:2]': 'NC'},
 {'bias': 1.0,
  'word.lower()': 'abogado',
  'word[-3:]': 'ado',
  'word[-2:]': 'do',
  'word.isupper()': False,
  'word.istitle()': True,
  'word.isdigit()': False,
  'postag': 'NC',
  'postag[:2]': 'NC',
  '-1:word.lower()': 'el',
  '-1:word.istitle()': True,
  '-1:word.isupper()': False,
  '-1:postag': 'DA',
  '-1:postag[:2]': 'DA',
  '+1:word.lower()': 'general',
  '+1:word.istitle()': True,
  '+1:word.isupper()': False,
  '+1:postag': 'AQ',
  '+1:postag[:2]': 'AQ'},
 {'bias': 1.0,
  'word.lower()': 'general',
  'word[-3:]': 'ral',
  'word[-2:]': 'al',
  'word.isupper()': False,
  'word.istitle()': True,
  'word.isdigit()': False,
  

### Feature Extraction:

Extract features from the training data and testing data

In [8]:
# sentence representations for sequence labelling
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

In [326]:
# sentence representations for sequence labelling
X_train2 = [sent2features2(s) for s in train]
y_train2 = [sent2labels2(s) for s in train]

X_test2 = [sent2features2(s) for s in test]
y_test2 = [sent2labels2(s) for s in test]

In [90]:

train_sents[0], y_train[0]

([('Melbourne', 'NP', 'B-LOC'),
  ('(', 'Fpa', 'O'),
  ('Australia', 'NP', 'B-LOC'),
  (')', 'Fpt', 'O'),
  (',', 'Fc', 'O'),
  ('25', 'Z', 'O'),
  ('may', 'NC', 'O'),
  ('(', 'Fpa', 'O'),
  ('EFE', 'NC', 'B-ORG'),
  (')', 'Fpt', 'O'),
  ('.', 'Fp', 'O')],
 ['B-LOC', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'O'])

In [327]:
X_train2[1], y_train2[1]

([{'bias': 1.0,
   'word.lower()': 'from',
   'word[-3:]': 'rom',
   'word[-2:]': 'om',
   'word.isupper()': False,
   'word.istitle()': True,
   'word.isdigit()': False,
   'BOS': True,
   '+1:word.lower()': 'green',
   '+1:word.istitle()': True,
   '+1:word.isupper()': False},
  {'bias': 1.0,
   'word.lower()': 'green',
   'word[-3:]': 'een',
   'word[-2:]': 'en',
   'word.isupper()': False,
   'word.istitle()': True,
   'word.isdigit()': False,
   '-1:word.lower()': 'from',
   '-1:word.istitle()': True,
   '-1:word.isupper()': False,
   '+1:word.lower()': 'newsfeed',
   '+1:word.istitle()': True,
   '+1:word.isupper()': False},
  {'bias': 1.0,
   'word.lower()': 'newsfeed',
   'word[-3:]': 'eed',
   'word[-2:]': 'ed',
   'word.isupper()': False,
   'word.istitle()': True,
   'word.isdigit()': False,
   '-1:word.lower()': 'green',
   '-1:word.istitle()': True,
   '-1:word.isupper()': False,
   '+1:word.lower()': ':',
   '+1:word.istitle()': False,
   '+1:word.isupper()': False},
  {'

### Training
Here we are using L-BFGS training algorithm (it is default) with Elastic Net (L1 + L2) regularization.

In [328]:
# train CRF model
!pip install sklearn_crfsuite
import sklearn_crfsuite
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=1000,
    all_possible_transitions=True
)





In [12]:
crf



CRF(algorithm='lbfgs', all_possible_transitions=True, c1=0.1, c2=0.1,
    keep_tempfiles=None, max_iterations=100)

In [210]:
crf.fit(X_train, y_train)

# training model parameters

CRF(algorithm='lbfgs', all_possible_transitions=True, c1=0.1, c2=0.1,
    keep_tempfiles=None, max_iterations=100)

In [245]:
 y_train2[4]

['O', 'O', 'O', 'O', 'O', 'O']

In [329]:
crf.fit(X_train2, y_train2)


CRF(algorithm='lbfgs', all_possible_transitions=True, c1=0.1, c2=0.1,
    keep_tempfiles=None, max_iterations=1000)

### Evaluation
There is much more O entities in data set, but we’re more interested in other entities. To account for this we’ll use averaged F1 score computed for all labels except for O. sklearn-crfsuite.metrics package provides some useful metrics for sequence classification task, including this one.

In [14]:
# get label set
labels = list(crf.classes_)
labels.remove('O')
print(labels)


['B-LOC', 'B-ORG', 'B-PER', 'I-PER', 'B-MISC', 'I-ORG', 'I-LOC', 'I-MISC']


In [330]:
labels2 = ['B-location',
 'B-product',
 'O',
 'I-corporation',
 'I-creative-work',
 'I-person',
 'B-person',
 'B-corporation',
 'I-group',
 'I-product',
 'B-creative-work',
 'B-group',
 'I-location']

In [211]:
# evaluate CRF model
from sklearn_crfsuite import metrics

y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=labels)

0.7964686316443963

In [331]:
# evaluate CRF model
from sklearn_crfsuite import metrics

y_pred2 = crf.predict(X_test2)
metrics.flat_f1_score(y_test2, y_pred2, average='weighted', labels=labels2)

0.9163362748991393

In [223]:
y_pred2[8]

['O', 'O', 'O', 'O', 'O']

### Inspect per-class results in more detail:

In [243]:
# group B and I results
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))



              precision    recall  f1-score   support

       B-LOC      0.810     0.784     0.797      1084
       I-LOC      0.690     0.637     0.662       325
      B-MISC      0.731     0.569     0.640       339
      I-MISC      0.699     0.589     0.639       557
       B-ORG      0.807     0.832     0.820      1400
       I-ORG      0.852     0.786     0.818      1104
       B-PER      0.850     0.884     0.867       735
       I-PER      0.893     0.943     0.917       634

   micro avg      0.813     0.787     0.799      6178
   macro avg      0.791     0.753     0.770      6178
weighted avg      0.809     0.787     0.796      6178

