# Este es un ejemplo de un modelo Baseline de CRFs para resolver la tarea de NER, usando la herramienta crfsuite. Las características de entrada; el token y el postag sobre conll2002, rendimiento del 72%.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [2]:
from itertools import chain

import nltk
import sklearn
import scipy.stats
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV

import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

## Extracción de los dataset de entrenamiento y testeo

El corpus de CoNLL2002 tiene especificados los archivos de los conjuntos de entrenamiento, evaluación y testeo. 

In [3]:
nltk.corpus.conll2002.fileids()

['esp.testa', 'esp.testb', 'esp.train', 'ned.testa', 'ned.testb', 'ned.train']

In [4]:
%%time
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))

CPU times: user 1.08 s, sys: 30.9 ms, total: 1.11 s
Wall time: 1.1 s


In [5]:
train_sents[0]

[('Melbourne', 'NP', 'B-LOC'),
 ('(', 'Fpa', 'O'),
 ('Australia', 'NP', 'B-LOC'),
 (')', 'Fpt', 'O'),
 (',', 'Fc', 'O'),
 ('25', 'Z', 'O'),
 ('may', 'NC', 'O'),
 ('(', 'Fpa', 'O'),
 ('EFE', 'NC', 'B-ORG'),
 (')', 'Fpt', 'O'),
 ('.', 'Fp', 'O')]

## Selección de características

El modelo base sólo tiene como característica el token de la palabr y el postag.

sklearn-crfsuite y python-crfsuite soporta varios formatos de características.

In [6]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    return word,postag

In [7]:
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

Se extraen las características para cada sentencia de entrenamiento, sent2features; extrae las características de forma de cada palabra, sent2labels, obtiene la etiqueta de la palabra y sent2tokens el token de la palabra.

In [8]:
#print(sent2features(train_sents[0])[0])
print(sent2labels(train_sents[0])[0])
print(sent2tokens(train_sents[0])[0])
print(sent2features(train_sents[0])[0])

B-LOC
Melbourne
('Melbourne', 'NP')


Se genera el conjunto de entrenamiento y el de testeo con las características de entrada.

In [9]:
%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

CPU times: user 64 ms, sys: 3.88 ms, total: 67.9 ms
Wall time: 66.8 ms


## Entrenamiento del modelo usando CRFsuite

El algoritmo de entrenamiento está basado en el algoritmo L-BFGS training algorithm con estándares de regularización.

In [10]:
%%time
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs', 
    c1=0.1, 
    c2=0.1, 
    max_iterations=100, 
    all_possible_transitions=True
)
crf.fit(X_train, y_train)

CPU times: user 11.3 s, sys: 0 ns, total: 11.3 s
Wall time: 11.3 s




CRF(algorithm='lbfgs', all_possible_transitions=True, c1=0.1, c2=0.1,
    keep_tempfiles=None, max_iterations=100)

## Evaluation

There is much more O entities in data set, but we're more interested in other entities. To account for this we'll use averaged F1 score computed for all labels except for O. ``sklearn-crfsuite.metrics`` package provides some useful metrics for sequence classification task, including this one.

In [11]:
labels = list(crf.classes_)
labels.remove('O')
labels

['B-LOC', 'B-ORG', 'B-PER', 'I-PER', 'B-MISC', 'I-ORG', 'I-LOC', 'I-MISC']

In [12]:
y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred, 
                      average='weighted', labels=labels)

0.7253660905082374

Inspect per-class results in more detail:

In [13]:
# group B and I results
sorted_labels = sorted(
    labels, 
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))



              precision    recall  f1-score   support

       B-LOC      0.816     0.608     0.697      1084
       I-LOC      0.668     0.545     0.600       325
      B-MISC      0.726     0.422     0.534       339
      I-MISC      0.574     0.460     0.510       557
       B-ORG      0.856     0.759     0.805      1400
       I-ORG      0.789     0.743     0.765      1104
       B-PER      0.924     0.642     0.758       735
       I-PER      0.922     0.785     0.848       634

   micro avg      0.810     0.662     0.728      6178
   macro avg      0.784     0.620     0.690      6178
weighted avg      0.809     0.662     0.725      6178



In [15]:
prueba=[('La', 'DA'), ('Coruña', 'NC'), ('sería','VSI'), ('el','DA'), ('nuevo','AQ'), ('equipo','NC'), ('de','SP'), ('James','NP'),('Rodriguez','NP'),(',','Fc'),('aunque','CC'),('todavía','RG'),
        ('es','VSI'), ('de','SP'),('el','DA'), ('Real','NP'), ('Madrid','NP'), ('de','SP'), ('España','NP')]
prueba1= [('Melbourne', 'NP', 'B-LOC'), ('(', 'Fpa', 'O'), ('Australia', 'NP', 'B-LOC'), (')', 'Fpt', 'O'), (',', 'Fc', 'O'),
 ('25', 'Z', 'O'), ('may', 'NC', 'O'), ('(', 'Fpa', 'O'), ('EFE', 'NC', 'B-ORG'), (')', 'Fpt', 'O'), ('.', 'Fp', 'O')]


def pos_tag(sentence):
    sentence_features = [word2features(sentence, index) for index in range(len(sentence))]
    return list(zip(sentence, crf.predict([sentence_features])[0]))
#print(sentence_features) 
print(pos_tag(prueba))  # [('I', 'PRP'), ('am', 'VBP'), ('Bob', 'NNP'), ('!', '.')]

[(('La', 'DA'), 'B-LOC'), (('Coruña', 'NC'), 'I-LOC'), (('sería', 'VSI'), 'O'), (('el', 'DA'), 'O'), (('nuevo', 'AQ'), 'O'), (('equipo', 'NC'), 'O'), (('de', 'SP'), 'O'), (('James', 'NP'), 'B-PER'), (('Rodriguez', 'NP'), 'I-PER'), ((',', 'Fc'), 'O'), (('aunque', 'CC'), 'O'), (('todavía', 'RG'), 'O'), (('es', 'VSI'), 'O'), (('de', 'SP'), 'O'), (('el', 'DA'), 'O'), (('Real', 'NP'), 'B-ORG'), (('Madrid', 'NP'), 'I-ORG'), (('de', 'SP'), 'I-ORG'), (('España', 'NP'), 'I-ORG')]
