# Building Model CRF
### By **Néstor Suat** in 2020

**Descripción:** Generando un modelo ML para la tarea de NER en tweets de accidentes para las etiquetas `loc` y `time` usando el estandar BIO.

**Input:**
* TSV con dataset etiquetado con BIO

**Output:**
* Model

**Tomado de**: https://www.depends-on-the-definition.com/named-entity-recognition-conditional-random-fields-python/
***

### Importando librerías

In [1]:
import pandas as pd

### Source code

La clase `StenteceGetter` es una clase generica en muchos proyectos de NER, permite tomar el dataset y prepararlo en una lista python para trabajar por oraciones.

In [2]:
class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
                                                           s["POS"].values.tolist(),
                                                           s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            #s = self.grouped["Sentence: {}".format(self.n_sent)]
            s = self.grouped[self.n_sent]
            self.n_sent += 1
            return s
        except:
            return None

### Selección de características

Para el algoritmo CRF se seleccionan unas caracteristicas respecto a las reglas gramaticas y formologicas de las palabra.

In [3]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

### Importando dataset anotado

El archivo `ner-crf-training-data.tsv` fue construido anteriormente transformando el formato de anotación de Standoff a BIO.

In [4]:
file = 'ner-crf-training-data.tsv'
dir_ = "../../../data/v1/NER/"
data = pd.read_csv(dir_+file, delimiter = "\t", quoting = 3, names=['Sentence #','Word','POS','Tag'])
#dataset[:50]

#### **Preparando el dataset**

Se construye un corpus de todas las palabras presentes en los tweets, se agrega un token esepcial para rellenar llamado ENDPAD, finalmente se calcula el tamaño del corpus de palabras. Esto mismo se hace para las etiquetas, aunque en este caso es más fácil porque son 5: `b-loc`, `i-loc`, `b-time`, `i-time` y `o`.

In [5]:
words = list(set(data['Word'].values))
words.sort()
n_words = len(words); n_words

455

Se toman los datos y se construye el arreglo de las oraciones a trabajar

In [7]:
getter = SentenceGetter(data)
sentences = getter.sentences
#sentences

sent = getter.get_next()
print(sent)

In [8]:
X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]

## Train and Test set

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [19]:
print("Train:",len(X_train), len(y_train))
print("Test:",len(X_test), len(y_test))

Train: 40 40
Test: 10 10


## Model CRF

In [36]:
from sklearn_crfsuite import CRF
from sklearn.model_selection import cross_val_predict
from sklearn_crfsuite.metrics import flat_classification_report
from sklearn_crfsuite import metrics
import eli5

In [12]:
from sklearn_crfsuite import CRF

crf = CRF(algorithm='lbfgs',
          c1=0.1,
          c2=0.1,
          max_iterations=100,
          all_possible_transitions=False)

## Evaluate

### **Cross Validation**
Validación usando cross_validation con K=5

In [52]:
#pred = cross_val_predict(estimator=crf, X=X, y=y, cv=5)
pred = cross_val_predict(estimator=crf, X=X_train, y=y_train, cv=5) #Solo datos de entrenamiento

#report = flat_classification_report(y_pred=pred, y_true=y)
report = flat_classification_report(y_pred=pred, y_true=y_train)  #Solo datos de entrenamiento
print(report)

In [54]:
# group B and I results
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_train, pred, labels=sorted_labels, digits=2
))

              precision    recall  f1-score   support

       B-loc       0.74      0.58      0.65        50
       I-loc       0.81      0.71      0.75       146
      B-time       0.00      0.00      0.00         6
      I-time       0.00      0.00      0.00         8

   micro avg       0.80      0.63      0.70       210
   macro avg       0.39      0.32      0.35       210
weighted avg       0.74      0.63      0.68       210



### **Split train & test validation**

In [26]:
crf.fit(X_train, y_train)

CRF(algorithm='lbfgs', all_possible_states=None, all_possible_transitions=False,
    averaging=None, c=None, c1=0.1, c2=0.1, calibration_candidates=None,
    calibration_eta=None, calibration_max_trials=None, calibration_rate=None,
    calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
    gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
    max_linesearch=None, min_freq=None, model_filename=None, num_memories=None,
    pa_type=None, period=None, trainer_cls=None, variance=None, verbose=False)

#### **Testset**

In [43]:
y_pred = crf.predict(X_test)

In [45]:
labels = list(crf.classes_)
labels.remove('O')
labels

print("F1-score: {:.1%}".format(metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=labels)))

F1-score: 68.3%


In [51]:
# group B and I results
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=2
))

              precision    recall  f1-score   support

       B-loc       0.55      0.50      0.52        12
       I-loc       0.92      0.68      0.78        34
      B-time       0.00      0.00      0.00         2
      I-time       0.00      0.00      0.00         0

   micro avg       0.81      0.60      0.69        48
   macro avg       0.37      0.29      0.33        48
weighted avg       0.79      0.60      0.68        48



### Inspeccionando el Modelo

Visualización de las matrices de probabilidad de transición de una etiqueta a otra. Tambien se puede ver que caracteristicas son más importantes para predecir una etiqueta u otra.

In [27]:
eli5.show_weights(crf, top=30)

From \ To,O,B-loc,I-loc,B-time,I-time
O,3.572,2.636,0.0,1.006,0.0
B-loc,-1.536,0.0,4.174,0.0,0.0
I-loc,0.043,0.0,4.413,0.0,0.0
B-time,-0.4,0.0,0.0,0.289,1.319
I-time,-0.505,0.0,0.0,0.0,2.645

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4
+2.423,EOS,,,
+1.754,bias,,,
+1.048,BOS,,,
+0.998,-1:word.lower():sur,,,
+0.988,postag:PUNCT,,,
+0.988,postag[:2]:PU,,,
+0.936,+1:word.lower():5:30,,,
+0.863,word[-3:]:Sur,,,
+0.863,-1:word.lower():23d,,,
+0.848,word.lower():accidente,,,

Weight?,Feature
+2.423,EOS
+1.754,bias
+1.048,BOS
+0.998,-1:word.lower():sur
+0.988,postag:PUNCT
+0.988,postag[:2]:PU
+0.936,+1:word.lower():5:30
+0.863,word[-3:]:Sur
+0.863,-1:word.lower():23d
+0.848,word.lower():accidente

Weight?,Feature
+1.610,+1:word.lower():norte
+1.559,word.istitle()
+1.215,-1:word.lower():la
+1.102,word.lower():corabastos
+1.071,word.lower():autonorte
+1.037,+1:postag:ADJ
+0.924,word.lower():auto
+0.924,word[-3:]:uto
+0.912,-1:word.lower():desde
+0.912,+1:word.lower():caro

Weight?,Feature
+1.313,-1:word.lower():con
+1.173,+1:word.lower():choque
+1.123,-1:word.lower():puente
+1.089,-1:word.lower():8
+1.017,-1:word.lower():128
+0.994,+1:word.lower():hacia
+0.954,-1:word.lower():autonorte
+0.950,-1:word.lower():auto
+0.936,word[-2:]:ón
+0.860,-1:word.lower():calle

Weight?,Feature
+1.650,postag:ADV
+0.973,word.lower():mañana
+0.969,word.lower():5:30
+0.969,word[-3:]::30
+0.952,word[-2:]:30
+0.932,word[-3:]:ana
+0.892,word[-2:]:oy
+0.892,word.lower():hoy
+0.892,word[-3:]:hoy
+0.879,-1:word.lower():muertos

Weight?,Feature
+0.911,-1:postag[:2]:NU
+0.911,-1:postag:NUM
+0.765,word[-3:]:ayo
+0.765,word.lower():mayo
+0.765,word[-2:]:yo
+0.765,+1:word.lower():nuestras
+0.708,-1:word.lower():5:30
+0.634,+1:word.lower():mayo
+0.634,word[-2:]:am
+0.625,word.lower():am


### Evaluando desempeño con una sentencia

In [111]:
import numpy as np
i = 1
p = crf.predict_single(X_test[i])

print("{:15} ({:5}): {}".format("Word", "True", "Pred"))
for w,true, pred in zip(X_test[i],y_test[i],p):
    print("{:15} ({:5}): {}".format(w['word.lower()'],true,pred))

Word            (True ): Pred
en              (O    ): O
la              (O    ): O
av              (B-loc): B-loc
boyacá          (I-loc): I-loc
con             (I-loc): I-loc
calle           (I-loc): I-loc
62              (I-loc): I-loc
sur             (O    ): O
sentido         (O    ): O
norte           (O    ): O
sur             (O    ): O
choque          (O    ): O
simple          (O    ): O
de              (O    ): O
camión          (O    ): O
vs              (O    ): O
zonal           (O    ): O
provisional     (O    ): O
paso            (O    ): O
reducido        (O    ): O
,               (O    ): O
se              (O    ): O
solicita        (O    ): O
tránsito        (O    ): O
.               (O    ): O


# Mejorando el modelo con regularización

In [138]:
crf = CRF(algorithm='lbfgs',
          c1=10, #Supuestamente mayor número representa que el modelo no dependa tanto de la palabra, sino contexto
          c2=0.1,
          max_iterations=100,
          all_possible_transitions=False)

### Cross Validation

In [140]:
#pred = cross_val_predict(estimator=crf, X=X, y=y, cv=5)
pred = cross_val_predict(estimator=crf, X=X_train, y=y_train, cv=5)

report = flat_classification_report(y_pred=pred, y_true=y)
print(report)

In [141]:
# group B and I results
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_train, pred, labels=sorted_labels, digits=2
))

              precision    recall  f1-score   support

       B-loc       0.72      0.52      0.60        50
       I-loc       0.80      0.70      0.75       146
      B-time       0.00      0.00      0.00         6
      I-time       0.00      0.00      0.00         8

   micro avg       0.79      0.61      0.69       210
   macro avg       0.38      0.30      0.34       210
weighted avg       0.73      0.61      0.66       210



  'precision', 'predicted', average, warn_for)


### Testset Evaluate

In [142]:
crf.fit(X_train, y_train)

CRF(algorithm='lbfgs', all_possible_states=None, all_possible_transitions=False,
    averaging=None, c=None, c1=1, c2=0.1, calibration_candidates=None,
    calibration_eta=None, calibration_max_trials=None, calibration_rate=None,
    calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
    gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
    max_linesearch=None, min_freq=None, model_filename=None, num_memories=None,
    pa_type=None, period=None, trainer_cls=None, variance=None, verbose=False)

In [143]:
y_pred = crf.predict(X_test)

In [134]:
labels = list(crf.classes_)
labels.remove('O')
labels

print("F1-score: {:.1%}".format(metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=labels)))

F1-score: 56.9%


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [144]:
# group B and I results
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=2
))

              precision    recall  f1-score   support

       B-loc       0.60      0.50      0.55        12
       I-loc       0.93      0.74      0.82        34
      B-time       0.00      0.00      0.00         2
      I-time       0.00      0.00      0.00         0

   micro avg       0.84      0.65      0.73        48
   macro avg       0.38      0.31      0.34        48
weighted avg       0.81      0.65      0.72        48



  'recall', 'true', average, warn_for)


In [136]:
eli5.show_weights(crf, top=30)

From \ To,O,B-loc,I-loc,B-time,I-time
O,2.506,1.226,0.0,0.0,0.0
B-loc,-0.167,0.0,2.921,0.0,0.0
I-loc,0.0,0.0,2.642,0.0,0.0
B-time,0.0,0.0,0.0,0.0,0.0
I-time,0.0,0.0,0.0,0.0,0.0

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
+1.262,bias,,,
+0.601,EOS,,,
+0.412,postag[:2]:PU,,,
+0.412,postag:PUNCT,,,
+0.167,BOS,,,
+0.145,word.lower():en,,,
+0.080,+1:postag:NOUN,,,
+0.080,+1:postag[:2]:NO,,,
+0.032,postag[:2]:DE,,,
+0.032,postag:DET,,,

Weight?,Feature
1.262,bias
0.601,EOS
0.412,postag[:2]:PU
0.412,postag:PUNCT
0.167,BOS
0.145,word.lower():en
0.08,+1:postag:NOUN
0.08,+1:postag[:2]:NO
0.032,postag[:2]:DE
0.032,postag:DET

Weight?,Feature
1.695,-1:word.lower():la
0.325,word.istitle()
0.008,-1:postag[:2]:DE
0.008,-1:postag:DET
-0.106,+1:postag:NOUN
-0.106,+1:postag[:2]:NO
-0.444,postag[:2]:AD

Weight?,Feature
1.479,word.isdigit()
0.304,-1:word.lower():con
0.218,word[-3:]:con
0.218,word.lower():con
0.179,-1:postag:NOUN
0.179,-1:postag[:2]:NO
0.162,+1:postag:PUNCT
0.162,+1:postag[:2]:PU
-0.083,-1:postag:NUM
-0.083,-1:postag[:2]:NU


In [145]:
import numpy as np
i = 1
p = crf.predict_single(X_test[i])

print("{:15} ({:5}): {}".format("Word", "True", "Pred"))
for w,true, pred in zip(X_test[i],y_test[i],p):
    print("{:15} ({:5}): {}".format(w['word.lower()'],true,pred))

Word            (True ): Pred
en              (O    ): O
la              (O    ): O
av              (B-loc): B-loc
boyacá          (I-loc): I-loc
con             (I-loc): I-loc
calle           (I-loc): I-loc
62              (I-loc): I-loc
sur             (O    ): O
sentido         (O    ): O
norte           (O    ): O
sur             (O    ): O
choque          (O    ): O
simple          (O    ): O
de              (O    ): O
camión          (O    ): O
vs              (O    ): O
zonal           (O    ): O
provisional     (O    ): O
paso            (O    ): O
reducido        (O    ): O
,               (O    ): O
se              (O    ): O
solicita        (O    ): O
tránsito        (O    ): O
.               (O    ): O
