### Named Entity Recognition using HMM

En este notebook vamos a desarrollar un etiquetado de las palabras:
- para el español que usan el dato de CoNLL-2002 para el español, el data set completo se puede encontrar en https://github.com/teropa/nlp/tree/master/resources/corpora/conll2002
- Tambien desarrollaremos el método tag

## Prediccion de Entidades de Nombres

para este procedimiento acontinuacion vamos a realizaruna secuencia de etiquetado de los nombres y anotar en
nuestro corpus.

![alt text](imagen1.png "Title")

## Modelo oculto de Markov no supervizado

para esto desarrollaremos y tomaremos encuenta primero solo las palabras y las
etiquetas como se muestra acontinaucion.


In [1]:
import nltk, nltk.classify.util, nltk.metrics
from dataprep import conll_sentences,conll_sentence_esp, conll_words,conll_words_esp, neel_sentences, neel_words
from helper import accuracy, entity_count
from nltk import MaxentClassifier
from sklearn.metrics import precision_recall_fscore_support

### Cargar DataSet Español
ahora vamos a primero cargar los archivos de nuestro dataset en español

In [33]:
train_file = './dataset/CoNLL2002/esp.train'
testa_file = './dataset/CoNLL2002/esp.testa'
testb_file = './dataset/CoNLL2002/esp.testb'

train_words, _, train_etiquetas = conll_words_esp(train_file)
testa_words, _, testa_etiquetas = conll_words_esp(testa_file)
testb_words, _, testb_etiquetas = conll_words_esp(testb_file)

### mostramos los datos cargados 

mostraremos los datos cargados del dataset en español, asi solo cargamos lo mas importante para este ejemplo
- palabra, entidad

In [34]:
print("[0] reglon:",train_words[0],_[0],train_etiquetas[0])

[0] reglon: Melbourne DA LOC


### Mezcla de training , test y formar set vocabulario

In [35]:
# acumulamos las palabras y las entidades de palabras

combinado_words = train_words + testa_words + testb_words
combinado_etiquetas = train_etiquetas + testa_etiquetas + testb_etiquetas

# for para obtener el set voca
char_set = set()
for word in combinado_words:
    for char in word:
        char_set.add(char)
etiqueta_set = set(combinado_etiquetas)

### Training HMM no supervizado 

In [36]:
# setting de modelo oculto de markov
trainer = nltk.tag.hmm.HiddenMarkovModelTrainer(states=etiqueta_set, symbols=char_set)

# entrenamiento con hmm de nltk 
modelo = trainer.train_unsupervised(testa_words, max_iterations=100)

iteration 0 logprob -1540694.8438385099
iteration 1 logprob -1105937.4315283587
iteration 2 logprob -1101893.490903211
iteration 3 logprob -1098056.8272231247
iteration 4 logprob -1093746.120157142
iteration 5 logprob -1088592.3821617411
iteration 6 logprob -1082441.2246317966
iteration 7 logprob -1075342.2843631073
iteration 8 logprob -1067370.626023771
iteration 9 logprob -1058403.5962087363
iteration 10 logprob -1048354.4302744969
iteration 11 logprob -1037960.2214199561
iteration 12 logprob -1028446.0466508842
iteration 13 logprob -1020185.1973935018
iteration 14 logprob -1012833.001689907
iteration 15 logprob -1005948.0716215682
iteration 16 logprob -998915.8665790766
iteration 17 logprob -991043.5838289928
iteration 18 logprob -981104.17044474
iteration 19 logprob -969750.2608120533
iteration 20 logprob -961678.2475442089
iteration 21 logprob -956602.4465816314
iteration 22 logprob -953169.8160937146
iteration 23 logprob -950822.1456024995
iteration 24 logprob -949256.8930210482


### Evaluacion(test) con testb del CoNLL 

aqui hacemos el test y luego mostramos los primero 10 resultados del etiquetado.

In [41]:
# evaluacion del resultado

testb_result = modelo.tag(testb_words)

# mostrar los 10 primero del tag


testb_result[0:5]


[('La', 'O'), ('Coruña', 'O'), (',', 'O'), ('23', 'O'), ('may', 'O')]

In [38]:

testb_predecido = []
for word, etiqueta in testb_result:
    testb_predecido.append(etiqueta)

### accuracy 

In [39]:

accuracy(testb_etiquetas, testb_predecido)

accuracy = 45355 / 51533 = 0.880116


### total de etiquetas

In [40]:
## mostramos total de tags
entity_count(testb_etiquetas)

ORG: 2504
PER: 1369
LOC: 1409
MISC: 896
O: 45355


## Modelo oculto de Markov condicional 

primero cargamos todos nuestras librerias necesarias como en la anterior pero esta vez trabajaremos com el dataSet en Ingles


## uso de nuestro Dataset CoNLL2003

In [10]:
def get_conll_features(index, sentence, pos, chunk):
    """
    Funcion usada para extraer las caracteristicas del dataset
    
    'w' representa caracteristica de palabra
    't' representa POS de etiqueta 
    'c' representa pedazo de etiqueta
    '-n' representa  n previa caract...
    '+n' representa n posterios caract...
    """
    
    features = {}
    last_index = len(sentence) - 1
    word = sentence[index]
    word_lc = word.lower()
    
    # features from current word:
    features['w'] = word
    features['t'] = pos[index]
    features['length'] = len(word)
    features['uppercase'] = any(x.isupper() for x in word)
    features['firstletter'] = word[0].isupper() and (len(word) > 1)
    features['hasdigits'] = any(x.isdigit() for x in word)
    features['c'] = chunk[index]
    features['loc_flag'] = ('field' in word_lc) or ('land' in word_lc) or ('burgh' in word_lc) or ('shire' in word_lc) 
    features['hasdot'] = ('.' in word and len(word) > 1)
    features['endsinns'] = (len(word) > 1 and word_lc[-2:] == 'ns')
    
    
    # features from previous 2 words
    if index == 0: # first word in sentence
        features['t-2 t-1'] = '<B> <B>'
        features['t-1'] = '<B>'
        features['w-2'] = '<B>'
        features['w-1'] = '<B>'
        features['c-2 c-1'] = '<B> <B>'
        features['c-1'] = '<B>'
    elif index == 1: # second word in sentence
        features['t-2 t-1'] = '<B> ' + pos[0]
        features['t-1'] = pos[0]
        features['w-2'] = '<B>'
        features['w-1'] = sentence[0]
        features['c-2 c-1'] = '<B> ' + chunk[0]
        features['c-1'] = chunk[0]
    else:
        features['t-2 t-1'] = pos[index-2] + ' ' + pos[index-1]
        features['t-1'] = pos[index-1]
        features['w-2'] = sentence[index-2]
        features['w-1'] = sentence[index-1]
        features['c-2 c-1'] = chunk[index-2] + ' ' + chunk[index-1]
        features['c-1'] = chunk[index-1]

      
    # features from posterior 2 words
    if index == last_index: # last word in sentence
        features['t+1 t+2'] = '<E> <E>'
        features['t+1'] = '<E>'
        features['w+2'] = '<E>'
        features['w+1'] = '<E>'
    elif index == last_index - 1: # second to last word in sentence
        features['t+1 t+2'] = pos[last_index] + ' <E>'
        features['t+1'] = pos[last_index]
        features['w+2'] = '<E>'
        features['w+1'] = sentence[last_index]
    else:
        features['t+1 t+2'] = pos[index+1] + ' ' + pos[index+2]
        features['t+1'] = pos[index+1]
        features['w+2'] = sentence[index+2]
        features['w+1'] = sentence[index+1]
    
    return features

### Extraer oraciones con su correspondiente etiqueta del archivo CoNll2002

In [11]:
c_train_file = './dataset/CoNLL2003/eng.train'
c_train_sent, c_train_pos, c_train_chunk, c_train_entity = conll_sentences(c_train_file)

### mostrar dataset 

ahora mostraremos los datos contenidos en el primer reglon para el ingles seguido de :
    - (caractaristica(palabra), posicion de la palabra,pezado de palabra, entidad)

In [30]:
# mostramos una horacion encontrada
print("---------------------------------------------")
print(c_train_sent[0][0],c_train_pos[0][0],c_train_chunk[0][0],c_train_entity[0][0])

---------------------------------------------
EU NNP I-NP ORG


### cada oración crea conjunto de características de datos de entrenamiento

In [12]:
c_train_data = []
for sent, pos, chunk, entity in zip(c_train_sent, c_train_pos, c_train_chunk, c_train_entity): 
    if len(sent) != len(pos) or len(pos) != len(chunk) or len(chunk) != len(entity):
        raise ValueError('error: CoNLL longitud de train  no coencide')  
    for i, ent in enumerate(entity):
        labelled_data = (get_conll_features(i, sent, pos, chunk), ent)
        c_train_data.append(labelled_data)

### entrenar CMM usando clasificador nltk 

In [14]:
algorithm = nltk.classify.MaxentClassifier.ALGORITHMS[0]
memm = MaxentClassifier.train(c_train_data, algorithm, max_iter=3)

  ==> Training (3 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -1.60944        0.055
             2          -0.25516        0.833
         Final          -0.19637        0.862


In [16]:
# 

c_testa_file = './dataset/CoNLL2003/eng.testa'
c_testb_file = './dataset/CoNLL2003/eng.testb'
c_testc_file = './dataset/CoNLL2003/eng.testc'

c_testa_sent, c_testa_pos, c_testa_chunk, c_testa_entity = conll_sentences(c_testa_file)
c_testb_sent, c_testb_pos, c_testb_chunk, c_testb_entity = conll_sentences(c_testb_file)
c_testc_sent, c_testc_pos, c_testc_chunk, c_testc_entity = conll_sentences(c_testc_file)

c_teata_truth = []
c_testa_pred = []
for sent, pos, chunk, entity in zip(c_testa_sent, c_testa_pos, c_testa_chunk, c_testa_entity):
    if len(sent) != len(pos) or len(pos) != len(chunk) or len(chunk) != len(entity):
        raise ValueError('error: CoNLL  longitud no coencide')
    for i, ent in enumerate(entity):
        c_teata_truth.append(ent)
        pred = memm.classify(get_conll_features(i, sent, pos, chunk))
        c_testa_pred.append(pred)

c_teatb_truth = []
c_testb_pred = []
for sent, pos, chunk, entity in zip(c_testb_sent, c_testb_pos, c_testb_chunk, c_testb_entity):
    if len(sent) != len(pos) or len(pos) != len(chunk) or len(chunk) != len(entity):
        raise ValueError('error: CoNLL testb longitud no coencide')
    for i, ent in enumerate(entity):
        c_teatb_truth.append(ent)
        pred = memm.classify(get_conll_features(i, sent, pos, chunk))
        c_testb_pred.append(pred)

c_teatc_truth = []
c_testc_pred = []
for sent, pos, chunk, entity in zip(c_testc_sent, c_testc_pos, c_testc_chunk, c_testc_entity):
    if len(sent) != len(pos) or len(pos) != len(chunk) or len(chunk) != len(entity):
        raise ValueError('error: CoNLL testc longitud no coencide')
    for i, ent in enumerate(entity):
        c_teatc_truth.append(ent)
        pred = memm.classify(get_conll_features(i, sent, pos, chunk))
        c_testc_pred.append(pred)

## evaluacion (accuracy)


In [18]:
accuracy(c_teata_truth, c_testa_pred)

accuracy = 43917 / 51362 = 0.855048
