# Analizador morfologico y generador de glosa auomática

## Replica del experimento

Esta es una replica del experimento realizado por [@theSarahRu](https://github.com/theSarahRu/FLExMorphSegmenter) donde realiza análisis morfológico y glosado de afijos para una lengua de bajos recursos llamada Lezgi [lez]. Lezgi es una de las [lenguas caucásicas nororientales](https://es.wikipedia.org/wiki/Lenguas_cauc%C3%A1sicas_nororientales) y es hablada entre Rusia y Azerbaiyán. La idea central es que su trabajo pueda ser replicado para cualquier lengua que cuente con 2000 a 3000 **palabras** correctamente etiquetadas. El programa se considera exitoso si alcanza el 80% de *accuracy*. 

En esta replica se utilizará la lengua del otomí y en particular la variante del valle del mezquital (hñahñu). Se utilizará el corpus etiquetado del Maestro en Lingüistica Computacional Victor Germán Mijangos. El corpus consta de `1705` **oraciones** etiquetadas.

### Importando bibliotecas necesarias

In [57]:
import pycrfsuite
import os
from sklearn.model_selection import train_test_split
from collections import Counter

### Definiendo modelo previamente entrenado

In [58]:
model_filename = 'tsunkua_uniq.crfsuite'

### Seccionando datos en test y train

In [59]:
def get_vic_data():
    """
    """
    with open("uniq_corpus.txt", encoding='utf-8', mode='r') as f:
        plain_text = f.read()
    raw_data = plain_text.split('\n')
    # El formato del corpus permite usar eval()
    return [eval(row) for row in raw_data]


def WordsToLetter(wordlists):
    '''
    Takes data from vic data: [[[[[[morpheme, gloss], pos],...],words],sents]].
    Returns [[[[[letter, POS, BIO-label],...],words],sents]]
    '''
    letterlists = []
    for i, phrase in enumerate(wordlists):
        sent = []
        for lexeme in phrase:
            palabra = ''
            word = []
            #Skip POS label
            for morpheme in lexeme[:-1]:
                palabra += ''.join([l for l in morpheme[0]])
                #use gloss as BIO label
                label = morpheme[1]
                #Break morphemes into letters
                for i in range(len(morpheme[0])):
                    letter = [morpheme[0][i]]  # Adding ascii for encoding
                    #add POS label to each letter
                    letter.append(lexeme[-1])
                    #add BIO label
                    if i == 0:
                        letter.append('B-' + label)
                    else:
                        letter.append('I-' + label)                               
                    word.append(letter)                    
            sent.append(word)            
        letterlists.append(sent)
    return letterlists


In [61]:
vic_data = get_vic_data()
train_data, test_data = train_test_split(WordsToLetter(vic_data), test_size=0.2)
print("Train data:", len(train_data))
print("Test data:", len(test_data))
print("Total data:", len(train_data) + len(test_data))
print("Oracion ejemplo: ", train_data[0])

Train data: 1350
Test data: 338
Total data: 1688
Oracion ejemplo:  [[['p', 'v', 'B-it'], ['i', 'v', 'I-it'], ['t', 'v', 'B-1.prf'], ['ó', 'v', 'I-1.prf'], ['t', 'v', 'B-stem'], ['ó', 'v', 'I-stem'], ['n', 'v', 'I-stem'], ['y', 'v', 'B-lim'], ['ι', 'v', 'I-lim'], ["'", 'v', 'I-lim']], [['y', 'det', 'B-det.pl'], ['ι', 'det', 'I-det.pl']], [['t', 'unkwn', 'B-dim'], ['s', 'unkwn', 'I-dim'], ['i', 'unkwn', 'I-dim'], ['b', 'unkwn', 'B-stem'], ['e', 'unkwn', 'I-stem'], ['s', 'unkwn', 'I-stem'], ['e', 'unkwn', 'I-stem'], ['r', 'unkwn', 'I-stem'], ['r', 'unkwn', 'I-stem'], ['a', 'unkwn', 'I-stem'], ['y', 'unkwn', 'B-lim'], ['ι', 'unkwn', 'I-lim'], ["'", 'unkwn', 'I-lim']], [['p', 'cnj', 'B-stem'], ['o', 'cnj', 'I-stem'], ['r', 'cnj', 'I-stem'], ['k', 'cnj', 'I-stem'], ['e', 'cnj', 'I-stem']], [['m', 'obl', 'B-stem'], ['e', 'obl', 'I-stem']], [['g', 'v', 'B-stem'], ['u', 'v', 'I-stem'], ['s', 'v', 'I-stem'], ['t', 'v', 'I-stem'], ['a', 'v', 'I-stem']]]


### Analisis y extracción de features

#### Definiendo funciones de analisis

In [62]:
def extractFeatures(sent):
    '''Takes data as [[[[[letter, POS, BIO-label],...],words],sents]].
    Returns list of words with characters as features list: [[[[[letterfeatures],POS,BIO-label],letters],words]]'''
    
    featurelist = []
    senlen = len(sent)

    # TODO: Optimizar los parametros hardcode para el otomí. Se probaran nuevos parámetro
    #each word in a sentence
    for i in range(senlen):
        word = sent[i]
        wordlen = len(word)
        lettersequence = ''
        #each letter in a word
        for j in range(wordlen):
            letter = word[j][0]
            #gathering previous letters
            lettersequence += letter
            #ignore     digits
            if not letter.isdigit():
                features = [
                    'bias',
                    'letterLowercase=' + letter.lower(),
                    'postag=' + word[j][1],
                ] 
                #position of word in sentence and pos tags sequence
                if i > 0:
                    features.append('prevpostag=' + sent[i-1][0][1])
                    if i != senlen-1:
                        features.append('nxtpostag=' + sent[i+1][0][1])
                    else:
                        features.append('EOS')
                else:
                    features.append('BOS')
                    #Don't get pos tag if sentence is 1 word long
                    if i != senlen-1:
                        features.append('nxtpostag=' + sent[i+1][0][1])
                #position of letter in word
                if j == 0:
                    features.append('BOW')
                elif j == wordlen-1:
                    features.append('EOW')
                else:
                    features.append('letterposition=-%s' % str(wordlen-1-j))
                #letter sequences before letter
                if j >= 4:
                    features.append('prev4letters=' + lettersequence[j-4:j].lower() + '>')
                if j >= 3:
                    features.append('prev3letters=' + lettersequence[j-3:j].lower() + '>')
                if j >= 2:
                    features.append('prev2letters=' + lettersequence[j-2:j].lower() + '>')
                if j >= 1:
                    features.append('prevletter=' + lettersequence[j-1:j].lower() + '>')
                #letter sequences after letter
                if j <= wordlen-2:
                    nxtlets = word[j+1][0]
                    features.append('nxtletter=<' + nxtlets.lower())
                if j <= wordlen-3:
                    nxtlets += word[j+2][0]
                    features.append('nxt2letters=<' + nxtlets.lower())
                if j <= wordlen-4:
                    nxtlets += word[j+3][0]
                    features.append('nxt3letters=<' + nxtlets.lower())
                if j <= wordlen-5:
                    nxtlets += word[j+4][0]
                    features.append('nxt4letters=<' + nxtlets.lower())
                
            featurelist.append([f.encode('utf-8') for f in features])  # Add encoding for pysrfsuite
    
    return featurelist


def extractLabels(sent):
    labels = []
    for word in sent:
        for letter in word:
            labels.append(letter[2].encode('utf-8'))
    return labels


def extractTokens(sent):
    tokens = []
    for word in sent:
        for letter in word:
            tokens.append(letter[0])
    return tokens


def sent2features(data):
    return [extractFeatures(sent) for sent in data]


def sent2labels(data):
    return [extractLabels(sent) for sent in data]


def sent2tokens(data):
    return [extractTokens(sent) for sent in data]


In [63]:
X_train = sent2features(train_data)
y_train = sent2labels(train_data)

X_test = sent2features(test_data)
y_test = sent2labels(test_data)

# Se codifican los ejemplos utf-8 para evitar problemas con pycrfsuite
print("X Train ejemplo: ", X_train[0][:5])
print("\ny Train ejemplo: ", y_train[0])
print("\nX Test ejemplo: ", X_test[0][:5])
print("\ny Test ejemplo: ", y_test[0])

X Train ejemplo:  [[b'bias', b'letterLowercase=p', b'postag=v', b'BOS', b'nxtpostag=det', b'BOW', b'nxtletter=<i', b'nxt2letters=<it', b'nxt3letters=<it\xc3\xb3', b'nxt4letters=<it\xc3\xb3t'], [b'bias', b'letterLowercase=i', b'postag=v', b'BOS', b'nxtpostag=det', b'letterposition=-8', b'prevletter=p>', b'nxtletter=<t', b'nxt2letters=<t\xc3\xb3', b'nxt3letters=<t\xc3\xb3t', b'nxt4letters=<t\xc3\xb3t\xc3\xb3'], [b'bias', b'letterLowercase=t', b'postag=v', b'BOS', b'nxtpostag=det', b'letterposition=-7', b'prev2letters=pi>', b'prevletter=i>', b'nxtletter=<\xc3\xb3', b'nxt2letters=<\xc3\xb3t', b'nxt3letters=<\xc3\xb3t\xc3\xb3', b'nxt4letters=<\xc3\xb3t\xc3\xb3n'], [b'bias', b'letterLowercase=\xc3\xb3', b'postag=v', b'BOS', b'nxtpostag=det', b'letterposition=-6', b'prev3letters=pit>', b'prev2letters=it>', b'prevletter=t>', b'nxtletter=<t', b'nxt2letters=<t\xc3\xb3', b'nxt3letters=<t\xc3\xb3n', b'nxt4letters=<t\xc3\xb3ny'], [b'bias', b'letterLowercase=t', b'postag=v', b'BOS', b'nxtpostag=det'

### Entrenando el modelo

In [64]:
trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)


# Set training parameters. L-BFGS is default. Using Elastic Net (L1 + L2) regularization.
trainer.set_params({
        'c1': 1.0,  # coefficient for L1 penalty
        'c2': 1e-3,  # coefficient for L2 penalty
        'max_iterations': 50  # early stopping
    })

#### Si existe un modelo previamente entrenado se utilizará ese

In [65]:
%%time
# The program saves the trained model to a file:

if not os.path.isfile(model_filename):
    print("ENTRENANDO...")
    trainer.train(model_filename)
else:
    print("Usando modelo pre-entrenado >>", model_filename)
print("Fin del entrenamiento")

ENTRENANDO...
Fin del entrenamiento
CPU times: user 2min 58s, sys: 200 ms, total: 2min 58s
Wall time: 2min 59s


#### Modelos actuales

In [66]:
!ls | grep tsunkua

tsunkua_50.crfsuite
tsunkua.crfsuite
tsunkua_mod.crfsuite
tsunkua_old.crfsuite
tsunkua_uniq.crfsuite


### Haciendo predicciones

In [67]:
tagger = pycrfsuite.Tagger()
tagger.open(model_filename)

<contextlib.closing at 0x7fe427f2e358>

Se usará el modelo entrenado para hacer la predicción de solo una sentencia del test data. Las etiquetas predecidas son mostradas y se comparan con las etiquetas reales.

In [68]:
example_sent = test_data[0]
print('Letras de la sentencia:', '  '.join(extractTokens(example_sent)), end='\n')
print("\nPredecidas  |  Correctas")
for p, c in zip(tagger.tag(extractFeatures(example_sent)), extractLabels(example_sent)):
    print(f"{p}        {c.decode('utf-8')}")

Letras de la sentencia: b  i  '  μ  n  g  í  p  o  r  m  e  d  i  o  d  í  m  á  n  t  s  o  p  h  ó

Predecidas  |  Correctas
B-3.cpl        B-3.cpl
I-3.cpl        I-3.cpl
B-stem        B-stem
I-stem        I-stem
I-stem        I-stem
B-1.obj        B-1.obj
I-1.obj        I-1.obj
B-stem        B-stem
I-stem        I-stem
I-stem        I-stem
B-stem        B-stem
I-stem        I-stem
I-stem        I-stem
I-stem        I-stem
I-stem        I-stem
B-1.icp        B-1.icp
I-1.icp        I-1.icp
B-ctrf        B-ctrf
I-ctrf        I-ctrf
B-stem        B-stem
I-stem        I-stem
I-stem        I-stem
I-stem        I-stem
I-stem        I-stem
I-stem        I-stem
I-stem        I-stem


#### Predicciones de etiquetas BIO en el test data

In [69]:
y_pred = []
y_test = labels_decoder(y_test)
y_pred = [tagger.tag(xseq) for xseq in X_test]

### Accuracy
Obteniendo el *accuracy* pra tener una evaluación del rendimiento del modelo entrenado. El *accuracy* se obtiene con la siguiente ecuación:

$$accuracy = \frac{etiquetas\ predichas\ correctamente}{total\ de\ etiquetas}$$

In [70]:
def accuracy_score(y_test, y_pred):
    right, total = 0, 0
    for tests, predictions in zip(y_test, y_pred):
        total += len(tests)
        for t, p in zip(tests, predictions):
            if t == p:
                right += 1
    return right / total
print("Accuracy Score> ", accuracy_score(y_test, y_pred))

Accuracy Score>  0.9621939510321651


### Reportes completos


In [71]:
def countMorphemes(morphlist):
    counts = {}
    for morpheme in morphlist:
        counts[morpheme[0][2:]] = counts.get(morpheme[0][2:], 0) + 1
    return counts


def eval_labeled_positions(y_correct, y_pred):
    # group the labels by morpheme and get list of morphemes
    correctmorphs, _ = concatenateLabels(y_correct)
    predmorphs, predLabels = concatenateLabels(y_pred)
    # Count instances of each morpheme
    test_morphcts = countMorphemes(correctmorphs)
    pred_morphcts = countMorphemes(predmorphs)

    correctMorphemects = {}
    idx = 0
    num_correct = 0
    for morpheme in correctmorphs:
        correct = True
        for label in morpheme:
            if label != predLabels[idx]:
                correct = False
            idx += 1
        if correct == True:
            num_correct += 1
            correctMorphemects[morpheme[0][2:]] = correctMorphemects.get(morpheme[0][2:], 0) + 1
    # calculate P, R F1 for each morpheme
    results = ''
    for firstlabel in correctMorphemects.keys():
        lprec = correctMorphemects[firstlabel] / pred_morphcts[firstlabel]
        lrecall = correctMorphemects[firstlabel] / test_morphcts[firstlabel]
        results += firstlabel + '\t\t{0:.2f}'.format(lprec) + '\t\t' + '{0:.2f}'.format(
            lrecall) + '\t' + '{0:.2f}'.format((2 * lprec * lrecall) / (lprec + lrecall)) + '\t\t' + str(
            test_morphcts[firstlabel]) + '\n'
    # overall results
    precision = num_correct / len(predmorphs)
    recall = num_correct / len(correctmorphs)

    print('\t\tPrecision\tRecall\tf1-score\tInstances\n\n' + results + '\ntotal/avg\t{0:.2f}'.format(
        precision) + '\t\t' + '{0:.2f}'.format(recall) + '\t' + '{0:.2f}'.format(
        (2 * precision * recall) / (precision + recall)))

Se obtienen resultados del etiquetador por posición. Esto evalua que tan bien se desempeño el clasificador en cada morfema en su conjunto y sus etiquetas, en lugar de evaluar a nivel de caractéres. Luego, secompueban los resultados y se imprime un informe con los resultados. Estos resultados son a nivel de caracter.

In [72]:
eval_labeled_positions(y_test, y_pred)

		Precision	Recall	f1-score	Instances

3.cpl		0.97		1.00	0.98		85
stem		0.96		0.96	0.96		1484
1.obj		0.97		0.97	0.97		33
1.icp		1.00		1.00	1.00		58
ctrf		0.91		0.98	0.95		63
psd		0.97		0.99	0.98		73
3.icp		0.97		0.98	0.97		58
3.icp.irr		1.00		1.00	1.00		14
lim		1.00		1.00	1.00		66
prag		1.00		0.99	0.99		76
1.pot		0.98		1.00	0.99		53
1.pss		1.00		1.00	1.00		34
dim		0.87		0.95	0.91		21
3.pot		1.00		1.00	1.00		34
3.prf		0.92		1.00	0.96		11
det		0.99		0.99	0.99		158
dem		0.97		0.97	0.97		36
1.prf		1.00		1.00	1.00		20
3.pls		1.00		1.00	1.00		11
pl.exc		0.90		0.96	0.93		47
loc		1.00		0.75	0.86		16
mod		0.86		0.80	0.83		15
lig		0.88		0.70	0.78		50
1.cpl		1.00		1.00	1.00		36
1.enf		1.00		1.00	1.00		7
3.obj		0.83		0.59	0.69		17
gen		1.00		0.33	0.50		3
prt		0.67		0.14	0.24		14
det.pl		0.98		1.00	0.99		55
1.cnt		1.00		1.00	1.00		6
dual.exc		0.89		1.00	0.94		8
ila		0.94		1.00	0.97		33
muy		0.85		0.85	0.85		13
2.icp		1.00		1.00	1.00		14
pl		0.95		0.71	0.82		28
3.cnt		0.90		1.00	0.95		9
it		1.00		0.

In [73]:
def bio_classification_report(y_correct, y_pred):
    '''Takes list of correct and predicted labels from tagger.tag. 
    Prints a classification report for a list of BIO-encoded sequences.
    It computes letter-level metrics.'''

    labeler = LabelBinarizer()
    y_correct_combined = labeler.fit_transform(list(chain.from_iterable(y_correct)))
    y_pred_combined = labeler.transform(list(chain.from_iterable(y_pred)))
    
    tagset = set(labeler.classes_)
    tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
    class_indices = {cls: idx for idx, cls in enumerate(labeler.classes_)}
    
    return classification_report(
        y_correct_combined,
        y_pred_combined,
        labels = [class_indices[cls] for cls in tagset],
        target_names = tagset)

In [74]:
print(bio_classification_report(y_test, y_pred))

               precision    recall  f1-score   support

      B-1.cnt       1.00      1.00      1.00         6
      I-1.cnt       1.00      1.00      1.00        12
      B-1.cpl       1.00      1.00      1.00        36
      I-1.cpl       1.00      1.00      1.00        36
  B-1.cpl.irr       0.50      1.00      0.67         1
  I-1.cpl.irr       0.50      1.00      0.67         1
      B-1.enf       1.00      1.00      1.00         7
      I-1.enf       1.00      1.00      1.00         7
      B-1.icp       1.00      1.00      1.00        58
      I-1.icp       1.00      1.00      1.00        58
  B-1.icp.irr       1.00      1.00      1.00         1
  I-1.icp.irr       1.00      1.00      1.00         2
      B-1.irr       0.00      0.00      0.00         1
      I-1.irr       0.00      0.00      0.00         2
      B-1.obj       0.97      0.97      0.97        33
      I-1.obj       0.97      0.97      0.97        33
      B-1.pls       1.00      0.50      0.67         2
      I-1

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [75]:
def print_transitions(trans_features):
    '''Print info from the crfsuite.'''
    
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))


def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-6s %s" % (weight, label, attr))

In [77]:
info = tagger.info()

print("Top likely transitions:")
print_transitions(Counter(info.transitions).most_common(15))

print("\nTop unlikely transitions:")
print_transitions(Counter(info.transitions).most_common()[-15:])


print("Top positive:")
print_state_features(Counter(info.state_features).most_common(15))

print("\nTop negative:")
print_state_features(Counter(info.state_features).most_common()[-15:])


Top likely transitions:
B-stem -> I-stem  10.934023
I-stem -> I-stem  8.395278
B-com  -> I-com   6.823475
B-neg  -> I-neg   6.721609
B-1.cnt -> I-1.cnt 6.630173
B-det  -> I-det   6.510363
B-3.pot -> I-3.pot 6.368820
B-lim  -> I-lim   6.278319
B-prt  -> I-prt   6.276504
B-dem  -> I-dem   5.959775
I-neg  -> I-neg   5.840462
B-mod  -> I-mod   5.839273
B-3.imp -> I-3.imp 5.748112
B-loc  -> I-loc   5.740772
B-2.icp -> I-2.icp 5.711878

Top unlikely transitions:
I-stem -> B-3.icp -0.006762
I-stem -> B-muy   -0.006980
I-dem  -> B-psd   -0.008274
I-aum  -> B-stem  -0.009441
I-dem  -> B-stem  -0.030308
I-pascuala -> B-stem  -0.030714
I-stem -> B-3.cpl -0.037465
I-stem -> B-3.prf -0.054174
I-loc  -> B-stem  -0.093922
I-stem -> B-ctrf  -0.151861
I-ila  -> B-stem  -0.169774
I-stem -> B-stem  -0.393821
I-1.obj -> B-stem  -0.402426
I-3.obj -> B-stem  -0.623276
I-prag -> B-stem  -0.972304
Top positive:
5.918738 I-1.cnt prev2letters=dr>
5.606847 I-stem EOW
5.132919 B-stem BOW
4.933950 B-1.pss nxtlette