# Entrenando un Modelo Markoviano Latente (HMM)

## Corpus de español: 

* AnCora | Github: https://github.com/UniversalDependencies/UD_Spanish-AnCora

* usamos el conllu parser para leer el corpus: https://pypi.org/project/conllu/

* Etiquetas Universal POS (Documentación): https://universaldependencies.org/u/pos/

In [1]:
#@title dependencias previas
!pip install conllu
!git clone https://github.com/UniversalDependencies/UD_Spanish-AnCora.git

Collecting conllu
  Downloading conllu-4.4.1-py2.py3-none-any.whl (15 kB)
Installing collected packages: conllu
Successfully installed conllu-4.4.1
Cloning into 'UD_Spanish-AnCora'...
remote: Enumerating objects: 875, done.[K
remote: Counting objects: 100% (128/128), done.[K
remote: Compressing objects: 100% (88/88), done.[K
remote: Total 875 (delta 90), reused 77 (delta 40), pack-reused 747[K
Receiving objects: 100% (875/875), 289.14 MiB | 11.34 MiB/s, done.
Resolving deltas: 100% (615/615), done.


In [2]:
#@title leyendo el corpus AnCora
from conllu import parse_incr 
wordList = []
data_file = open("UD_Spanish-AnCora/es_ancora-ud-dev.conllu", "r", encoding="utf-8")
for tokenlist in parse_incr(data_file):
    print(tokenlist.serialize())

[1;30;43mSe truncaron las últimas líneas 5000 del resultado de transmisión.[0m
2	informó	informar	VERB	vmis3s0	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	9	advcl	9:advcl	_
3	hoy	hoy	ADV	rg	_	2	advmod	2:advmod	_
4	la	el	DET	da0fs0	Definite=Def|Gender=Fem|Number=Sing|PronType=Art	5	det	5:det	_
5	Cadena	Cadena	PROPN	np00000	_	2	nsubj	2:nsubj	MWE=Cadena_Ser|MWEPOS=PROPN|ClusterId=CESS-CAST-AA-20000511-8193-s3.sn.10|ClusterType=Spec.organization|MentionSpan=4-6
6	Ser	Ser	PROPN	_	_	5	flat	5:flat	SpaceAfter=No
7	,	,	PUNCT	fc	PunctType=Comm	2	punct	2:punct	_
8	Benegas	Benegas	PROPN	np00000	_	9	nsubj	9:nsubj	ClusterId=CESS-CAST-AA-20000511-8193-c1|ClusterType=Spec.person|MentionSpan=8|MentionMisc=CorefType:ident
9	sufrió	sufrir	VERB	vmis3s0	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	0	root	0:root	_
10	un	uno	DET	di0ms0	Definite=Ind|Gender=Masc|Number=Sing|PronType=Art	11	det	11:det	_
11	intento	intento	NOUN	ncms000	Gender=Masc|Number=Sing	9	obj	9:obj	ClusterId=CESS-CAST-

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [3]:
#@title Estructura de los tokens etiquetados del corpus
tokenlist[1]

{'deprel': 'det',
 'deps': [('det', 5)],
 'feats': {'Definite': 'Def',
  'Gender': 'Masc',
  'Number': 'Sing',
  'PronType': 'Art'},
 'form': 'El',
 'head': 5,
 'id': 2,
 'lemma': 'el',
 'misc': None,
 'upos': 'DET',
 'xpos': 'da0ms0'}

In [4]:
tokenlist[1]['form']+'|'+tokenlist[1]['upos']

'El|DET'

## Entrenamiento del modelo - Calculo de conteos:

* tags (tags) `tagCountDict`: $C(tag)$
* emisiones (word|tag) `emissionProbDict`: $C(word|tag)$
* transiciones (tag|prevtag) `transitionDict`: $C(tag|prevtag)$

In [None]:
tagCountDict = {} 
emissionDict = {}
transitionDict = {}

tagtype = 'upos'
data_file = open("UD_Spanish-AnCora/es_ancora-ud-dev.conllu", "r", encoding="utf-8")

# Calculando conteos (pre-probabilidades)
for tokenlist in parse_incr(data_file):
  prevtag = None
  for token in tokenlist:

    # C(tag)
    tag = token[tagtype]
    if tag in tagCountDict.keys():
      tagCountDict[tag] += 1
    else:
      tagCountDict[tag] = 1

    # C(word|tag) -> probabilidades emision
    wordtag = token['form'].lower()+'|'+token[tagtype] # (word|tag)
    if wordtag in emissionDict.keys():
      emissionDict[wordtag] = emissionDict[wordtag] + 1
    else:
      emissionDict[wordtag] = 1

    #  C(tag|tag_previo) -> probabilidades transición
    if prevtag is None:
      prevtag = tag
      continue
    transitiontags = tag+'|'+prevtag
    if transitiontags in transitionDict.keys():
      transitionDict[transitiontags] = transitionDict[transitiontags] + 1
    else:
      transitionDict[transitiontags] = 1
    prevtag = tag
    
#transitionDict
#emissionDict
#tagCountDict

## Entrenamiento del modelo - calculo de probabilidades
* probabilidades de transición:
$$P(tag|prevtag) = \frac{C(prevtag, tag)}{C(prevtag)}$$

* probabilidades de emisión:
 $$P(word|tag) = \frac{C(word|tag)}{C(tag)}$$

In [None]:
transitionProbDict = {} # matriz A
emissionProbDict = {} # matriz B

# transition Probabilities 
for key in transitionDict.keys():
  tag, prevtag = key.split('|')
  if tagCountDict[prevtag]>0:
    transitionProbDict[key] = transitionDict[key]/(tagCountDict[prevtag])
  else:
    print(key)

# emission Probabilities 
for key in emissionDict.keys():
  word, tag = key.split('|')
  if emissionDict[key]>0:
    emissionProbDict[key] = emissionDict[key]/tagCountDict[tag]
  else:
    print(key)

transitionProbDict['ADJ|ADJ']
#emissionProbDict

0.030225988700564973

## Guardar parámetros del modelo

In [None]:
import numpy as np
np.save('transitionHMM.npy', transitionProbDict)
np.save('emissionHMM.npy', emissionProbDict)
transitionProbdict = np.load('transitionHMM.npy', allow_pickle='TRUE').item()
transitionProbDict['ADJ|ADJ']

0.030225988700564973