<a href="https://colab.research.google.com/github/vicentcamison/idal_ia3/blob/main/5%20Procesado%20del%20lenguaje%20natural/Sesion%203/NLP_10_Topic_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic modeling
En este notebook se va a mostrar el uso de distintos modelos de extracción de temáticas (*topic modeling*) en un conjunto de textos de ejemplo sencillo.

In [None]:
import spacy
import matplotlib.pyplot as plt
import numpy as np

nlp=spacy.load('en_core_web_sm')

### Creación del corpus
Creamos un pequeño Corpus de ejemplo formado por 8 frases cortas. Definimos una sencilla función de normalización y aplicamos esta normalización a todo el corpus.

In [None]:
def normalize_document(doc):
    # tokenizamos el texto
    tokens = nlp(doc)
    # quitamos puntuación/espacios/stop words y cogemos el lema
    lemmas = [t.lemma_ for t in tokens if not t.is_punct and not t.is_space and not t.is_stop]
    doc = ' '.join(lemmas)
    return doc

def normalize_corpus(corpus):
    """Normaliza un corpus de documentos aplicando al función de normalización
    normalize_document() a cada documento de la lista pasada como argumento"""   
    return [normalize_document(text) for text in corpus]

toy_corpus = [
"The fox jumps over the dog",
"the fox is very clever and quick",
"The dog is slow and lazy",
"The cat is smarter than the fox and the dog",
"Python is an excellent programming language",
"Java and Ruby are other programming languages",
"Python and Java are very popular programming languages",
"Python programs are smaller than Java programs"]

norm_corpus = normalize_corpus(toy_corpus)
norm_corpus

['fox jump dog',
 'fox clever quick',
 'dog slow lazy',
 'cat smart fox dog',
 'Python excellent programming language',
 'Java Ruby programming language',
 'Python Java popular programming language',
 'Python program small Java program']

## Topic modeling usando Scikit-learn
La librería `scikit-learn` implementa los modelos *Latent Semantic Analysis* (LSA) y *Latent Dirichlet Allocation* (LDA).  
Partimos de un modelo TF-IDF para el modelado LSA y de un modelo BoW para el modelado LDA

### Modelo LSA

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# usamos características tf-idf para LSA.
tfidf_vectorizer = TfidfVectorizer(min_df=2)
tfidf = tfidf_vectorizer.fit_transform(norm_corpus)

In [None]:
tfidf

<8x6 sparse matrix of type '<class 'numpy.float64'>'
	with 18 stored elements in Compressed Sparse Row format>

In [None]:
tfidf_vectorizer.get_feature_names()

['dog', 'fox', 'java', 'language', 'programming', 'python']

Definimos una función de ayuda para mostrar los resultados (términos asociados a cada tema)

In [None]:
def print_top_words(model, feature_names, n_top_words):
    """Función auxiliar para mostrar los términos más importantes
    de cada topic"""
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

Calculamos los modelos para nuestro corpus (método `fit`) y vemos cuáles son los 5 términos con más peso para cada *topic*. Cada modelo asigna un grado de pertenencia en cada tema a cada término del vocabulario de la matriz tfidf o bow utilizada como entrada.

In [None]:
from sklearn.decomposition import TruncatedSVD, LatentDirichletAllocation

# Ajustamos el modelo LSA
lsa = TruncatedSVD(n_components=2).fit(tfidf)

print("\nTopics en modelo LSA:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(lsa, tfidf_feature_names, 2)


Topics en modelo LSA:
Topic #0: python programming
Topic #1: dog fox



El método `fit` aprende la matriz de `topics` x `términos` para el corpus dado

In [None]:
lsa.components_.shape

(2, 6)

In [None]:
import pandas as pd
pd.DataFrame(lsa.components_, columns=tfidf_feature_names)

Unnamed: 0,dog,fox,java,language,programming,python
0,-0.0,1.962616e-16,0.5,0.5,0.5,0.5
1,0.707107,0.7071068,-1.110223e-16,0.0,0.0,-1.665335e-16


Podemos ver el porcentaje de pertenencia a cada *topic* de cada una de los documentos asignados por el modelo con el método `transform`:

In [None]:
lsa.transform(tfidf)

array([[ 1.38777878e-16,  1.00000000e+00],
       [ 1.96261557e-16,  7.07106781e-01],
       [ 0.00000000e+00,  7.07106781e-01],
       [ 1.38777878e-16,  1.00000000e+00],
       [ 8.66025404e-01, -9.61481343e-17],
       [ 8.66025404e-01, -6.40987562e-17],
       [ 1.00000000e+00, -1.38777878e-16],
       [ 7.07106781e-01, -1.96261557e-16]])

Cada fila corresponde a un documento del Corpus, y cada columna el grado de pertenencia a ese tema del documento.  
El modelo ha separado correctamente el corpus en las dos temáticas principales:

In [None]:
import numpy as np
np.argmax(lsa.transform(tfidf), axis=1)

array([1, 1, 1, 1, 0, 0, 0, 0])

### Modelo LDA

In [None]:
# usamos características BoW para LDA.
tf_vectorizer = CountVectorizer(min_df=2)
tf = tf_vectorizer.fit_transform(norm_corpus)

In [None]:
tf

<8x6 sparse matrix of type '<class 'numpy.int64'>'
	with 18 stored elements in Compressed Sparse Row format>

In [None]:
# Ajustamos el modelo LDA
lda = LatentDirichletAllocation(n_components=2, max_iter=5,
                                learning_method='batch',
                                learning_offset=50.,
                                random_state=0).fit(tf)

print("\nTopics en modelo LDA:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 2)


Topics en modelo LDA:
Topic #0: dog fox
Topic #1: programming language



El atributo `components_` contiene los parámetros de la distribución de términos en *topics*.

In [None]:
pd.DataFrame(lda.components_, columns=tfidf_feature_names)

Unnamed: 0,dog,fox,java,language,programming,python
0,3.491959,3.491955,0.513787,0.511393,0.511393,0.513756
1,0.508041,0.508045,3.486213,3.488607,3.488607,3.486244


Normalizando esta matriz muestra la distribución de términos dentro de cada *topic*

In [None]:
distribucion = lda.components_ / lda.components_.sum(axis=1)[:, np.newaxis]
pd.DataFrame(distribucion, columns=tfidf_feature_names)

Unnamed: 0,dog,fox,java,language,programming,python
0,0.386525,0.386524,0.056871,0.056606,0.056606,0.056868
1,0.033947,0.033947,0.232946,0.233106,0.233106,0.232948


Podemos ver el porcentaje de pertenencia a cada *topic* de cada una de los documentos asignados por el modelo con el método `transform`:

In [None]:
lda.transform(tf)

array([[0.8319788 , 0.1680212 ],
       [0.74801655, 0.25198345],
       [0.74801659, 0.25198341],
       [0.8319788 , 0.1680212 ],
       [0.1281347 , 0.8718653 ],
       [0.12813488, 0.87186512],
       [0.1025063 , 0.8974937 ],
       [0.17085046, 0.82914954]])

Los porcentajes de pertenencia suman 1 para los *topics* de cada documento

In [None]:
lda.transform(tf).sum(axis=1)

array([1., 1., 1., 1., 1., 1., 1., 1.])

El modelo ha asignado correctamente los documentos a las dos temáticas del *corpus*

In [None]:
np.argmax(lda.transform(tf), axis=1)

array([0, 0, 0, 0, 1, 1, 1, 1])

## Topic modeling usando librería Gensim
La librería `gensim` implementa los siguientes modelos:  
* [Latent Semantic Indexing, LSI (or sometimes LSA)](https://en.wikipedia.org/wiki/Latent_semantic_indexing) transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into a latent space of a lower dimensionality.  
* [Latent Dirichlet Allocation, LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is yet another transformation from bag-of-words counts into a topic space of lower dimensionality. LDA is a probabilistic extension of LSA (also called multinomial PCA), so LDA’s topics can be interpreted as probability distributions over words. These distributions are, just like with LSA, inferred automatically from a training corpus. Documents are in turn interpreted as a (soft) mixture of these topics (again, just like with LSA).  
* [Hierarchical Dirichlet Process, HDP](http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf) is a non-parametric bayesian method (note the missing number of requested topics).

La entrada a los modelos de `gensim`
 debe ser una lista de tokens y no un texto por cada documento del corpus por lo que hay que cambiar la función de normalización 

In [None]:
def normalize_tokenize_document(doc):
    # tokenizamos el texto
    tokens = nlp(doc)
    # quitamos puntuación/espacios y cogemos el lema
    lemmas = [t.lemma_.lower() for t in tokens if not t.is_punct and not t.is_space and not t.is_stop]
    return lemmas

def normalize_tokenize_corpus(corpus):
    """Normaliza un corpus de documentos aplicando al función de normalización
    normalize_tokenize_document() a cada documento de la lista pasada como argumento"""   
    return [normalize_tokenize_document(text) for text in corpus]
        
norm_tokenized_corpus = normalize_tokenize_corpus(toy_corpus)
norm_tokenized_corpus

[['fox', 'jump', 'dog'],
 ['fox', 'clever', 'quick'],
 ['dog', 'slow', 'lazy'],
 ['cat', 'smart', 'fox', 'dog'],
 ['python', 'excellent', 'programming', 'language'],
 ['java', 'ruby', 'programming', 'language'],
 ['python', 'java', 'popular', 'programming', 'language'],
 ['python', 'program', 'small', 'java', 'program']]

Al igual que en los modelos de la librería `scikit-learn`, primero generamos matrices de características BoW y TF-IDF como paso previo a aplicar los modelos de topic-modeling.  
En `gensim` estas matrices se calculan de manera diferente a `scikit-learn`

In [None]:
from gensim.corpora import Dictionary
from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel, TfidfModel

#diccionario de términos únicos del corpus
dictionary = Dictionary(norm_tokenized_corpus)
#creamos matriz BoW
corpus_bow = [dictionary.doc2bow(text)
                 for text in norm_tokenized_corpus]
#creamos matriz TF-IDF del corpus a partir de BoW
tfidf = TfidfModel(corpus_bow)
corpus_tfidf = tfidf[corpus_bow]

In [None]:
corpus_bow[0]

[(0, 1), (1, 1), (2, 1)]

In [None]:
corpus_tfidf[0]

[(0, 0.39239043318859274), (1, 0.39239043318859274), (2, 0.8319011334792957)]

In [None]:
[(i, k) for i,k in dictionary.items()]

[(0, 'dog'),
 (1, 'fox'),
 (2, 'jump'),
 (3, 'clever'),
 (4, 'quick'),
 (5, 'lazy'),
 (6, 'slow'),
 (7, 'cat'),
 (8, 'smart'),
 (9, 'excellent'),
 (10, 'language'),
 (11, 'programming'),
 (12, 'python'),
 (13, 'java'),
 (14, 'ruby'),
 (15, 'popular'),
 (16, 'program'),
 (17, 'small')]

### Latent Semantic Indexing
Los modelos de *topic modeling* de `gensim` asignan un peso de pertenencia de cada término del diccionario bow/tfidf a cada tema:

In [None]:
lsi = LsiModel(corpus_tfidf, 
                      id2word=dictionary,
                      num_topics=2)

for index, topic in lsi.print_topics(2):
    print('Topic #{}\n{}\n'.format(str(index+1), topic))

Topic #1
0.459*"language" + 0.459*"programming" + 0.344*"python" + 0.344*"java" + 0.336*"popular" + 0.318*"excellent" + 0.318*"ruby" + 0.148*"program" + 0.074*"small" + -0.000*"cat"

Topic #2
0.459*"fox" + 0.459*"dog" + 0.444*"jump" + 0.322*"cat" + 0.322*"smart" + 0.208*"quick" + 0.208*"clever" + 0.208*"slow" + 0.208*"lazy" + -0.000*"popular"



In [None]:
topics_lsi = lsi[corpus_tfidf]
topics_lsi

<gensim.interfaces.TransformedCorpus at 0x7f5f49b93ac0>

El modelo LSI de `gensim` genera un objeto iterable e indexable con la transformación LSI de todos los documentos del corpus.\
El modelo devuelve una lista de tuplas por cada documento con (*topic_id*, *peso del topic*). El número de tuplas que devuelve es variable para cada documento, sólo devuelve las que tienen mayor importancia.

In [None]:
topics_lsi[0]

[(1, 0.7296053406305377)]

In [None]:
for t in topics_lsi:
    print(t)

[(1, 0.7296053406305377)]
[(1, 0.4246970606936986)]
[(1, 0.4246970606936975)]
[(1, 0.6892950729735762)]
[(0, 0.7070677687653233)]
[(0, 0.7070677687653228)]
[(0, 0.7950457769187105)]
[(0, 0.29788705652128283)]


### Latent Dirichlet Allocation

In [None]:
lda = LdaModel(corpus_bow, 
                      id2word=dictionary,
                      iterations=1000,
                      num_topics=2)
for index, topic in lda.print_topics(2):
    print('Topic #{}\n{}\n'.format(str(index+1), topic))

Topic #1
0.110*"program" + 0.098*"python" + 0.084*"fox" + 0.079*"dog" + 0.076*"java" + 0.068*"small" + 0.056*"programming" + 0.055*"jump" + 0.051*"language" + 0.051*"excellent"

Topic #2
0.105*"language" + 0.101*"programming" + 0.086*"java" + 0.084*"dog" + 0.080*"fox" + 0.070*"python" + 0.051*"ruby" + 0.051*"popular" + 0.048*"lazy" + 0.046*"slow"



In [None]:
topics_lda = lda[corpus_bow]
topics_lda

<gensim.interfaces.TransformedCorpus at 0x7f5f48f478e0>

In [None]:
topics_lda[0]

[(0, 0.77280724), (1, 0.22719279)]

### Hierarchical Dirichlet Process
En este modelo no se especifica un número de *topics*. El modelo ajusta tantos *topics* como documentos y los ordena por importancia.

In [None]:
#no hay que especificar un núm. de topics
hdp = HdpModel(corpus_bow, 
                      id2word=dictionary)
for index, topic in hdp.print_topics(2):
    print('Topic #{}\n{}\n'.format(str(index+1), topic))

Topic #1
0.181*quick + 0.159*program + 0.116*lazy + 0.105*smart + 0.077*cat + 0.068*jump + 0.051*python + 0.050*clever + 0.031*small + 0.030*excellent

Topic #2
0.281*lazy + 0.095*cat + 0.085*excellent + 0.080*slow + 0.067*programming + 0.064*ruby + 0.047*language + 0.045*small + 0.044*dog + 0.039*popular



Si especificamos un total de 2 topics, las palabras que contribuyen a cada uno de ellos son las que aparecen detalladas, y el valor asociado a cada una es lo 'fuertemente' que está relacionada esa palabra con dicho topic

In [None]:
for index, topic in hdp.print_topics(4):
    print('Topic #{}\n{}\n'.format(str(index+1), topic))

Topic #1
0.181*quick + 0.159*program + 0.116*lazy + 0.105*smart + 0.077*cat + 0.068*jump + 0.051*python + 0.050*clever + 0.031*small + 0.030*excellent

Topic #2
0.281*lazy + 0.095*cat + 0.085*excellent + 0.080*slow + 0.067*programming + 0.064*ruby + 0.047*language + 0.045*small + 0.044*dog + 0.039*popular

Topic #3
0.154*fox + 0.104*python + 0.088*ruby + 0.081*programming + 0.076*cat + 0.075*popular + 0.071*dog + 0.064*small + 0.064*program + 0.039*lazy

Topic #4
0.268*cat + 0.140*jump + 0.087*programming + 0.082*java + 0.079*fox + 0.078*smart + 0.050*language + 0.046*clever + 0.043*quick + 0.032*program



In [None]:
topics_hdp = hdp[corpus_bow]
topics_hdp

<gensim.interfaces.TransformedCorpus at 0x7f5f49b93b80>

El modelo HDP sólo devuelve para cada documentos los *topics* que tienen mayor relavancia en su composición:

In [None]:
for t in topics_hdp:
    print(t)

[(0, 0.07328190139955858), (1, 0.04901361450713037), (2, 0.7788827828216391), (3, 0.0266230207447064), (4, 0.019599164793479312), (5, 0.01431472544131097), (6, 0.010532653366491408)]
[(0, 0.8153515946759589), (1, 0.04892964010977624), (2, 0.036905019164293784), (3, 0.026615294101975724), (4, 0.0195989416136308), (5, 0.0143147200338518), (6, 0.0105326533748281)]
[(0, 0.06779468503778084), (1, 0.7968696132183295), (2, 0.036540245686808095), (3, 0.026597042226215582), (4, 0.019598907745612522), (5, 0.014314715788214399), (6, 0.01053265337135506)]
[(0, 0.2437520393219116), (1, 0.03971087621308347), (2, 0.6374819252498064), (3, 0.021296439175269695), (4, 0.01567911530026445), (5, 0.011451772505553515)]
[(0, 0.05590578030123453), (1, 0.835713186765757), (2, 0.029337026488526097), (3, 0.021285135779718606), (4, 0.015679266722736486), (5, 0.011451771708803)]
[(0, 0.30814273002078396), (1, 0.5833163467359885), (2, 0.029460601384669927), (3, 0.021320576952027725), (4, 0.015680129498551208), (5, 

### Estimación de temática principal
Podemos calcular la pertenencia de cada documento a una temática mayoritaria a partir de su modelo calculado, cogiendo la primera tupla devuelta por el modelo.

In [None]:
corpus_lsi = lsi[corpus_tfidf]
for i, doc in enumerate(corpus_lsi):
     print(doc, toy_corpus[i])

[(1, 0.7296053406305377)] The fox jumps over the dog
[(1, 0.4246970606936986)] the fox is very clever and quick
[(1, 0.4246970606936975)] The dog is slow and lazy
[(1, 0.6892950729735762)] The cat is smarter than the fox and the dog
[(0, 0.7070677687653233)] Python is an excellent programming language
[(0, 0.7070677687653228)] Java and Ruby are other programming languages
[(0, 0.7950457769187105)] Python and Java are very popular programming languages
[(0, 0.29788705652128283)] Python programs are smaller than Java programs


Cada modelo guarda internamente los pesos que otorga a cada término en cada temática

In [None]:
lsi.get_topics().shape

(2, 18)

In [None]:
len([t for t in dictionary.values()])

18

In [None]:
lsi.get_topics()

array([[ 0.00000000e+00, -8.32667268e-17,  5.55111512e-17,
         5.04457587e-15,  4.94743135e-15,  5.32907052e-15,
         5.24580379e-15, -5.45397061e-15, -5.34294831e-15,
         3.18217601e-01,  4.58720976e-01,  4.58720976e-01,
         3.43618045e-01,  3.43618045e-01,  3.18217601e-01,
         3.36092332e-01,  1.48379168e-01,  7.41895838e-02],
       [ 4.59433532e-01,  4.59433532e-01,  4.43623263e-01,
         2.08216334e-01,  2.08216334e-01,  2.08216334e-01,
         2.08216334e-01,  3.22198614e-01,  3.22198614e-01,
        -1.11022302e-16, -4.44089210e-16, -4.44089210e-16,
        -2.56739074e-16, -4.30211422e-16, -4.16333634e-16,
        -4.51028104e-16, -1.11022302e-16, -5.55111512e-17]])

In [None]:
corpus_lda = lda[corpus_bow]
for i, doc in enumerate(corpus_lda):
     print(doc, toy_corpus[i])

[(0, 0.7727445), (1, 0.22725552)] The fox jumps over the dog
[(0, 0.19212334), (1, 0.80787665)] the fox is very clever and quick
[(0, 0.16202964), (1, 0.8379703)] The dog is slow and lazy
[(0, 0.16993101), (1, 0.83006895)] The cat is smarter than the fox and the dog
[(0, 0.21532944), (1, 0.78467053)] Python is an excellent programming language
[(0, 0.11891045), (1, 0.88108957)] Java and Ruby are other programming languages
[(0, 0.112972006), (1, 0.887028)] Python and Java are very popular programming languages
[(0, 0.9002772), (1, 0.09972281)] Python programs are smaller than Java programs


Los números asociados a cada frase en la celda de arriba indican las probabilidades de que cada frase trate de cada uno de los topics

In [None]:
lda.get_topics().shape

(2, 18)

In [None]:
lda.get_topics()

array([[0.07886329, 0.08364438, 0.05513915, 0.03610821, 0.03890279,
        0.03175268, 0.03422564, 0.0381352 , 0.03980745, 0.05077701,
        0.05129281, 0.05553823, 0.09764033, 0.07588366, 0.02718895,
        0.02767461, 0.1096984 , 0.06772716],
       [0.08372255, 0.08011434, 0.03000719, 0.0443695 , 0.04226045,
        0.04765654, 0.04579025, 0.04283974, 0.04157775, 0.03329921,
        0.1045295 , 0.10132551, 0.06955191, 0.08597127, 0.05110073,
        0.05073421, 0.02464205, 0.02050728]], dtype=float32)

En el modelo LDA, los pesos de cada término en un *topic* es su probabilidad de pertenencia, por lo que la suma de todos los pesos por *topic* es 1.

In [None]:
np.sum(lda.get_topics(), axis=1)

array([1.        , 0.99999994], dtype=float32)

Con el modelo HDP no se especifica un número de temas sino que se definen automáticamente (con importancia decreciente)

In [None]:
# Solución
corpus_hdp = hdp[corpus_bow]
for i, doc in enumerate(corpus_hdp):
     print(doc, toy_corpus[i])

[(0, 0.07328190139955858), (1, 0.04901361450713037), (2, 0.7788827828216391), (3, 0.0266230207447064), (4, 0.019599164793479312), (5, 0.01431472544131097), (6, 0.010532653366491408)] The fox jumps over the dog
[(0, 0.8153515946759589), (1, 0.04892964010977624), (2, 0.036905019164293784), (3, 0.026615294101975724), (4, 0.0195989416136308), (5, 0.0143147200338518), (6, 0.0105326533748281)] the fox is very clever and quick
[(0, 0.06779468503778084), (1, 0.7968696132183295), (2, 0.036540245686808095), (3, 0.026597042226215582), (4, 0.019598907745612522), (5, 0.014314715788214399), (6, 0.01053265337135506)] The dog is slow and lazy
[(0, 0.2437520393219116), (1, 0.03971087621308347), (2, 0.6374819252498064), (3, 0.021296439175269695), (4, 0.01567911530026445), (5, 0.011451772505553515)] The cat is smarter than the fox and the dog
[(0, 0.05590578030123453), (1, 0.835713186765757), (2, 0.029337026488526097), (3, 0.021285135779718606), (4, 0.015679266722736486), (5, 0.011451771708803)] Python i

In [None]:
hdp.get_topics().shape

(150, 18)

Podemos obtener los términos relevantes para cada tema y su importancia con el métoo `show_topics` del modelo:

In [None]:
lsitopics = [[(word,prob) for word, prob in topic] for topicid, topic in lsi.show_topics(formatted=False)]

hdptopics = [[(word,prob) for word, prob in topic] for topicid, topic in hdp.show_topics(formatted=False)]

ldatopics = [[(word,prob) for word, prob in topic] for topicid, topic in lda.show_topics(formatted=False)]

In [None]:
ldatopics

[[('program', 0.1096984),
  ('python', 0.09764033),
  ('fox', 0.083644375),
  ('dog', 0.07886329),
  ('java', 0.075883664),
  ('small', 0.06772716),
  ('programming', 0.05553823),
  ('jump', 0.055139154),
  ('language', 0.05129281),
  ('excellent', 0.050777014)],
 [('language', 0.1045295),
  ('programming', 0.10132551),
  ('java', 0.085971266),
  ('dog', 0.08372255),
  ('fox', 0.08011434),
  ('python', 0.069551915),
  ('ruby', 0.05110073),
  ('popular', 0.050734207),
  ('lazy', 0.047656536),
  ('slow', 0.045790248)]]

### Topic Coherence
La librería `gensim` proporciona una funcionalidad para identificar qué modelo de *topic modeling* se adapta mejor al corpus. La función `CoherenceModel` calcula una puntuación sobre la coherencia del modelo, que podemos usar para compararlos. Esta función utiliza las palabras que definen cada tópico en los modelos.

In [None]:
lsitopics = [[word for word, prob in topic] for topicid, topic in lsi.show_topics(formatted=False)]

hdptopics = [[word for word, prob in topic] for topicid, topic in hdp.show_topics(formatted=False)]

ldatopics = [[word for word, prob in topic] for topicid, topic in lda.show_topics(formatted=False)]


lsi_coherence = CoherenceModel(topics=lsitopics[:10], texts=norm_tokenized_corpus,
                               dictionary=dictionary, window_size=10).get_coherence()

hdp_coherence = CoherenceModel(topics=hdptopics[:10], texts=norm_tokenized_corpus, 
                               dictionary=dictionary, window_size=10).get_coherence()

lda_coherence = CoherenceModel(topics=ldatopics, texts=norm_tokenized_corpus,
                               dictionary=dictionary, window_size=10).get_coherence()

In [None]:
lsi_coherence

In [None]:
def evaluate_bar_graph(coherences, indices):
    """
    Función para dibujar una gráfica de barras con:
    
    coherences: lista de los valores de coherencia
    indices: textos para etiquetar las barras.
    Ambos parámetros deben tener la misma longitud
    """
    assert len(coherences) == len(indices)
    n = len(coherences)
    x = np.arange(n)
    plt.bar(x, coherences, width=0.2, tick_label=indices, align='center')
    plt.xlabel('Modelos')
    plt.ylabel('Valor Coherencia')

In [None]:
evaluate_bar_graph([lsi_coherence, hdp_coherence, lda_coherence],
                   ['LSI', 'HDP', 'LDA'])