## Tomotopy
Tomotopy is a Python extension of tomoto (Topic Modeling Tool) which is a Gibbs-sampling based topic model library written in C++. It utilizes a vectorization of modern CPUs for maximizing speed. The current version of tomoto supports several major topic models including

- Latent Dirichlet Allocation [LDAModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.LDAModel)
- Labeled LDA [LLDAModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.LLDAModel)
- Partially Labeled LDA [PLDAModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.PLDAModel)
- Supervised LDA [SLDAModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.SLDAModel)
- Dirichlet Multinomial Regression [DMRModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.DMRModel)
- Generalized Dirichlet Multinomial Regression [GDMRModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.GDMRModel)
- Hierarchical Dirichlet Process [HDPModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.HDPModel)
- Hierarchical LDA [HLDAModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.HLDAModel)
- Multi Grain LDA [MGLDAModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.MGLDAModel)
- Pachinko Allocation [PAModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.PAModel)
- Hierarchical PA [HPAModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.HPAModel)
- Correlated Topic Model [CTModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.CTModel)
- Dynamic Topic Model [DTModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.DTModel)
- Pseudo-document based Topic Model [PTModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.PTModel)


In [None]:
#!pip install tomotopy

In [1]:
import nltk
#nltk.download('wordnet')
import tomotopy as tp

In [2]:
import warnings
import pandas as pd
warnings.filterwarnings("ignore", category=DeprecationWarning)
pd.options.mode.chained_assignment = None  # default='warn'

from tqdm import tqdm
from tqdm.notebook import tqdm_notebook

tqdm.pandas()
import numpy as np
from pprint import pprint

from utils import filter_by_media
from utils import cluster_by_month
from utils import preprocess

In [3]:
df = pd.read_csv("../../data/loslagos-comunas.csv")[:1000]
df = cluster_by_month(filter_by_media(df))
df['tokens'] =  df.content.progress_apply(lambda x: preprocess(str(x)))
df.head(5)

100%|████████████████████████████████████████████████████████████████████████████████| 990/990 [01:00<00:00, 16.39it/s]


Unnamed: 0,date,media_outlet,url,title,text,content,comuna,date_clustering,tokens
0,2021-10-01,elheraldoaustral,https://www.eha.cl/noticia/local/reconocen-a-g...,Reconocen a guardaparques de la Región de Los ...,Distintos protagonistas de los parques naciona...,reconocen guardaparques región lagos actores c...,"['puyehue', 'chaiten']",2021-10,"[reconocen, guardaparques, region, lagos, acto..."
1,2021-10-01,elheraldoaustral,https://www.eha.cl/noticia/local/con-nuevos-ma...,Con nuevos materiales comienza plan piloto en ...,Centro de negocios Sercotec coordina acuerdos ...,nuevos materiales comienza plan piloto saltos ...,['puerto varas'],2021-10,"[nuevos, materiales, comienza, plan, piloto, s..."
2,2021-10-01,elheraldoaustral,https://www.eha.cl/noticia/local/centro-de-sal...,Centro de Salud Familiar CESFAM Puerto Varas i...,Las horas se solicitan en el SOME o bien a tra...,centro salud familiar cesfam puerto varas invi...,['puerto varas'],2021-10,"[centro, salud, familiar, cesfam, puerto, vara..."
3,2021-10-01,elheraldoaustral,https://www.eha.cl/noticia/local/alcalde-tomas...,Alcalde Tomás Gárate presidió por primera vez ...,Los y las consejeras destacaron el hecho de vo...,alcalde tomás gárate presidió primera vez octa...,"['castro', 'puerto varas']",2021-10,"[alcalde, tomas, garate, presidio, primera, ve..."
4,2021-10-01,elheraldoaustral,https://www.eha.cl/noticia/local/galeria-de-ar...,Galería de Arte Machacoya realizará remate de ...,"Hoy viernes a las 18:30 horas, en Machacoya At...",galería arte machacoya realizará remate obras ...,,2021-10,"[galeria, arte, machacoya, realizara, remate, ..."


### HLDA
#### Parameters:

    tw : Union[int, TermWeight]
    term weighting scheme in TermWeight. The default value is TermWeight.ONE
    
    min_cf : int
    minimum collection frequency of words. Words with a smaller collection frequency than min_cf are excluded from the model. The default value is 0, which means no words are excluded.
    
    min_df : int

    minimum document frequency of words. Words with a smaller document frequency than min_df are excluded from the model. The default value is 0, which means no words are excluded
    
    rm_top : int
    the number of top words to be removed. If you want to remove too common words from model, you can set this value to 1 or more. The default value is 0, which means no top words are removed.
    
    depth : int
    the maximum depth level of hierarchy between 2 ~ 32767
    
    alpha : Union[float, Iterable[float]]
    hyperparameter of Dirichlet distribution for document-depth level, given as a single float in case of symmetric prior and as a list with length depth of float in case of asymmetric prior.
    
    eta : float
    hyperparameter of Dirichlet distribution for topic-word
    
    gamma : float
    concentration coeficient of Dirichlet Process
    
    seed : int
    random seed. default value is a random number from std::random_device{} in C++
    
    corpus : Corpus
    a list of documents to be added into the model
    
    transform : Callable[dict, dict]
    a callable object to manipulate arbitrary keyword arguments for a specific topic model


#### Test 

In [33]:
h_mdl = tp.HLDAModel(depth=4, min_cf=100, seed=1)

for token in df.tokens:
    #doc = " ".join(token)
    h_mdl.add_doc(token)
    #print(doc)

print('Training model by iterating over the corpus 100 times, 10 iterations at a time')
iterations = 10
for i in tqdm_notebook(range(0, 100, iterations)):
    h_mdl.train(iterations)
    print('Iteration: #{}\tLog-likelihood: {}'.format(i, h_mdl.ll_per_word))

    print("Number of topics: ", h_mdl.k)
print("Number of live topics: ", h_mdl.live_k)
print("Number of documents: ", len(h_mdl.docs))
print("Model perplexity: ", h_mdl.perplexity)

Training model by iterating over the corpus 100 times, 10 iterations at a time


  0%|          | 0/10 [00:00<?, ?it/s]

Iteration: #0	Log-likelihood: -6.073829278630365
Number of topics:  160
Iteration: #10	Log-likelihood: -5.949068784288547
Number of topics:  176
Iteration: #20	Log-likelihood: -5.90564660221281
Number of topics:  200
Iteration: #30	Log-likelihood: -5.867441728307926
Number of topics:  208
Iteration: #40	Log-likelihood: -5.8465084726312755
Number of topics:  240
Iteration: #50	Log-likelihood: -5.837629796083174
Number of topics:  240
Iteration: #60	Log-likelihood: -5.818394908835068
Number of topics:  240
Iteration: #70	Log-likelihood: -5.810661733279824
Number of topics:  248
Iteration: #80	Log-likelihood: -5.792544809473434
Number of topics:  256
Iteration: #90	Log-likelihood: -5.78400524578482
Number of topics:  264
Number of live topics:  248
Number of documents:  990
Model perplexity:  325.05852589193074


  # This is added back by InteractiveShellApp.init_path()


In [65]:
# OUTPUT (hLDA) - Explore the topics (children, parents, depth, number of topics per level) as csv
rows = []
for k in tqdm_notebook(range(h_mdl.k)):
    topic = k,
    keyword = h_mdl.get_topic_words(k, top_n=10),
    num_docs = h_mdl.num_docs_of_topic(k),
    children = h_mdl.children_topics(k),
    parent = h_mdl.parent_topic(k),
    level = h_mdl.level(k),
    rows.append([topic, keyword, num_docs, children, parent, level])

topics_df = pd.DataFrame(rows, columns=["Topic", "Keywords", "Num_Docs", "Children", "Parent", "Level"])

#topics_df.to_csv('outputs/hLDA/hLDA-4-level.csv')

  0%|          | 0/264 [00:00<?, ?it/s]

In [93]:
#h_mdl.get_topic_words(10, top_n=100)
h_mdl.get_count_by_topics()
h_mdl.summary()

<Basic Info>
| HLDAModel (current version: 0.12.3)
| 990 docs, 56160 words
| Total Vocabs: 23063, Used Vocabs: 293
| Entropy of words: 5.54396
| Entropy of term-weighted words: 5.54396
| Removed Vocabs: <NA>
|
<Training Info>
| Iterations: 100, Burn-in steps: 0
| Optimization Interval: 10
| Log-likelihood per word: -5.78401
|
<Initial Parameters>
| tw: TermWeight.ONE
| min_cf: 100 (minimum collection frequency of words)
| min_df: 0 (minimum document frequency of words)
| rm_top: 0 (the number of top words to be removed)
| depth: 4 (the maximum depth level of hierarchy between 2 ~ 32767)
| alpha: [0.1] (hyperparameter of Dirichlet distribution for document-depth level, given as a single `float` in case of symmetric prior and as a list with length `depth` of `float` in case of asymmetric prior.)
| eta: 0.01 (hyperparameter of Dirichlet distribution for topic-word)
| gamma: 0.1 (concentration coeficient of Dirichlet Process)
| seed: 1 (random seed)
| trained in version 0.12.3
|
<Parameter

In [73]:
pd.set_option('display.max_rows', None)
topics_df

Unnamed: 0,Topic,Keywords,Num_Docs,Children,Parent,Level
0,"(0,)","([(persona, 0.025957293808460236), (puerto, 0....","(990,)","([124, 126, 62, 15, 123, 60, 120, 11, 58, 63, ...","(-1,)","(0,)"
1,"(1,)","([(ano, 0.003412969410419464), (persona, 0.003...","(0,)","([],)","(-1,)","(-1,)"
2,"(2,)","([(ano, 0.003412969410419464), (persona, 0.003...","(0,)","([],)","(-1,)","(-1,)"
3,"(3,)","([(ano, 0.003412969410419464), (persona, 0.003...","(0,)","([],)","(-1,)","(-1,)"
4,"(4,)","([(ano, 0.003412969410419464), (persona, 0.003...","(0,)","([],)","(-1,)","(-1,)"
5,"(5,)","([(ano, 0.003412969410419464), (persona, 0.003...","(0,)","([],)","(-1,)","(-1,)"
6,"(6,)","([(ano, 0.003412969410419464), (persona, 0.003...","(0,)","([],)","(-1,)","(-1,)"
7,"(7,)","([(ano, 0.003412969410419464), (persona, 0.003...","(0,)","([],)","(-1,)","(-1,)"
8,"(8,)","([(ano, 0.033074285835027695), (presidente, 0....","(814,)","([88, 220, 219, 75, 167, 216, 165, 132, 128, 1...","(0,)","(1,)"
9,"(9,)","([(ano, 0.06058770418167114), (tras, 0.0605877...","(96,)","([135, 129, 51, 17],)","(0,)","(1,)"


In [74]:
lista_topicos = []
for i in range(len(topics_df.Keywords)):
    add = []
    for j in range(len(topics_df.Keywords[0][0])):
        
        add.append(topics_df.Keywords[i][0][j][0])
    lista_topicos.append(str(add))
xddd = list(set(lista_topicos))
xddd

["['persona', 'puerto', 'ano', 'ademas', 'dia', 'parte', 'asi', 'montt', 'importante', 'pandemia']",
 "['osorno', 'me', 'mayor', 'gran', 'actividades', 'casa', 'obra', 'chile', 'puerto', 'region']",
 "['ministro', 'comunidad', 'investigacion', 'hechos', 'conocer', 'universidad', 'nuevo', 'nuevos', 'cada', 'serie']",
 "['millones', 'pandemia', 'ano', 'director', 'hacia', 'pasado', 'nueva', 'momento', 'indico', 'destaco']",
 "['retiro', 'pensiones', 'persona', 'sistema', 'mayor', 'aumento', 'ayer', 'recursos', 'embargo', 'fondos']",
 "['cuenta', 'pago', 'servicio', 'nacional', 'octubre', 'trave', 'jornada', 'chile', 'puerto', 'region']",
 "['partido', 'equipo', 'ayer', 'hoy', 'puntos', 'nacional', 'chile', 'fecha', 'cuatro', 'encuentro']",
 "['alcalde', 'castro', 'aun', 'gobierno', 'ahora', 'sido', 'chile', 'ello', 'tal', 'persona']",
 "['municipal', 'atencion', 'informacion', 'comuna', 'me', 'septiembre', 'puerto', 'total', 'persona', 'regional']",
 "['comuna', 'carabineros', 'lugar', '

In [76]:
# VISUALIZATION - Transform topic table to hierarchical data structure for tree
# TO DO: pivot table
level_index = topics_df[topics_df['Parent'] == -1 ].index #drop topics with parent -1 (top level)
topics_df.drop(level_index, inplace=True) #preserve original topic numbers
pivoted_h = topics_df.pivot(index='Topic', columns='Level', values='Parent')
pivoted_h.rename(columns={pivoted_h.columns[0]: "-1", 
                          pivoted_h.columns[1]: "0",
                          pivoted_h.columns[2]: "1",
                          pivoted_h.columns[3]: "2",
                          pivoted_h.columns[4]: "3",
                         }, inplace=True)

In [94]:
#pivoted_h[~pivoted_h['-1'].isnull()]

In [95]:
#pivoted_h[~pivoted_h['0'].isnull()]

In [96]:
#pivoted_h[~pivoted_h['1'].isnull()]

In [97]:
#pivoted_h[~pivoted_h['2'].isnull()]

In [98]:
#pivoted_h[~pivoted_h['3'].isnull()]

### DTM