## Tomotopy
Tomotopy is a Python extension of tomoto (Topic Modeling Tool) which is a Gibbs-sampling based topic model library written in C++. It utilizes a vectorization of modern CPUs for maximizing speed. The current version of tomoto supports several major topic models including

- Latent Dirichlet Allocation [LDAModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.LDAModel)
- Labeled LDA [LLDAModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.LLDAModel)
- Partially Labeled LDA [PLDAModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.PLDAModel)
- Supervised LDA [SLDAModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.SLDAModel)
- Dirichlet Multinomial Regression [DMRModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.DMRModel)
- Generalized Dirichlet Multinomial Regression [GDMRModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.GDMRModel)
- Hierarchical Dirichlet Process [HDPModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.HDPModel)
- Hierarchical LDA [HLDAModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.HLDAModel)
- Multi Grain LDA [MGLDAModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.MGLDAModel)
- Pachinko Allocation [PAModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.PAModel)
- Hierarchical PA [HPAModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.HPAModel)
- Correlated Topic Model [CTModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.CTModel)
- Dynamic Topic Model [DTModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.DTModel)
- Pseudo-document based Topic Model [PTModel](https://bab2min.github.io/tomotopy/v0.12.3/en/#tomotopy.PTModel)


In [1]:
#!pip install tomotopy

In [2]:
import nltk
#nltk.download('wordnet')
import tomotopy as tp

In [3]:
import warnings
import pandas as pd
warnings.filterwarnings("ignore", category=DeprecationWarning)
pd.options.mode.chained_assignment = None  # default='warn'

from tqdm import tqdm
from tqdm.notebook import tqdm_notebook
tqdm.pandas()
import numpy as np

from utils import filter_by_media
from utils import cluster_by_month
from utils import preprocess

In [5]:
df = pd.read_csv("../../data/loslagos-comunas.csv")[:10000]
df = cluster_by_month(filter_by_media(df))
df['tokens'] =  df.content.progress_apply(lambda x: preprocess(str(x)))
df.head(5)

100%|██████████████████████████████████████████████████████████████████████████████| 9965/9965 [08:57<00:00, 18.55it/s]


Unnamed: 0,date,media_outlet,url,title,text,content,comuna,date_clustering,tokens
0,2021-10-01,elheraldoaustral,https://www.eha.cl/noticia/local/reconocen-a-g...,Reconocen a guardaparques de la Región de Los ...,Distintos protagonistas de los parques naciona...,reconocen guardaparques región lagos actores c...,"['puyehue', 'chaiten']",2021-10,"[reconocen, guardaparques, region, lagos, acto..."
1,2021-10-01,elheraldoaustral,https://www.eha.cl/noticia/local/con-nuevos-ma...,Con nuevos materiales comienza plan piloto en ...,Centro de negocios Sercotec coordina acuerdos ...,nuevos materiales comienza plan piloto saltos ...,['puerto varas'],2021-10,"[nuevos, materiales, comienza, plan, piloto, s..."
2,2021-10-01,elheraldoaustral,https://www.eha.cl/noticia/local/centro-de-sal...,Centro de Salud Familiar CESFAM Puerto Varas i...,Las horas se solicitan en el SOME o bien a tra...,centro salud familiar cesfam puerto varas invi...,['puerto varas'],2021-10,"[centro, salud, familiar, cesfam, puerto, vara..."
3,2021-10-01,elheraldoaustral,https://www.eha.cl/noticia/local/alcalde-tomas...,Alcalde Tomás Gárate presidió por primera vez ...,Los y las consejeras destacaron el hecho de vo...,alcalde tomás gárate presidió primera vez octa...,"['castro', 'puerto varas']",2021-10,"[alcalde, tomas, garate, presidio, primera, ve..."
4,2021-10-01,elheraldoaustral,https://www.eha.cl/noticia/local/galeria-de-ar...,Galería de Arte Machacoya realizará remate de ...,"Hoy viernes a las 18:30 horas, en Machacoya At...",galería arte machacoya realizará remate obras ...,,2021-10,"[galeria, arte, machacoya, realizara, remate, ..."


### LDA

In [None]:
from pprint import pprint

def lda_model_example(df):
    model = tp.LDAModel(k=20, seed=1)  #k is the number of topics
    #Creating a corpus
    for text in df.tokens:
        model.add_doc(text)
    #Learning
    model.train(iter=100)
    #Extracting the word distribution of a topic
    for k in range(model.k):
        print(f"Topic {k}")
        pprint(model.get_topic_words(k, top_n=5))
    #for i in range(0, 1000, 10):
    #        model.train(10)
    #        print('Iteration: {}\tLog-likelihood: {}'.format(i, model.ll_per_word))

### HLDA
#### Parameters:

    tw : Union[int, TermWeight]
    term weighting scheme in TermWeight. The default value is TermWeight.ONE
    
    min_cf : int
    minimum collection frequency of words. Words with a smaller collection frequency than min_cf are excluded from the model. The default value is 0, which means no words are excluded.
    
    min_df : int

    minimum document frequency of words. Words with a smaller document frequency than min_df are excluded from the model. The default value is 0, which means no words are excluded
    
    rm_top : int
    the number of top words to be removed. If you want to remove too common words from model, you can set this value to 1 or more. The default value is 0, which means no top words are removed.
    
    depth : int
    the maximum depth level of hierarchy between 2 ~ 32767
    
    alpha : Union[float, Iterable[float]]
    hyperparameter of Dirichlet distribution for document-depth level, given as a single float in case of symmetric prior and as a list with length depth of float in case of asymmetric prior.
    
    eta : float
    hyperparameter of Dirichlet distribution for topic-word
    
    gamma : float
    concentration coeficient of Dirichlet Process
    
    seed : int
    random seed. default value is a random number from std::random_device{} in C++
    
    corpus : Corpus
    a list of documents to be added into the model
    
    transform : Callable[dict, dict]
    a callable object to manipulate arbitrary keyword arguments for a specific topic model


In [None]:
def h_example_1(df):    
    h_model = tp.HLDAModel(depth=4, seed=1)  #k is the number of topics

    for text in df.tokens:
        h_model.add_doc(text)

    for i in range(0, 100, 10): #Train the model using Gibbs-sampling
        h_model.train(10)
        print('Iteration: {}\tLog-likelihood: {}'.format(i, h_model.ll_per_word))

    for i, doc in enumerate(h_model.docs):
        print('Topic Distribution of Doc #{}'.format(i))
        for topic_id, weight in zip(doc.path, doc.get_topic_dist()):
            print('Topic #{}: {}'.format(topic_id, weight))

#### test 2

In [7]:
mdl = tp.HLDAModel(depth=3, min_cf=100)

for text in df.tokens:
    mdl.add_doc(text)

print('Training model by iterating over the corpus 100 times, 10 iterations at a time')
iterations = 10
for i in range(0, 100, iterations):
    mdl.train(iterations)
    print('Iteration: #{}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))


Training model by iterating over the corpus 100 times, 10 iterations at a time
Iteration: #0	Log-likelihood: -7.738148284277556
Iteration: #10	Log-likelihood: -7.543677743851485
Iteration: #20	Log-likelihood: -7.4911480982741745
Iteration: #30	Log-likelihood: -7.45938305195127
Iteration: #40	Log-likelihood: -7.433325272984932
Iteration: #50	Log-likelihood: -7.408519514201089
Iteration: #60	Log-likelihood: -7.393356676805286
Iteration: #70	Log-likelihood: -7.379991930791562
Iteration: #80	Log-likelihood: -7.368259749951029
Iteration: #90	Log-likelihood: -7.3620729498596935


In [22]:
for k in range(mdl.k):
    if not mdl.is_live_topic(k):
        continue
    print('child of topic #{0} - Level: {1}, number of documents {2}'.format(mdl.parent_topic(k), mdl.level(k), mdl.num_docs_of_topic(k)) )
    print('Top 10 words of global topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))
    print("###########################################################")


child of topic #-1 - Level: 0, number of documents 9965
Top 10 words of global topic #0
[('ano', 0.015291642397642136), ('parte', 0.006977279204875231), ('chile', 0.006798227317631245), ('ademas', 0.006301502231508493), ('ser', 0.0060531399212777615), ('dia', 0.006044476293027401), ('asi', 0.005440897773951292), ('trabajo', 0.0053889150731265545), ('do', 0.004727577790617943), ('toda', 0.004586069379001856)]
###########################################################
child of topic #0 - Level: 1, number of documents 7125
Top 10 words of global topic #8
[('persona', 0.00877594854682684), ('ano', 0.007453262805938721), ('salud', 0.006539979949593544), ('dia', 0.005287277977913618), ('region', 0.004779898561537266), ('parte', 0.004734409507364035), ('ademas', 0.004541954956948757), ('regional', 0.004415985196828842), ('mil', 0.004202535841614008), ('si', 0.004143049940466881)]
###########################################################
child of topic #0 - Level: 1, number of documents 32


[('pymes', 0.03600487858057022), ('empresas', 0.03400572016835213), ('procedimientos', 0.03000739775598049), ('emprendedores', 0.024009916931390762), ('actual', 0.020011596381664276), ('escenario', 0.018012436106801033), ('crisis', 0.01601327583193779), ('tamano', 0.01601327583193779), ('pandemia', 0.01601327583193779), ('alternativas', 0.01601327583193779)]
###########################################################
child of topic #8 - Level: 2, number of documents 1
Top 10 words of global topic #1933
[('municipal', 0.07055990397930145), ('concejo', 0.052934613078832626), ('renuncia', 0.0353093259036541), ('senalo', 0.0353093259036541), ('cargo', 0.0353093259036541), ('alcalde', 0.02943423204123974), ('funcionario', 0.02943423204123974), ('situacion', 0.02943423204123974), ('funcionaria', 0.02355913445353508), ('derechos', 0.02355913445353508)]
###########################################################
child of topic #8 - Level: 2, number of documents 2
Top 10 words of global topic #

In [None]:
# error
parent_topic = [k for k in range(mdl.k) if (mdl.children_topics(k)> 100).all() and (mdl.num_docs_of_topic(parent_topic) > 100).all()]

for parent_topic in parent_topic:
    child_topics = [child_topic for child_topic in mdl.children_topics(parent_topic) if mdl.num_docs_of_topic(child_topic) > 100]
    if child_topics:
        print('\n\n')
    print('Top 10 words of level %s parent topic #%s of %s documents: %r' % (mdl.level(parent_topic), parent_topic, mdl.num_docs_of_topic(parent_topic), mdl.get_topic_words(parent_topic, top_n=10)))

    for child_topic in child_topics:
        print('    Top 10 words of child topic #%s: %r' % (child_topic, mdl.get_topic_words(child_topic, top_n=10)))

### DTM