# Topic Modeling — With Tomotopy

<a href="https://colab.research.google.com/github/chu-ise/411A-2022/blob/main/notebooks/09/09-03_tomotopy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [59]:
import warnings
warnings.filterwarnings('ignore')

## Install Packages

In [None]:
%pip install tomotopy py7zr

## Import Packages

In [41]:
import tomotopy as tp

## Prepare corpus

In [49]:
import gdown
import py7zr

id = "18OpNijd4iwPyYZ2O7pQoPyeTAKEXa71J"

data_file = "enwiki-stemmed-1000.7z"
gdown.download(id=id, output=data_file, quiet=False, fuzzy=True)

archive = py7zr.SevenZipFile(data_file, mode="r")
archive.extractall()
archive.close()

In [50]:
input_file = "enwiki-stemmed-1000.txt"
corpus = tp.utils.Corpus(tokenizer=tp.utils.SimpleTokenizer(), stopwords=["."])
# data_feeder yields a tuple of (raw string, user data) or a str (raw string)
corpus.process(open(input_file, encoding="utf-8"))


1000

## Train Topic Model

In [51]:
num_topics = 20
# make LDA model and train
mdl = tp.LDAModel(k=num_topics, min_cf=10, min_df=5, corpus=corpus)
# The word 'church' is assigned to Topic 0 with a weight of 1.0 and to the remaining topics with a weight of 0.1.
# Therefore, a topic related to 'church' can be fixed at Topic 0 .
mdl.set_word_prior("church", [1.0 if k == 0 else 0.1 for k in range(20)])
# Topic 1 for a topic related to 'softwar'
mdl.set_word_prior("softwar", [1.0 if k == 1 else 0.1 for k in range(20)])
# Topic 2 for a topic related to 'citi'
mdl.set_word_prior("citi", [1.0 if k == 2 else 0.1 for k in range(20)])
mdl.train(0)
print(
    "Num docs:",
    len(mdl.docs),
    ", Vocab size:",
    len(mdl.used_vocabs),
    ", Num words:",
    mdl.num_words,
)
print("Removed top words:", mdl.removed_top_words)
for i in range(0, 1000, 10):
    mdl.train(10)
    print("Iteration: {}\tLog-likelihood: {}".format(i, mdl.ll_per_word))


Num docs: 1000 , Vocab size: 11024 , Num words: 1316911
Removed top words: []
Iteration: 990	Log-likelihood: -8.240544343075326


In [54]:
mdl.summary()

<Basic Info>
| LDAModel (current version: 0.12.2)
| 1000 docs, 1316911 words
| Total Vocabs: 92725, Used Vocabs: 11024
| Entropy of words: 8.21178
| Entropy of term-weighted words: 8.21178
| Removed Vocabs: <NA>
|
<Training Info>
| Iterations: 1000, Burn-in steps: 0
| Optimization Interval: 10
| Log-likelihood per word: -8.24054
|
<Initial Parameters>
| tw: TermWeight.ONE
| min_cf: 10 (minimum collection frequency of words)
| min_df: 5 (minimum document frequency of words)
| rm_top: 0 (the number of top words to be removed)
| k: 20 (the number of topics between 1 ~ 32767)
| alpha: [0.1] (hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.)
| eta: 0.01 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 2932769256 (random seed)
| trained in version 0.12.2
|
<Parameters>
| alpha (Dirichlet prior on the per-document topic distributions)
|  [0.20

In [53]:
for k in range(mdl.k):
    print("== Topic #{} ==".format(k))
    for word, prob in mdl.get_topic_words(k, top_n=10):
        print(word, prob, sep='\t')
    print()

== Topic #0 ==
church	0.022353215143084526
king	0.014985035173594952
alexand	0.012221967801451683
roman	0.011286325752735138
son	0.008654833771288395
emperor	0.007631475105881691
christian	0.0071344152092933655
cathol	0.006739691365510225
year	0.006637355778366327
anglican	0.006476542446762323

== Topic #1 ==
film	0.012897787615656853
first	0.009338054805994034
play	0.008005264215171337
year	0.007836556993424892
time	0.006874922662973404
work	0.005879547446966171
book	0.005862676538527012
stori	0.0053396825678646564
award	0.005322812125086784
also	0.005221587605774403

== Topic #2 ==
citi	0.022447897121310234
univers	0.014521674253046513
club	0.01054862979799509
footbal	0.010429439134895802
team	0.00971429143100977
australia	0.009575234726071358
school	0.009376582689583302
leagu	0.008840220980346203
game	0.008641568943858147
play	0.008204534649848938

== Topic #3 ==
anim	0.023545561358332634
speci	0.015288359485566616
use	0.012181689962744713
plant	0.01087361853569746
tank	0.0093747861

In [58]:
num_topic_words = 10

print("\nTopic Model Results:\n\n")
# Print out top 10 words for each topic
topics = []
topic_individual_words = []
for topic_number in range(0, num_topics):
    topic_words = " ".join(
        f"{word}({prob*100:.1f})"
        for word, prob in mdl.get_topic_words(
            topic_id=topic_number, top_n=num_topic_words
        )
    )
    topics.append(topic_words)
    topic_individual_words.append(topic_words.split())
    print(f"✨Topic {topic_number}✨\n\n{topic_words}\n")


Topic Model Results:


✨Topic 0✨

church(2.2) king(1.5) alexand(1.2) roman(1.1) son(0.9) emperor(0.8) christian(0.7) cathol(0.7) year(0.7) anglican(0.6)

✨Topic 1✨

film(1.3) first(0.9) play(0.8) year(0.8) time(0.7) work(0.6) book(0.6) stori(0.5) award(0.5) also(0.5)

✨Topic 2✨

citi(2.2) univers(1.5) club(1.1) footbal(1.0) team(1.0) australia(1.0) school(0.9) leagu(0.9) game(0.9) play(0.8)

✨Topic 3✨

anim(2.4) speci(1.5) use(1.2) plant(1.1) tank(0.9) famili(0.8) includ(0.8) gun(0.8) order(0.8) armour(0.7)

✨Topic 4✨

use(1.9) appl(1.9) comput(1.9) system(1.6) program(0.9) softwar(0.8) design(0.8) develop(0.8) languag(0.7) code(0.7)

✨Topic 5✨

war(1.7) armi(1.3) forc(1.2) militari(0.9) attack(0.9) battl(0.8) jew(0.7) empir(0.7) kill(0.6) defeat(0.5)

✨Topic 6✨

art(3.5) work(1.7) de(1.5) music(1.4) build(1.1) artist(1.1) french(1.1) style(1.0) design(0.9) pari(0.9)

✨Topic 7✨

state(3.0) american(1.8) court(1.1) parti(1.0) british(0.9) elect(0.9) new(0.9) presid(0.8) vote(0.8) linco

In [60]:
import numpy as np
import pyLDAvis

topic_term_dists = np.stack([mdl.get_topic_word_dist(k) for k in range(mdl.k)])
doc_topic_dists = np.stack([doc.get_topic_dist() for doc in mdl.docs])
doc_topic_dists /= doc_topic_dists.sum(axis=1, keepdims=True)
doc_lengths = np.array([len(doc.words) for doc in mdl.docs])
vocab = list(mdl.used_vocabs)
term_frequency = mdl.used_vocab_freq

prepared_data = pyLDAvis.prepare(
    topic_term_dists,
    doc_topic_dists,
    doc_lengths,
    vocab,
    term_frequency,
    start_index=0,  # tomotopy starts topic ids with 0, pyLDAvis with 1
    sort_topics=False,  # IMPORTANT: otherwise the topic_ids between pyLDAvis and tomotopy are not matching!
)


pyLDAvis.enable_notebook()
pyLDAvis.display(prepared_data)


  from imp import reload
