# Topic Modeling — With Tomotopy

<a href="https://colab.research.google.com/github/chu-ise/411A-2022/blob/main/notebooks/09/09-03_tomotopy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import warnings
warnings.filterwarnings('ignore')

## Install Packages

In [10]:
%%capture
%pip install tomotopy py7zr pyLDAvis

## Import Packages

In [3]:
import tomotopy as tp

## Prepare corpus

In [4]:
import gdown
import py7zr

id = "18OpNijd4iwPyYZ2O7pQoPyeTAKEXa71J"

data_file = "enwiki-stemmed-1000.7z"
gdown.download(id=id, output=data_file, quiet=False, fuzzy=True)

archive = py7zr.SevenZipFile(data_file, mode="r")
archive.extractall()
archive.close()

Downloading...
From: https://drive.google.com/uc?id=18OpNijd4iwPyYZ2O7pQoPyeTAKEXa71J
To: /content/enwiki-stemmed-1000.7z
100%|██████████| 3.81M/3.81M [00:00<00:00, 162MB/s]


In [5]:
input_file = "enwiki-stemmed-1000.txt"
corpus = tp.utils.Corpus(tokenizer=tp.utils.SimpleTokenizer(), stopwords=["."])
# data_feeder yields a tuple of (raw string, user data) or a str (raw string)
corpus.process(open(input_file, encoding="utf-8"))


1000

## Train Topic Model

In [6]:
num_topics = 20
# make LDA model and train
mdl = tp.LDAModel(k=num_topics, min_cf=10, min_df=5, corpus=corpus)
# The word 'church' is assigned to Topic 0 with a weight of 1.0 and to the remaining topics with a weight of 0.1.
# Therefore, a topic related to 'church' can be fixed at Topic 0 .
mdl.set_word_prior("church", [1.0 if k == 0 else 0.1 for k in range(20)])
# Topic 1 for a topic related to 'softwar'
mdl.set_word_prior("softwar", [1.0 if k == 1 else 0.1 for k in range(20)])
# Topic 2 for a topic related to 'citi'
mdl.set_word_prior("citi", [1.0 if k == 2 else 0.1 for k in range(20)])
mdl.train(0)
print(
    "Num docs:",
    len(mdl.docs),
    ", Vocab size:",
    len(mdl.used_vocabs),
    ", Num words:",
    mdl.num_words,
)
print("Removed top words:", mdl.removed_top_words)
for i in range(0, 1000, 100):
    mdl.train(100)
    print("Iteration: {}\tLog-likelihood: {}".format(i, mdl.ll_per_word))


Num docs: 1000 , Vocab size: 11024 , Num words: 1316911
Removed top words: []
Iteration: 0	Log-likelihood: -9.001531664336264
Iteration: 10	Log-likelihood: -8.653416101979964
Iteration: 20	Log-likelihood: -8.548952155364795
Iteration: 30	Log-likelihood: -8.488660258156584
Iteration: 40	Log-likelihood: -8.449558563431012
Iteration: 50	Log-likelihood: -8.420854252792374
Iteration: 60	Log-likelihood: -8.401128360933047
Iteration: 70	Log-likelihood: -8.388665685094589
Iteration: 80	Log-likelihood: -8.374544046773378
Iteration: 90	Log-likelihood: -8.363495254541748
Iteration: 100	Log-likelihood: -8.355824003091165
Iteration: 110	Log-likelihood: -8.349268641099055
Iteration: 120	Log-likelihood: -8.347479702794454
Iteration: 130	Log-likelihood: -8.34110153160477
Iteration: 140	Log-likelihood: -8.337443794912282
Iteration: 150	Log-likelihood: -8.331191583389625
Iteration: 160	Log-likelihood: -8.328939017415056
Iteration: 170	Log-likelihood: -8.325948569095434
Iteration: 180	Log-likelihood: -8.

In [7]:
mdl.summary()

<Basic Info>
| LDAModel (current version: 0.12.2)
| 1000 docs, 1316911 words
| Total Vocabs: 92725, Used Vocabs: 11024
| Entropy of words: 8.21178
| Entropy of term-weighted words: 8.21178
| Removed Vocabs: <NA>
|
<Training Info>
| Iterations: 1000, Burn-in steps: 0
| Optimization Interval: 10
| Log-likelihood per word: -8.22298
|
<Initial Parameters>
| tw: TermWeight.ONE
| min_cf: 10 (minimum collection frequency of words)
| min_df: 5 (minimum document frequency of words)
| rm_top: 0 (the number of top words to be removed)
| k: 20 (the number of topics between 1 ~ 32767)
| alpha: [0.1] (hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.)
| eta: 0.01 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 1500386825 (random seed)
| trained in version 0.12.2
|
<Parameters>
| alpha (Dirichlet prior on the per-document topic distributions)
|  [0.12

In [8]:
for k in range(mdl.k):
    print("== Topic #{} ==".format(k))
    for word, prob in mdl.get_topic_words(k, top_n=10):
        print(word, prob, sep='\t')
    print()

== Topic #0 ==
church	0.03874339535832405
god	0.0188911110162735
christian	0.015643028542399406
anglican	0.011511466465890408
cathol	0.010809880681335926
bishop	0.010030340403318405
centuri	0.00995238684117794
tradit	0.006522410549223423
saint	0.0064184716902673244
augustin	0.006106655579060316

== Topic #1 ==
use	0.01917710155248642
appl	0.016603130847215652
comput	0.015563314780592918
system	0.01498374529182911
compani	0.00975057389587164
design	0.007977773435413837
develop	0.007858450524508953
product	0.007619804237037897
program	0.007227742578834295
softwar	0.007142511662095785

== Topic #2 ==
game	0.01663792133331299
team	0.01656859740614891
club	0.012755793519318104
first	0.012432282790541649
leagu	0.012131880037486553
footbal	0.011184455826878548
season	0.011022700928151608
play	0.010190816596150398
australia	0.010052168741822243
new	0.008919881656765938

== Topic #3 ==
state	0.03189214691519737
american	0.017832506448030472
court	0.011096411384642124
parti	0.010352307930588722


In [9]:
num_topic_words = 10

print("\nTopic Model Results:\n\n")
# Print out top 10 words for each topic
topics = []
topic_individual_words = []
for topic_number in range(0, num_topics):
    topic_words = " ".join(
        f"{word}({prob*100:.1f})"
        for word, prob in mdl.get_topic_words(
            topic_id=topic_number, top_n=num_topic_words
        )
    )
    topics.append(topic_words)
    topic_individual_words.append(topic_words.split())
    print(f"✨Topic {topic_number}✨\n\n{topic_words}\n")


Topic Model Results:


✨Topic 0✨

church(3.9) god(1.9) christian(1.6) anglican(1.2) cathol(1.1) bishop(1.0) centuri(1.0) tradit(0.7) saint(0.6) augustin(0.6)

✨Topic 1✨

use(1.9) appl(1.7) comput(1.6) system(1.5) compani(1.0) design(0.8) develop(0.8) product(0.8) program(0.7) softwar(0.7)

✨Topic 2✨

game(1.7) team(1.7) club(1.3) first(1.2) leagu(1.2) footbal(1.1) season(1.1) play(1.0) australia(1.0) new(0.9)

✨Topic 3✨

state(3.2) american(1.8) court(1.1) parti(1.0) nation(1.0) govern(1.0) elect(0.9) unit(0.9) presid(0.8) new(0.8)

✨Topic 4✨

apollo(3.0) aircraft(2.1) flight(1.7) mission(1.6) lunar(1.4) airlin(1.3) crew(1.2) space(1.1) land(1.1) moon(1.1)

✨Topic 5✨

citi(1.7) area(1.1) popul(0.8) south(0.7) north(0.7) region(0.7) state(0.7) centuri(0.6) river(0.6) includ(0.6)

✨Topic 6✨

day(1.8) countri(1.3) nation(1.2) jew(1.0) state(0.9) azerbaijan(0.9) group(0.8) soviet(0.8) islam(0.8) govern(0.8)

✨Topic 7✨

number(1.6) use(1.4) set(1.2) algorithm(1.1) algebra(1.1) comput(1.1) 

In [None]:
# calculate coherence using preset
for preset in ('u_mass', 'c_uci', 'c_npmi', 'c_v'):
    coh = tp.coherence.Coherence(mdl, coherence=preset)
    average_coherence = coh.get_score()
    coherence_per_topic = [coh.get_score(topic_id=k) for k in range(mdl.k)]
    print('==== Coherence : {} ===='.format(preset))
    print('Average:', average_coherence, '\nPer Topic:', coherence_per_topic)
    print()

In [11]:
import numpy as np
import pyLDAvis

topic_term_dists = np.stack([mdl.get_topic_word_dist(k) for k in range(mdl.k)])
doc_topic_dists = np.stack([doc.get_topic_dist() for doc in mdl.docs])
doc_topic_dists /= doc_topic_dists.sum(axis=1, keepdims=True)
doc_lengths = np.array([len(doc.words) for doc in mdl.docs])
vocab = list(mdl.used_vocabs)
term_frequency = mdl.used_vocab_freq

prepared_data = pyLDAvis.prepare(
    topic_term_dists,
    doc_topic_dists,
    doc_lengths,
    vocab,
    term_frequency,
    start_index=0,  # tomotopy starts topic ids with 0, pyLDAvis with 1
    sort_topics=False,  # IMPORTANT: otherwise the topic_ids between pyLDAvis and tomotopy are not matching!
)


pyLDAvis.enable_notebook()
pyLDAvis.display(prepared_data)


  from collections import Iterable
  from numpy.dual import register_func
  supported_dtypes = [np.typeDict[x] for x in supported_dtypes]
  supported_dtypes = [np.typeDict[x] for x in supported_dtypes]
  supported_dtypes = [np.typeDict[x] for x in supported_dtypes]
  supported_dtypes = [np.typeDict[x] for x in supported_dtypes]
  supported_dtypes = [np.typeDict[x] for x in supported_dtypes]
  supported_dtypes = [np.typeDict[x] for x in supported_dtypes]
  supported_dtypes = [np.typeDict[x] for x in supported_dtypes]
  supported_dtypes = [np.typeDict[x] for x in supported_dtypes]
  supported_dtypes = [np.typeDict[x] for x in supported_dtypes]
  supported_dtypes = [np.typeDict[x] for x in supported_dtypes]
  supported_dtypes = [np.typeDict[x] for x in supported_dtypes]
  supported_dtypes = [np.typeDict[x] for x in supported_dtypes]
  supported_dtypes = [np.typeDict[x] for x in supported_dtypes]
  supported_dtypes = [np.typeDict[x] for x in supported_dtypes]
  supported_dtypes = [np.typeD

## Correlated Topic Model using tomotopy and visualize the correlation between topics

In [1]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
from sklearn.datasets import fetch_20newsgroups
import nltk
from nltk.corpus import stopwords
import re
import tomotopy as tp

porter_stemmer = nltk.PorterStemmer().stem
english_stops = set(porter_stemmer(w) for w in stopwords.words('english'))
pat = re.compile('^[a-z]{2,}$')
corpus = tp.utils.Corpus(
    tokenizer=tp.utils.SimpleTokenizer(porter_stemmer), 
    stopwords=lambda x: x in english_stops or not pat.match(x)
)
newsgroups_train = fetch_20newsgroups()
corpus.process(d.lower() for d in newsgroups_train.data)

11314

In [3]:
mdl = tp.CTModel(tw=tp.TermWeight.IDF, min_df=5, rm_top=40, k=30, corpus=corpus)
mdl.train(0)

In [None]:
# Since we have more than ten thousand of documents, 
# setting the `num_beta_sample` smaller value will not cause an inaccurate result.
mdl.num_beta_sample = 5
print('Num docs:{}, Num Vocabs:{}, Total Words:{}'.format(
    len(mdl.docs), len(mdl.used_vocabs), mdl.num_words
))
print('Removed Top words: ', *mdl.removed_top_words)

# Let's train the model
for i in range(0, 1000, 20):
    print('Iteration: {:04}, LL per word: {:.4}'.format(i, mdl.ll_per_word))
    mdl.train(20)
print('Iteration: {:04}, LL per word: {:.4}'.format(1000, mdl.ll_per_word))

mdl.summary()


Num docs:11314, Num Vocabs:16117, Total Words:1459952
Removed Top words:  ax edu line subject com organ use one would write articl like get peopl univers know think time say max go make also system work new good year want ca could right need even well see may problem thing us
Iteration: 0000, LL per word: -9.174
Iteration: 0020, LL per word: -7.185
Iteration: 0040, LL per word: -6.888
Iteration: 0060, LL per word: -6.786

In [None]:
from pyvis.network import Network

# Let's visualize the result
g = Network(width=800, height=800, font_color="#333")
correl = mdl.get_correlations().reshape([-1])
correl.sort()
top_tenth = mdl.k * (mdl.k - 1) // 10
top_tenth = correl[-mdl.k - top_tenth]

for k in range(mdl.k):
    label = "#{}".format(k)
    title= ' '.join(word for word, _ in mdl.get_topic_words(k, top_n=6))
    print('Topic', label, title)
    g.add_node(k, label=label, title=title, shape='ellipse')
    for l, correlation in zip(range(k - 1), mdl.get_correlations(k)):
        if correlation < top_tenth: continue
        g.add_edge(k, l, value=float(correlation), title='{:.02}'.format(correlation))

g.barnes_hut(gravity=-1000, spring_length=20)
g.show_buttons()
g.show("topic_network.html")