# Tutorial: Combined Topic Modeling

(last updated 10-07-2022)

In this tutorial, we are going to use our **Combined Topic Model** to get the topics out of a collections of articles.

## Topic Models

Topic models allow you to discover latent topics in your documents in a completely unsupervised way. Just use your documents and get topics out.

## Contextualized Topic Models

![](https://raw.githubusercontent.com/MilaNLProc/contextualized-topic-models/master/img/logo.png)

What are Contextualized Topic Models? **CTMs** are a family of topic models that combine the expressive power of BERT embeddings with the unsupervised capabilities of topic models to get topics out of documents.

## Python Package

You can find our package [here](https://github.com/MilaNLProc/contextualized-topic-models).

![https://github.com/MilaNLProc/contextualized-topic-models/actions](https://github.com/MilaNLProc/contextualized-topic-models/workflows/Python%20package/badge.svg) ![https://pypi.python.org/pypi/contextualized_topic_models](https://img.shields.io/pypi/v/contextualized_topic_models.svg) ![https://pepy.tech/badge/contextualized-topic-models](https://pepy.tech/badge/contextualized-topic-models)

# **Before you start...**

If you have additional questions about these topics, follow the links:

- you need to work with languages different than English: [click here!](https://contextualized-topic-models.readthedocs.io/en/latest/language.html#language-specific)
- you can't get good results with topic models: [click here!](https://contextualized-topic-models.readthedocs.io/en/latest/faq.html#i-am-getting-very-poor-results-what-can-i-do)
- you want to load your own embeddings: [click here!](https://contextualized-topic-models.readthedocs.io/en/latest/faq.html#can-i-load-my-own-embeddings)


# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# Installing Contextualized Topic Models

First, we install the contextualized topic model library

In [1]:
#%%capture
#!pip install contextualized-topic-models==2.3.0

In [3]:
#%%capture
#!pip install pyldavis

## Restart the Notebook

For the changes to take effect, we now need to restart the notebook.

From the Menu:

Runtime → Restart Runtime

# Data

We are going to need some data. You should upload a file with one document per line. We assume you haven't run any preprocessing script.

However, if you want to first test the model without uploading your data, you can simply use the test file I'm putting here

In [4]:
#%%capture
#!wget https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_unprep.txt

In [5]:
#!head -n 2 dbpedia_sample_abstract_20k_unprep.txt

In [6]:
text_file = "dbpedia_sample_abstract_20k_unprep.txt" # EDIT THIS WITH THE FILE YOU UPLOAD

# Importing what we need

In [7]:
import os
import random
import argparse
import itertools
from os.path import join as pjoin

import torch
import numpy as np
import gensim
import gensim.downloader as api
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary
from sklearn.metrics.pairwise import cosine_distances
from tqdm import tqdm


from contextualized_topic_models.models.ctm import CombinedTM, ZeroShotTM
from contextualized_topic_models.models.pytorchavitm.avitm.avitm_model import AVITM_model
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file
from contextualized_topic_models.evaluation.measures import TopicDiversity, CoherenceNPMI, CoherenceCV, CoherenceWordEmbeddings, InvertedRBO
from contextualized_topic_models.utils.visualize import save_word_dist_plot, save_histogram
from composite_activations import composite_activations


from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing
import nltk
import torch

In [8]:
# Disable tokenizer warning
os.environ["TOKENIZERS_PARALLELISM"] = "false"


## Preprocessing

Why do we use the **preprocessed text** here? We need text without punctuation to build the bag of word. Also, we might want only to have the most frequent words inside the BoW. Too many words might not help.

In [9]:
from nltk.corpus import stopwords as stop_words

nltk.download('stopwords')

documents = [line.strip() for line in open(text_file, encoding="utf-8").readlines()[0:400]]

print('Number of documents', len(documents))

bow_size = 2000 #new 

sp = WhiteSpacePreprocessing(documents, vocabulary_size = bow_size, stopwords_language='english')
preprocessed_documents, unpreprocessed_corpus, vocab, retained_indices = sp.preprocess()

Number of documents 400


[nltk_data] Downloading package stopwords to /home/felipe/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
preprocessed_documents[:2]

['mid peninsula highway freeway across niagara peninsula canadian province ontario although highway connecting south niagara niagara international study published ministry',
 'died march 15 2007 american photographer wedding 1947 1970s operated studio silver spring maryland later lived florida magazine wedding photographer year 1990']

We don't discard the non-preprocessed texts, because we are going to use them as input for obtaining the contextualized document representations.

Let's pass our files with preprocess and unpreprocessed data to our `TopicModelDataPreparation` object. This object takes care of creating the bag of words for you and of obtaining the contextualized BERT representations of documents. This operation allows us to create our training dataset.

Note: Here we use the contextualized model "paraphrase-distilroberta-base-v1".


In [11]:


import torch

In [12]:
tp = TopicModelDataPreparation("all-mpnet-base-v2")
training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Let's check the first ten words of the vocabulary

In [13]:
tp.vocab[:10]

array(['000', '10', '11', '12', '13', '14', '15', '16', '1619', '165'],
      dtype=object)

## Training our Combined TM

Finally, we can fit our new topic model. We will ask the model to find 50 topics in our collection.

In [14]:
# Preparing extra data 

In [15]:
use_npmi_loss= True  #boolean
use_glove_loss = True
use_diversity_loss = True

In [16]:
vocab_mask, npmi_matrix, word_vectors = None, None, None

In [17]:
def compute_npmi_matrix(training_dataset, text_for_bow):
    bow_corpus = [doc.split() for doc in text_for_bow] #this is new

    vocab_size = len(training_dataset.vocab)
    npmi_matrix = np.zeros((vocab_size, vocab_size))
    dictionary = Dictionary()
    dictionary.id2token = training_dataset.idx2token
    dictionary.token2id = {v:k for k, v in training_dataset.idx2token.items()}
    topics = [list(training_dataset.idx2token.values())]
    npmi = CoherenceModel(topics=topics, texts=bow_corpus, dictionary=dictionary, coherence='c_npmi', topn=len(topics[0]))
    segmented_topics = npmi.measure.seg(npmi.topics)
    accumulator = npmi.estimate_probabilities(segmented_topics)
    num_docs = accumulator.num_docs
    eps = 1e-12
    for w1, w2 in tqdm(segmented_topics[0]):
        w1_count = accumulator[w1]
        w2_count = accumulator[w2]
        co_occur_count = accumulator[w1, w2]
        
        p_w1_w2 = co_occur_count / num_docs
        p_w1 = (w1_count / num_docs)
        p_w2 = (w2_count / num_docs)
        npmi_matrix[w1, w2] = np.log((p_w1_w2 + eps) / (p_w1 * p_w2)) / -np.log(p_w1_w2  + eps)
    return npmi_matrix


In [18]:
len(preprocessed_documents)

400

In [19]:
npmi_matrix= compute_npmi_matrix(training_dataset, text_for_bow = preprocessed_documents)

100%|██████████| 3982020/3982020 [01:56<00:00, 34095.38it/s]


In [20]:

#need to run this once
#from gensim.scripts.glove2word2vec import glove2word2vec
#glove2word2vec(glove_input_file="contextualized_topic_models/data/glove.6B.50d.txt", word2vec_output_file="contextualized_topic_models/data/glove.6B.50d.w2vformat.txt")



def apply_glove_loss(training_dataset, glove_path =  'contextualized_topic_models/data/glove.6B.50d.w2vformat.txt'):
    wv = gensim.models.KeyedVectors.load_word2vec_format(glove_path)
    word_vectors = np.zeros((len(training_dataset.idx2token), wv.vector_size))
    missing_indices = []
    for idx, token in training_dataset.idx2token.items():
        if wv.has_index_for(token):
            word_vectors[idx] = wv.get_vector(token)
        else:
            missing_indices.append(idx)    

    return word_vectors, missing_indices

In [21]:
word_vectors, missing_indices = apply_glove_loss(training_dataset)

### Normal parameters

python run_topic_models.py --text_file resources/20news_unprep.txt --bow_file resources/20news_prep.txt \
--model_type combined --device 2 --use_npmi_loss --weight_lambda 100 --use_diversity_loss --weight_alpha 0.7 \
| tee 20news_combined_npmi_diversity_scores.out

In [23]:
import torch

In [24]:
model_type = 'combinedtm'
num_epochs = 10
num_topics = 20
device =  torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")


In [25]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"


In [26]:
weight_alpha = 0.5
weight_lambda = 100

In [27]:
device

device(type='cuda')

In [28]:
npmi_matrix

array([[ 0.        , -0.62684929, -0.62295401, ..., -0.58400722,
        -0.52575972, -0.52575972],
       [-0.62684929,  0.        ,  0.19909051, ..., -0.62025086,
        -0.56200335, -0.56200335],
       [-0.62295401,  0.19909051,  0.        , ..., -0.61635557,
        -0.55810807, -0.55810807],
       ...,
       [-0.58400722, -0.62025086, -0.61635557, ...,  0.        ,
        -0.51916129, -0.51916129],
       [-0.52575972, -0.56200335, -0.55810807, ..., -0.51916129,
         0.        , -0.46091379],
       [-0.52575972, -0.56200335, -0.55810807, ..., -0.51916129,
        -0.46091379,  0.        ]])

In [31]:
ctm = CombinedTM(
        bow_size=len(training_dataset.idx2token),
        contextual_size=768,
        n_components=num_topics,
        num_epochs=num_epochs,
        device=device,
        use_npmi_loss=use_npmi_loss,
        npmi_matrix=npmi_matrix,
        vocab_mask=vocab_mask,
        use_diversity_loss=use_diversity_loss,
        use_glove_loss=use_glove_loss,
        word_vectors=word_vectors,
        loss_weights={"lambda": weight_lambda, "beta": 1, "alpha": weight_alpha},
                )

In [32]:
ctm.fit(training_dataset) # run the model

Epoch: [10/10]	 Seen Samples: [3840/4000]	 Train Loss (KL/RL/DL): 14.83/170.29/4.35	 Time: 0:00:00.690369: : 10it [00:07,  1.36it/s]


In [33]:
ctm.get_topic_lists(5)

[['welsh', 'name', 'certain', 'public', 'names'],
 ['hampton', 'august', 'university', 'moved', 'navy'],
 ['island', 'two', 'state', 'voice', 'name'],
 ['born', 'new', 'member', 'show', 'greek'],
 ['league', 'died', 'lived', 'government', 'de'],
 ['km', 'cross', 'unincorporated', 'holy', 'system'],
 ['recorded', 'found', '16', 'republic', 'radio'],
 ['city', 'mozambique', 'central', 'official', 'crater'],
 ['cumulative', 'july', 'south', 'verse', 'new'],
 ['years', 'rock', 'studio', 'veneta', 'newspaper'],
 ['gate', 'drama', 'fc', 'brown', 'sectors'],
 ['december', 'son', 'september', 'valley', 'kilometers'],
 ['album', '1986', '30', 'west', 'happy'],
 ['series', 'modern', 'general', 'deseret', '1952'],
 ['also', 'malaysia', 'premier', 'film', 'michael'],
 ['like', 'ghost', 'star', 'outside', 'near'],
 ['designed', 'comics', 'college', 'trinidad', 'tennessee'],
 ['canada', '24', 'foster', 'wasp', 'office'],
 ['bacoside', 'chief', 'tennis', 'newton', 'welsh'],
 ['late', 'led', 'rugby', 

# Topics

After training, now it is the time to look at our topics: we can use the

```
get_topic_lists
```

function to get the topics. It also accepts a parameter that allows you to select how many words you want to see for each topic.

If you look at the topics, you will see that they all make sense and are representative of a collection of documents that comes from Wikipedia (general knowledge). Notice that the topics are in English, because we trained the model on English documents.

In [34]:
ctm.get_topic_lists(5)

[['welsh', 'name', 'certain', 'public', 'names'],
 ['hampton', 'august', 'university', 'moved', 'navy'],
 ['island', 'two', 'state', 'voice', 'name'],
 ['born', 'new', 'member', 'show', 'greek'],
 ['league', 'died', 'lived', 'government', 'de'],
 ['km', 'cross', 'unincorporated', 'holy', 'system'],
 ['recorded', 'found', '16', 'republic', 'radio'],
 ['city', 'mozambique', 'central', 'official', 'crater'],
 ['cumulative', 'july', 'south', 'verse', 'new'],
 ['years', 'rock', 'studio', 'veneta', 'newspaper'],
 ['gate', 'drama', 'fc', 'brown', 'sectors'],
 ['december', 'son', 'september', 'valley', 'kilometers'],
 ['album', '1986', '30', 'west', 'happy'],
 ['series', 'modern', 'general', 'deseret', '1952'],
 ['also', 'malaysia', 'premier', 'film', 'michael'],
 ['like', 'ghost', 'star', 'outside', 'near'],
 ['designed', 'comics', 'college', 'trinidad', 'tennessee'],
 ['canada', '24', 'foster', 'wasp', 'office'],
 ['bacoside', 'chief', 'tennis', 'newton', 'welsh'],
 ['late', 'led', 'rugby', 

# Let's Draw!

We can use PyLDAvis to plot our topic in a nice and friendly manner :)

In [35]:
 lda_vis_data = ctm.get_ldavis_data_format(tp.vocab, training_dataset, n_samples=10)

Sampling: [10/10]: : 10it [00:06,  1.54it/s]


In [36]:
#!pip install pyLDAvis

In [37]:
import pyLDAvis as vis

lda_vis_data = ctm.get_ldavis_data_format(tp.vocab, training_dataset, n_samples=10)

ctm_pd = vis.prepare(**lda_vis_data)
vis.display(ctm_pd)

Sampling: [10/10]: : 10it [00:06,  1.50it/s]
  default_term_info = default_term_info.sort_values(
  from scipy.sparse.base import spmatrix
  from scipy.optimize.linesearch import line_search_wolfe2, line_search_wolfe1
  from scipy.optimize.linesearch import line_search_wolfe2, line_search_wolfe1
  from scipy.sparse.base import spmatrix
  from scipy.sparse.base import spmatrix
  from scipy.optimize.linesearch import line_search_wolfe2, line_search_wolfe1
  from scipy.optimize.linesearch import line_search_wolfe2, line_search_wolfe1
  from scipy.optimize.linesearch import line_search_wolfe2, line_search_wolfe1
  from scipy.optimize.linesearch import line_search_wolfe2, line_search_wolfe1
  from scipy.sparse.base import spmatrix
  from scipy.sparse.base import spmatrix
  from scipy.optimize.linesearch import line_search_wolfe2, line_search_wolfe1
  from scipy.optimize.linesearch import line_search_wolfe2, line_search_wolfe1
  from scipy.optimize.linesearch import line_search_wolfe2, line_