# Mono-lingual Text Analysis with Contextualized Topic Models Using Jeopardy Dataset:

In [1]:
%%capture
!pip install contextualized_topic_models

This the the CTM package that we are going to use for this praticular project

In [3]:
import pandas as pd

# Read the CSV file into a Pandas DataFrame
df = pd.read_csv('JEOPARDY_CSV.csv')

# Display the modified DataFrame
df.head()


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [4]:
# Assuming 'ColumnName' is the name of the column for which you want unique items
column_name = 'Show Number'

# Get unique items from the specified column
unique_items = df[column_name].unique()

# Display the unique items in array format
print(unique_items)
unique_category_count = df[column_name].nunique()
print(unique_category_count)


[4680 5957 3751 ... 5070 5195 4999]
3640


In [5]:
df.drop_duplicates(subset= 'Show Number', inplace=True)
df

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
56,5957,2010-07-06,Jeopardy!,"GEOGRAPHY ""E""",$200,It's the largest kingdom in the United Kingdom,England
117,3751,2000-12-18,Jeopardy!,ROYAL FEMALE NICKNAMES,$100,"Prime Minister Tony Blair dubbed her ""The Peop...",Princess Diana
175,3673,2000-07-19,Jeopardy!,GENERAL SCIENCE,$100,"This white, glossy coating on your teeth is th...",Enamel
236,4931,2006-02-06,Jeopardy!,THE CIVIL WAR,$200,His first act after being sworn in as presiden...,Jefferson Davis
...,...,...,...,...,...,...,...
216629,3940,2001-10-19,Jeopardy!,U.K. DEPENDENCIES,$100,This archipelago east of North Carolina is now...,Bermuda
216687,6044,2010-12-16,Jeopardy!,THE AMERICAN RED CROSS,$200,"(<a href=""http://www.j-archive.com/media/2010-...",Switzerland
216747,5070,2006-09-29,Jeopardy!,THE NATIONAL PARK SYSTEM,$200,"A natl. historic site, the Hubbell Trading Pos...",the Navajo tribe
216808,5195,2007-03-23,Jeopardy!,HAIL TO THE CHIEF,$200,The rustic James A. Garfield was born in one o...,a log cabin


Importing a certain functions that we might need to work on further


In [6]:
from contextualized_topic_models.models.ctm import ZeroShotTM, CombinedTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessingStopwords
import nltk
import torch
import random
import numpy as np

This a function that has been created to make sure that fixes the random seeds so that we can replicate the results.

In [7]:
def fix_seeds():
  torch.manual_seed(10)
  torch.cuda.manual_seed(10)
  np.random.seed(10)
  random.seed(10)
  torch.backends.cudnn.enabled = False
  torch.backends.cudnn.deterministic = True

Now to execute simple text preprocessing we have selected a random sampling size of 362 documents which is about 1/10th of the orginal size.

In [37]:
sample_size = 500

# Randomly select 1500 rows
random_sample = df.sample(n=sample_size, random_state=42)

# Display the randomly selected sample
random_sample

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
24754,1245,1990-01-19,Jeopardy!,FAMOUS TOMs,$100,"This NBC newsman used to co-host the ""Today"" s...",Tom Brokaw
174330,5937,2010-06-08,Jeopardy!,I LOVE A SHOW TUNE,$200,"Professor 'Enry 'Iggins sings, ""I've grown acc...",My Fair Lady
109618,3653,2000-06-21,Jeopardy!,JOHN QUINCY ADAMS,$100,"When John Quincy passed the bar in 1790, his f...",Vice president
17791,5777,2009-10-27,Jeopardy!,LIVRES EN FRANÇAIS,$200,"Laura Ingalls Wilder: ""La petite maison dans l...",Little House on the Prairie
63741,3816,2001-03-19,Jeopardy!,HISTORIC OCCASIONS,$100,"James Marshall found this Jan. 24, 1848, days ...",Gold
...,...,...,...,...,...,...,...
195057,3590,2000-03-24,Jeopardy!,THE 1980s,$100,The war over these islands was fought in May &...,Falkland (or Malvinas) Islands
837,4541,2004-05-10,Jeopardy!,HISTORIC NICKNAMES,$200,"Because of his Hanoverian heritage, American c...",George III
99981,4389,2003-10-09,Jeopardy!,EUROPE,$200,The world's earliest steam locomotives pulled ...,coal
10346,5623,2009-02-04,Jeopardy!,ROTTEN POETRY ABOUT GOOD POETS,$200,"Okay, we've been savin' / His poem, ""The Raven...",(Edgar Allan) Poe


In [39]:
random_sample = random_sample.rename(columns={' Question': 'Question'})

In [20]:
!pip install beautifulsoup4




In [38]:
from bs4 import BeautifulSoup

# Assuming 'question' is the name of the column containing HTML tags
column_name = ' Question'

# Define a function to remove HTML tags
def remove_html_tags(html):
    soup = BeautifulSoup(html, 'html.parser')
    return soup.get_text()

# Apply the function to the specified column
random_sample[column_name] = random_sample[column_name].apply(remove_html_tags)

# Display the modified DataFrame
random_sample.head()


  soup = BeautifulSoup(html, 'html.parser')


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
24754,1245,1990-01-19,Jeopardy!,FAMOUS TOMs,$100,"This NBC newsman used to co-host the ""Today"" s...",Tom Brokaw
174330,5937,2010-06-08,Jeopardy!,I LOVE A SHOW TUNE,$200,"Professor 'Enry 'Iggins sings, ""I've grown acc...",My Fair Lady
109618,3653,2000-06-21,Jeopardy!,JOHN QUINCY ADAMS,$100,"When John Quincy passed the bar in 1790, his f...",Vice president
17791,5777,2009-10-27,Jeopardy!,LIVRES EN FRANÇAIS,$200,"Laura Ingalls Wilder: ""La petite maison dans l...",Little House on the Prairie
63741,3816,2001-03-19,Jeopardy!,HISTORIC OCCASIONS,$100,"James Marshall found this Jan. 24, 1848, days ...",Gold


Text Preprocessing

In [40]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

nltk.download('stopwords')
nltk.download('punkt')

# Assuming 'random_sample' is your DataFrame and 'Question' is the column with textual data
documents = random_sample['Question'].tolist()

# Function for text preprocessing
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Join the processed words into a string
    processed_text = ' '.join(tokens)

    return processed_text

# Apply the preprocessing function to each document
processed_documents = [preprocess_text(doc) for doc in documents]

# Add checks and print statements
print("Number of Documents:", len(documents))
print("Number of Processed Documents:", len(processed_documents))
print("Processed Documents:", processed_documents)

# Get the vocabulary
vocab = set(word for doc in processed_documents for word in doc.split())

# Display the results
print("Vocabulary:", vocab)

# Check lengths of processed documents and vocabulary
print("Length of Processed Documents:", len(processed_documents))
print("Length of Vocabulary:", len(vocab))



Number of Documents: 500
Number of Processed Documents: 500
Processed Documents: ['nbc newsman used cohost today show barbara walters successor jane pauley', 'professor enry iggins sings ive grown accustomed face song show', 'john quincy passed bar 1790 father job', 'laura ingalls wilder la petite maison dans la prairie', 'james marshall found jan 24 1848 days california handed us', 'call paintings like one seen put lips together blow', 'play opens battlements castle elsinore barnardo asks whos', 'prince edward marion 2 small islands country lie 1200 miles southeast cape town', 'capote born truman streckfus persons september 30 1924 southern us city', 'correlating system measuring time mayan one starts 3114 bc', 'us senate rejected versailles treaty', '1939 pact steel mussolini joined forces world leader', 'words peoples heroes immortal engraved monument tiananmen square', '1755 earthquake destroyed twothirds portuguese city', 'bucephalus favorite horse ancient warrior built city honor

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


We then pass our files with preprocess and unpreprocessed data to our TopicModelDataPreparation object. This object takes care of creating the bag of words for you and of obtaining the contextualized representations of documents. This operation allows us to create our training dataset.

In [41]:
tp = TopicModelDataPreparation("paraphrase-distilroberta-base-v2")

training_dataset = tp.fit(text_for_contextual= documents, text_for_bow= processed_documents)

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

How many topics?
There are different techniques to select the best number of topics. There are different techniques to select the best number of topics. In this case, I run our topic model with a different number of topics (5, 10, 15, 20) and selected the one that produces the topics with the highest coherence.

Also remember that a topic model is a probabilistic model, and each time produces different results if run with the same values of hyperparameters (e.g. the same number of topics). For this reason, I've run the topic model with the same number of topics for 5 times.

For your data, we are going to select around 20 topics since it gave the highest coherence socre of -0.2008


In [49]:
from contextualized_topic_models.evaluation.measures import CoherenceNPMI, InvertedRBO
corpus = [d.split() for d in processed_documents]

num_topics = [5, 10, 15, 20]
num_runs = 5

best_topic_coherence = -999
best_num_topics = 0
for n_components in num_topics:
  for i in range(num_runs):
    print("num topics:", n_components, "/ num run:", i)
    ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768,
                     n_components=n_components, num_epochs=50)
    ctm.fit(training_dataset) # run the model
    coh = CoherenceNPMI(ctm.get_topic_lists(10), corpus)
    coh_score = coh.score()
    print("coherence score:", coh_score)
    if best_topic_coherence < coh_score:
      best_topic_coherence = coh_score
      best_num_topics = n_components
    print("current best coherence", best_topic_coherence, "/ best num topics", best_num_topics)

num topics: 5 / num run: 0


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 67.02268872942243	Time: 0:00:00.255783: : 50it [00:15,  3.16it/s]
100%|██████████| 8/8 [00:00<00:00, 31.00it/s]


coherence score: -0.5028064618763513
current best coherence -0.5028064618763513 / best num topics 5
num topics: 5 / num run: 1


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 67.04812622070312	Time: 0:00:00.254406: : 50it [00:15,  3.16it/s]
100%|██████████| 8/8 [00:00<00:00, 31.17it/s]


coherence score: -0.45237613816239836
current best coherence -0.45237613816239836 / best num topics 5
num topics: 5 / num run: 2


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 66.12082181658063	Time: 0:00:00.290764: : 50it [00:18,  2.66it/s]
100%|██████████| 8/8 [00:00<00:00, 29.26it/s]


coherence score: -0.4749809291754154
current best coherence -0.45237613816239836 / best num topics 5
num topics: 5 / num run: 3


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 67.35906982421875	Time: 0:00:00.258442: : 50it [00:15,  3.17it/s]
100%|██████████| 8/8 [00:00<00:00, 30.61it/s]


coherence score: -0.43030034332724937
current best coherence -0.43030034332724937 / best num topics 5
num topics: 5 / num run: 4


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 67.4800785609654	Time: 0:00:00.263540: : 50it [00:15,  3.14it/s]
100%|██████████| 8/8 [00:00<00:00, 30.19it/s]


coherence score: -0.4193922945859915
current best coherence -0.4193922945859915 / best num topics 5
num topics: 10 / num run: 0


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 66.00057329450335	Time: 0:00:00.285014: : 50it [00:16,  3.04it/s]
100%|██████████| 8/8 [00:00<00:00, 29.56it/s]


coherence score: -0.3643205696338737
current best coherence -0.3643205696338737 / best num topics 10
num topics: 10 / num run: 1


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 65.9751957484654	Time: 0:00:00.271845: : 50it [00:16,  3.07it/s]
100%|██████████| 8/8 [00:00<00:00, 31.42it/s]


coherence score: -0.4332100728100719
current best coherence -0.3643205696338737 / best num topics 10
num topics: 10 / num run: 2


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 66.26791490827289	Time: 0:00:00.265235: : 50it [00:16,  3.12it/s]
100%|██████████| 8/8 [00:00<00:00, 30.25it/s]


coherence score: -0.3765111345620762
current best coherence -0.3643205696338737 / best num topics 10
num topics: 10 / num run: 3


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 66.53519766671317	Time: 0:00:00.264265: : 50it [00:17,  2.89it/s]
100%|██████████| 8/8 [00:00<00:00, 27.69it/s]


coherence score: -0.3742277610604736
current best coherence -0.3643205696338737 / best num topics 10
num topics: 10 / num run: 4


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 66.35259192330497	Time: 0:00:00.266847: : 50it [00:16,  3.10it/s]
100%|██████████| 8/8 [00:00<00:00, 27.95it/s]


coherence score: -0.3853761536846882
current best coherence -0.3643205696338737 / best num topics 10
num topics: 15 / num run: 0


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 67.05092566353935	Time: 0:00:00.278318: : 50it [00:16,  3.09it/s]
100%|██████████| 8/8 [00:00<00:00, 28.37it/s]


coherence score: -0.34152123091413866
current best coherence -0.34152123091413866 / best num topics 15
num topics: 15 / num run: 1


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 66.64705113002232	Time: 0:00:00.274336: : 50it [00:16,  3.07it/s]
100%|██████████| 8/8 [00:00<00:00, 31.06it/s]


coherence score: -0.31665585814319575
current best coherence -0.31665585814319575 / best num topics 15
num topics: 15 / num run: 2


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 67.07063293457031	Time: 0:00:00.274005: : 50it [00:16,  3.06it/s]
100%|██████████| 8/8 [00:00<00:00, 29.88it/s]


coherence score: -0.35595271950898494
current best coherence -0.31665585814319575 / best num topics 15
num topics: 15 / num run: 3


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 67.25832257952008	Time: 0:00:00.274859: : 50it [00:16,  3.04it/s]
100%|██████████| 8/8 [00:00<00:00, 27.08it/s]


coherence score: -0.3851885734849858
current best coherence -0.31665585814319575 / best num topics 15
num topics: 15 / num run: 4


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 66.80733544485909	Time: 0:00:00.442427: : 50it [00:16,  2.98it/s]
100%|██████████| 8/8 [00:00<00:00, 20.85it/s]


coherence score: -0.29534040302278264
current best coherence -0.29534040302278264 / best num topics 15
num topics: 20 / num run: 0


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 68.54690987723214	Time: 0:00:00.260374: : 50it [00:16,  3.02it/s]
100%|██████████| 8/8 [00:00<00:00, 28.65it/s]


coherence score: -0.20080021995860556
current best coherence -0.20080021995860556 / best num topics 20
num topics: 20 / num run: 1


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 68.6215569632394	Time: 0:00:00.274132: : 50it [00:16,  3.02it/s]
100%|██████████| 8/8 [00:00<00:00, 30.03it/s]


coherence score: -0.2682568460635003
current best coherence -0.20080021995860556 / best num topics 20
num topics: 20 / num run: 2


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 69.27561514718192	Time: 0:00:00.297770: : 50it [00:16,  2.96it/s]
100%|██████████| 8/8 [00:00<00:00, 27.44it/s]


coherence score: -0.23638426796317397
current best coherence -0.20080021995860556 / best num topics 20
num topics: 20 / num run: 3


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 69.10673141479492	Time: 0:00:00.389374: : 50it [00:17,  2.93it/s]
100%|██████████| 8/8 [00:00<00:00, 18.69it/s]


coherence score: -0.30285383503808405
current best coherence -0.20080021995860556 / best num topics 20
num topics: 20 / num run: 4


Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 68.23990140642438	Time: 0:00:00.447832: : 50it [00:17,  2.91it/s]
100%|██████████| 8/8 [00:00<00:00, 16.97it/s]


coherence score: -0.20652451010378864
current best coherence -0.20080021995860556 / best num topics 20


Training the Combined Contextualized Topic Model
Let us run the topic model with 12 topics (parameter n_components).

Recall that CTM is a neural model. So we need to define for how many epochs the model will run. We can also use early stopping criterion to let the model stop automatically. In this case, we should provide a validation dataset to the fit function (parameter validation_dataset).

We also need to set the dimension of the BoW and the dimension of the contextualized representation.

In [51]:
fix_seeds()

num_topics = 20
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=num_topics, num_epochs=50)
ctm.fit(training_dataset) # run the model

Epoch: [50/50]	 Seen Samples: [22400/25000]	Train Loss: 67.92583683558873	Time: 0:00:00.479368: : 50it [00:17,  2.87it/s]

  0%|          | 0/8 [00:00<?, ?it/s][A
 12%|█▎        | 1/8 [00:00<00:01,  3.94it/s][A
100%|██████████| 8/8 [00:00<00:00, 15.94it/s]


Next we are using the get topics lists function to look at our topics: get_topics accepts a parameter that allows you to select how many words you want to see for each topic.

If you look at the topics, you can see if they all make sense.

In [52]:
ctm.get_topic_lists(5)

[['bahrain', '1521', 'pronounced', 'iberian', 'nation'],
 ['right', 'thy', 'fish', 'longfellow', 'little'],
 ['ballet', 'theatre', 'de', 'song', 'name'],
 ['day', '1755', 'normandy', 'portuguese', 'thousands'],
 ['largest', 'thomas', 'seals', 'nickname', 'seal'],
 ['buried', 'abbey', 'duke', 'famous', 'city'],
 ['crew', 'us', 'clue', 'rights', 'national'],
 ['topped', 'charts', 'wish', 'condemned', 'show'],
 ['battle', 'islands', 'france', 'clash', 'capital'],
 ['ho', 'north', 'sea', 'lake', 'center'],
 ['son', 'father', 'title', 'john', 'original'],
 ['miles', 'greek', '23', 'athens', 'ran'],
 ['eat', 'plant', 'seat', 'go', 'less'],
 ['state', 'steel', 'capital', 'michigan', '1811'],
 ['strength', 'moon', 'said', 'earth', 'together'],
 ['president', 'served', 'texas', 'home', 'prime'],
 ['visit', 'allows', 'game', 'yahoo', 'worst'],
 ['every', 'skraelings', 'judges', 'still', 'peak'],
 ['battle', 'war', 'symphony', 'painted', 'wrote'],
 ['first', 'space', 'may', 'plane', 'sgt']]

However, we also want to quantify how better the contextualized models are with respect to our current model, for that we use Latent Dirichlet Allocation (LDA) model with respect to our current model based on CTM.

In [53]:
from gensim.corpora.dictionary import Dictionary
from gensim.test.utils import common_texts
from gensim.models import LdaModel
from gensim.models.coherencemodel import CoherenceModel

split_processed_documents = [d.split() for d in processed_documents]
dictionary = Dictionary(split_processed_documents)
corpus = [dictionary.doc2bow(text) for text in split_processed_documents]

lda = LdaModel(corpus, num_topics=num_topics, iterations=500, random_state=42)



In [54]:
def get_topics_lda(topk=10):
  topic_terms = []
  for i in range(num_topics):
      topic_words_list = []
      for word_tuple in lda.get_topic_terms(i, topk):
          topic_words_list.append(dictionary[word_tuple[0]])
      topic_terms.append(topic_words_list)
  return topic_terms

get_topics_lda(5)

[['theyre', 'whose', 'lake', 'wrote', 'joined'],
 ['first', 'big', 'poem', '1', 'song'],
 ['sgt', 'seals', 'says', 'navy', 'globe'],
 ['clue', 'adopted', '16th', '2', '1975'],
 ['country', '2', 'steel', 'new', 'president'],
 ['state', 'first', 'capital', 'site', 'famous'],
 ['visit', 'variety', 'colors', 'us', 'john'],
 ['called', '6', 'city', 'duke', 'high'],
 ['president', 'land', 'chiang', 'signed', 'miles'],
 ['first', 'country', 'deals', 'june', 'called'],
 ['state', 'little', 'continent', 'great', 'texas'],
 ['mountain', 'named', 'ho', 'generation', 'friend'],
 ['river', 'one', 'land', 'point', 'may'],
 ['first', 'one', 'man', 'born', 'got'],
 ['became', 'state', 'new', 'group', 'bolshoi'],
 ['virginia', 'delivers', 'state', 'gland', 'shoot'],
 ['american', 'first', 'race', 'german', 'states'],
 ['country', 'together', 'one', 'potato', 'clue'],
 ['name', 'rains', 'theatre', 'cell', 'city'],
 ['island', 'march', 'died', 'people', 'good']]

### Topic Coherence
We usually use the topic coherence as main indicator of the quality of the topics. NPMI topic coherence is the most used one and it is computed on the co-occurrences of the words in the original or in an external corpus. The intuition is that if two words often co-occur together, then they are more likely to be related to each other.

In [55]:
cm = CoherenceModel(model=lda, dictionary=dictionary,
                    texts=split_processed_documents, coherence='c_npmi')
lda_coherence = cm.get_coherence()  # get coherence value
print("coherence score LDA:", lda_coherence)

coherence score LDA: -0.4868609283763642


###Coherence on CTM
CTM library already integrates gensim's computation of coherence. We just provide the list of topics and the corpus as input to the class CoherenceNPMI and compute the score with the .score() function

In [56]:
from contextualized_topic_models.evaluation.measures import CoherenceNPMI, InvertedRBO
corpus = [d.split() for d in processed_documents]
coh = CoherenceNPMI(ctm.get_topic_lists(10), corpus)
print("coherence score CTM:", coh.score())

coherence score CTM: -0.24285005367483642


###Diversity of the topics
We can also compute how much diverse are the topics from each other. Ideally we expect topics which represent separate concepts or ideas. In this case, we use the IRBO (inverted ranked biased overlap) measure. Topics with common words at different rankings are penalized less than topics sharing the same words at the highest ranks.

In [57]:
irbo_lda = InvertedRBO(get_topics_lda(10))
print("diversity score LDA:", irbo_lda.score())

irbo_ctm = InvertedRBO(ctm.get_topic_lists(10))
print("diversity score CTM:", irbo_ctm.score())

diversity score LDA: 0.9570286854889097
diversity score CTM: 0.9958009323454511


### Topic Predictions

Now we can take a document and see which topics have been assigned to it.

We first consider the topic distribution of the training documents, which CTM already computed.

In [61]:
topics_predictions = ctm.training_doc_topic_distributions

We get the index of the most likely topic of the document of our choice

In [62]:
import numpy as np
train_doc_id = 0
topic_id = np.argmax(topics_predictions[train_doc_id]) # get the topic id of the  document

And finally get the top words of the most likely topic for the considered document.

In [63]:
ctm.get_topic_lists(10)[topic_id]

['first',
 'space',
 'may',
 'plane',
 'sgt',
 'flying',
 '1939',
 '18',
 'birds',
 'eagle']

###Get the top K documents for a topic
A different way to explore the results consists in retrieving all the K documents which are most likely assigned to a specific topic.

Let us first consider a topic index

In [58]:
topic_id = 5
print(ctm.get_topics()[topic_id])

['buried', 'abbey', 'duke', 'famous', 'city', 'named', 'marshall', 'hes', 'gardens', 'delegate']


And then we use the get_top_documents_per_topic_id function to get the list of most likely documents with their corresponding probability. The probability we see here corresponds to the conditional probability of the document to be assigned to the considered topic. The parameter k controls how many documents we want to retrieve.

In [59]:
ctm.get_top_documents_per_topic_id(documents, ctm.training_doc_topic_distributions, topic_id, k=7)

[('gene chandler duke earl prince philip duke city', 0.73655057),
 ('hes buried westminster abbey even though hes famous canterbury tales',
  0.550646),
 ('wanted 12year crime spree eating king hrothgars warriors officer beowulf assigned case',
  0.5411082),
 ('2 chiles juan fernandez islands named alexander selkirk fictional castaway inspired',
  0.4771245),
 ('marshall entered virginia politics 1787 delegate convention ratify',
  0.4753851),
 ('governors mansion housing george pataki city', 0.47404152),
 ('august 1851 schooner hundred guinea cup cup later renamed', 0.46695423)]

In [60]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
