<a href="https://colab.research.google.com/github/sejaldua/digesting-the-digest/blob/main/BERTopic_DTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Dynamic Topic Models
Dynamic topic models can be used to analyze the evolution of topics of a collection of documents over time. 

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="40%">

# Installing BERTopic

We start by installing BERTopic from PyPi:

In [5]:
%%capture
!pip install bertopic

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# **Data**
For this tutorial, we will be needing to extract all Trump's Tweet from his @realDonalTrump account. We will be removing all retweet and focus on his original tweets. 

Moreover, since we are looking at his tweets over time, we will be saving all timestamps related to his tweets.

In [2]:
import re
import pandas as pd
from datetime import datetime

# Load data
df = pd.read_csv('https://raw.githubusercontent.com/sejaldua/digesting-the-digest/main/article_data_via_gmail_api.csv')

# Get variables
titles = df.Title.to_list()
dates = df['Date'].apply(lambda x: pd.Timestamp(x))

In [3]:
df

Unnamed: 0,Date,Title,Subtitle,Author,Publication,Minutes
0,2021-08-11 11:40:00+00:00,Stop One-Hot Encoding your Categorical Feature...,Techniques to Encode Categorical Features with...,Satyam Kumar,The Startup,5
1,2021-08-11 11:40:00+00:00,Encoding Categorical Features,Introduction,Yang Liu,Towards Data Science,6
2,2021-08-11 11:40:00+00:00,How to Write a Headline,Insights from Medium's editorial team,Medium Creators,Creators Hub,5
3,2021-08-11 11:40:00+00:00,Topic Areas to Avoid,The following topics have been covered at leng...,Zack Shapiro,Better Programming,2
4,2021-08-11 11:40:00+00:00,How to Calculate Molecular Similarity,Day 15 of the 66 Days of Data,Chanin Nantasenamat,Data Professor,5
...,...,...,...,...,...,...
9160,2020-01-01 16:30:00+00:00,How to Get the Unquantifiable Benefits of Cold...,The surprising side effects that science can't...,May Pang,Better Humans,10
9161,2020-01-01 16:30:00+00:00,Screw Productivity Hacks: My Morning Routine I...,Anyone who brags about a 3:30 a.m. gym routine...,Jessica Valenti,GEN,3
9162,2020-01-01 16:30:00+00:00,The Latest Science on Chronic Pain Is Fascinat...,Experts can even predict who's likely to suffe...,Robert Roy Britt,Elemental,14
9163,2020-01-01 16:30:00+00:00,5 scientific myths you probably believe about ...,How a little knowledge can bring about some hu...,Ethan Siegel,Starts With A Bang!,8


# **Dynamic Topic Modeling**


## Basic Topic Model
To perform Dynamic Topic Modeling with BERTopic we will first need to create a basic topic model using all tweets. The temporal aspect will be ignored as we are, for now, only interested in the topics that reside in those tweets. 

In [6]:
from bertopic import BERTopic
topic_model = BERTopic(min_topic_size=30, verbose=True)
topics, _ = topic_model.fit_transform(titles)

  defaults = yaml.load(f)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=690.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3693.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=629.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=122.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=229.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=90895153.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=53.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466081.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=516.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=190.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Batches', max=287.0, style=ProgressStyle(description_widt…

2021-08-11 23:10:05,450 - BERTopic - Transformed documents to Embeddings





2021-08-11 23:10:38,472 - BERTopic - Reduced dimensionality with UMAP
2021-08-11 23:10:38,954 - BERTopic - Clustered UMAP embeddings with HDBSCAN


We can then extract most frequent topics:

In [7]:
freq = topic_model.get_topic_info(); freq.head(10)

Unnamed: 0,Topic,Count,Name
0,-1,2657,-1_this_why_when_about
1,0,432,0_scientist_scientists_interview_career
2,1,429,1_libraries_useful_functions_science
3,2,338,2_learning_models_algorithms_ml
4,3,325,3_developer_code_programming_software
5,4,207,4_react_app_create_hooks
6,5,160,5_life_habit_successful_habits
7,6,158,6_design_ux_designer_portfolio
8,7,153,7_regression_bayesian_probability_distribution
9,8,139,8_dating_relationship_love_marriage


-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [8]:
topic_nr = freq.iloc[3]["Topic"]  # We select a frequent topic
topic_model.get_topic(topic_nr)   # You can select a topic number as shown above

[('learning', 0.07957643550746428),
 ('models', 0.02887036768514761),
 ('algorithms', 0.0227672839565982),
 ('ml', 0.01948218486023126),
 ('model', 0.018236085881024336),
 ('algorithm', 0.017246406874633566),
 ('selection', 0.013393544039876033),
 ('bias', 0.01171935103489153),
 ('regression', 0.009375006606619066),
 ('loss', 0.008370965024922522)]

We can visualize the basic topics that were created with the Intertopic Distance Map. This allows us to judge visually whether the basic topics are sufficient before proceeding to creating the topics over time. 

In [9]:
fig = topic_model.visualize_topics(); fig

## Topics over Time
Before we start with the Dynamic Topic Modeling step, it is important that you are satisfied with the topics that were created previously. We are going to be using those specific topics as a base for Dynamic Topic Modeling. 

Thus, this step will essentially show you how the topics that were defined previously have evolved over time. 

There are a few important parameters that you should take note of, namely:

* `docs`
  * These are the tweets that we are using
* `topics`
  * The topics that we have created before
* `timestamps`
  * The timestamp of each tweet/document
* `global_tuning`
  * Whether to average the topic representation of a topic at time *t* with its global topic representation
* `evolution_tuning`
  * Whether to average the topic representation of a topic at time *t* with the topic representation of that topic at time *t-1*
* `nr_bins`
  * The number of bins to put our timestamps into. It is computationally inefficient to extract the topics at thousands of different timestamps. Therefore, it is advised to keep this value below 20. 


In [10]:
topics_over_time = topic_model.topics_over_time(docs=titles, 
                                                topics=topics, 
                                                timestamps=dates, 
                                                global_tuning=True, 
                                                evolution_tuning=True, 
                                                nr_bins=20)

20it [00:24,  1.23s/it]


## Visualize Topics over Time
After having created our `topics_over_time`, we will have to visualize those topics as accessing them becomes a bit more difficult with the added temporal dimension. 

To do so, we are going to visualize the distribution of topics over time based on their frequency. Doing so allows us to see how the topics have evolved over time. Make sure to hover over any point to see how the topic representation at time *t* differs from the global topic representation. 


In [11]:
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=10)

In [12]:
topic_model.visualize_heatmap()

In [13]:
topic_model.visualize_barchart(top_n_topics=9, n_words=5, height=800)