# Topic Modeling using BERT
In this activity we will use BERTopic, which is a topic modeling technique that leverages BERT embeddings and a class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

For more details on BERTopic, see:
https://maartengr.github.io/BERTopic/index.html

https://github.com/MaartenGr/BERTopic

You can compare the resulting topics from this activity with topics we derived in our earlier activity using LDA for topic modeling.

In [1]:
#run once!
!pip install bertopic
!pip install bertopic[visualization]

Collecting bertopic
  Downloading bertopic-0.16.0-py2.py3-none-any.whl (154 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━[0m [32m122.9/154.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.1/154.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap-learn-0.5.5.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.9

## Data & Scenario

We will use the same dataset on restaurant reviews we used for the earlier activity using LDA for topic modeling so that you can compare the results.  

As explained earlier, we can explore whether there are certain topics that people write about in their reviews. These topics can be used to come up with different strategies to engage users on online platforms or other channels.  

This is a small dataset for learning purposes and to avoid long processing times.
You can use any other textual data as input. Depending on the data format, you may have to use different functions to import your text data. Once you have your data imported as a dataframe, where one colum contains the *documents*, the rest will be the same.

Download the file "**Restaurant_Reviews.tsv**" form elearn and upload it to your session before processing.

In [4]:
# importing restaurant reviews dataset
import pandas as pd
df=pd.read_csv('Restaurant_Reviews.tsv',delimiter="\t")
df.head(2)

Unnamed: 0,Review,Sentiment
0,Wow... Loved this place.,1
1,Crust is not good.,0


The sentiment for each review has been manually labeled for this dataset, we will not be using it for this activity; we only use the Review content.

There are 996 unique reviews (documents, in NLP terminology).

In [5]:
len(df)
docs=df['Review']
docs=docs.drop_duplicates() #drop duplicate reviews
docs=docs.values
len(docs)

996

In [6]:
#let's take a look at a document
docs[2]

'Not tasty and the texture was just nasty.'

## Pre-processing
Here we won't need to use the pre-processing steps as we did in our earlier activity, since the library we are using is going to automatically apply the needed text pre-processing.

## Importing BERTopic

In [7]:
#import BERTTopic
from bertopic import BERTopic

This package is very easy to use, but there are several steps happening in the background (which are customizable, btw) and utilizes advanced pre-trained models (more on this later).

## Embeddings
Let's recall our previous activities when we used RNNs for sentiment analysis and document classification.
Remember those RNN models had an embedding layer (the first layer) that would convert tokens (words represented as integers) to vectors (word vectors); in sum, it would create word embeddings (vectors) from tokens so that similar words would be close in the resulting vector-space.

Here, we have a similar process at the sentence-level (instead of token-level). The first step (happens in the background) is to create sentence embeddings using pre-trained models using what is known as "sentence-transformers"; sentence-transformers convert sentences into vector representations. These models are usually trained on very large collections of text (some for multilple languages) where the training goal is for the model to be able to predict some missing part of text (e.g., at the word or sentence level). This process can also be done at the document level, i.e., document embeddings.

Here is a repository for pre-trained models https://huggingface.co/models. Depending on what sort of textual data you are working with (e.g., scientific articles, social media, etc.), you might want to use a library that is trained on type of text that is similar in nature to what you have.

We will be using the default "distilbert-base-nli-mean-tokens" model (link to original paper https://arxiv.org/abs/1910.01108 ) for the english language.

In [8]:
# creating a instance of bertopic for "english", we are setting the parameters to save topic probablities (will use them later for visualization)
model = BERTopic(language="english",calculate_probabilities=True,verbose=True)


Notice that we are not specifying the number of topics here, contrary to when we do topic modeling using LDA.

You could think of this as sth similar to what we had when we used DBSCAN for clustering where we don't specifiy number of clusters (vs. k-means where we had to specify the number of clusters).

## Running BERTopic on our corpus

In [9]:
# apply the model on our documents and save both topics and probabilities (probability of each document belonging to any of the topics)
topics, probabilities = model.fit_transform(docs)

2023-11-29 22:35:49,093 - BERTopic - Embedding - Transforming documents to embeddings.


.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

2023-11-29 22:36:04,015 - BERTopic - Embedding - Completed ✓
2023-11-29 22:36:04,018 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2023-11-29 22:36:16,805 - BERTopic - Dimensionality - Completed ✓
2023-11-29 22:36:16,807 - BERTopic - Cluster - Start clustering the reduced embeddings
2023-11-29 22:36:16,906 - BERTopic - Cluster - Completed ✓
2023-11-29 22:36:16,918 - BERTopic - Representation - Extracting topics from clusters using representation models.
2023-11-29 22:36:17,014 - BERTopic - Representation - Completed ✓


## Derived Topics and their frequency

Topic **-1** refers to all documents that did not have any topics assigned (outlier topic).

In [10]:
# number of topics and freq (number of documents assigned to each topic)
model.get_topic_freq().shape

(23, 2)

In [11]:
# Topic -1 refers to all documents that did not have any topics assigned (outlier topic).
model.get_topic_freq()

Unnamed: 0,Topic,Count
1,-1,362
6,0,63
7,1,60
0,2,48
5,3,44
12,4,42
3,5,41
2,6,39
4,7,37
8,8,35


### Question 1
How many topics were derived (aside from the outlier cluster)?

...

**Answer:**
24 Topics are derived.

###Question 2
How many documents are assigned to the first and second topic?

...

**Answer:**
Topic 1 has 62 documents and
Topic 2 has 54 documents.

## Top words for a topic
Let's take a look at top words that represent the first topic (topic=1), which has the most documents. You can simply change the topic number to look at the top words for other topics.

In [12]:
# get top words for first topic
model.get_topic(1)

[('back', 0.20412368750633197),
 ('will', 0.11352941431281269),
 ('go', 0.11339850248902238),
 ('be', 0.11063323503660967),
 ('wont', 0.08776635935648835),
 ('again', 0.08033916048914416),
 ('here', 0.06460700964176065),
 ('never', 0.05693880208343854),
 ('coming', 0.05628077139002808),
 ('dont', 0.05556542266961629)]

## Visualizing topics
We can also visualize the derived topics (note that we installed the visualiazation library in the beginning of this notebook).

In [13]:
model.visualize_topics()

## Document-Topic probablities

Each document in our corpus has a probability for belonging to each of the derived topics; we can derive these probabilities to use them for some other task (for example, as features for some predictive modeling task).


In [14]:
probabilities.shape

(996, 22)

In [None]:
# for example, we can export all these document-topic probabilities as a csv file.
# pd.DataFrame(probabilities).to_csv("probs.csv")

## Visualize Topic probability distribution
We can also visualize the Topic probability distribution for a specific document.
Note that topics with a probablity beloew the specified threshold are not shown.

In [15]:
docs[4] # review number 5

'The selection on the menu was great and so were the prices.'

In [16]:
# probablity of doc[4] (review 5) belonging to each of the topics
model.visualize_distribution(probabilities[4],min_probability=0.005)

### Question 3
Which topic does the 10th review belong to?

**Answer:**
10 th review belong to Topic 7

## What topic would a new review be most similar to?
Let's see which topic(s) a new review would be more similar to.

We have the option to derive similarity of a new piece of text to the derived topcis (in terms of cosine similarity between embeddings).

In [17]:
new_review="The food was too salty but I liked the atmosphere."

In [18]:
model.find_topics(new_review)

([8, 6, 3, -1, 15], [0.69908607, 0.62514675, 0.62153083, 0.5863854, 0.5617428])

###Question 4
Which topic(s) is the new review most simlar to?

...

**Answer:**
The review is more similar to topics 20, 6, 5, 18, 7

In [19]:
model.get_topic(15)

[('fish', 0.10111931843438797),
 ('shrimp', 0.09707208697246344),
 ('fresh', 0.07842680630214127),
 ('seafood', 0.07583948882579099),
 ('legs', 0.0607799035286924),
 ('crab', 0.0607799035286924),
 ('salt', 0.056249224087043546),
 ('salmon', 0.05304189355755465),
 ('the', 0.04460991929946435),
 ('was', 0.0414255349936863)]