# Transformer-based NLP topic modeling using the Python package BERTopic!!

# Intro

BERTopic is a topic modeling python library that uses the combination of transformer embeddings and clustering model algorithms to identify topics in NLP (Natual Language Processing). In this notebook, we will talk about:
* How transformers, c-TF-IDF, and clustering models are used behind the BERTopic?
* How to extract and interpret topics from the topic modeling results?
* How to make predictions using topic modeling?
* How to save and load a BERTopic topic model?

# BERTopic Model Algorithms

In step 0, we will talk about the algorithms behind the BERTopic model.
* **Documents Embedding**: Firstly, we need to get the embeddings for all the documents. Embeddings are the vector representation of the documents.
 * BERTopic uses the English version of the `sentence_transformers` by default to get document embeddings.
 * If there are multiple languages in the document, we can use `BERTopic(language="multilingual")` to support the topic modeling of over 50 languages.
 * BERTopic also supports the pre-trained models from other python packages such as hugging face and flair.
* **Dimension Reduction and Documents Clustering**: After the text documents have been transformed into embeddings, the next step is to run a clustering model on the embedded documents. Because the embedding vectors usually have very high dimensions, dimension reduction techniques are used to reduce the dimensionalities.
 * The default algorithm for dimension reduction is UMAP (Uniform Manifold Approximation & Projection). Compared with other dimension reduction techniques such as PCA (Principle Component Analysis), UMAP maintains the data's local and global structure when reducing the dimensionality, which is important for representing the semantics of the text data. BERTopic provides the option of using other dimensionality reduction techniques by changing the `umap_model` value in the `BERTopic` method.
 * The default algorithm for clustering is HDBSCAN. HDBSCAN is a density-based clustering model. It identifies the number of clustering automatically, and does not require specifying the number of clusters beforehand like most of the clustering models.
* **Topic Representation**: After assigning each document in the corpus into a cluster, the next step is to get the topic representation using a class-based TF-IDF called c-TF-IDF. The top words with the highest c-TF-IDF scores are selected to represent each topic.
 * c-TF-IDF is similar to TF-IDF in that it measures the term importance by term frequencies while taking into account the whole corpus (all the text data for the analysis).
 * c-TF-IDF is different from TF-IDF in that the term frequency level is different. In the regular TF-IDF, TF measures the term frequency in each document. While in the c-TF-IDF, TF measures the term frequency in each cluster, and each cluster includes many documents.
* **Maximal Marginal Relevance (MMR)** (optional): After extracting the most important terms describing each cluster, there is an optional step to optimize the terms using Maximal Marginal Relevance (MMR). Maximal Marginal Relevance (MMR) has two benefits:
 * The first benefit is to increase the coherence among the terms for the same topic and remove irrelevant terms.
 * The second benefit is to increase the topic representation by removing synonyms and variations of the same words.


# Install And Import Python Libraries

In step 1, we will install and import python libraries.

Firstly, let's import `bertopic`.

In [None]:
# Install bertopic
!pip install bertopic

Collecting bertopic
  Downloading bertopic-0.16.0-py2.py3-none-any.whl (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.1/154.1 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap-learn-0.5.5.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.9/90.9 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentence-transformers>=0.4.1 (from bertopic)
  Downloading sentence_transformers-2.4.0-py3-none-any.whl (149 kB)
[2K     [90m━━━━━

Secondly, import necessary packages

In [None]:
from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from bertopic.vectorizers import ClassTfidfTransformer

import pandas as pd
import numpy as np

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Understand data

The second step is to download and read the dataset.

* `drive.mount` is used to mount to the Google drive so the colab notebook can access the data on the Google drive.
* `os.chdir` is used to change the default directory on Google drive. I set the default directory to the folder where the review dataset is saved.
* `!pwd` is used to print the current working directory.

Please check out [Google Colab Tutorial for Beginners](https://medium.com/towards-artificial-intelligence/google-colab-tutorial-for-beginners-834595494d44) for details about using Google Colab for data science projects.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Change directory
import os
os.chdir("drive/My Drive/Colab Notebooks")

# Print out the current directory
!pwd

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/Colab Notebooks


Now let's read the data `reviews_cleaned.parquet` into a `pandas` dataframe and see what the dataset looks like.

In [None]:
# Read in data
data = pd.read_parquet("/content/reviews_cleaned.parquet")


In [None]:
data.head()

Unnamed: 0.1,Unnamed: 0,page,titre,verbatim,date,note,reponse,date_experience,fournisseur,source,clean_verb,tokens,tokens_lem,tokens_processed,tokens_lem_processed
0,0,1,Aucun soucis particulier,Je paie ma facture tous les deux mois en fonct...,Il y a 17 heures,4,,Date de l'expérience: 01 décembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,je paie ma facture tous les deux mois en fonct...,"[[je, ], [paie, ], [ma, ], [facture, ], [tous,...","[[je, ], [pai, ], [ma, ], [factur, ], [tous, ]...","[paie, facture, mois, fonction, consommation, ...","[pai, factur, mois, fonction, consomm, exact, ..."
1,1,1,Engie facture a ses clients des sommes…,Engie facture a ses clients des sommes exorbit...,Il y a un jour,1,"Bonjour Julien Blanco,\n\nPour des raisons de ...",Date de l'expérience: 26 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,engie facture a ses clients des sommes exorbit...,"[[engie, ], [facture, ], [a, ], [ses, ], [clie...","[[engi, ], [factur, ], [a, ], [se, ], [client,...","[engie, facture, clients, exorbitants, engie, ...","[engi, factur, client, somm, exorbit, engi, fa..."
2,2,1,Facturation sur consommation d'un autre logement,Ils me facturent sur le pdl du logement au des...,ll y a 3 jours,1,"Bonjour BlooDz,\n\nPour des raisons de confide...",Date de l'expérience: 29 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,ils me facturent sur le pdl du logement au des...,"[[ils, ], [me, ], [facturent, ], [sur, ], [le,...","[[il, ], [me, ], [facturent, ], [sur, ], [le, ...","[facturent, pdl, logement, disant, faute, jama...","[facturent, pdl, log, dis, faut, jam, pris, fa..."
3,3,1,un service client ou il est dur de…,un service client ou il est dur de comprendre ...,ll y a 3 jours,1,"Bonjour Ricanto77,\nPour des raisons de confid...",Date de l'expérience: 29 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,un service client ou il est dur de comprendre ...,"[[un, ], [service, ], [client, ], [ou, ], [il,...","[[un, ], [servic, ], [client, ], [ou, ], [il, ...","[service, client, dur, comprendre, langue, uti...","[servic, client, dur, comprendr, langu, utilis..."
4,4,1,Client d'ENGIE depuis longtemps toujours satis...,Excellente expérience avec ENGIE et une interl...,Il y a 24 minutes,5,,Date de l'expérience: 01 décembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,excellente expérience avec engie et une interl...,"[[excellente, ], [expérience, ], [avec, ], [en...","[[excellent, ], [expérient, ], [avec, ], [engi...","[excellente, expérience, engie, interlocutrice...","[excellent, expérient, engi, interlocutric, so..."


# Embeddings



In [None]:
# Initiate Embedding model using SentenceTransformer use "paraphrase-multilingual-MiniLM-L12-v2"

model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
#embeddings = model.encode(data)
#print(embeddings)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.12k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# Dimensionality Reduction

BERTopic model by default produces different results each time because of the stochasticity inherited from UMAP.

To get reproducible topics, we need to pass a value to the `random_state` parameter in the `UMAP` method.
* `n_neighbors=15` means that the local neighborhood size for UMAP is 15. This is the parameter that controls the local versus global structure in data.
 * A low value forces UMAP to focus more on local structure, and may lose insights into the big picture.
 * A high value pushes UMAP to look at broader neighborhood, and may lose details on local structure.
 * The default `n_neighbors` values for UMAP is 15.
* `n_components=5` indicates that the target dimension from UMAP is 5. This is the dimension of data that will be passed into the clustering model.
* `min_dist` controls how tightly UMAP is allowed to pack points together. It's the minimum distance between points in the low dimensional space.
 * Small values of `min_dist` result in clumpier embeddings, which is good for clustering. Since our goal of dimension reduction is to build clustering models, we set `min_dist` to 0.
 * Large values of `min_dist` prevent UMAP from packing points together and preserves the broad structure of data.
* `metric='cosine'` indicates that we will use cosine to measure the distance.
* `random_state` sets a random seed to make the UMAP results reproducible.


In [None]:
# Instanciate UMAP
umap_reducer = UMAP(n_neighbors=15, min_dist=0, random_state =42, n_components=5, metric='cosine')


# Clustering



In [None]:
# Instanciate HDBSCAN
clusterer = HDBSCAN()

# Vectorizers

In [None]:
# Instanciate a CountVectorizer
vectorizer = CountVectorizer()

# cTF-IDF
In BERTopic, in order to get an accurate representation of the topics from our bag-of-words matrix, TF-IDF was adjusted to work on a cluster/categorical/topic level instead of a document level. This adjusted TF-IDF representation is called c-TF-IDF and takes into account what makes the documents in one cluster different from documents in another cluster:

In [None]:
# Instanciate a ClassTfidf
ctf_idf = ClassTfidfTransformer()

# Topic Representation

After having generated our topics with c-TF-IDF, we might want to do some fine-tuning based on the semantic relationship between keywords/keyphrases and the set of documents in each topic. Although we can use a centroid-based technique for this, it can be costly and does not take the structure of a cluster into account. Instead, we leverage c-TF-IDF to create a set of representative documents per topic and use those as our updated topic embedding. Then, we calculate the similarity between candidate keywords and the topic embedding using the same embedding model that embedded the documents.

In [None]:
# Instanciate a keyBERTInspired
# KeyBERT is a minimal and easy-to-use method for keyword extraction with BERT embeddings

kw_model = KeyBERTInspired()


# Put All together

Finally, we pass the processed review documents to the topic model and saved the results for topics and topic probabilities.

The values in topics represents the topic each document is assigned to.
The values in probabilities represents the probability of a document belongs to each of the topics

In [None]:
# Instanciate a BERTopic class with all components above and fit to documents
bertopic = BERTopic(
    embedding_model=model,
    umap_model=umap_reducer,
    hdbscan_model=clusterer,
    vectorizer_model =vectorizer,
    ctfidf_model = ctf_idf,
    representation_model = kw_model,
    )


# Analyse Topics

In [None]:
topics, probabilities = bertopic.fit_transform(data["verbatim"])

In [None]:
# Get the list of topics
topics = bertopic.get_topics()
print(topics)

# Each topic is represented as a pair of (word, weight)
# Let's print the topics
for topic_num, topic in topics.items():
    print(f"Topic {topic_num}: ", end="")
    words = ", ".join([word for word, _ in topic])
    print(words)

{-1: [('payer', 0.45674336), ('factures', 0.4250915), ('facture', 0.40951094), ('contrat', 0.40152055), ('fournisseur', 0.33512533), ('électricité', 0.30681598), ('compte', 0.30537814), ('demande', 0.30454472), ('depuis', 0.29089668), ('service', 0.28256637)], 0: [('technicienne', 0.79316366), ('techniciens', 0.75618535), ('technicien', 0.7526088), ('professionnels', 0.69989526), ('professionnel', 0.66280687), ('compétent', 0.5814539), ('compétents', 0.57771164), ('métier', 0.55272746), ('professionnalisme', 0.5466609), ('competent', 0.5377156)], 1: [('téléphonique', 0.8161819), ('telephone', 0.8077973), ('téléphoniques', 0.80605483), ('téléphone', 0.804651), ('contact', 0.68294376), ('telephonique', 0.6736375), ('communication', 0.6560931), ('accueil', 0.62653685), ('accueillant', 0.6212888), ('appel', 0.61798394)], 2: [('bien', 0.9551678), ('parfait', 0.91042167), ('fait', 0.76099765), ('très', 0.7358353), ('', 0.6872295), ('', 0.6872295), ('', 0.6872295), ('', 0.6872295), ('', 0.687

If more than 4 terms are needed for a topic, we can use `get_topic` and pass in the topic number. For example, `get_topic(0)` gives us the top 10 terms for topic 0 and their relative importance.

In [None]:
# Get top 10 terms for of the first topic
topics[0][:10]

[('technicienne', 0.79316366),
 ('techniciens', 0.75618535),
 ('technicien', 0.7526088),
 ('professionnels', 0.69989526),
 ('professionnel', 0.66280687),
 ('compétent', 0.5814539),
 ('compétents', 0.57771164),
 ('métier', 0.55272746),
 ('professionnalisme', 0.5466609),
 ('competent', 0.5377156)]

We can visualize the top keywords using a bar chart. `top_n_topics=12` means that we will create bar charts for the top 12 topics. The length of the bar represents the score of the keyword. A longer bar means higher importance for the topic.

In [None]:
# Visualize top topic keywords
bertopic.visualize_barchart(top_n_topics=12)

Another view for keyword importance is the "Term score decline per topic" chart. It's a line chart with the term rank being the x-axis and the c-TF-IDF score on the y-axis.

There are a total of 31 lines, one line for each topic. Hovering over the line shows the term score information.

# Topic Similarities

In step 6, we will analyze the relationship between the topics generated by the topic model.

Intertopic distance map measures the distance between topics. Similar topics are closer to each other, and very different topics are far from each other. From the visualization, we can see that there are five topic groups for all the topics. Topics with similar semantic meanings are in the same topic group.

The size of the circle represents the number of documents in the topics, and larger circles mean that more reviews belong to the topic.

In [None]:
# Visualize intertopic distance
bertopic.visualize_topics(top_n_topics=30)

Another way to see how the topics are connected is through a hierarchical clustering graph. We can control the number of topics in the graph by the `top_n_topics` parameter.

In [None]:
# Visualize connections between topics using hierachical clustering
bertopic.visualize_hierarchy(top_n_topics=13)


Heatmap can also be used to analyze the similarities between topics. The similarity score ranges from 0 to 1. A value close to 1 represents a higher similarity between the two topics, which is represented by darker blue color.

In [None]:
bertopic.visualize_heatmap(top_n_topics=13)

# Topic Model In-sample Predictions

In step 8, we will talk about how to make in-sample predictions using the topic model.

BERTopic model can output the predicted topic for each review in the dataset.

Using `.topics_`, we save the predicted topics in a list and then save it as a column in the review dataset.

In [None]:
# Get the topic predictions
prediction = bertopic.topics_

# Save the predictions in the dataframe
df2 = pd.DataFrame(data["verbatim"], prediction)

# Take a look at the data
df2.head()

Unnamed: 0,verbatim
-1,
146,Mauvaise expérience avec Engie ke ne recommand...
-1,
61,J'ai écrit sur le chat engie pour avoir des ex...
1,Engie facture a ses clients des sommes exorbit...


# Topic Model Predictions on New Data

In this step, we will talk about how to use the BERTopic model to make predictions on new reviews.

Let's say there is a new review "I like the new headphone. Its sound quality is great.", and we would like to automatically predict the topic for this review.
* Firstly, let's decide the number of topics to include in the prediction.
 * If we would like to assign only one topic to the document, then the number of topics should be 1.  
 * If we would like to assign multiple topics to the document, then the number of topics should be greater than 1. Here we are getting the top 3 topics that are most relevant to the new review.
* After that, we pass the new review and the number of topics to the `find_topics` method. This gives us the topic number and the similarity value.
* Finally, the results are printed. The top 3 similar topics for the new review are topic 1, topic 0, and topic 2. The similarities are 0.43, 0.34, and 0.30.


In [None]:
# New data for the review
new_review = # Write a fake review

# Find topics of the new review


# Print results


To verify if the assigned topics are a good fit for the new review, let's get the top keywords for the top 3 topics using the `get_topic` method.

In [None]:
# Print the top keywords for the top similar topics


We can see that topic 1 is about headsets and topic 0 is about sound quality. Both topics are a good fit for the new review. Topic 2 is about the earpiece, which is similar to the headset. From this example, we can see that the BERTopic model made good predictions on the new document.

# Save and Load Topic Models

In [None]:
# Save the topic model
topic_model.save("amz_review_topic_model")

# Load the topic model
my_model = BERTopic.load("amz_review_topic_model")

# References

* [BERTopic GitHub](https://github.com/MaartenGr/BERTopic)
* [Documentation on BERTopic algorithms](https://maartengr.github.io/BERTopic/algorithm/algorithm.html#visual-overview)
* [UMAP documentation](https://umap-learn.readthedocs.io/en/latest/parameters.html)