<a href="https://colab.research.google.com/github/zwxgrace/Deep-Learning-for-NLP-CV/blob/main/%E2%80%9CC5_X_HEC_BERTopic_ipynb%E2%80%9D%E7%9A%84%E5%89%AF%E6%9C%AC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformer-based NLP topic modeling using the Python package BERTopic!!

**bold text**# Intro

```
# This is formatted as code
```



BERTopic is a topic modeling python library that uses the combination of transformer embeddings and clustering model algorithms to identify topics in NLP (Natual Language Processing). In this notebook, we will talk about:
* How transformers, c-TF-IDF, and clustering models are used behind the BERTopic?
* How to extract and interpret topics from the topic modeling results?
* How to make predictions using topic modeling?
* How to save and load a BERTopic topic model?

# BERTopic Model Algorithms

In step 0, we will talk about the algorithms behind the BERTopic model.
* **Documents Embedding**: Firstly, we need to get the embeddings for all the documents. Embeddings are the vector representation of the documents.
 * BERTopic uses the English version of the `sentence_transformers` by default to get document embeddings.
 * If there are multiple languages in the document, we can use `BERTopic(language="multilingual")` to support the topic modeling of over 50 languages.
 * BERTopic also supports the pre-trained models from other python packages such as hugging face and flair.
* **Dimension Reduction and Documents Clustering**: After the text documents have been transformed into embeddings, the next step is to run a clustering model on the embedded documents. Because the embedding vectors usually have very high dimensions, dimension reduction techniques are used to reduce the dimensionalities.
 * The default algorithm for dimension reduction is UMAP (Uniform Manifold Approximation & Projection). Compared with other dimension reduction techniques such as PCA (Principle Component Analysis), UMAP maintains the data's local and global structure when reducing the dimensionality, which is important for representing the semantics of the text data. BERTopic provides the option of using other dimensionality reduction techniques by changing the `umap_model` value in the `BERTopic` method.
 * The default algorithm for clustering is HDBSCAN. HDBSCAN is a density-based clustering model. It identifies the number of clustering automatically, and does not require specifying the number of clusters beforehand like most of the clustering models.
* **Topic Representation**: After assigning each document in the corpus into a cluster, the next step is to get the topic representation using a class-based TF-IDF called c-TF-IDF. The top words with the highest c-TF-IDF scores are selected to represent each topic.
 * c-TF-IDF is similar to TF-IDF in that it measures the term importance by term frequencies while taking into account the whole corpus (all the text data for the analysis).
 * c-TF-IDF is different from TF-IDF in that the term frequency level is different. In the regular TF-IDF, TF measures the term frequency in each document. While in the c-TF-IDF, TF measures the term frequency in each cluster, and each cluster includes many documents.
* **Maximal Marginal Relevance (MMR)** (optional): After extracting the most important terms describing each cluster, there is an optional step to optimize the terms using Maximal Marginal Relevance (MMR). Maximal Marginal Relevance (MMR) has two benefits:
 * The first benefit is to increase the coherence among the terms for the same topic and remove irrelevant terms.
 * The second benefit is to increase the topic representation by removing synonyms and variations of the same words.


# Install And Import Python Libraries

In step 1, we will install and import python libraries.

Firstly, let's import `bertopic`.

In [3]:
# Install bertopic
!pip install tensorflow==2.18 --quiet
!pip install bertopic --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m615.4/615.4 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
[?25h

Secondly, import necessary packages

In [4]:
from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from bertopic.vectorizers import ClassTfidfTransformer

import pandas as pd
import numpy as np

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Understand data

The second step is to download and read the dataset.

* `drive.mount` is used to mount to the Google drive so the colab notebook can access the data on the Google drive.
* `os.chdir` is used to change the default directory on Google drive. I set the default directory to the folder where the review dataset is saved.
* `!pwd` is used to print the current working directory.

Please check out [Google Colab Tutorial for Beginners](https://medium.com/towards-artificial-intelligence/google-colab-tutorial-for-beginners-834595494d44) for details about using Google Colab for data science projects.

In [16]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Change directory
import os
os.chdir("/content/drive/MyDrive/HEC/reviews_cleaned.parquet")

# Print out the current directory
!pwd

Mounted at /content/drive


NotADirectoryError: [Errno 20] Not a directory: '/content/drive/MyDrive/HEC/reviews_cleaned.parquet'

Now let's read the data `reviews_cleaned.parquet` into a `pandas` dataframe and see what the dataset looks like.

In [15]:
# Read in data
reviews_cleaned = pd.read_parquet("/content/drive/MyDrive/HEC/reviews_cleaned.parquet")

In [17]:
list_total = ['https://fr.trustpilot.com/review/totalenergies.fr', 'https://www.avis-verifies.com/avis-clients/totalenergies.fr', 'https://www.avis-verifies.com/avis-clients/totalenergies.fr?filtre=&p=87',
       'https://www.avis-verifies.com/avis-clients/totalenergies.fr?filtre=&p=478']

In [18]:
reviews_cleaned

Unnamed: 0.1,Unnamed: 0,page,titre,verbatim,date,note,reponse,date_experience,fournisseur,source,clean_verb,tokens,tokens_lem,tokens_processed,tokens_lem_processed
0,0,1,Aucun soucis particulier,Je paie ma facture tous les deux mois en fonct...,Il y a 17 heures,4,,Date de l'expérience: 01 décembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,je paie ma facture tous les deux mois en fonct...,"[[je, ], [paie, ], [ma, ], [facture, ], [tous,...","[[je, ], [pai, ], [ma, ], [factur, ], [tous, ]...","[paie, facture, mois, fonction, consommation, ...","[pai, factur, mois, fonction, consomm, exact, ..."
1,1,1,Engie facture a ses clients des sommes…,Engie facture a ses clients des sommes exorbit...,Il y a un jour,1,"Bonjour Julien Blanco,\n\nPour des raisons de ...",Date de l'expérience: 26 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,engie facture a ses clients des sommes exorbit...,"[[engie, ], [facture, ], [a, ], [ses, ], [clie...","[[engi, ], [factur, ], [a, ], [se, ], [client,...","[engie, facture, clients, exorbitants, engie, ...","[engi, factur, client, somm, exorbit, engi, fa..."
2,2,1,Facturation sur consommation d'un autre logement,Ils me facturent sur le pdl du logement au des...,ll y a 3 jours,1,"Bonjour BlooDz,\n\nPour des raisons de confide...",Date de l'expérience: 29 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,ils me facturent sur le pdl du logement au des...,"[[ils, ], [me, ], [facturent, ], [sur, ], [le,...","[[il, ], [me, ], [facturent, ], [sur, ], [le, ...","[facturent, pdl, logement, disant, faute, jama...","[facturent, pdl, log, dis, faut, jam, pris, fa..."
3,3,1,un service client ou il est dur de…,un service client ou il est dur de comprendre ...,ll y a 3 jours,1,"Bonjour Ricanto77,\nPour des raisons de confid...",Date de l'expérience: 29 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,un service client ou il est dur de comprendre ...,"[[un, ], [service, ], [client, ], [ou, ], [il,...","[[un, ], [servic, ], [client, ], [ou, ], [il, ...","[service, client, dur, comprendre, langue, uti...","[servic, client, dur, comprendr, langu, utilis..."
4,4,1,Client d'ENGIE depuis longtemps toujours satis...,Excellente expérience avec ENGIE et une interl...,Il y a 24 minutes,5,,Date de l'expérience: 01 décembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,excellente expérience avec engie et une interl...,"[[excellente, ], [expérience, ], [avec, ], [en...","[[excellent, ], [expérient, ], [avec, ], [engi...","[excellente, expérience, engie, interlocutrice...","[excellent, expérient, engi, interlocutric, so..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37289,37289,508,Avis client,Le commercial est très bien Le SAV est à revoir,le 22/02/2022 par Claude P.,5,,suite à une expérience du 31/01/2022,https://www.avis-verifies.com/avis-clients/edf...,avis_verifies,le commercial est très bien le sav est à revoir,"[[le, ], [commercial, ], [est, ], [très, ], [b...","[[le, ], [commercial, ], [est, ], [tres, ], [b...","[commercial, sav, revoir]","[commercial, sav, revoir]"
37290,37290,508,Avis client,tres professionnel et maintrisant bien le sujet,le 22/02/2022 par Jacky P.,4,,suite à une expérience du 27/01/2022,https://www.avis-verifies.com/avis-clients/edf...,avis_verifies,tres professionnel et maintrisant bien le sujet,"[[tres, ], [professionnel, ], [et, ], [maintri...","[[tre, ], [professionnel, ], [et, ], [maintris...","[professionnel, maintrisant]","[tre, professionnel, maintris]"
37291,37291,508,Avis client,Je le décrirai d'une façon totalement profesio...,le 22/02/2022 par ANNIC B.,4,,suite à une expérience du 25/01/2022,https://www.avis-verifies.com/avis-clients/edf...,avis_verifies,je le décrirai d'une façon totalement profesio...,"[[je, ], [le, ], [décrirai, ], [d', ], [une, ]...","[[je, ], [le, ], [decr, ], [d', ], [une, ], [f...","[décrirai, totalement, profesionnelle, rapport...","[decr, total, profesionnel, rapport, demand]"
37292,37292,508,Avis client,"Un rendez-vous qui s'est très bien déroulé, un...",le 22/02/2022 par Aurélien N.*,5,,suite à une expérience du 27/01/2022\n*Informa...,https://www.avis-verifies.com/avis-clients/edf...,avis_verifies,un rendez vous qui s'est très bien déroulé un ...,"[[un, ], [rendez, ], [vous, ], [qui, ], [s', ]...","[[un, ], [rend, ], [vous, ], [qui, ], [s', ], ...","[rendez, déroulé, nécessaire, savoir, riche, i...","[déroul, nécessair, savoir, rich, inform]"


In [19]:
reviews_cleaned = reviews_cleaned[(reviews_cleaned.note <= 1) & (reviews_cleaned.fournisseur.isin(list_total))]
# note == 1
reviews_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3065 entries, 5002 to 31251
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Unnamed: 0            3065 non-null   int64 
 1   page                  3065 non-null   int64 
 2   titre                 3065 non-null   object
 3   verbatim              3065 non-null   object
 4   date                  3065 non-null   object
 5   note                  3065 non-null   int64 
 6   reponse               2326 non-null   object
 7   date_experience       3065 non-null   object
 8   fournisseur           3065 non-null   object
 9   source                3065 non-null   object
 10  clean_verb            3065 non-null   object
 11  tokens                3065 non-null   object
 12  tokens_lem            3065 non-null   object
 13  tokens_processed      3065 non-null   object
 14  tokens_lem_processed  3065 non-null   object
dtypes: int64(3), object(12)
memory usage: 3

In [20]:
docs = reviews_cleaned['verbatim'].to_list()
# get all comments in a list

In [21]:
docs

["Cliente chez total énergie depuis près de deux ans pour ses tarifs qui restent concurrentiels, je vais toutefois changer de fournisseur à cause de leur service client et du manque de transparence dans le suivi des contrats. En effet, je me suis aperçue que Total énergie continuait de me prélever l'abonnement et les frais pour un contrat sans aucune consommation dont j'avais demandé la clôture il y a un an. La réponse du service client a été qu'ils n'avaient pas trace de notre demande de clôture et qu'il n'y aurait aucun remboursement. Ils ont pourtant bien noté l'ouverture d'un autre compte pour la même adresse qui avait été demandé en même temps et qui génère une autre facturation. Dans tous les cas, ils ont un devoir d'information auprès d'un client qui paie un abonnement pour un compte qui affiche zéro consommation depuis plusieurs mois. Bien entendu aucun remboursement ou geste commercial n'a été proposé. Je vais donc saisir le médiateur de l'énergie et changer d'opérateur.",
 'N

# Embeddings



In [22]:
# Initiate Embedding model using SentenceTransformer use "paraphrase-multilingual-MiniLM-L12-v2"
embedding_model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
# choosing embedding methods,

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.12k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# Dimensionality Reduction

BERTopic model by default produces different results each time because of the stochasticity inherited from UMAP.

To get reproducible topics, we need to pass a value to the `random_state` parameter in the `UMAP` method.
* `n_neighbors=15` means that the local neighborhood size for UMAP is 15. This is the parameter that controls the local versus global structure in data.
 * A low value forces UMAP to focus more on local structure, and may lose insights into the big picture.
 * A high value pushes UMAP to look at broader neighborhood, and may lose details on local structure.
 * The default `n_neighbors` values for UMAP is 15.
* `n_components=5` indicates that the target dimension from UMAP is 5. This is the dimension of data that will be passed into the clustering model.
* `min_dist` controls how tightly UMAP is allowed to pack points together. It's the minimum distance between points in the low dimensional space.
 * Small values of `min_dist` result in clumpier embeddings, which is good for clustering. Since our goal of dimension reduction is to build clustering models, we set `min_dist` to 0.
 * Large values of `min_dist` prevent UMAP from packing points together and preserves the broad structure of data.
* `metric='cosine'` indicates that we will use cosine to measure the distance.
* `random_state` sets a random seed to make the UMAP results reproducible.


In [23]:
import matplotlib.pyplot as plt

In [24]:
# Instanciate UMAP
reducer = UMAP(random_state=42, n_neighbors=30, n_components=5, min_dist=0, metric='cosine')

# Clustering



In [25]:
# Instanciate HDBSCAN
hscan = HDBSCAN()

# Vectorizers

In [26]:
count_vect = CountVectorizer(ngram_range=(1, 3))

# cTF-IDF
In BERTopic, in order to get an accurate representation of the topics from our bag-of-words matrix, TF-IDF was adjusted to work on a cluster/categorical/topic level instead of a document level. This adjusted TF-IDF representation is called c-TF-IDF and takes into account what makes the documents in one cluster different from documents in another cluster:

In [27]:
# Instanciate a ClassTfidf
classTFIDF = ClassTfidfTransformer()

# Topic Representation

After having generated our topics with c-TF-IDF, we might want to do some fine-tuning based on the semantic relationship between keywords/keyphrases and the set of documents in each topic. Although we can use a centroid-based technique for this, it can be costly and does not take the structure of a cluster into account. Instead, we leverage c-TF-IDF to create a set of representative documents per topic and use those as our updated topic embedding. Then, we calculate the similarity between candidate keywords and the topic embedding using the same embedding model that embedded the documents.

In [28]:
# Instanciate a keyBERTInspired
keyBERTinsp = KeyBERTInspired()

# Put All together

Finally, we pass the processed review documents to the topic model and saved the results for topics and topic probabilities.

The values in topics represents the topic each document is assigned to.
The values in probabilities represents the probability of a document belongs to each of the topics

In [29]:
# Instanciate a BERTopic class with all components above and fit to documents
topic_model = BERTopic(embedding_model=embedding_model, umap_model=reducer, hdbscan_model=hscan,
                       vectorizer_model=count_vect, ctfidf_model=classTFIDF,
                       representation_model=keyBERTinsp, language='French')
topics, probabilities = topic_model.fit_transform(docs)

# Analyse Topics

In [30]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,1365,-1_le service_service client_service_fournisseur,"[le service, service client, service, fourniss...","[Bonjour, je tiens à partager mon expérience a..."
1,0,105,0_toujours pas électricité_avoir électricité_p...,"[toujours pas électricité, avoir électricité, ...",[J'ai souscrit à un contrat électricité chez d...
2,1,84,1_mail_au téléphone_compte_service,"[mail, au téléphone, compte, service, le servi...","[J'ai souscrit un abonnement le 4 Octobre, via..."
3,2,83,2_le service clients_du service client_le serv...,"[le service clients, du service client, le ser...",[Des erreurs on été commises par les employés ...
4,3,81,3_harcèlement téléphonique_du harcèlement_des ...,"[harcèlement téléphonique, du harcèlement, des...",[Démarchage agressif avec extorsion d'informat...
...,...,...,...,...,...
94,93,6,93_mes factures mensuelles_factures mensuelles...,"[mes factures mensuelles, factures mensuelles ...",[Mes factures mensuelles sont toujours mises e...
95,94,6,94_coupure électricité ai_électricité ai perdu...,"[coupure électricité ai, électricité ai perdu,...",[Je déconseille fortement ce fournisseur !!!\n...
96,95,6,95_conseillère total energies_energies pour ga...,"[conseillère total energies, energies pour gaz...","[Bonjour à tous, je suis actionnaire Total et ..."
97,96,5,96_injustifié un montant_en service inexplicab...,"[injustifié un montant, en service inexplicabl...",[J'ai été victime d'une escroquerie ayant abou...


In [31]:
topic_model.get_document_info(docs)

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,Cliente chez total énergie depuis près de deux...,-1,-1_le service_service client_service_fournisseur,"[le service, service client, service, fourniss...","[Bonjour, je tiens à partager mon expérience a...",le service - service client - service - fourni...,0.000000,False
1,Nous sommes en litige avec vous des sommes dem...,90,90_pour payer autant_payer autant par_ici pour...,"[pour payer autant, payer autant par, ici pour...",[Horrible. Les informations diffèrent d’un con...,pour payer autant - payer autant par - ici pou...,1.000000,False
2,J’ai été démarché par Yamina YAICHE alors que ...,-1,-1_le service_service client_service_fournisseur,"[le service, service client, service, fourniss...","[Bonjour, je tiens à partager mon expérience a...",le service - service client - service - fourni...,0.000000,False
3,Services client inexistant . Les personnes qui...,-1,-1_le service_service client_service_fournisseur,"[le service, service client, service, fourniss...","[Bonjour, je tiens à partager mon expérience a...",le service - service client - service - fourni...,0.000000,False
4,Des voleurs des voleurs tout simplement mes pa...,35,35_gros voleurs_de gros voleurs_des gros voleu...,"[gros voleurs, de gros voleurs, des gros voleu...",[Pire fournisseur du monde eni et total\n\nEn ...,gros voleurs - de gros voleurs - des gros vole...,0.793690,False
...,...,...,...,...,...,...,...,...
3060,Service client médiocre !,2,2_le service clients_du service client_le serv...,"[le service clients, du service client, le ser...",[Des erreurs on été commises par les employés ...,le service clients - du service client - le se...,1.000000,False
3061,Je me suis finalement rétractée suite à un par...,1,1_mail_au téléphone_compte_service,"[mail, au téléphone, compte, service, le servi...","[J'ai souscrit un abonnement le 4 Octobre, via...",mail - au téléphone - compte - service - le se...,0.984802,False
3062,Une matinée pour les avoir au téléphone. Quand...,-1,-1_le service_service client_service_fournisseur,"[le service, service client, service, fourniss...","[Bonjour, je tiens à partager mon expérience a...",le service - service client - service - fourni...,0.000000,False
3063,Très longue attente pour la souscription plus ...,42,42_téléphone trop long_long to connect_au télé...,"[téléphone trop long, long to connect, au télé...",[So long to connect. On the phone for an hour....,téléphone trop long - long to connect - au tél...,1.000000,False


In [32]:
# Get the list of topics
topic_model.get_topic(2)

[('le service clients', 0.5256647),
 ('du service client', 0.5070193),
 ('le service client', 0.49910468),
 ('service clients', 0.49698925),
 ('incompétentes', 0.4823568),
 ('service client est', 0.480261),
 ('incompétents', 0.47259435),
 ('service client', 0.45342886),
 ('clients et', 0.4284047),
 ('incompétent', 0.42529547)]

If more than 4 terms are needed for a topic, we can use `get_topic` and pass in the topic number. For example, `get_topic(0)` gives us the top 10 terms for topic 0 and their relative importance.

We can visualize the top keywords using a bar chart. `top_n_topics=12` means that we will create bar charts for the top 12 topics. The length of the bar represents the score of the keyword. A longer bar means higher importance for the topic.

In [33]:
# Visualize top topic keywords
topic_model.visualize_barchart(top_n_topics=12)

Another view for keyword importance is the "Term score decline per topic" chart. It's a line chart with the term rank being the x-axis and the c-TF-IDF score on the y-axis.

There are a total of 31 lines, one line for each topic. Hovering over the line shows the term score information.

# Topic Similarities

In step 6, we will analyze the relationship between the topics generated by the topic model.

Intertopic distance map measures the distance between topics. Similar topics are closer to each other, and very different topics are far from each other. From the visualization, we can see that there are five topic groups for all the topics. Topics with similar semantic meanings are in the same topic group.

The size of the circle represents the number of documents in the topics, and larger circles mean that more reviews belong to the topic.

In [34]:
# Visualize intertopic distance
topic_model.visualize_topics()

In [35]:
# Visualize connections between topics using hierachical clustering
topic_model.visualize_hierarchy()

Another way to see how the topics are connected is through a hierarchical clustering graph. We can control the number of topics in the graph by the `top_n_topics` parameter.

Heatmap can also be used to analyze the similarities between topics. The similarity score ranges from 0 to 1. A value close to 1 represents a higher similarity between the two topics, which is represented by darker blue color.

# Prepare query for GPT3.5

# Topic Model In-sample Predictions

In [36]:
topic_model.get_topic(-1)

[('le service', 0.44559225),
 ('service client', 0.42104477),
 ('service', 0.4149865),
 ('fournisseur', 0.4042101),
 ('demande', 0.38837212),
 ('contrat', 0.37867734),
 ('client', 0.37515578),
 ('électricité', 0.3713158),
 ('compte', 0.36731124),
 ('nous', 0.35170263)]

In [37]:
import math
import pickle

In [38]:
# Further reduce topics
topic_model.reduce_topics(docs, nr_topics=30)

<bertopic._bertopic.BERTopic at 0x7f78d8ecbfd0>

In [39]:
def topic_weights(get_topics):
  res = ''
  for word, weight in get_topics:
    if res != '':
      res += ' + '
    add_str = f'{math.ceil(weight*10000)/10000}*"{word}"'
    res += add_str
  return res

In [40]:
topic_info = topic_model.get_topic_info()

In [41]:
num_topics = len(topic_info) - 1
num_topics

29

In [42]:
topic_str = []
for i in range(num_topics):
  string = topic_weights(topic_model.get_topic(i))
  topic_str.append((i, string))

In [43]:
topic_str

[(0,
  '0.5023*"électricité" + 0.4712*"total direct energie" + 0.4684*"direct energie" + 0.4658*"direct énergie" + 0.3931*"energie" + 0.3695*"énergie" + 0.3591*"contrat" + 0.3419*"facture" + 0.3363*"total direct" + 0.3327*"payer"'),
 (1,
  '0.5174*"remboursement" + 0.5157*"factures" + 0.5016*"payer" + 0.4835*"une facture" + 0.4642*"facture de" + 0.4543*"facture" + 0.3621*"contrat" + 0.3474*"compte" + 0.3318*"chèque" + 0.2987*"compteur"'),
 (2,
  '0.5695*"mon contrat" + 0.4995*"un contrat" + 0.4899*"le contrat" + 0.4705*"contrat" + 0.371*"reçu" + 0.3571*"un conseiller" + 0.3546*"mail" + 0.3507*"service client" + 0.3387*"appel" + 0.3385*"compte"'),
 (3,
  '0.5784*"service client incompétent" + 0.5216*"le service client" + 0.4988*"service clients" + 0.4892*"client incompétent" + 0.4844*"payer" + 0.4825*"le client" + 0.4812*"service client" + 0.445*"clients" + 0.4378*"client est" + 0.3937*"un service"'),
 (4,
  '0.6213*"harcèlement téléphonique" + 0.5885*"démarchage téléphonique" + 0.5746*

In [48]:
with open('/content/drive/MyDrive/HEC/topic_str.pkl', 'wb') as f:
    pickle.dump(topic_str, f)

In [46]:
print(type(topic_str))

<class 'list'>


In [50]:
save_df = reviews_cleaned[['verbatim', 'note']]
save_df.to_parquet('/content/drive/MyDrive/HEC/negative_reviews.parquet')

In [51]:
save_df

Unnamed: 0,verbatim,note
5002,Cliente chez total énergie depuis près de deux...,1
5003,Nous sommes en litige avec vous des sommes dem...,1
5004,J’ai été démarché par Yamina YAICHE alors que ...,1
5005,Services client inexistant . Les personnes qui...,1
5006,Des voleurs des voleurs tout simplement mes pa...,1
...,...,...
31212,Service client médiocre !,1
31217,Je me suis finalement rétractée suite à un par...,1
31219,Une matinée pour les avoir au téléphone. Quand...,1
31234,Très longue attente pour la souscription plus ...,1


In step 8, we will talk about how to make in-sample predictions using the topic model.

BERTopic model can output the predicted topic for each review in the dataset.

Using `.topics_`, we save the predicted topics in a list and then save it as a column in the review dataset.

In [None]:
# Get the topic predictions
topic_preds = topic_model.topics_

# Save the predictions in the dataframe
reviews_cleaned['topics'] = topic_preds

# Take a look at the data
reviews_cleaned['topics']

5002     145
5003     134
5004      19
5005      24
5006      -1
        ... 
31250     -1
31251     22
31261     -1
31286     -1
31296     -1
Name: topics, Length: 3844, dtype: int64

# Topic Model Predictions on New Data

In this step, we will talk about how to use the BERTopic model to make predictions on new reviews.

Let's say there is a new review "I like the new headphone. Its sound quality is great.", and we would like to automatically predict the topic for this review.
* Firstly, let's decide the number of topics to include in the prediction.
 * If we would like to assign only one topic to the document, then the number of topics should be 1.  
 * If we would like to assign multiple topics to the document, then the number of topics should be greater than 1. Here we are getting the top 3 topics that are most relevant to the new review.
* After that, we pass the new review and the number of topics to the `find_topics` method. This gives us the topic number and the similarity value.
* Finally, the results are printed. The top 3 similar topics for the new review are topic 1, topic 0, and topic 2. The similarities are 0.43, 0.34, and 0.30.


In [None]:
# New data for the review
new_review = # Write a fake review

# Find topics of the new review


# Print results


To verify if the assigned topics are a good fit for the new review, let's get the top keywords for the top 3 topics using the `get_topic` method.

In [None]:
# Print the top keywords for the top similar topics


We can see that topic 1 is about headsets and topic 0 is about sound quality. Both topics are a good fit for the new review. Topic 2 is about the earpiece, which is similar to the headset. From this example, we can see that the BERTopic model made good predictions on the new document.

# Save and Load Topic Models

In [None]:
# Save the topic model
topic_model.save("amz_review_topic_model")

# Load the topic model
my_model = BERTopic.load("amz_review_topic_model")

# References

* [BERTopic GitHub](https://github.com/MaartenGr/BERTopic)
* [Documentation on BERTopic algorithms](https://maartengr.github.io/BERTopic/algorithm/algorithm.html#visual-overview)
* [UMAP documentation](https://umap-learn.readthedocs.io/en/latest/parameters.html)