<a href="https://colab.research.google.com/github/yufei-ilariahuang/Ergonomics-Chair-project/blob/main/bertopicamazon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Load Data
This is the Australian Broadcasting Corporation news published over a period of eight years, freely available on Kaggle. It has two main columns: publish_date: date of publishing for the article in yyyyMMdd format. headline_text: text of the headline in English. This is the information that will be used by the topic model.

In [None]:
import pandas as pd
# Visualize the length distribution
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('processed_data_wordcut.csv')
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9336 entries, 0 to 9335
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   country            9336 non-null   object 
 1   date               9336 non-null   object 
 2   price              9336 non-null   float64
 3   productAsin        9336 non-null   object 
 4   ratingScore        9336 non-null   int64  
 5   reviewCategoryUrl  9336 non-null   object 
 6    Review            9336 non-null   object 
 7   processed_text     9335 non-null   object 
dtypes: float64(1), int64(1), object(6)
memory usage: 583.6+ KB


In [None]:
text_ = df['processed_text']

In [None]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L3-v2')
embeddings = model.encode(sentences)
print(embeddings)

[[ 3.20785940e-01  1.37235269e-01  2.59860367e-01 -9.81592610e-02
  -3.63948077e-01  7.71966800e-02  3.69380206e-01 -3.12745720e-01
   1.33277372e-01 -2.93541998e-01  1.15458727e-01 -3.55825484e-01
   7.46323317e-02 -1.58448726e-01  1.35559961e-01  2.90623102e-02
  -1.19101524e-01  2.45669007e-01 -3.30902010e-01  2.87550837e-01
   3.58330399e-01  4.96013075e-01  2.47233197e-01 -8.83322731e-02
   4.18651029e-02 -2.08249204e-02  2.40278870e-01 -4.19054568e-01
   4.51311111e-01 -2.65613347e-01 -1.54888630e-02  1.05151854e-01
   5.29058874e-01 -1.93584710e-01 -2.27550775e-01  3.09600174e-01
  -7.52531514e-02 -2.09876001e-02  1.02610067e-01  1.09510556e-01
  -6.37895539e-02 -5.12716234e-01  3.22104357e-02 -1.68447867e-01
  -2.78715789e-01 -2.81573594e-01 -7.27954833e-03  4.53029156e-01
  -1.75040945e-01 -1.90942660e-01  5.97975478e-02 -1.14427827e-01
   5.13521656e-02  2.61589475e-02  1.39900789e-01 -3.35936069e-01
  -4.79268172e-04 -2.16055408e-01 -2.80452192e-01 -9.01871845e-02
  -1.14230

In [None]:
  text_ = df['processed_text'].astype(str)
  #Check for Non-String Values in text_:
  #Ensure that all elements in text_ are strings. If text_ is a column from a pandas DataFrame, there might be some non-string (float) values in the column.

In [None]:
#Drop Missing or NaN Values:
text_ = df['processed_text'].dropna()



2. Topic Modeling
In this example, we will go through the main components of BERTopic and the steps necessary to create a strong topic model.

2.1 Training
We start by instantiating BERTopic. We set language to english since our documents are in the English language. If you would like to use a multi-lingual model, please use language="multilingual" instead.

We will also calculate the topic probabilities. However, this can slow down BERTopic significantly at large amounts of data (>100_000 documents). It is advised to turn this off if you want to speed up the model.



In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
from bertopic import BERTopic
#n_gram_range=(1,2): Sets the n-gram range for the topic model to consider both unigrams (single words) and bigrams (pairs of words).
#verbose=True: Enables verbose output to provide more information during the model fitting.
#min_topic_size=50: Sets the minimum size of a topic to 7, meaning each topic needs to contain at least 7 documents.



1. using transform pretrained model

In [None]:
model = BERTopic(verbose=True,embedding_model='paraphrase-MiniLM-L3-v2', min_topic_size= 7,  n_gram_range = (1,2) )
topics, probabilities  = model.fit_transform(text_)

Batches:   0%|          | 0/292 [00:00<?, ?it/s]

2023-11-22 23:36:03,854 - BERTopic - Transformed documents to Embeddings
2023-11-22 23:36:29,826 - BERTopic - Reduced dimensionality
2023-11-22 23:36:30,683 - BERTopic - Clustered reduced embeddings


In [None]:
freq = model.get_topic_info()
print("Number of topics: {}".format( len(freq)))
freq.head()

Number of topics: 131


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,5072,-1_back_seat_support_comfortable,"[back, seat, support, comfortable, like, one, ...",[great better suited smaller people year half ...
1,0,317,0_pain_back pain_back_day,"[pain, back pain, back, day, lower back, lower...",[amazing chair--if pain especially back pain r...
2,1,207,1_assemble_easy assemble_easy_comfortable easy,"[assemble, easy assemble, easy, comfortable ea...","[easy assemble comfortable, comfortable easy a..."
3,2,166,2_box_damaged_broken_arrived,"[box, damaged, broken, arrived, came, part, op...","[arrived damaged arrived damaged damaged part,..."
4,3,163,3_miller_herman miller_herman_aeron,"[miller, herman miller, herman, aeron, miller ...",[herman miller authorized dealer herman miller...


The above table has 3 main columns, providing information about all the 54 topics in descending order of topics size/Count.

'Topic' is the topic number, a kind of identifier, and the outliers are labeled as -1. Those are topics that should be ignored because they do not bring any added value.
'Count' is the number of words in the topic.
Name is the name given to the topic.
For each topic, we can retrieve the top words and their corresponding c-TF-IDF score. The higher the score, the most relevant the word is in representing the topic.

2. Using bertopic embedding model(with warnings)

In [None]:
from bertopic import BERTopic

topic_model = BERTopic( min_topic_size= 50 , nr_topics= 20 , verbose=True, n_gram_range = (1,2) )
topics2, probabilities2 = topic_model.fit_transform(text_)

Batches:   0%|          | 0/292 [00:00<?, ?it/s]

2023-11-23 00:09:24,002 - BERTopic - Transformed documents to Embeddings
2023-11-23 00:09:46,724 - BERTopic - Reduced dimensionality
2023-11-23 00:09:47,415 - BERTopic - Clustered reduced embeddings
2023-11-23 00:09:51,871 - BERTopic - Reduced number of topics from 3 to 3


In [None]:
freq2 = topic_model.get_topic_info()
print("Number of topics: {}".format( len(freq2)))
freq2.head(20)

Number of topics: 3


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,21,-1_la_de_el_que,"[la, de, el, que, en, silla, para, muy, la sil...",[la mejor inversión en una silla que cumple to...
1,0,150,0_squeak_noise_month_loud,"[squeak, noise, month, loud, back, squeaking, ...",[creak loudly every time sit march nine month ...
2,1,9164,1_back_support_comfortable_seat,"[back, support, comfortable, seat, like, great...",[great bigger people anyone looking ultimate s...


Using 1 st method. From this topic, we observe that all the words are coherent for the underlying topic which seems to be about firefighters 🔥.

In [None]:
a_topic = freq.iloc[1]["Topic"] # Select the 1st topic
model.get_topic(a_topic) # Show the words and their c-TF-IDF scores

[('pain', 0.030336298143828794),
 ('back pain', 0.026830521138067925),
 ('back', 0.014167762056664654),
 ('day', 0.01083448516856924),
 ('lower back', 0.010511632512642886),
 ('lower', 0.009447730183898329),
 ('work', 0.008867490809317936),
 ('home', 0.007533085046887863),
 ('hour', 0.007351629995388859),
 ('sitting', 0.007170788439960161)]

3 Topics Visualization.
The topic visualization helps in gaining more insight about each topic. BERTopic provides several visualization possibilities such as terms visualization, intertopic distance map, topic hierarchy clustering just to name a few, and our focus will be on those that have been cited.

3.1 Topic Terms
The most relevant words of each topic can be visualized in a form of barchart out of the c-TF-IDF score, which is interesting to visually compare topics. Below is the corresponding visualization for the topic 6 topics.

In [None]:
model.visualize_barchart(top_n_topics=6)
#The longer the horizontal bar, the most relevant it is to the topic.

3.2 Intertopic Distance Map
For those who are fimiliar with Latent Dirichlet Allocation LDAvis library. This library provides the user with an interactive dashboard showing for each topic the corresponding words and their score. BERTopic does the same with its visualize_topics() function and even go one step further by giving the distance between topics (the lower the most similar), and all of this with a single function visualize_topics()

In [None]:
model.visualize_topics()

3.3 Visualize Topic Hierarchy
As you can see in the Interdistance topic dashboard, some topics are very close. One thing that could come to mind is how can I reduce the number of topics? The good new is that those topics can be hierarchically in order to select the appropriate number of topics. The visualization flavor helps to understand how they relate to one another.


By looking at the first level (level 0) of the dendogram, we can see that topics with the same colors have been grouped together.
All these information can help the user better understand the reason why the topics have been considered to be similar one to another.

In [None]:
model.visualize_hierarchy(top_n_topics=30)

4. Search Topics.
Once the topic model is trained, we can search for topics that are semantically similar to an input query word/term using the find_topics function. In our case, we can search for top 3 topics that are related to the word 'politics'



In [None]:
# Select most 3 similar topics
similar_topics, similarity = model.find_topics("comfort", top_n = 50)

similar_topics contains the topics index from most similar to least similar.
similarity contains the similarity scores in descending order.

In [None]:
similar_topics

[28,
 119,
 27,
 11,
 107,
 17,
 42,
 111,
 71,
 39,
 63,
 103,
 77,
 85,
 35,
 92,
 67,
 62,
 80,
 59,
 22,
 72,
 14,
 -1,
 18,
 25,
 94,
 50,
 19,
 65,
 41,
 83,
 93,
 116,
 79,
 123,
 118,
 75,
 46,
 0,
 16,
 76,
 127,
 121,
 106,
 51,
 23,
 3,
 96,
 115]

In [None]:
most_similar = similar_topics[0]
print("Most Similar Topic Info: \n{}".format(model.get_topic(most_similar)))
print("Similarity Score: {}".format(similarity[0]))

Most Similar Topic Info: 
[('comfort', 0.11425247507915565), ('comfort comfort', 0.05298730390548712), ('comfort comfortable', 0.035232876587890496), ('comfortable comfort', 0.03403486302645432), ('great comfort', 0.02529532257925599), ('love', 0.025097973171106153), ('mucho tiempo', 0.023876793944386125), ('durability comfort', 0.023876793944386125), ('tiempo', 0.023876793944386125), ('comfort love', 0.023876793944386125)]
Similarity Score: 0.8628519773483276


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("LiYuan/amazon-review-sentiment-analysis")

model = AutoModelForSequenceClassification.from_pretrained("LiYuan/amazon-review-sentiment-analysis")

tokenizer_config.json:   0%|          | 0.00/556 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.56M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/670M [00:00<?, ?B/s]

In [None]:
freq = model.get_topic_info()
print("Number of topics: {}".format( len(freq)))
freq.head()