## SET UP
1). create a conda env called BERTopic python 3.12.8.

2). Connect the jupiter notebook to that env (kernel).

3). install dependencies 

4). Create your dataset (if you don't want to use the existing one)

In [None]:
%%capture #remove the capture prefix/use your terminal for debug output 
!conda create --name BERTopic python=3.12.8 
!conda activate BERTopic

In [None]:
%%capture #remove the capture prefix/use your terminal for debug output 
!conda install -n BERTopic -c conda-forge pandas bertopic keybert -y
!conda install -n BERTopic -c conda-forge nbformat -y


### Create new dataset (optional)
**note**: this is an optional step, we recommend using an existing dataset. 

1). Run the `sm-insights-next` project locally (cd to sm-insights-next && npm run dev:https) 

2). Run the `create_youtube_dataset` with your params.

3). If all good the dataset would be saved under `/Notebooks/datasets/youtube-comments/filename.cvs`

In [None]:
import requests

def create_youtube_dataset( video_id="gXjj2EoElFg", dataset_name="JackVsCalley", limit=500):
    base_url="https://localhost:3000"
    endpoint = f"{base_url}/api/youtube/comments/create-dataset"
    
    params = {
        "video_id": video_id,
        "data_set_name": dataset_name,
        "limit": limit
    }
    
    try:
        # Disable SSL verification for localhost
        response = requests.get(endpoint, params=params, verify=False)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error making request: {e}")
        return None



In [60]:
datasetName="honey_scam_500"
video_id="vc4yL3YTwWk" #the value of the v url parameter on a youtube video link on your desktop
limit=500

# response  = create_youtube_dataset(video_id=video_id,dataset_name=datasetName,limit=limit)
# response





'dataset was saved to ../Notebooks/datasets/youtube-comments/honey_scam_500.csv'

## BERTopic

In [61]:
datasetName = "honey_scam_500" 

Create and run BERTtopic

In [62]:
import pandas as pd
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer


# Load the comments dataset
df = pd.read_csv(f"../datasets/youtube-comments/{datasetName}.csv") 

# Assuming your CSV has a column named 'text' containing the comments
comments = df['text'].tolist()

vectorizer_model = CountVectorizer(stop_words="english")
sentence_model = SentenceTransformer('all-mpnet-base-v2') 
# Create and fit the BERTopic model
model = BERTopic( 
    vectorizer_model=vectorizer_model,
    embedding_model=sentence_model,
    language='english',
    calculate_probabilities=True,
    verbose=True)

topics, probabilities = model.fit_transform(comments)


2025-01-21 15:25:57,130 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 16/16 [00:05<00:00,  2.97it/s]
2025-01-21 15:26:02,531 - BERTopic - Embedding - Completed ✓
2025-01-21 15:26:02,532 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-01-21 15:26:02,795 - BERTopic - Dimensionality - Completed ✓
2025-01-21 15:26:02,798 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-01-21 15:26:02,820 - BERTopic - Cluster - Completed ✓
2025-01-21 15:26:02,822 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-01-21 15:26:02,838 - BERTopic - Representation - Completed ✓


An example of cluster labeling using keybert

In [63]:
from keybert import KeyBERT

kw_model = KeyBERT()
topic_labels = {}
for topic in range(len(set(topics))-1):
    words = model.get_topic(topic)
    keywords = kw_model.extract_keywords(' '.join([word[0] for word in words]), keyphrase_ngram_range=(1, 2), top_n=1) 
    topic_labels[topic] = keywords[0][0]



model.set_topic_labels(topic_labels=topic_labels)
topic_labels

{0: 'br 39',
 1: 'scam influencers',
 2: 'release waiting',
 3: 'paypal people',
 4: 'journalism gained',
 5: 'class lawsuit',
 6: 'linus ltt',
 7: 'mark called',
 8: 'thanks omgosh',
 9: '아이고 wishes',
 10: 'video vanished'}

Comments and their label

In [64]:
for i in range(10):
    print(f'{topic_labels[i]}: {comments[i]}')

br 39: If you guys enjoyed this video, please consider supporting me on Patreon: <a href="https://patreon.com/MegaLag">https://patreon.com/MegaLag</a><br><br>If you have any inside information about PayPal Honey or believe you can contribute to this story, please feel free to contact me confidentially at megalagtips@<a href="http://proton.me/">proton.me</a>
scam influencers: Cannot trust those guys on the thumbnail?
release waiting: I like that the influencers got f*<b>**ed to when they advertise that sh***</b> 😂
paypal people: This is straight up evil. I knew Paypal is  no good. The moment I registered there my data was sold. I receoved phishing mails and calls DAILY. This company is EVIL
journalism gained: Its not their fault tho they didnt know
class lawsuit: <a href="https://www.youtube.com/watch?v=vc4yL3YTwWk&amp;t=1174">19:34</a> it&#39;s a protection racket 😮
linus ltt: Still waiting for part 2...
mark called: So? Are you gonna drop a second part or just disappear?
thanks omgosh

Top 10 clusters

In [65]:
freq = model.get_topic_info()
freq.head(10)

Unnamed: 0,Topic,Count,Name,CustomName,Representation,Representative_Docs
0,-1,73,-1_video_megalag_watch_youtube,-1_video_megalag_watch_youtube,"[video, megalag, watch, youtube, free, vc4yl3y...","[<a href=""https://www.youtube.com/watch?v=vc4y..."
1,0,142,0_honey_br_39_quot,br 39,"[honey, br, 39, quot, just, money, don, used, ...",[I think making this so focused on youtubers a...
2,1,74,1_scam_influencers_39_youtube,scam influencers,"[scam, influencers, 39, youtube, br, people, q...",[It&#39;s a classic &quot;too good to be true&...
3,2,55,2_weeks_release_waiting_wait,release waiting,"[weeks, release, waiting, wait, drop, second, ...",[Problem with this series is you should of rec...
4,3,28,3_paypal_people_owned_extension,paypal people,"[paypal, people, owned, extension, patreon, ev...",[I have the paypal app downloaded and it comes...
5,4,25,4_work_thank_journalism_gained,journalism gained,"[work, thank, journalism, gained, excellent, g...",[Excellent... I liked your video 9 times ...👍👍...
6,5,22,5_class_lawsuit_action_illegal,class lawsuit,"[class, lawsuit, action, illegal, br, pay, hop...",[That’s insane! Time to do a class action laws...
7,6,20,6_lmg_linus_ltt_think,linus ltt,"[lmg, linus, ltt, think, 39, company, instead,...","[I think you are off base about ltt man, why s..."
8,7,19,7_markiplier_right_predicted_calling,mark called,"[markiplier, right, predicted, calling, listen...",[You didn&#39;t mention Markiplier! Markiplier...
9,8,19,8_damn_wow_thanks_omgosh,thanks omgosh,"[damn, wow, thanks, omgosh, brilliant, dam, wo...","[Damn, they got us, DAMN, damn.]"


### visualizations  

In [66]:
model.visualize_topics(custom_labels=True)

In [67]:
model.visualize_hierarchy(custom_labels=True)

In [68]:
model.visualize_barchart(custom_labels=True)

In [69]:
model.visualize_heatmap(custom_labels=True)