## SET UP
1). create a conda env called BERTopic python 3.12.8.

2). Connect the jupiter notebook to that env (kernel).

3). install dependencies 

4). Create your dataset (if you don't want to use the existing one)

In [None]:
%%capture #remove the capture prefix/use your terminal for debug output 
!conda create --name BERTopic python=3.12.8 -y
!conda activate BERTopic # powershell gave problems and cmd worked.

In [None]:
%%capture #remove the capture prefix/use your terminal for debug output 
!conda install -n BERTopic -c conda-forge --file Notebooks/requirements.txt -y

# Option 2 Mamba
# conda activate BERTopic
# conda install -c conda-forge mamba -y
# mamba install --file Notebooks/requirements.txt -y

### Create new dataset (optional)
**note**: this is an optional step, we recommend using an existing dataset. 

1). Run the `sm-insights-next` project locally (cd to sm-insights-next && npm run dev:https) 

2). Run the `create_youtube_dataset` with your params.

3). If all good the dataset would be saved under `/Notebooks/datasets/youtube-comments/filename.cvs`

In [None]:
import requests

def create_youtube_dataset( video_id="gXjj2EoElFg", dataset_name="JackVsCalley", limit=500):
    base_url="https://localhost:3000"
    endpoint = f"{base_url}/api/youtube/comments/create-dataset"
    
    params = {
        "video_id": video_id,
        "data_set_name": dataset_name,
        "limit": limit
    }
    
    try:
        # Disable SSL verification for localhost
        response = requests.get(endpoint, params=params, verify=False)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error making request: {e}")
        return None



In [None]:
datasetName="honey_scam_500"
video_id="vc4yL3YTwWk" #the value of the v url parameter on a youtube video link on your desktop
limit=500

# response  = create_youtube_dataset(video_id=video_id,dataset_name=datasetName,limit=limit)
# response

## BERTopic

In [None]:
datasetName = "jack_vs_calley_1000" 

Create and run BERTtopic

In [None]:
import pandas as pd
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer


# Load the comments dataset
df = pd.read_csv(f"../datasets/youtube-comments/{datasetName}.csv") 

# Assuming your CSV has a column named 'text' containing the comments
comments = df['text'].tolist()

vectorizer_model = CountVectorizer(stop_words="english")
sentence_model = SentenceTransformer('all-mpnet-base-v2') 
# Create and fit the BERTopic model
model = BERTopic( 
    vectorizer_model=vectorizer_model,
    embedding_model=sentence_model,
    language='english',
    calculate_probabilities=True,
    verbose=True)

topics, probabilities = model.fit_transform(comments)


An example of cluster labeling using keybert

In [None]:
from keybert import KeyBERT

kw_model = KeyBERT()
topic_labels = {}
for topic in range(len(set(topics))-1):
    words = model.get_topic(topic)
    keywords = kw_model.extract_keywords(' '.join([word[0] for word in words]), keyphrase_ngram_range=(1, 2), top_n=1) 
    topic_labels[topic] = keywords[0][0]



model.set_topic_labels(topic_labels=topic_labels)
topic_labels

Comments and their label

In [None]:
for i in range(10):
    print(f'{topic_labels[i]}: {comments[i]}')

Top 10 clusters

In [None]:
freq = model.get_topic_info()
freq.head(10)

### visualizations  

In [None]:
model.visualize_topics(custom_labels=True)

In [None]:
model.visualize_hierarchy(custom_labels=True)

In [None]:
model.visualize_barchart(custom_labels=True)

In [None]:
model.visualize_heatmap(custom_labels=True)