# Part 1: Introduction

## Introduction to BERTopic and Dimensionality Reduction

### Overview  
BERTopic is a topic modeling technique that leverages transformer-based embeddings and clustering algorithms to extract meaningful topics from text data.  
One challenge in topic modeling is the high dimensionality of embeddings. To make visualization and interpretation easier, dimensionality reduction techniques are applied.

### Why Dimensionality Reduction?  
High-dimensional data is difficult to interpret and visualize. By reducing the dimensions while preserving structure, we can:
- Improve computational efficiency.  
- Enhance clustering performance.  
- Enable better visualization of topics.

### Techniques Covered  
In this notebook, we will focus on three popular dimensionality reduction techniques:
- **PCA (Principal Component Analysis):** A linear method that projects data onto principal components.  
- **t-SNE (t-Distributed Stochastic Neighbor Embedding):** A nonlinear method that preserves local relationships in data for visualization.  
- **UMAP (Uniform Manifold Approximation and Projection):** A nonlinear technique that balances local and global structure preservation, well-suited for embedding visualization.


#Part 2: Load and Explore Data
Here, we load the dataset (honey_scam_500.xlsx) from Google Drive into a pandas DataFrame, perform basic exploration of the data, and clean the dataset by removing any missing values.

## Installations

In [None]:
%pip install pandas numpy bertopic umap-learn
%pip install --force-reinstall --no-cache-dir gensim

In [None]:
%pip install nltk

In [None]:
%pip install tf-keras

## Imports

In [None]:
import pandas as pd
from bertopic import BERTopic
import umap
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

## Load Data

In [None]:
# Mount Google Drive

# Load the dataset
file_path = '../data/youtube_comments/honey_scam_500.csv'
df = pd.read_csv(file_path)

# Display basic information about the dataset
display(df.head())
display(df.info())

## Data Overview
# The dataset contains media comments with the following columns:
# - **text:** The content of the comment.
# - **author:** The user who posted the comment.
# - **likes:** Number of likes the comment received.
# - **replyCount:** Number of replies to the comment.

## Basic Data Cleaning
# Let's remove missing values if any.
df = df.dropna()

# Show updated dataset information
display(df.info())

In [None]:
texts = df['text'].astype(str).tolist()

#Part 3: BERTopic Overview
We introduce BERTopic in this section, explaining how it uses transformer-based embeddings and clustering techniques to identify topics from text data. We also prepare the dataset by extracting the text column for modeling.

In [None]:
bertopic_model = BERTopic()

#Part 4: Topic Modeling with BERTopic
In this section, we apply BERTopic to extract topics from the dataset and explore the various topic visualization tools it offers, such as visualizations of topic clusters, term importance, and topic distributions.

In [None]:
### Part 4: Topic Modeling with BERTopic

# Train BERTopic on the dataset
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

# Define a custom CountVectorizer to remove generic stopwords and improve topic quality
vectorizer_model = CountVectorizer(stop_words='english', ngram_range=(1, 2))

# Initialize BERTopic with HDBSCAN parameters to refine topic formation
bertopic_model = BERTopic(vectorizer_model=vectorizer_model, min_topic_size=10, verbose=True)

# Fit the model on the dataset
topics, probs = bertopic_model.fit_transform(texts)

# Display the top topics
display(bertopic_model.get_topic_info())

## Visualizing the Topics
# BERTopic provides several built-in visualization tools:

# 1. Visualize topic clusters
bertopic_model.visualize_topics()

# 2. Visualize term importance per topic
bertopic_model.visualize_barchart(top_n_topics=10)  # Limit to top 10 topics for clarity

# 3. Visualize the distribution of topics over documents
bertopic_model.visualize_distribution(probs)

## Handling Topic Outliers (-1)
# If too many documents fall into topic -1, we may need to adjust parameters
outlier_percentage = (len([t for t in topics if t == -1]) / len(topics)) * 100
print(f"Outlier Topic (-1) Percentage: {outlier_percentage:.2f}%")

# If outlier percentage is too high, adjust HDBSCAN parameters accordingly
if outlier_percentage > 20:
    print("Too many outliers detected! Consider tuning 'min_topic_size' or 'cluster_selection_epsilon'.")



> The topics given above represent the "raw" form of clusters, generated by Berttopic with default settings- using UMAP as a reducer.

> The topic thats starts with -1 represents the outliers topic



#Part 5: Dimensionality Reduction for Visualization
In this section, we focus on reducing the dimensionality of the topic embeddings using PCA and t-SNE. We experiment with different parameter combinations to observe how they affect the visualization of the topic space.

##Preperations

In [None]:
# Ensure embeddings exist
embeddings = bertopic_model.topic_embeddings_
if embeddings is None:
    raise ValueError("BERTopic model does not have topic embeddings. Ensure you are using a model that supports topic representations.")

##PCA

In [None]:
## PCA for Dimensionality Reduction
# Experimenting with different numbers of components and variance retention
pca_variants = [2, 3]
for n_comp in pca_variants:
    pca = PCA(n_components=n_comp)
    pca_embeddings = pca.fit_transform(embeddings)
    explained_var = sum(pca.explained_variance_ratio_)

    # Plot PCA results
    plt.figure(figsize=(10, 6))
    plt.scatter(pca_embeddings[:, 0], pca_embeddings[:, 1], alpha=0.7)
    plt.xlabel("PCA Component 1")
    plt.ylabel("PCA Component 2")
    plt.title(f"PCA Projection of Topic Embeddings (n_components={n_comp}, Variance={explained_var:.2f})")
    plt.show()

##t-SNE

In [None]:
## t-SNE for Nonlinear Dimensionality Reduction
perplexity_values = [5, 30, 50]
learning_rates = [10, 200]
n_samples = embeddings.shape[0]

for perp in perplexity_values:
    if perp >= n_samples:
        print(f"Skipping perplexity={perp} as it is >= number of samples ({n_samples})")
        continue
    for lr in learning_rates:
        tsne = TSNE(n_components=2, perplexity=perp, learning_rate=lr, random_state=42)
        tsne_embeddings = tsne.fit_transform(embeddings)

        # Plot t-SNE results
        plt.figure(figsize=(10, 6))
        plt.scatter(tsne_embeddings[:, 0], tsne_embeddings[:, 1], alpha=0.7)
        plt.xlabel("t-SNE Component 1")
        plt.ylabel("t-SNE Component 2")
        plt.title(f"t-SNE Projection (Perplexity={perp}, Learning Rate={lr})")
        plt.show()

# Part 6: Comparative Analysis of PCA vs. t-SNE
In this section, we compare PCA and t-SNE based on how well they separate and visualize topic clusters. We evaluate clustering performance using silhouette scores and visualize the resulting clusters for both methods.

##Explained variance plot for PCA.

In [None]:
# Compute explained variance for PCA
pca = PCA(n_components=10)
pca.fit(embeddings)
explained_variance = np.cumsum(pca.explained_variance_ratio_)

# Plot explained variance
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), explained_variance, marker='o', linestyle='--')
plt.xlabel("Number of PCA Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("Explained Variance by PCA Components")
plt.grid()
plt.show()

##Clustering separability using K-Means and silhouette scores.

In [None]:
## Comparing Clustering Separability
# We will use k-means clustering to evaluate how well PCA and t-SNE separate topic clusters.

# KMeans clustering on PCA embeddings
pca_kmeans = KMeans(n_clusters=10, random_state=42).fit(pca_embeddings)
pca_silhouette = silhouette_score(pca_embeddings, pca_kmeans.labels_)

# KMeans clustering on t-SNE embeddings
tsne_kmeans = KMeans(n_clusters=10, random_state=42).fit(tsne_embeddings)
tsne_silhouette = silhouette_score(tsne_embeddings, tsne_kmeans.labels_)

# Compare silhouette scores
print(f"Silhouette Score for PCA: {pca_silhouette:.3f}")
print(f"Silhouette Score for t-SNE: {tsne_silhouette:.3f}")

## Visualization of Clusters

In [None]:
# Plot PCA clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(x=pca_embeddings[:, 0], y=pca_embeddings[:, 1], hue=pca_kmeans.labels_, palette="viridis", alpha=0.7)
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.title("PCA Clustering of Topics")
plt.legend(title="Cluster")
plt.show()

# Plot t-SNE clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(x=tsne_embeddings[:, 0], y=tsne_embeddings[:, 1], hue=tsne_kmeans.labels_, palette="coolwarm", alpha=0.7)
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")
plt.title("t-SNE Clustering of Topics")
plt.legend(title="Cluster")
plt.show()

## Conclusion
# - PCA preserves global structure and explains variance well but may not separate clusters as effectively.
# - t-SNE captures local structures better and tends to form more distinct clusters.
# - Silhouette scores indicate which method provides better-defined topic clusters.

#Part 7: Using the reducer in bertopic
In order to find the best dimention reducer, we want to plug the reducer into a given bertopic "assummption configuration". This configuration uses the default `HDBSCAN` for the clustering model, and the `CountVectorizer` as the vectorizing model.

### Model Assumptions

In [None]:
import numpy as np
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer
from umap import UMAP
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import re
import string
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

# Download stop words if not already available
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def custom_preprocessor(text):
    # Lowercase
    text = text.lower()

    # Remove emojis
    text = re.sub(r'[^\x00-\x7F]+', '', text)

    # Remove non-alphabetic characters
    text = re.sub(r'[^a-z\s]', '', text)

    # Tokenize and remove stopwords
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]

    return ' '.join(tokens)

TSNE.transform = TSNE.fit_transform
vectorizer_model = CountVectorizer(stop_words='english', ngram_range=(1, 2), preprocessor=custom_preprocessor)
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")


## Testing Coherence and diversity

### **Topic Coherence**

**Definition**: Measures the **semantic similarity** between high-probability words in a topic.

**Why it matters**: A coherent topic has top words that make sense together—e.g., a topic like `["apple", "banana", "fruit", "mango", "grape"]` is more coherent than `["apple", "engine", "mango", "car", "grape"]`.

**Popular metric variants**:
- **NPMI (Normalized Pointwise Mutual Information)** – used in the BERTopic paper; good for evaluating neural models.
- **C_V** – combines cosine similarity with sliding windows; works well for human-aligned scores.
- **U_Mass** – based on document co-occurrence; favors traditional LDA-style models.

The Traditional BERTopic was tested using the `NPMI` metric variant, but it tends to work poorly with short texts, low vocabulary overlap and few documents. This is why we use the `c`

**Range**: Typically from -1 to 1. Higher is better.

---

In [None]:
def bertopic_coherence(model: BERTopic, documents: list[str], coherence: str = "c_v", top_n_words: int = 10) -> float:
    from gensim.corpora.dictionary import Dictionary
    from gensim.models.coherencemodel import CoherenceModel
    from sklearn.feature_extraction.text import CountVectorizer


    # Step 1: Extract valid topic word lists
    topic_words = []
    for topic_id in range(len(model.get_topics())):
        words = model.get_topic(topic_id)
        if words:  # Ensure non-empty topic
            topic_words.append([word for word, _ in words[:top_n_words]])

    if not topic_words:
        raise ValueError("No valid topics found in the BERTopic model.")

    # Step 2: Vectorizer vocabulary for filtering
    vectorizer = CountVectorizer(stop_words='english')
    vectorizer.fit(documents)
    vocab = set(vectorizer.get_feature_names_out())

    # Step 3: Tokenize documents with vocabulary filtering
    tokenized_docs = [
        [word for word in doc.lower().split() if word in vocab]
        for doc in documents
    ]

    if not any(tokenized_docs):
        raise ValueError("Tokenized documents are empty after preprocessing.")

    # Step 4: Create dictionary and compute coherence
    dictionary = Dictionary(tokenized_docs)
    coherence_model = CoherenceModel(
        topics=topic_words,
        texts=tokenized_docs,
        dictionary=dictionary,
        coherence=coherence
    )

    return coherence_model.get_coherence()



### **Topic Diversity**

**Definition**: Measures how **distinct** the topics are from each other by looking at the **overlap of their top words**.

**Formula** (used in BERTopic paper):
\[
\text{Diversity} = \frac{\text{Number of unique words across topics}}{\text{Total words (Top-N words × Number of topics)}}
\]

**Why it matters**: You want each topic to represent something unique. If every topic includes the same top words, your model lacks diversity.

**Range**: 0 to 1. Higher is better.

---

Together:
- **High coherence** = easy to interpret topics.
- **High diversity** = minimal redundancy across topics.

Want to compute both for your BERTopic model in code?

In [None]:
def bertopic_diversity(model, top_n_words=10):
    topics = model.get_topics()
    topic_words = set()
    total_words = 0

    for topic_id in range(len(topics)):
        words = model.get_topic(topic_id)
        if words:
            top_words = [word for word, _ in words[:top_n_words]]
            topic_words.update(top_words)
            total_words += len(top_words)

    if total_words == 0:
        return 0.0

    diversity_score = len(topic_words) / total_words
    return diversity_score

In [None]:
bertopic = BERTopic()
topics, probs = bertopic.fit_transform(texts)

### Silhouette Score
The **silhouette score** is a metric used to evaluate the quality of **unsupervised clustering**, including the kind of clustering BERTopic does.

#### Definition
The silhouette score measures how similar a data point is to its **own cluster** compared to **other clusters**.

#### Formula (for a single sample)
\[
s = \frac{b - a}{\max(a, b)}
\]
- `a`: Mean intra-cluster distance (average distance to other points in the same cluster).
- `b`: Mean nearest-cluster distance (average distance to points in the nearest cluster).
- `s` ranges from **-1 to 1**:
  - **+1**: Well matched to its own cluster, and poorly matched to others.
  - **0**: On or very close to the decision boundary between clusters.
  - **-1**: Possibly assigned to the wrong cluster.

#### Use Case
Great for evaluating how "tight" and "well-separated" your clusters are—especially in **embedding-based** models like BERTopic where traditional likelihood-based metrics don’t apply.


In [None]:
def bertopic_silhouette(model, texts):
    """
    Calculates the silhouette score for a given BERTopic model and texts.

    Args:
        model: The BERTopic model.
        texts: A list of texts.

    Returns:
        The silhouette score.
    """
    # Get the topic embeddings
    embeddings = model.topic_embeddings_
    if embeddings is None:
        raise ValueError("BERTopic model does not have topic embeddings.")

    # Get the topic assignments
    topics, probs = model.transform(texts)

    # Remove outlier topics (-1)
    valid_indices = [i for i, topic in enumerate(topics) if topic != -1]

    # Check if valid_indices is empty
    if not valid_indices:
        print("No valid topics found for silhouette score calculation.")
        return None

    # Ensure valid_indices are within the bounds of embeddings
    valid_indices = [i for i in valid_indices if i < embeddings.shape[0]]

    valid_embeddings = embeddings[valid_indices]
    valid_topics = [topics[i] for i in valid_indices]

    # Check if enough points for silhouette score calculation
    if len(set(valid_topics)) < 2:
        print("Not enough clusters for silhouette score calculation.")
        return None

    # Calculate silhouette score
    try:
        score = silhouette_score(valid_embeddings, valid_topics)
        return score
    except ValueError:
        print("Silhouette score calculation failed. Check data or parameters.")
        return None

The assumption configuration also includes the KMeans clutering model, which will make sure that the data will always be split into a constant number of clusters. This is done in order for us to be able to calculate silhouette scores even when the dim-red performs poorly. for example `PCA-10` configuration could split the data into 2 clusters

In [None]:
shared_configuration = {
    "vectorizer_model": CountVectorizer(stop_words='english', ngram_range=(1, 2), preprocessor=custom_preprocessor),
    "embedding_model": SentenceTransformer("all-MiniLM-L6-v2"),
    "verbose": False,
    "min_topic_size": 10,
    "hdbscan_model": KMeans(n_clusters=10, random_state=42)
}

In [None]:
bertopic_silhouette(bertopic, texts)

### Finding bertopic baseline scores
In order to optimize the bertopic dimentionallity reduction module, we need to compare between the baseline scores of bertopic, which uses `UMAP(n_components=5,n_neighbors=15,min_dist=0.0,metric="cosine",low_memory=self.low_memory)` by default

In [None]:
diversity_score = bertopic_diversity(bertopic)
coherece_score = bertopic_coherence(bertopic, texts, 'c_v')
diversity_score, coherece_score

In [None]:
from umap import UMAP
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from bertopic.dimensionality import BaseDimensionalityReduction
from bertopic import BERTopic

TSNE.transform = TSNE.fit_transform

topic_models = {}


topic_models['EMPTY'] = BERTopic(**shared_configuration, umap_model= BaseDimensionalityReduction())

# PCA variations
topic_models["PCA_5"] = BERTopic(**shared_configuration, umap_model=PCA(n_components=5))
topic_models["PCA_10"] = BERTopic(**shared_configuration, umap_model=PCA(n_components=10))
topic_models["PCA_25"] = BERTopic(**shared_configuration, umap_model=PCA(n_components=25))

# TSNE variations
topic_models["TSNE_10"] = BERTopic(**shared_configuration, umap_model=TSNE(n_components=2, perplexity=10, random_state=42))
topic_models["TSNE_30"] = BERTopic(**shared_configuration, umap_model=TSNE(n_components=2, perplexity=30, random_state=42))
topic_models["TSNE_50"] = BERTopic(**shared_configuration, umap_model=TSNE(n_components=2, perplexity=50, random_state=42))

# UMAP variations
topic_models["UMAP_5"] = BERTopic(**shared_configuration, umap_model=UMAP(n_components=5, random_state=42))
topic_models["UMAP_10"] = BERTopic(**shared_configuration, umap_model=UMAP(n_components=10, random_state=42))
topic_models["UMAP_25"] = BERTopic(**shared_configuration, umap_model=UMAP(n_components=25, random_state=42))

### Fit models

In [None]:
import pandas as pd

# Assuming you have already run the code and have the topic_models dictionary and texts list

results = []
for model_name, model in topic_models.items():
    try:
        model.fit_transform(texts)
        diversity = bertopic_diversity(model)
        coherence = bertopic_coherence(model, texts, 'c_v')
        silhouette = bertopic_silhouette(model, texts)
        results.append([model_name, diversity, coherence, silhouette])
    except Exception as e:
        print(f"Error processing {model_name}: {e}")
        results.append([model_name, None, None])  # Store None for errors


df_comparison = pd.DataFrame(results, columns=["Model", "Diversity", "Coherence", "silhouette"])
df_comparison


In [None]:
import plotly.express as px
from umap import UMAP

plot_data = []
projection_model = UMAP(n_neighbors=15, min_dist=0.0, n_components=2, metric="cosine")
projection = projection_model.fit_transform(embeddings)

for model_name, model in topic_models.items():
    try:
        topics, _ = model.transform(texts)

        if len(set(topics)) <= 1:
            print(f"Skipping {model_name}: only one topic detected.")
            continue

        for i, (x, y) in enumerate(projection):
            plot_data.append({
                "x": projection[:, 0],
                "y": projection[:, 1],
                "Text": texts[i],
                "Topic": f"Topic {topics[i]}",
                "Model": model_name
            })

    except Exception as e:
        print(f"Error in {model_name} visualization: {e}")

In [None]:
df_plot = pd.DataFrame(plot_data)

if not df_plot.empty:
    fig = px.scatter(
        df_plot,
        x="x", y="y", color="Topic",
        facet_col="Model", facet_col_wrap=3,
        hover_data=["Text"],
        title="BERTopic Document Clustering Across Dimensionality Reducers",
        opacity=0.85,
        height=800
    )
    fig.show()
else:
    print("No clustering data to visualize.")

In [None]:
# Visualization: collect document-topic assignments
import plotly.express as px
from umap import UMAP

projection_model = UMAP(n_neighbors=15, min_dist=0.0, n_components=2, metric="cosine")
projection = projection_model.fit_transform(embeddings)
plot_data = []
for model_name, model in topic_models.items():
    try:
        topics, _ = model.transform(texts)

        if len(set(topics)) <= 1:
            print(f"Skipping {model_name}: only one topic detected.")
            continue

        for i, (x, y) in enumerate(projection):
            plot_data.append({
                "x": x,
                "y": y,
                "Text": texts[i],
                "Topic": f"Topic {topics[i]}",
                "Model": model_name
            })

    except Exception as e:
        print(f"Error in {model_name} visualization: {e}")

In [None]:
df_plot = pd.DataFrame(plot_data)

if not df_plot.empty:
    fig = px.scatter(
        df_plot,
        x="x", y="y", color="Topic",
        facet_col="Model", facet_col_wrap=3,
        hover_data=["Text"],
        title="BERTopic Document Clustering Across Dimensionality Reducers",
        opacity=0.85,
        height=800
    )
    fig.show()
else:
    print("No clustering data to visualize.")

## Optimizing the best dim-red
It seems that the `Umap-5` configuration tends to work  best. Umap with 5 dimensions is the recommended approach when working with BERTopic, and it is for a good reason.
In the following section, we will try to optimize the results of UMAP.

In [None]:
from sklearn.model_selection import ParameterGrid

# Define the parameter grid for UMAP
param_grid = {
    "n_neighbors": [10, 15, 20],
    "min_dist": [0.0, 0.1, 0.2, 0.5],
    "n_components": [5] ,
    "metric": [ "cosine","euclidean"]  #fix n_components at 5
}

# Initialize an empty dictionary to store the results
results = []

# Iterate over the parameter grid using ParameterGrid
for params in ParameterGrid(param_grid):
    try:
        # Create a UMAP model with current parameters
        umap_model = UMAP(**params, random_state=42)

        # Create and fit the BERTopic model with current UMAP config
        model = BERTopic(**shared_configuration, umap_model=umap_model)
        model.fit_transform(texts)

        # Calculate and store the evaluation metrics
        diversity = bertopic_diversity(model)
        coherence = bertopic_coherence(model, texts, 'c_v')
        silhouette = bertopic_silhouette(model, texts)
        results.append([params, diversity, coherence, silhouette])

    except Exception as e:
        print(f"Error with parameters {params}: {e}")
        results.append([params, None, None, None])

# Convert the results to a Pandas DataFrame for easier analysis
df_umap_results = pd.DataFrame(results, columns=["Parameters", "Diversity", "Coherence", "Silhouette"])
df_umap_results


In [None]:
# 🔍 UMAP 2D projection (for consistent plotting)
umap_vis = UMAP(n_components=2, random_state=42)
projection = umap_vis.fit_transform(embeddings)

# 📦 Collect clustering data
plot_data = []

for params in ParameterGrid(param_grid):
    try:
        umap_model = UMAP(**params, random_state=42)
        model = BERTopic(**shared_configuration, umap_model=umap_model)
        model.fit(texts, embeddings)
        topics, _ = model.transform(texts, embeddings=embeddings)

        if len(set(topics)) <= 1:
            print(f"Skipping config (only 1 topic): {params}")
            continue

        config_label = f"n={params['n_neighbors']}<br>dist={params['min_dist']}<br>{params['metric']}"

        for i, (x, y) in enumerate(projection):
            plot_data.append({
                "x": x,
                "y": y,
                "Text": texts[i],
                "Topic": f"Topic {topics[i]}",
                "UMAP_Config": config_label
            })

    except Exception as e:
        print(f"Visualization error for {params}: {e}")

In [None]:
df_plot = pd.DataFrame(plot_data)

if not df_plot.empty:
    fig = px.scatter(
        df_plot,
        x="x", y="y", color="Topic",
        facet_col="UMAP_Config", facet_col_wrap=3,
        hover_data=["Text"],
        title="📌 BERTopic Clustering of Comments Across UMAP Configurations",
        height=1000,
        opacity=0.85
    )
    fig.show()
else:
    print("⚠️ No data available for visualization.")

experimenting with HDBScan to generate higher silhouette scores

In [None]:
import hdbscan
hdbscan_model = hdbscan.HDBSCAN(
    min_cluster_size=10,         # Adjust based on expected cluster sizes
    min_samples=5,               # Lower or equal to min_cluster_size is a good start
    metric='euclidean',          # Or try 'cosine' if your data is normalized
    cluster_selection_method='leaf',
    prediction_data=True  # 'leaf' can sometimes yield more granular clusters
)

shared_configuration['hdbscan_model'] = hdbscan_model

In [None]:
from sklearn.model_selection import ParameterGrid

# Define the parameter grid for UMAP
param_grid = {
    "n_neighbors": [10, 15, 20],
    "min_dist": [0.0, 0.1, 0.2, 0.5],
    "n_components": [5] ,
    "metric": [ "cosine","euclidean"]  #fix n_components at 5
}

# Initialize an empty dictionary to store the results
results = []

# Iterate over the parameter grid using ParameterGrid
for params in ParameterGrid(param_grid):
    try:
        # Create a UMAP model with current parameters
        umap_model = UMAP(**params, random_state=42)

        # Create and fit the BERTopic model with current UMAP config
        model = BERTopic(**shared_configuration, umap_model=umap_model)
        model.fit_transform(texts)

        # Calculate and store the evaluation metrics
        diversity = bertopic_diversity(model)
        coherence = bertopic_coherence(model, texts, 'c_v')
        silhouette = bertopic_silhouette(model, texts)
        results.append([params, diversity, coherence, silhouette])

    except Exception as e:
        print(f"Error with parameters {params}: {e}")
        results.append([params, None, None, None])

# Convert the results to a Pandas DataFrame for easier analysis
df_umap_results = pd.DataFrame(results, columns=["Parameters", "Diversity", "Coherence", "Silhouette"])
df_umap_results


In [None]:
df_umap_results[df_umap_results["Silhouette"] > 0]

We found out that under our assumptions, UMap with the configuration of 5 dimensions, min_dist between `[0.1, 0.2]` and 10 neighbors works best.