# Topic based clustering

The goal of our project is to group YouTube comments together based on their topics.

This notebook will explore different clustering algorithms, benchmark them, and recommend one that is most suitable for our problem.

## Introduction  

### Assumptions


This notebook assumes the following: 
- Comments have already been cleaned and encoded. 
- Comments encoding have been reduced into a 2 dimensional space.

Currently, we did not prefect these steps, so here is a short and imperfect implementation of these so we can start working. [link to notebook](./assumptions.ipynb)

> We assume that the data is encoded properly and that by mesuring the distance between comments, we can cluster them based on topics.

### Algorithms
- Agglomerative
- K-Means
- HDBSCAN



### Benchmarks
- Davies-Bouldin
- Silhouette
- Calinski-Harabasz


## Installations

In [None]:
%%capture
%pip install scikit-learn pandas numpy tqdm python-dotenv google-genai plotly matplotlib

## Benchmarks utills

In [None]:
import pandas as pd
import os

def create_subdatasets(dataset_path, subsets_amount, subset_size):
    """
    Reads a dataset from a path and splits it into smaller dataframes.
    
    Parameters:
    -----------
    dataset_path : str
        Path to the dataset file
    subsets_amount : int
        Number of subdatasets to create
    subset_size : int
        Size of each subdataset
        
    Returns:
    --------
    dict
        Dictionary with dataset names as keys and pandas DataFrames as values
    """

    try:
        df = pd.read_csv(dataset_path)
    except Exception as e:
        raise ValueError(f"Could not read dataset from {dataset_path}. Error: {str(e)}")
    
    total_required_rows = subsets_amount * subset_size
    if len(df) < total_required_rows:
        raise ValueError(f"Dataset has {len(df)} rows, but {total_required_rows} rows are required for {subsets_amount} subsets of size {subset_size}")
    
    dataset_name = os.path.splitext(os.path.basename(dataset_path))[0]
    
    subdatasets = {}
    for i in range(subsets_amount):
        subset = df.sample(n=subset_size, random_state=i)
        subset_name = f"{dataset_name}_subset_{subset_size}_{i}"
        subdatasets[subset_name] = subset
    
    return subdatasets

In [None]:
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

def clustering_evaluation(X, labels):
    
    if len(set(labels)) > 1 and len(set(labels)) < len(X):
        silhouette = silhouette_score(X, labels)
        davies_bouldin = davies_bouldin_score(X, labels)
        calinski_harabasz = calinski_harabasz_score(X, labels)
    else:
        silhouette = float('nan')
        davies_bouldin = float('nan')
        calinski_harabasz = float('nan')
    
    return silhouette, davies_bouldin, calinski_harabasz

In [None]:
import pandas as pd
from tqdm import tqdm
import re

def clustering_benchmark(models_dict, dataframes_dict):    
    results = []
    total_iterations = len(dataframes_dict) * len(models_dict)

    with tqdm(total=total_iterations, desc='Clustering Progress') as pbar:
        for dataset_name, df in dataframes_dict.items():
            embed_cols = [col for col in df.columns if re.match(r'embed_dim_\d+', col)]
            
            for model_name, model in models_dict.items():
                model.fit(df[embed_cols])
                labels = model.labels_
                df['Cluster_Assignment'] = labels
                
                silhouette, davies_bouldin, calinski_harabasz = clustering_evaluation(df[embed_cols], labels)
                
                results.append({
                    'Dataset': dataset_name,
                    'Dataset Size': len(df),
                    'Model': model_name,
                    'Silhouette Score': silhouette,
                    'Davies-Bouldin Index': davies_bouldin,
                    'Calinski-Harabasz Index': calinski_harabasz,
                    'Number of Clusters': len(set(labels)) if hasattr(model, 'labels_') else 0
                })
                pbar.update(1)
                
    return pd.DataFrame(results)

## Notebook Start

### First step, lets just put our points on the plane.

In [None]:
data_path="./datasets/with-assumptions/jack_vs_calley_1000.csv"

In [None]:
import pandas as pd
import ast

df = pd.read_csv(data_path)
if df['encoded'].dtype == 'object' and isinstance(df['encoded'].iloc[0], str):
    df['encoded'] = df['encoded'].apply(lambda x: ast.literal_eval(x))
    
df['x'] = df['encoded'].apply(lambda x: x[0])
df['y'] = df['encoded'].apply(lambda x: x[1])

In [None]:
import plotly.express as px

def plot_scatter(df):        
    fig = px.scatter(
        df, 
        x='x', 
        y='y',
        hover_data=['text'],
        title='Visualization of Encoded Points',
        labels={'x': 'Dimension 1', 'y': 'Dimension 2'},
        opacity=0.7
    )

    # Improve layout
    fig.update_layout(
        plot_bgcolor='white',
        width=900,
        height=700
    )

    # Add grid lines
    fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='LightGray')
    fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='LightGray')


    fig.show()


In [None]:
plot_scatter(df)

We can not see any pattern at this momnet.
To make it more easly ilustrated, lets just pick 50 random comments.
Also, we can assume the data is that clusted cuz of the huge dimensionality reduction. 

In [None]:
sample_50_df = df.sample(n=50, random_state=42)
plot_scatter(sample_50_df)

### Runing the benchmark

In [None]:
from sklearn.cluster import AgglomerativeClustering, KMeans, HDBSCAN

data_path = "./datasets/with-assumptions/jack_vs_calley_1000.csv"

benchmark_data = create_subdatasets(data_path,20, 50)

models_dict = {
    "Agglomerative": AgglomerativeClustering(n_clusters=5),
    "HDBSCAN": HDBSCAN(min_cluster_size=3),
    "KMeans": KMeans(n_clusters=5, random_state=42)
}


results_df = clustering_benchmark(models_dict, benchmark_data)

results_df.groupby('Model').mean(numeric_only=True).style\
    .format('{:.2f}')\
    .set_caption('Clustering Benchmark Results')

    

In [None]:
output_file= f"./benchmarks results/jack_vs_cally_small.csv"
results_df.to_csv(output_file, index=False)


### Run some clustering

## Data preperation

### Split the data into subsets

## Utils

## Traditional Clustering Approach

### Clustering Algorithms

#### K-Means


#### HDBSCAN

#### Agglomerative Clustering

### Evaluation and Benchmarking

#### Silhouette Score


#### Calinski-Harabasz Index


#### Algorithm Comparison and Selection

### Hyperparameter Optimization:

### Results Visualization and Interpretation

##  Clustering with LLM Usage

## Summary and Recommendations