# Topic based clustering

The goal of our project is to group YouTube comments together based on their topics.

This notebook will explore different clustering algorithms, benchmark them, and recommend one that is most suitable for our problem.

## Introduction  

### Assumptions


This notebook assumes the following: 
- Comments have already been cleaned and encoded. 
- Comments encoding have been reduced into a 2 dimensional space.

Currently, we did not prefect these steps, so here is a short and imperfect implementation of these so we can start working. [link to notebook](./assumptions.ipynb)

> We assume that the data is encoded properly and that by mesuring the distance between comments, we can cluster them based on topics.

### Algorithms
- Agglomerative
- K-Means
- HDBSCAN



### Benchmarks
- Davies-Bouldin
- Silhouette
- Calinski-Harabasz


## Installations

In [None]:
%pip install scikit-learn pandas numpy tqdm python-dotenv google-genai plotly

## Benchmarks utills

In [None]:
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

def clustering_evaluation(X, labels):
    silhouette = silhouette_score(X, labels)
    davies_bouldin = davies_bouldin_score(X, labels)
    calinski_harabasz = calinski_harabasz_score(X, labels)
    
    return silhouette, davies_bouldin, calinski_harabasz



## Notebook Start

### First step, lets just put our points on the plane.

In [None]:
data_path="./datasets/with-assumptions/jack_vs_calley_1000.csv"

In [None]:
import pandas as pd
import ast

df = pd.read_csv(data_path)
if df['encoded'].dtype == 'object' and isinstance(df['encoded'].iloc[0], str):
    df['encoded'] = df['encoded'].apply(lambda x: ast.literal_eval(x))
    
df['x'] = df['encoded'].apply(lambda x: x[0])
df['y'] = df['encoded'].apply(lambda x: x[1])

In [None]:
import plotly.express as px

def plot_scatter(df):        
    fig = px.scatter(
        df, 
        x='x', 
        y='y',
        hover_data=['text'],
        title='Visualization of Encoded Points',
        labels={'x': 'Dimension 1', 'y': 'Dimension 2'},
        opacity=0.7
    )

    # Improve layout
    fig.update_layout(
        plot_bgcolor='white',
        width=900,
        height=700
    )

    # Add grid lines
    fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='LightGray')
    fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='LightGray')


    fig.show()


In [None]:
plot_scatter(df)

We can not see any pattern at this momnet.
To make it more easly ilustrated, lets just pick 50 random comments.
Also, we can assume the data is that clusted cuz of the huge dimensionality reduction. 

In [None]:
sample_50_df = df.sample(n=50, random_state=42)
plot_scatter(sample_50_df)

### Run some clustering

## Data preperation

### Split the data into subsets

## Utils

## Traditional Clustering Approach

### Clustering Algorithms

#### K-Means


#### HDBSCAN

#### Agglomerative Clustering

### Evaluation and Benchmarking

#### Silhouette Score


#### Calinski-Harabasz Index


#### Algorithm Comparison and Selection

### Hyperparameter Optimization:

### Results Visualization and Interpretation

##  Clustering with LLM Usage

## Summary and Recommendations