# YouTube Comments Processing Notebook

This notebook processes YouTube comments by:
- Encoding the comments
- Reducing comment encodings to 2-dimensional space

using the [text-embedding-004](https://ai.google.dev/gemini-api/docs/embeddings) model. 

The processed data will be used for clustering analysis.

## Package Installation
Run this cell to install required packages.

In [None]:
%%capture
%pip install pandas numpy tqdm python-dotenv
%pip install -q -U google-genai

## Imports and Setup


In [None]:
import pandas as pd
import os

In [None]:
%load_ext autoreload
%autoreload 2

import importlib
import sys
import os

module_path = os.path.abspath(os.path.join("../"))
if module_path not in sys.path:
    sys.path.append(module_path)

import utils.gemini_client as gemini_client

def reload_utils():
    importlib.reload(gemini_client)


reload_utils()

## Utility Functions

In [None]:
def validate_dataframe(df):
    required_columns = ['text', 'author', 'likes', 'replyCount']
    missing_columns = [col for col in required_columns if col not in df.columns]
    
    if missing_columns:
        raise ValueError(f"Missing required columns: {', '.join(missing_columns)}")

In [None]:
def load_and_validate_dataset(input_file):
    print(f"Loading data from {input_file}...")
    df = pd.read_csv(input_file)
    validate_dataframe(df)
    return df

## Main Process

In [None]:
dataset_name = "honey_scam_500"
input_file = f"../datasets/youtube-comments/{dataset_name}.csv"
df = load_and_validate_dataset(input_file)

In [None]:
df = gemini_client.add_embeddings_to_dataframe(
    df, task_type="CLUSTERING", output_dimensionality=2, batch_size=100)
display(df.head())

In [None]:
output_file= f"./datasets/with-assumptions/{dataset_name}.csv"
df.to_csv(output_file, index=False)
