# YouTube Comments Processing Notebook

This notebook processes YouTube comments by:
- Encoding the comments
- Reducing comment encodings to 2-dimensional space

using the [text-embedding-004](https://ai.google.dev/gemini-api/docs/embeddings) model. 

The processed data will be used for clustering analysis.

## Package Installation
Run this cell to install required packages.

In [None]:
import sys

!{sys.executable} -m pip install pandas numpy tqdm python-dotenv
!{sys.executable} -m pip install -q -U google-genai



## Imports and Setup


In [None]:
import pandas as pd
import os

In [None]:
%load_ext autoreload
%autoreload 2

import importlib
import sys
import os

module_path = os.path.abspath(os.path.join("../"))
if module_path not in sys.path:
    sys.path.append(module_path)

import utils.gemini_utils.gemini_client as gemini_client
import utils.gemini_utils.gemini_key as gemini_key


def reload_utils():
    importlib.reload(gemini_client)
    importlib.reload(gemini_key)

reload_utils()

In [None]:
from google import genai
google_client = genai.Client(api_key=gemini_key.GEMINI_API_KEY)

## Utility Functions

In [None]:
def validate_dataframe(df):
    required_columns = ['text', 'author', 'likes', 'replyCount']
    missing_columns = [col for col in required_columns if col not in df.columns]
    
    if missing_columns:
        raise ValueError(f"Missing required columns: {', '.join(missing_columns)}")

In [None]:
def load_and_validate_dataset(input_file):
    print(f"Loading data from {input_file}...")
    df = pd.read_csv(input_file)
    validate_dataframe(df)
    return df

## Main Process

In [None]:
dataset_name = "honey_scam_500"
input_file = f"../datasets/youtube-comments/{dataset_name}.csv"
df = load_and_validate_dataset(input_file)

Loading data from ../datasets/youtube-comments/honey_scam_500.csv...


In [None]:
df = gemini_client.add_embeddings_to_dataframe(
    df, google_client=google_client, task_type="CLUSTERING", output_dimensionality=2, batch_size=100)
display(df.head())

Generating embeddings...


Processing embeddings batches:  60%|██████    | 3/5 [00:02<00:01,  1.69batch/s]

Error during embedding batch 1: 429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'Resource has been exhausted (e.g. check quota).', 'status': 'RESOURCE_EXHAUSTED'}}
Error during embedding batch 2: 429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'Resource has been exhausted (e.g. check quota).', 'status': 'RESOURCE_EXHAUSTED'}}


Processing embeddings batches: 100%|██████████| 5/5 [00:02<00:00,  2.71batch/s]

Error during embedding batch 3: 429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'Resource has been exhausted (e.g. check quota).', 'status': 'RESOURCE_EXHAUSTED'}}
Error during embedding batch 4: 429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'Resource has been exhausted (e.g. check quota).', 'status': 'RESOURCE_EXHAUSTED'}}


Processing embeddings batches: 100%|██████████| 5/5 [00:02<00:00,  1.77batch/s]


Unnamed: 0,text,author,likes,replyCount,embed_dim_0,embed_dim_1
0,"If you guys enjoyed this video, please conside...",@MegaLag,23905,539,-0.014661,-0.001
1,Cannot trust those guys on the thumbnail?,@logic8673,0,0,-0.017816,0.010065
2,I like that the influencers got f***ed to when...,@PeterJung-cx1ib,0,0,-0.00177,-0.007576
3,This is straight up evil. I knew Paypal is no...,@filsc,0,0,-0.01582,0.016761
4,Its not their fault tho they didnt know,@TeumuTemara,0,0,-0.014794,0.009811


In [None]:
output_file= f"./datasets/with-assumptions/{dataset_name}.csv"
df.to_csv(output_file, index=False)
