# Sentiment Analysis

Let's download the clean dataset.

In [1]:
!gdown 1YvDb54rFAs8IjLVwOstu1lNlQJ9xMY5f

Downloading...
From: https://drive.google.com/uc?id=1YvDb54rFAs8IjLVwOstu1lNlQJ9xMY5f
To: /kaggle/working/YoutubeCommentsDataSet.csv
100%|██████████████████████████████████████| 3.49M/3.49M [00:00<00:00, 25.6MB/s]


In [2]:
# Install necessary libraries
!pip install datasets transformers huggingface_hub evaluate optuna
!apt-get install git-lfs

Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.5-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, evaluate
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.5.1
    Uninstalling fsspec-2025.5.1:
      Successfully uninstalled fsspec-2025.5.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 2.8.0 r

In [4]:
import os
from huggingface_hub import login as hf_login
import subprocess
import wandb
import pandas as pd
import numpy as np
import torch
from datasets import Dataset
from transformers import DataCollatorWithPadding, TrainingArguments, Trainer, AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from sklearn.utils import resample
from sklearn.model_selection import train_test_split

2025-08-19 09:59:54.711108: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1755597594.915234      36 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1755597594.971129      36 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
# Auto-login for Hugging Face + Weights & Biases
hf_token = os.environ.get("HF_TOKEN")
wandb_api_key = os.environ.get("WANDB_API_KEY")


hf_login(token=hf_token)
os.environ["HF_HUB_DISABLE_TELEMETRY"] = "1"

wandb.login(key=wandb_api_key)

[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mso-907[0m ([33mso-907-none[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

We need to make sure that we have a GPU.

In [5]:
torch.cuda.is_available()

True

Let's start by preparing the dataset.

In [7]:
df = pd.read_csv("YoutubeCommentsDataSet.csv")
df.dropna(inplace=True)
df["Sentiment"] = df["Sentiment"].replace({"negative": 0, "neutral": 1, "positive": 2})

train_dataset, eval_dataset = train_test_split(df, random_state=42, stratify=df["Sentiment"])

# Turn into dataset object
train_dataset = Dataset.from_pandas(train_dataset)
eval_dataset = Dataset.from_pandas(eval_dataset)

# Instantiate tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased", num_labels=3)

def tokenize_function(examples):
  return tokenizer(examples["Comment"],
                   padding="max_length",
                   truncation=True,
                   max_length=512
                   )

train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)

  df["Sentiment"] = df["Sentiment"].replace({"negative": 0, "neutral": 1, "positive": 2})


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/13773 [00:00<?, ? examples/s]

Map:   0%|          | 0/4591 [00:00<?, ? examples/s]

In [8]:
# Rename columns and define data collator
train_dataset = train_dataset.rename_column("Sentiment", "label")
eval_dataset = eval_dataset.rename_column("Sentiment", "label")

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

We are going to use one of Hugging Faces's pretrained models, RoBERTa base.

In [9]:
def compute_metrics(eval_pred):
  load_accuracy = evaluate.load("accuracy")
  load_f1 = evaluate.load("f1")

  logits, labels = eval_pred
  predictions = np.argmax(logits, axis=-1)
  accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
  f1 = load_f1.compute(predictions=predictions, references=labels, average="macro")["f1"]

  return {"accuracy": accuracy, "f1": f1}

Let's setup the trainer and start training our model. We are setting some parameters to make the training faster.

In [12]:
import os

model = AutoModelForSequenceClassification.from_pretrained(
        "roberta-base",
        num_labels=3
    )

def model_init():
    return AutoModelForSequenceClassification.from_pretrained(
        "roberta-base",
        num_labels=3
    )

repo_name = "sentiment-analysis-on-youtube-comments"

training_args = TrainingArguments(
   output_dir=repo_name,
   eval_strategy="epoch",
   save_strategy="epoch",
   load_best_model_at_end=True,
   metric_for_best_model="f1",
   fp16=True,
   greater_is_better=True,
   report_to=["wandb"],
   run_name="roberta-base-experiments"
)

trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=train_dataset,
   eval_dataset=eval_dataset,
   tokenizer=tokenizer,
   compute_metrics=compute_metrics,
   data_collator=data_collator,
   model_init=model_init
)


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Now we will use optuna to perform hyperparameter tuning.

In [None]:
best_trial = trainer.hyperparameter_search(
    direction="maximize",
    n_trials=5,
    backend="optuna"
)

print(best_trial)

[I 2025-08-19 10:01:58,004] A new study created in memory with name: no-name-a1a27555-de52-4dd7-824d-e82dc5c45a1e
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.378674,0.84034,0.784874
2,0.440800,0.376626,0.853627,0.8068


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

[I 2025-08-19 10:16:01,563] Trial 0 finished with value: 1.6604268240330917 and parameters: {'learning_rate': 3.194088626974822e-05, 'num_train_epochs': 2, 'seed': 31, 'per_device_train_batch_size': 16}. Best is trial 0 with value: 1.6604268240330917.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁█
eval/f1,▁█
eval/loss,█▁
eval/runtime,█▁
eval/samples_per_second,▁█
eval/steps_per_second,▁█
train/epoch,▁▂██
train/global_step,▁▂██
train/grad_norm,▁
train/learning_rate,▁

0,1
eval/accuracy,0.85363
eval/f1,0.8068
eval/loss,0.37663
eval/runtime,44.6208
eval/samples_per_second,102.889
eval/steps_per_second,6.432
total_flos,3649012037167104.0
train/epoch,2.0
train/global_step,862.0
train/grad_norm,388887.75




Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.674631,0.693531,0.435185
2,No log,0.542897,0.778262,0.686108
3,0.695900,0.499829,0.802875,0.735623
4,0.695900,0.490327,0.805489,0.740678


[I 2025-08-19 10:43:34,832] Trial 1 finished with value: 1.5461669245375957 and parameters: {'learning_rate': 2.3081481392099564e-06, 'num_train_epochs': 4, 'seed': 15, 'per_device_train_batch_size': 32}. Best is trial 0 with value: 1.6604268240330917.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▆██
eval/f1,▁▇██
eval/loss,█▃▁▁
eval/runtime,▇█▁▃
eval/samples_per_second,▂▁█▆
eval/steps_per_second,▂▁█▆
train/epoch,▁▃▄▆██
train/global_step,▁▃▄▆██
train/grad_norm,▁
train/learning_rate,▁

0,1
eval/accuracy,0.80549
eval/f1,0.74068
eval/loss,0.49033
eval/runtime,44.4236
eval/samples_per_second,103.346
eval/steps_per_second,6.461
total_flos,7298024074334208.0
train/epoch,4.0
train/global_step,864.0
train/grad_norm,266065.25




Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.396889,0.833805,0.776156
2,No log,0.367109,0.849488,0.807781
3,0.368700,0.40529,0.848617,0.801887
4,0.368700,0.435825,0.853844,0.80753


[I 2025-08-19 11:11:08,847] Trial 2 finished with value: 1.6613740069011005 and parameters: {'learning_rate': 3.146708957495452e-05, 'num_train_epochs': 4, 'seed': 38, 'per_device_train_batch_size': 32}. Best is trial 2 with value: 1.6613740069011005.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▆▆█
eval/f1,▁█▇█
eval/loss,▄▁▅█
eval/runtime,▄▅▁█
eval/samples_per_second,▅▄█▁
eval/steps_per_second,▅▄█▁
train/epoch,▁▃▄▆██
train/global_step,▁▃▄▆██
train/grad_norm,▁
train/learning_rate,▁

0,1
eval/accuracy,0.85384
eval/f1,0.80753
eval/loss,0.43583
eval/runtime,44.7429
eval/samples_per_second,102.608
eval/steps_per_second,6.414
total_flos,7298024074334208.0
train/epoch,4.0
train/global_step,864.0
train/grad_norm,306688.09375




Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.5755,0.38858,0.837726,0.787686
2,0.3301,0.363995,0.84731,0.800796
3,0.2593,0.392503,0.850795,0.805532
4,0.2216,0.417219,0.849488,0.799573


[I 2025-08-19 11:41:03,267] Trial 3 finished with value: 1.649061109129843 and parameters: {'learning_rate': 8.59325140295354e-06, 'num_train_epochs': 4, 'seed': 8, 'per_device_train_batch_size': 8}. Best is trial 2 with value: 1.6613740069011005.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▆█▇
eval/f1,▁▆█▆
eval/loss,▄▁▅█
eval/runtime,▄▁█▃
eval/samples_per_second,▅█▁▆
eval/steps_per_second,▅█▁▆
train/epoch,▁▂▂▃▄▅▆▆▇██
train/global_step,▁▂▂▃▄▅▆▆▇██
train/grad_norm,▁▁▅█▆▅
train/learning_rate,█▇▅▄▂▁

0,1
eval/accuracy,0.84949
eval/f1,0.79957
eval/loss,0.41722
eval/runtime,44.7765
eval/samples_per_second,102.532
eval/steps_per_second,6.41
total_flos,7298024074334208.0
train/epoch,4.0
train/global_step,3444.0
train/grad_norm,762670.875




Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.5179,0.394717,0.842736,0.788943


