# Sentiment Analysis Pipeline for Comments

This notebook applies a multilingual BERT-based sentiment analysis model to a large dataset of YouTube comments. The workflow includes loading the model, batch processing comments for sentiment scoring, saving results to CSV, and inspecting the output. The process is optimized for large-scale data and produces negative, neutral, and positive sentiment scores for each comment.

### Import Required Libraries

This cell imports all necessary libraries for sentiment analysis, including PyTorch, HuggingFace transformers, pandas, and tqdm for progress tracking.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import pandas as pd
from tqdm import tqdm

### Load Sentiment Model and Define Scoring Functions

This cell loads a multilingual BERT-based sentiment model, sets up the tokenizer and device, and defines functions for batch sentiment scoring and chunked processing of large datasets.

In [None]:
MODEL_NAME = "nlptown/bert-base-multilingual-uncased-sentiment"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME).to(device)
model.eval()

# Function to compute sentiment scores in a batch
def get_sentiment_scores_batch(texts, batch_size=1000):
    all_scores = []

    for start in range(0, len(texts), batch_size):
        batch = texts[start:start+batch_size]

        # Tokenize and collate on GPU
        inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt").to(device)

        # Forward pass
        with torch.no_grad():
            outputs = model(**inputs)
            probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

        # Convert to sentiment scores
        for p in probs.cpu().numpy():
            # map: [1 star, 2 stars, 3 stars, 4 stars, 5 stars]
            negative = p[0] + p[1]
            neutral = p[2]
            positive = p[3] + p[4]

            total = negative + neutral + positive
            all_scores.append({
                "negative": negative / total,
                "neutral": neutral / total,
                "positive": positive / total
            })

    return all_scores


# Batch processing for large files
def process_in_chunks(input_file, output_file, chunk_size=10000, batch_size=300, max_rows=None):
    reader = pd.read_csv(input_file, chunksize=chunk_size)
    first = True
    processed_rows = 0

    for i, chunk in enumerate(tqdm(reader, desc="Processing chunks")):
        if max_rows is not None and processed_rows >= max_rows:
            break

        texts = chunk["cleanedText"].tolist()
        scores = get_sentiment_scores_batch(texts, batch_size=batch_size)

        scores_df = pd.DataFrame(scores)
        chunk = chunk.reset_index(drop=True).join(scores_df)

        # Append to output
        if first:
            chunk.to_csv(output_file, index=False, mode="w")
            first = False
        else:
            chunk.to_csv(output_file, index=False, mode="a", header=False)

        processed_rows += len(chunk)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/669M [00:00<?, ?B/s]

### Run Sentiment Analysis and Save Results

This cell processes the input CSV file in chunks, computes sentiment scores for each comment using the model, and saves the results to a new CSV file.

In [None]:
process_in_chunks("dataset/final_after_spam_eng_relevance.csv", "dataset/final_after_spam_eng_relevance_sentiment.csv")

Processing chunks: 100it [1:38:51, 59.31s/it]


### Load Sentiment-Scored Comments

This cell loads the CSV file containing comments with computed sentiment scores and displays the first few rows for inspection.

In [None]:
file_path = 'dataset/final_after_spam_eng_relevance_sentiment.csv'
comment = pd.read_csv(file_path)
comment.head()

Unnamed: 0,commentId,channelId,videoId,authorId,textOriginal,parentCommentId,likeCount,publishedAt,updatedAt,duplicatedFlag,cleanedText,cleanedTextSentiment,regex_spam,predicted_spam,isSpam,is_english,relevance_score,negative,neutral,positive
0,3166243,41024,6217,26499,Good Information... Will definitely try it......,,0,2020-01-01 16:00:58+00:00,2020-01-01 16:00:58+00:00,0,good information definitely try thanks,good information definitely try thanks : smili...,0,0.0,0,1,0.120977,0.019712,0.112812,0.867477
1,0,10004,86296,164837,Yes but I am charged $8 to cover your free shi...,1888757.0,0,2020-01-04 07:53:24+00:00,2020-01-04 07:53:24+00:00,0,yes charged $ 8 cover free shipping not rep wo...,yes charged $ 8 cover free shipping not rep wo...,0,0.0,0,1,0.073981,0.855613,0.094974,0.049413
2,1279533,5459,64449,882554,Very useful video,,2,2020-01-04 10:32:19+00:00,2020-01-04 10:32:19+00:00,0,useful video,useful video,0,0.0,0,1,0.116708,0.032399,0.222604,0.744997
3,2543589,32215,89804,1777705,Osm three hair colour,,2,2020-01-04 13:07:46+00:00,2020-01-04 13:07:46+00:00,0,osm three hair colour,osm three hair colour,0,0.0,0,1,0.576698,0.190975,0.341314,0.467712
4,3857384,48408,32889,2850201,Freshness Level 9999999999😍,,0,2020-01-04 14:22:11+00:00,2020-01-04 14:22:11+00:00,0,freshness level 99,freshness level 99 : smiling_face_with_heart-e...,0,0.0,0,1,0.185953,0.095852,0.111687,0.792461


### Display Shape of Sentiment-Scored Data

This cell displays the shape (number of rows and columns) of the DataFrame containing sentiment-scored comments, confirming the size of the processed dataset.

In [None]:
comment.shape

(1000000, 20)