In [1]:
## 1. import relevant libraries

import pandas as pd
import os
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from transformers import pipeline

In [2]:
## 2. Create a dataframe from table

df = pd.read_csv('YoutubeCommentsDataSet.csv')

# Confirm data types
print(df.dtypes)
print(df.head())

Comment      object
Sentiment    object
dtype: object
                                             Comment Sentiment
0  lets not forget that apple pay in 2014 require...   neutral
1  here in nz 50 of retailers don’t even have con...  negative
2  i will forever acknowledge this channel with t...  positive
3  whenever i go to a place that doesn’t take app...  negative
4  apple pay is so convenient secure and easy to ...  positive


This script was originally developed to provide insights into the free-text fields in user surveys. I wrote it intending to be a "plug and play" script that could be used for any text-based data set with minimal editing. This dataset was downloaded from Kaggle and already has sentiment tags. I'm curious to see whether my sentiment analysis script matches the original. Using pre-trained cloud models saves on storage and training time, but another option would be to train and store our own model. For the original work, we've decided Bert is good enough and worth the tradeoff. If there were additional text-based fields, steps 3-5 would be repeated for each. In this case, there is only one.

In [4]:
## 3.remove Nan fields to exclude from topic creation
##This step is unnecessary in the current dataset, which has been pre-cleaned but it won't always be clean data so good to keep it in
df_clean = df[['Comment', 'Sentiment']].dropna()

#cast text column as string to ensure that there aren't a few random floats or emojis thrown into the mix
df_clean['Comment'] = df_clean['Comment'].astype(str)

##check filtered dataframe
df_clean

Unnamed: 0,Comment,Sentiment
0,lets not forget that apple pay in 2014 require...,neutral
1,here in nz 50 of retailers don’t even have con...,negative
2,i will forever acknowledge this channel with t...,positive
3,whenever i go to a place that doesn’t take app...,negative
4,apple pay is so convenient secure and easy to ...,positive
...,...,...
18403,i really like the point about engineering tool...,positive
18404,i’ve just started exploring this field and thi...,positive
18405,excelente video con una pregunta filosófica pr...,neutral
18406,hey daniel just discovered your channel a coup...,positive


In [5]:
# 4. Convert to list for vectorization
docs = df_clean['Comment'].to_list()

In [6]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Initialize BERTopic Model
vectorizer_model = CountVectorizer(stop_words="english")
topic_model = BERTopic(
    vectorizer_model=vectorizer_model,
    calculate_probabilities=True,
    min_topic_size=50,
    verbose=True,
    n_gram_range=(1, 2),
    top_n_words=3
)

# Initialize Sentiment Analysis Model
sentiment_classifier = pipeline(
    'sentiment-analysis',
    model='cardiffnlp/twitter-roberta-base-sentiment-latest',
    truncation=True,
    max_length=512
)

# Perform topic classification and sentiment analysis in one pass
topics, probabilities = topic_model.fit_transform(docs)
sentiment_results = sentiment_classifier(docs)

# Collect data for the final DataFrame
topic_numbers = topics  # Topic classification results
sentiment_labels = [res['label'] for res in sentiment_results]
sentiment_scores = [res['score'] for res in sentiment_results]

# Create DataFrame for topic information
df_topics = pd.DataFrame({'Topic': list(topic_model.get_topics().keys()),
                          'topic_words': list(topic_model.get_topics().values())})
df_topic_info = pd.DataFrame(topic_model.get_topic_info()).iloc[1:].reset_index(drop=True)
df_topic_info['topic_words'] = df_topics['topic_words']

# Add topic and sentiment results to the original DataFrame
df_clean['Topic'] = topic_numbers
df_clean['sentiment_new'] = sentiment_labels
df_clean['sentiment_score_new'] = sentiment_scores

# Merge with topic information
df_topic = df_clean.merge(df_topic_info, how='left', on='Topic')

# Drop coordinates if they exist
columns_to_drop = ['x_coordinate', 'y_coordinate']
df_topic = df_topic.drop(columns=[col for col in columns_to_drop if col in df_topic.columns], axis=1)

# Filter out outlier topics (-1)
df_topic = df_topic[df_topic['Topic'] != -1]

# Optional: Filter by topic size
min_topic_size = 50
##this number should be adjusted as the size of the dataset increases
large_topics = df_topic_info[df_topic_info['Count'] >= min_topic_size]['Topic']
df = df_topic[df_topic['Topic'].isin(large_topics)]

# Rename columns for clarity
##in the case of multiple text columns, a column name signifier would also be added
df.rename(columns={
    'Comment': 'full_comment_text',
    'Sentiment': 'sentiment_original',
    'Topic': 'topic_number',
    'Name': 'topic_name',
    'Representation': 'representation',
    'Representative_Docs': 'docs',
    'topic_words': 'words',
    'Count': 'topic_count'
}, inplace=True)



# Final DataFrame
df

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0
2025-04-09 11:12:40,894 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/574 [00:00<?, ?it/s]

2025-04-09 11:13:47,946 - BERTopic - Embedding - Completed ✓
2025-04-09 11:13:47,951 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-04-09 11:14:02,056 - BERTopic - Dimensionality - Completed ✓
2025-04-09 11:14:02,060 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-04-09 11:14:05,333 - BERTopic - Cluster - Completed ✓
2025-04-09 11:14:05,344 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-04-09 11:14:05,622 - BERTopic - Representation - Completed ✓


Unnamed: 0,full_comment_text,sentiment_original,topic_number,sentiment_new,sentiment_score_new,topic_count,topic_name,representation,docs,words
0,lets not forget that apple pay in 2014 require...,neutral,9,neutral,0.556387,381.0,9_apple_iphone_pro,"[apple, iphone, pro]",[order of event prediction 1 intro what apple ...,"[(shes, 0.058790930161863104), (makeup, 0.0385..."
1,here in nz 50 of retailers don’t even have con...,negative,9,negative,0.770144,381.0,9_apple_iphone_pro,"[apple, iphone, pro]",[order of event prediction 1 intro what apple ...,"[(shes, 0.058790930161863104), (makeup, 0.0385..."
3,whenever i go to a place that doesn’t take app...,negative,9,negative,0.722036,381.0,9_apple_iphone_pro,"[apple, iphone, pro]",[order of event prediction 1 intro what apple ...,"[(shes, 0.058790930161863104), (makeup, 0.0385..."
4,apple pay is so convenient secure and easy to ...,positive,9,positive,0.946898,381.0,9_apple_iphone_pro,"[apple, iphone, pro]",[order of event prediction 1 intro what apple ...,"[(shes, 0.058790930161863104), (makeup, 0.0385..."
5,we’ve been hounding my bank to adopt apple pay...,neutral,9,neutral,0.477529,381.0,9_apple_iphone_pro,"[apple, iphone, pro]",[order of event prediction 1 intro what apple ...,"[(shes, 0.058790930161863104), (makeup, 0.0385..."
...,...,...,...,...,...,...,...,...,...,...
18350,awesome succinct very effective for a newbie t...,positive,14,positive,0.960398,253.0,14_tutorial_thank_explained,"[tutorial, thank, explained]","[i still want to see that tutorial, i could cr...","[(data, 0.14780327371328625), (science, 0.0753..."
18351,well explanation love it,positive,14,positive,0.890718,253.0,14_tutorial_thank_explained,"[tutorial, thank, explained]","[i still want to see that tutorial, i could cr...","[(data, 0.14780327371328625), (science, 0.0753..."
18353,thanks for sharing what is machine learning us...,positive,26,positive,0.966756,145.0,26_learning_machine_ml,"[learning, machine, ml]",[please make another more videos on machine le...,"[(life, 0.04247010384916758), (reading, 0.0358..."
18354,i am currently enrolled in a msc machine learn...,positive,26,positive,0.689571,145.0,26_learning_machine_ml,"[learning, machine, ml]",[please make another more videos on machine le...,"[(life, 0.04247010384916758), (reading, 0.0358..."


In the original version of this script, developed for analysis of survey responses, the topic modeling process was done on each column separately and then pd.merge was used to merge them all back to the original for Tableau reporting purposes. That step is unnecessary here but code would be:

df_final= pd.merge(
	pd.merge(
		pd.merge(
			df_clean,
			df1, on=['join_key'],
			how='left'
	    ),
		df2,on=['join_key'],
		how='left'
	),
    df3,on=['join_key'],
	how='left'
)

In [8]:
df.to_csv('youtube_comments_for_reporting.csv', index=False)