# Relevance Scoring of Comments to Video Content

This notebook computes a relevance score for each YouTube comment by measuring the semantic similarity between the comment and its corresponding video title. The workflow includes loading and merging datasets, generating sentence embeddings using a transformer model, calculating cosine similarity as a relevance score, and exporting the results for further analysis. The process enables quantitative assessment of how closely comments relate to video content.

### Import Required Libraries

This cell imports all necessary libraries for data manipulation, progress tracking, sentence embedding, and parallel processing, including pandas, tqdm, transformers, concurrent.futures, and numpy.

In [None]:
import pandas as pd
from tqdm import tqdm
from transformers import pipeline
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed
tqdm.pandas()
import numpy as np

### Load Sentence Transformer Model

This cell loads the pre-trained SentenceTransformer model ('all-MiniLM-L6-v2') for generating embeddings used in relevance scoring.

In [None]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### Load Comments and Video Metadata

This cell loads the English comments and video metadata from CSV files, preparing them for merging and relevance scoring.

In [None]:
file_path = 'dataset/final_after_spam_eng.csv'
comment = pd.read_csv(file_path)

file_path = 'dataset/videos_cleaned_translated.csv'
video = pd.read_csv(file_path)
video.head()

Unnamed: 0,videoId,publishedAt,channelId,title,description,tags,defaultLanguage,defaultAudioLanguage,contentDuration,viewCount,likeCount,commentCount,topicCategories,cleanedText,is_english,translated
0,85806,2024-01-15 00:59:29+00:00,33807,Unlocking the Benefits of Face Masks for Skin ...,,,en-US,en-US,PT9S,72.0,0.0,0.0,"['https://en.wikipedia.org/wiki/Health', 'http...",unlocking benefit face mask skin health,1,unlocking benefit face mask skin health
1,30556,2023-10-27 19:32:16+00:00,46650,Get ready for the Magic💚💜🤍💝✨ #hydration #glowi...,,,,,PT45S,257.0,7.0,0.0,['https://en.wikipedia.org/wiki/Lifestyle_(soc...,get ready magic hydration glowingskin nomakeup...,1,get ready magic hydration glowingskin nomakeup...
2,51771,2024-09-28 01:23:22+00:00,14346,#trending #makeup #beautymakeup #yslbeauty #lu...,,,,en-US,PT19S,164.0,4.0,2.0,['https://en.wikipedia.org/wiki/Lifestyle_(soc...,trending makeup beautymakeup yslbeauty luxury ...,1,trending makeup beautymakeup yslbeauty luxury ...
3,45298,2023-07-13 15:19:28+00:00,50139,#shortvedio #balayage,,,,,PT14S,1207.0,20.0,0.0,['https://en.wikipedia.org/wiki/Lifestyle_(soc...,shortvedio balayage,0,shortvedio balayage
4,43611,2023-04-29 18:47:37+00:00,8143,Full Face of Merit Beauty 🤎 featuring new Flus...,,,,en,PT56S,8647.0,268.0,7.0,['https://en.wikipedia.org/wiki/Lifestyle_(soc...,full face merit beauty featuring new flush bal...,1,full face merit beauty featuring new flush bal...


### Filter English Comments

This cell filters the loaded comments to retain only those labeled as English (`is_english == 1`).

In [None]:
eng=comment[comment['is_english']==1]

### Merge Comments with Video Metadata

This cell merges the filtered English comments with the corresponding video metadata based on the video ID.

In [None]:
merged_df = eng.merge(video, left_on='videoId', right_on='videoId', how='left')

### Display DataFrame Info

This cell displays information about the merged DataFrame, including column types and non-null counts, to verify the merge and inspect the data structure.

In [None]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2008655 entries, 0 to 2008654
Data columns (total 31 columns):
 #   Column                Dtype  
---  ------                -----  
 0   commentId             int64  
 1   channelId_x           int64  
 2   videoId               int64  
 3   authorId              int64  
 4   textOriginal          object 
 5   parentCommentId       float64
 6   likeCount_x           int64  
 7   publishedAt_x         object 
 8   updatedAt             object 
 9   duplicatedFlag        int64  
 10  cleanedText_x         object 
 11  cleanedTextSentiment  object 
 12  regex_spam            int64  
 13  predicted_spam        float64
 14  isSpam                int64  
 15  is_english_x          int64  
 16  publishedAt_y         object 
 17  channelId_y           float64
 18  title                 object 
 19  description           object 
 20  tags                  object 
 21  defaultLanguage       object 
 22  defaultAudioLanguage  object 
 23  content

### Display DataFrame Shape

This cell displays the shape (number of rows and columns) of the merged DataFrame to confirm the size of the dataset after merging.

In [None]:
merged_df.shape

(2008655, 31)

### Compute Sentence Embeddings for Comments and Titles

This cell computes sentence embeddings for all comments and their corresponding video titles using the loaded SentenceTransformer model. Embeddings are computed in batches for efficiency.

In [None]:
# Precompute embeddings in batches
batch_size = 100 # adjust based on GPU memory
comments = merged_df['cleanedText_x'].astype(str).tolist()
titles   = merged_df['translated'].astype(str).tolist()

comment_embeddings = model.encode(
    comments,
    batch_size=batch_size,
    convert_to_tensor=True,
    show_progress_bar=True
)

title_embeddings = model.encode(
    titles,
    batch_size=batch_size,
    convert_to_tensor=True,
    show_progress_bar=True
)

Batches:   0%|          | 0/20087 [00:00<?, ?it/s]

Batches:   0%|          | 0/20087 [00:00<?, ?it/s]

### Compute Relevance Scores Using Cosine Similarity

This cell computes the cosine similarity between each comment and its corresponding video title embedding, assigning a relevance score to each comment. The scores are added to the DataFrame.

In [None]:
# Compute diagonal cosine similarities (comment vs its title) in batches
batch_size = 1000  # Use the same batch size as for encoding
similarities = []

for i in tqdm(range(0, len(comments), batch_size), desc="Computing similarities"):
    batch_comments_embeddings = comment_embeddings[i:i+batch_size]
    batch_title_embeddings = title_embeddings[i:i+batch_size]
    batch_similarities = util.cos_sim(batch_comments_embeddings, batch_title_embeddings).diagonal()
    similarities.append(batch_similarities.cpu().numpy())

# Concatenate the batch results
similarities = np.concatenate(similarities)

# Save back to DataFrame
merged_df['relevance_score'] = similarities

Computing similarities: 100%|██████████| 2009/2009 [00:01<00:00, 1670.47it/s]


### Preview DataFrame with Relevance Scores

This cell displays the first few rows of the DataFrame after adding the relevance scores, allowing inspection of the new column.

In [None]:
merged_df.head()

Unnamed: 0,commentId,channelId_x,videoId,authorId,textOriginal,parentCommentId,likeCount_x,publishedAt_x,updatedAt,duplicatedFlag,...,defaultAudioLanguage,contentDuration,viewCount,likeCount_y,commentCount,topicCategories,cleanedText_y,is_english_y,translated,relevance_score
0,3166243,41024,6217,26499,Good Information... Will definitely try it......,,0,2020-01-01 16:00:58+00:00,2020-01-01 16:00:58+00:00,0,...,hi,PT10M45S,161.0,4.0,1.0,"['https://en.wikipedia.org/wiki/Health', 'http...",five best anti ageing facial exercise look you...,1.0,five best anti ageing facial exercise look you...,0.120977
1,0,10004,86296,164837,Yes but I am charged $8 to cover your free shi...,1888757.0,0,2020-01-04 07:53:24+00:00,2020-01-04 07:53:24+00:00,0,...,,PT3M44S,197.0,6.0,3.0,"['https://en.wikipedia.org/wiki/Fashion', 'htt...",crystal tara avon fashion show,0.0,crystal tara avon fashion show,0.073981
2,1279533,5459,64449,882554,Very useful video,,2,2020-01-04 10:32:19+00:00,2020-01-04 10:32:19+00:00,0,...,,PT7M19S,25959.0,341.0,166.0,['https://en.wikipedia.org/wiki/Lifestyle_(soc...,tone n glow face wash|| crystal clear skin||,1.0,tone n glow face wash|| crystal clear skin||,0.116708
3,2543589,32215,89804,1777705,Osm three hair colour,,2,2020-01-04 13:07:46+00:00,2020-01-04 13:07:46+00:00,0,...,,PT8M56S,5556.0,112.0,7.0,"['https://en.wikipedia.org/wiki/Hobby', 'https...",hair colour home hair colour transformation ha...,1.0,hair colour home hair colour transformation ha...,0.576698
4,3857384,48408,32889,2850201,Freshness Level 9999999999😍,,0,2020-01-04 14:22:11+00:00,2020-01-04 14:22:11+00:00,0,...,,PT17M7S,85924.0,1006.0,126.0,['https://en.wikipedia.org/wiki/Lifestyle_(soc...,kalma lang po kayo | soft cut crease makeup ft...,1.0,kalma lang po kayo | soft cut crease makeup ft...,0.185953


### Drop Video Columns

This cell drops video columns from the DataFrame to keep the CSV file small.

In [None]:
columns_to_drop = [
    "channelId_y",
    "title",
    "description",
    "tags",
    "defaultLanguage",
    "defaultAudioLanguage",
    "contentDuration",
    "viewCount",
    "likeCount_y",
    "commentCount",
    "topicCategories",
    "cleanedText_y",
    "is_english_y",
    "translated",
    "publishedAt_y"
]

merged_df = merged_df.drop(columns=columns_to_drop)

### Rename Columns for Consistency

This cell renames columns in the DataFrame to remove suffixes and ensure consistent naming.

In [None]:
merged_df.rename(
    columns=lambda col: col.replace("_x", "") if col.endswith("_x") else col,
    inplace=True
)

### Display Final DataFrame Info

This cell displays information about the final DataFrame after dropping and renaming columns, verifying the structure before saving.

In [None]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2008655 entries, 0 to 2008654
Data columns (total 17 columns):
 #   Column                Dtype  
---  ------                -----  
 0   commentId             int64  
 1   channelId             int64  
 2   videoId               int64  
 3   authorId              int64  
 4   textOriginal          object 
 5   parentCommentId       float64
 6   likeCount             int64  
 7   publishedAt           object 
 8   updatedAt             object 
 9   duplicatedFlag        int64  
 10  cleanedText           object 
 11  cleanedTextSentiment  object 
 12  regex_spam            int64  
 13  predicted_spam        float64
 14  isSpam                int64  
 15  is_english            int64  
 16  relevance_score       float32
dtypes: float32(1), float64(2), int64(9), object(5)
memory usage: 252.9+ MB


### Save Final DataFrame to CSV

This cell saves the final DataFrame, which includes relevance scores and cleaned columns, to a CSV file for downstream analysis or reporting.

In [None]:
merged_df.to_csv('dataset/final_after_spam_eng_relevance.csv',index=False)

### Display Final DataFrame Shape

This cell displays the shape of the final DataFrame, confirming the number of rows and columns after all processing steps.

In [None]:
merged_df.shape

(2008655, 17)