# Semi-Supervised Spam Detection for YouTube Comments

This notebook implements a semi-supervised learning pipeline for spam detection in YouTube comments. It includes data cleaning, spammer identification, model training with transformers, prediction, uncertainty sampling, and iterative labeling.

### Install Required Package: emoji

This cell installs the `emoji` package, which is used for handling and removing emojis from text data during preprocessing.

In [None]:
# Install the emoji package for text preprocessing
!pip install emoji

### Import Libraries for Data Processing, Modeling, and Utilities

This cell imports all necessary libraries for data manipulation, model training, evaluation, and text processing. It includes pandas for dataframes, HuggingFace transformers for NLP modeling, scikit-learn for metrics and splitting, tqdm for progress bars, torch for deep learning, numpy for numerical operations, and emoji/re/string for text cleaning.

In [None]:
# Import libraries for data processing, model training, evaluation, and text cleaning
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from tqdm import tqdm
import torch
import numpy as np
import os
import emoji
import re


### Load and Inspect Main Comments Dataset

This cell loads the main comments dataset (with duplicate tags) from a CSV file and displays the first few rows to verify successful loading and inspect the data structure.

In [None]:
# Load the main comments dataset and display the first few rows for inspection
file_path = 'dataset/comments_all_tagged_text_duplicates.csv'
data = pd.read_csv(file_path)
data.head()

### Calculate Duplication Statistics per Author and Video

This cell groups the data by author and video, calculating the total number of duplicate comments and the total number of unique comments for each (author, video) pair. It then computes the duplication ratio for each pair and finds the maximum duplication ratio for each author, which is used to identify potential spammers.

In [None]:
# Group by author and video to calculate the number of duplicate comments and total comments, 
# then compute duplication ratio and find the maximum duplication ratio for each author to identify potential spammers

# Step 1: Group by author and video to calculate duplication statistics
author_spam = (
    data.groupby(['authorId', 'videoId'])
    .agg(
        duplicated_sum=('duplicatedFlag', 'sum'), # total duplicated comments (flag=1) per author per video
        total_comments=('commentId', 'nunique'), # number of unique comments per author per video
    )
    .reset_index()
)

# Step 2: Calculate duplication ratio for each (author, video) pair
# dup_ratio = proportion of comments that are duplicates
author_spam['dup_ratio'] = author_spam['duplicated_sum'] / author_spam['total_comments']

# Step 3: For each author, take the maximum duplication ratio across all videos
# This shows how "spammy" they are in their worst-case video
author = (
    author_spam.groupby('authorId')
    .agg(max_dup_ratio=('dup_ratio', 'max')) # max duplication ratio per author
    .reset_index()
)


### Identify Spammers Based on Duplication Ratio

This cell selects authors whose maximum duplication ratio across all videos is greater than or equal to 0.7, flagging them as potential spammers.

In [None]:
# Select authors with a maximum duplication ratio >= 0.7 as potential spammers
spammer_list = author.loc[author['max_dup_ratio'] >= 0.7, 'authorId']

### Save Spammer List to CSV

This cell exports the list of identified spammers (authors with high duplication ratio) to a CSV file for future reference or exclusion from further analysis.

In [None]:
# Save the list of identified spammers to a CSV file for future use
spammer_list.to_csv("author_spammer_list.csv", index=False)

### Filter Out Spammers from Main Dataset

This cell removes all comments made by identified spammers from the main dataset, resulting in a cleaner dataset for model training and evaluation.

In [None]:
# Remove all comments from authors identified as spammers to clean the dataset
data_filtered = data[~data['authorId'].isin(spammer_list)]

### Filter Out Spam Comments Using Custom Function

This cell defines a function `is_spam` to identify and filter out useless comments based on several criteria: missing values, presence of links, very short comments, emoji-only comments, and comments with more emojis than words. The function is applied to the dataset, and only comments deemed useful are retained for further analysis. The cell also prints the number of original and filtered comments for reference.

In [None]:
# Function to filter useless comments
def is_spam(comment):
    # Remove NaN
    if pd.isna(comment):
        return 1

    # Remove links
    if re.search(r"http\S+|www\S+", comment):
        return 1

    # Remove very short comments (<3 words)
    words = comment.split()
    if len(words) < 3:
        return 1

     # Remove emoji-only comments
    if comment.strip() and emoji.replace_emoji(comment, replace='').strip() == '':
        return 1

    # Count emojis vs words
    emojis = [ch for ch in comment if ch in emoji.EMOJI_DATA]
    if len(emojis) > len(words):
        return 1


    return 0

# Apply filter
data_filtered.loc[:, "regex_spam"] = data_filtered["textOriginal"].apply(is_spam)

# Keep only useful comments
filtered_data = data_filtered[data_filtered['regex_spam'] == 0]

print(f"Original comments: {len(data_filtered)}, Useful comments: {len(filtered_data)}")


Original comments: 4724755, Useful comments: 3304925

### Save Cleaned Comments Dataset to CSV

This cell saves the filtered and cleaned comments dataset (with spam and useless comments removed) to a CSV file. The exported file can be used for further analysis, modeling, or as a clean input for downstream tasks.

In [None]:
data_filtered.to_csv('dataset/comments_all_tagged_text_duplicates_regex.csv', index=False)

### Load Manually Labeled Data for Supervised Training (First Round)

This cell loads a manually labeled dataset from a CSV file. This labeled data is used for supervised training and evaluation of the spam detection model.

In [None]:
# Load the manually labeled comments dataset for supervised training and display the first few rows
file_path = '/content/drive/MyDrive/Colab Notebooks/labeled.csv'
manual = pd.read_csv(file_path)
manual.head()

Unnamed: 0,commentId,textOriginal,ManualLabelSpam,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15
0,3166243,Good Information... Will definitely try it......,0.0,,,,,,,,,,,,,
1,1888757,"Crystal, is it true that beginning Campaign 3,...",0.0,,,,,,,,,,,,,
2,0,Yes but I am charged $8 to cover your free shi...,0.0,,,,,,,,,,,,,
3,1,Youravon.com/cspurlin,1.0,,,,,,,,,,,,,
4,1279533,Very useful video,0.0,,,,,,,,,,,,,


### Check Shape of Labeled Data

This cell displays the shape (number of rows and columns) of the manually labeled dataset to confirm successful loading and understand the dataset size.

In [None]:
# Display the shape of the manually labeled dataset to check its size
manual.shape

(6000, 16)

### Count Number of Spam Labels in Labeled Data

This cell counts and displays the number of comments labeled as spam (ManualLabelSpam == 1) in the manually labeled dataset.

In [None]:
# Count the number of comments labeled as spam in the manually labeled dataset
len(manual[manual['ManualLabelSpam']==1])

776

### Set Model Name and Device for Training

This cell specifies the HuggingFace transformer model to use (DistilBERT) and sets the device (GPU if available, otherwise CPU) for model training and inference.

In [None]:
# Set the transformer model (DistilBERT) and select device (GPU if available, else CPU) for training and inference
MODEL_NAME = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Split Labeled Data into Training and Validation Sets

This cell drops any rows with missing labels, then splits the labeled data into training and validation sets for supervised model training and evaluation.

In [None]:
# Drop rows with missing labels and split the labeled data into training and validation sets for model training and evaluation
manual = manual.dropna(subset=['ManualLabelSpam'])
train_texts, val_texts, train_labels, val_labels = train_test_split(
    manual['textOriginal'].astype(str).tolist(),
    manual['ManualLabelSpam'].astype(int).tolist(),
    test_size=0.2,
    random_state=42
)

### Define Evaluation Metrics Function

This cell defines a function to compute evaluation metrics (accuracy, precision, recall, F1) for the spam classifier using scikit-learn, which will be used during model training and validation.

In [None]:
# Define a function to compute accuracy, precision, recall, and F1 score for model evaluation using scikit-learn
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(-1)

    acc = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds, average="binary", pos_label=1  # spam = 1
    )

    return {
        "accuracy": acc,
        "precision": precision,
        "recall": recall,
        "f1": f1,
    }

### Tokenize Text Data and Prepare Datasets

This cell disables Weights & Biases logging, loads the HuggingFace tokenizer, tokenizes the training and validation texts, and prepares the datasets for model training and evaluation.

In [None]:
os.environ["WANDB_DISABLED"] = "true"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)


train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=64)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=64)

# Convert encodings and labels to datasets.Dataset
train_dataset = Dataset.from_dict(train_encodings)
train_dataset = train_dataset.add_column("labels", train_labels)

val_dataset = Dataset.from_dict(val_encodings)
val_dataset = val_dataset.add_column("labels", val_labels)

### Define Custom PyTorch Dataset for Tokenized Data

This cell defines a custom `CommentDataset` class that extends the PyTorch `Dataset` class. It is used to wrap tokenized encodings and labels, enabling efficient data loading and batching for model training and evaluation.

In [None]:
class CommentDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx],dtype=torch.long)
        return item

    def __len__(self):
        return len(self.labels)


### Initialize Model, Set Training Arguments, and Train

This cell initializes the transformer model for sequence classification, sets up the HuggingFace `TrainingArguments`, creates the `Trainer` object, and starts the training process on the prepared datasets.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2).to(device)

training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/Colab Notebooks/SpamResult",
    num_train_epochs=2,                       
    per_device_train_batch_size=200,           
    per_device_eval_batch_size=200,
    save_strategy="epoch",                    # save only at end of each epoch
    logging_strategy="epoch",                 # log only at end of each epoch
    logging_dir="./logs",
    fp16=True,                                # mixed precision (much faster on GPU)
    dataloader_num_workers=2,                 # faster data loading
    report_to="none"                          # disable W&B/TensorBoard auto-logging
)

# ✅ Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

# ✅ Train
trainer.train()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
13,0.4579
26,0.3134


TrainOutput(global_step=26, training_loss=0.38562440872192383, metrics={'train_runtime': 27.4828, 'train_samples_per_second': 184.77, 'train_steps_per_second': 0.946, 'total_flos': 84083681296896.0, 'train_loss': 0.38562440872192383, 'epoch': 2.0})

### Evaluate Trained Model on Validation Set

This cell evaluates the trained spam detection model on the validation set and reports the evaluation metrics (accuracy, precision, recall, F1).

In [None]:
# Evaluate the trained model on the validation set and report evaluation metrics
trainer.evaluate()

{'eval_loss': 0.28368812799453735,
 'eval_accuracy': 0.8881889763779528,
 'eval_precision': 0.9081632653061225,
 'eval_recall': 0.5894039735099338,
 'eval_f1': 0.714859437751004,
 'eval_runtime': 0.7344,
 'eval_samples_per_second': 864.695,
 'eval_steps_per_second': 5.447,
 'epoch': 2.0}

### Save Trained Model and Tokenizer

This cell saves the trained spam detection model and its tokenizer to disk for future inference or further training.

In [None]:
# Save the trained model and tokenizer to disk for future use
trainer.save_model("model/semi_spam")
tokenizer.save_pretrained("model/semi_spam")

### Select Non-Duplicate Comments for Prediction

This cell filters the main dataset to select only non-duplicate comments (duplicatedFlag == 0) for prediction and displays the first few rows.

In [None]:
# Select only non-duplicate comments for prediction and display the first few rows
comment=data[data['duplicatedFlag']==0]
comment.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,commentId,channelId,videoId,authorId,textOriginal,parentCommentId,likeCount,publishedAt,updatedAt,duplicatedFlag,cleanedText,cleanedTextSentiment,isSpam
0,0,0,3166243,41024,6217,26499,Good Information... Will definitely try it......,,0,2020-01-01 16:00:58+00:00,2020-01-01 16:00:58+00:00,0,good information definitely try thanks,good information definitely try thanks : smili...,0
1,1,1,1888757,10004,86296,2608986,"Crystal, is it true that beginning Campaign 3,...",,1,2020-01-04 07:49:54+00:00,2020-01-04 07:49:54+00:00,0,crystal true beginning campaign 3 order get fr...,crystal true beginning campaign 3 order get fr...,0
2,2,2,0,10004,86296,164837,Yes but I am charged $8 to cover your free shi...,1888757.0,0,2020-01-04 07:53:24+00:00,2020-01-04 07:53:24+00:00,0,yes charged $ 8 cover free shipping not rep wo...,yes charged $ 8 cover free shipping not rep wo...,0
3,3,3,1,10004,86296,164837,Youravon.com/cspurlin,1888757.0,0,2020-01-04 07:53:37+00:00,2020-01-04 07:53:37+00:00,0,youravoncomcspurlin,youravoncomcspurlin,1
4,4,4,1279533,5459,64449,882554,Very useful video,,2,2020-01-04 10:32:19+00:00,2020-01-04 10:32:19+00:00,0,useful video,useful video,0


### Select Subset of Comments for Prediction

This cell selects a subset of non-duplicate comments (from index 6000 onward) for spam prediction and displays the shape of the resulting DataFrame.

In [None]:
# Select a subset of non-duplicate comments (from index 6000 onward) for spam prediction and display the shape
predict=comment[6000:]
predict.shape

(4608143, 15)

### Select First 10,000 Comments for Batch Prediction

This cell selects the first 10,000 comments from the prediction subset for batch spam prediction.

In [None]:
# Select the first 10,000 comments from the prediction subset for batch spam prediction
first_prediction=predict[:10000]

### Load Trained Model and Tokenizer for Inference

This cell loads the previously saved spam detection model and tokenizer from disk, preparing them for inference on new data.

In [None]:
# Load the trained spam detection model and tokenizer from disk for inference
model_path = "model/semi_spam"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

### Define Batch Scoring Function for Unlabeled Data

This cell defines a function to score an unlabeled dataset using the trained classifier, returning the predicted probabilities for each class (spam/not spam) in batches for efficiency.

In [None]:
# Define a function to score an unlabeled dataset in batches using the trained classifier, returning predicted probabilities for each class
def score_unlabeled(df, text_col, tokenizer, model, batch_size=64):
    """
    Score an unlabeled dataset with a classifier and show progress.
    """
    all_probs = []
    model.eval()
    device = model.device

    texts = df[text_col].astype(str).tolist()

    for i in tqdm(range(0, len(texts), batch_size), desc="Scoring unlabeled pool"):
        batch = texts[i:i+batch_size]
        inputs = tokenizer(
            batch,
            padding=True,
            truncation=True,
            max_length=64,
            return_tensors="pt"
        ).to(device)

        with torch.no_grad():
            outputs = model(**inputs)
            probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

        all_probs.append(probs.cpu().numpy())

    return np.vstack(all_probs)

### Score First Prediction Batch and Add Probability Columns

This cell uses the batch scoring function to predict spam probabilities for the first batch of comments, adds the probabilities to the DataFrame, and displays the top 10 most likely spam comments.

In [None]:
# Score the first batch of comments, add spam/not spam probabilities, and display the top 10 most likely spam comments
probs = score_unlabeled(first_prediction, "textOriginal", tokenizer, model)

first_prediction["p_not_spam"] = probs[:, 0]
first_prediction["p_spam"] = probs[:, 1]

print(first_prediction.sort_values("p_spam", ascending=False).head(10))

Scoring unlabeled pool: 100%|██████████| 157/157 [08:28<00:00,  3.24s/it]

       Unnamed: 0.1  Unnamed: 0  commentId  channelId  videoId  authorId  \
8247           8247        8306    3028413      12513    41135   3643674   
7594           7594        7648    2971470      12513    41135   2958832   
10139         10139       10212    4294087      12513    41135   3636640   
12681         12681       12762    3275947      12513    41135   1884283   
7687           7687        7742    2164111      12513    41135   3625895   
7318           7318        7371    1035653      12513    41135   3636734   
6633           6633        6682    2205167      12513    41135   3030834   
10337         10337       10411    3527959      12513    41135     59874   
15623         15623       15726       2049      12513    41135   2173735   
6341           6341        6389     597335      12513    41135   2701833   

                                  textOriginal  parentCommentId  likeCount  \
8247           معقولة 8 ملايين مابيهم عراقية 😿              NaN          1   
7594   


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  first_prediction["p_not_spam"] = probs[:, 0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  first_prediction["p_spam"] = probs[:, 1]


### Save First Prediction Results to CSV

This cell saves the first batch of prediction results, including spam probabilities, to a CSV file for further analysis or manual review.

In [None]:
# Save the first batch of prediction results (with spam probabilities) to a CSV file for further analysis
first_prediction.to_csv('dataset/first_prediction.csv', index=False)

### Check Shape of First Prediction DataFrame

This cell displays the shape of the DataFrame containing the first batch of prediction results to confirm the number of processed comments.

In [None]:
# Display the shape of the first prediction DataFrame to confirm the number of processed comments
first_prediction.shape

(10000, 18)

### Calculate Uncertainty Margin and Select Uncertain Samples

This cell calculates the margin (absolute difference between spam and not spam probabilities) for each comment, sorts by uncertainty (smallest margin), and selects the most uncertain samples for further manual labeling.

In [None]:
# Calculate the uncertainty margin for each comment, sort by uncertainty, and select the most uncertain samples for manual labeling
# Add margin column (uncertainty score)
first_prediction["margin"] = (first_prediction["p_spam"] - first_prediction["p_not_spam"]).abs()

# Sort ascending → smaller margin = more uncertain
uncertain_samples = first_prediction.sort_values("margin", ascending=True).head(1000)

print(uncertain_samples[["textOriginal", "p_spam", "p_not_spam", "margin"]].head(10))

                       textOriginal    p_spam  p_not_spam    margin
4793  @@unispiral3459 undoubtedly 😂  0.499954    0.500046  0.000092
5862                          First  0.500065    0.499935  0.000129
3090                          First  0.500065    0.499935  0.000129
3066                           Oiii  0.500151    0.499849  0.000302
2769                              .  0.500176    0.499824  0.000352
5563                              .  0.500176    0.499824  0.000352
3922                              .  0.500176    0.499824  0.000352
1810                  Çok güzel  😊💖  0.500412    0.499588  0.000825
8304          Miriiiip bangeeeet😍😍😍  0.499536    0.500464  0.000927
5866                         Second  0.500675    0.499325  0.001349


### Check Shape of Uncertain Samples DataFrame

This cell displays the shape of the DataFrame containing the most uncertain samples selected for manual labeling.

In [None]:
# Display the shape of the DataFrame containing the most uncertain samples for manual labeling
uncertain_samples.shape

(1000, 18)

### Display Uncertain Samples for Manual Review

This cell displays the first few rows of the DataFrame containing the most uncertain samples, which are candidates for manual labeling in the next round.

In [None]:
# Show the first few rows of the most uncertain samples for manual inspection and labeling
uncertain_samples.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,commentId,channelId,videoId,authorId,textOriginal,parentCommentId,likeCount,publishedAt,updatedAt,duplicatedFlag,cleanedText,cleanedTextSentiment,isSpam,p_not_spam,p_spam,margin
4793,10793,10869,1627,12513,41135,2713489,@@unispiral3459 undoubtedly 😂,2994720.0,4,2020-02-01 04:31:35+00:00,2020-02-01 04:31:35+00:00,0,undoubtedly,undoubtedly : face_with_tears_of_joy :,0,0.500046,0.499954,9.2e-05
5862,11862,11941,2185088,41733,25563,2487825,First,,1,2020-02-02 13:47:40+00:00,2020-02-02 13:47:40+00:00,0,first,first,1,0.499935,0.500065,0.000129
3090,9090,9157,1134065,32075,46086,2092869,First,,0,2020-01-30 03:52:34+00:00,2020-01-30 03:52:34+00:00,0,first,first,1,0.499935,0.500065,0.000129
3066,9066,9133,1405,12513,41135,1743999,Oiii,4699447.0,2,2020-01-30 02:47:53+00:00,2020-01-30 02:47:53+00:00,0,oii,oii,1,0.499849,0.500151,0.000302
2769,8769,8835,1627023,12513,41135,3154820,.,,0,2020-01-29 20:32:37+00:00,2020-01-29 20:32:37+00:00,0,,,1,0.499824,0.500176,0.000352


### Prepare for the Next Iteration of Semi-Supervised Learning

This cell prepares the selected uncertain samples for manual labeling or further processing, which is a key step in iterative semi-supervised learning workflows.

In [None]:
# Prepare the uncertain samples for manual labeling or further processing in the next iteration of semi-supervised learning
label2=uncertain_samples[["commentId","textOriginal"]]

### Preview Uncertain Samples to be Exported

This cell displays the first few rows of the DataFrame containing uncertain samples that will be exported for manual labeling.

In [None]:
label2.head()

Unnamed: 0,commentId,textOriginal
4793,1627,@@unispiral3459 undoubtedly 😂
5862,2185088,First
3090,1134065,First
3066,1405,Oiii
2769,1627023,.


### Export Uncertain Samples for Manual Labeling

This cell saves the selected uncertain samples to a CSV file, so they can be manually labeled and used in the next iteration of model training.

In [None]:
# Save the uncertain samples to a CSV file for manual labeling in the next iteration
label2.to_csv('label2.csv',index=False)

### Load Trained Model and Tokenizer for Inference or Further Training

This cell loads the previously trained spam detection model and its tokenizer from disk. The loaded model can be used for inference on new data or for continued training in subsequent rounds.

In [None]:
model_path = "/content/drive/MyDrive/Colab Notebooks/semi_spam"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

### Load Manually Labeled Data for Supervised Training (Second Round)

This cell loads a manually labeled dataset from a CSV file. This labeled data is used for supervised training and evaluation of the spam detection model.

In [None]:
file_path = 'dataset/label7000.csv'
label2 = pd.read_csv(file_path)
label2.head()

Unnamed: 0,commentId,textOriginal,ManualLabelSpam,ManualLabelText
0,3166243.0,Good Information... Will definitely try it......,0.0,
1,1888757.0,"Crystal, is it true that beginning Campaign 3,...",0.0,
2,0.0,Yes but I am charged $8 to cover your free shi...,0.0,
3,1.0,Youravon.com/cspurlin,1.0,
4,1279533.0,Very useful video,0.0,


### Check Shape of Labeled Data

This cell displays the shape (number of rows and columns) of the `label2` DataFrame to confirm successful loading and understand the dataset size before further processing.

In [None]:
label2.shape

(7000, 4)

### Preview Labeled Data

This cell displays the first few rows of the `label2` DataFrame to allow inspection of the labeled data and verify its contents before further processing.

In [None]:
label2.head()

Unnamed: 0,commentId,textOriginal,ManualLabelSpam
0,3166243.0,Good Information... Will definitely try it......,0.0
1,1888757.0,"Crystal, is it true that beginning Campaign 3,...",0.0
2,0.0,Yes but I am charged $8 to cover your free shi...,0.0
3,1.0,Youravon.com/cspurlin,1.0
4,1279533.0,Very useful video,0.0


### Clean Labels and Split Data for Training and Validation

This cell fills any remaining NaN values in the 'ManualLabelSpam' column with 0 to ensure label completeness, then splits the labeled data into training and validation sets for model training and evaluation.

In [None]:
# Fill any remaining NaN values in 'ManualLabelSpam' with 0
label2['ManualLabelSpam'] = label2['ManualLabelSpam'].fillna(0)

train_texts, val_texts, train_labels, val_labels = train_test_split(
    label2['textOriginal'].astype(str).tolist(),
    label2['ManualLabelSpam'].astype(int).tolist(),
    test_size=0.2,
    random_state=42
)

### Tokenize, Prepare Datasets, and Train Model (Second Round)

This cell performs tokenization, prepares the training and validation datasets, defines a custom PyTorch dataset class, sets up HuggingFace `TrainingArguments`, and initializes the `Trainer` for the second round of supervised training. The model is trained on the updated labeled data from the second round, further improving its spam detection performance.

In [None]:
os.environ["WANDB_DISABLED"] = "true"

train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=64)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=64)

# Convert encodings and labels to datasets.Dataset
train_dataset = Dataset.from_dict(train_encodings)
train_dataset = train_dataset.add_column("labels", train_labels)

val_dataset = Dataset.from_dict(val_encodings)
val_dataset = val_dataset.add_column("labels", val_labels)


class CommentDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx],dtype=torch.long)
        return item

    def __len__(self):
        return len(self.labels)

training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/Colab Notebooks/SpamResult",
    num_train_epochs=2,                       # keep small for Colab
    per_device_train_batch_size=200,           # try 16, increase if GPU has room
    per_device_eval_batch_size=200,
    save_strategy="epoch",                    # save only at end of each epoch
    logging_strategy="epoch",                 # log only at end of each epoch
    logging_dir="./logs",
    fp16=True,                                # mixed precision (much faster on GPU)
    dataloader_num_workers=2,                 # faster data loading
    report_to="none"                          # disable W&B/TensorBoard auto-logging
)

# ✅ Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

# ✅ Train
trainer.train()

Step,Training Loss
28,0.2243
56,0.1666


TrainOutput(global_step=56, training_loss=0.19544026681355067, metrics={'train_runtime': 29.4058, 'train_samples_per_second': 380.877, 'train_steps_per_second': 1.904, 'total_flos': 185454358118400.0, 'train_loss': 0.19544026681355067, 'epoch': 2.0})

### Evaluate Trained Model on Validation Set (Second Round)

This cell evaluates the trained spam detection model on the validation set after the second round of training, reporting key evaluation metrics such as accuracy, precision, recall, and F1 score.

In [None]:
# Evaluate the trained model using the HuggingFace Trainer's evaluate() method
trainer.evaluate()

{'eval_loss': 0.19468963146209717,
 'eval_accuracy': 0.9228571428571428,
 'eval_precision': 0.7636363636363637,
 'eval_recall': 0.75,
 'eval_f1': 0.7567567567567568,
 'eval_runtime': 0.6901,
 'eval_samples_per_second': 2028.772,
 'eval_steps_per_second': 10.144,
 'epoch': 2.0}

### Save Trained Model and Tokenizer (Second Round)

This cell saves the trained spam detection model and its tokenizer to disk after the second round of training. Saving these artifacts allows for future inference, further training, or deployment without retraining from scratch.

In [None]:
trainer.save_model("model/semi_spam")
tokenizer.save_pretrained("model/semi_spam")

### Load and Preview First Round Prediction Results

This cell loads the prediction results from the first round of model inference, reading the CSV file containing predicted probabilities and other outputs. It then displays the first few rows to verify successful loading and inspect the prediction data structure.

In [None]:
file_path = 'model/first_prediction.csv'
prediction = pd.read_csv(file_path)
prediction.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,commentId,channelId,videoId,authorId,textOriginal,parentCommentId,likeCount,publishedAt,updatedAt,duplicatedFlag,cleanedText,cleanedTextSentiment,isSpam,p_not_spam,p_spam
0,6000,6048,1973314,21411,1265,1371722,this make up looks really good on you even the...,,0,2020-01-27 00:17:37+00:00,2020-01-27 00:17:37+00:00,0,make look really good even pink lip look reall...,make look really good even pink lip look reall...,0,0.964349,0.035651
1,6001,6049,813,21411,1265,1744706,my skin color and hair roots are different lik...,4048980.0,3,2020-01-27 00:17:49+00:00,2020-01-27 00:17:49+00:00,0,skin color hair root different like not know t...,skin color hair root different like not know t...,0,0.967937,0.032063
2,6002,6050,724,21411,1265,1371722,lmao 😂😭,3008603.0,1,2020-01-27 00:20:02+00:00,2020-01-27 00:20:02+00:00,0,lmao,lmao : face_with_tears_of_joy : :loudly_crying...,1,0.445418,0.554582
3,6003,6051,1350387,12513,41135,1579873,I want to be your model...love you girl! (IT ...,,1,2020-01-27 00:24:16+00:00,2020-01-27 00:24:16+00:00,0,want modellove girl happen one day beautiful t...,want modellove girl happen one day beautiful t...,0,0.965182,0.034818
4,6004,6052,2399926,21411,1265,2458808,Plz try British chav makeup,,0,2020-01-27 00:27:16+00:00,2020-01-27 00:27:16+00:00,0,plz try british chav makeup,plz try british chav makeup,0,0.94874,0.05126


### Define Batch Scoring Function for Unlabeled Data

This cell defines a function to score an unlabeled dataset using the trained classifier. It processes the data in batches, applies the model to obtain class probabilities (spam/not spam), and returns the results as a NumPy array. This function is essential for efficient inference and uncertainty sampling in large datasets.

In [None]:
def score_unlabeled(df, text_col, tokenizer, model, batch_size=64):
    """
    Score an unlabeled dataset with a classifier and show progress.
    """
    all_probs = []
    model.eval()
    device = model.device

    texts = df[text_col].astype(str).tolist()

    for i in tqdm(range(0, len(texts), batch_size), desc="Scoring unlabeled pool"):
        batch = texts[i:i+batch_size]
        inputs = tokenizer(
            batch,
            padding=True,
            truncation=True,
            max_length=64,
            return_tensors="pt"
        ).to(device)

        with torch.no_grad():
            outputs = model(**inputs)
            probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

        all_probs.append(probs.cpu().numpy())

    return np.vstack(all_probs)

### Score Second Prediction Batch and Add Probability Columns

This cell uses the batch scoring function to predict spam probabilities for the second batch of comments. It adds the predicted probabilities for both classes (not spam and spam) to the DataFrame, then displays the top 10 comments most likely to be spam. This step is essential for identifying high-confidence spam candidates in the second round of semi-supervised learning.

In [None]:
probs = score_unlabeled(second_prediction, "textOriginal", tokenizer, model)

second_prediction["p_not_spam_2"] = probs[:, 0]
second_prediction["p_spam_2"] = probs[:, 1]

print(second_prediction.sort_values("p_spam_2", ascending=False).head(10))

Scoring unlabeled pool: 100%|██████████| 141/141 [00:04<00:00, 32.78it/s]

      Unnamed: 0.1  Unnamed: 0  commentId  channelId  videoId  authorId  \
5135         11135       11213       1458      12513    41135   3629344   
5653         11653       11731    1730357      12513    41135   3644030   
2808          8808        8874    3307552      12513    41135   3637482   
1827          7827        7884    2905539      12513    41135   1501463   
3623          9623        9694    2119525      12513    41135   3630651   
4437         10437       10513    2051812      12513    41135   1614084   
3242          9242        9312    3367525      12513    41135   3583851   
966           6966        7019       1106      12513    41135   2144355   
3606          9606        9677    2946857      12513    41135   3639555   
4036         10036       10108    2109693      12513    41135   3646334   

                                           textOriginal  parentCommentId  \
5135               ها حبايبي لتحسون بلغربة  بتعليقات 🤣🤣        3017449.0   
5653                  




### Calculate Uncertainty Margin and Select Top Confident Samples

This cell calculates the uncertainty margin for each prediction by taking the absolute difference between the predicted spam and not spam probabilities. It then sorts the predictions by this margin in descending order (most confident first) and selects the top 9,000 most confident samples for further analysis or the next stage of processing. The shape of the resulting DataFrame is displayed to confirm the number of selected samples.

In [None]:
# Add margin column (uncertainty score)
prediction["margin"] = (prediction["p_spam_2"] - prediction["p_not_spam_2"]).abs()
second_prediction = prediction.sort_values("margin", ascending=False).head(9000)
second_prediction.shape

(9000, 18)

### Calculate Uncertainty Margin and Select Most Uncertain Samples (Second Round)

This cell calculates the uncertainty margin for each comment in the second prediction set by taking the absolute difference between the predicted spam and not spam probabilities. It then sorts the comments by this margin in ascending order (most uncertain first) and selects the top 1,000 most uncertain samples. These samples are ideal candidates for manual labeling in the next iteration of semi-supervised learning.

In [None]:
# Add margin column (uncertainty score)
second_prediction["margin"] = (second_prediction["p_spam_2"] - second_prediction["p_not_spam_2"]).abs()
uncertain_samples = second_prediction.sort_values("margin", ascending=True).head(1000)

### Preview Most Uncertain Samples Selected for Manual Labeling

This cell displays the first few rows of the DataFrame containing the most uncertain samples. These samples, identified by their low uncertainty margin, are candidates for manual labeling in the next round of semi-supervised learning. Reviewing these samples helps ensure the selection process is working as intended.

In [None]:
uncertain_samples.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,commentId,channelId,videoId,authorId,textOriginal,parentCommentId,likeCount,publishedAt,updatedAt,duplicatedFlag,cleanedText,cleanedTextSentiment,isSpam,p_not_spam,p_spam,margin,p_not_spam_2,p_spam_2
7596,13596,13685,1903495,49753,38989,1461696,I neeeedddd,,5,2020-02-05 01:09:39+00:00,2020-02-05 01:09:39+00:00,0,needd,needd,1,0.601872,0.398128,0.0019,0.49905,0.50095
2999,8999,9066,3484110,12513,41135,2654114,U 👏🏽did👏🏽 that!!,,0,2020-01-30 01:30:19+00:00,2020-01-30 01:30:19+00:00,0,,: clapping_hands_medium_skin_tone : : clapping...,1,0.603559,0.396441,0.016722,0.491639,0.508361
2478,8478,8541,1189,12513,41135,1938795,Muy linda,2006775.0,1,2020-01-29 15:47:03+00:00,2020-01-29 15:47:03+00:00,0,muy linda,muy linda,1,0.577694,0.422306,0.01768,0.49116,0.50884
6436,12436,12516,3940715,12513,41135,1629487,Brasil 2020,,0,2020-02-03 00:55:05+00:00,2020-02-03 00:55:05+00:00,0,brasil 2020,brasil 2020,1,0.651202,0.348798,0.018713,0.509356,0.490644
9095,15095,15195,1681,21411,1265,1227269,Alex W yes yes yes,2827370.0,0,2020-02-05 21:16:27+00:00,2020-02-05 21:16:27+00:00,0,alex w yes yes yes,alex w yes yes yes,0,0.889348,0.110652,0.021085,0.510542,0.489458


### Prepare DataFrame for Export of Uncertain Samples

This cell creates a new DataFrame containing only the comment IDs and original texts of the most uncertain samples. This streamlined DataFrame is prepared for export and manual labeling in the next iteration of the semi-supervised learning process.

In [None]:
# Create a DataFrame with comment IDs and texts of the most uncertain samples for manual labeling
label3=uncertain_samples[["commentId","textOriginal"]]

### Preview the Next Batch of Uncertain Samples for Manual Labeling

This cell displays the first few rows of the DataFrame containing the next batch of uncertain samples (`label3`). Reviewing these samples allows for a quick inspection before exporting them for manual labeling in the next iteration of the semi-supervised learning process.

In [None]:
# Display the first few rows of the next batch of uncertain samples to be exported for manual labeling
label3.head()

Unnamed: 0,commentId,textOriginal
7596,1903495,I neeeedddd
2999,3484110,U 👏🏽did👏🏽 that!!
2478,1189,Muy linda
6436,3940715,Brasil 2020
9095,1681,Alex W yes yes yes


### Export Next Batch of Uncertain Samples for Manual Labeling

This cell saves the next batch of uncertain samples to a CSV file, so they can be manually labeled and used in the following iteration of model training.

In [None]:
label3.to_csv('dataset/label3.csv', index=False)

### Start of Round 3: Next Iteration of Semi-Supervised Learning

This section begins the third round of the semi-supervised learning process, where newly labeled data from the previous iteration is incorporated to further improve the model.

### Load Trained Model and Tokenizer for Round 3

This cell loads the trained spam detection model and its tokenizer from disk for the third round of semi-supervised learning. The loaded model will be used for further training or inference with the newly labeled data from previous rounds.

In [None]:
# Start of Round 3: Incorporate newly labeled data and repeat the semi-supervised learning process
model_path = "model/semi_spam"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

### Load and Preview Newly Labeled Data for Round 3

This cell loads the newly labeled data for the third round of semi-supervised learning from a CSV file and displays the first few rows. Previewing the data ensures it has been loaded correctly and is ready for further processing or model training.

In [None]:
file_path = 'model/label8000.csv'
label3 = pd.read_csv(file_path)
label3.head()

Unnamed: 0,commentId,textOriginal,ManualLabelSpam
0,3166243.0,Good Information... Will definitely try it......,0.0
1,1888757.0,"Crystal, is it true that beginning Campaign 3,...",0.0
2,0.0,Yes but I am charged $8 to cover your free shi...,0.0
3,1.0,Youravon.com/cspurlin,1.0
4,1279533.0,Very useful video,0.0


### Clean Labels and Split Data for Training and Validation (Round 3)

This cell fills any remaining NaN values in the 'ManualLabelSpam' column with 0 to ensure all samples have a label. It then splits the labeled data into training and validation sets for the third round of model training and evaluation.

In [None]:
# Fill any remaining NaN values in 'ManualLabelSpam' with 0
label3['ManualLabelSpam'] = label3['ManualLabelSpam'].fillna(0)

train_texts, val_texts, train_labels, val_labels = train_test_split(
    label3['textOriginal'].astype(str).tolist(),
    label3['ManualLabelSpam'].astype(int).tolist(),
    test_size=0.2,
    random_state=42
)

### Train or Evaluate Model in Round 3

This cell trains or evaluates the model using the updated dataset, aiming to further improve performance with the newly labeled data.

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=64)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=64)

# Convert encodings and labels to datasets.Dataset
train_dataset = Dataset.from_dict(train_encodings)
train_dataset = train_dataset.add_column("labels", train_labels)

val_dataset = Dataset.from_dict(val_encodings)
val_dataset = val_dataset.add_column("labels", val_labels)


class CommentDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx],dtype=torch.long)
        return item

    def __len__(self):
        return len(self.labels)

training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/Colab Notebooks/SpamResult",
    num_train_epochs=2,                       # keep small for Colab
    per_device_train_batch_size=200,           # try 16, increase if GPU has room
    per_device_eval_batch_size=200,
    save_strategy="epoch",                    # save only at end of each epoch
    logging_strategy="epoch",                 # log only at end of each epoch
    logging_dir="./logs",
    fp16=True,                                # mixed precision (much faster on GPU)
    dataloader_num_workers=2,                 # faster data loading
    report_to="none"                          # disable W&B/TensorBoard auto-logging
)

# ✅ Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

# ✅ Train
trainer.train()

Step,Training Loss
32,0.224
64,0.1663


TrainOutput(global_step=64, training_loss=0.1951385959982872, metrics={'train_runtime': 33.4998, 'train_samples_per_second': 382.092, 'train_steps_per_second': 1.91, 'total_flos': 211947837849600.0, 'train_loss': 0.1951385959982872, 'epoch': 2.0})

### Evaluate Trained Model on Validation Set (Round 3)

This cell evaluates the trained spam detection model on the validation set after the third round of training. It reports key evaluation metrics such as accuracy, precision, recall, and F1 score to assess model performance.

In [None]:
trainer.evaluate()

{'eval_loss': 0.1982705444097519,
 'eval_accuracy': 0.928125,
 'eval_precision': 0.8245033112582781,
 'eval_recall': 0.8006430868167203,
 'eval_f1': 0.8123980424143556,
 'eval_runtime': 0.9866,
 'eval_samples_per_second': 1621.723,
 'eval_steps_per_second': 8.109,
 'epoch': 2.0}

### Save Model from Round 3

This cell saves or exports the results, such as model checkpoints, from the third round for further analysis or future use.

In [None]:
trainer.save_model("model/semi_spam")
tokenizer.save_pretrained("model/semi_spam")