# Semi-Supervised Spam Detection Application Pipeline

This notebook applies a semi-supervised learning pipeline to detect spam in YouTube comments. It loads a pretrained transformer model, processes a cleaned dataset, predicts spam probabilities, and merges rule-based and model-based spam labels. The final output is a comprehensive labeled dataset suitable for further analysis or downstream tasks. The workflow includes data loading, filtering, batch scoring, merging results, and exporting the final labeled data.

### Import Required Libraries

This cell imports all necessary libraries for data manipulation, model loading, evaluation, and progress tracking, including pandas, HuggingFace transformers, scikit-learn, tqdm, torch, and numpy.

In [None]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from tqdm import tqdm
import torch
import numpy as np

### Load Pretrained Model and Tokenizer

This cell loads a pretrained spam detection model and its tokenizer from disk using HuggingFace's Transformers library. The model will be used for inference on the cleaned comments dataset.

In [None]:
model_path = "/content/drive/MyDrive/Colab Notebooks/semi_spam"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

### Load Cleaned Comments Dataset

This cell loads the cleaned comments dataset from a CSV file and displays the first few rows to verify successful loading and inspect the data structure.

In [None]:
file_path = '/content/drive/MyDrive/Colab Notebooks/comments_all_tagged_text_duplicates_regex.csv'
cleaned = pd.read_csv(file_path)
cleaned.head()

Unnamed: 0,commentId,channelId,videoId,authorId,textOriginal,parentCommentId,likeCount,publishedAt,updatedAt,duplicatedFlag,cleanedText,cleanedTextSentiment,regex_spam
0,3166243,41024,6217,26499,Good Information... Will definitely try it......,,0,2020-01-01 16:00:58+00:00,2020-01-01 16:00:58+00:00,0,good information definitely try thanks,good information definitely try thanks : smili...,0
1,1888757,10004,86296,2608986,"Crystal, is it true that beginning Campaign 3,...",,1,2020-01-04 07:49:54+00:00,2020-01-04 07:49:54+00:00,0,crystal true beginning campaign 3 order get fr...,crystal true beginning campaign 3 order get fr...,0
2,0,10004,86296,164837,Yes but I am charged $8 to cover your free shi...,1888757.0,0,2020-01-04 07:53:24+00:00,2020-01-04 07:53:24+00:00,0,yes charged $ 8 cover free shipping not rep wo...,yes charged $ 8 cover free shipping not rep wo...,0
3,1,10004,86296,164837,Youravon.com/cspurlin,1888757.0,0,2020-01-04 07:53:37+00:00,2020-01-04 07:53:37+00:00,0,youravoncomcspurlin,youravoncomcspurlin,1
4,1279533,5459,64449,882554,Very useful video,,2,2020-01-04 10:32:19+00:00,2020-01-04 10:32:19+00:00,0,useful video,useful video,0


### Flag Duplicates and Spam for Filtering

This cell creates a new column 'filter' in the dataset, flagging comments as 1 if they are duplicates or identified as regex-based spam, and 0 otherwise. This helps in selecting only the comments suitable for prediction.

In [None]:
cleaned['filter'] = np.where(
    (cleaned['duplicatedFlag'] == 1) | (cleaned['regex_spam'] == 1),  # condition
    1,  # value if True
    0   # value if False
)

### Select Comments for Prediction

This cell filters the dataset to select only those comments that are not flagged as duplicates or regex-based spam (i.e., 'filter' == 0). These comments are the candidates for spam prediction by the model.

In [None]:
to_predict=cleaned[cleaned['filter']==0]

### Display Shape of Prediction Set

This cell displays the shape (number of rows and columns) of the DataFrame containing comments selected for prediction, confirming the number of comments to be processed by the model.

In [None]:
to_predict.shape

(3256490, 14)

### Define Batch Scoring Function for Unlabeled Data

This cell defines a function to score an unlabeled dataset using the trained classifier. It processes the data in batches, applies the model to obtain class probabilities (spam/not spam), and returns the results as a NumPy array. This function is essential for efficient inference on large datasets.

In [None]:
def score_unlabeled(df, text_col, tokenizer, model, batch_size=1000):
    """
    Score an unlabeled dataset with a classifier and show progress.
    """
    all_probs = []
    model.eval()

    # Force device detection
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)  # move model to GPU if available

    texts = df[text_col].astype(str).tolist()

    for i in tqdm(range(0, len(texts), batch_size), desc="Scoring unlabeled pool"):
        batch = texts[i:i+batch_size]
        inputs = tokenizer(
            batch,
            padding=True,
            truncation=True,
            max_length=64,
            return_tensors="pt"
        ).to(device)  # move inputs to same device

        with torch.no_grad():
            outputs = model(**inputs)
            probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

        all_probs.append(probs.cpu().numpy())  # back to CPU for numpy

    return np.vstack(all_probs)

### Score Comments and Add Prediction Columns

This cell uses the batch scoring function to predict spam probabilities for the selected comments, adds the probabilities to the DataFrame, and creates a new column for the predicted spam label (0 = not spam, 1 = spam).

In [None]:
# Run scoring
probs = score_unlabeled(to_predict, "textOriginal", tokenizer, model)

# Add probabilities
to_predict["p_not_spam"] = probs[:, 0]
to_predict["p_spam"] = probs[:, 1]

# Add numeric prediction (0 = not spam, 1 = spam)
to_predict["predicted_spam"] = np.argmax(probs, axis=1)


Scoring unlabeled pool: 100%|██████████| 3257/3257 [1:25:09<00:00,  1.57s/it]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  to_predict["p_not_spam"] = probs[:, 0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  to_predict["p_spam"] = probs[:, 1]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  to_predict["predicted_spam"] = np.argmax(probs, axis=

### Count Predicted Spam Comments

This cell calculates and displays the total number of comments predicted as spam by the model in the current prediction set.

In [None]:
to_predict['predicted_spam'].sum()

np.int64(165976)

### Count Total Comments in Prediction Set

This cell displays the total number of comments in the prediction set, providing context for the number of spam predictions.

In [None]:
len(to_predict)

3256490

### Save Prediction Results to CSV

This cell saves the DataFrame containing the prediction results, including spam probabilities and predicted labels, to a CSV file for further analysis or downstream processing.

In [None]:
to_predict.to_csv('dataset/comments_all_tagged_text_duplicates_model.csv', index=False)

### Load Cleaned Dataset for Merging

This cell loads the cleaned comments dataset (with regex-based spam filtering) from a CSV file and displays the first few rows to verify successful loading before merging with model predictions.

In [None]:
file_path = 'dataset/comments_all_tagged_text_duplicates_regex.csv'
first = pd.read_csv(file_path)
first.head()

Unnamed: 0,commentId,channelId,videoId,authorId,textOriginal,parentCommentId,likeCount,publishedAt,updatedAt,duplicatedFlag,cleanedText,cleanedTextSentiment,regex_spam
0,3166243,41024,6217,26499,Good Information... Will definitely try it......,,0,2020-01-01 16:00:58+00:00,2020-01-01 16:00:58+00:00,0,good information definitely try thanks,good information definitely try thanks : smili...,0
1,1888757,10004,86296,2608986,"Crystal, is it true that beginning Campaign 3,...",,1,2020-01-04 07:49:54+00:00,2020-01-04 07:49:54+00:00,0,crystal true beginning campaign 3 order get fr...,crystal true beginning campaign 3 order get fr...,0
2,0,10004,86296,164837,Yes but I am charged $8 to cover your free shi...,1888757.0,0,2020-01-04 07:53:24+00:00,2020-01-04 07:53:24+00:00,0,yes charged $ 8 cover free shipping not rep wo...,yes charged $ 8 cover free shipping not rep wo...,0
3,1,10004,86296,164837,Youravon.com/cspurlin,1888757.0,0,2020-01-04 07:53:37+00:00,2020-01-04 07:53:37+00:00,0,youravoncomcspurlin,youravoncomcspurlin,1
4,1279533,5459,64449,882554,Very useful video,,2,2020-01-04 10:32:19+00:00,2020-01-04 10:32:19+00:00,0,useful video,useful video,0


### Load Model Prediction Results for Merging

This cell loads the model's spam prediction results from a CSV file and displays the first few rows to verify successful loading before merging with the cleaned dataset.

In [None]:
file_path = 'dataset/comments_all_tagged_text_duplicates_model.csv'
second = pd.read_csv(file_path)
second.head()

Unnamed: 0,commentId,channelId,videoId,authorId,textOriginal,parentCommentId,likeCount,publishedAt,updatedAt,duplicatedFlag,cleanedText,cleanedTextSentiment,regex_spam,filter,p_not_spam,p_spam,predicted_spam
0,3166243,41024,6217,26499,Good Information... Will definitely try it......,,0,2020-01-01 16:00:58+00:00,2020-01-01 16:00:58+00:00,0,good information definitely try thanks,good information definitely try thanks : smili...,0,0,0.96721,0.03279,0
1,1888757,10004,86296,2608986,"Crystal, is it true that beginning Campaign 3,...",,1,2020-01-04 07:49:54+00:00,2020-01-04 07:49:54+00:00,0,crystal true beginning campaign 3 order get fr...,crystal true beginning campaign 3 order get fr...,0,0,0.988724,0.011276,0
2,0,10004,86296,164837,Yes but I am charged $8 to cover your free shi...,1888757.0,0,2020-01-04 07:53:24+00:00,2020-01-04 07:53:24+00:00,0,yes charged $ 8 cover free shipping not rep wo...,yes charged $ 8 cover free shipping not rep wo...,0,0,0.974706,0.025294,0
3,1279533,5459,64449,882554,Very useful video,,2,2020-01-04 10:32:19+00:00,2020-01-04 10:32:19+00:00,0,useful video,useful video,0,0,0.980675,0.019325,0
4,2543589,32215,89804,1777705,Osm three hair colour,,2,2020-01-04 13:07:46+00:00,2020-01-04 13:07:46+00:00,0,osm three hair colour,osm three hair colour,0,0,0.986989,0.013011,0


### Merge Cleaned Data with Model Predictions

This cell merges the cleaned comments dataset with the model's spam predictions based on the comment ID, allowing for a unified view of both regex-based and model-based spam detection.

In [None]:
merged = first.merge(second[['commentId', 'predicted_spam']], on='commentId', how='left')
merged.head()

Unnamed: 0,commentId,channelId,videoId,authorId,textOriginal,parentCommentId,likeCount,publishedAt,updatedAt,duplicatedFlag,cleanedText,cleanedTextSentiment,regex_spam,predicted_spam
0,3166243,41024,6217,26499,Good Information... Will definitely try it......,,0,2020-01-01 16:00:58+00:00,2020-01-01 16:00:58+00:00,0,good information definitely try thanks,good information definitely try thanks : smili...,0,0.0
1,1888757,10004,86296,2608986,"Crystal, is it true that beginning Campaign 3,...",,1,2020-01-04 07:49:54+00:00,2020-01-04 07:49:54+00:00,0,crystal true beginning campaign 3 order get fr...,crystal true beginning campaign 3 order get fr...,0,0.0
2,0,10004,86296,164837,Yes but I am charged $8 to cover your free shi...,1888757.0,0,2020-01-04 07:53:24+00:00,2020-01-04 07:53:24+00:00,0,yes charged $ 8 cover free shipping not rep wo...,yes charged $ 8 cover free shipping not rep wo...,0,0.0
3,1,10004,86296,164837,Youravon.com/cspurlin,1888757.0,0,2020-01-04 07:53:37+00:00,2020-01-04 07:53:37+00:00,0,youravoncomcspurlin,youravoncomcspurlin,1,
4,1279533,5459,64449,882554,Very useful video,,2,2020-01-04 10:32:19+00:00,2020-01-04 10:32:19+00:00,0,useful video,useful video,0,0.0


### Display Shape of Merged Data

This cell displays the shape of the merged DataFrame, confirming the number of comments and columns after combining the cleaned data with model predictions.

In [None]:
merged.shape

(4724755, 14)

### Create Final Spam Label

This cell creates a new column 'isSpam' in the merged DataFrame, labeling a comment as spam if it is a duplicate, regex-based spam, or predicted as spam by the model. This provides a comprehensive spam label for each comment.

In [None]:
merged['isSpam'] = np.where(
    (merged['duplicatedFlag'] == 1) | (merged['regex_spam'] == 1) |(merged['predicted_spam'] == 1),  # condition
    1,  # value if True
    0   # value if False
)

### Count Total Spam Comments

This cell calculates and displays the total number of comments labeled as spam in the final merged dataset.

In [None]:
merged['isSpam'].sum()

np.int64(1561060)

### Count Total Non-Spam Comments

This cell calculates and displays the total number of comments labeled as not spam in the final merged dataset.

In [None]:
len(merged)-merged['isSpam'].sum()

np.int64(3059016)

### Save Final Labeled Dataset to CSV

This cell saves the final merged DataFrame, which includes both regex-based and model-based spam labels, to a CSV file for downstream use or further analysis.

In [None]:
merged.to_csv('dataset/final_after_spam.csv', index=False)