# YouTube Spam Comments Detection

This project aims to detect spam comments on YouTube using the Naive Bayes algorithm, specifically the BernoulliNB model. Spam detection is important for maintaining the quality and integrity of online platforms, and YouTube is no exception. In this project, we use a dataset containing YouTube comments and a binary label (Spam or Not Spam) to build a machine learning model for classification.

`Author:` [Syed Muhammad Ebad](https://www.kaggle.com/syedmuhammadebad)\
`Date:` 14-Sept-2024\
[Send me an email](mailto:mohammadebad1@hotmail.com)\
[Visit my GitHub profile](https://github.com/smebad)

[Dataset used in this notebook](https://www.kaggle.com/datasets/lakshmi25npathi/images?resource=download)

## Dataset Description
The dataset contains the following key columns:
- `CONTENT`: The actual text of the YouTube comment.
- `CLASS`: A binary label indicating whether the comment is spam (1) or not spam (0).

The dataset consists of YouTube comments from popular videos. Our goal is to classify the comments into two categories: "Spam" or "Not Spam".

## Steps to Build the Model

### 1. Import Libraries
We start by importing the necessary libraries, including Pandas and NumPy for data manipulation, and libraries from Scikit-learn for vectorizing text data, splitting the dataset, and implementing the Naive Bayes model.

In [46]:
# Importing the necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB

### 2. Import and Review the Dataset
Load the dataset into a DataFrame and take a look at the first few rows to understand the structure of the data. We will also review the data types.

In [47]:
# Load the dataset and take a look at the first 10 rows
df = pd.read_csv("Youtube01-Psy.csv")
print(df.head(10))

# Check for data types and missing values
df.info()


                                    COMMENT_ID            AUTHOR  \
0  LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU         Julius NM   
1  LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A       adam riyati   
2  LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8  Evgeny Murashkin   
3          z13jhp0bxqncu512g22wvzkasxmvvzjaz04   ElNino Melendez   
4          z13fwbwp1oujthgqj04chlngpvzmtt3r3dw            GsMega   
5  LZQPQhLyRh9-wNRtlZDM90f1k0BrdVdJyN_YsaSwfxc      Jason Haddad   
6          z13lfzdo5vmdi1cm123te5uz2mqig1brz04    ferleck ferles   
7        z122wfnzgt30fhubn04cdn3xfx2mxzngsl40k      Bob Kanowski   
8          z13ttt1jcraqexk2o234ghbgzxymz1zzi04              Cony   
9          z12avveb4xqiirsix04chxviiljryduwxg0       BeBe Burkey   

                  DATE                                            CONTENT  \
0  2013-11-07T06:20:48  Huh, anyway check out this you[tube] channel: ...   
1  2013-11-07T12:37:15  Hey guys check out my new channel and our firs...   
2  2013-11-08T17:34:

### 3. Data Preprocessing
We only need the CONTENT and CLASS columns for this task. We will remove any unnecessary columns and also map the CLASS values from 0 and 1 to "Not Spam" and "Spam" for better readability.

In [49]:
# Keep only relevant columns: 'CONTENT' and 'CLASS'
df = df[["CONTENT", "CLASS"]]
print(df.head(10))

                                             CONTENT  CLASS
0  Huh, anyway check out this you[tube] channel: ...      1
1  Hey guys check out my new channel and our firs...      1
2             just for test I have to say murdev.com      1
3   me shaking my sexy ass on my channel enjoy ^_^ ﻿      1
4            watch?v=vtaRGgvGtWQ   Check this out .﻿      1
5  Hey, check out my new website!! This site is a...      1
6                          Subscribe to my channel ﻿      1
7  i turned it on mute as soon is i came on i jus...      0
8    You should check my channel for Funny VIDEOS!!﻿      1
9  and u should.d check my channel and tell me wh...      1


In [50]:
# Map 'CLASS' from 0 and 1 to 'Not Spam' and 'Spam' for better understanding
df["CLASS"] = df["CLASS"].map({0: "Not Spam", 1: "Spam"})
print(df.head(10))

                                             CONTENT     CLASS
0  Huh, anyway check out this you[tube] channel: ...      Spam
1  Hey guys check out my new channel and our firs...      Spam
2             just for test I have to say murdev.com      Spam
3   me shaking my sexy ass on my channel enjoy ^_^ ﻿      Spam
4            watch?v=vtaRGgvGtWQ   Check this out .﻿      Spam
5  Hey, check out my new website!! This site is a...      Spam
6                          Subscribe to my channel ﻿      Spam
7  i turned it on mute as soon is i came on i jus...  Not Spam
8    You should check my channel for Funny VIDEOS!!﻿      Spam
9  and u should.d check my channel and tell me wh...      Spam


### 4. Feature Extraction
Since machine learning models can't work directly with text data, we need to convert the comments (text data) into a numerical form. For this, we use CountVectorizer from Scikit-learn, which converts the text into a matrix of token counts.

In [51]:
# Convert text data to numerical form using CountVectorizer
x = np.array(df["CONTENT"])  # Features (comments)
y = np.array(df["CLASS"])    # Labels (Spam or Not Spam)

# Vectorize the text data
cv = CountVectorizer()
x = cv.fit_transform(x)

### 5. Train-Test Split
We split the dataset into training and test sets. The training set will be used to train the model, and the test set will be used to evaluate its performance.

In [53]:
# Split the dataset into training and testing sets (67% training, 33% testing)
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.33, random_state=42)

### 6. Model Training: Bernoulli Naive Bayes
We are using the Bernoulli Naive Bayes model for classification. This model is well-suited for binary/boolean features, which makes it a good fit for this spam detection task where each word in the comment is treated as either present (1) or absent (0) in the comment.

In [54]:
# Initialize the Bernoulli Naive Bayes model and train it
model = BernoulliNB()
model.fit(xtrain, ytrain)

# Evaluate the model using the test set and print the accuracy score
print(model.score(xtest, ytest))

0.9913793103448276


### 7. Model Testing with New Data
We can now use the trained model to predict whether a new comment is spam or not. Below are two examples: one with a spam comment and one with a non-spam comment.

Example 1: Testing with a Spam Comment

In [55]:
# Test the model with a sample spam comment
df_sample = "Hey! Check out my new website https://abc123.com"
df = cv.transform([df_sample]).toarray()
print(model.predict(df))

['Spam']


Example 2: Testing with a Non-Spam Comment

In [56]:
# Test the model with a sample non-spam comment
df_sample = "This video is very insightful!"
df = cv.transform([df_sample]).toarray()
print(model.predict(df))

['Not Spam']


# Summary
In this project, we successfully built a model to detect spam comments on YouTube using the Bernoulli Naive Bayes algorithm. We started by importing and exploring the dataset, followed by data preprocessing, feature extraction using CountVectorizer, and model training. The BernoulliNB model was chosen because of its efficiency with binary features, making it an excellent choice for text-based classification tasks like spam detection. The model achieved a reasonable accuracy and was able to classify both spam and non-spam comments correctly.