Spam comments detection means classifying comments as spam or not spam. YouTube is one of the platforms that uses Machine Learning to filter spam comments automatically to save its creators from spam comments.

### Spam Comments Detection
Detecting spam comments is the task of text classification in Machine Learning. Spam comments on social media platforms are the type of comments posted to redirect the user to another social media account, website or any piece of content.

To detect spam comments with Machine Learning, we need labelled data of spam comments. Luckily, I found a dataset on Kaggle about YouTube spam comments which will be helpful for the task of spam comments detection.

### Spam Comments Detection using Python
Let’s start this task by importing the necessary Python libraries and the dataset:

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB

In [2]:
data = pd.read_csv("Youtube01-Psy.csv")
print(data.sample(5))

                                COMMENT_ID         AUTHOR  \
129  z13xhp2xmkeowtbyo04cffggylf3yjjpsns0k       Brotha B   
193    z130xbcwfnj5vlskv23airsxqqfqvhl5504      Adam Mudd   
10   z13auhww3oufjn1qo04ci3grqqjmfjexxuo0k      Huckyduck   
153    z13xsxxqlxnzz3d3g22wfjx5lsrjgpaic04  MasterRobotTV   
286    z12oxlzh4qjicd2zu04cgfabqtipf3gq4is      Susan Jay   

                    DATE                                            CONTENT  \
129  2014-11-05T16:18:58  Like getting Gift cards..but hate spending the...   
193  2014-11-07T12:16:08  How are there 2 billion views and theres only ...   
10   2013-11-28T17:06:17                               Hey subscribe to me﻿   
153  2014-11-06T05:40:43                http://www.twitch.tv/zxlightsoutxz﻿   
286  2014-11-08T10:04:22  Enough with the whole "how does this have two ...   

     CLASS  
129      1  
193      0  
10       1  
153      1  
286      0  


We only need the content and class column from the dataset for the rest of the task. So let’s select both the columns and move further:

In [3]:
data = data[["CONTENT", "CLASS"]]
print(data.sample(5))

                                               CONTENT  CLASS
47   http://www.avaaz.org/po/petition/Youtube_Corpo...      1
120    This has had over 2 billion views.  Holy shit.﻿      0
32                   sub my channel for no reason -_-﻿      1
136  Dance dance,,,,,Psy  http://www.reverbnation.c...      1
172           For Christmas Song visit my channel! ;)﻿      1


The class column contains values 0 and 1. 0 indicates not spam, and 1 indicates spam. So to make it look better, I will use spam and not spam labels instead of 1 and 0:

In [4]:
data["CLASS"] = data["CLASS"].map({0: "Not Spam", 
                                  1: "Spam Comment"})

print(data.sample(5))

                                               CONTENT         CLASS
30   everyone please come check our newest song in ...  Spam Comment
70                                  2 Billions in 2014      Not Spam
57   Subscribe and like to me for more how to video...  Spam Comment
157              Follow me on Twitter @mscalifornia95﻿  Spam Comment
332  The girl in the train who was dancing, her out...      Not Spam


### Training a Classification Model
Now let’s move further by training a classification Machine Learning model to classify spam and not spam comments. As this problem is a problem of binary classification, I will use the Bernoulli Naive Bayes algorithm to train the model:

In [5]:
x = np.array(data["CONTENT"])
y = np.array(data["CLASS"])

cv = CountVectorizer()
x = cv.fit_transform(x)
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.2, random_state = 42)

model = BernoulliNB()
model.fit(xtrain, ytrain)
print(model.score(xtest, ytest))

0.9857142857142858


Now let’s test the model by giving spam and not spam comments as input:

In [6]:
sample = "Check this out: https://thecleverprogrammer.com/"
data = cv.transform([sample]).toarray()
print(model.predict(data))

['Spam Comment']


In [7]:
sample = "Lack of information!"
data = cv.transform([sample]).toarray()
print(model.predict(data))

['Not Spam']


So this is how you can train a Machine Learning model for the task of spam detection using Python.

### Summary
Spam comments detection means classifying comments as spam or not spam. Spam comments on social media platforms are the type of comments posted to redirect the user to another social media account, website or any piece of content.