# Youtube Spam Detection (5 Videos)
## Abstract
Try to detect spam comments on youtube videos. The dataset contains 5 videos with 5 different classes of comments.<br>
Data Source: https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection<br>

Data Set Information:<br>
The table below lists the datasets, the YouTube video ID, the amount of samples in each class and the total number of samples per dataset.<br>

| Dataset | YouTube ID | # Spam | # Ham | Total |
| --- | --- | --- | --- | --- |
| Psy | 9bZkp7q19f0 | 175 | 175 | 350 |
| KatyPerry | CevxZvSJLk8 | 175 | 175 | 350 |
| LMFAO | KQ6zr6kCPj8 | 236 | 202 | 438 |
| Eminem | uelHwf8o7_U | 245 | 203 | 448 |
| Shakira | pRpeEdMmmQ0 | 174 | 196 | 370 |

First, we will only use the first video (Psy) to train the model. Then, we will use all the videos to train the model. Using Bag of Words (BoW) and Random Froset (RF) to train the model.


In [1]:
import pandas as pd

In [2]:
# Read in The Psy file, and display the tail
df_psy = pd.read_csv('../Intro_AI_Project_3/Youtube01-Psy.csv')
df_psy.tail()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
345,z13th1q4yzihf1bll23qxzpjeujterydj,Carmen Racasanu,2014-11-14T13:27:52,How can this have 2 billion views when there's...,0
346,z13fcn1wfpb5e51xe04chdxakpzgchyaxzo0k,diego mogrovejo,2014-11-14T13:28:08,I don't now why I'm watching this in 2014﻿,0
347,z130zd5b3titudkoe04ccbeohojxuzppvbg,BlueYetiPlayz -Call Of Duty and More,2015-05-23T13:04:32,subscribe to me for call of duty vids and give...,1
348,z12he50arvrkivl5u04cctawgxzkjfsjcc4,Photo Editor,2015-06-05T14:14:48,hi guys please my android photo editor downloa...,1
349,z13vhvu54u3ewpp5h04ccb4zuoardrmjlyk0k,Ray Benich,2015-06-05T18:05:16,The first billion viewed this because they tho...,0


In [3]:
# find number of rows and columns
print(df_psy.shape)

(350, 5)


There are 350 rows and 5 columns

In [4]:
# count number of 0, 1 in CLASS
df_psy.CLASS.value_counts()

1    175
0    175
Name: CLASS, dtype: int64

When you compare the commnet content and class, you can find out class 0 equal to not spam and class 1 equal to spam. In psy.csv file, there are 175 rows spam and 175 rows not spam

In [5]:
# Import the Bag of Words function and create an instance
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

In [6]:
# Fit and transform the comments in one step
dvec = vectorizer.fit_transform(df_psy['CONTENT'])

In [7]:
dvec

<350x1418 sparse matrix of type '<class 'numpy.int64'>'
	with 4354 stored elements in Compressed Sparse Row format>

The size of sparse matrix is 350x1418

In [8]:
# print out 349th comment
print(df_psy['CONTENT'][349])

The first billion viewed this because they thought it was really cool, the  other billion and a half came to see how stupid the first billion were...﻿


In [9]:
# Using vectorizer.build_analyzer() display
# the breakdown of the 349th comment in to a “bag of words”
analyze = vectorizer.build_analyzer()
analyze(df_psy['CONTENT'][349])

['the',
 'first',
 'billion',
 'viewed',
 'this',
 'because',
 'they',
 'thought',
 'it',
 'was',
 'really',
 'cool',
 'the',
 'other',
 'billion',
 'and',
 'half',
 'came',
 'to',
 'see',
 'how',
 'stupid',
 'the',
 'first',
 'billion',
 'were']

In [10]:
# Give the output of vectorizer.get_feature_names()
vectorizer.get_feature_names()



['00',
 '000',
 '02',
 '034',
 '05',
 '08',
 '10',
 '100',
 '100000415527985',
 '10200253113705769',
 '1030',
 '1073741828',
 '11',
 '1111',
 '112720997191206369631',
 '12',
 '123',
 '124',
 '124923004',
 '126',
 '127',
 '13017194',
 '131338190916',
 '1340488',
 '1340489',
 '1340490',
 '1340491',
 '1340492',
 '1340493',
 '1340494',
 '1340499',
 '1340500',
 '1340502',
 '1340503',
 '1340504',
 '1340517',
 '1340518',
 '1340519',
 '1340520',
 '1340521',
 '1340522',
 '1340523',
 '1340524',
 '134470083389909',
 '1415297812',
 '1495323920744243',
 '1496241863981208',
 '1496273723978022',
 '1498561870415874',
 '161620527267482',
 '171183229277',
 '19',
 '19924',
 '1firo',
 '1m',
 '20',
 '2009',
 '2012',
 '2012bitches',
 '2013',
 '2014',
 '201470069872822',
 '2015',
 '2017',
 '210',
 '23',
 '24',
 '24398',
 '243a',
 '279',
 '29',
 '2b',
 '2billion',
 '2x10',
 '300',
 '3000',
 '313327',
 '315',
 '322',
 '33',
 '33gxrf',
 '39',
 '390875584405933',
 '391725794320912',
 '40beuutvu2zkxk4utgpz8k',
 '

It return the names of features from the dataset, the output begin with numbers and then words

In [11]:
# shuffle dataset
df_psy_shuffle = df_psy.sample(frac=1)

In [12]:
df_psy_shuffle

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
106,z13qgx0yzwf1uj1xm04ccbkhjnrsgz0i41g,firo mota,2014-11-04T14:38:42,please subscribe i am a new youtuber and need ...,1
343,z13sh3cz1kbqgrai504cf53qsq25ypmi5zs0k,Leonel Hernandez,2014-11-14T12:35:38,"Something to dance to, even if your sad JUST ...",0
112,z12ftpab5svihfffz23kf3iiymiwjzesi,trespasser4000,2014-11-04T22:07:22,This song never gets old love it.﻿,0
18,z12qth5j0ob1fx3q404chvy4fz32tbkpllk0k,Tony K Frazier,2013-11-28T23:57:13,http://ubuntuone.com/40beUutVu2ZKxK4uTgPZ8K﻿,1
303,z12cehoxozfgg3nok04cjj05xznbgrlpfjo,Elieo Cardiopulmonary,2014-11-08T15:29:52,im sorry for the spam but My name is Jenny. I ...,1
...,...,...,...,...,...
110,z123jlf4lzjbgpbcr23yhxyqbpe3gxpvm,TIGERIO_,2014-11-04T19:46:38,EHI GUYS CAN YOU SUBSCRIBE IN MY CHANNEL? I AM...,1
80,z122f1fy5muwdxdxd04cexyxes3hh5hrifc,Squir3,2014-11-02T18:34:03,http://woobox.com/33gxrf/brt0u5 FREE CS GO!!!!﻿,1
224,z12sjrqiurm3sd4rh04chz4oplrmhzmgzmg0k,Stronzo Chicheritr,2014-11-07T20:01:15,prehistoric song..has been﻿,0
144,z13osluqrpefv1hd323idhejzxanc3ai004,Tyrek Sings,2014-11-05T22:50:02,CHECK MY CHANNEL OUT PLEASE. I DO SINGING COVERS﻿,1


In [13]:
# create training and testing set
# first 300 for training and remain for testing
d_train = df_psy_shuffle.iloc[:300]
d_test = df_psy_shuffle.iloc[300:]

In [14]:
# d_train

In [15]:
# d_test

In [16]:
# Create your training and testing attributes BOW 
# using vectorizer.fit_transform
d_train_att = vectorizer.fit_transform(d_train['CONTENT'])
d_test_att =vectorizer.transform(d_test['CONTENT'])

In [17]:
d_train_att

<300x1274 sparse matrix of type '<class 'numpy.int64'>'
	with 3759 stored elements in Compressed Sparse Row format>

In [18]:
d_test_att.shape

(50, 1274)

In [19]:
# training and testing labels
d_train_label = d_train['CLASS']
d_test_label = d_test['CLASS']

In [20]:
d_train_att

<300x1274 sparse matrix of type '<class 'numpy.int64'>'
	with 3759 stored elements in Compressed Sparse Row format>

d_train_att is a 300x1238 matrix

In [21]:
d_test_att

<50x1274 sparse matrix of type '<class 'numpy.int64'>'
	with 450 stored elements in Compressed Sparse Row format>

d_test_att is a 50x400 matrix

In [22]:
# create random forest classifier with 80 trees
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=80)
clf.fit(d_train_att, d_train_label)

In [23]:
clf.score(d_test_att, d_test_label)

0.92

Score for random forest classifier test set is 94%

In [24]:
# Create a confusion matrix for the test labels and prediction labels, and output array
predicted_labels = clf.predict(d_test_att)

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(d_test_label, predicted_labels)

In [25]:
cm

array([[24,  0],
       [ 4, 22]], dtype=int64)

In [26]:
# Cross validate and output your accuracy
from sklearn.model_selection import cross_val_score
score = cross_val_score(clf,d_train_att, d_train_label, cv=5)

print('Random forest accuracy: ', round(score.mean(), 4), '(+/- ', round(score.std()*2, 4),')')

Random forest accuracy:  0.96 (+/-  0.0686 )


Cross validation split our dataset into multiple portions. Some will use for training rest will use on testing. For example in above cross validation, we divide dataset in 5 portion. For 1 interation, first portion will use for test and rest will use for train. Then, for interation 2, second portion will use for test and rest will use for train. Repeat until all portion have been used for test.

In [27]:
# Load and concatenate all 5 comments file
concat = [pd.read_csv('Youtube01-Psy.csv'), 
          pd.read_csv('Youtube02-KatyPerry.csv'), 
          pd.read_csv('Youtube03-LMFAO.csv'), 
          pd.read_csv('Youtube04-Eminem.csv'), 
          pd.read_csv('Youtube05-Shakira.csv')]
combined_csv = pd.concat(concat, ignore_index=True)

In [28]:
combined_csv

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1
...,...,...,...,...,...
1951,_2viQ_Qnc6-bMSjqyL1NKj57ROicCSJV5SwTrw-RFFA,Katie Mettam,2013-07-13T13:27:39.441000,I love this song because we sing it at Camp al...,0
1952,_2viQ_Qnc6-pY-1yR6K2FhmC5i48-WuNx5CumlHLDAI,Sabina Pearson-Smith,2013-07-13T13:14:30.021000,I love this song for two reasons: 1.it is abou...,0
1953,_2viQ_Qnc6_k_n_Bse9zVhJP8tJReZpo8uM2uZfnzDs,jeffrey jules,2013-07-13T12:09:31.188000,wow,0
1954,_2viQ_Qnc6_yBt8UGMWyg3vh0PulTqcqyQtdE7d4Fl0,Aishlin Maciel,2013-07-13T11:17:52.308000,Shakira u are so wiredo,0


In [29]:
# Display the length of the file as well as the breakdown into spam and non-spam
print('Lenght file: ', len(combined_csv))
print('Lenght of spam: ', len(combined_csv[combined_csv.CLASS == 1]))
print('Lenght of non spam: ', len(combined_csv[combined_csv.CLASS == 0]))

Lenght file:  1956
Lenght of spam:  1005
Lenght of non spam:  951


In [30]:
# Shuffle the new data, and create content and label sets, d_content, and d_label
combined_df = combined_csv.sample(frac=1)
d_content = combined_df['CONTENT']
d_label = combined_df['CLASS']

The advantage to use pipeline can minmal code and less room for mistake, instead using fit and transform method, programmer only need to use fit in pipline and make the code easler to read.

Reference: <br>
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html<br>
https://medium.com/mlearning-ai/how-to-use-sklearns-pipelines-to-optimize-your-analysis-b6cd91999be

In [31]:
# mport both Pipeline and make_pipeline from sklearn
from sklearn.pipeline import Pipeline, make_pipeline

In [32]:
# create pipeline
pipeline = Pipeline([('bag-of-words', CountVectorizer()),
                     ('random forest', RandomForestClassifier())])

In [33]:
# output the pipeline
pipeline

In [34]:
# Fit your pipeline with the first 1500 entries of the content and the labels. Output.
pipeline.fit(d_content[:1500], d_label[:1500])

In [35]:
# Use .score to score your pipeline
pipeline.score(d_content[1500:], d_label[1500:])

0.9473684210526315

In [36]:
# Use pipeline to predict spam or not
pipeline.predict(["what a neat video!"])

array([0], dtype=int64)

In [37]:
# Use pipeline to predict spam or not
pipeline.predict(["plz subscribe to my channel"])

array([1], dtype=int64)

For "what a neat video!", the model predict is 0, it mean not a spam.<br> 
For "plz subscribe to my channel", model predict is 1, it mean the comment is spam

In [38]:
# Cross validate pipeline using d_content and d_labels. Set cv=5
scores = cross_val_score(pipeline, d_content, d_label, cv=5)

In [39]:
# Print out the accuracy(scores.mean) +/- 2sd
print('Accracy: %0.2f (+/- %0.2f)' % (scores.mean(), scores.std()*2))

Accracy: 0.96 (+/- 0.02)


In [40]:
# Create second pipeline named pipeline2 which incorporates the TfidfTransformer
from sklearn.feature_extraction.text import TfidfTransformer
pipeline2 = make_pipeline(CountVectorizer(), TfidfTransformer(norm=None), RandomForestClassifier())

In [41]:
# Cross validate pipeline2 using d_content and d_labels.
# Output the accuracy(scores.mean) +/- 2sd. Set cv=5
scores = cross_val_score(pipeline2, d_content, d_label)
print('Accracy: %0.2f (+/- %0.2f)' % (scores.mean(), scores.std()*2))

Accracy: 0.96 (+/- 0.02)


In [42]:
#Output the steps of pipeline2
pipeline2.steps

[('countvectorizer', CountVectorizer()),
 ('tfidftransformer', TfidfTransformer(norm=None)),
 ('randomforestclassifier', RandomForestClassifier())]

In [43]:
# parameter search
parameters = {
    'countvectorizer__max_features': (None, 1000, 2000),
    'countvectorizer__ngram_range': ((1,1), (1,2)), # unigrams or bigrams
    'countvectorizer__stop_words': ('english', None),
    'tfidftransformer__use_idf': (True, False), # effectively turn on/off tfidf
    'randomforestclassifier__n_estimators': (20, 59, 100)
}

from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(pipeline2, parameters, n_jobs=-1, verbose=1)

The code in part a use grid search to optimize the parameters, parameter can have 1000, to 2000 words. Using ngrams we can use single words, pairs of words.
Use the English stop words or not. TF-IDF, on or off and random forest classifier uses 20,50 or 100 trees

In [44]:
# Using .fit and d_content, and d_labels, perform the grid search
grid_search.fit(d_content, d_label)

Fitting 5 folds for each of 72 candidates, totalling 360 fits


In [45]:
# grid_search.best_index_

In [46]:
# grid_search.best_estimator_

In [47]:
print('Best score: %0.3f ' % grid_search.best_score_)
print('Best parameters set: ')
for x in grid_search.best_params_:
    print ('\t', x ,':', grid_search.best_params_[x])
    

Best score: 0.959 
Best parameters set: 
	 countvectorizer__max_features : None
	 countvectorizer__ngram_range : (1, 2)
	 countvectorizer__stop_words : None
	 randomforestclassifier__n_estimators : 59
	 tfidftransformer__use_idf : True


According to this output our best score was 0.960; with 2000 words, only pair of words, using english stop words; 100 trees, without TF-IDF