#YouTube Spam Collection Data Set (Part 1)

Source: [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection)

Original Source: [YouTube Spam Collection v. 1](http://dcomp.sor.ufscar.br/talmeida/youtubespamcollection/)

> Alberto, T.C., Lochter J.V., Almeida, T.A. __Filtragem Automática de Spam nos Comentários do YouTube.__ Anais do XII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC'15), Natal, RN, Brazil, 2015. ([preprint](http://dcomp.sor.ufscar.br/talmeida/papers/TCA_ENIAC15.pdf))

> Alberto, T.C., Lochter J.V., Almeida, T.A. __TubeSpam: Comment Spam Filtering on YouTube.__ Proceedings of the 14th IEEE International Conference on Machine Learning and Applications (ICMLA'15), 1-6, Miami, FL, USA, December, 2015. ([preprint](http://dcomp.sor.ufscar.br/talmeida/papers/TCA_ICMLA15.pdf))

##Contents
* [1 Data Set Description](#section1)
* [2 Approach](#section2)
* [3 Solution](#section3)
  * [3a Import modules](#section3a)
  * [3b Read the data set](#section3b)
  * [3c Data cleanup](#section3c)
  * [3d Split the data](#section3d)
  * [3e Transform the data](#section3e)
  * [3f Build the model](#section3f)
  * [3g Run predictions](#section3g)
  * [3h Score the prediction](#section3h)
  * [3i Analyze the results](#section3i)
  * [3j Other learnings](#section3j)
* [4 Summary](#section4)

<a id='section1'></a>
##1. Data Set Description

From the description accompanying the data set, "the samples were extracted from the comments section of five videos that were among the 10 most viewed on YouTube during the collection period."

The data is available in five distinct data sets, and the data is classified as 1 for "spam" and 0 for "ham"

<a id='section2'></a>
##2. Approach
Since the data set is split across five data sets, we will take two passes at the data. 

In this notebook, we will only consider the Psy data set. We will use this as a way to wrap our hands around the problem. We will not do any model tuning in this round.

Our second pass will involve merging all five data sets and then running the classification on the combined data set. In this round, we will also tune the model and the vectorizer to eke out some improvements. The notebook for this can be accessed [here](https://github.com/vkpedia/databuff/blob/master/random-walks/YouTube-Spam/YouTube_Spam_Collection%20%28Part%202%29.ipynb).

<a id='section3'></a>
##3. Solution

<a id='section3a'></a>
###Import initial set of modules

In [1]:
# Import modules

import numpy as np
import pandas as pd

<a id='section3b'></a>
###Read in the data from the first CSV alone

In [2]:
# Read the data set; print the first few rows
psy = pd.read_csv('data\\Youtube01-Psy.csv')
psy.head()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1


<a id='section3c'></a>
###Data cleanup

In [3]:
# Check for missing values
psy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 5 columns):
COMMENT_ID    350 non-null object
AUTHOR        350 non-null object
DATE          350 non-null object
CONTENT       350 non-null object
CLASS         350 non-null int64
dtypes: int64(1), object(4)
memory usage: 13.8+ KB


In [4]:
# Looks like there are no missing values. Let's proceed.
# Of the five columns, the only relevant columns for spam/ham classification are the CONTENT and CLASS columns.
# We will use just these two columns. But first, let's check the distribution of spam and ham 

psy.CLASS.value_counts()

1    175
0    175
Name: CLASS, dtype: int64

In [5]:
# There is an equal distribution. Given that this is a small data set, this is probably good, 
# because the algorithm has enough items it can learn from
# Now, let us set up our X and y
X = psy.CONTENT
y = psy.CLASS

<a id='section3d'></a>
###Split the data

In [6]:
# Let us now split the data set into train and test sets
# We will use an 80/20 split
test_size = 0.2

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=test_size)

<a id='section3e'></a>
###Transform the data

In [7]:
# Set up a vectorizer, and create a Document-Term matrix
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)

In [8]:
# Check the layout of the Document-Term matrix
X_train_dtm

<280x1221 sparse matrix of type '<class 'numpy.int64'>'
	with 3560 stored elements in Compressed Sparse Row format>

<a id='section3f'></a>
###Build the model

In [9]:
# We will build a Naive Bayes model for the Psy data set
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

# Fit the training data (the DTM, to be precise)
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

<a id='section3g'></a>
###Run the prediction

In [10]:
# Transform the test data to a DTM and predict
X_test_dtm = vect.transform(X_test)
y_pred_nb = nb.predict(X_test_dtm)

<a id='section3h'></a>
###Score the prediction

In [11]:
# Let us check the accuracy score
# It needs to better than 50%, which was the baseline
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score
accuracy_score(y_test, y_pred_nb)

0.97142857142857142

In [12]:
# The accuracy score was 97%, which is indeed good. 
# Let us check the confusion matrix to get a sense of the prediction distribution
confusion_matrix(y_test, y_pred_nb)

array([[26,  1],
       [ 1, 42]])

In [13]:
# The model predicted 68 out of 70 instances correctly
# We had one false positive and one false negative

# What was the false positive comment? (That is, ham marked as spam)
X_test[y_pred_nb > y_test]

208    P E A C E  &amp;  L O V E  ! !﻿
Name: CONTENT, dtype: object

In [14]:
# And what was the false negative comment? (That is, a spam comment that went undetected)
X_test[y_pred_nb < y_test]

288    if i reach 100 subscribers i will go round in ...
Name: CONTENT, dtype: object

Both these cases are interesting. If one were to hazard a guess, the reason for the false positive could have been that the algorithm was unable to make out the words "PEACE" and "LOVE" because of the way they appeared in the comment.

The false negative comment could have been due to the way the comment was worded. We will check out the top spam and ham words, but before that, let us check the area under the ROC curve, which I anticipate would be pretty high.

In [15]:
roc_auc_score(y_test, nb.predict_proba(X_test_dtm)[:, 1])

0.9982773471145564

And it is! Now, let us check out the top spam and ham keywords.

<a id='section3i'></a>
###Analysis

In [16]:
# Get the tokens from the vectorization process earlier
X_train_tokens = vect.get_feature_names()

In [17]:
# Here is the feature count from the Naive Bayes model
# The first line represents class 0, i.e. ham, and the second line represents spam
nb.feature_count_

array([[ 14.,   1.,   0., ...,   1.,   1.,   1.],
       [  0.,   0.,   1., ...,   0.,   0.,   0.]])

In [18]:
ham_token_count = nb.feature_count_[0, :]
ham_token_count

array([ 14.,   1.,   0., ...,   1.,   1.,   1.])

In [19]:
spam_token_count = nb.feature_count_[1, :]
spam_token_count

array([ 0.,  0.,  1., ...,  0.,  0.,  0.])

In [20]:
# We will create a data frame of all tokens, and their corresponding ham and spam scores
tokens = pd.DataFrame({'token': X_train_tokens, 'ham': ham_token_count, 'spam': 
    spam_token_count}).set_index('token')

# Here are 10 random values drawn from the data frame
# Note that these are absolute occurrences
tokens.sample(10, random_state=50)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
appreciate,0.0,3.0
getting,2.0,2.0
drugs,0.0,1.0
giveaways,0.0,1.0
chinese,2.0,0.0
got,4.0,3.0
who,3.0,2.0
stay,0.0,1.0
number,1.0,0.0
_thqbeum69aqup1ih,0.0,1.0


In [21]:
# Take the word "getting" for example. Was it a ham word or a spam word?
# If we concluded that it was equal-chance spam and ham, that would be incorrect
# That is because we do not know the proportion of words marked as ham and spam
# In this step, we shall standardize this to come up with a truly comparable score
class_count = nb.class_count_

class_count

array([ 148.,  132.])

In [22]:
tokens.ham /= class_count[0]
tokens.spam /= class_count[1]

tokens.sample(10, random_state=50)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
appreciate,0.0,0.022727
getting,0.013514,0.015152
drugs,0.0,0.007576
giveaways,0.0,0.007576
chinese,0.013514,0.0
got,0.027027,0.022727
who,0.02027,0.015152
stay,0.0,0.007576
number,0.006757,0.0
_thqbeum69aqup1ih,0.0,0.007576


In [23]:
# This gives us a sense for the relative spamminess and hamminess of a word
# In this case, 'getting' is slightly more spammy

# Now that we know this, we can compute which words tends to be the most spammy
# And which tend to be the least spammy
# Please note that since some words have a ham ratio of 0
# We can avoid dividing by 0 by adding the same small value to both the numerator and denominator
words_with_spam_score = ((tokens.spam + 1e-6)/(tokens.ham + 1e-6)).sort_values(ascending=False)

<a id='section3j'></a>
###Learnings

In [24]:
# Top 50 most spam words
words_with_spam_score[:50].keys()

Index(['www', 'subscribe', 'channel', 'com', 'image2you', 'ru', '48051',
       'https', 'videos', 'guys', 'do', 'thanks', 'facebook', 'our', 'amp',
       'follow', 'co', 'sub', 'pl', 'dolacz', 'v3veygin', 'ermail', 'free',
       'tsu', 'll', 'twitter', 'plz', 'gt', 'org', 'subscribers', 'friend',
       'share', 'give', 'twitch', 'hi', 'minecraft', 'la', 'chance', 'play',
       'clothes', 'remix', 'tv', 'news', 'gaming', 'network', 'cards', 'gift',
       'subs', 'need', 'earn'],
      dtype='object', name='token')

In [25]:
# Top 50 most ham words
words_with_spam_score[-50:].keys()

Index(['actually', '5million', 'dick', 'section', 'dislike', 'salt',
       '9bzkp7q19f0', 'pray', 'population', 'dislikes', 'planet', 'piece',
       'person', 'ching', 'other', 'holy', 'wanted', 'gets', 'guy', '강남스타일',
       'fucking', 'fuck', 'every', 'korea', 'mean', 'saying', 'justin', 'been',
       'understand', 'baby', 'likes', '2billion', 'ago', 'checking', 'hits',
       'popular', 'wow', 'million', 'gangnam', 'viewed', 'years', 'over',
       'shit', 'came', 'still', 'most', 'he', '000', 'billion', 'views'],
      dtype='object', name='token')

<a id='section4'></a>
##4. Summary

In this notebook, we built a machine learning model based on comments for a YouTube video. The training data set had a total of 350 observations, half of which was spam. We used a Naive Bayes classifier as our algorithm, and trained it on 80% of the data set. The model resulted in an accuracy score of 97% on the test data. We used the model to derive the top 50 most spammy and least spammy keywords.

In the following notebook, we will extend this to the entire data set, consisting of 1,956 observations spread across five videos. This notebook can be accessed [here]().