#YouTube Spam Collection Data Set (Part 2)

Source: [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection)

Original Source: [YouTube Spam Collection v. 1](http://dcomp.sor.ufscar.br/talmeida/youtubespamcollection/)

> Alberto, T.C., Lochter J.V., Almeida, T.A. __Filtragem Automática de Spam nos Comentários do YouTube.__ Anais do XII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC'15), Natal, RN, Brazil, 2015. ([preprint](http://dcomp.sor.ufscar.br/talmeida/papers/TCA_ENIAC15.pdf))

> Alberto, T.C., Lochter J.V., Almeida, T.A. __TubeSpam: Comment Spam Filtering on YouTube.__ Proceedings of the 14th IEEE International Conference on Machine Learning and Applications (ICMLA'15), 1-6, Miami, FL, USA, December, 2015. ([preprint](http://dcomp.sor.ufscar.br/talmeida/papers/TCA_ICMLA15.pdf))

##Contents
* [1 Data Set Description](#section1)
* [2 Approach](#section2)
* [3 Solution](#section3)
  * [3a Import modules](#section3a)
  * [3b Read the data set](#section3b)
  * [3c Data cleanup](#section3c)
  * [3d Split the data](#section3d)
  * [3e Transform the data](#section3e)
  * [3f Build the model](#section3f)
  * [3g Run predictions](#section3g)
  * [3h Score the prediction](#section3h)
* [4 Summary](#section4)

<a id='section1'></a>
##1. Data Set Description

From the description accompanying the data set, "the samples were extracted from the comments section of five videos that were among the 10 most viewed on YouTube during the collection period."

The data is available in five distinct data sets, and the data is classified as 1 for "spam" and 0 for "ham"

<a id='section2'></a>
##2. Approach
Since the data set is split across five data sets, we will take two passes at the data. This is the second pass.

In the (optional) first pass, we considered only the Psy data set, as a way to wrap our hands around the problem. The notebook for this can be accessed [here](https://github.com/vkpedia/databuff/blob/master/random-walks/YouTube-Spam/YouTube_Spam_Collection%20%28Part%201%29.ipynb).

Our second pass will involve merging all five data sets and then running the classification on the combined data set. In this round, we will also tune the model and the vectorizer to eke out some improvements.

<a id='section3'></a>
##3. Solution

<a id='section3a'></a>
###Import initial set of modules

In [1]:
# Import modules

import numpy as np
import pandas as pd

<a id='section3b'></a>
###Read in the data from the first CSV alone

In [2]:
# Read the data set; print the first few rows
files = ['data\\Youtube01-Psy.csv', 'data\\Youtube02-KatyPerry.csv', 'data\\Youtube03-LMFAO.csv', 
         'data\\Youtube04-Eminem.csv', 'data\\Youtube05-Shakira.csv']

df = pd.DataFrame()
for file in files:
    df = df.append(pd.read_csv(file))
    
df.head()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1


<a id='section3c'></a>
###Data cleanup

In [3]:
# Check for missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1956 entries, 0 to 369
Data columns (total 5 columns):
COMMENT_ID    1956 non-null object
AUTHOR        1956 non-null object
DATE          1711 non-null object
CONTENT       1956 non-null object
CLASS         1956 non-null int64
dtypes: int64(1), object(4)
memory usage: 91.7+ KB


In [4]:
# Looks like there are missing values in the DATE column, but it is not a column of interest. Let's proceed.
# Of the five columns, the only relevant columns for spam/ham classification are the CONTENT and CLASS columns.
# We will use just these two columns. But first, let's check the distribution of spam and ham 

df.CLASS.value_counts()

1    1005
0     951
Name: CLASS, dtype: int64

In [5]:
# There is an almost equal distribution. Given that this is a small data set, this is probably good, 
# because the algorithm has enough items it can learn from
# Now, let us set up our X and y
X = df.CONTENT
y = df.CLASS

<a id='section3d'></a>
###Split the data

In [6]:
# Let us now split the data set into train and test sets
# We will use an 80/20 split
test_size = 0.2
seed = 42
scoring = 'accuracy'
num_folds = 10

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=seed, test_size=test_size)

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

models = []
names = []
results = []

lr = ('LR', LogisticRegression())
knn = ('KNN', KNeighborsClassifier())
svc = ('SVC', SVC())
nb = ('NB', MultinomialNB())
cart = ('CART', DecisionTreeClassifier())

models.extend([lr,  knn, svc, nb, cart])

<a id='section3e'></a>
###Transform the data

In [8]:
# Set up a vectorizer, and create a Document-Term matrix
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)

In [9]:
# Check the layout of the Document-Term matrix
X_train_dtm

<1564x3810 sparse matrix of type '<class 'numpy.int64'>'
	with 20546 stored elements in Compressed Sparse Row format>

<a id='section3f'></a>
###Build the model

In this step, we will build 6 models, and pick the one with the best accuracy score

In [10]:
from sklearn.model_selection import KFold, cross_val_score

for name, model in models:
    kfold = KFold(n_splits=num_folds, random_state=seed)
    score = cross_val_score(model, X_train_dtm, y_train, scoring=scoring, cv=kfold)
    names.append(name)
    results.append(score)

In [11]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, \
    RandomForestClassifier, ExtraTreesClassifier

ensembles = []
ensemble_names = []
ensemble_results = []
ensembles.append(('AB', AdaBoostClassifier()))
ensembles.append(('RF', RandomForestClassifier()))
ensembles.append(('ET', ExtraTreesClassifier()))

In [12]:
for name, model in ensembles:
    kfold = KFold(n_splits=num_folds, random_state=seed)
    score = cross_val_score(model, X_train_dtm, y_train, cv=kfold, scoring=scoring)
    ensemble_names.append(name)
    ensemble_results.append(score)

In [13]:
models_list = []
for i, name in enumerate(names):
    d = {'model': name, 'mean': results[i].mean(), 'std': results[i].std()}
    models_list.append(d)
for i, name in enumerate(ensemble_names):
    d = {'model': name, 'mean': results[i].mean(), 'std': results[i].std()}
    models_list.append(d)
    
models_df = pd.DataFrame(models_list).set_index('model')
models_df.sort_values('mean', ascending=False)

Unnamed: 0_level_0,mean,std
model,Unnamed: 1_level_1,Unnamed: 2_level_1
CART,0.952041,0.011211
LR,0.947575,0.008471
AB,0.947575,0.008471
NB,0.922028,0.02251
KNN,0.868333,0.032262
RF,0.868333,0.032262
SVC,0.591458,0.032478
ET,0.591458,0.032478


###Model selection

Based on accuracy scores, the best algorithm is the Decision Tree Classifier. Logistic Regression and AdaBoost Classifier also performed very well. We will choose Decision Tree as our model, and look to tune it.

In [14]:
cart

('CART',
 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
             max_features=None, max_leaf_nodes=None,
             min_impurity_split=1e-07, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             presort=False, random_state=None, splitter='best'))

In [15]:
from sklearn.model_selection import GridSearchCV

final_model = DecisionTreeClassifier()

criterion_values = ['gini', 'entropy']
splitter_values = ['best', 'random']
min_samples_split_values = np.arange(2, 11, 1)

param_grid = dict(criterion=criterion_values, splitter=splitter_values, 
                  min_samples_split=min_samples_split_values)
kfold = KFold(n_splits=num_folds, random_state=seed)
grid = GridSearchCV(estimator=final_model, cv=kfold, scoring=scoring, param_grid=param_grid)

grid_result = grid.fit(X_train_dtm, y_train)
print(grid_result.best_params_, grid_result.best_score_)

{'criterion': 'gini', 'min_samples_split': 7, 'splitter': 'best'} 0.953964194373


It looks like we were able to eke out some improvement in the performance. The Decision Tree Classifier seems to perform best with the min_samples_split set to 7. We will use this for our final model. Note that the default values for 'criterion' and 'splitter' seem to be part of the best performing set of parameters.

<a id='section3g'></a>
###Run the prediction

In [16]:
final_model = DecisionTreeClassifier(min_samples_split=7, random_state=seed)
final_model.fit(X_train_dtm, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=7, min_weight_fraction_leaf=0.0,
            presort=False, random_state=42, splitter='best')

In [17]:
# Transform the test data to a DTM and predict
X_test_dtm = vect.transform(X_test)
y_pred = final_model.predict(X_test_dtm)

<a id='section3h'></a>
###Score the prediction

In [18]:
# Let us check the accuracy score
# It needs to better than 50%, which was the baseline
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score
accuracy_score(y_test, y_pred)

0.93367346938775508

In [19]:
# The accuracy score was 93.37%, which is lower than we may have anticipated 
# Let us check the confusion matrix to get a sense of the prediction distribution
confusion_matrix(y_test, y_pred)

array([[164,  12],
       [ 14, 202]])

In [20]:
# The model predicted 366 out of 392 instances correctly
# We had 14 false positives and 12 false negatives

# What were the false positive comments? (That is, ham marked as spam)
X_test[y_pred > y_test]

251    2,000,000,000 out of 7,000,000,000 people in t...
221                                      I want new song
328    I hate videos like these with those poor anima...
254    How did THIS Video in all of YouTube get this ...
109    8 million likes xD even the subscribers not 8 ...
270    The little PSY is suffering Brain Tumor and on...
188    OMG I LOVE YOU KATY PARRY YOUR SONGS ROCK!!!!!...
49     thumbs up if u checked this video to see hw vi...
289               YOUTUBE MONEY !!!!!!!!!!!!!!!!!!!!!!!﻿
306            NEW GOAL!   3,000,000!  Let's go for it!﻿
369                         Lemme Top Comments Please!!﻿
198    Since when has Katy Perry had her own YouTube ...
Name: CONTENT, dtype: object

In [21]:
# And what were the false negative comments? (That is, spam comments that went undetected)
X_test[y_pred < y_test]

8                         Aslamu Lykum... From Pakistan﻿
99     http://thepiratebay.se/torrent/6381501/Timothy...
288               los invito a subscribirse a mi canal ﻿
262             ｈｔｔｐ://ｗｗｗ.ｅｂａｙ.ｃｏｍ/ｕｓｒ/ｓｈｏｅｃｏｌｌｅｃｔｏｒ314
379    Ummm... I just hit 1k subscribers. I make Mine...
205    I love katy fashions tiger, care to visit my b...
88     Finally someone shares the same opinion as me....
109                                I&#39;m A SUBSCRIBER﻿
324    Hey yall its the real Kevin Hart, shout out to...
128    She loves Vena. trojmiasto.pl/Vena-Bus-Taxi-o5...
136    http://thepiratebay.se/torrent/10626048/The.Ex...
135    http://thepiratebay.se/torrent/10626835/The.Ex...
0              +447935454150 lovely girl talk to me xxx﻿
195    if you need youtube subscriber mail hermann bu...
Name: CONTENT, dtype: object

Some of the false negatives seem like they should have been marked as spam, so it is interesting that the model missed these. We may need to tune our vectorizer and/or attempt some other classifiers.

Let us check the area under the ROC curve.

In [22]:
roc_auc_score(y_test, final_model.predict_proba(X_test_dtm)[:, 1])

0.94489162457912457

The area is around 94.49%. Not bad for a first pass, but again, this could have been higher.

<a id='section4'></a>
##4. Summary

In this notebook, we built a machine learning model based on comments for five YouTube videos. The training data set had a total of 1,956 observations, about half of which was spam. We used 80% of the data set for training, and used 10-fold validation to find among eight algorithms the best-performing algorithm based on accuracy score. The Decision Tree algorithm had the best accuracy score, so we selected it as our model. We further tuned the model and were able to improve the accuracy score.

The final model resulted in an **accuracy score of 93.37%** on the test data, with an AUC of 94.49%. Some false negatives seem concerning, and we will need to delve further into the model to understand it better and to iron out these inaccuracies.

Some potential next steps are tuning the vectorizer to use bi-grams, visualizing and tuning the Decision Tree further, and comparing the performance to other tuned classifiers.