## Classifying Spam Youtube Comments
### [Source](https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection)

***
#### Notes
- Using comments from Psy video as training data

In [1]:
# version check
import sys
sys.version

'3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]'

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

#### A quick note on `nltk.stopwords()`
*for first time nltk users*

In order to use this package, you have to install the `stopwords` package from the `nltk` download GUI.  
This can be achieved by entering the following into python console:
```python
>>> import nltk
>>> nltk.download()
```

Then the GUI will pop up, go to the corpus tab and find `stopwords`

In [3]:
psy_data = 'data/Youtube01-Psy.csv'
df_psy = pd.read_csv(psy_data)
df_psy.tail(10)
# class: boolean for spam tagging
# 349 comments total

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
340,z12exzcrvpeew1yxg04cd5tbwnmfubnh4kk0k,Anthony1SV,2014-11-14T00:01:37,Please do buy these new Christmas shirts! You ...,1
341,z13qyxk5tzq1e5asx22xjt3wdq3ns32f5,Ameenk Chanel,2014-11-14T11:50:02,Free my apps get 1m crdits ! Just click on the...,1
342,z12uyzhazxqkzbrzz04cedqpkwzdvjyy3u40k,Brandon Wilson,2014-11-14T12:21:52,Why does a song like this have more views than...,0
343,z13sh3cz1kbqgrai504cf53qsq25ypmi5zs0k,Leonel Hernandez,2014-11-14T12:35:38,"Something to dance to, even if your sad JUST ...",0
344,z12etrpq5xu0vnm2g230szo5ote3zviny,InterGaming,2014-11-14T13:16:24,everyones back lool this is almost 3 years old...,0
345,z13th1q4yzihf1bll23qxzpjeujterydj,Carmen Racasanu,2014-11-14T13:27:52,How can this have 2 billion views when there's...,0
346,z13fcn1wfpb5e51xe04chdxakpzgchyaxzo0k,diego mogrovejo,2014-11-14T13:28:08,I don't now why I'm watching this in 2014﻿,0
347,z130zd5b3titudkoe04ccbeohojxuzppvbg,BlueYetiPlayz -Call Of Duty and More,2015-05-23T13:04:32,subscribe to me for call of duty vids and give...,1
348,z12he50arvrkivl5u04cctawgxzkjfsjcc4,Photo Editor,2015-06-05T14:14:48,hi guys please my android photo editor downloa...,1
349,z13vhvu54u3ewpp5h04ccb4zuoardrmjlyk0k,Ray Benich,2015-06-05T18:05:16,The first billion viewed this because they tho...,0


In [4]:
def wordlist(slist):
    l = []
    for s in slist: # access each comment
        x = re.sub("[^a-zA-Z]"," ", s) # replace punctuation with whitespace
        l.append(x) # big list of cleaned comments
    lower = [s.lower() for s in l] # still list of long strings
    en_stopwords = set(stopwords.words("english"))
    words = [w for s in lower for w in s.split() if w not in en_stopwords]   
    # returns a bag of words for all given string in list of strings
    cleanstrings = []
    return words
    
def cleanstrings(slist):
    l = []
    for s in slist: # access each comment
        x = re.sub("[^a-zA-Z]"," ", s) # replace punctuation with whitespace
        l.append(x) # big list of cleaned comments
    lower = [s.lower() for s in l] # still list of long strings
    en_stopwords = set(stopwords.words("english"))
    clean = []
    for s in lower:
        x = s
        for w in s.split():
            if w in en_stopwords:
                x = x.replace(w,'',1)
        clean.append(x.split())
    return clean

def clean_raw(s):
    s = re.sub("[^a-zA-Z]"," ", s)
    s = s.lower().split()
    en_stopwords = set(stopwords.words("english"))
    x = [w for w in s if w not in en_stopwords]
    return " ".join(x)
        
    

For those unfamiliar with list comprehension, I find this helpful:  
[Understanding nested list comprehension syntax in Python](https://spapas.github.io/2016/04/27/python-nested-list-comprehensions/)

the below code is equivalent of the above list comprehension:
```python
    words = []
    for s in lower:
        for w in s.split():
            if w not in en_stopwords:
                words.append(w)
                
                
```            

In [5]:
# basic example of how lists work
# lists are more efficient than sets when searching
l1 = [5,5,5,6,4,7,8,8,8,8,3]
s1 = set(l1)

print(l1, '\n', s1)

[5, 5, 5, 6, 4, 7, 8, 8, 8, 8, 3] 
 {3, 4, 5, 6, 7, 8}


In [6]:
com_list = list(df_psy['CONTENT'])
clean_coms = cleanstrings(com_list) # list of lists of words in commments
com_words = wordlist(com_list) # separates comments into list of strings of words

print("Qty words in comments (cum):  ",len(com_words))
unique_words = list(set(com_words))
print("Unique words: ",len(unique_words))      
# print(len(clean_coms), 'original qty comments:  ',len(com_list))

clean_com_str = [clean_raw(s) for s in com_list] # this will be fed to vectorizer

Qty words in comments (cum):   3141
Unique words:  1237


In [7]:
# to see current items, run below
print(clean_coms[:3])
print(com_words[:10])
print(unique_words[:10]) # contains unique words only
print(clean_com_str[:2])

[['huh', 'anyway', 'check', 'tube', 'channel', 'kobyoshi'], ['hey', 'guys', 'check', 'new', 'chnnel', 'frst', 'vid', 'us', 'onkeys', 'i', 'm', 'monkey', 'white', 'shirt', 'please', 'leave', 'a', 'like', 'comment', 'please', 'subscribe'], ['test', 'say', 'murdev', 'com']]
['huh', 'anyway', 'check', 'tube', 'channel', 'kobyoshi', 'hey', 'guys', 'check', 'new']
['whipe', 'behold', 'dancing', 'binbox', 'windshield', 'headbutt', 'lexis', 'end', 'access', 'photo']
['huh anyway check tube channel kobyoshi', 'hey guys check new channel first vid us monkeys monkey white shirt please leave like comment please subscribe']


In [16]:
vectorizer = CountVectorizer(analyzer="word",
                         tokenizer=None, #default
                         preprocessor=None, #default
                         stop_words=None, #default
                         max_features=None)
type(vectorizer)

sklearn.feature_extraction.text.CountVectorizer

In [18]:
train_features = vectorizer.fit_transform(clean_com_str)
print(type(train_features))
feature_words = vectorizer.get_feature_names()
train_features = train_features.toarray()
print(type(train_features))
print(feature_words[:5])


<class 'scipy.sparse.csr.csr_matrix'>
<class 'numpy.ndarray'>
['aa', 'aaaaaaa', 'able', 'absolutely', 'access']


## A to do list in markdown

- Create array of vectorized word count with classification (spam/ham) as col
- give numberical id to yt usernames
  - even if specific word for each col unknown, it's just a variable indicator--doesn't need to be known
- df.keys = [user_as_id, [vectorized_words], class]
- *alternatively* simply give training features as--

### Indication of presence of a unique word is more important than the word whose presence is represented by the vector
- it would suffice for an identifier to be something such as "w01" 
- create dataframe and append values [see link](https://stackoverflow.com/questions/20763012/creating-a-pandas-dataframe-from-a-numpy-array-how-do-i-specify-the-index-colum)


In [38]:
words_df = pd.DataFrame(data=train_features, columns=feature_words, index=df_psy.index)
(words_df.tail(10))
# words_df['aa'].unique()


Unnamed: 0,aa,aaaaaaa,able,absolutely,access,accessories,account,accounts,acn,acting,...,yr,yt,zanol,zero,zkxk,zllsvdqv,zombie,zuf,zvrrp,zxlightsoutxz
340,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
341,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
342,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
343,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
344,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
345,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
346,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
347,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
348,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
349,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
dfs = [df_psy, words_df]
df_psy_words = pd.concat(objs=dfs, axis=1, join='inner')
# do not use concat, want to join on indices (appending cols)

# the below is equivalent in output
# df_psy_words= df_psy.join(words_df)
df_psy_words.head(10)

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS,aa,aaaaaaa,able,absolutely,access,...,yr,yt,zanol,zero,zkxk,zllsvdqv,zombie,zuf,zvrrp,zxlightsoutxz
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,LZQPQhLyRh9-wNRtlZDM90f1k0BrdVdJyN_YsaSwfxc,Jason Haddad,2013-11-26T02:55:11,"Hey, check out my new website!! This site is a...",1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,z13lfzdo5vmdi1cm123te5uz2mqig1brz04,ferleck ferles,2013-11-27T21:39:24,Subscribe to my channel ﻿,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,z122wfnzgt30fhubn04cdn3xfx2mxzngsl40k,Bob Kanowski,2013-11-28T12:33:27,i turned it on mute as soon is i came on i jus...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,z13ttt1jcraqexk2o234ghbgzxymz1zzi04,Cony,2013-11-28T16:01:47,You should check my channel for Funny VIDEOS!!﻿,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,z12avveb4xqiirsix04chxviiljryduwxg0,BeBe Burkey,2013-11-28T16:30:13,and u should.d check my channel and tell me wh...,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [39]:
df_psy_words['aa'].unique() # previously there were NaN values, there shouldn't be though

array([0, 1], dtype=int64)

In [11]:
forest = RandomForestClassifier(n_estimators=50)

forest = forest.fit(train_features, df_psy['CLASS'])