# Question - Ads classification

Hello Internet Saviour,

You know much we all love internet. We connect, learn and have fun through this window to the world. It's nearly free and is become more and more a community thing.

However, just like we take care of our communities, we need to keep internet clean and fresh. One of the things you all might have come across is random comments and remarks on articles, blogs, videos. While everyone is free to do what they want, random and unwanted ads take the experience away.

In this dataset we have 5 topics on which people commented. A lot of them were advertisements. When a article becomes popular, and draws people it also draws these unwanted adverts in comments.

Your task is to predict whether a comment is an advertisement or not.

Import the dataset and most of the things are self-explanatory. However, here's the feature set:
Variable 	Description
COMMENT_ID 	ID column
AUTHOR 	name of person commenting
DATE 	data of comment
CONTENT 	Textual content of the comment
CLASS 	Whether the comment was ad (1) or not (0)

Evaluation Metric

Scores are evaluated based on roc_auc_score evaluation metric.

Sample solution csv is as shown below:

ID,CLASS
4,1
6,0
Click here to download data set.


In [1]:
import pandas as pd
import numpy as np
import pandasql as pdsql

import re
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import preprocessor as p

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score


In [2]:
pd.options.display.max_rows = 999

# Exploring the data

In [3]:
train = pd.read_csv('/media/d/Python/adPredict/train.csv')
train.head(10)

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
2,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1
3,z13lfzdo5vmdi1cm123te5uz2mqig1brz04,ferleck ferles,2013-11-27T21:39:24,Subscribe to my channel ﻿,1
4,z12avveb4xqiirsix04chxviiljryduwxg0,BeBe Burkey,2013-11-28T16:30:13,and u should.d check my channel and tell me wh...,1
5,z13xit5agm2zyh4f523rst2gowmbx5bml,Lone Twistt,2013-11-28T17:34:55,Once you have started reading do not stop. If...,1
6,z13pejoiuozwxtdu323dspopnri4xts0f,Archie Lewis,2013-11-28T17:54:39,https://twitter.com/GBphotographyGB﻿,1
7,z12oglnpoq3gjh4om04cfdlbgp2uepyytpw0k,Francisco Nora,2013-11-28T19:52:35,please like :D https://premium.easypromosapp.c...,1
8,z13phrmwrkfisn5er22eyrbpbvaiwfvwf04,Gaming and Stuff PRO,2013-11-28T21:14:13,"Hello! Do you like gaming, art videos, scienti...",1
9,z13bgdvyluihfv11i22rgxwhuvabzz1os04,Zielimeek21,2013-11-28T21:49:00,I'm only checking the views﻿,0


In [4]:
train.CONTENT[8]

"Hello! Do you like gaming, art videos, scientific experiments, tutorials,  lyrics videos, and much, much more of that? If you do please check out our  channel and subscribe to it, we've just started, but soon we hope we will  be able to cover all of our expectations... You can also check out what  we've got so far!\ufeff"

### Checking whether the problem is balanced or not

In [5]:
train.CLASS.value_counts()

1    586
0    571
Name: CLASS, dtype: int64

The positive and negative classes are balanced according to the numbers above. So no rebalancing of dataset required.

### Incorporating Authors as a feature

Most of the Authors have commented only once in the train dataset. Also there are only 86 authors who are common in the training and test set which has 799 observations. 

An interesting feature is that while exploring authors who have commented more than once, we see that that author has either given only advertisement or good comments. So this can potentially be a good feature to explore. We can make a dictionary of potential spammers which can be a feature. Due to lack of time, ignoring this currently.

In [6]:
def text_clean(x):
    p.set_options(p.OPT.URL, p.OPT.MENTION, p.OPT.HASHTAG, p.OPT.RESERVED, p.OPT.EMOJI, p.OPT.SMILEY, p.OPT.NUMBER)
    tokens = p.tokenize(x) # Replace the URLs, Mentions, Hashtags, Emoji, Smiley and Numbers
    tokens = re.sub('[0-9]+'," ",tokens)  #The code above replaces 24.56 as number.56. Replace the uncaught numbers with spaces
    
    tokens = word_tokenize(tokens)  #Tokenize the string
    tokens = [word.lower() for word in tokens]  #Convert  characters to lower case

    s=set(stopwords.words('english'))
    s.update(['please','plz','pls'])
    tokens = [word for word in tokens if word not in s] #Remove common english stop words
    tokens=[word for word in tokens if len(word) > 2] #Remove tokens having less than 2 characters
    lmtzr = WordNetLemmatizer() #lemmatize all tokens into its present form
    tokens=[lmtzr.lemmatize(word,'v') for word in tokens]
    return tokens

In text mining the major accuracy improvements come from the preprocessing. Writing a text_clean() function to do the prerocessing.

In [7]:
train['CONTENTclean'] = train['CONTENT'].apply(text_clean) # Apply preprocessing to the train content
def reassemble(x):
    return ' '.join(x)
train['commentsReassemble'] = train['CONTENTclean'].apply(reassemble) # Reassemble the preprocessed tokens into a string

In [8]:
train.head(10)

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS,CONTENTclean,commentsReassemble
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1,"[huh, anyway, check, tube, channel, kobyoshi]",huh anyway check tube channel kobyoshi
1,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1,"[shake, sexy, ass, channel, enjoy, ^_^]",shake sexy ass channel enjoy ^_^
2,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1,"[watch, v=vtarggvgtwq, check]",watch v=vtarggvgtwq check
3,z13lfzdo5vmdi1cm123te5uz2mqig1brz04,ferleck ferles,2013-11-27T21:39:24,Subscribe to my channel ﻿,1,"[subscribe, channel]",subscribe channel
4,z12avveb4xqiirsix04chxviiljryduwxg0,BeBe Burkey,2013-11-28T16:30:13,and u should.d check my channel and tell me wh...,1,"[should.d, check, channel, tell, next]",should.d check channel tell next
5,z13xit5agm2zyh4f523rst2gowmbx5bml,Lone Twistt,2013-11-28T17:34:55,Once you have started reading do not stop. If...,1,"[start, read, stop, subscribe, within, one, da...",start read stop subscribe within one day 're e...
6,z13pejoiuozwxtdu323dspopnri4xts0f,Archie Lewis,2013-11-28T17:54:39,https://twitter.com/GBphotographyGB﻿,1,[url],url
7,z12oglnpoq3gjh4om04cfdlbgp2uepyytpw0k,Francisco Nora,2013-11-28T19:52:35,please like :D https://premium.easypromosapp.c...,1,"[like, smiley, url]",like smiley url
8,z13phrmwrkfisn5er22eyrbpbvaiwfvwf04,Gaming and Stuff PRO,2013-11-28T21:14:13,"Hello! Do you like gaming, art videos, scienti...",1,"[hello, like, game, art, videos, scientific, s...",hello like game art videos scientific smiley e...
9,z13bgdvyluihfv11i22rgxwhuvabzz1os04,Zielimeek21,2013-11-28T21:49:00,I'm only checking the views﻿,0,"[check, views﻿]",check views﻿


In [9]:
count_vect = CountVectorizer(min_df=4) # Make features from the text
X_train_full = count_vect.fit_transform(train.commentsReassemble)
X_train_full.shape

(1157, 387)

We have 1157 rows and 807 columns which is very sparse when we use CountVectorizer(min_df=2). Need to get the feature space more dense. We are in the territory of serious overfitting since the complexity of the model is more than the parameters that can be estimated from the observations in the dataset.

So using CountVectorizer(min_df=4) which gives 387 features for 1157 rows. This puts a limit that a word has to occur in at least 4 documents to be treated as a valid feature.

Checking the vocabulary built below..

In [10]:
count_vect.get_feature_names()

['absolutely',
 'account',
 'actually',
 'adam',
 'add',
 'advertise',
 'africa',
 'ago',
 'almost',
 'already',
 'also',
 'amaze',
 'amp',
 'animals',
 'annoy',
 'anyone',
 'appreciate',
 'around',
 'artist',
 'attention',
 'away',
 'awesome',
 'baby',
 'back',
 'bad',
 'beat',
 'beautiful',
 'become',
 'begin',
 'believe',
 'bennett',
 'best',
 'better',
 'big',
 'billion',
 'bite',
 'black',
 'bless',
 'boy',
 'bring',
 'button',
 'buy',
 'call',
 'canal',
 'care',
 'case',
 'chance',
 'change',
 'channel',
 'charlie',
 'check',
 'class',
 'click',
 'com',
 'come',
 'comment',
 'constructive',
 'cool',
 'could',
 'cover',
 'crazy',
 'criticism',
 'cup',
 'cute',
 'daily',
 'damn',
 'dance',
 'dante',
 'day',
 'decent',
 'deserve',
 'dick',
 'dislike',
 'do',
 'dollars',
 'donate',
 'dont',
 'download',
 'dream',
 'drop',
 'earn',
 'earth',
 'easily',
 'easy',
 'eminem',
 'emoji',
 'end',
 'enjoy',
 'enter',
 'epic',
 'erience',
 'even',
 'ever',
 'every',
 'everyday',
 'everyone',
 

# Fitting a model to the full train data

I am using a logistic regression here since this is a binary classification problem and the +ve and -ve classes are balanced. Logistic regression is a discriminative classification model and we are interested in learning the class boundary well here. I have tried Naive bayes which belongs to Generative classification model and the accuracy was less than logistic regression. 

We could use other discriminative models like SVM or tree based models like Random Forest or XGBoost here. The accuracy with logistic regression is satisfactory so not going to these models.

In [11]:
logistic = LogisticRegression() # Fitting a logistic regression to the data
logistic.fit(X_train_full, train.CLASS)
train['Predicted'] = logistic.predict(X_train_full)
train.head(10)

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS,CONTENTclean,commentsReassemble,Predicted
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1,"[huh, anyway, check, tube, channel, kobyoshi]",huh anyway check tube channel kobyoshi,1
1,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1,"[shake, sexy, ass, channel, enjoy, ^_^]",shake sexy ass channel enjoy ^_^,1
2,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1,"[watch, v=vtarggvgtwq, check]",watch v=vtarggvgtwq check,1
3,z13lfzdo5vmdi1cm123te5uz2mqig1brz04,ferleck ferles,2013-11-27T21:39:24,Subscribe to my channel ﻿,1,"[subscribe, channel]",subscribe channel,1
4,z12avveb4xqiirsix04chxviiljryduwxg0,BeBe Burkey,2013-11-28T16:30:13,and u should.d check my channel and tell me wh...,1,"[should.d, check, channel, tell, next]",should.d check channel tell next,1
5,z13xit5agm2zyh4f523rst2gowmbx5bml,Lone Twistt,2013-11-28T17:34:55,Once you have started reading do not stop. If...,1,"[start, read, stop, subscribe, within, one, da...",start read stop subscribe within one day 're e...,1
6,z13pejoiuozwxtdu323dspopnri4xts0f,Archie Lewis,2013-11-28T17:54:39,https://twitter.com/GBphotographyGB﻿,1,[url],url,1
7,z12oglnpoq3gjh4om04cfdlbgp2uepyytpw0k,Francisco Nora,2013-11-28T19:52:35,please like :D https://premium.easypromosapp.c...,1,"[like, smiley, url]",like smiley url,1
8,z13phrmwrkfisn5er22eyrbpbvaiwfvwf04,Gaming and Stuff PRO,2013-11-28T21:14:13,"Hello! Do you like gaming, art videos, scienti...",1,"[hello, like, game, art, videos, scientific, s...",hello like game art videos scientific smiley e...,1
9,z13bgdvyluihfv11i22rgxwhuvabzz1os04,Zielimeek21,2013-11-28T21:49:00,I'm only checking the views﻿,0,"[check, views﻿]",check views﻿,1


# Checking the accuracy on training set

In [12]:
np.mean(train.Predicted == train.CLASS)

0.9662921348314607

# Predict on Blind test dataset

In [13]:
test = pd.read_csv('/media/d/Python/adPredict/test.csv')
test.head(10)

Unnamed: 0,ID,COMMENT_ID,AUTHOR,DATE,CONTENT
0,0,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...
1,1,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com
2,2,LZQPQhLyRh9-wNRtlZDM90f1k0BrdVdJyN_YsaSwfxc,Jason Haddad,2013-11-26T02:55:11,"Hey, check out my new website!! This site is a..."
3,3,z122wfnzgt30fhubn04cdn3xfx2mxzngsl40k,Bob Kanowski,2013-11-28T12:33:27,i turned it on mute as soon is i came on i jus...
4,4,z13ttt1jcraqexk2o234ghbgzxymz1zzi04,Cony,2013-11-28T16:01:47,You should check my channel for Funny VIDEOS!!﻿
5,5,z13auhww3oufjn1qo04ci3grqqjmfjexxuo0k,Huckyduck,2013-11-28T17:06:17,Hey subscribe to me﻿
6,6,z121zxaxsq25z5k5o04ch1o5jqqfij3gtm40k,TheUploadaddict,2013-11-28T18:12:12,subscribe like comment﻿
7,7,z13vxpnoxsyeuv2jr04cctprprb1slnxdf4,OutrightIgnite,2013-11-28T21:55:02,http://www.ebay.com/itm/171183229277?ssPageNam...
8,8,z12qth5j0ob1fx3q404chvy4fz32tbkpllk0k,Tony K Frazier,2013-11-28T23:57:13,http://ubuntuone.com/40beUutVu2ZKxK4uTgPZ8K﻿
9,9,z13etj0bclzfztuwc04cgfvrgmf3fvjor1g,Jose Renteria,2013-11-29T00:22:01,We are an EDM apparel company dedicated to bri...


In [14]:
test['CONTENTclean'] = test['CONTENT'].apply(text_clean)
test['commentsReassemble'] = test['CONTENTclean'].apply(reassemble)
test.head(10)

Unnamed: 0,ID,COMMENT_ID,AUTHOR,DATE,CONTENT,CONTENTclean,commentsReassemble
0,0,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,"[hey, guy, check, new, channel, first, vid, mo...",hey guy check new channel first vid monkey mon...
1,1,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,"[test, say, murdev.com]",test say murdev.com
2,2,LZQPQhLyRh9-wNRtlZDM90f1k0BrdVdJyN_YsaSwfxc,Jason Haddad,2013-11-26T02:55:11,"Hey, check out my new website!! This site is a...","[hey, check, new, website, site, kid, stuff, k...",hey check new website site kid stuff kidsmedia...
3,3,z122wfnzgt30fhubn04cdn3xfx2mxzngsl40k,Bob Kanowski,2013-11-28T12:33:27,i turned it on mute as soon is i came on i jus...,"[turn, mute, soon, come, want, check, view, ...]",turn mute soon come want check view ...
4,4,z13ttt1jcraqexk2o234ghbgzxymz1zzi04,Cony,2013-11-28T16:01:47,You should check my channel for Funny VIDEOS!!﻿,"[check, channel, funny, videos]",check channel funny videos
5,5,z13auhww3oufjn1qo04ci3grqqjmfjexxuo0k,Huckyduck,2013-11-28T17:06:17,Hey subscribe to me﻿,"[hey, subscribe, me﻿]",hey subscribe me﻿
6,6,z121zxaxsq25z5k5o04ch1o5jqqfij3gtm40k,TheUploadaddict,2013-11-28T18:12:12,subscribe like comment﻿,"[subscribe, like, comment﻿]",subscribe like comment﻿
7,7,z13vxpnoxsyeuv2jr04cctprprb1slnxdf4,OutrightIgnite,2013-11-28T21:55:02,http://www.ebay.com/itm/171183229277?ssPageNam...,[url],url
8,8,z12qth5j0ob1fx3q404chvy4fz32tbkpllk0k,Tony K Frazier,2013-11-28T23:57:13,http://ubuntuone.com/40beUutVu2ZKxK4uTgPZ8K﻿,[url],url
9,9,z13etj0bclzfztuwc04cgfvrgmf3fvjor1g,Jose Renteria,2013-11-29T00:22:01,We are an EDM apparel company dedicated to bri...,"[edm, apparel, company, dedicate, bring, music...",edm apparel company dedicate bring music inspi...


In [15]:
X_test_Blind = count_vect.transform(test.commentsReassemble)
test['CLASS'] = logistic.predict(X_test_Blind)
test.head(10)

Unnamed: 0,ID,COMMENT_ID,AUTHOR,DATE,CONTENT,CONTENTclean,commentsReassemble,CLASS
0,0,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,"[hey, guy, check, new, channel, first, vid, mo...",hey guy check new channel first vid monkey mon...,1
1,1,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,"[test, say, murdev.com]",test say murdev.com,0
2,2,LZQPQhLyRh9-wNRtlZDM90f1k0BrdVdJyN_YsaSwfxc,Jason Haddad,2013-11-26T02:55:11,"Hey, check out my new website!! This site is a...","[hey, check, new, website, site, kid, stuff, k...",hey check new website site kid stuff kidsmedia...,1
3,3,z122wfnzgt30fhubn04cdn3xfx2mxzngsl40k,Bob Kanowski,2013-11-28T12:33:27,i turned it on mute as soon is i came on i jus...,"[turn, mute, soon, come, want, check, view, ...]",turn mute soon come want check view ...,0
4,4,z13ttt1jcraqexk2o234ghbgzxymz1zzi04,Cony,2013-11-28T16:01:47,You should check my channel for Funny VIDEOS!!﻿,"[check, channel, funny, videos]",check channel funny videos,1
5,5,z13auhww3oufjn1qo04ci3grqqjmfjexxuo0k,Huckyduck,2013-11-28T17:06:17,Hey subscribe to me﻿,"[hey, subscribe, me﻿]",hey subscribe me﻿,1
6,6,z121zxaxsq25z5k5o04ch1o5jqqfij3gtm40k,TheUploadaddict,2013-11-28T18:12:12,subscribe like comment﻿,"[subscribe, like, comment﻿]",subscribe like comment﻿,1
7,7,z13vxpnoxsyeuv2jr04cctprprb1slnxdf4,OutrightIgnite,2013-11-28T21:55:02,http://www.ebay.com/itm/171183229277?ssPageNam...,[url],url,1
8,8,z12qth5j0ob1fx3q404chvy4fz32tbkpllk0k,Tony K Frazier,2013-11-28T23:57:13,http://ubuntuone.com/40beUutVu2ZKxK4uTgPZ8K﻿,[url],url,1
9,9,z13etj0bclzfztuwc04cgfvrgmf3fvjor1g,Jose Renteria,2013-11-29T00:22:01,We are an EDM apparel company dedicated to bri...,"[edm, apparel, company, dedicate, bring, music...",edm apparel company dedicate bring music inspi...,1


# Checking whether the Prediction output is balanced

In [16]:
test.CLASS.value_counts()

0    401
1    398
Name: CLASS, dtype: int64

The number of predicted class outputs are almost the same in the test output. So we donot suspect that the distribution of train and test are different.

In [17]:
out = test[['COMMENT_ID','CLASS']] #Creating the submission file

In [18]:
out.head(5)

Unnamed: 0,COMMENT_ID,CLASS
0,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,1
1,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,0
2,LZQPQhLyRh9-wNRtlZDM90f1k0BrdVdJyN_YsaSwfxc,1
3,z122wfnzgt30fhubn04cdn3xfx2mxzngsl40k,0
4,z13ttt1jcraqexk2o234ghbgzxymz1zzi04,1


In [19]:
out.to_csv('/media/d/Python/adPredict/testOutput.csv', index = False) #Writing the submission output

# Testing model performance with train test split

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X_train_full, train.CLASS, test_size=0.3)

In [21]:
log2 = LogisticRegression().fit(X_train, y_train)
y_test_pred = log2.predict(X_test)

In [22]:
np.mean(y_test_pred == y_test)

0.9310344827586207

The held out test set accuracy is 93% while prediction on the full training set itself is 96% which are close to each other. So we are not overfitting to the training set according to this intuition.

Once we are satisfied with a model, we would train it on the whole training data before predicting the unknown CLASS labels for the test set. This ensures that the model gets a chance to learn from the maximum data available to us.

# Testing model performance with Cross Validation

In [23]:
scores = cross_val_score(logistic, X_train_full, train.CLASS, cv=5)
scores

array([0.86695279, 0.92207792, 0.91341991, 0.96103896, 0.9047619 ])

In [24]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.91 (+/- 0.06)


Here also we see that 91% CV accuracy is close to the held out test set accuracy of 93% and training set accuracy of 96%.

# Finding the CrossValidated ROC_AUC score of the model

In [25]:
scores_AUC = cross_val_score(logistic, X_train_full, train.CLASS, cv=5, scoring = 'roc_auc')
scores_AUC

array([0.90486367, 0.96018893, 0.96101365, 0.9728595 , 0.95119208])

In [26]:
print("ROC_AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))

ROC_AUC: 0.95 (+/- 0.05)


### The expected ROC_AUC score of the submission file is 0.95 (+/- 0.05)