# Ad Predictor

### Problem definition:

#### Implement a proof-of-concept classifier that uses data about banner ads to predict the advertiser represented in an ad, or return "not an ad" if the image isn't an ad, or return "no prediction" if the classifier isn't sufficiently confident. You can find a labeled dataset with representative class frequencies at http://moatsearch-data.s3.amazonaws.com/homework/ad_classification_hw_dataset.json.

#### Use the following cost matrix to inform your implementation and analysis:

```
| predicted   |  correct brand     |  wrong brand  |  non-ad  |  no prediction |
| actual -----|--------------------|---------------|----------|----------------|
| any brand   |         0          |      -20      |    -100  |       -5       |
| non-ad      |         X          |      -40      |     0    |       -5       |
```

#### Questions:
- Discuss the performance of your classifier. For context, include specs for the machine you trained your classifier on.
- Describe the reasoning behind all major design decisions you had to make.
- If you were to keep developing this proof-of-concept, what are some changes you think would be promising to explore next, and why?

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
givenData = pd.read_json('http://moatsearch-data.s3.amazonaws.com/homework/ad_classification_hw_dataset.json')
givenData.describe()

Unnamed: 0,label,md5,ocr_logos,ocr_text,screenshot_url
count,9729,9729,9729,9729.0,9729
unique,1375,9729,2407,8056.0,9729
top,not_an_ad,6dbd03573e0610e810f13ac1a8170a71,[],,https://search-creatives.s3.amazonaws.com/64/9...
freq,3400,1,4710,732.0,1


In [4]:
givenData

Unnamed: 0,label,md5,ocr_logos,ocr_text,screenshot_url
0,ashley furniture,2bb7ffaef7e7012ade02da88cdf7edf7,[],et happy\nASHLEY\nholidays\nthis is home\n25 l...,https://search-creatives.s3.amazonaws.com/8b/c...
1,emirates airline,2eb9ba692c4c687cb3ef8491d042450b,[Emirates],Emirates\nBUY 2 TICKETS\nFOR THE\nPRICE OF 1!\...,https://search-creatives.s3.amazonaws.com/9d/1...
2,verizon wireless,a40a6af8fa229c630c7d4d7951a8f517,[Verizon Wireless],Gratis\nVerizon\nLG G Pad'\nTM\n7.0 LTE\nApren...,https://search-creatives.s3.amazonaws.com/88/b...
3,walgreens,2a2a280ebc1369ee11be245a630a140a,[Walgreens],Great gifts are right walgreens.\naround the c...,https://search-creatives.s3.amazonaws.com/f8/d...
4,hewlett packard,dd66013a011608d5410eee78517b2082,[HP Partner],LA FORCE EST\nPUISSANTE DANS\nNOTRE FAMILLE\nD...,https://search-creatives.s3.amazonaws.com/f8/9...
5,not_an_ad,6d48faaef070e63092398bd179579cf4,[],II We were guided step by step through\nthe pu...,https://moatsearch-data.s3.amazonaws.com/creat...
6,bluehost,d659dc402651c871ad8d631906d93c94,[],"EASY, HASSLE-FREE\nWEB HOSTING\nFOR $3.95 /MON...",https://search-creatives.s3.amazonaws.com/3c/a...
7,not_an_ad,ce89d5fb1c26cc673b9acbc501d53978,[1664],/004\nKRONEN BOURG\n1664\nPICK UP A\nPACK TODA...,https://search-creatives.s3.amazonaws.com/1d/e...
8,not_an_ad,f5a7ffec70312212ab0c8eb1208f0351,[],,https://search-creatives.s3.amazonaws.com/46/f...
9,fashion mia,cba9c5ca50f1f5be522a0c6b8f4547ab,[Fashion MIA],Fashion\nMia\nCU U\n$18.95\n$9.95\n$14.95\n$14...,https://search-creatives.s3.amazonaws.com/f3/5...


In [10]:
#clean ocr_text columns and add to new column 'ocr_text_all'
import re
reg = re.compile('\W+')

df = givenData
ocr_text = df['ocr_text'].values

def clean(text):
    return reg.sub(' ', text).strip()

cleanText = [clean(text).lower() for text in ocr_text]
#df = df.drop('ocr_text_all', axis=1)
df.insert(loc=0, column='ocr_text_all', value= cleanText) 
df


Unnamed: 0,ocr_text_all,label,md5,ocr_logos,ocr_text,screenshot_url
0,et happy ashley holidays this is home 25 l 36 ...,ashley furniture,2bb7ffaef7e7012ade02da88cdf7edf7,[],et happy\nASHLEY\nholidays\nthis is home\n25 l...,https://search-creatives.s3.amazonaws.com/8b/c...
1,emirates buy 2 tickets for the price of 1 jfk ...,emirates airline,2eb9ba692c4c687cb3ef8491d042450b,[Emirates],Emirates\nBUY 2 TICKETS\nFOR THE\nPRICE OF 1!\...,https://search-creatives.s3.amazonaws.com/9d/1...
2,gratis verizon lg g pad tm 7 0 lte aprendemas ...,verizon wireless,a40a6af8fa229c630c7d4d7951a8f517,[Verizon Wireless],Gratis\nVerizon\nLG G Pad'\nTM\n7.0 LTE\nApren...,https://search-creatives.s3.amazonaws.com/88/b...
3,great gifts are right walgreens around the cor...,walgreens,2a2a280ebc1369ee11be245a630a140a,[Walgreens],Great gifts are right walgreens.\naround the c...,https://search-creatives.s3.amazonaws.com/f8/d...
4,la force est puissante dans notre famille du 2...,hewlett packard,dd66013a011608d5410eee78517b2082,[HP Partner],LA FORCE EST\nPUISSANTE DANS\nNOTRE FAMILLE\nD...,https://search-creatives.s3.amazonaws.com/f8/9...
5,ii we were guided step by step through the pur...,not_an_ad,6d48faaef070e63092398bd179579cf4,[],II We were guided step by step through\nthe pu...,https://moatsearch-data.s3.amazonaws.com/creat...
6,easy hassle free web hosting for 3 95 month un...,bluehost,d659dc402651c871ad8d631906d93c94,[],"EASY, HASSLE-FREE\nWEB HOSTING\nFOR $3.95 /MON...",https://search-creatives.s3.amazonaws.com/3c/a...
7,004 kronen bourg 1664 pick up a pack today dri...,not_an_ad,ce89d5fb1c26cc673b9acbc501d53978,[1664],/004\nKRONEN BOURG\n1664\nPICK UP A\nPACK TODA...,https://search-creatives.s3.amazonaws.com/1d/e...
8,,not_an_ad,f5a7ffec70312212ab0c8eb1208f0351,[],,https://search-creatives.s3.amazonaws.com/46/f...
9,fashion mia cu u 18 95 9 95 14 95 14 95 2016 f...,fashion mia,cba9c5ca50f1f5be522a0c6b8f4547ab,[Fashion MIA],Fashion\nMia\nCU U\n$18.95\n$9.95\n$14.95\n$14...,https://search-creatives.s3.amazonaws.com/f3/5...


In [11]:
#add ocr_logos as well to the 'ocr_text_all'
ocr_logos = df['ocr_logos'].values
ocr_logos

ocr_logos_flattened = [' '.join(s).lower() for s in ocr_logos]

ocr_text_all_new = []
for idx,val in enumerate(ocr_logos_flattened):
    res = val + ' ' + df['ocr_text_all'][idx]
    ocr_text_all_new.append(res)

df['ocr_text_all'] = ocr_text_all_new
df

Unnamed: 0,ocr_text_all,label,md5,ocr_logos,ocr_text,screenshot_url
0,et happy ashley holidays this is home 25 l 36...,ashley furniture,2bb7ffaef7e7012ade02da88cdf7edf7,[],et happy\nASHLEY\nholidays\nthis is home\n25 l...,https://search-creatives.s3.amazonaws.com/8b/c...
1,emirates emirates buy 2 tickets for the price ...,emirates airline,2eb9ba692c4c687cb3ef8491d042450b,[Emirates],Emirates\nBUY 2 TICKETS\nFOR THE\nPRICE OF 1!\...,https://search-creatives.s3.amazonaws.com/9d/1...
2,verizon wireless gratis verizon lg g pad tm 7 ...,verizon wireless,a40a6af8fa229c630c7d4d7951a8f517,[Verizon Wireless],Gratis\nVerizon\nLG G Pad'\nTM\n7.0 LTE\nApren...,https://search-creatives.s3.amazonaws.com/88/b...
3,walgreens great gifts are right walgreens arou...,walgreens,2a2a280ebc1369ee11be245a630a140a,[Walgreens],Great gifts are right walgreens.\naround the c...,https://search-creatives.s3.amazonaws.com/f8/d...
4,hp partner la force est puissante dans notre f...,hewlett packard,dd66013a011608d5410eee78517b2082,[HP Partner],LA FORCE EST\nPUISSANTE DANS\nNOTRE FAMILLE\nD...,https://search-creatives.s3.amazonaws.com/f8/9...
5,ii we were guided step by step through the pu...,not_an_ad,6d48faaef070e63092398bd179579cf4,[],II We were guided step by step through\nthe pu...,https://moatsearch-data.s3.amazonaws.com/creat...
6,easy hassle free web hosting for 3 95 month u...,bluehost,d659dc402651c871ad8d631906d93c94,[],"EASY, HASSLE-FREE\nWEB HOSTING\nFOR $3.95 /MON...",https://search-creatives.s3.amazonaws.com/3c/a...
7,1664 004 kronen bourg 1664 pick up a pack toda...,not_an_ad,ce89d5fb1c26cc673b9acbc501d53978,[1664],/004\nKRONEN BOURG\n1664\nPICK UP A\nPACK TODA...,https://search-creatives.s3.amazonaws.com/1d/e...
8,,not_an_ad,f5a7ffec70312212ab0c8eb1208f0351,[],,https://search-creatives.s3.amazonaws.com/46/f...
9,fashion mia fashion mia cu u 18 95 9 95 14 95 ...,fashion mia,cba9c5ca50f1f5be522a0c6b8f4547ab,[Fashion MIA],Fashion\nMia\nCU U\n$18.95\n$9.95\n$14.95\n$14...,https://search-creatives.s3.amazonaws.com/f3/5...


In [45]:
#Randomize dataset
df_randomized = df
df_randomized = df_randomized.reindex(np.random.permutation(df_randomized.index))

In [46]:
#Choose a smaller subset of ad labels
labels = df_randomized['label'].values
labelsList = list(labels)
adLabels = set()

#TODO: Keeping this to 400 labels in order to execute this code in finite time
for l in labels:
    if (len(adLabels) < 400):
        adLabels.add(l)
adLabels       

{u' visit greenville sc',
 u'24optioncom',
 u'888 sport',
 u'aaa insurance',
 u'aarp',
 u'academy sports outdoors',
 u'acer',
 u'adobe',
 u'adriano goldschmied',
 u'agape senior',
 u'agario',
 u'aim',
 u'alaska airlines',
 u'alibabacom',
 u'allegany optical',
 u'allianz',
 u'amazon',
 u'ambulatory care center',
 u'american express',
 u'american university of antigua',
 u'app store',
 u'argos',
 u'arm hammer',
 u'arnot health',
 u'art van furniture',
 u'ashley furniture',
 u'at&t',
 u'atahotels',
 u'athleta',
 u'babybel',
 u'baers furniture design',
 u'baller arcade',
 u'banggoodcom',
 u'barberitos',
 u'baylor',
 u'beatport',
 u'befrugalcom',
 u'bellevue arts museum',
 u'belterra park gaming entertainment center',
 u'benq',
 u'benzinga',
 u'best buy',
 u'bet black entertainment television',
 u'beyonddietcom',
 u'billiard factory',
 u'bistro md',
 u'blendtec',
 u'bloomberg',
 u'blue cross blue shield',
 u'blue cross blue shield of kansas',
 u'blue vine',
 u'bluehost',
 u'bobs watches',
 

In [47]:
adLabelsList = list(adLabels)

In [48]:
#Select only subset of df containing adLabels
df_filtered = df_randomized.loc[df_randomized['label'].isin(adLabelsList)]

In [49]:
#Keep 60% training set, 20% validation set, 20% test set
# test set -> to be removed. This set would be used to calculate final performance of the classifier
# validation set -> Should be used for selecting model (I will try out between NBMultinomial, Logistic and linear svm).
# I will select the model which performs best in my cross-validation test results

splitAt = int(0.2 * len(df_filtered))

test = df_filtered[0:splitAt]
cv_trainTest = df_filtered[splitAt:]
# cv_test = cv_trainTest[0:splitAt]
# cv_train = cv_trainTest[splitAt:]

In [50]:
#model selection
from sklearn.cross_validation import KFold
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
#from sklearn import svm

#NOTE: I would have ideally wanted to do parameter tuning using GridSearch to select the best parameters. Skipping it in the interest of time
#NOTE 2: I would have liked to add a misclassification cost for the label 'not_an_ad' in the objective. As I have not implemented this before, I will be skipping it in the interest of time.
clf1 = MultinomialNB()
clf2 = SGDClassifier(loss='log', penalty='l2', alpha=1e-3, n_iter=5, random_state=10)   #logistic regression
clf3 = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=10) #linear svm

clfs = [clf1, clf2, clf3]
#Adding this to normalize feature vector 
tfidf_transformer = TfidfTransformer()
vectorizer = CountVectorizer()

Y = cv_trainTest['label'].values
kf = KFold(len(Y), n_folds=5)

max = 0
scores = []

for idx in range(0,len(clfs)):
    #implement cross-validation
    cv_scores = []
    for train_index, test_index in kf:
        X_train_counts = vectorizer.fit_transform(cv_trainTest['ocr_text_all'].values[train_index])
        X_train = tfidf_transformer.fit_transform(X_train_counts)
        X_test_counts = vectorizer.transform(cv_trainTest['ocr_text_all'].values[test_index])
        X_test = tfidf_transformer.transform(X_test_counts)
        Y_train, Y_test = Y[train_index], Y[test_index]
        clf = clfs[idx].fit(X_train, Y_train)
        score = clf.score(X_test, Y_test)
        cv_scores.append(score)
    scores.append(np.mean(cv_scores))
    
print(scores)
highScoreIndex = np.argmax(scores)
print(highScoreIndex)
clf = clfs[highScoreIndex]

[0.63135629359409573, 0.65460693536673931, 0.83943692894901289]
2


In [51]:
#create dictionary for ocr_text_all using vectorizer

#Use training+validation set to train the model
X_train_counts = vectorizer.fit_transform(cv_trainTest['ocr_text_all'].values)
X_train = tfidf_transformer.fit_transform(X_train_counts)
Y_train = cv_trainTest['label']

#use test set to get performance now
X_test_counts = vectorizer.transform(test['ocr_text_all'].values)
X_test = tfidf_transformer.transform(X_test_counts)
Y_test = test['label'].values

In [52]:
#get performance on test set

model = clf.fit(X_train, Y_train)
predictedRaw = model.predict(X_test)
model.score(X_test, Y_test)

0.73197674418604652

In [53]:
from sklearn import metrics

#performance metrics of classifier - without 'no prediction'
print(metrics.classification_report(Y_test, predictedRaw))

             precision    recall  f1-score   support

24optioncom       0.00      0.00      0.00         1
  888 sport       0.00      0.00      0.00         1
aaa insurance       0.00      0.00      0.00         2
       aarp       0.00      0.00      0.00         1
academy sports outdoors       1.00      0.50      0.67         6
       acer       0.00      0.00      0.00         1
adriano goldschmied       0.00      0.00      0.00         1
agape senior       0.00      0.00      0.00         1
     agario       0.00      0.00      0.00         1
        aim       0.00      0.00      0.00         1
alaska airlines       0.00      0.00      0.00         1
 alibabacom       0.00      0.00      0.00         1
    allianz       0.00      0.00      0.00         1
     amazon       1.00      0.96      0.98        27
ambulatory care center       0.00      0.00      0.00         1
american express       0.00      0.00      0.00         3
american university of antigua       0.00      0.00    

In [77]:
#return no prediction for weak predictions
def predictedModified(model, x):
    predictions = []
    for elem in x:
        maxBoundary = np.max(model.decision_function(elem))
        #based on cost matrix- place higher penalty for predicting not_an_ad 
        prediction = model.predict(elem)
        if (maxBoundary< -0.02):
            prediction = np.array([u'no prediction'])
            #increase maxBoundary constraint to remove wrong no_ad predictions on any_brand
        elif(prediction == u'not_an_ad' and maxBoundary < 1.3):
            prediction = np.array([u'no prediction'])    
        predictions.append(prediction[0])
    return predictions
         
predictions = predictedModified(model, X_test)
predictions

[u'no prediction',
 u'no prediction',
 u'no prediction',
 u'no prediction',
 u'fashion mia',
 u'ashley furniture',
 u'no prediction',
 u'no prediction',
 u'no prediction',
 u'no prediction',
 u'verizon communications',
 u'no prediction',
 u'bluehost',
 u'no prediction',
 u'no prediction',
 u'jos a bank',
 u'fashion mia',
 u'no prediction',
 u'no prediction',
 u'no prediction',
 u'no prediction',
 u'no prediction',
 u'no prediction',
 u'jos a bank',
 u'bluehost',
 u'no prediction',
 u'no prediction',
 u'emirates airline',
 u'no prediction',
 u'no prediction',
 u'lifelock business solutions',
 u'no prediction',
 u'no prediction',
 u'no prediction',
 u'no prediction',
 u'no prediction',
 u'no prediction',
 u'the home depot',
 u'fashion mia',
 u'fashion mia',
 u'no prediction',
 u'no prediction',
 u'emirates airline',
 u'sammy dress',
 u'no prediction',
 u'no prediction',
 u'no prediction',
 u'lowermybillscom',
 u'no prediction',
 u'no prediction',
 u'no prediction',
 u'no prediction',
 u'

In [80]:
#Get generic performance
from sklearn import metrics

#performance report
print(metrics.classification_report(Y_test, predictions))

#Confusion matrix
print(metrics.confusion_matrix(Y_test, predictions))


             precision    recall  f1-score   support

24optioncom       0.00      0.00      0.00         1
  888 sport       0.00      0.00      0.00         1
aaa insurance       0.00      0.00      0.00         2
       aarp       0.00      0.00      0.00         1
academy sports outdoors       0.00      0.00      0.00         6
       acer       0.00      0.00      0.00         1
adriano goldschmied       0.00      0.00      0.00         1
agape senior       0.00      0.00      0.00         1
     agario       0.00      0.00      0.00         1
        aim       0.00      0.00      0.00         1
alaska airlines       0.00      0.00      0.00         1
 alibabacom       0.00      0.00      0.00         1
    allianz       0.00      0.00      0.00         1
     amazon       1.00      0.67      0.80        27
ambulatory care center       0.00      0.00      0.00         1
american express       0.00      0.00      0.00         3
american university of antigua       0.00      0.00    

In [79]:
#Get performance figures on the test Set which highlight how well we did as per the cost-matrix:

#A. Total number of records in test set

#B. any_brand metrics:
    #a. %age
    #b. %correctly predicted
    #c. %incorrectly predicted as not_an_ad // Need to ensure this is lowest
    #d. %incorrectly predicted as wrong band //moderate cost
    #e. %surrendered //low cost
    
    
#C. not_an_ad metrics:
    #a. %age
    #b. %correctly predicted
    #c. %incorrectly predicted as wrong band //moderate-high cost
    #d. %surrendered //low cost 
    

#anyBrandCount + noAdCount = totalCount
#a. %age composition
totalCount = len(Y_test)
anyBrandCount = len([k for k in Y_test if k!='not_an_ad' and k!= 'no prediction'])
anyBrandCountPerc = float(anyBrandCount)/totalCount
noAdCount = len([k for k in Y_test if k=='not_an_ad'])
noAdCountPerc = float(noAdCount)/totalCount

#number of surrenders by the classifier
noPredictionCount = len([k for k in predictions if k=='no prediction'])
noPredictionCountPerc = float(noPredictionCount)/totalCount

# %correctly Predicted, %incorrectlyredicted, %surrendered
anyBrandCorrect = 0
not_an_adCorrect = 0
not_an_adSurrender = 0
anyBrandSurrender = 0
not_an_adIncorrect = 0
anyBrand_notAnAdIncorrect = 0
anyBrand_wrongBrandIncorrect = 0

for idx,k in enumerate(Y_test):
    if(k == predictions[idx]):
        if (k=='not_an_ad'):
            not_an_adCorrect += 1    
        else:
            anyBrandCorrect += 1
    elif(predictions[idx] == 'no prediction'):
        if (k=='not_an_ad'):
            not_an_adSurrender += 1
        else:
            anyBrandSurrender += 1
    else:
        if (k=='not_an_ad'):
            not_an_adIncorrect += 1
        else:
            if (predictions[idx] == 'not_an_ad'):
                anyBrand_notAnAdIncorrect += 1
            else:
                anyBrand_wrongBrandIncorrect += 1
                

In [76]:
#from tabulate import tabulate
from prettytable import PrettyTable

print('~Performance Stats~:')
print('Total records: ', totalCount)

t = PrettyTable(['Category', 'total(%)', 'Correct', 'Incorrect_wrongAd', 'Incorrect_not_an_ad', 'No Prediction'])
t.add_row(['AnyBrand', anyBrandCountPerc, anyBrandCorrect, anyBrand_wrongBrandIncorrect, anyBrand_notAnAdIncorrect, anyBrandSurrender])
t.add_row(['No_ad', noAdCountPerc, not_an_adCorrect, not_an_adIncorrect, 'X', not_an_adSurrender])
print t

~Performance Stats~:
('Total records: ', 1720)
+----------+----------------+---------+-------------------+---------------------+---------------+
| Category |    total(%)    | Correct | Incorrect_wrongAd | Incorrect_not_an_ad | No Prediction |
+----------+----------------+---------+-------------------+---------------------+---------------+
| AnyBrand | 0.656976744186 |   563   |         7         |          4          |      556      |
|  No_ad   | 0.343023255814 |    8    |         3         |          X          |      579      |
+----------+----------------+---------+-------------------+---------------------+---------------+


# Implementations Details:

On analyzing the dataset, it appeared to me that 'ocr_logos' and 'ocr_text' seemed to be the key insight providers for what the label output was.

These are the steps I took to create the classifier for predicting the ad label:

1. Cleaned text inside ocr_text field by converting each to lowercase and replacing any whitespaces with a single 
   space. Also, concatenated the cleaned ocr_logos with this.

2. Randomized dataset and chose 500 labels from output labels

3. Filtered the Randomized dataset to contain only the selected labels; created a 60-20-20 split. 

4. Did cross-validation on train set and validation set to select a model (linear SVM in this case)

5. Re-trained on train+validation set; Measured performance on the test set.

6. Got performance stats specifically to measure cost incurred based on cost-matrix. I tweaked the thresholds in method predictedModified, in order to get a low number of incorrect predictions on no_ad.


Things that I would have liked to work on:

1. Title casing: Currently I convert all text to lowercase, which loses information that could potentially pinpoint 
   occurence of an Adword label in ocr_text.
2. Lemmatization & punctuation trimming: Would have wanted to use this on the ocr_text. Would have definitely improved 
   performance.
3. grid-search: Grid search to find the most optimal parameters for the model. 
4. Learning curves: Would have liked to plot a learning curve on my hypothesis, comparing the cross-validation 
   training and test sets to inform me if my model has high variance or high bias. Based on this, I could proceed to   
   use more data points in training if model has high variance.
5. Embedding Label cost in the hypothesis: Currently, the predictions are modified based on the distance from the 
   decision boundary (the assumption here is that svm is selected from the models). It would be much better to add 
   weights to the parameter which penalize heavily for incorrect prediction of a label to 'not_an_ad'
6. I would have liked to experiment a bit more on choosing a good threshold for doing 'no prediction'. This 
   unfortunately hit the recall on No_ad pretty severely; would have wanted to find a good balance. Currently I have 
   set it to give me a very low nuber of incorrect no_ad predictions
   
   