# Ad Predictor

### Problem definition:

#### Implement a proof-of-concept classifier that uses data about banner ads to predict the advertiser represented in an ad, or return "not an ad" if the image isn't an ad, or return "no prediction" if the classifier isn't sufficiently confident. You can find a labeled dataset with representative class frequencies at http://moatsearch-data.s3.amazonaws.com/homework/ad_classification_hw_dataset.json.

#### Use the following cost matrix to inform your implementation and analysis:

```
| predicted   |  correct brand     |  wrong brand  |  non-ad  |  no prediction |
| actual -----|--------------------|---------------|----------|----------------|
| any brand   |         0          |      -20      |    -100  |       -5       |
| non-ad      |         X          |      -40      |     0    |       -5       |
```

#### Questions:
- Discuss the performance of your classifier. For context, include specs for the machine you trained your classifier on.
- Describe the reasoning behind all major design decisions you had to make.
- If you were to keep developing this proof-of-concept, what are some changes you think would be promising to explore next, and why?

## Solution:

### classes for prediction - not_an_ad, [advertiser name]

#### Steps:
```
1. Start by loading data into memory and removing unnecessary fields
2. Add column 'labelOutput' which gives a numerical category to each of the labels
3. Add columns corresponding to features for ocr_logo:
    x1: ocr_logo_present  = ([] ==0 , _ == 1)
    x2: ocr_logo_intersection = [(labelOutput of intersected label, intersection %)]
    x3: ocr_logo_intersectionMax_LabelOutput = labelOutput from _x2_ with highest intersection
    x4: ocr_logo_intersectionMax_percentage = intersection % from _x2_ with highest intersection
    x5: ocr_logo_numClassesMatched = size of _x2_
```

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [2]:
givenData = pd.read_json('http://moatsearch-data.s3.amazonaws.com/homework/ad_classification_hw_dataset.json')
givenData.describe()


Unnamed: 0,label,md5,ocr_logos,ocr_text,screenshot_url
count,9729,9729,9729,9729.0,9729
unique,1375,9729,2407,8056.0,9729
top,not_an_ad,6dbd03573e0610e810f13ac1a8170a71,[],,https://search-creatives.s3.amazonaws.com/64/9...
freq,3400,1,4710,732.0,1


In [3]:
givenData

Unnamed: 0,label,md5,ocr_logos,ocr_text,screenshot_url
0,ashley furniture,2bb7ffaef7e7012ade02da88cdf7edf7,[],et happy\nASHLEY\nholidays\nthis is home\n25 l...,https://search-creatives.s3.amazonaws.com/8b/c...
1,emirates airline,2eb9ba692c4c687cb3ef8491d042450b,[Emirates],Emirates\nBUY 2 TICKETS\nFOR THE\nPRICE OF 1!\...,https://search-creatives.s3.amazonaws.com/9d/1...
2,verizon wireless,a40a6af8fa229c630c7d4d7951a8f517,[Verizon Wireless],Gratis\nVerizon\nLG G Pad'\nTM\n7.0 LTE\nApren...,https://search-creatives.s3.amazonaws.com/88/b...
3,walgreens,2a2a280ebc1369ee11be245a630a140a,[Walgreens],Great gifts are right walgreens.\naround the c...,https://search-creatives.s3.amazonaws.com/f8/d...
4,hewlett packard,dd66013a011608d5410eee78517b2082,[HP Partner],LA FORCE EST\nPUISSANTE DANS\nNOTRE FAMILLE\nD...,https://search-creatives.s3.amazonaws.com/f8/9...
5,not_an_ad,6d48faaef070e63092398bd179579cf4,[],II We were guided step by step through\nthe pu...,https://moatsearch-data.s3.amazonaws.com/creat...
6,bluehost,d659dc402651c871ad8d631906d93c94,[],"EASY, HASSLE-FREE\nWEB HOSTING\nFOR $3.95 /MON...",https://search-creatives.s3.amazonaws.com/3c/a...
7,not_an_ad,ce89d5fb1c26cc673b9acbc501d53978,[1664],/004\nKRONEN BOURG\n1664\nPICK UP A\nPACK TODA...,https://search-creatives.s3.amazonaws.com/1d/e...
8,not_an_ad,f5a7ffec70312212ab0c8eb1208f0351,[],,https://search-creatives.s3.amazonaws.com/46/f...
9,fashion mia,cba9c5ca50f1f5be522a0c6b8f4547ab,[Fashion MIA],Fashion\nMia\nCU U\n$18.95\n$9.95\n$14.95\n$14...,https://search-creatives.s3.amazonaws.com/f3/5...


In [5]:
filtered = givenData.drop(['md5', 'screenshot_url'], axis = 1)

In [8]:
filtered 

Unnamed: 0,label,ocr_logos,ocr_text
0,ashley furniture,[],et happy\nASHLEY\nholidays\nthis is home\n25 l...
1,emirates airline,[Emirates],Emirates\nBUY 2 TICKETS\nFOR THE\nPRICE OF 1!\...
2,verizon wireless,[Verizon Wireless],Gratis\nVerizon\nLG G Pad'\nTM\n7.0 LTE\nApren...
3,walgreens,[Walgreens],Great gifts are right walgreens.\naround the c...
4,hewlett packard,[HP Partner],LA FORCE EST\nPUISSANTE DANS\nNOTRE FAMILLE\nD...
5,not_an_ad,[],II We were guided step by step through\nthe pu...
6,bluehost,[],"EASY, HASSLE-FREE\nWEB HOSTING\nFOR $3.95 /MON..."
7,not_an_ad,[1664],/004\nKRONEN BOURG\n1664\nPICK UP A\nPACK TODA...
8,not_an_ad,[],
9,fashion mia,[Fashion MIA],Fashion\nMia\nCU U\n$18.95\n$9.95\n$14.95\n$14...


In [10]:
# Add column 'Output' which gives a numerical category to each of the labels
from sklearn import preprocessing

label_op = preprocessing.LabelEncoder()

filtered = filtered.drop('labelOutput', axis=1)
filtered.insert(loc=0, column='labelOutput', value= label_op.fit_transform(filtered.loc[:]['label'])) 
filtered #not_an_ad labelOutput = 820 , unique labels = 1376 (including not_an_ad)

Unnamed: 0,labelOutput,label,ocr_logos,ocr_text
0,85,ashley furniture,[],et happy\nASHLEY\nholidays\nthis is home\n25 l...
1,371,emirates airline,[Emirates],Emirates\nBUY 2 TICKETS\nFOR THE\nPRICE OF 1!\...
2,1280,verizon wireless,[Verizon Wireless],Gratis\nVerizon\nLG G Pad'\nTM\n7.0 LTE\nApren...
3,1307,walgreens,[Walgreens],Great gifts are right walgreens.\naround the c...
4,509,hewlett packard,[HP Partner],LA FORCE EST\nPUISSANTE DANS\nNOTRE FAMILLE\nD...
5,820,not_an_ad,[],II We were guided step by step through\nthe pu...
6,167,bluehost,[],"EASY, HASSLE-FREE\nWEB HOSTING\nFOR $3.95 /MON..."
7,820,not_an_ad,[1664],/004\nKRONEN BOURG\n1664\nPICK UP A\nPACK TODA...
8,820,not_an_ad,[],
9,402,fashion mia,[Fashion MIA],Fashion\nMia\nCU U\n$18.95\n$9.95\n$14.95\n$14...


In [32]:
#Add ocr_logo_present column -- [5019/9729 data points with ocr_logo_present] 1374 unique ocr_logos

nullVal = filtered['ocr_logos'][0]
filtered = filtered.drop('ocr_logos_present', axis=1)
filtered.insert(loc=2, column='ocr_logos_present', value= filtered["ocr_logos"])
filtered['ocr_logos_present'] = [1*(k!=[]) for k in filtered["ocr_logos_present"].values]
filtered

Unnamed: 0,labelOutput,label,ocr_logos_present,ocr_logo_intersection_full_index,ocr_logos,ocr_text
0,85,ashley furniture,0,[],[],et happy\nASHLEY\nholidays\nthis is home\n25 l...
1,371,emirates airline,1,[Emirates],[Emirates],Emirates\nBUY 2 TICKETS\nFOR THE\nPRICE OF 1!\...
2,1280,verizon wireless,1,[Verizon Wireless],[Verizon Wireless],Gratis\nVerizon\nLG G Pad'\nTM\n7.0 LTE\nApren...
3,1307,walgreens,1,[Walgreens],[Walgreens],Great gifts are right walgreens.\naround the c...
4,509,hewlett packard,1,[HP Partner],[HP Partner],LA FORCE EST\nPUISSANTE DANS\nNOTRE FAMILLE\nD...
5,820,not_an_ad,0,[],[],II We were guided step by step through\nthe pu...
6,167,bluehost,0,[],[],"EASY, HASSLE-FREE\nWEB HOSTING\nFOR $3.95 /MON..."
7,820,not_an_ad,1,[1664],[1664],/004\nKRONEN BOURG\n1664\nPICK UP A\nPACK TODA...
8,820,not_an_ad,0,[],[],
9,402,fashion mia,1,[Fashion MIA],[Fashion MIA],Fashion\nMia\nCU U\n$18.95\n$9.95\n$14.95\n$14...


In [45]:
#Add ocr_logo_intersection_full_index column

from difflib import SequenceMatcher

def replacePunct(x):
    import string
    for c in string.punctuation:
        x = x.replace(c,"")
    return x

def getMatchedLabelOutput(xList):
    if(not xList):
        return (0.0, -1)
    else:
        lowerX = xList[0].lower()
        lowerXCleaned = replacePunct(lowerX)
        ratios = [SequenceMatcher(None,x, lowerXCleaned).ratio() for x in l]
        zipped = zip(ratios, range(0, len(ratios)))
        return max(zipped)

l = [label_op.inverse_transform(k) for k in range(0,1375)]

#filtered1 = filtered
#filtered1 = filtered1.drop('ocr_logo_intersection_full_index', axis=1)
#filtered1 = filtered1.insert(loc=3, column='ocr_logos_intersection_full_index', value= filtered1['ocr_logos'])

filtered1 = filtered1[0:100]
filtered1["ocr_logos_intersection_full_index"] = filtered1["ocr_logos_intersection_full_index"].apply(lambda x: getMatchedLabelOutput(x))
filtered1

#filtered.loc['ocr_logo_intersection_full_index'] = filtered.loc[filtered['ocr_logo_present'] == 0, 'ocr_logo_intersection_full_index'] = -1
#filtered.loc['ocr_logo_intersection_full_index'] = filtered.loc[filtered['ocr_logo_present'] != 0, 'ocr_logo_intersection_full_index'] = filtered['ocr_logos']
#filtered["ocr_logo_intersection_full_index"].apply(lambda x: getMatchedLabelOutput(x))
#filtered1

Unnamed: 0,labelOutput,label,ocr_logos_present,ocr_logos_intersection_full_index,ocr_logos,ocr_text
0,85,ashley furniture,0,"(0.0, -1)",[],et happy\nASHLEY\nholidays\nthis is home\n25 l...
1,371,emirates airline,1,"(0.714285714286, 743)",[Emirates],Emirates\nBUY 2 TICKETS\nFOR THE\nPRICE OF 1!\...
2,1280,verizon wireless,1,"(1.0, 1280)",[Verizon Wireless],Gratis\nVerizon\nLG G Pad'\nTM\n7.0 LTE\nApren...
3,1307,walgreens,1,"(1.0, 1307)",[Walgreens],Great gifts are right walgreens.\naround the c...
4,509,hewlett packard,1,"(0.625, 910)",[HP Partner],LA FORCE EST\nPUISSANTE DANS\nNOTRE FAMILLE\nD...
5,820,not_an_ad,0,"(0.0, -1)",[],II We were guided step by step through\nthe pu...
6,167,bluehost,0,"(0.0, -1)",[],"EASY, HASSLE-FREE\nWEB HOSTING\nFOR $3.95 /MON..."
7,820,not_an_ad,1,"(0.285714285714, 364)",[1664],/004\nKRONEN BOURG\n1664\nPICK UP A\nPACK TODA...
8,820,not_an_ad,0,"(0.0, -1)",[],
9,402,fashion mia,1,"(1.0, 402)",[Fashion MIA],Fashion\nMia\nCU U\n$18.95\n$9.95\n$14.95\n$14...
