# Ad Predictor

### Problem definition:

#### Implement a proof-of-concept classifier that uses data about banner ads to predict the advertiser represented in an ad, or return "not an ad" if the image isn't an ad, or return "no prediction" if the classifier isn't sufficiently confident. You can find a labeled dataset with representative class frequencies at http://moatsearch-data.s3.amazonaws.com/homework/ad_classification_hw_dataset.json.

#### Use the following cost matrix to inform your implementation and analysis:

```
| predicted   |  correct brand     |  wrong brand  |  non-ad  |  no prediction |
| actual -----|--------------------|---------------|----------|----------------|
| any brand   |         0          |      -20      |    -100  |       -5       |
| non-ad      |         X          |      -40      |     0    |       -5       |
```

#### Questions:
- Discuss the performance of your classifier. For context, include specs for the machine you trained your classifier on.
- Describe the reasoning behind all major design decisions you had to make.
- If you were to keep developing this proof-of-concept, what are some changes you think would be promising to explore next, and why?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [26]:
givenData = pd.read_json('http://moatsearch-data.s3.amazonaws.com/homework/ad_classification_hw_dataset.json')
givenData.describe()

Unnamed: 0,label,md5,ocr_logos,ocr_text,screenshot_url
count,9729,9729,9729,9729.0,9729
unique,1375,9729,2407,8056.0,9729
top,not_an_ad,6dbd03573e0610e810f13ac1a8170a71,[],,https://search-creatives.s3.amazonaws.com/64/9...
freq,3400,1,4710,732.0,1


In [27]:
givenData

Unnamed: 0,label,md5,ocr_logos,ocr_text,screenshot_url
0,ashley furniture,2bb7ffaef7e7012ade02da88cdf7edf7,[],et happy\nASHLEY\nholidays\nthis is home\n25 l...,https://search-creatives.s3.amazonaws.com/8b/c...
1,emirates airline,2eb9ba692c4c687cb3ef8491d042450b,[Emirates],Emirates\nBUY 2 TICKETS\nFOR THE\nPRICE OF 1!\...,https://search-creatives.s3.amazonaws.com/9d/1...
2,verizon wireless,a40a6af8fa229c630c7d4d7951a8f517,[Verizon Wireless],Gratis\nVerizon\nLG G Pad'\nTM\n7.0 LTE\nApren...,https://search-creatives.s3.amazonaws.com/88/b...
3,walgreens,2a2a280ebc1369ee11be245a630a140a,[Walgreens],Great gifts are right walgreens.\naround the c...,https://search-creatives.s3.amazonaws.com/f8/d...
4,hewlett packard,dd66013a011608d5410eee78517b2082,[HP Partner],LA FORCE EST\nPUISSANTE DANS\nNOTRE FAMILLE\nD...,https://search-creatives.s3.amazonaws.com/f8/9...
5,not_an_ad,6d48faaef070e63092398bd179579cf4,[],II We were guided step by step through\nthe pu...,https://moatsearch-data.s3.amazonaws.com/creat...
6,bluehost,d659dc402651c871ad8d631906d93c94,[],"EASY, HASSLE-FREE\nWEB HOSTING\nFOR $3.95 /MON...",https://search-creatives.s3.amazonaws.com/3c/a...
7,not_an_ad,ce89d5fb1c26cc673b9acbc501d53978,[1664],/004\nKRONEN BOURG\n1664\nPICK UP A\nPACK TODA...,https://search-creatives.s3.amazonaws.com/1d/e...
8,not_an_ad,f5a7ffec70312212ab0c8eb1208f0351,[],,https://search-creatives.s3.amazonaws.com/46/f...
9,fashion mia,cba9c5ca50f1f5be522a0c6b8f4547ab,[Fashion MIA],Fashion\nMia\nCU U\n$18.95\n$9.95\n$14.95\n$14...,https://search-creatives.s3.amazonaws.com/f3/5...


In [28]:
#clean ocr_text columns and add to new column 'ocr_text_all'
import re
reg = re.compile('\W+')

df = givenData
ocr_text = df['ocr_text'].values

def clean(text):
    return reg.sub(' ', text).strip()

cleanText = [clean(text).lower() for text in ocr_text]
df.insert(loc=0, column='ocr_text_all', value= cleanText) 
df


Unnamed: 0,ocr_text_all,label,md5,ocr_logos,ocr_text,screenshot_url
0,et happy ashley holidays this is home 25 l 36 ...,ashley furniture,2bb7ffaef7e7012ade02da88cdf7edf7,[],et happy\nASHLEY\nholidays\nthis is home\n25 l...,https://search-creatives.s3.amazonaws.com/8b/c...
1,emirates buy 2 tickets for the price of 1 jfk ...,emirates airline,2eb9ba692c4c687cb3ef8491d042450b,[Emirates],Emirates\nBUY 2 TICKETS\nFOR THE\nPRICE OF 1!\...,https://search-creatives.s3.amazonaws.com/9d/1...
2,gratis verizon lg g pad tm 7 0 lte aprendemas ...,verizon wireless,a40a6af8fa229c630c7d4d7951a8f517,[Verizon Wireless],Gratis\nVerizon\nLG G Pad'\nTM\n7.0 LTE\nApren...,https://search-creatives.s3.amazonaws.com/88/b...
3,great gifts are right walgreens around the cor...,walgreens,2a2a280ebc1369ee11be245a630a140a,[Walgreens],Great gifts are right walgreens.\naround the c...,https://search-creatives.s3.amazonaws.com/f8/d...
4,la force est puissante dans notre famille du 2...,hewlett packard,dd66013a011608d5410eee78517b2082,[HP Partner],LA FORCE EST\nPUISSANTE DANS\nNOTRE FAMILLE\nD...,https://search-creatives.s3.amazonaws.com/f8/9...
5,ii we were guided step by step through the pur...,not_an_ad,6d48faaef070e63092398bd179579cf4,[],II We were guided step by step through\nthe pu...,https://moatsearch-data.s3.amazonaws.com/creat...
6,easy hassle free web hosting for 3 95 month un...,bluehost,d659dc402651c871ad8d631906d93c94,[],"EASY, HASSLE-FREE\nWEB HOSTING\nFOR $3.95 /MON...",https://search-creatives.s3.amazonaws.com/3c/a...
7,004 kronen bourg 1664 pick up a pack today dri...,not_an_ad,ce89d5fb1c26cc673b9acbc501d53978,[1664],/004\nKRONEN BOURG\n1664\nPICK UP A\nPACK TODA...,https://search-creatives.s3.amazonaws.com/1d/e...
8,,not_an_ad,f5a7ffec70312212ab0c8eb1208f0351,[],,https://search-creatives.s3.amazonaws.com/46/f...
9,fashion mia cu u 18 95 9 95 14 95 14 95 2016 f...,fashion mia,cba9c5ca50f1f5be522a0c6b8f4547ab,[Fashion MIA],Fashion\nMia\nCU U\n$18.95\n$9.95\n$14.95\n$14...,https://search-creatives.s3.amazonaws.com/f3/5...


In [32]:
#add ocr_logos as well to the 'ocr_text_all'
ocr_logos = df['ocr_logos'].values
ocr_logos

ocr_logos_flattened = [' '.join(s).lower() for s in ocr_logos]

ocr_text_all_new = []
for idx,val in enumerate(ocr_logos_flattened):
    res = val + ' ' + df['ocr_text_all'][idx]
    ocr_text_all_new.append(res)

df['ocr_text_all'] = ocr_text_all_new
df

Unnamed: 0,ocr_text_all,label,md5,ocr_logos,ocr_text,screenshot_url
0,et happy ashley holidays this is home 25 l 36...,ashley furniture,2bb7ffaef7e7012ade02da88cdf7edf7,[],et happy\nASHLEY\nholidays\nthis is home\n25 l...,https://search-creatives.s3.amazonaws.com/8b/c...
1,emirates emirates buy 2 tickets for the price ...,emirates airline,2eb9ba692c4c687cb3ef8491d042450b,[Emirates],Emirates\nBUY 2 TICKETS\nFOR THE\nPRICE OF 1!\...,https://search-creatives.s3.amazonaws.com/9d/1...
2,verizon wireless gratis verizon lg g pad tm 7 ...,verizon wireless,a40a6af8fa229c630c7d4d7951a8f517,[Verizon Wireless],Gratis\nVerizon\nLG G Pad'\nTM\n7.0 LTE\nApren...,https://search-creatives.s3.amazonaws.com/88/b...
3,walgreens great gifts are right walgreens arou...,walgreens,2a2a280ebc1369ee11be245a630a140a,[Walgreens],Great gifts are right walgreens.\naround the c...,https://search-creatives.s3.amazonaws.com/f8/d...
4,hp partner la force est puissante dans notre f...,hewlett packard,dd66013a011608d5410eee78517b2082,[HP Partner],LA FORCE EST\nPUISSANTE DANS\nNOTRE FAMILLE\nD...,https://search-creatives.s3.amazonaws.com/f8/9...
5,ii we were guided step by step through the pu...,not_an_ad,6d48faaef070e63092398bd179579cf4,[],II We were guided step by step through\nthe pu...,https://moatsearch-data.s3.amazonaws.com/creat...
6,easy hassle free web hosting for 3 95 month u...,bluehost,d659dc402651c871ad8d631906d93c94,[],"EASY, HASSLE-FREE\nWEB HOSTING\nFOR $3.95 /MON...",https://search-creatives.s3.amazonaws.com/3c/a...
7,1664 004 kronen bourg 1664 pick up a pack toda...,not_an_ad,ce89d5fb1c26cc673b9acbc501d53978,[1664],/004\nKRONEN BOURG\n1664\nPICK UP A\nPACK TODA...,https://search-creatives.s3.amazonaws.com/1d/e...
8,,not_an_ad,f5a7ffec70312212ab0c8eb1208f0351,[],,https://search-creatives.s3.amazonaws.com/46/f...
9,fashion mia fashion mia cu u 18 95 9 95 14 95 ...,fashion mia,cba9c5ca50f1f5be522a0c6b8f4547ab,[Fashion MIA],Fashion\nMia\nCU U\n$18.95\n$9.95\n$14.95\n$14...,https://search-creatives.s3.amazonaws.com/f3/5...


In [34]:
#Randomize dataset
df_randomized = df
df_randomized = df_randomized.reindex(np.random.permutation(df_randomized.index))

In [42]:
#Choose a smaller subset of ad labels
labels = df_randomized['label'].values
adLabels = set()
for l in labels:
    if (len(adLabels)<200):
        adLabels.add(l)
adLabels       

{u'37com',
 u'aaa travel',
 u'academy sports outdoors',
 u'afl',
 u'aktion deutschland hilft',
 u'aliexpress',
 u'align',
 u'amazon',
 u'american express',
 u'amsoil',
 u'ansi',
 u'app store',
 u'ashley furniture',
 u'at&t',
 u'avon',
 u'baller arcade',
 u'banggoodcom',
 u'beatport',
 u'best buy',
 u'bet black entertainment television',
 u'bloomberg',
 u'blue cross blue shield',
 u'blue vine',
 u'bluehost',
 u'bon prix',
 u'brandsmart usa',
 u'british red cross',
 u'brooklyn movie',
 u'bsn',
 u'buick gmc dealerships',
 u'cafe heavenly',
 u'cartwheel',
 u'chamberlain college of nursing',
 u'charleston resort islands golf',
 u'club sport',
 u'coke',
 u'coldwell banker',
 u'coles fine flooring',
 u'community transit',
 u'compassion',
 u'conceptis logic puzzles',
 u'conedison solutions',
 u'crye leike realtors',
 u'custominkcom',
 u'decathlon',
 u'dell',
 u'demandware',
 u'digi key',
 u'duproprio',
 u'edina realty',
 u'elite daily',
 u'emirates airline',
 u'engie',
 u'epiphone',
 u'fashion

In [119]:
adLabelsList = list(adLabels)

[u'hebs digital',
 u'mitsubishi electric',
 u'walgreens',
 u'lowermybillscom',
 u'state farm',
 u'girlgameme',
 u'ibotta',
 u'florida virtual school',
 u'mirage',
 u'modcloth',
 u'zenni optical',
 u'mcdonalds',
 u'shopathomecom',
 u'zulily',
 u'fifty shades of grey',
 u'michael todd',
 u'rue la la',
 u'ymca young mens christian association',
 u'la z boy',
 u'regent',
 u'bon prix',
 u'texture',
 u'custominkcom',
 u'ocz',
 u'blue vine',
 u'room board',
 u'baller arcade',
 u'taylormade',
 u'tracfone',
 u'ski sundown ',
 u'compassion',
 u'time warner cable',
 u'dell',
 u'amazon',
 u'sparta war of empires game',
 u'massage envy spa',
 u'target',
 u'sofacom',
 u'intel',
 u'umass amherst',
 u'marriott hotels',
 u'h',
 u'jos a bank',
 u'toyota',
 u'crye leike realtors',
 u'william hill',
 u'qc event school',
 u'subway',
 u'passport auto',
 u'xero',
 u'qrops choices',
 u'aaa travel',
 u'lego nexo knights',
 u'neiman marcus',
 u'kiehls',
 u'kaiser realty',
 u'romwe',
 u'aliexpress',
 u'm&t bank'

In [120]:
#Select only subset of df containing adLabels
df_filtered = df_randomized.loc[df_randomized['label'].isin(adLabelsList)]

In [121]:
#Keep 60% training, 20% cross-validation test, 20% validation test. Remove validationTest part
splitAt = int(0.2 * len(df_filtered))

test = df_filtered[0:splitAt]

cv_trainTest = df_filtered[splitAt:]
cv_test = cv_trainTest[0:splitAt]
cv_train = cv_trainTest[splitAt:]

Unnamed: 0,ocr_text_all,label,md5,ocr_logos,ocr_text,screenshot_url
247,kroger rger great food low prices 99 with card...,kroger,dc0db6966e27db2490b9a6e099a9628e,[Kroger],Rger\nGreat food\nLow prices.\n99\nWith Card\n...,https://search-creatives.s3.amazonaws.com/80/b...
4151,walmart cheetos lable online only free offer l...,walmart,9930196aa5bc2c4044a2a4888cb6be06,"[Walmart, Cheetos]",lable online only\nFree\noffer last.\nwhile su...,https://search-creatives.s3.amazonaws.com/b6/3...
2775,mount up 1943,not_an_ad,17e1e43397dd328c6aa75326b74d5965,[],MOUNT\nUP\n1943\n,https://search-creatives.s3.amazonaws.com/ef/e...
7711,marriott international book a hotel near tampa...,marriott hotels,fbc7da8c4a669850d7c974836e5fc361,[Marriott International],Book a hotel near\nTAMPA\nMarriott\nFROM\n76\n...,https://search-creatives.s3.amazonaws.com/5e/b...
9243,ashley furniture be no interest financing for ...,ashley furniture,11694949ff2b251c3bc9e9be6c7183e8,[Ashley Furniture],be\nno interest financing for\nevent\nblack fr...,https://search-creatives.s3.amazonaws.com/d4/3...
3422,walmart walmart les fetes petits prix magasiner,walmart,d72e45b39ce685afcf92aeb0060c4388,[Walmart],Walmart\nLES FETES petits prix\nMagasiner\n,https://search-creatives.s3.amazonaws.com/1a/5...
8148,emirates spencers solicitors emirates book 2 p...,emirates airline,05972ccec26b5f3f2cdba63094b3e063,"[Emirates, Spencers Solicitors]",Emirates\nBook 2 passengers\ntraveling togethe...,https://search-creatives.s3.amazonaws.com/b1/f...
3100,subway star wars star the force awaken marsh e...,subway,8e892c33fd7d624bb7b4b3c8cd8ff4ca,"[Subway, Star Wars]",STAR\nTHE FORCE\nAWAKEN\nMARSH\nEXCLUSIVE\nSTA...,https://search-creatives.s3.amazonaws.com/1e/f...
9198,cesaultste marie mchgan,not_an_ad,c70d3cc745d8602b131c09feb7419cf5,[],CESAULTSTE MARIE MCHGAN\n,https://search-creatives.s3.amazonaws.com/5d/2...
7687,stylish home decor the well appointed house,not_an_ad,b43e25fde7d24c5a03f6f06c3a54dc7f,[],STYLISH HOME DECOR\nTHE WELL APPOINTED HOUSE\n,https://search-creatives.s3.amazonaws.com/c8/f...


In [270]:
#create dictionary for ocr_text_all using vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vocab = vectorizer.fit_transform(cv_trainTest['ocr_text_all'].values)
#vocab.shape -> 16245

predictors = ['ocr_text_all']
X_train = vectorizer.transform(cv_train['ocr_text_all'].values)
X_train.shape
Y_train = cv_train['label']
Y_test = cv_test['label'].values


In [271]:
#clf1- Naive Baye's classifier
from sklearn.naive_bayes import MultinomialNB
clfNB = MultinomialNB()

In [272]:
#get performance of clfNB
X_cv_test_data = cv_test['ocr_text_all']
X_cv_test = vectorizer.transform(X_cv_test_data)
clfNB.fit(X_train, Y_train)
predicted = clfNB.predict(X_cv_test)
clfNB.score(X_cv_test, Y_test)

0.79793061472915394

In [165]:
#clf2- Logistic Regression
from sklearn.linear_model import SGDClassifier
clfLR = SGDClassifier(loss='log', penalty='l2', alpha=1e-3, n_iter=5, random_state=10)

In [273]:
#get performance of clfLR -- 88.98%
clfLR.fit(X_train, Y_train)
predicted = clfLR.predict(X_cv_test)
clfLR.score(X_cv_test, Y_test)

0.88983566646378576

In [274]:
#clf3- Linear SVM
clfLSVM = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=10)

In [275]:
#get performance of clf3 -- 91.72%
clfLSVM = clfLSVM.fit(X_train, Y_train)
predicted = clfLSVM.predict(X_cv_test)
clfLSVM.score(X_cv_test, Y_test)

0.91722458916615945

In [188]:
from sklearn import metrics

#performance metrics of classifier
print(metrics.classification_report(Y_test, predicted, target_names=list(adLabels)))

                                       precision    recall  f1-score   support

                         hebs digital       1.00      1.00      1.00         2
                  mitsubishi electric       0.96      0.96      0.96        24
                            walgreens       0.00      0.00      0.00         5
                      lowermybillscom       1.00      1.00      1.00         4
                           state farm       1.00      0.93      0.96        44
                           girlgameme       0.91      0.59      0.71        17
                               ibotta       0.00      0.00      0.00         2
               florida virtual school       1.00      1.00      1.00        28
                               mirage       1.00      0.50      0.67         2
                             modcloth       0.80      0.80      0.80         5
                        zenni optical       0.00      0.00      0.00         0
                            mcdonalds       1.00   

In [277]:
#implement cross-validation and select best model
from sklearn.cross_validation import KFold
X = vectorizer.transform(cv_trainTest['ocr_text_all'].values)
Y = cv_trainTest['label'].values
kf = KFold(len(Y_train), n_folds=4)

clfLSVM2 = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=10)

max = 0
models =[]
scores = []
for train_index, test_index in kf:
    X_train, X_test = X[train_index], X[test_index]
    Y_train, Y_test = Y[train_index], Y[test_index]
    clfLSVM2 = clfLSVM2.fit(X_train, Y_train)
    score = clfLSVM2.score(X_test, Y_test)
    models.append(clfLSVM2)
    scores.append(score)
  
print(scores)
highScoreIndex = np.argmax(scores)
model = models[highScoreIndex]
    

[0.91565287915652882, 0.92781832927818331, 0.92295214922952151, 0.9285714285714286]


In [280]:
#get performance on test set
X_test = vectorizer.transform(test['ocr_text_all'].values)
X_test.shape
Y_test = test['label'].values
predicted = model.predict(X_test)
model.score(X_test, Y_test)

0.85636031649421784

In [393]:
#Sampleprediction: #model.predict(vectorizer.transform(['This is a Verizon ad'])) => Verizon wireless

#return no prediction for weak predictions
def predictedModified(model, x):
    predictions = []
    for elem in x:
        maxBoundary = np.max(model.decision_function(elem))
        #based on cost matrix- place higher penalty for predicting not_an_ad 
        prediction = model.predict(elem)
        if (maxBoundary< -0.05):
            prediction = np.array([u'no prediction'])
            #increase maxBoundary constraint to remove wrong no_ad predictions on any_brand
        elif(prediction == u'not_an_ad' and maxBoundary < 1.4):
            prediction = np.array([u'no prediction'])    
        predictions.append(prediction[0])
    return predictions
     
# predictedModified(model, vectorizer.transform(['This is a Verizon ad', 'This is walgreens']))    
predictions = predictedModified(model, X_test)

In [394]:
#Get generic performance
from sklearn import metrics

#performance report
print(metrics.classification_report(Y_test, predictions))

#Confusion matrix
print(metrics.confusion_matrix(Y_test, predictions))


             precision    recall  f1-score   support

      37com       0.00      0.00      0.00         1
 aaa travel       0.00      0.00      0.00         1
academy sports outdoors       0.00      0.00      0.00         5
        afl       0.00      0.00      0.00         2
aktion deutschland hilft       0.00      0.00      0.00         1
 aliexpress       0.00      0.00      0.00         1
      align       0.00      0.00      0.00         1
     amazon       1.00      0.97      0.98        30
american express       1.00      0.33      0.50         3
     amsoil       0.00      0.00      0.00         1
       ansi       0.00      0.00      0.00         1
  app store       0.75      0.50      0.60         6
ashley furniture       1.00      0.92      0.96        26
       at&t       0.93      0.65      0.76        20
       avon       0.00      0.00      0.00         1
baller arcade       0.00      0.00      0.00         5
banggoodcom       1.00      0.94      0.97        18
   beatp

In [395]:
#Get performance figures on the test Set which highlight how well we did as per the cost-matrix:

#A. Total number of records in test set

#B. any_brand metrics:
    #a. %age
    #b. %correctly predicted
    #c. %incorrectly predicted as not_an_ad // Need to ensure this is lowest
    #d. %incorrectly predicted as wrong band //moderate cost
    #e. %surrendered //low cost
    
    
#C. not_an_ad metrics:
    #a. %age
    #b. %correctly predicted
    #c. %incorrectly predicted as wrong band //moderate-high cost
    #d. %surrendered //low cost 
    

#anyBrandCount + noAdCount = totalCount
#a. %age composition
totalCount = len(Y_test)
anyBrandCount = len([k for k in Y_test if k!='not_an_ad' and k!= 'no prediction'])
anyBrandCountPerc = float(anyBrandCount)/totalCount
noAdCount = len([k for k in Y_test if k=='not_an_ad'])
noAdCountPerc = float(noAdCount)/totalCount

#number of surrenders by the classifier
noPredictionCount = len([k for k in predictions if k=='no prediction'])
noPredictionCountPerc = float(noPredictionCount)/totalCount

# %correctly Predicted, %incorrectlyredicted, %surrendered
anyBrandCorrect = 0
not_an_adCorrect = 0
not_an_adSurrender = 0
anyBrandSurrender = 0
not_an_adIncorrect = 0
anyBrand_notAnAdIncorrect = 0
anyBrand_wrongBrandIncorrect = 0

for idx,k in enumerate(Y_test):
    if(k == predictions[idx]):
        if (k=='not_an_ad'):
            not_an_adCorrect += 1    
        else:
            anyBrandCorrect += 1
    elif(predictions[idx] == 'no prediction'):
        if (k=='not_an_ad'):
            not_an_adSurrender += 1
        else:
            anyBrandSurrender += 1
    else:
        if (k=='not_an_ad'):
            not_an_adIncorrect += 1
        else:
            if (predictions[idx] == 'not_an_ad'):
                anyBrand_notAnAdIncorrect += 1
            else:
                anyBrand_wrongBrandIncorrect += 1
                

In [396]:
#from tabulate import tabulate
from prettytable import PrettyTable

print('~Performance Stats~:')
print('Total records: ', totalCount)

t = PrettyTable(['Category', 'total(%)', 'Correct', 'Incorrect_wrongAd', 'Incorrect_not_an_ad', 'No Prediction'])
t.add_row(['AnyBrand', anyBrandCountPerc, anyBrandCorrect, anyBrand_wrongBrandIncorrect, anyBrand_notAnAdIncorrect, anyBrandSurrender])
t.add_row(['No_ad', noAdCountPerc, not_an_adCorrect, not_an_adIncorrect, 'X', not_an_adSurrender])
print t

~Performance Stats~:
('Total records: ', 1643)
+----------+----------------+---------+-------------------+---------------------+---------------+
| Category |    total(%)    | Correct | Incorrect_wrongAd | Incorrect_not_an_ad | No Prediction |
+----------+----------------+---------+-------------------+---------------------+---------------+
| AnyBrand | 0.602556299452 |   747   |         14        |          28         |      201      |
|  No_ad   | 0.397443700548 |    80   |         7         |          X          |      566      |
+----------+----------------+---------+-------------------+---------------------+---------------+


In [313]:
#things that I would have liked to work on:
#1. Title casing: Currently I convert all text to lowercase, which loses information that could potentially pinpoint occurence of an Adword label in ocr_tex.
#2. grid-search: Grid search to find the most optimal parameters for the model. 
#3. Learning curves: Would have liked to plot a learning curve on my hypothesis, comparing the cross-validation training and test sets to inform me if my model has high variance or high bias.
    #Based on this, I could proceed to use more data points in training if model has high variance.