# Looking for Sneaky Clickbait

The aim of this experiment is to evaluate the clickbait detector model and find out what kind of clickbait does it fail to detect.

In [1]:
from keras.models import load_model
from keras.preprocessing import sequence
import sys
import string 
import re


UNK = "<UNK>"
PAD = "<PAD>"
MATCH_MULTIPLE_SPACES = re.compile("\ {2,}")
SEQUENCE_LENGTH = 20

Using TensorFlow backend.


## Load the model and vocabulary

In [2]:
model = load_model("../models/detector.h5")


vocabulary = open("../data/vocabulary.txt").read().split("\n")
inverse_vocabulary = dict((word, i) for i, word in enumerate(vocabulary))

## Load validation data

In [3]:
clickbait = open("../data/clickbait.valid.txt").read().split("\n")
genuine = open("../data/genuine.valid.txt").read().split("\n")

print "Clickbait: "
for each in clickbait[:5]:
    print each
print "-" * 50

print "Genuine: "
for each in genuine[:5]:
    print each

Clickbait: 
All The Looks At The People's Choice Awards
Does Kylie Jenner Know How To Wear Coats? A Very Serious Investigation
This Is What US Protests Looked Like In The '60s
24 GIFs That Show How Corinne Is The Greatest "Bachelor" Villian Yet
Nene Leakes And Kandi Burruss Finally "See Each Other" In A Good Way
--------------------------------------------------
Genuine: 
Mayawatis risky calculus
L&T Q3 net up 39% at Rs 972 cr, co says note ban a disruptor
Australian Open women's final: Serena beats sister Venus Williams to win 23rd Grand Slam
It's Federer vs Nadal in Australian Open finals
Medical board fails to make any conclusion in report on Sunandas death


In [4]:
def words_to_indices(words):
    return [inverse_vocabulary.get(word, inverse_vocabulary[UNK]) for word in words]


def clean(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, " " + punctuation + " ")
    for i in range(10):
        text = text.replace(str(i), " " + str(i) + " ")
    text = MATCH_MULTIPLE_SPACES.sub(" ", text)
    return text


## Genuine news marked as clickbait

In [5]:
wrong_genuine_count = 0
for each in genuine:
    cleaned = clean(each.encode("ascii", "ignore").lower()).split()
    indices = words_to_indices(cleaned)
    indices = sequence.pad_sequences([indices], maxlen=SEQUENCE_LENGTH)
    prediction = model.predict(indices)[0, 0]
    if prediction > .5:
        print prediction, each
        wrong_genuine_count += 1

print "-" * 50
print "{0} out of {1} wrong.".format(wrong_genuine_count, len(genuine))

0.607759 Only love, no jihad: RLD fields couples who broke the communal divide
0.998569 In U.P. polls, a united Left will make an impact: Yechury
0.86769 The bull and their life
0.990986 When White House blocked U.K. scribes from covering Trump-May meet
0.816014 A look at Trumps executive order on refugees, immigration
0.97482 Malala heartbroken over Trumps ban on most defenceless refugees
0.526892 Zahida Pervez, three others convicted in Shehla Masood murder case
0.985698 The White House hints that tax reform could pay for the border wall
0.994258 Understanding the spike in Chinas birth rate
0.922186 The multinational company is in trouble
0.68529 The Doomsday Clock now reads two and a half minutes to midnight
0.536371 Why Russia is about to decriminalise wife-beating
0.949111 Donald Trump Signs Actions Banning Syrians, Suspending Refugee Program
0.574807 On Globalization, China and Trump Are Closer Than They Appear
0.596575 Nikki Haley Arrives at U.N., Vowing to Take Names of Opposin

## Clickbait not detected

In [6]:
wrong_clickbait_count = 0
for each in clickbait:
    cleaned = clean(each.encode("ascii", "ignore").lower()).split()
    indices = words_to_indices(cleaned)
    indices = sequence.pad_sequences([indices], maxlen=SEQUENCE_LENGTH)
    prediction = model.predict(indices)[0, 0]
    if prediction < .5:
        print prediction, each
        wrong_clickbait_count += 1

print "-" * 50
print "{0} out of {1} wrong.".format(wrong_clickbait_count, len(clickbait))

0.120165 Trump signs executive order to 'keep radical Islamic terrorists out' of U.S.
0.018982  Considering A Medical Career? 
0.141109 Haley to U.N. allies: back us or we'll take names
0.0218158 Do Donald Trump's criticisms of NATO have merit? | Opinion
0.0167255 Trumps Putin bromance is making Americans pro-Russian
0.347661 How much longer can Oman be an oasis of peace in the Middle East?
0.00139505 China is stepping up as Donald Trump withdraws from the world stage | Opinion
0.0820034 Can Congos footballers help ease political tensions?
0.00678784 Aruba; Five Star Island Goes Green
0.0368978 Michael Wolff: Why the media keeps losing to Donald Trump
0.00522536 Vijay Mallya: I begged for help, not loans - Times of India
0.197374 Union Budget 2017: What manufacturing sector expect from Arun Jaitley-  The Times of India
0.27219 Paul Ryan Really Doesn't Care If Mexico Pays For Trump's Wall
0.118483 Dear President Trump: Our Grandparents Were Refugees. This Is Their Story.
0.00116141 Week