# Looking for Sneaky Clickbait

The aim of this experiment is to evaluate the clickbait detector model and find out what kind of clickbait does it fail to detect.

In [1]:
from keras.models import load_model
from keras.preprocessing import sequence
import sys
import string 
import re


UNK = "<UNK>"
PAD = "<PAD>"
MATCH_MULTIPLE_SPACES = re.compile("\ {2,}")
SEQUENCE_LENGTH = 20

Using TensorFlow backend.


## Load the model and vocabulary

In [2]:
model = load_model("../models/detector.h5")


vocabulary = open("../data/vocabulary.txt").read().split("\n")
inverse_vocabulary = dict((word, i) for i, word in enumerate(vocabulary))

## Load validation data

In [3]:
clickbait = open("../data/clickbait.valid.txt").read().split("\n")
genuine = open("../data/genuine.valid.txt").read().split("\n")

print "Clickbait: "
for each in clickbait[:5]:
    print each
print "-" * 50

print "Genuine: "
for each in genuine[:5]:
    print each

Clickbait: 
All The Looks At The People's Choice Awards
Does Kylie Jenner Know How To Wear Coats? A Very Serious Investigation
This Is What US Protests Looked Like In The '60s
24 GIFs That Show How Corinne Is The Greatest "Bachelor" Villian Yet
Nene Leakes And Kandi Burruss Finally "See Each Other" In A Good Way
--------------------------------------------------
Genuine: 
Mayawatis risky calculus
L&T Q3 net up 39% at Rs 972 cr, co says note ban a disruptor
Australian Open women's final: Serena beats sister Venus Williams to win 23rd Grand Slam
It's Federer vs Nadal in Australian Open finals
Medical board fails to make any conclusion in report on Sunandas death


In [4]:
def words_to_indices(words):
    return [inverse_vocabulary.get(word, inverse_vocabulary[UNK]) for word in words]


def clean(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, " " + punctuation + " ")
    for i in range(10):
        text = text.replace(str(i), " " + str(i) + " ")
    text = MATCH_MULTIPLE_SPACES.sub(" ", text)
    return text


## Genuine news marked as clickbait

In [5]:
wrong_genuine_count = 0
for each in genuine:
    cleaned = clean(each.encode("ascii", "ignore").lower()).split()
    indices = words_to_indices(cleaned)
    indices = sequence.pad_sequences([indices], maxlen=SEQUENCE_LENGTH)
    prediction = model.predict(indices)[0, 0]
    if prediction > .5:
        print prediction, each
        wrong_genuine_count += 1

print "-" * 50
print "{0} out of {1} wrong.".format(wrong_genuine_count, len(genuine))

0.996671 In U.P. polls, a united Left will make an impact: Yechury
0.996506 When White House blocked U.K. scribes from covering Trump-May meet
0.955967 A look at Trumps executive order on refugees, immigration
0.963505 Malala heartbroken over Trumps ban on most defenceless refugees
0.898073 Zahida Pervez, three others convicted in Shehla Masood murder case
0.924497 The White House hints that tax reform could pay for the border wall
0.877707 Understanding the spike in Chinas birth rate
0.855187 President Trumps infrastructure plans probably involve more tolls
0.621383 The multinational company is in trouble
0.946161 The Doomsday Clock now reads two and a half minutes to midnight
0.759111 Why Russia is about to decriminalise wife-beating
0.990156 Donald Trump Signs Actions Banning Syrians, Suspending Refugee Program
0.866885 Big Chinese Deals Stall on Capital-Outflows Clampdown
0.771161 Twitter releases national securityletters
0.618022 Zuckerberg defends immigrants threatened byTrump
0.

## Clickbait not detected

In [6]:
wrong_clickbait_count = 0
for each in clickbait:
    cleaned = clean(each.encode("ascii", "ignore").lower()).split()
    indices = words_to_indices(cleaned)
    indices = sequence.pad_sequences([indices], maxlen=SEQUENCE_LENGTH)
    prediction = model.predict(indices)[0, 0]
    if prediction < .5:
        print prediction, each
        wrong_clickbait_count += 1

print "-" * 50
print "{0} out of {1} wrong.".format(wrong_clickbait_count, len(clickbait))

0.318887 Nene Leakes And Kandi Burruss Finally "See Each Other" In A Good Way
0.220154 Trump signs executive order to 'keep radical Islamic terrorists out' of U.S.
0.0129206  Considering A Medical Career? 
0.0939609 Jewish leaders have warned against post-truth populism on Holocaust Memorial Day
0.25566 Do Donald Trump's criticisms of NATO have merit? | Opinion
0.0920399 Trumps Putin bromance is making Americans pro-Russian
0.348526 Mexico foreign minister says paying for Trump's border wall "totally unacceptable"
0.0919236 How much longer can Oman be an oasis of peace in the Middle East?
0.0143478 China is stepping up as Donald Trump withdraws from the world stage | Opinion
0.278981 Buffett, Gates express optimism for U.S. in Trump era
0.00424221 Can Congos footballers help ease political tensions?
0.0606429 Aruba; Five Star Island Goes Green
0.0049553 Michael Wolff: Why the media keeps losing to Donald Trump
0.00370871 Vijay Mallya: I begged for help, not loans - Times of India
0.459