# Clickbait Detector
_You Won't Believe What Happened Next_

In [1]:
import pandas as pd
import numpy as np
import json

In [2]:
pd.set_option('display.max_colwidth', 1000)

In [53]:
import importlib
from feature_extraction import LinguisticFeatureExtractor
importlib.reload(LinguisticFeatureExtractor)
cc = LinguisticFeatureExtractor.CharacterCounter()
wc = LinguisticFeatureExtractor.WordCounter()

In [60]:
# Demo
str1 = 'Deze zin--heel vervelend die streepjes-- is geen "clickbait" woord-aanelkaar 1.0 single-quote\'s.'
str2 = "Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui."

In [61]:
print(cc.numchars(str1))
print(cc.numchars(str2))
print(cc.diff(str1, str2))
print(cc.diff(str2, str1))
print(cc.ratio(str1, str2))
print(cc.ratio(str2, str1))

86
102
16
16
0.8431372549019608
1.186046511627907


In [62]:
print(wc.words(str1))
print(wc.words(str2))
print(wc.numwords(str1))
print(wc.numwords(str2))
print(wc.diff(str1, str2))
print(wc.diff(str2, str1))
print(wc.ratio(str1, str2))
print(wc.ratio(str2, str1))

['Deze', 'zin', 'heel', 'vervelend', 'die', 'streepjes', 'is', 'geen', 'clickbait', 'woordaanelkaar', '1.0', "singlequote's."]
['Nemo', 'enim', 'ipsam', 'voluptatem', 'quia', 'voluptas', 'sit', 'aspernatur', 'aut', 'odit', 'aut', 'fugit', 'sed', 'quia', 'consequuntur', 'magni', 'dolores', 'eos', 'qui.']
12
19
7
7
0.631578947368421
1.5833333333333333


## Load data
Parse the JSON and construct two Pandas dataframes.

In [4]:
ds2016 = "../data/clickbait17-train-170331/instances.jsonl"
ds2017 = "../data/clickbait17-validation-170630/instances.jsonl"

In [5]:
df16 = pd.read_json(ds2016, lines=True, encoding='utf8')
df17 = pd.read_json(ds2017, lines=True, encoding='utf8')

In [61]:
df16['chars'] = df16.apply(lambda x : cc.chars(x['postText'][0]), axis=1)
df16['words'] = df16.apply(lambda x : wc.words(x['postText'][0]), axis=1)

## Article titles

In [62]:
df16.head()

Unnamed: 0,id,postMedia,postText,postTimestamp,targetCaptions,targetDescription,targetKeywords,targetParagraphs,targetTitle,chars,words
0,608310377143799808,[],[Apple's iOS 9 'App thinning' feature will give your phone's storage a boost],Tue Jun 09 16:31:10 +0000 2015,"['App thinning' will be supported on Apple's iOS 9 and later models. It ensures apps use the lowest amount of storage space on a device by only downloading the parts it needs to run on individual handsets. It 'slices' the app into 'app variants' that only need to access the specific files on that specific device, 'App thinning' will be supported on Apple's iOS 9 and later models. It ensures apps use the lowest amount of storage space on a device by only downloading the parts it needs to run on individual handsets. It 'slices' the app into 'app variants' that only need to access the specific files on that specific device, The guidelines also discuss so-called 'on-demand resources.' This allows developers to omit features from an app until they are opened or requested by the user. The App Store hosts these resources on Apple servers and manages the downloads for the developer and user. This will also increase how quickly an app downloads, The guidelines also discuss so-called 'on-dem...",'App thinning' will be supported on Apple's iOS 9 and later models. It ensures apps use the lowest amount of storage space by 'slicing' it to work on individual handsets (illustrated).,"Apple,gives,gigabytes,iOS,9,app,thinning,feature,finally,phone,s,storage,boost","[Paying for a 64GB phone only to discover that this is significantly reduced by system files and bloatware is the bane of many smartphone owner's lives. , And the issue became so serious earlier this year that some Apple users even sued the company over it. , But with the launch of iOS 9, Apple is hoping to address storage concerns by introducing a feature known as 'app thinning.', It has been explained on the watchOS Developer Library site and is aimed at developers looking to optimise their apps to work on iOS and the watchOS. , It ensures apps use the lowest amount of storage space on a device by only downloading the parts it needs run on the particular handset it is being installed onto., It 'slices' the app into 'app variants' that only need to access the specific files on that specific handset. , XperiaBlog recently spotted that the 8GB version of Sony's mid-range M4 Aqua has just 1.26GB of space for users. , This means that firmware, pre-installed apps and Android software t...",Apple gives back gigabytes: iOS 9 'app thinning' feature will finally give your phone's storage a boost,63,13
1,609297109095972864,[media/609297109095972864.jpg],"[RT @kenbrown12: Emerging market investors are doing their best Monty Pythons--""Run away, run away""]",Fri Jun 12 09:52:05 +0000 2015,"[Stocks Fall as Investors Watch Central Banks, Do Traders See Data Before Release? Markets Say Yes, Be Careful: Stock Volatility is Hiding, Not Hibernating, Singapore Bans Ex-Goldman Banker Linked to 1MDB From Working in Financial Sector, Demanding and Direct: HSBC’s New Chairman Mark Tucker, Saudi King’s Asia Expedition Enters Crucial Phase, Puerto Rico Board Approves Fiscal Road Map, CFPB, Justice Department Poised to Square Off in Court, Raphael Bostic to Lead Atlanta Fed, Blackstone Sells Part of Stake in NCR, Dartmouth College Appoints Alice Ruth as New Endowment Chief, Hedge-Fund Loses Big on Oil Bets, Buyout Firm Buys $800 Million of Assets From Itself, J.P. Morgan Moves Ahead With Plan to Drop Commissions in IRAs, Fed Signal Could Revive a Problem for China’s Central Bank, Why Apple and Pfizer Are Piling Into Taiwan’s Bond Market, [https://m.wsj.net/video/20170312/031017vinny/031017vinny_167x94.jpg], [https://m.wsj.net/video/20170303/0307lotd_boca/0307lotd_boca_167x94.jpg],...","Global investors have yanked $9.3 billion from stocks in developing countries in the week to Wednesday, the most since the depths of the 2008 global financial crisis.","emerging market,emerging markets,em flows,em inflow,em outflow,equity markets,money,forex markets,commodity,financial market news","[Emerging markets are out of favor., Global investors have yanked $9.3 billion from stocks in developing countries in the week to Wednesday, the most since the depths of the global financial crisis in 2008. Asia has been particularly vulnerable with $7.9 billion pulled out of the region’s equity markets, the most in almost 15 years, according to data provider EPFR Global., Financial markets in emerging markets have been grinding...]",Emerging Markets Suffer Largest Outflow in Seven Years,85,15
2,609504474621612032,[],"[U.S. Soccer should start answering tough questions about Hope Solo, @eric_adelson writes.]",Fri Jun 12 23:36:05 +0000 2015,"[US to vote for Ali in FIFA election and not Blatter, US to vote for Ali in FIFA election and not Blatter, FILE - This Oct. 10, 2014, file photo shows Sunil Gulati, president of the United States Soccer Federation, during a press conference in Bristol, Conn. The United States says it will vote for Jordan's Prince Ali bin Al-Hussein for FIFA president Friday, May 29, 2015 and not for incumbent Sepp Blatter. (AP Photo/Elise Amendola, File)]",A U.S. Senator's scathing letter questioned U.S. Soccer's inadequate handling of Solo's domestic violence charges. It's time for Sunil Gulati to respond.,,"[WINNIPEG, Manitoba – The bubble U.S. Soccer is putting around Hope Solo isn't working to calm anyone's concerns about the star goalkeeper., The latest lament comes from no less than a U.S. Senator, who into Solo's domestic violence incident of last year and offer a detailed explanation of why Solo is on the field. She is expected to be the starting goalkeeper when the USA plays Sweden in its second group game at the Women's World Cup on Friday., [FC Yahoo: ], U.S. Senator Richard Blumenthal of Connecticut penned a lengthy complaint about the near-silence the organization has given on Solo, especially in the wake of ESPN's ""Outside the Lines"" report on Sunday. Blumenthal wrote that if the report is accurate ""U.S. Soccer's approach to domestic violence and family violence is at best superficial and at worst dangerously neglectful and self-serving."", This situation is well beyond Solo now. U.S. Soccer has made this a referendum about its own ability to represent the values of the nat...",U.S. Soccer should start answering tough questions about Hope Solo,78,13
3,609748367049105408,[],[How theme parks like Disney World left the middle class behind],Sat Jun 13 15:45:13 +0000 2015,"[Some 1,000 persons turned out in Albuquerque, New Mexico to greet Mickey Mouse on Nov. 14, 1978, as he celebrated his 50th birthday with a whistle-stop train tour. (AP Photo/John Holmes), Tourists crowd around Cinderella’s Castle to watch a performance at Walt Disney World’s Magic Kingdom in Lake Buena Vista, Fla., Thursday, Sept. 2, 2004. (AP Photo/Phelan M. Ebenhack), Card Walker, right, president and chief operating officer for Walt Disney Inc., kicks off Disney World’s tenth birthday celebration by welcoming the William Windsor family in Orlando, Oct. 1, 1981. The Windsors were the first to see the Magic Kingdom, when it first opened ten years before. (AP Photo), Views of the private Cinderella luxury suite at the top of Cinderella’s Castle at Walt Disney World’s Magic Kingdom in Lake Buena Vista, Fla. are seen on Friday, Jan. 26,2006. A night’s stay at the castle was one of prizes given randomly to unsuspecting park guests recently as part of the launch of Disney’s “Year of a...","America's top family vacation spots, like the ""happiest place on earth,"" are increasingly aimed at upscale buyers.","disney, disney world, disney ticket prices, disney world prices, disneyland, disneyworld, disney world price, trip to disney world,","[When Walt Disney World opened in an Orlando swamp in 1971, with its penny arcade and marching-band parade down Main Street U.S.A., admission for an adult cost $3.50, about as much then as three gallons of milk., Disney has raised the gate price for the Magic Kingdom 41 times since, nearly doubling it over the past decade. This year, a ticket inside the “most magical place on Earth” rocketed past $100 for the first time in history., Ballooning costs have not slowed the mouse-eared masses flooding into the world’s busiest theme park. Disney’s main attraction hosted a record 19 million visitors last year, a number nearly as large as the population of New York state., But rising prices have changed the character of Big Mouse’s family-friendly empire in unavoidably glitzy ways. A visitor to Disney’s central Florida fantasy-land can now dine on a $115 steak, enjoy a $53-per-plate dessert party and sleep in a bungalow overlooking the Seven Seas Lagoon starting at $2,100 a night., For Ame...",How theme parks like Disney World left the middle class behind,52,11
4,608688782821453824,[media/608688782821453825.jpg],[Could light bulbs hurt your health? One company is now putting warning labels on its bulbs:],Wed Jun 10 17:34:49 +0000 2015,"[Electric lights have made the world safer and made people smarter, but can they also hurt our health?, Electric lights have made the world safer and made people smarter, but can they also hurt our health?, Quantcast]",One company will put a health notice on all the packages for their lighting. Here's how electric lights can hurt your health,"health, Should there be warning labels on your light bulbs? - CNN.com","[(CNN)The light bulb always makes the world's top inventions lists., It makes us more productive. It deters crime. It's allowed New York to become the ""city that never sleeps."" And yet, more than just Manhattenites are failing to get their zzz's because of electric lighting, and there is growing body of scientific evidence that electric lighting may be hurting our health., Scientists have talked about this for years, but now a lighting company is about to point that out to every single one of its customers. And they are doing it voluntarily. What's the catch?, The company, Florida based Lighting Science Group does make a line of biological lighting that it says can be a better fit for your health than a traditional light bulb. Their idea is that you should get the right light for the right time. So they sell a product that is supposed to be better for bedtime, and another to help you feel more awake. But with this announcement, Fred Maxik, the company's chief science officer, seems...",Warning labels on your light bulbs,76,16


## Post titles per article

In [62]:
# Post Titles
df16['postTextLen'] = df16.apply(lambda x : len(x['postText']), axis=1)
df17['postTextLen'] = df17.apply(lambda x : len(x['postText']), axis=1)

# Images
df16['numImg'] = df16.apply(lambda x : len(x['postMedia']), axis=1)
df17['numImg'] = df17.apply(lambda x : len(x['postMedia']), axis=1)

# Show
display(df16[['postTextLen', 'numImg']].describe())
display(df17[['postTextLen', 'numImg']].describe())

Unnamed: 0,postTextLen,numImg
count,2459.0,2459.0
mean,1.0,0.660024
std,0.0,0.473797
min,1.0,0.0
25%,1.0,0.0
50%,1.0,1.0
75%,1.0,1.0
max,1.0,1.0


Unnamed: 0,postTextLen,numImg
count,19538.0,19538.0
mean,1.0,0.566435
std,0.0,0.603358
min,1.0,0.0
25%,1.0,0.0
50%,1.0,1.0
75%,1.0,1.0
max,1.0,4.0
