# OSEMN Model (1): Obtain

- Kaggle (32,000)
    - clickbait (from ‘BuzzFeed’, ‘Upworthy’, ‘ViralNova’, ‘Thatscoop’, ‘Scoopwhoop’ and ‘ViralStories’)
    - non-clickbait (from ‘WikiNews’, ’New York Times’, ‘The Guardian’, and ‘The Hindu’)
    
- Webscraping from clickbait websites (16,707)
    - clickhole.com (4,526)
    - worldtruth.tv (12,181)
    
- API's from major press companies (16,660)
    - The New York Times (11,460)
    - The Guardian (5,200)
    
- Total of 65,367 headlines - 32,707 clickbait and 32,660 non-clickbait headlines

# Kaggle Dataset

In [10]:
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
kaggle = pd.read_csv('Data/clickbait_data.csv')

Description

Dataset for Classification of news headlines into clickbait or non-clickbait.
The data is collected from various news sites.
The clickbait headlines are collected from sites such as ‘BuzzFeed’, ‘Upworthy’, ‘ViralNova’, ‘Thatscoop’, ‘Scoopwhoop’ and ‘ViralStories’.
The relevant or non-clickbait headlines are collected from many trustworthy news sites such as ‘WikiNews’, ’New York Times’, ‘The Guardian’, and ‘The Hindu’.
We can apply different classification algorithms to classify the data into clickbait and non-clickbait.

Your data will be in front of the world's largest data science community. What questions do you want to see answered?

About this file

This dataset contains headlines from various news sites such as ‘WikiNews’, ’New York Times’, ‘The Guardian’, ‘The Hindu’, ‘BuzzFeed’, ‘Upworthy’, ‘ViralNova’, ‘Thatscoop’, ‘Scoopwhoop’ and ‘ViralStories’. It has two columns first one contains headlines and the second one has numerical labels of clickbait in which 1 represents that it is clickbait and 0 represents that it is non-clickbait headline. The dataset contains total 32000 rows of which 50% are clickbait and other 50% are non-clickbait.

In [261]:
kaggle.head()

Unnamed: 0,headline,clickbait
0,Should I Get Bings,1
1,Which TV Female Friend Group Do You Belong In,1
2,"The New ""Star Wars: The Force Awakens"" Trailer...",1
3,"This Vine Of New York On ""Celebrity Big Brothe...",1
4,A Couple Did A Stunning Photo Shoot With Their...,1


# Webscraping

## Clickhole

### Lifestyle

In [14]:
from bs4 import BeautifulSoup
import requests
import time

In [60]:
urls = ['https://clickhole.com/category/lifestyle/page/{}/'.format(i) for i in range(1, 401)]

In [56]:
def get_headlines(urls):
    headlines = []
    for url in urls:
        html_page = requests.get(url)
        time.sleep(1)
        soup = BeautifulSoup(html_page.content, 'html.parser')
        raw = soup.find_all('h2', {'class': 'post-title'})
        headlines += [title.find('a').text for title in raw]
    return headlines

In [61]:
clickhole_lifestyle = get_headlines(urls)

In [262]:
print(len(clickhole_lifestyle))

2794


In [63]:
df_lifestyle = pd.DataFrame(clickhole_lifestyle)

In [134]:
df_lifestyle.to_csv('clickhole_lifestyle.csv')

### news

In [139]:
news_urls = ['https://clickhole.com/category/news/page/{}/'.format(i) for i in range(1, 249)]

In [140]:
clickhole_news = get_headlines(news_urls)

In [263]:
print(len(clickhole_news))

1732


In [142]:
df_news = pd.DataFrame(clickhole_news)

In [144]:
df_news.to_csv('clickhole_news.csv')

## Worldtruth

In [158]:
html_page = requests.get('https://worldtruth.tv/sitemap/') # Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') # Pass the page contents to beautiful soup for parsing

In [264]:
archive_months = soup.find_all('li', {'class': ''})[:37]
archive_months[0]

<li><a href="https://worldtruth.tv/2020/12/">December 2020</a> (239)</li>

In [265]:
page_links = []
for li_tag in archive_months:
    page_links.append(li_tag.find('a')['href'])

page_links[0]

'https://worldtruth.tv/2020/12/'

In [178]:
soup.find_all('li', {'class': ''})[0].text[-4:-1]

'239'

In [181]:
articles_per_month = []
for num_months in range(len(archive_months)):
    articles_per_month.append(soup.find_all('li', {'class': ''})[num_months].text[-4:-1])

In [183]:
html_page = requests.get('https://worldtruth.tv/2020/12/page/2/') # Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') # Pass the page contents to beautiful soup for parsing

In [192]:
soup.find_all('h3', {'class': 'entry-title td-module-title'})[0].text

'“Wearing A Mask Offers Little If Any Protection From Infection” – Harvard Doctors'

In [193]:
len(soup.find_all('h3', {'class': 'entry-title td-module-title'}))

50

In [196]:
import math

In [200]:
pages_per_month = [math.ceil(int(month) / 50) for month in articles_per_month]

In [250]:
link_pairs = [[page_links[i], pages_per_month[i]] for i in range(len(page_links))]

In [251]:
link_pairs[0]

['https://worldtruth.tv/2020/12/', 5]

In [236]:
links = []
for i in range(len(page_links)):
    links.append([page_links[i] + 'page/{}/'.format(pg+1) for pg in range(pages_per_month[i])])

In [266]:
wordtruth_headlines = []
for link in links:
    for page in link:
        html_page = requests.get(page)
        time.sleep(1)
        soup = BeautifulSoup(html_page.content, 'html.parser')
        raw = soup.find_all('h3', {'class': 'entry-title td-module-title'})
        wordtruth_headlines += [title.text for title in raw]

In [254]:
len(wordtruth_headlinesheadlinestruth_headlines)

12181

In [255]:
df_wordtruth_headlines = pd.DataFrame(wordtruth_headlines)

In [256]:
df_wordtruth_headlines.to_csv('worldtruth_headlines.csv')

# API

In [9]:
import requests
import json

In [7]:
from config import nyt_key, nyt_secret, guardian_key

## NYT Archives

In [12]:
nyt_2020_01_response = requests.get('https://api.nytimes.com/svc/archive/v1/{}/{}.json?api-key={}'.format(2020, 1, nyt_key))

In [16]:
jan_2020 = nyt_2020_01_response.json()

In [22]:
jan_2020['response']['meta']

{'hits': 4480}

In [23]:
len(jan_2020['response']['docs'])

4480

In [33]:
jan_2020['response']['docs'][0]['headline']['main']

'‘Battling a Demon’: Drifter Sought Help Before Texas Church Shooting'

In [34]:
nyt_2021_01_response = requests.get('https://api.nytimes.com/svc/archive/v1/{}/{}.json?api-key={}'.format(2021, 1, nyt_key))
jan_2021 = nyt_2021_01_response.json()
print(len(jan_2021['response']['docs']))

7001


In [121]:
nyt_jan_2021_list = []
for i in range(len(jan_2021['response']['docs'])):
    nyt_jan_2021_list.append(jan_2021['response']['docs'][i]['headline']['main'])

In [124]:
import pandas as pd

In [126]:
df_nyt_jan_2021 = pd.DataFrame(nyt_jan_2021_list)
df_nyt_jan_2021.to_csv('nyt_jan_2021.csv')

In [36]:
nyt_2020_07_response = requests.get('https://api.nytimes.com/svc/archive/v1/{}/{}.json?api-key={}'.format(2020, 7, nyt_key))
jul_2020 = nyt_2020_07_response.json()
print(len(jul_2020['response']['docs']))

4459


In [127]:
nyt_jul_2020_list = []
for i in range(len(jul_2020['response']['docs'])):
    nyt_jul_2020_list.append(jul_2020['response']['docs'][i]['headline']['main'])
    
df_nyt_jul_2020 = pd.DataFrame(nyt_jul_2020_list)
df_nyt_jul_2020.to_csv('nyt_jul_2020.csv')

In [128]:
df_nyt_jul_2020

Unnamed: 0,0
0,This Should Be Biden’s Bumper Sticker
1,"Corrections: July 1, 2020"
2,Why Does Trump Put Russia First?
3,Senate Approves Extending Small-Business Program
4,Quotation of the Day: The Fight Over Abortion ...
...,...
4454,Supreme Court Lets Trump Keep Building His Bor...
4455,The Coronavirus Infected Hundreds at a Georgia...
4456,Trump’s Coronavirus Testing Chief Concedes a L...
4457,White House and Congress Clash on Relief Plan ...


## The Guardian

In [37]:
guardian_fall_2020_response = requests.get('https://content.guardianapis.com/search?from-date=2020-09-15&to-date=2020-11-15&order-by=newest&show-fields=all&page-size=200&api-key={}'.format(guardian_key))
g_fall_2020 = guardian_fall_2020_response.json()

In [41]:
g_fall_2020['response'][]

13445

In [42]:
guardian_2020_9_response = requests.get('https://content.guardianapis.com/search?from-date=2020-09-01&to-date=2020-09-30&order-by=newest&show-fields=all&page-size=200&api-key={}'.format(guardian_key))
g_sep_2020 = guardian_2020_9_response.json()

In [50]:
type(g_sep_2020['response']['total'])

int

In [49]:
g_sep_2020['response']['results'][0]['webTitle']

'Italian senate suspended as lawmakers test positive –\xa0as it happened'

In [56]:
range(g_sep_2020['response']['total'])

range(0, 6738)

In [63]:
gresults = g_sep_2020['response']['results']

In [65]:
len(gresults)

200

In [68]:
guard_2020_9_response = requests.get('https://content.guardianapis.com/search?from-date=2020-09-01&to-date=2020-09-30&order-by=newest&show-fields=all&page-size=5000&api-key={}'.format(guardian_key))
guard_sep_2020 = guard_2020_9_response.json()

In [69]:
guard_sep_2020

{'response': {'status': 'error',
  'message': 'page-size must be an integer between 0 and 200'}}

In [80]:
guardian_oct_2020 = {}
guardian_oct_2020[1] = nyt_2020_06_response.json()
guardian_oct_2020[2] = nyt_2020_07_response.json()

In [129]:
# guardian_oct_2020 = {}
for i in range(19):
    guardian_2020_10_response = requests.get('https://content.guardianapis.com/search?page={}&from-date=2020-10-01&to-date=2020-10-31&order-by=newest&show-fields=all&page-size=200&api-key={}'.format(i+1, guardian_key))
    guardian_oct_2020[i+1] = guardian_2020_10_response.json()

In [137]:
guardian_list = []
for i in range(26):
    guardian_list.append(guardian_oct_2020[i+1]['response']['results'])

In [140]:
guardian_list = []
for i in range(26):
    single_page = guardian_oct_2020[i+1]['response']['results']
    for j in range(len(single_page)):
        guardian_list.append(single_page[j]['webTitle'])

In [96]:
guardian_2020_10_response = requests.get('https://content.guardianapis.com/search?page={}&from-date=2020-10-01&to-date=2020-10-31&order-by=newest&show-fields=all&page-size=200&api-key={}'.format(21, guardian_key))
guardian_oct_2020[21] = guardian_2020_10_response.json()

In [99]:
guardian_2020_10_response = requests.get('https://content.guardianapis.com/search?page={}&from-date=2020-10-01&to-date=2020-10-31&order-by=newest&show-fields=all&page-size=200&api-key={}'.format(22, guardian_key))
guardian_oct_2020[22] = guardian_2020_10_response.json()

In [141]:
len(guardian_list)

5200

In [142]:
df_guardian = pd.DataFrame(guardian_list)
df_guardian.to_csv('guardian_oct_2020.csv')