## Scraping Articles
This notebook calls the NYT article API for metatadata for a given topic for only OP-ED and editorial articles. Then it scrapes the body of the article into a data frame.  

In [1]:
import requests
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import urllib
import urllib.robotparser
from urllib.request import urlopen
import json
import time
import re

In [2]:
api_key = 'GIzRkc7PriEuubEK19fHWqiknOuoB5r8'

The Article Search API returns a max of 10 results at a time. The meta node in the response contains the total number of matches ("hits") and the current offset. Use the page query parameter to paginate thru results (page=0 for results 1-10, page=1 for 11-20, ...). You can paginate thru up to 100 pages (1,000 results). If you get too many results try filtering by date range.

In [3]:
# create function to construct url for articles with key words that match the topic keywords

def get_url(topics, page):
    '''This function creates a url that can be input into an NYT API request 
    input: 
        topics: if multiple include as one string eg: topics = 'policy environment'
        page: Use the page query parameter to paginate thru results (page=0 for results 1-10, page=1 for 11-20, ...). 
              You can paginate thru up to 100 pages (1,000 results).
    output: a url
    '''

    head_url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json?'
    page = str(page)
    url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json?q=' + topics + \
        '&page=' + page + \
        '&fq=news_desk:("OpEd" "Opinion" "Editorial")&api-key=' + api_key

    return url

In [None]:
# check to see if we can scrape from NYT
rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://www.nytimes.com/robots.txt")
rp.read()
rrate = rp.request_rate("*")
rp.crawl_delay("*")
rp.can_fetch("*", data['web_url'][0])

#### polarizing topics
- I want to chose polarizing topics to start with so there will be multiple opinions written about in the NYT
-  potential subjects to focus on: environment, healthcare, immigration, economy, abortion, taxes, gun control
- I'm going to start with immigration to test this model because I think there are a lot of different sub-topics within immigration. 

In [116]:
# Get all article urls and metadata for immigration and put in table
url = get_url(topics = 'immigration', page=1)

# put first page into data frame
r = requests.get(url)
data = r.json()['response']['docs']
data = pd.DataFrame(data)

#Clean up dataframe 
data = data[['abstract', 'web_url', 'snippet', 'lead_paragraph', 'source',
             'headline', 'pub_date', 'document_type', 'news_desk', 'section_name', 'byline']]
data['headline'] = [row['main'] for row in data['headline']]
data['byline'] = [row['original'] for row in data['byline']]
data.head()

Unnamed: 0,abstract,web_url,snippet,lead_paragraph,source,headline,pub_date,document_type,news_desk,section_name,byline
0,Mr. Sanders interviews for The New York Times’...,https://www.nytimes.com/interactive/2020/01/13...,Mr. Sanders interviews for The New York Times’...,Mr. Sanders interviews for The New York Times’...,The New York Times,Bernie Sanders Wants to Change Your Mind,2020-01-13T10:00:01+0000,multimedia,Opinion,Opinion,By The Editorial Board
1,She has big ideas for repairing the American e...,https://www.nytimes.com/2019/03/15/opinion/sun...,She has big ideas for repairing the American e...,Bill Clinton had a consequential presidency wh...,The New York Times,Elizabeth Warren Actually Wants to Fix Capitalism,2019-03-15T19:34:32+0000,article,OpEd,Opinion,By David Leonhardt
2,A field guide for Democrats desperate to paint...,https://www.nytimes.com/interactive/2017/09/30...,A field guide for Democrats desperate to paint...,A field guide for Democrats desperate to paint...,The New York Times,Who Can Beat Trump in 2020?,2017-09-30T15:43:04+0000,multimedia,Opinion,Opinion,By JASON ZENGERLE
3,Thousands of readers told us who they think sh...,https://www.nytimes.com/2020/01/17/opinion/dem...,Thousands of readers told us who they think sh...,"On Sunday, The New York Times’s editorial boar...",The New York Times,Which Candidate Would You Endorse?,2020-01-17T19:54:10+0000,article,OpEd,Opinion,By Rachel L. Harris and Lisa Tarchak
4,Mr. Buttigieg interviews for The New York Time...,https://www.nytimes.com/interactive/2020/01/16...,Mr. Buttigieg interviews for The New York Time...,Mr. Buttigieg interviews for The New York Time...,The New York Times,Pete Buttigieg Says He’s More Than a Résumé,2020-01-16T10:00:01+0000,multimedia,Opinion,Opinion,By The Editorial Board


NYT allows 4,000 requests per day and 10 requests per minute. You should sleep 6 seconds between calls to avoid hitting the per minute rate limit.

In [117]:
# loop through pages and append to table - start with 1,000 for one topic
pages = range(2, 200) # 100 pages returns 1,000 results 

for page in pages:
    # get url
    url = get_url(topics = 'immigration climate medicare abortion', page=page)
    # fetch response from url
    r = requests.get(url)
    try: 
        # put data into a data frame
        data_temp = r.json()['response']['docs']
        data_temp = pd.DataFrame(data_temp)
        # Clean up dataframe before putting in mongodb
        data_temp = data_temp[['abstract', 'web_url', 'snippet', 'lead_paragraph', 'source',
             'headline', 'pub_date', 'document_type', 'news_desk', 'section_name', 'byline']]
        data_temp['headline'] = [row['main'] for row in data_temp['headline']]
        data_temp['byline'] = [row['original'] for row in data_temp['byline']]
        # append to dataframe
        data = data.append(data_temp)
        time.sleep(30)
    except:
        print(page, ":", r.status_code)

2 : 429
3 : 429
4 : 429
5 : 429
6 : 429
7 : 429
8 : 429
9 : 429
10 : 429
11 : 429
12 : 429
13 : 429
14 : 429
15 : 429
16 : 429
17 : 429
18 : 429
19 : 429


KeyboardInterrupt: 

In [111]:
len(data)

2000

In [101]:
url = get_url(topics = 'immigration', page=1)
r = requests.get(url)
data_temp = r.json()['response']['docs']
data_temp = pd.DataFrame(data_temp)
data_temp.head()

Unnamed: 0,abstract,web_url,snippet,lead_paragraph,source,multimedia,headline,keywords,pub_date,document_type,news_desk,section_name,byline,type_of_material,_id,word_count,uri,print_section,print_page,subsection_name
0,The administration’s security rationale for it...,https://www.nytimes.com/2020/02/04/opinion/tru...,The administration’s security rationale for it...,"On Friday, with Americans focused on President...",The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'Trump Posts Another ‘Keep Out’ Sign ...,"[{'name': 'subject', 'value': 'Immigration and...",2020-02-05T00:32:02+0000,article,Editorial,Opinion,"{'original': 'By The Editorial Board', 'person...",Editorial,nyt://article/45a5c8db-eecd-5705-8d86-4d0fbeec...,666,nyt://article/45a5c8db-eecd-5705-8d86-4d0fbeec...,,,
1,Adding Nigeria to the expanded list of exclude...,https://www.nytimes.com/2020/02/04/opinion/tru...,Adding Nigeria to the expanded list of exclude...,It’s happening a little bit out of public cons...,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'The Racism at the Heart of Trump’s ‘...,"[{'name': 'glocations', 'value': 'Nigeria', 'r...",2020-02-04T10:00:15+0000,article,OpEd,Opinion,"{'original': 'By Jamelle Bouie', 'person': [{'...",Op-Ed,nyt://article/fee8c33b-1600-5001-bd88-1b789479...,882,nyt://article/fee8c33b-1600-5001-bd88-1b789479...,,,
2,Congressional hearings are urgently needed to ...,https://www.nytimes.com/2020/02/07/opinion/dhs...,Congressional hearings are urgently needed to ...,“When the government tracks the location of a ...,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'The Government Uses ‘Near Perfect Su...,"[{'name': 'subject', 'value': 'Surveillance of...",2020-02-08T00:24:47+0000,article,Editorial,Opinion,"{'original': 'By The Editorial Board', 'person...",Op-Ed,nyt://article/9674e906-2901-5a44-b99b-69679bc7...,1012,nyt://article/9674e906-2901-5a44-b99b-69679bc7...,A,26.0,
3,The bar for being labeled a gang member is low...,https://www.nytimes.com/2020/02/03/opinion/los...,The bar for being labeled a gang member is low...,I found out I was in a gang database — a share...,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...","{'main': 'Are You in a Gang Database?', 'kicke...","[{'name': 'subject', 'value': 'Gangs', 'rank':...",2020-02-04T00:00:07+0000,article,OpEd,Opinion,"{'original': 'By Stefano Bloch', 'person': [{'...",Op-Ed,nyt://article/449b791a-2f28-5904-8d54-b90d62ab...,1084,nyt://article/449b791a-2f28-5904-8d54-b90d62ab...,A,23.0,
4,Immigration can invigorate the country. But wh...,https://www.nytimes.com/2020/01/16/opinion/imm...,Immigration can invigorate the country. But wh...,"In 2001, when I was the new Washington corresp...",The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'I’m a Liberal Who Thinks Immigration...,"[{'name': 'subject', 'value': 'Illegal Immigra...",2020-01-16T23:48:30+0000,article,OpEd,Opinion,"{'original': 'By Jerry Kammer', 'person': [{'f...",Op-Ed,nyt://article/af2afd2c-d37d-5a32-b068-234b3d05...,1496,nyt://article/af2afd2c-d37d-5a32-b068-234b3d05...,A,23.0,


In [112]:
# Save data frame
data.to_csv('/Users/AuerPower/Metis/git/steel-man/data/immigration_2000_author.csv')

In [113]:
# Define a function to scrape the article body
def get_article_body(url):
    '''This function takes in a url for a NYT article and returns the body of text.
    input: a url
    output: text sring
    '''
    response = requests.get(url, auth=('user', 'pass'), timeout=4)
    doc = soup(response.text, 'html.parser')
    paragraphs = doc.find_all("p",{"class":"css-exrw3m evys1bk0"}) # this is a list of paragraphs
    article_text = ''
    for paragraph in paragraphs:
        article_text += paragraph.text
    return article_text

In [48]:
# create df of first row
row = 0
all_text = ''
url = data.iloc[row].web_url
headline = data.iloc[row].headline
abstract = data.iloc[row].abstract
body = get_article_body(url)
all_text = headline + abstract + body
df_text = pd.DataFrame({'all_text': all_text}, index=[row])
df_text

Unnamed: 0,all_text
0,The Racism at the Heart of Trump’s ‘Travel Ban...


In [53]:
# Loop through urls and append to df_text
for row in range(100, len(data)):
    try:
        all_text = ''
        url = data.iloc[row].web_url
        headline = data.iloc[row].headline
        abstract = data.iloc[row].abstract
        body = get_article_body(url)
        all_text = headline + abstract + body
        df = pd.DataFrame({'all_text': all_text}, index=[row])
        df_text = df_text.append(df)
        time.sleep(6)
    except:
        print("fail:", row)

fail: 147
fail: 161
fail: 173
fail: 179
fail: 253
fail: 262
fail: 270
fail: 271
fail: 284
fail: 286
fail: 291
fail: 298
fail: 299
fail: 303
fail: 304
fail: 305
fail: 307
fail: 318
fail: 322
fail: 324
fail: 330
fail: 355
fail: 423
fail: 436
fail: 440
fail: 471
fail: 519
fail: 520
fail: 627
fail: 637
fail: 664
fail: 675
fail: 681
fail: 683
fail: 698
fail: 700
fail: 712
fail: 724
fail: 738
fail: 751
fail: 766
fail: 810
fail: 891
fail: 894
fail: 907
fail: 948
fail: 955
fail: 965
fail: 985
fail: 1075
fail: 1083
fail: 1114
fail: 1134
fail: 1161
fail: 1188
fail: 1194
fail: 1200
fail: 1234
fail: 1236
fail: 1252
fail: 1297
fail: 1313
fail: 1316
fail: 1334
fail: 1337
fail: 1373
fail: 1374
fail: 1380
fail: 1381
fail: 1415
fail: 1425
fail: 1446
fail: 1462
fail: 1464
fail: 1471
fail: 1477
fail: 1480
fail: 1502
fail: 1516
fail: 1538
fail: 1567
fail: 1575
fail: 1579
fail: 1612
fail: 1622
fail: 1641
fail: 1660
fail: 1685
fail: 1696
fail: 1701
fail: 1703
fail: 1711
fail: 1721
fail: 1778
fail: 1779
fail

Not all articles were successfully scraped but enought to move forward with a proof of concept for the project.

In [54]:
len(df_text)

1795

In [56]:
df_text.head()

Unnamed: 0,all_text
0,The Racism at the Heart of Trump’s ‘Travel Ban...
100,It’s Time for an Immigration EnchiladaPresiden...
101,The Immigration DealThe Senate last week seize...
102,Immigration MalpracticeTo millions of legal im...
103,Immigration MiseryAs the country waits for Con...


In [55]:
df_text.to_csv('/Users/AuerPower/Metis/git/steel-man/data/article_text/df_text_1795.csv')

In [75]:
# join the first run 0-100 and the second run together 100-2000
df_text1 = pd.read_csv('/Users/AuerPower/Metis/git/steel-man/data/article_text/df_text_100.csv', index_col=0)
df_text1 = df_text1.iloc[1:]
df_text2 = pd.read_csv('/Users/AuerPower/Metis/git/steel-man/data/article_text/df_text_1795.csv', index_col=0)
df_text2 = df_text2.iloc[1:]
df_text1 = df_text1.append(df_text2)
df_text1.head()

Unnamed: 0,all_text
0,The Racism at the Heart of Trump’s ‘Travel Ban...
1,The Government Uses ‘Near Perfect Surveillance...
2,Are You in a Gang Database?The bar for being l...
3,I’m a Liberal Who Thinks Immigration Must Be R...
4,Starving for Justice in ICE DetentionFor immig...


In [93]:
# put a space before upper case letters if there is a lower case letter before it
df_text1['all_text'] = [re.sub(r'([a-z](?=[A-Z])|[A-Z](?=[A-Z][a-z]))', r'\1 ', row) for row in df_text1['all_text']]

# put a space after ? and . that doesn't have a space
df_text1['all_text'] = [re.sub(r'(?<=[.,?])(?=[^\s])', r' ', row) for row in df_text1['all_text']]
df_text1.head()

Unnamed: 0,all_text
0,The Racism at the Heart of Trump’s ‘Travel Ban...
1,The Government Uses ‘Near Perfect Surveillance...
2,Are You in a Gang Database? The bar for being ...
3,I’m a Liberal Who Thinks Immigration Must Be R...
4,Starving for Justice in ICE Detention For immi...


In [96]:
# save file
df_text1.to_csv('/Users/AuerPower/Metis/git/steel-man/data/article_text/df_text_1892.csv')