# Data collection

The **goal** of this notebook is to gather data from The New York Times using the [Archive API](https://developer.nytimes.com/docs/archive-product/1/overview) and a unique API Key obtained from the [developer portion of the NYT website](https://developer.nytimes.com). This is the second notebook of The New York Times project series.

#### Importing tools and libraries

In [277]:
import os
import json
import time
import requests
import datetime
import dateutil
import pandas as pd
from ast import literal_eval
import matplotlib.pyplot as plt

pd.options.display.max_colwidth = 100
import nltk
from nltk.corpus import stopwords

#### Setting up data collection range

**Note:** Since we chose a large timeframe - 70 years, the code needs to be run in chunks. A decade seems like an optimal timeframe for each round. As the result, we will have the articles' headline and keywords for the range 1950-2020.

In [3]:
end = datetime.date(2009, 9, 30)
start = datetime.date(2000, 1, 1)
print('Start date: ' + str(start))
print('End date: ' + str(end))

Start date: 2000-01-01
End date: 2009-09-30


Each month is going to have its own csv file.

In [6]:
months_in_range = [x.split(' ') for x in pd.date_range(start, end, freq='MS').strftime("%Y %-m").tolist()]
months_in_range

[['2000', '1'],
 ['2000', '2'],
 ['2000', '3'],
 ['2000', '4'],
 ['2000', '5'],
 ['2000', '6'],
 ['2000', '7'],
 ['2000', '8'],
 ['2000', '9'],
 ['2000', '10'],
 ['2000', '11'],
 ['2000', '12'],
 ['2001', '1'],
 ['2001', '2'],
 ['2001', '3'],
 ['2001', '4'],
 ['2001', '5'],
 ['2001', '6'],
 ['2001', '7'],
 ['2001', '8'],
 ['2001', '9'],
 ['2001', '10'],
 ['2001', '11'],
 ['2001', '12'],
 ['2002', '1'],
 ['2002', '2'],
 ['2002', '3'],
 ['2002', '4'],
 ['2002', '5'],
 ['2002', '6'],
 ['2002', '7'],
 ['2002', '8'],
 ['2002', '9'],
 ['2002', '10'],
 ['2002', '11'],
 ['2002', '12'],
 ['2003', '1'],
 ['2003', '2'],
 ['2003', '3'],
 ['2003', '4'],
 ['2003', '5'],
 ['2003', '6'],
 ['2003', '7'],
 ['2003', '8'],
 ['2003', '9'],
 ['2003', '10'],
 ['2003', '11'],
 ['2003', '12'],
 ['2004', '1'],
 ['2004', '2'],
 ['2004', '3'],
 ['2004', '4'],
 ['2004', '5'],
 ['2004', '6'],
 ['2004', '7'],
 ['2004', '8'],
 ['2004', '9'],
 ['2004', '10'],
 ['2004', '11'],
 ['2004', '12'],
 ['2005', '1'],
 ['2005',

#### Helper functions

The functions below help **automate the data collection**: <br>
     **send_request** prepares the request into the archive for a given date, and returns a *response*. <br>
     **is_valid** checks whether the article falls into the requested timeframe, confirms whether the headline is present, thus insuring the article's validity, and returns a binary result of *is_in_range* and *has_headline*.<br>
     **parse_response** turns the json response into a dataframe. *data* here is a dictionary where we specify what we want our columns to be. If everything is valid, it appends to the dictionary and returns a *dataframe*.<br>
     **get_data** uses *send_request* and *parse_response* as well as the *dates* specified by user, and then saves the headlines and other info to csv files corresponding to each month within the range.

In [27]:
def send_request(date):
    '''Sends a request to the NYT Archive API for given date.'''
    base_url = 'https://api.nytimes.com/svc/archive/v1/'
    url = base_url + '/' + date[0] + '/' + date[1] + '.json?api-key=' + 'F9FPP1mJjiX8pAEFAxBYBg08vZECa39n'
    response = requests.get(url).json()
    time.sleep(10)
    return response


def is_valid(article, date):
    '''An article is only worth checking if it is in range, and has a headline.'''
    is_in_range = date > start and date < end
    has_headline = type(article['headline']) == dict and 'main' in article['headline'].keys()
    return is_in_range and has_headline


def parse_response(response):
    '''Parses and returns response as pandas data frame.'''
    data = {'headline': [],  
        'date': [], 
        'doc_type': [],
        'material_type': [],
        'section': [],
        'keywords': []}
    
    articles = response['response']['docs'] 
    for article in articles: # For each article, make sure it falls within our date range
        date = dateutil.parser.parse(article['pub_date']).date()
        if is_valid(article, date):
            data['date'].append(date)
            data['headline'].append(article['headline']['main']) 
            if 'section' in article:
                data['section'].append(article['section_name'])
            else:
                data['section'].append(None)
            data['doc_type'].append(article['document_type'])
            if 'type_of_material' in article: 
                data['material_type'].append(article['type_of_material'])
            else:
                data['material_type'].append(None)
            keywords = [keyword['value'] for keyword in article['keywords'] if keyword['name'] == 'subject']
            data['keywords'].append(keywords)
    return pd.DataFrame(data) 


def get_data(dates):
    '''Sends and parses request/response to/from NYT Archive API for given dates.'''
    total = 0
    print('Date range: ' + str(dates[0]) + ' to ' + str(dates[-1]))
    if not os.path.exists('headlines'):
        os.mkdir('headlines')
    for date in dates:
        response = send_request(date)
        df = parse_response(response)
        total += len(df)
        df.to_csv('headlines/' + date[0] + '-' + date[1] + '.csv', index=False)
        print('Saving headlines/' + date[0] + '-' + date[1] + '.csv...')
    print('Number of articles collected: ' + str(total))

In [8]:
get_data(months_in_range)

Date range: ['2000', '1'] to ['2009', '9']
Saving headlines/2000-1.csv...
Saving headlines/2000-2.csv...
Saving headlines/2000-3.csv...
Saving headlines/2000-4.csv...
Saving headlines/2000-5.csv...
Saving headlines/2000-6.csv...
Saving headlines/2000-7.csv...
Saving headlines/2000-8.csv...
Saving headlines/2000-9.csv...
Saving headlines/2000-10.csv...
Saving headlines/2000-11.csv...
Saving headlines/2000-12.csv...
Saving headlines/2001-1.csv...
Saving headlines/2001-2.csv...
Saving headlines/2001-3.csv...
Saving headlines/2001-4.csv...
Saving headlines/2001-5.csv...
Saving headlines/2001-6.csv...
Saving headlines/2001-7.csv...
Saving headlines/2001-8.csv...
Saving headlines/2001-9.csv...
Saving headlines/2001-10.csv...
Saving headlines/2001-11.csv...
Saving headlines/2001-12.csv...
Saving headlines/2002-1.csv...
Saving headlines/2002-2.csv...
Saving headlines/2002-3.csv...
Saving headlines/2002-4.csv...
Saving headlines/2002-5.csv...
Saving headlines/2002-6.csv...
Saving headlines/2002

In [5]:
end = datetime.date(2018, 12, 31)
start = datetime.date(2014, 2, 1)
print('Start date: ' + str(start))
print('End date: ' + str(end))

Start date: 2014-02-01
End date: 2018-12-31


In [6]:
months_in_range = [x.split(' ') for x in pd.date_range(start, end, freq='MS').strftime("%Y %-m").tolist()]

In [7]:
get_data(months_in_range)

Date range: ['2014', '2'] to ['2018', '12']
Saving headlines/2014-2.csv...
Saving headlines/2014-3.csv...
Saving headlines/2014-4.csv...
Saving headlines/2014-5.csv...
Saving headlines/2014-6.csv...
Saving headlines/2014-7.csv...
Saving headlines/2014-8.csv...
Saving headlines/2014-9.csv...
Saving headlines/2014-10.csv...
Saving headlines/2014-11.csv...
Saving headlines/2014-12.csv...
Saving headlines/2015-1.csv...
Saving headlines/2015-2.csv...
Saving headlines/2015-3.csv...
Saving headlines/2015-4.csv...
Saving headlines/2015-5.csv...
Saving headlines/2015-6.csv...
Saving headlines/2015-7.csv...
Saving headlines/2015-8.csv...
Saving headlines/2015-9.csv...
Saving headlines/2015-10.csv...
Saving headlines/2015-11.csv...
Saving headlines/2015-12.csv...
Saving headlines/2016-1.csv...
Saving headlines/2016-2.csv...
Saving headlines/2016-3.csv...
Saving headlines/2016-4.csv...
Saving headlines/2016-5.csv...
Saving headlines/2016-6.csv...
Saving headlines/2016-7.csv...
Saving headlines/201

In [29]:
end = datetime.date(1999, 12, 31)
start = datetime.date(1990, 1, 1)
print('Start date: ' + str(start))
print('End date: ' + str(end))

Start date: 1990-01-01
End date: 1999-12-31


In [30]:
months_in_range = [x.split(' ') for x in pd.date_range(start, end, freq='MS').strftime("%Y %-m").tolist()]

In [32]:
get_data(months_in_range)

Saving headlines/1990-1.csv...
Saving headlines/1990-2.csv...
Saving headlines/1990-3.csv...
Saving headlines/1990-4.csv...
Saving headlines/1990-5.csv...
Saving headlines/1990-6.csv...
Saving headlines/1990-7.csv...
Saving headlines/1990-8.csv...
Saving headlines/1990-9.csv...
Saving headlines/1990-10.csv...
Saving headlines/1990-11.csv...
Saving headlines/1990-12.csv...
Saving headlines/1991-1.csv...
Saving headlines/1991-2.csv...
Saving headlines/1991-3.csv...
Saving headlines/1991-4.csv...
Saving headlines/1991-5.csv...
Saving headlines/1991-6.csv...
Saving headlines/1991-7.csv...
Saving headlines/1991-8.csv...
Saving headlines/1991-9.csv...
Saving headlines/1991-10.csv...
Saving headlines/1991-11.csv...
Saving headlines/1991-12.csv...
Saving headlines/1992-1.csv...
Saving headlines/1992-2.csv...
Saving headlines/1992-3.csv...
Saving headlines/1992-4.csv...
Saving headlines/1992-5.csv...
Saving headlines/1992-6.csv...
Saving headlines/1992-7.csv...
Saving headlines/1992-8.csv...
Sa

In [33]:
end = datetime.date(1989, 12, 31)
start = datetime.date(1980, 1, 1)
print('Start date: ' + str(start))
print('End date: ' + str(end))

Start date: 1980-01-01
End date: 1989-12-31


In [34]:
months_in_range = [x.split(' ') for x in pd.date_range(start, end, freq='MS').strftime("%Y %-m").tolist()]

In [35]:
get_data(months_in_range)

Saving headlines/1980-1.csv...
Saving headlines/1980-2.csv...
Saving headlines/1980-3.csv...
Saving headlines/1980-4.csv...
Saving headlines/1980-5.csv...
Saving headlines/1980-6.csv...
Saving headlines/1980-7.csv...
Saving headlines/1980-8.csv...
Saving headlines/1980-9.csv...
Saving headlines/1980-10.csv...
Saving headlines/1980-11.csv...
Saving headlines/1980-12.csv...
Saving headlines/1981-1.csv...
Saving headlines/1981-2.csv...
Saving headlines/1981-3.csv...
Saving headlines/1981-4.csv...
Saving headlines/1981-5.csv...
Saving headlines/1981-6.csv...
Saving headlines/1981-7.csv...
Saving headlines/1981-8.csv...
Saving headlines/1981-9.csv...
Saving headlines/1981-10.csv...
Saving headlines/1981-11.csv...
Saving headlines/1981-12.csv...
Saving headlines/1982-1.csv...
Saving headlines/1982-2.csv...
Saving headlines/1982-3.csv...
Saving headlines/1982-4.csv...
Saving headlines/1982-5.csv...
Saving headlines/1982-6.csv...
Saving headlines/1982-7.csv...
Saving headlines/1982-8.csv...
Sa

In [36]:
end = datetime.date(1979, 12, 31)
start = datetime.date(1970, 1, 1)
print('Start date: ' + str(start))
print('End date: ' + str(end))

Start date: 1970-01-01
End date: 1979-12-31


In [37]:
months_in_range = [x.split(' ') for x in pd.date_range(start, end, freq='MS').strftime("%Y %-m").tolist()]

In [38]:
get_data(months_in_range)

Saving headlines/1970-1.csv...
Saving headlines/1970-2.csv...
Saving headlines/1970-3.csv...
Saving headlines/1970-4.csv...
Saving headlines/1970-5.csv...
Saving headlines/1970-6.csv...
Saving headlines/1970-7.csv...
Saving headlines/1970-8.csv...
Saving headlines/1970-9.csv...
Saving headlines/1970-10.csv...
Saving headlines/1970-11.csv...
Saving headlines/1970-12.csv...
Saving headlines/1971-1.csv...
Saving headlines/1971-2.csv...
Saving headlines/1971-3.csv...
Saving headlines/1971-4.csv...
Saving headlines/1971-5.csv...
Saving headlines/1971-6.csv...
Saving headlines/1971-7.csv...
Saving headlines/1971-8.csv...
Saving headlines/1971-9.csv...
Saving headlines/1971-10.csv...
Saving headlines/1971-11.csv...
Saving headlines/1971-12.csv...
Saving headlines/1972-1.csv...
Saving headlines/1972-2.csv...
Saving headlines/1972-3.csv...
Saving headlines/1972-4.csv...
Saving headlines/1972-5.csv...
Saving headlines/1972-6.csv...
Saving headlines/1972-7.csv...
Saving headlines/1972-8.csv...
Sa

In [39]:
end = datetime.date(1969, 12, 31)
start = datetime.date(1960, 1, 1)
print('Start date: ' + str(start))
print('End date: ' + str(end))

Start date: 1960-01-01
End date: 1969-12-31


In [40]:
months_in_range = [x.split(' ') for x in pd.date_range(start, end, freq='MS').strftime("%Y %-m").tolist()]

In [41]:
get_data(months_in_range)

Saving headlines/1960-1.csv...
Saving headlines/1960-2.csv...
Saving headlines/1960-3.csv...
Saving headlines/1960-4.csv...
Saving headlines/1960-5.csv...
Saving headlines/1960-6.csv...
Saving headlines/1960-7.csv...
Saving headlines/1960-8.csv...
Saving headlines/1960-9.csv...
Saving headlines/1960-10.csv...
Saving headlines/1960-11.csv...
Saving headlines/1960-12.csv...
Saving headlines/1961-1.csv...
Saving headlines/1961-2.csv...
Saving headlines/1961-3.csv...
Saving headlines/1961-4.csv...
Saving headlines/1961-5.csv...
Saving headlines/1961-6.csv...
Saving headlines/1961-7.csv...
Saving headlines/1961-8.csv...
Saving headlines/1961-9.csv...
Saving headlines/1961-10.csv...
Saving headlines/1961-11.csv...
Saving headlines/1961-12.csv...
Saving headlines/1962-1.csv...
Saving headlines/1962-2.csv...
Saving headlines/1962-3.csv...
Saving headlines/1962-4.csv...
Saving headlines/1962-5.csv...
Saving headlines/1962-6.csv...
Saving headlines/1962-7.csv...
Saving headlines/1962-8.csv...
Sa

In [42]:
end = datetime.date(1959, 12, 31)
start = datetime.date(1950, 1, 1)
print('Start date: ' + str(start))
print('End date: ' + str(end))

Start date: 1950-01-01
End date: 1959-12-31


In [43]:
months_in_range = [x.split(' ') for x in pd.date_range(start, end, freq='MS').strftime("%Y %-m").tolist()]

In [44]:
get_data(months_in_range)

Saving headlines/1950-1.csv...
Saving headlines/1950-2.csv...
Saving headlines/1950-3.csv...
Saving headlines/1950-4.csv...
Saving headlines/1950-5.csv...
Saving headlines/1950-6.csv...
Saving headlines/1950-7.csv...
Saving headlines/1950-8.csv...
Saving headlines/1950-9.csv...
Saving headlines/1950-10.csv...
Saving headlines/1950-11.csv...
Saving headlines/1950-12.csv...
Saving headlines/1951-1.csv...
Saving headlines/1951-2.csv...
Saving headlines/1951-3.csv...
Saving headlines/1951-4.csv...
Saving headlines/1951-5.csv...
Saving headlines/1951-6.csv...
Saving headlines/1951-7.csv...
Saving headlines/1951-8.csv...
Saving headlines/1951-9.csv...
Saving headlines/1951-10.csv...
Saving headlines/1951-11.csv...
Saving headlines/1951-12.csv...
Saving headlines/1952-1.csv...
Saving headlines/1952-2.csv...
Saving headlines/1952-3.csv...
Saving headlines/1952-4.csv...
Saving headlines/1952-5.csv...
Saving headlines/1952-6.csv...
Saving headlines/1952-7.csv...
Saving headlines/1952-8.csv...
Sa

#### Concatenate

We concatenate all those csv files into one big dataframe

In [5]:
import glob

path = 'headlines'
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

In [6]:
frame

Unnamed: 0,headline,date,doc_type,material_type,section,keywords
0,Missed Chances a Sign the Transformation Is No...,2006-12-03,article,News,,"['Football', 'College Athletics']"
1,The Collector,2006-12-03,article,News,,"['Collectors and Collections', 'Art', 'Travel ..."
2,Working Out Those Royal Father-Son Issues on a...,2006-12-02,article,Review,,"['Reviews', 'Opera']"
3,Joshua Rikon and Rebecca Benjamin,2006-12-03,article,News,,"['Dating (Social)', 'Weddings and Engagements']"
4,"Catch a Ferry, Unclog a Highway",2006-12-03,article,Editorial,,"['Transit Systems', 'Commuting', 'Ferries']"
...,...,...,...,...,...,...
11472897,RIGHTS CHIEF ASSAILS HIRING GOALS AS FAILURE,1985-11-01,article,News,,"['Speeches and Statements', 'AFFIRMATIVE ACTIO..."
11472898,WESTERN TELE COMMUNICAIONS reports earnings fo...,1985-10-31,article,Statistics,,['Company Reports']
11472899,The Griffin Quartet,1985-11-01,article,News,,['Music']
11472900,150 Guatemalans Seize Cathedral in Protest,1985-11-01,article,News,,"['Missing Persons', 'Demonstrations and Riots'..."


* Saving our dataframe to pickle. Contains ALL headers between 1950's and now.

In [7]:
import pickle
with open('frame_all.pickle', 'wb') as to_write:
    pickle.dump(frame, to_write)

* Checking if exports from pickle properly.

In [2]:
with open('frame_all.pickle', 'rb') as read_file:
    df = pickle.load(read_file)

Now, we move onto the next notebook, data_analysis, for some EDA and visuals