<a href="https://colab.research.google.com/github/tanaymukherjee/Debugging-NY-Times-library/blob/main/Debugging_NY_Times.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Accessing NY Times Database via Web APIs

In [1]:
# Import required libraries
import requests
import json
from __future__ import division
import math
import csv
import matplotlib.pyplot as plt
import time

## 1. Constructing API GET Request

In [2]:
# set base url
base_url = "https://api.nytimes.com/svc/search/v2/articlesearch.json"

For the API key, we'll use the following demonstration key for now, but in the future, get your own, it only takes a few seconds!

In [3]:
# set key
key = "BBoHU8vBvJFxaeiazF1zpbwc4WKf6fwY"


For many API's, you'll have to specify the response format, such as xml or JSON. But for this particular API, the only possible response format is JSON, as we can see in the url, so we don't have to name it explicitly.

Now we need to send some sort of data in the URL’s query string. This data tells the API what information we want. In our case, we want articles about Covid. Requests allows you to provide these arguments as a dictionary, using the params keyword argument. In addition to the search term q, we have to put in the api-key term.

In [4]:
# set search parameters
search_params = {"q": "Covid",
                 "api-key": key}

In [5]:
# make request
r = requests.get(base_url, params=search_params)

In [6]:
print(r.url)

https://api.nytimes.com/svc/search/v2/articlesearch.json?q=Covid&api-key=BBoHU8vBvJFxaeiazF1zpbwc4WKf6fwY


### Challenge 1: Adding a date range
What if we only want to search within a particular date range? The NYT Article Search API allows us to specify start and end dates.

Alter search_params so that the request only searches for articles in the year 2020. Remember, since search_params is a dictionary, we can simply add the new keys to it.

In [7]:
# set date parameters here
search_params["begin_date"] = "20200101"
search_params["end_date"] = "20201231"

### Challenge 2: Specifying a results page
The above will return the first 10 results. To get the next ten, you need to add a "page" parameter. Change the search parameters above to get the second 10 results.

In [8]:
# set page parameters here
search_params["page"] = 1

In [9]:
r = requests.get(base_url, params=search_params)
print(r.url)

https://api.nytimes.com/svc/search/v2/articlesearch.json?q=Covid&api-key=BBoHU8vBvJFxaeiazF1zpbwc4WKf6fwY&begin_date=20200101&end_date=20201231&page=1


## 2. Parsing the response text

In [10]:
# Inspect the content of the response, parsing the result as text
response_text = r.text
print(response_text[:1000])

{"status":"OK","copyright":"Copyright (c) 2021 The New York Times Company. All Rights Reserved.","response":{"docs":[{"abstract":"A sales executive who was placed on a ventilator while in a coma says she was denied accommodations to work from home as she recovered.","web_url":"https://www.nytimes.com/2020/12/24/business/tony-robbins-covid-lawsuit.html","snippet":"A sales executive who was placed on a ventilator while in a coma says she was denied accommodations to work from home as she recovered.","lead_paragraph":"Tony Robbins, the life coach and motivational speaker, discriminated against one of his employees by refusing to grant her the accommodations she needed to work from home after she contracted a debilitating case of Covid-19 in the spring, according to a lawsuit filed Wednesday.","print_section":"B","print_page":"5","source":"The New York Times","multimedia":[{"rank":0,"subtype":"xlarge","caption":null,"credit":null,"type":"image","url":"images/2020/12/23/lens/23xp-tonyrobbin

In [11]:
# Convert JSON response to a dictionary
data = json.loads(response_text)
print(data)

{'status': 'OK', 'copyright': 'Copyright (c) 2021 The New York Times Company. All Rights Reserved.', 'response': {'docs': [{'abstract': 'A sales executive who was placed on a ventilator while in a coma says she was denied accommodations to work from home as she recovered.', 'web_url': 'https://www.nytimes.com/2020/12/24/business/tony-robbins-covid-lawsuit.html', 'snippet': 'A sales executive who was placed on a ventilator while in a coma says she was denied accommodations to work from home as she recovered.', 'lead_paragraph': 'Tony Robbins, the life coach and motivational speaker, discriminated against one of his employees by refusing to grant her the accommodations she needed to work from home after she contracted a debilitating case of Covid-19 in the spring, according to a lawsuit filed Wednesday.', 'print_section': 'B', 'print_page': '5', 'source': 'The New York Times', 'multimedia': [{'rank': 0, 'subtype': 'xlarge', 'caption': None, 'credit': None, 'type': 'image', 'url': 'images

In [12]:
print(data.keys())

dict_keys(['status', 'copyright', 'response'])


In [13]:
data['copyright']

'Copyright (c) 2021 The New York Times Company. All Rights Reserved.'

In [14]:
data['response'].keys()

dict_keys(['docs', 'meta'])

In [15]:
print(data['response']['docs'])

[{'abstract': 'A sales executive who was placed on a ventilator while in a coma says she was denied accommodations to work from home as she recovered.', 'web_url': 'https://www.nytimes.com/2020/12/24/business/tony-robbins-covid-lawsuit.html', 'snippet': 'A sales executive who was placed on a ventilator while in a coma says she was denied accommodations to work from home as she recovered.', 'lead_paragraph': 'Tony Robbins, the life coach and motivational speaker, discriminated against one of his employees by refusing to grant her the accommodations she needed to work from home after she contracted a debilitating case of Covid-19 in the spring, according to a lawsuit filed Wednesday.', 'print_section': 'B', 'print_page': '5', 'source': 'The New York Times', 'multimedia': [{'rank': 0, 'subtype': 'xlarge', 'caption': None, 'credit': None, 'type': 'image', 'url': 'images/2020/12/23/lens/23xp-tonyrobbins-photo/23xp-tonyrobbins-photo-articleLarge.jpg', 'height': 428, 'width': 600, 'legacy': {

In [16]:
docs = data['response']['docs']

In [17]:
type(docs)

list

In [18]:
docs[0]

{'_id': 'nyt://article/297bd87d-5667-537a-a94d-7fc2fd054849',
 'abstract': 'A sales executive who was placed on a ventilator while in a coma says she was denied accommodations to work from home as she recovered.',
 'byline': {'organization': None,
  'original': 'By Azi Paybarah and Michael Levenson',
  'person': [{'firstname': 'Azi',
    'lastname': 'Paybarah',
    'middlename': None,
    'organization': '',
    'qualifier': None,
    'rank': 1,
    'role': 'reported',
    'title': None},
   {'firstname': 'Michael',
    'lastname': 'Levenson',
    'middlename': None,
    'organization': '',
    'qualifier': None,
    'rank': 2,
    'role': 'reported',
    'title': None}]},
 'document_type': 'article',
 'headline': {'content_kicker': None,
  'kicker': None,
  'main': 'Lawsuit Accuses Tony Robbins of Discriminating Against Employee Who Got Covid',
  'name': None,
  'print_headline': 'Robbins Sued By Employee Who Had Covid',
  'seo': None,
  'sub': None},
 'keywords': [{'major': 'N',
   '

## 3. Putting everything together to get all the articles.

In [19]:
# set key
key = "BBoHU8vBvJFxaeiazF1zpbwc4WKf6fwY"

# set base url
base_url = "https://api.nytimes.com/svc/search/v2/articlesearch.json"

# set search parameters
search_params = {"q": "Covid",
                 "api-key": key,
                 "begin_date": "20200101",  # date must be in YYYYMMDD format
                 "end_date": "20200131"}

# make request
r = requests.get(base_url, params=search_params)

# wait 3 seconds for the GET request
time.sleep(3)

# convert to a dictionary
data = json.loads(r.text)

# get number of hits
hits = data['response']['meta']['hits']
print("number of hits: ", str(hits))

# get number of pages
pages = int(math.ceil(hits / 10))
print("number of pages: ", str(pages))

number of hits:  105
number of pages:  11


Now we're ready to loop through our pages. We'll start off by creating an empty list all_docs which will be our accumulator variable. Then we'll loop through pages and make a request for each one.

In [21]:
# make an empty list where we'll hold all of our docs for every page
all_docs = []

# now we're ready to loop through the pages
for i in range(pages):
    print("collecting page", str(i))

    # set the page parameter
    search_params['page'] = i

    # make request
    r = requests.get(base_url, params=search_params)

    # get text and convert to a dictionary
    data = json.loads(r.text)

    # get just the docs
    docs = data['response']['docs']

    # add those docs to the big list
    all_docs = all_docs + docs

    time.sleep(3)  # pause between calls

collecting page 0
collecting page 1
collecting page 2
collecting page 3
collecting page 4
collecting page 5
collecting page 6
collecting page 7
collecting page 8
collecting page 9
collecting page 10


Let's make sure we got all the articles.

In [22]:
assert len(all_docs) == data['response']['meta']['hits']

### Challenge 3: Make a function
Using the code above, create a function called get_api_data() with the parameters term and a year that returns all the documents containing that search term in that year.

In [None]:
#DEFINE YOUR FUNCTION HERE

def get_api_data(term, year):
    
    # set base url
    base_url = "http://api.nytimes.com/svc/search/v2/articlesearch"

    # set search parameters
    search_params = {"q": term,
                     "api-key": "BBoHU8vBvJFxaeiazF1zpbwc4WKf6fwY",
                     # date must be in YYYYMMDD format
                     "begin_date": str(year) + "0101",
                     "end_date": str(year) + "0131"}

    # make request
    r = requests.get(base_url, params=search_params)
    time.sleep(3)

    # convert to a dictionary
    data = json.loads(r.text)

    # get number of hits
    hits = data['response']['meta']['hits']
    print("number of hits:", str(hits))

    # get number of pages
    pages = int(math.ceil(hits / 10))

    # make an empty list where we'll hold all of our docs for every page
    all_docs = []

    # now we're ready to loop through the pages
    for i in range(pages):
        print("collecting page", str(i))

        # set the page parameter
        search_params['page'] = i

        # make request
        r = requests.get(base_url, params=search_params)

        # get text and convert to a dictionary
        data = json.loads(r.text)

        # get just the docs
        docs = data['response']['docs']

        # add those docs to the big list
        all_docs = all_docs + docs

        time.sleep(3)  # pause between calls

    return(all_docs)

In [None]:
get_api_data("Covid", 2020)

In [24]:
all_docs[0]

{'_id': 'nyt://interactive/669ba0d2-a0c6-5f87-9017-c92d47cf3b20',
 'abstract': 'Test your knowledge of this week’s health news.',
 'byline': {'organization': None,
  'original': 'By Toby Bilanow',
  'person': [{'firstname': 'Toby',
    'lastname': 'Bilanow',
    'middlename': None,
    'organization': '',
    'qualifier': None,
    'rank': 1,
    'role': 'reported',
    'title': None}]},
 'document_type': 'multimedia',
 'headline': {'content_kicker': None,
  'kicker': None,
  'main': 'Weekly Health Quiz: Optimism, Muscles and Coronavirus',
  'name': None,
  'print_headline': None,
  'seo': None,
  'sub': None},
 'keywords': [{'major': 'N',
   'name': 'subject',
   'rank': 1,
   'value': 'Coronavirus (2019-nCoV)'},
  {'major': 'N', 'name': 'subject', 'rank': 2, 'value': "Alzheimer's Disease"},
  {'major': 'N',
   'name': 'subject',
   'rank': 3,
   'value': 'SARS (Severe Acute Respiratory Syndrome)'},
  {'major': 'N', 'name': 'subject', 'rank': 4, 'value': 'Prostate Gland'},
  {'major':

In [25]:
def format_articles(unformatted_docs):
    '''
    This function takes in a list of documents returned by the NYT api 
    and parses the documents into a list of dictionaries, 
    with 'id', 'header', and 'date' keys
    '''
    formatted = []
    for i in unformatted_docs:
        dic = {}
        dic['id'] = i['_id']
        dic['headline'] = i['headline']['main']
        dic['date'] = i['pub_date'][0:10]  # cutting time of day.
        formatted.append(dic)
    return(formatted)

In [26]:
all_formatted = format_articles(all_docs)

In [27]:
all_formatted[:5]

[{'date': '2020-01-30',
  'headline': 'Weekly Health Quiz: Optimism, Muscles and Coronavirus',
  'id': 'nyt://interactive/669ba0d2-a0c6-5f87-9017-c92d47cf3b20'},
 {'date': '2020-01-28',
  'headline': 'Coronavirus World Map: Tracking the Global Outbreak',
  'id': 'nyt://interactive/e60f3741-87f7-531d-9e65-f73b61ffd30c'},
 {'date': '2020-01-30',
  'headline': 'A Virus’s Journey Across China',
  'id': 'nyt://article/217bd4ba-d91e-5c4b-8926-71d190808e6e'},
 {'date': '2020-01-31',
  'headline': 'State Department Warns Against Traveling to China Amid Coronavirus Outbreak',
  'id': 'nyt://article/27fd86f0-c453-54ef-b806-143bb6111f4e'},
 {'date': '2020-01-30',
  'headline': 'Wall St. Shrugs Off Risks Related to Viral Outbreak in China',
  'id': 'nyt://article/9d4aca5a-7332-595b-8be2-587009cf9725'}]

### Challenge 4: Collect more fields
Edit the function above so that we include the lead_paragraph and word_count fields.

HINT: Some articles may not contain a lead_paragraph, in which case, it'll throw an error if you try to address this value (which doesn't exist.) You need to add a conditional statement that takes this into consideration. If

Advanced: Add another key that returns a list of keywords associated with the article.

In [28]:
def format_articles(unformatted_docs):
    '''
    This function takes in a list of documents returned by the NYT api 
    and parses the documents into a list of dictionaries, 
    with 'id', 'header', 'date', 'lead paragrph' and 'word count' keys
    '''
    formatted = []
    for i in unformatted_docs:
        dic = {}
        dic['id'] = i['_id']
        dic['headline'] = i['headline']['main']
        dic['date'] = i['pub_date'][0:10]  # cutting time of day.

        # YOUR CODE HERE
        
        if 'lead_paragraph' in i.keys():
            dic['lead_paragraph'] = i['lead_paragraph']
        dic['word_count'] = i['word_count']
        dic['keywords'] = [keyword['value'] for keyword in i['keywords']]

        formatted.append(dic)
        
    return(formatted)

In [29]:
all_formatted = format_articles(all_docs)
all_formatted[:5]

[{'date': '2020-01-30',
  'headline': 'Weekly Health Quiz: Optimism, Muscles and Coronavirus',
  'id': 'nyt://interactive/669ba0d2-a0c6-5f87-9017-c92d47cf3b20',
  'keywords': ['Coronavirus (2019-nCoV)',
   "Alzheimer's Disease",
   'SARS (Severe Acute Respiratory Syndrome)',
   'Prostate Gland',
   'Heart',
   'Elderly',
   'Longevity',
   'Hygiene and Cleanliness',
   'Muscles',
   'Deaths (Fatalities)',
   'Medicine and Health',
   'Optimism',
   'Exercise'],
  'lead_paragraph': 'Test your knowledge of this week’s health news.',
  'word_count': 0},
 {'date': '2020-01-28',
  'headline': 'Coronavirus World Map: Tracking the Global Outbreak',
  'id': 'nyt://interactive/e60f3741-87f7-531d-9e65-f73b61ffd30c',
  'keywords': ['Coronavirus (2019-nCoV)',
   'Epidemics',
   'Centers for Disease Control and Prevention',
   'Johns Hopkins University',
   'Wuhan (China)',
   'China',
   'United States',
   'Australia',
   'Singapore',
   'Disease Rates',
   'Deaths (Fatalities)'],
  'lead_paragra

## 5. Exporting

In [30]:
keys = all_formatted[1]
# writing the rest
with open('all-formated.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(all_formatted)