# Accessing Databases via Web APIs
* * * * *

In [None]:
# Import required libraries
import requests
import json
from __future__ import division
import math
import csv
import matplotlib.pyplot as plt
import time

## 1. Constructing API GET Request
*****

In the first place, we know that every call will require us to provide:

1. a base URL for the API, and
2. some authorization code or key.

So let's store those in some variables.

To get the base url, we can simply use the [documentation](https://developer.nytimes.com/). The New York Times has a lot of different APIs. If we scroll down, the second one is the [Article Search API](https://developer.nytimes.com/article_search_v2.json), which is what we want. From that page we can find the url. Now let's assign it to a variable.

In [None]:
# set base url
base_url = "https://api.nytimes.com/svc/search/v2/articlesearch.json"

For the API key, we'll use the following demonstration keys for now, but in the future, [get your own](https://developer.nytimes.com/signup), it only takes a few seconds!

1. 18046cd15e21e1b9996ddfb6dafbb578:4:44644296
2. 86c06083cec242518ea58415fd9d3861
3. b931c838cdb745bbab0f213cfc16b7a5:12:44644296
4. 18046cd15e21e1b9996ddfb6dafbb578:4:44644296
5. be8992a420bfd16cf65e8757f77a5403:8:44644296

In [None]:
# set key
key = "be8992a420bfd16cf65e8757f77a5403:8:44644296"

For many API's, you'll have to specify the response format, such as xml or JSON. But for this particular API, the only possible response format is JSON, as we can see in the url, so we don't have to name it explicitly.

Now we need to send some sort of data in the URL’s query string. This data tells the API what information we want. In our case, we want articles about Duke Ellington. Requests allows you to provide these arguments as a dictionary, using the `params` keyword argument. In addition to the search term `q`, we have to put in the `api-key` term.

In [None]:
# set search parameters
search_params = {"q": "Duke Ellington",
                 "api-key": key}

Now we're ready to make the request. We use the `.get` method from the `requests` library to make an HTTP GET Request.

In [None]:
# make request
r = requests.get(base_url, params=search_params)

Now, we have a [response](http://docs.python-requests.org/en/latest/api/#requests.Response) object called `r`. We can get all the information we need from this object. For instance, we can see that the URL has been correctly encoded by printing the URL. Click on the link to see what happens.

In [None]:
print(r.url)

Click on that link to see what it returns!

It's not very pleasant looking, but in the next section we will work on parsing it into something more palatable. For now let's try adding some parameters to our search.

### Challenge 1:  Adding a date range

What if we only want to search within a particular date range? The NYT Article Search API allows us to specify start and end dates.

Alter `search_params` so that the request only searches for articles in the year 2015. Remember, since `search_params` is a dictionary, we can simply add the new keys to it.

Use the [documentation](https://developer.nytimes.com/article_search_v2.json#/Documentation/GET/articlesearch.json) to see how to format the new parameters.

In [None]:
# set date parameters here

In [None]:
# Uncomment to test
# r = requests.get(base_url, params=search_params)
# print(r.url)

### Challenge 2:  Specifying a results page

The above will return the first 10 results. To get the next ten, you need to add a "page" parameter. Change the search parameters above to get the second 10 results. 

In [None]:
# set page parameters here

In [None]:
# Uncomment to test
# r = requests.get(base_url, params=search_params)
# print(r.url)

## 2. Parsing the response text
*****

We can read the content of the server’s response using `.text` from `requests`.

In [None]:
# Inspect the content of the response, parsing the result as text
response_text = r.text
print(response_text[:1000])

What you see here is JSON text, encoded as unicode text. JSON stands for "Javascript object notation." It has a very similar structure to a python dictionary -- both are built on key/value pairs. This makes it easy to convert JSON response to a python dictionary. We do this with the `json.loads()` function.

In [None]:
# Convert JSON response to a dictionary
data = json.loads(response_text)
print(data)

That looks intimidating! But it's really just a big dictionary. Let's see what keys we got in there.

In [None]:
print(data.keys())

In [None]:
# this is boring
data['status']

In [None]:
# so is this
data['copyright']

In [None]:
# this looks more promising
data['response']

We'll need to parse this dictionary even further. Let's look at its keys.

In [None]:
data['response'].keys()

In [None]:
data['response']['meta']

Looks like we probably want `docs`.

In [None]:
print(data['response']['docs'])

That looks what we want! Let's assign that to its own variable.

In [None]:
docs = data['response']['docs']

So that we can further manipulate this, we need to know what type of object it is.

In [None]:
type(docs)

That makes things easy. Let's take a look at the first doc.

In [None]:
docs[0]

## 3. Putting everything together to get all the articles.
*****

That's great. But we only have 10 items. The original response said we had 65 hits! Which means we have to make 65 /10, or 7 requests to get them all. Sounds like a job for a loop! 

But first, let's review what we've done so far.

In [None]:
# set key
key = "be8992a420bfd16cf65e8757f77a5403:8:44644296"

# set base url
base_url = "https://api.nytimes.com/svc/search/v2/articlesearch.json"

# set search parameters
search_params = {"q": "Duke Ellington",
                 "api-key": key,
                 "begin_date": "20150101",  # date must be in YYYYMMDD format
                 "end_date": "20151231"}

# make request
r = requests.get(base_url, params=search_params)

# wait 3 seconds for the GET request
time.sleep(3)

# convert to a dictionary
data = json.loads(r.text)

# get number of hits
hits = data['response']['meta']['hits']
print("number of hits: ", str(hits))

# get number of pages
pages = int(math.ceil(hits / 10))
print("number of pages: ", str(pages))

Now we're ready to loop through our pages. We'll start off by creating an empty list `all_docs` which will be our accumulator variable. Then we'll loop through `pages` and make a request for each one.

In [None]:
# make an empty list where we'll hold all of our docs for every page
all_docs = []

# now we're ready to loop through the pages
for i in range(pages):
    print("collecting page", str(i))

    # set the page parameter
    search_params['page'] = i

    # make request
    r = requests.get(base_url, params=search_params)

    # get text and convert to a dictionary
    data = json.loads(r.text)

    # get just the docs
    docs = data['response']['docs']

    # add those docs to the big list
    all_docs = all_docs + docs

    time.sleep(3)  # pause between calls

Let's make sure we got all the articles.

In [None]:
assert len(all_docs) == data['response']['meta']['hits']

We did it!

### Challenge 3: Make a function

Using the code above, create a function called `get_api_data()` with the parameters `term` and a `year` that returns all the documents containing that search term in that year.

In [None]:
#DEFINE YOUR FUNCTION HERE

In [None]:
# uncomment to test
# get_api_data("Duke Ellington", 2014)

## 4. Formatting
*****

Let's take another look at one of these documents.

In [None]:
all_docs[0]

This is all great, but it's pretty messy. What we’d really like to to have, eventually, is a CSV, with each row representing an article, and each column representing something about that article (header, date, etc). As we saw before, the best way to do this is to make a list of dictionaries, with each dictionary representing an article and each dictionary representing a field of metadata from that article (e.g. headline, date, etc.) We can do this with a custom function:

In [None]:
def format_articles(unformatted_docs):
    '''
    This function takes in a list of documents returned by the NYT api 
    and parses the documents into a list of dictionaries, 
    with 'id', 'header', and 'date' keys
    '''
    formatted = []
    for i in unformatted_docs:
        dic = {}
        dic['id'] = i['_id']
        dic['headline'] = i['headline']['main']
        dic['date'] = i['pub_date'][0:10]  # cutting time of day.
        formatted.append(dic)
    return(formatted)

In [None]:
all_formatted = format_articles(all_docs)

In [None]:
all_formatted[:5]

### Challenge 4: Collect more fields

Edit the function above so that we include the `lead_paragraph` and `word_count` fields.

**HINT**: Some articles may not contain a lead_paragraph, in which case, it'll throw an error if you try to address this value (which doesn't exist.) You need to add a conditional statement that takes this into consideration. If

**Advanced**: Add another key that returns a list of `keywords` associated with the article.

In [None]:
def format_articles(unformatted_docs):
    '''
    This function takes in a list of documents returned by the NYT api 
    and parses the documents into a list of dictionaries, 
    with 'id', 'header', 'date', 'lead paragrph' and 'word count' keys
    '''
    formatted = []
    for i in unformatted_docs:
        dic = {}
        dic['id'] = i['_id']
        dic['headline'] = i['headline']['main']
        dic['date'] = i['pub_date'][0:10]  # cutting time of day.

        # YOUR CODE HERE

        formatted.append(dic)
        
    return(formatted)

In [None]:
# uncomment to test
all_formatted = format_articles(all_docs)
all_formatted[:5]

## 5. Exporting
*****

We can now export the data to a CSV.

In [None]:
keys = all_formatted[1]
# writing the rest
with open('all-formated.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(all_formatted)

## Capstone Challenge

Using what you learned, tell me if Chris' claim (i.e. that Duke Ellington has gotten more popular lately) holds water.

In [None]:
# YOUR CODE HERE
