# Assignment 7: Web APIs


**Due**: November 3 at 4pm
* * * * *

In [1]:
# Import required libraries
import requests
from urllib import quote_plus
import json
from __future__ import division
import math
import csv

## 1: API Keys

Get an API key from the [NYT Developer](http://developer.nytimes.com/apps/mykeys) website. Set your key in the variable given below.

In [2]:
# set key
key = "ef9055ba947dd842effe0ecf5e338af9:15:72340235"

## 2. Requesting Data

### 2.1 Edit the `get_api_year` function

In the cell below, I've given you the code from lecture that defines a function which passes a search term (a string), and returns all articles mentioning that term in 2014. 

Edit this code so that is passes a second argument, `year` (an integer), and returns all the articles mentioning a a search term for that year.

In [3]:
# MAKE A FUNCTION HERE

def get_api_data(term, year):
    # set base url
    base_url="http://api.nytimes.com/svc/search/v2/articlesearch"

    # set response format
    response_format=".json"

    # set search parameters
    search_params = {"q":term,
                 "api-key":key,
                 "begin_date": str(year) + "0101", # date must be in YYYYMMDD format
                 "end_date":str(year) + "1231"}

    # make request
    r = requests.get(base_url+response_format, params=search_params)
    
    # convert to a dictionary
    data=json.loads(r.text)
    
    # get number of hits
    hits = data['response']['meta']['hits']
    print "number of hits: " + str(hits)
    
    # get number of pages
    pages = int(math.ceil(hits/10))
    
    # make an empty list where we'll hold all of our docs for every page
    all_docs = [] 
    
    # now we're ready to loop through the pages
    for i in range(pages):
        print "collecting page " + str(i)
        
        # set the page parameter
        search_params['page'] = i
        
        # make request
        r = requests.get(base_url+response_format, params=search_params)
    
        # get text and convert to a dictionary
        data=json.loads(r.text)
        
        # get just the docs
        docs = data['response']['docs']
        
        # add those docs to the big list
        all_docs = all_docs + docs
        
    return(all_docs)

In [4]:
# uncomment to test
# get_api_data("Duke Ellington", 2014)

### 2.2 Create list of years

In class, we collected all the articles mentioning Duke Ellington in 2014. Now we want to search over multiple years. Create a list called `years` that contains all the years from 2005 to 2014, inclusive.

In [5]:
years = range(2005, 2015)
years

[2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]

### 2.3 Collect data over multiple years

Using the function you made in 2.1, loop over the list `years`, collecting data on articles that contain mention of "Duke Ellington". Store all years' data in an object called `all_duke`.

In [6]:
all_duke = []
for i in years:
    all_duke.extend(get_api_data("Duke Ellington", i))

number of hits: 77
collecting page 0
collecting page 1
collecting page 2
collecting page 3
collecting page 4
collecting page 5
collecting page 6
collecting page 7
number of hits: 101
collecting page 0
collecting page 1
collecting page 2
collecting page 3
collecting page 4
collecting page 5
collecting page 6
collecting page 7
collecting page 8
collecting page 9
collecting page 10
number of hits: 111
collecting page 0
collecting page 1
collecting page 2
collecting page 3
collecting page 4
collecting page 5
collecting page 6
collecting page 7
collecting page 8
collecting page 9
collecting page 10
collecting page 11
number of hits: 99
collecting page 0
collecting page 1
collecting page 2
collecting page 3
collecting page 4
collecting page 5
collecting page 6
collecting page 7
collecting page 8
collecting page 9
number of hits: 114
collecting page 0
collecting page 1
collecting page 2
collecting page 3
collecting page 4
collecting page 5
collecting page 6
collecting page 7
collecting page 8

In [7]:
# test your code
len(all_duke) == 1043

True

In [8]:
all_duke[0]

{u'_id': u'4fd2872b8eb7c8105d858553',
 u'abstract': None,
 u'blog': [],
 u'byline': {u'original': u'By N.R. Kleinfield',
  u'person': [{u'firstname': u'N.',
    u'lastname': u'Kleinfield',
    u'middlename': u'R.',
    u'organization': u'',
    u'rank': 1,
    u'role': u'reported'}]},
 u'document_type': u'article',
 u'headline': {u'kicker': u'New York Bookshelf',
  u'main': u'NEW YORK BOOKSHELF/NONFICTION'},
 u'keywords': [{u'name': u'persons', u'value': u'ELLINGTON, DUKE'},
  {u'name': u'persons', u'value': u'HARRIS, DANIEL'}],
 u'lead_paragraph': u"A WIDOW'S WALK: A Memoir of 9/11 By Marian Fontana Simon & Schuster ($24, hardcover) Theresa and I walk into the Blue Ribbon, an expensive, trendy restaurant on Fifth Avenue in Park Slope. We sit at a banquette in the middle of the room and read the eclectic menu, my eyes instinctively scanning the prices for the least expensive item.",
 u'multimedia': [],
 u'news_desk': u'The City Weekly Desk',
 u'print_page': u'9',
 u'pub_date': u'2005-1

## 3. Formatting and Exporting

### 3.1 Collect more fields

In the cell below, I've pasted the code from lecture defining a function that accepts a list of unformatted documents returned by the API, and formats it into a clean list of dictionaries that contain keys for `id`, `headline`, and `date`.

Edit the function so that we include the `lead_paragraph` and `word_count` fields.

**HINT**: Some articles may not contain a lead_paragraph, in which case, it'll throw an error if you try to address this value (which doesn't exist.) You need to add a conditional statement that takes this into consideraiton. If

**HINT**: Add `.encode("utf8")` at the end of dictionary key lookups. You'll thank me later when you try to export your CSV.

**Advanced**: Add another key that returns a list of `keywords` associated with the article.

In [9]:
def format_articles(unformatted_docs):
    '''
    This function takes in a list of documents returned by the NYT api 
    and parses the documents into a list of formated dictionaries, 
    with 'id', 'header', and 'date' keys
    '''
    formatted = []
    for i in unformatted_docs:
        dic = {}
        dic['id'] = i['_id']
        dic['headline'] = i['headline']['main'].encode("utf8")
        dic['date'] = i['pub_date'][0:10] # cutting time of day.
        if i['lead_paragraph']:
            dic['lead_paragraph'] = i['lead_paragraph'].encode("utf8")
        dic['word_count'] = i['word_count']
        formatted.append(dic)
    return(formatted) 

### 3.2 Format `all_duke`

Using the function you made above, format the `all_duke` data. Store the result in an object called `all_duke_formatted`

In [10]:
all_duke_formatted = format_articles(all_duke)

In [11]:
# test you code
all_duke_formatted[0]

{'date': u'2005-10-02',
 'headline': 'NEW YORK BOOKSHELF/NONFICTION',
 'id': u'4fd2872b8eb7c8105d858553',
 'lead_paragraph': "A WIDOW'S WALK: A Memoir of 9/11 By Marian Fontana Simon & Schuster ($24, hardcover) Theresa and I walk into the Blue Ribbon, an expensive, trendy restaurant on Fifth Avenue in Park Slope. We sit at a banquette in the middle of the room and read the eclectic menu, my eyes instinctively scanning the prices for the least expensive item.",
 'word_count': 629}

### 3.3 Export as CSV

Export the object all_duke_formatted into a CSV file.

In [12]:
keys = all_duke_formatted[0]
#writing the rest
with open('allduke.csv', 'wb') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(all_duke_formatted)

## 4. Extra Credit / Bonus / Advanced / Optional

Import the data in R, and produce a graph that visualizes how Duke Ellington has changed in popularity over time.

See Assignment_7_R.R