 # Web APIs

A growing number of organizations make data sets available on the web in a style called REST, which stands for REpresentational State Transfer. When REST is used, every data set is identified by a URL and can be accessed through a set of functions called an Application Programming Interface (API). 

**Topics Covered:**
- [Getting data from the Web](#Getting data from the Web)
- Examples
    * [Climate Data API](#Climate Data API)
    * [NYtimes API](#NYtimes API)
    * [Twitter API](#twitterapi)

---


**References:**
* [Working With Data on the Web](http://swcarpentry.github.io/web-data-python/01-getdata/)
* [Accessing Databases via Web APIs](https://github.com/Data-on-the-Mind/2017-summer-workshop/blob/master/hench-data-from-web/01-APIs/01-API_workbook.ipynb)
* [Data and Twitter](https://github.com/henchc/EDUC290B/blob/master/02-Data-and-Twitter.ipynb)

----


## Getting data from the Web

### How do GET Requests Work?  A Web browsing example

* Surfing the Web = Making a bunch of GET Requests

* For instance, I open my web browser and type in http://www.wikipedia.org.  Once I hit return, I'd see a webpage

* Several different processes occured, however, between me hitting "return" and the page finally being rendered



### Step 1: The GET Request

* web browser took the entered character string 
* used the command-line tool "Curl" to write a properly formatted HTTP GET request 
* submitted it to the server that hosts the Wikipedia homepage

---
### STEP 2: The Response

* Wikipedia's server receives this request
* send back an HTTP response
* from which Curl extracted the HTML code for the page

```{html}
[1] "<!DOCTYPE html>\n<html lang=\"mul\" dir=\"ltr\">\n<head>\n<!-- Sysops: Please do not edit the main template directly; update /temp and synchronise. -->\n<meta charset=\"utf-8\">\n<title>Wikipedia</title>\n<!--[if lt IE 7]><meta http-equiv=\"imagetoolbar\" content=\"no\"><![endif]-->\n<meta name=\"viewport\" content=\"i"
```

---
### STEP 3: The Formatting

* raw HTML code was formatted and executed by the web browser
* rendering the page as seen in the window.

---

### Web Browsing as a Template for RESTful Database Querying

The process of web browsing described above is a close analogue for the process of database querying via RESTful APIs, with only a few adjustments:

1. While the Curl tool will still be used to send HTML GET requests to the servers hosting our databases of interest, the character string that we supply to Curl must be constructed so that the resulting request can be interpreted and succesfully acted upon by the server.  In particular, it is likely that the character string must encode **search terms and/or filtering parameters**, as well as one or more **authentication codes**.  While the terms are often similar across APIs, most are API-specific.

2. Unlike with web browsing, the content of the server's response that is extracted by Curl is unlikely to be HTML code.  Rather, it will likely be **raw text response that can be parsed into one of a few file formats commonly used for data storage**.  The usual suspects include .csv, .xml, and .json files.

3. Whereas the web browser capably parsed and executed the HTML code, **one or more facilities in R, Python, or other programming languages will be necessary for parsing the server response and converting it into a format for local storage** (e.g. matrices, dataframes, databases, lists, etc.).

----


Before we start, run this command:


In [None]:
!pip install requests

----

### Climate Data API
The Climate Data API provides programmatic access to most of the climate data used on the World Bank’s [Climate Change Knowledge Portal](http://sdwebx.worldbank.org/climateportal/). Check out the World Bank’s [Terms of Use](https://data.worldbank.org/summary-terms-of-use). According to the API’s home page, the data sets containing yearly averages for various values are identified by URLs of the form:

http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/var/year/iso3.ext

where:

* var is either pr (for precipitation) or tas (for “temperature at surface”);
* iso3 is the International Standards Organization (ISO) 3-letter code for a country, such as “CAN” for Canada or “BRA” for Brazil; and
* ext (short for “extension”) specifies the format we want the data in. There are several choices for format, but the simplest is comma-separated values (CSV), in which each record is a row, and the values in each row are separated by commas. (CSV is frequently used for spreadsheet data.)

For example, if we want the average annual temperature in Canada as a CSV file, the URL is:

http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/CAN.csv

If we paste that URL into a browser, it displays:
~~~
year,data
1901,-7.67241907119751
1902,-7.862711429595947
1903,-7.910782814025879
...
2007,-6.819293975830078
2008,-7.2008957862854
2009,-6.997011661529541
~~~

This particular data set might be stored in a file on the World Bank’s server, or that server might:

1. Receive our URL.
2. Break it into pieces.
3. Extract the three key fields (the variable, the country code, and the desired format).
4. Fetch the desired data from a database.
5. Format the data as CSV.
6. Send that to our browser.

As long as the World Bank doesn’t change its URLs, we don’t need to know which method it’s using and it can switch back and forth between them without breaking our programs.

In [None]:
#imports the requests library
import requests
#defines the URL for the data we want; 
#we could just pass this URL as an argument to the requests.get 
url = 'http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/CAN.csv'
#Initiate GET request and assign the response to an object
response = requests.get(url)
#
if response.status_code != 200:
    print('Failed to get data:', response.status_code)
else:
    print('First 100 characters of data are')
    print(response.text[:100]) #Assign the data sent back by the web server to the object’s text member variable and print it

----

### NYTimes API

How Much Less Popular is Duke Ellington?

If you ask a jazz musician who they feel is the greatest bandleader of all time, there's a pretty good chance they'll mention Duke Ellington. Though Ellington was at peak popularity from roughly 1930 to 1945, his music is still heard regularly.

TASK: Characterize the popularity of Duke Ellington over the past 15 years. Specifically, is he "trending"?

### STEP 1: Finding Data Resources

To determine the popularity of something, we need a measurement of how frequently or widely it is referenced or encountered.  Moreover, to determine how this popularity changes over time, we need a measurement that is taken repeatedly.

Newspapers are an excellent source of such information.  The frequency with which certain items appear in its pages can be a decent metric of its popularity, and its continual publication creates a built-in time series.  And while there are a variety of newspapers to choose from, we'll be working with the New York Times for a variety of reasons --- including its status as a paper of record, its long publishing history, and (most importantly) its convenient article API.

[NYT Article API](http://developer.nytimes.com/)

### STEP 2: Getting API Access

For most APIs, a key or other user credentials are required for any database querying.  Generally, this requires that you register with the organization.  Most APIs are set up for developers, so you'll likely be asked to register an "application".  All this really entails is coming up with a name for your app/bot/project, and providing your real name, organization, and email.  Note that some more popular APIs (e.g. Twitter, Facebook) will require additional information, such as a web address or mobile number.

Once you've successfully registered, you will be assigned one or more keys, tokens, or other credentials that must be supplied to the server as part of any API call you make.  To make sure that users aren't abusing their data access privileges (e.g. by making many rapid queries), each set of keys will be given several **rate limits** governing the total number of calls that can be made over certain intervals of time.  For the NYT Article API, we have relatively generous rate limits --- 10 calls per second and 10,000 calls per day.

[NYT Article API Keys](http://developer.nytimes.com/apps/mykeys)

### STEP 3: Learning how to Construct API GET Requests

Likely the most challenging part of using web APIs is learning how to format your GET request URLs.  While there are common architectures for such URLs, each API has its own unique quirks.  For this reason, carefully reviewing the API documentation is critical.

Fortunately, the NYT Article API is [very well documented](http://developer.nytimes.com/docs/read/article_search_api_v2)!

----
Most GET request URLs for API querying have three or four components:

1. *Base URL*: a link stub that will be at the beginning of all calls to a given API; points the server to the location of an entire database

2. *Search Parameters*: a character string appended to a base URL that tells the server what to extract from the database; basically a series of filters used to point to specific parts of a database

3. *Authenication Key/Token*: a user-specific character string appended to a base URL telling the server who is making the query; allows servers to efficiently manage database access

4. *Response Format*: a character string indicating how the response should be formatted; usually one of .csv, .json, or .xml

In [None]:
# Import required libraries
import requests  # to make the GET request 
import json  # to parse the JSON response to a Python dictionary
import time  # to pause after each API call
import csv  # to write our data to a CSV
import pandas  # to see our CSV


#Step 1: Construct GET request (a base URL for the API, some authorization code or key, and, a format for the response.)
# set key. Use the following demonstration keys for now, but in the future, get your own
key="be8992a420bfd16cf65e8757f77a5403:8:44644296"

# set base url
base_url="http://api.nytimes.com/svc/search/v2/articlesearch"

# set response format
response_format=".json"

# set search parameters
search_params = {"q":"Duke Ellington",
                 "api-key":key}   
r = requests.get(base_url+response_format, params=search_params) #response object called r

#Uncomment the following line to see what it will print
#print(r.url)
#Click on the link to see what happens: 
#http://api.nytimes.com/svc/search/v2/articlesearch.json?q=Duke+Ellington&api-key=be8992a420bfd16cf65e8757f77a5403%3A8%3A44644296



# Inspect the content of the response, parsing the result as text
response_text= r.text
#Uncomment the following line to see what it will print
#print(response_text[:1000])

# Convert JSON response to a dictionary
data = json.loads(response_text)
#Uncomment the following line to see what it will print
# data

#Commands to work with json data

#Print the status
print(data['status'])

#put the data in variable.
docs = data['response']['docs']
docs[0]

In [None]:
# Import required libraries

import time
from random import randint
import requests
import json
from __future__ import division
import math
import csv
import matplotlib.pyplot as plt


# DEFINE YOUR FUNCTION HERE
# set key
key="be8992a420bfd16cf65e8757f77a5403:8:44644296"

def get_api_data(term, year):
    # set base url
    base_url="http://api.nytimes.com/svc/search/v2/articlesearch"

    # set response format
    response_format=".json"

    # set search parameters
    search_params = {"q":term,
                 "api-key":key,
                 "begin_date": str(year) + "0101", # date must be in YYYYMMDD format
                 "end_date":str(year) + "1231"}

    # make request
    r = requests.get(base_url+response_format, params=search_params)
    
    # convert to a dictionary
    data=json.loads(r.text)
    
    # get number of hits
    hits = data['response']['meta']['hits']
    print("number of hits:", str(hits))
    
    # get number of pages
    pages = int(math.ceil(hits/10))
    
    # make an empty list where we'll hold all of our docs for every page
    all_docs = [] 
    
    # now we're ready to loop through the pages
    for i in range(pages):
        print("collecting page", str(i))
        
        # set the page parameter
        search_params['page'] = i
        
        # make request
        r = requests.get(base_url+response_format, params=search_params)
    
        # get text and convert to a dictionary
        data=json.loads(r.text)
        
        # get just the docs
        docs = data['response']['docs']
        
        # add those docs to the big list
        all_docs = all_docs + docs
        
        time.sleep(randint(3,5))  # pause between calls
        
    return(all_docs)

In [None]:
get_api_data("Duke Ellington", 2014)

----

### Twitter API

This [Twitter API](https://dev.twitter.com/overview/api) is slightly more complicated, but because of this, people have created very useful tools to easily interact with the Twitter API. First, follow the directions to get your API credientials.
1. [Create a Twitter account](https://twitter.com).  You can use an existing account if you have one.
2. Under account settings, add your phone number to the account.
3. [Create a Twitter developer account](https://dev.twitter.com/resources/signup).  Attach it to your Twitter account.
4. Once you're logged into your developer account, [create an application for this course](https://apps.twitter.com/app/new).  You can call it whatever you want, and you can write any URL when it asks for a web site.
5. On the page for that application, find your Consumer Key and Consumer Secret.
6. On the same page, create an Access Token.  Record the resulting Access Token and Access Token Secret.


In [None]:
!pip install tweepy

In [None]:
#Importing required libraries
import tweepy  # This halps us access Twitter data.
import matplotlib.pyplot as plt #This is for plotting 

# Twitter API credentials
# Note that these credentials are for demonstration 
consumer_key = "IjI8AdEUOlzif3J0qgt6bw9JI"
consumer_secret = "gZLhygPv5uBCWIQVr6sZjCgYVfcXGzGuTNl7oOapYmWazdLEm6"
access_key = "278661116-qdUru3GXVYT9upGH0cgbwROu4KzypMSwQgknMNW2"
access_secret = "RMZY9H7vvbuHq9jFZO4fdtw5cBPZUlbLDhEwU9zir6LyG"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)

# Establish a query using the word 'Berkeley'
results = tweepy.Cursor(
    api.search,
    q='Berkeley', # query, any word you want found in a tweet
    result_type = 'popular'
    ).items(20)

# define an empty list called results_tweets
results_tweets = []

# Iterate over the first tweets in `results` and add each of those tweets to results_tweets
for t in results:
    results_tweets.append(t) 


#print the time of the first tweet
print("The time of the first tweet")
print(results_tweets[0].created_at)

#Counting retweet counts of the tweets
retweet_counts = []
for t in results_tweets:
    retweet_counts.append(t.retweet_count)
    
#Plot retweet counts    
plt.hist(retweet_counts)
plt.show()

