# Working with Data APIs

**Sam Maurer // maurer@berkeley.edu // Oct. 3, 2016**

This notebook provides a demonstration of data-access APIs that operate over the web.

In Part 1, we'll load and parse data from an automated USGS feed of earthquakes. In Part 2, we'll add query parameters to the workflow, using the Google Maps Geolocation API as an example. In Part 3, we'll use authenticated APIs to access (public) Twitter data. 

# Part 1: Reading from an automated data feed

### USGS real-time earthquake feeds

This is an API for near-real-time data on earthquakes. Results are provided in JSON format over the web. No authentication is needed, and rather than accepting queries, the API has a separate endpoint for each permutation of the data that users might want.

**API documentation:**  
http://earthquake.usgs.gov/earthquakes/feed/v1.0/geojson.php

**Sample API endpoint, for magnitude 4.5+ earthquakes in past day:**  
http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/4.5_day.geojson  


In [None]:
%matplotlib inline

import pandas as pd
import urllib
import json

In [None]:
# use endpoint for magnitude 2.5+ quakes in past week
endpoint_url = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_week.geojson"

# open a connection to the URL
connection = urllib.urlopen(endpoint_url)

# download the results
results = connection.read()

print results[:500]  # first 500 characters
print type(results)

In [None]:
# the results are a string with JSON-formatted data inside

# parse the string into a Python data structure
data = json.loads(results)

print data['features'][0]  # first item from the array called 'features'
print type(data)

In [None]:
# pull out the event descriptions

for quake in data['features']:
    print quake['properties']['title']

In [None]:
# pull out magnitudes and depths into a pandas dataframe

# first, set up a dictionary of empty arrays
d = {'magnitude': [], 'depth': []}

# loop through the earthquakes and pull out datapoints
for quake in data['features']:
    d['magnitude'].append(quake['properties']['mag'])
    d['depth'].append(quake['geometry']['coordinates'][2])

# then load it all into a dataframe
df = pd.DataFrame.from_dict(d)

print len(df)

In [None]:
print df.head()

In [None]:
print df.describe()

In [None]:
# plot the depth vs. magnitude

df.plot(x='magnitude', y='depth', kind='scatter')

In [None]:
# save dataframe to disk

df.to_csv('usgs_earthquake_data.csv')

print 'file saved'

In [None]:
# read it back later

new_df = pd.DataFrame.from_csv('usgs_earthquake_data.csv')

print new_df.head()

# Part 2: Querying an API endpoint

### Google Maps Geocoding API

Google Maps has several APIs for getting search results programmatically. This one looks up latitude-longidtude coordinates (and other place information) for street addresses, which is called geocoding. 

It works similarly to the earthquakes example, with query parameters added to the URL endpoint.

**API documentation:**  
https://developers.google.com/maps/documentation/geocoding/intro

**API endpoint:**  
https://maps.googleapis.com/maps/api/geocode/json

**API endpoint with query parameters:**  
https://maps.googleapis.com/maps/api/geocode/json?address=Wurster+Hall

In [None]:
# we have to encode the search query so that it can be passed as a URL, 
# with spaces and other special characters removed

endpoint = 'https://maps.googleapis.com/maps/api/geocode/json'

params = {
    'address': 'Wurster Hall, Berkeley, CA',
}

url = endpoint + '?' + urllib.urlencode(params)
print url

In [None]:
# open a connection to the URL
connection = urllib.urlopen(url)

# download and parse the results
results = json.loads(connection.read())

print results

In [None]:
# pull out the formatted addresses

for item in results['results']:
    print item['formatted_address']

# Part 3: Querying an API with authentication

### Twitter REST and Streaming APIs

Twitter's APIs also operate over the web, but they require a back-and-forth authentication process at the beginning of a connection. It's easier to have a Python library handle this than to create the query URLs ourselves.

The REST APIs perform stand-alone operations: we submit a query and receive results, like in earlier examples. The Streaming API continues sending results in real time until we disconnect.

(REST is a set of principles describing how data transactions should work over the web, while the actual communication protocol is called HTTP. Web pages work through HTTP and REST too, but the browser steps in to interpret and display the content for you.)

**API documentation:**  
https://dev.twitter.com/rest/public  
https://dev.twitter.com/streaming/overview

**Documentation for third-party Python "wrapper"**:  
https://github.com/geduldig/TwitterAPI

In [None]:
from TwitterAPI import TwitterAPI

In [None]:
# import API credentials from keys.py file in the
# same directory as this notebook

from keys import *

In [None]:
# set up an API connection using credentials from the keys file

api = TwitterAPI(consumer_key, consumer_secret, 
                 access_token, access_token_secret)

print "Connection is set up but not tested"

### Making a simple data request

In [None]:
# Most recent tweet from @GBoeing's timeline

endpoint = 'statuses/user_timeline'
params = {
    'screen_name': 'gboeing', 
    'count': 1
}
r = api.request(endpoint, params)

for tweet in r.get_iterator():
    print tweet['text']

In [None]:
# What other data is there?

print tweet.keys()

In [None]:
# Contents of some additional fields...
# Here are the definitions: https://dev.twitter.com/overview/api/tweets

for tweet in r.get_iterator():
    print "Tweet      // ", tweet['text']
    print "Timestamp  // ", tweet['created_at']
    print "Retweets   // ", tweet['retweet_count']
    print "Favorites  // ", tweet['favorite_count']
    print "Geotag     // ", tweet['coordinates']
    print "Language   // ", tweet['lang']
    print "User       // ", tweet['user']['screen_name']
    print "Followers  // ", tweet['user']['followers_count']
    print "Profile    // ", tweet['user']['description']

### Other API endpoints allow different types of searches

In [None]:
# Search for public tweets about #muni

endpoint = 'search/tweets'
params = {
    'q': '#muni', 
    'count': 5
}
r = api.request(endpoint, params)

for tweet in r.get_iterator():
    print tweet['text'] + '\n'

In [None]:
# Search for public tweets in Hindi

endpoint = 'search/tweets'
params = {
    'q': '*', 
    'lang': 'hi', 
    'count': 5
} 
r = api.request(endpoint, params)

for tweet in r.get_iterator():
    print tweet['text'] + '\n'

In [None]:
# Search for public tweets geotagged near the UC Berkeley campus

endpoint = 'search/tweets'
params = {
    'q': '*', 
    'geocode': '37.873,-122.260,0.5km', 
    'count': 5
} 
r = api.request(endpoint, params)

for tweet in r.get_iterator():
    print tweet['text'] + '\n'

### Exercise

1. Try some different search queries!
2. Display some more data fields in addition to the tweet text
3. Advanced: can you figure out how to use the API to *post* a tweet?

Here's the search documentation: https://dev.twitter.com/rest/reference/get/search/tweets




### Streaming live tweets in real time 

In [None]:
# Twitter limits simultaneous connections to the streaming API,
# so this part may not work using the demo API keys during class

endpoint = 'statuses/filter'
params = {'locations': '-180,-90,180,90'}
r = api.request(endpoint, params)

# 'enumerate' lets us count tweets as we receive them

for i, tweet in enumerate(r.get_iterator()):
    print tweet['created_at']
    print tweet['place']['full_name'] + ', ' + tweet['place']['country']
    print tweet['text'] + '\n'
    if (i > 20): break

r.close()  # close streaming connection

### Loading tweets into a dataframe

In [None]:
# first, save some tweets to an array instead of just printing them

r = api.request(endpoint, params)
tweets = []

for i, tweet in enumerate(r.get_iterator()):
    if (i >= 500): break
    tweets.append(tweet)

r.close()
print len(tweets)

In [None]:
# the raw data is very messy though!

print tweets[0:5]

In [None]:
# we'll pull out some pieces into a dataframe

# first, set up a dictionary of empty arrays
d = {'place': [], 'latitude': [], 'longitude': []}

for t in tweets:
    try:
        # first check whether the fields we want exist
        _test = t['coordinates']['coordinates']
        
        # then pull out the data
        d['place'].append(t['place']['name'])
        d['latitude'].append(t['coordinates']['coordinates'][1])
        d['longitude'].append(t['coordinates']['coordinates'][0])
        
    except:
        # if the test failed, continue to next tweet
        continue

# load it into a dataframe
df = pd.DataFrame.from_dict(d)

print len(df)

In [None]:
print df.head()

In [None]:
print df.sort('place').head(12)

In [None]:
df.plot(x='longitude', y='latitude', kind='scatter')

In [None]:
# Note that when working with text strings that include characters from 
# other alphabets, you need to keep track of the text encoding.

# Some interesting related reading:
# - http://www.joelonsoftware.com/articles/Unicode.html

df.to_csv('saved_coords.csv', encoding='utf-8')

### Exercise for the remainder of class

Choose one:

1. Using one of the APIs from this demo, save and graph a different aspect of the data.  
   &nbsp;

2. Or, search the web for another API that provides data you're interested in. Can you figure out how to connect to it using Python code?

Some common terms for describing these APIs that operate over the web are "HTTP" and "REST". The most frequent data format they provide is JSON, but with some code modifications you can parse other formats as well.