# INFO 103: Introduction to data science <br> Demo \#3: APIs<br> Author: JRW
## Mission
In this work book we're going to take a look at how APIs really work from a programming point of view to gain insight into how they are used to build online applications. 

1. Build our own API client for the Facebook Graph API to:
    - download user posts,
    - extract posted images, and
    - gather a stream of user comments.
2. Use a well-developed Twitter API client to:
    - download historical (famous) tweets by ID,
    - download a specific twitter users recent timeline of tweets, and
    - filter a live stream of tweets by key words and locations.
3. Use a well-developed Google API client to:
    - find geographic information from a street address,
    - find a street address from a latitude/longitude pair, and
    - find directions between two places by name.
4. Our own local SEPTA!

In [19]:
import json
import os, re
from IPython.core.display import display
from PIL import Image
from io import StringIO
import urllib.request
from urllib.request import urlopen
import requests

## Quotas
Remember, APIs are not usually free. They will just about always come with a liscence and the way most of the sites enforce a paywall is with a rate limit or quota. 

#### Graph API limits
Facebook's is so hard to hit that you may never notice (1 query/second), but they reserve the much more massive trove of private data they have. 

#### Twitter API limits
Twitter will let you see pretty much any of their data, but cap you at a 1% stream limit, or at 180 calls per 15 minute window if you are using the rest API.

#### Geocoding quotas
Users of the standard API:

2,500 free requests per day, calculated as the sum of client-side and server-side queries.
50 requests per second, calculated as the sum of client-side and server-side queries.

#### Directions quotas
Users of the standard API:

2,500 free directions requests per day, calculated as the sum of client-side and server-side queries.
Up to 23 waypoints allowed in each request, whether client-side or server-side queries.
50 requests per second, calculated as the sum of client-side and server-side queries.

## Facebook
#### What is the Facebook Graph API
Data from Facebook comes from the 'Graph' API, because that view their platform as a network, or, 'graph'. The documentation for this API may be found at:

* https://developers.facebook.com/docs/graph-api

The graph API allows you to access information about specific individuals, friends, and posts, etc. Anyone with a Facebook account can access the Graph API, and there are other APIs, such as the Public Feed API:

* https://developers.facebook.com/docs/public_feed/

which provides streaming data, i.e., live emerging data. However, this API is really only available to a restricted set of users, so we will focus only on the Graph API. 

#### Getting a Facebook app ID

As mentioned, you can use the Graph API if you are on Facebook, but to do this you have to register as a developer. As usual, there are some helpful resources out there on stackoverflow:

* http://stackoverflow.com/questions/3203649/where-can-i-find-my-facebook-application-id-and-secret-key

Here, I would say that the most helpful suggestion directs to the app registration page. Create an app:

* https://developers.facebook.com/apps

After you create an app, you will wind up on the app's development page. At the top of this page is your App ID. Record your ID in the string here:

In [1]:
APP_ID = "440014060125609"
APP_SECRET = "9ab6ed302f08aa3e6b6a19225de26ee6"

You will also need to get you App's secret code. This may be obtained by navigating to "settings" in the navigation on the left side of the app development. Once there, you will have to click on the "show" button to see the secret code. Record this string here:

#### Building API requests as URL strings
Api requests on both of Facebook and twitter are really just URLs. This makes sense, because whenever you look at a webpage you are actually just downloading its content. There are a lot of details on the Facebook Graph API, and we're just going to build one kind of request: the last $N$ public posts of a particular user. The following function creates a request URL from several inputs, notably the user's name (username) and the number of past messages to collect (limit). The `APP_ID` and `APP_Secret` are both passed to this function, as well.

In [2]:
def createPostUrl(username, APP_ID, APP_SECRET, limit):
    post_args = "/feed?access_token=" + APP_ID + "|" + APP_SECRET + \
    "&fields=attachments,created_time,message&limit=" + str(limit)
    post_url = "https://graph.facebook.com/" + username + post_args
    return post_url

#### Requesting the data behind a URL
This is the function that really does all of the work, relying on the globally-assigned `APP_ID` and `APP_SECRET`. This function runs the `CreatePostUrl()` function and then makes the http request with the `urllib2.urlopen()` function. The web response is read, and appears as a string, which, in JSON format is converted to a python dictionary using the `json.loads()` function.

In [3]:
def getPosts(username, limit):
    post_url = createPostUrl(username, APP_ID, APP_SECRET, limit)
    web_response = urllib.request.urlopen(post_url)
    readable_page = web_response.read()
    return json.loads(readable_page)

#### Running the API function
Let's try this out and grab the last 10 posts made by Drexel university (`'drexeluniv'`).

In [None]:
import urllib
data = getPosts("drexeluniv", 10)

#### Inspecting the output
The resulting data object is a dictionary at the top level with two keys, `'paging'`, and `'data'`. The value of `'data'` is what we're really looking for, and `'paging'` is actually another post URL that helps us to go even further back in time. In other words, we only asked for the 10 most recent posts, and if we want the ten before those, we just use the URL in `data['paging']`. Check it out:

In [None]:
data['paging']

#### The actual data
That's ugly, but it's really important if we want to way back in time and great that we don't have to build it. The actualy data it self is under the `'data'` key, and is a list of the different posts. Let's look at the first (most recent) post:

In [None]:
data['data'][0]

#### What are the individual pieces of data we requested?
In addition to the post message, the URL requests we built include any attachments and the creation time. The creation time `'created_time'` is fairly straightforward, but the attachments include any images that were in the post. Here's the primary message, itself:

In [None]:
data['data'][0]['message']

#### What if we want to see the photo?
The attachments key has another dictionary as value, let's take a look:

In [None]:
data['data'][0]['attachments']

This dictionary holds another dictionary with only one key, `'data'`, whose value is a list containing all of the meat. It's a list because the post may have multiple attachments! There's only one here, and it has a `'description'` and `'title'`, a `'url'` to the linked Drexel website and not the actual image. To get the actual image, we need to look at the `'media'` key under `'image'` and then `'src'`. Follow this link with your browser and you'll see the image that Drexel posted, which is of Berlin.

In [None]:
data['data'][0]['attachments']['data'][0]['media']['image']['src']

#### What if we want to download the image?
Well, technically navigating to the above URL does download the image, but if you want it saved on your computer, or in your Python workspace, you can once again use the `urllib2.urlopen()` function. The image data that is downloaded is just a string, and can be written out to file like text or anything else. It's a really big string, so dont try and print it. Instead, we should convert the string to a Python image with `Image()`, and use the IPython `display()` function:

In [None]:
web_response = urllib2.urlopen(data['data'][0]['attachments']['data'][0]['media']['image']['src'])
image_data = web_response.read()
image_object = Image.open(StringIO(image_data))

Running this will open the image in a window.

In [None]:
image_object.show()

And running this will place the display right here in the notebook

In [None]:
display(image_object)

## APIs usually handle many types of request
So far we have set up to be able to gather a stream of pubic posts going back in time. As organizations (like Drexel) post public updates Facebook users will often comment, generating threaded discussion. Since these discussions are also public, we can access them. Let's look at one pose back so some comments will have had the chance to accumulate.

In [None]:
web_response = urllib.urlopen(data['data'][1]['attachments']['data'][0]['media']['image']['src'])
image_data = web_response.read()
image_object = Image.open(StringIO(image_data))
print(data['data'][0]['message'])
display(image_object)

Hey, this is about that neon sign museaum!

#### Facebook objects have unique identifiers
To be able to request to comments associated to a post we will have to be able to provide the unique identifier for a post. Fortunately, this is provided!

In [None]:
print(data['data'][1]['id'])

#### Creating separate URL and request functions for comments
Sadly, our first API-access function won't do for this type of request. Instead we will have to include a place for post IDs and specifically build a comments query. Note that there is also the 'filter' option for the comments, which ensures that all comments are returned in chronological order. 

In [None]:
def getPostComments(POST_ID, limit):
    comments_url = createPostCommentsUrl(POST_ID, APP_ID, APP_SECRET, limit)
    web_response = urllib2.urlopen(comments_url)
    readable_page = web_response.read()
    return json.loads(readable_page)

#### Requesting comments
Here, we will request the comments from the second to last post made by Drexel. Once again, there is paging information and the data. Once again, since we've requested multiple comments we have a list as a return object. Let's loop through the comments and print them out along with their `'created_time'`.

In [None]:
comments_data = getPostComments(data['data'][1]['id'], 10)
print("This post currently has "+str(len(comments_data['data']))+" comments. Here's what we got:\n")
for comment in comments_data['data']:
    print(comment["created_time"])
    print(comment["message"])
    print("")

#### The data emerge oldest to newest
Note that all of these comments are from a few days ago and are getting newer. This is the reverse order of the posts feed, where we have to go back in time! Also, it appears people were more immediately interested with the move of firestone. My favorite comment is the 'glowing reviews' comment for the museaum...

## Twitter 
What if we want data from another source? Twitter has a similar API and actually makes much more of its data available than Facebook. However, it doesn't always have to be so difficult as constructing your very own API request URLs. In fact, python has several clients (modules) for downloading data from twitter that make the API access very easy! Here, we'll use `tweepy`. Since this is just a client, be aware that it may have limited functionality. So if you want to see everything that the API can do, check out the full documentation. However, like we were doing with the Facebook API, this may require building your own URLs.

* https://dev.twitter.com/docs

Just like with facebook, you'll have to get API access keys, which (from [stackoverflow](http://stackoverflow.com/questions/1808855/getting-new-twitter-api-consumer-and-secret-keys)) involves:

1. Having a twitter account
2. Go to https://apps.twitter.com and sign in.
3. Create an app (fill out the form).
4. Go To API keys section and click generate ACCESS TOKEN.

Note that the resulting keys are refferred to as:

* 'oauth_access_token' means **Access token**
* 'oauth_access_token_secret' means **Access token secret**
* 'consumer_key' means **API key**
* 'consumer_secret' **means API secret**


To get `tweepy`, just go to a command line and enter:

```
pip install tweepy
```

tweepy is pretty well documented, too:

* http://docs.tweepy.org/en/v3.5.0/index.html
* https://github.com/tweepy/tweepy


#### Getting started
First things first, we will need to import the necessary modules and enter our access keys.

In [None]:
import tweepy
import json

consumer_key="bGSPlAoZzbCFQfeQhxNmfj1cp"
consumer_secret="LR2DMvd3LffMIjYFPTQnlp036PgKlVEcn1rFqcWUWDEy2rFH2p"
access_token="227267417-D2IEgEgeUerDvbem0Of75nATQwbIiBXJDDJoVvVM"
access_token_secret="jxjiSsZfl2WJUgquSA7voZfEpoAJtbuP4vP28btCsYpbS"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

#### The rest API
The rest API allows you to access historical data (i.e., data that is 'resting') and to manage your account. This means looking up tweets by ID, and also follow/unfollow other accounts, etcetera. With tweepy, we first have to initialize a rest api instance.

In [None]:
rest = tweepy.API(auth)

#### Downloading some old tweets
To get some old tweets we will need a list of tweet IDs.

Note: gathering the list of tweet IDs required going into the source html. After next week, we could write a web scraper to pull this out for us!

In [None]:
idlist = [
    "1121915133", 
    "64780730286358528", 
    "64877790624886784", 
    "20", 
    "467192528878329856", 
    "474971393852182528",
    "475071400466972672",
    "475121451511844864",
    "440322224407314432",
    "266031293945503744",
    "3109544383",
    "1895942068",
    "839088619",
    "8062317551",
    "232348380431544320",
    "286910551899127808",
    "286948264236945408",
    "27418932143",
    "786571964",
    "467896522714017792",
    "290892494152028160",
    "470571408896962560"
]
data = {id_: "" for id_ in idlist}
tweets = rest.statuses_lookup(id_=idlist, include_entities=True)

#### What does a tweet look like?
The resulting status objects have a lot of extra structure to them, but a python dictionary of Twitter's raw format may be accessed through the `._json` value of the object. Let's look at the keys.

In [None]:
print(tweets[0]._json.keys())

The most important thing here is the `'text'`, but there's lots of other good stuff too. Let's look at all 19 of the tweets in order. Unfortunately, since the order is off, we will have to fix it.

In [None]:
for tweet in tweets:
    data[str(tweet._json['id'])] = tweet._json
for ix, id_ in enumerate(idlist):
    print(str(ix+1)+": "+data[id_]['text'])

#### Getting a user's timeline
Now, we can also follow a specific user easily with tweepy. Let's get the last 10 tweets from Drexel (`drexeluniv`).

In [None]:
timeline = rest.user_timeline(screen_name = "drexeluniv", count = 10)
for tweet in timeline:
    print(tweet._json["text"])

#### The streaming API
So far we've only accessed the rest API for old tweets. Twitter is neat because it also makes its streaming API available to the public (at 1% bandwidth). Here's some mode advanced tweepy code that allows us to download `N` immediately recent tweets from the stream using keyword and geolocation filters.

In [None]:
class StdOutListener(tweepy.streaming.StreamListener):
    """ A listener handles tweets that are received from the stream.
    This listener collects N tweets, storing them in memory, and then stops.
    """
    def __init__(self, N):
        super(StdOutListener,self).__init__(self)
        self.data = []
        self.N = N
    def on_data(self, data):
        self.data.append(json.loads(data))
        if len(self.data) >= self.N:
            return False
        else:
            return True

    def on_error(self, status):
        print(status)

In [None]:
def getNtweets(N, auth, track = [], locations = []):
    listener = StdOutListener(N)
    stream = tweepy.Stream(auth, listener)
    if len(track) and len(locations):
        stream.filter(track=track, locations = locations)
    elif len(track):
        stream.filter(track = track)
    elif len(locations):
        stream.filter(locations = locations)

    return listener.data

In [None]:
dataScienceTweets = getNtweets(10, auth, track=['datascience'])
for tweet in dataScienceTweets:
    print(tweet['text'])

#### Geolocation data
As mentioned above, we can also use the streaming API to filter data by location. Let's look at 10 recent tweets from Philadelphia! To do this, we will have to get a lat/lon bounding box for philadelphia. I got these number from

* https://github.com/amyxzhang/boundingbox-cities/blob/master/boundbox.txt

but as we will see below, we could gather this data from Google's API. Note: the lat/lon order for a location box is `[lon1,lat1,lon2,lat2]`. Note that because there are fewer tweets coming from such a small box, this will take a bit longer to run for 10 the tweets!

In [None]:
bbox = [-75.280327, 39.864841, -74.941788, 40.154541]
phillyTweets = getNtweets(10, auth, locations=bbox)

In [None]:
phillyTweets[0].keys()

In [None]:
for tweet in phillyTweets:
    print(tweet['place']['full_name'])
    print(tweet['text'])
    print("")

## Google
Google has API's for lot's of stuff. This includes all of the geographic features of maps, the linguistic features of translate, and even YouTube data, since Google bought them in 2006 for \$1.65 billion. Here, we're just going to go forward and use a client that provides the geographic services. Like usual, you will have to have a Google account for this. The steps are then:

1. Get a Google account.
2. Get an API key: https://developers.google.com/places/web-service/get-api-key
3. Go to the developer's console https://developers.google.com/console
4. Enable the specific APIs of interest: https://support.google.com/cloud/answer/6158841?hl=en

#### The python client
Here, we're going to use a nice Python client for the maps services called googlemaps. We can install this easily from the command line with pip, once again:

```
pip install -U googlemaps
```

For more information, be sure to check out their project documentation:

* https://github.com/googlemaps/google-maps-services-python


#### Load the client and set up your API instance

In [15]:
import googlemaps
from datetime import datetime

GOOGLE_API_KEY = "AIzaSyDFverCdXIh3_z7QdxMKIHhjfUxU1oavsc"

gmaps = googlemaps.Client(key=GOOGLE_API_KEY)

#### Get the geocoding for Rush and City halls

In [None]:
rushHall = gmaps.geocode('30 N. 33rd Street, Philadelphia, PA')
print(rushHall)

#### A bounding box for Rush hall!
There's lots of information here about the building, but relating back to our Twitter API experiment, notice how we can actually get a bounding box for the building&mdash;this means we could download all of the tweets appearing from this building!

In [None]:
print(rushHall[0]['geometry']['viewport'])

#### Reverse lookup
Note that we can also get the address of a location by lat/lon lookup! Let's see if we can pull the Rush hall address back out of the API.

In [None]:
# Look up an address with reverse geocoding
lat = rushHall[0]['geometry']['location']['lat']
lng = rushHall[0]['geometry']['location']['lng']
reverseLookup = gmaps.reverse_geocode((lat, lng))
for component in reverseLookup[0]['address_components']:
    print(component['long_name'])

#### Directions to city hall
Google is great for driving directions and we can use the API for this, too!

In [None]:
cityHall = gmaps.geocode('1401 John F Kennedy Blvd, Philadelphia, PA')

# Request walking directions
now = datetime.now()
directions_result = gmaps.directions(
    "30 N. 33rd Street, Philadelphia, PA",
    "Philadelphia City Hall",
    mode="driving",
    departure_time=now
)
print(directions_result[0].keys())

#### What's the result?
Once again, there's a lot of information here. Besides a list of lat/lon pairs for the directions (so you can make a map) there is also a text list of html directions in under the `'legs'` key.

In [None]:
print("It's a "+directions_result[0]['legs'][0]['distance']['text']+" walk, total:\n")
stepnum = 1
for step in directions_result[0]['legs'][0]['steps']:
    print(str(stepnum)+") "+re.sub("<\/?b>", "", step['html_instructions']))
    stepnum += 1

## A more local example of an API
The Southeastern Pennsylvania Transportation Authority (SEPTA) [makes a few APIs available](http://www3.septa.org/hackathon/). Some of these APIs can be used to access realtime data about SEPTA transit (trains, buses, trolleys). For example, we can request data about the next trains to arrive at a given station.

In [23]:
# format: "http://www3.septa.org/hackathon/Arrivals/*STATION_NAME*/*NUMBER_OF_TRAINS*"
arrivals_response = requests.get("http://www3.septa.org/hackathon/Arrivals/30th Street Station/5")

arrivals_dict = arrivals_response.json()
arrivals_dict

{'30th Street Station Departures: April 13, 2022, 5:29 pm': [{'Northbound': [{'direction': 'N',
     'path': 'R0N',
     'train_id': '1084',
     'origin': 'Cynwyd',
     'destination': 'Suburban Sta',
     'line': 'Manayunk/Norristown',
     'status': 'On Time',
     'service_type': 'LOCAL',
     'next_station': '30th St',
     'sched_time': '2022-04-13 17:31:01.000',
     'depart_time': '2022-04-13 17:32:00.000',
     'track': '1',
     'track_change': None,
     'platform': '',
     'platform_change': None},
    {'direction': 'N',
     'path': 'R3N',
     'train_id': '6336',
     'origin': '30th Street Station',
     'destination': 'West Trenton',
     'line': 'West Trenton',
     'status': 'On Time',
     'service_type': 'EXP TO JENKINTOWN',
     'next_station': None,
     'sched_time': '2022-04-13 17:36:01.000',
     'depart_time': '2022-04-13 17:37:00.000',
     'track': '2',
     'track_change': None,
     'platform': '',
     'platform_change': None},
    {'direction': 'N',
   

In [None]:
#Make a request to the SEPTA Arrivals API to get data on the next 10 trains to arrive at Suburban Station.

In [24]:
import requests
from pprint import pprint

response = requests.get("http://www3.septa.org/hackathon/Arrivals/Suburban Station/10")

data = response.json()
top_keys = list(data.keys())
# pprint(data[top_keys[0]][0]["Northbound"])

trains = []
for timestamp in data: ## timestamp is the sole key at the top level of response
    for outbound_direction in data[timestamp]: ## each track direction gets its own dictionary
        for direction in outbound_direction:
            for train in outbound_direction[direction]:
                trains.append({
                    'direction': train['direction'],
                    'line': train['line'],
                    'sched_time': train['sched_time'],
                    'status': train['status'],
                    'track': train['track']
                })

pprint(trains)

[{'direction': 'N',
  'line': 'Warminster',
  'sched_time': '2022-04-13 17:30:00.000',
  'status': '2 min',
  'track': '2'},
 {'direction': 'N',
  'line': 'Fox Chase',
  'sched_time': '2022-04-13 17:34:00.000',
  'status': '1 min',
  'track': '1'},
 {'direction': 'N',
  'line': 'Manayunk/Norristown',
  'sched_time': '2022-04-13 17:36:00.000',
  'status': 'On Time',
  'track': '6'},
 {'direction': 'N',
  'line': 'West Trenton',
  'sched_time': '2022-04-13 17:41:00.000',
  'status': 'On Time',
  'track': '2'},
 {'direction': 'N',
  'line': 'Paoli/Thorndale',
  'sched_time': '2022-04-13 17:44:00.000',
  'status': 'On Time',
  'track': '1'},
 {'direction': 'N',
  'line': 'Trenton',
  'sched_time': '2022-04-13 17:51:00.000',
  'status': '2 min',
  'track': '2'},
 {'direction': 'N',
  'line': 'Lansdale/Doylestown',
  'sched_time': '2022-04-13 17:52:00.000',
  'status': 'On Time',
  'track': '1'},
 {'direction': 'N',
  'line': 'Media/Elwyn',
  'sched_time': '2022-04-13 17:56:00.000',
  'statu