## Getting Data From the Interwebs

In many cases the data you'll be working with as a data engineer will be company data that is accessible vias internal systems. In other cases though, you'll have to venture out into the wild to find your own data. We did this in a small way during the web scraping labs. In this lab though, we'll talk about how to get data from more structured sources.

Public APIs are a great way of pulling data directly from other websites and data sources. Once you figure out how to handle the output of an API, you can build datasets out of almost anything!

A few examples of the possibilities:
- Connect to the FRED (Federal Reserve Economic Data) API and pull economic indicators for time series analysis
- Connect to the Twitter API and pull tweets on a certain topic (like your company/product/etc) to analyze public sentiment
- Connect to the SkyScanner API to analyze flight prices
- Connect to Data.gov and build a student loan data set

Public data is a really cool way to build a portfolio of projects that people can relate to. Company specific analysis will be your bread and butter in any job, but public data projects can be a great way of sharing your skills with the world. Todd Schneider's [taxi data analysis](https://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/) is a perfect example of this. Give that piece a read before getting started.

In [1]:
#import pandas, requests, json, and seaborn

import pandas as pd
import requests
import json
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## For our first API - let's use [FRED](https://research.stlouisfed.org/docs/api/fred/index.html). 
This is a common data source used in financial and economic analysis. Take a spin through their documentation, specifically the [observations](https://research.stlouisfed.org/docs/api/fred/series_observations.html) section before moving on.

The first thing you'll need to do is register for an account and request an API key. You can do this [here](https://research.stlouisfed.org/docs/api/api_key.html) in about 1 minute.

Second - we'll need to pick a data series to pull. For starters let's pull unemployement data for Delaware. Find this on the FRED website and take note of the unique ID for the series. We'll need to specify it in our API call.


In [2]:
#save your API key as a variable called 'mykey'

mykey = '5ffe1b83bb27dae2eb5ec54bd6add76a'

In [3]:
#save the series_id as a variable called 'unemployment_id'

unemployment_id = 'DEUR'

Calling an API shares some similarities with web scraping in that, the basics are really the same. You're simply making a web call and recording the call's response. 

Take a look at FRED's example call below:

https://api.stlouisfed.org/fred/series/observations?series_id=GNPCA&api_key=abcdefghijklmnopqrstuvwxyz123456&file_type=json

In [4]:
#create a base url to call (this should end at '/observations?')

base = 'https://api.stlouisfed.org/fred/series/observations?'

In [5]:
#reconstruct the example URL, using your API key and series_id

unemp_url = base + 'series_id=' + unemployment_id\
+ '&' + 'api_key=' + mykey + '&' + 'file_type=json'

In [6]:
#show the URL you created to confirm it is formatted as you'd expect

unemp_url

'https://api.stlouisfed.org/fred/series/observations?series_id=DEUR&api_key=5ffe1b83bb27dae2eb5ec54bd6add76a&file_type=json'

In [7]:
#use requests.get() to call the URL you constructed

response = requests.get(unemp_url)

In [8]:
#check the status code of the response to confirm your call succeeded

response.status_code

200

## Bonus:

What is the status_code for a successful web call?

In [9]:
#read the response and see what it looks like

response.content

b'{"realtime_start":"2020-02-17","realtime_end":"2020-02-17","observation_start":"1600-01-01","observation_end":"9999-12-31","units":"lin","output_type":1,"file_type":"json","order_by":"observation_date","sort_order":"asc","count":528,"offset":0,"limit":100000,"observations":[{"realtime_start":"2020-02-17","realtime_end":"2020-02-17","date":"1976-01-01","value":"7.7"},{"realtime_start":"2020-02-17","realtime_end":"2020-02-17","date":"1976-02-01","value":"7.7"},{"realtime_start":"2020-02-17","realtime_end":"2020-02-17","date":"1976-03-01","value":"7.700"},{"realtime_start":"2020-02-17","realtime_end":"2020-02-17","date":"1976-04-01","value":"8.0"},{"realtime_start":"2020-02-17","realtime_end":"2020-02-17","date":"1976-05-01","value":"8.4"},{"realtime_start":"2020-02-17","realtime_end":"2020-02-17","date":"1976-06-01","value":"8.8"},{"realtime_start":"2020-02-17","realtime_end":"2020-02-17","date":"1976-07-01","value":"9.200"},{"realtime_start":"2020-02-17","realtime_end":"2020-02-17","d

In [10]:
#save the content of the response

content  = response.content

We specified the file_type as json because it nice and easy to read this way. Using the json library, figure out how to load the response as a dictionary. 

In [11]:
#load the content of the response as a dict

response_dict = json.loads(content)

In [12]:
#load the observations from the dict into a DataFrame

unemp_df = pd.DataFrame(response_dict['observations'])

In [13]:
#show the df

unemp_df

Unnamed: 0,realtime_start,realtime_end,date,value
0,2020-02-17,2020-02-17,1976-01-01,7.7
1,2020-02-17,2020-02-17,1976-02-01,7.7
2,2020-02-17,2020-02-17,1976-03-01,7.700
3,2020-02-17,2020-02-17,1976-04-01,8.0
4,2020-02-17,2020-02-17,1976-05-01,8.4
...,...,...,...,...
523,2020-02-17,2020-02-17,2019-08-01,3.4
524,2020-02-17,2020-02-17,2019-09-01,3.5
525,2020-02-17,2020-02-17,2019-10-01,3.7
526,2020-02-17,2020-02-17,2019-11-01,3.8


Now let's wrap all steps above into a reusable function, and use it to pull in another data series.

Your function should have one input and should:
- Call the API
- Check if the response succeded
    - Print if it fails
- Read the content as JSON
- Load the JSON into a dict
- Return a DataFrame from the dict

In [14]:
#create your function here

def get_data(url):
    response = requests.get(url)
    if response.status_code == 200:
        content  = response.content
        response_dict = json.loads(content)
        data = pd.DataFrame(response_dict['observations'])
    else:
        print('Sorry - this call failed. Please check your URL.')
        
    return data

Now let's pull another data set. Go find the series_id for Labor Force Participation rate in DE.

Then:
- Create a fresh URL to hit
- Pass that URL to your new function
- Inspect the results

In [15]:
#save the series id as labor_id

labor_id = 'LBSSA10'

In [16]:
#create a new URL to hit that incorporates the new series_id

participation_url = base + 'series_id=' + labor_id\
+ '&' + 'api_key=' + mykey + '&' + 'file_type=json'

In [17]:
#create a df by passing the URL to the function you created 

particip_data = get_data(participation_url)

In [18]:
#show the df

particip_data

Unnamed: 0,realtime_start,realtime_end,date,value
0,2020-02-17,2020-02-17,1976-01-01,63.0
1,2020-02-17,2020-02-17,1976-02-01,62.8
2,2020-02-17,2020-02-17,1976-03-01,62.8
3,2020-02-17,2020-02-17,1976-04-01,62.8
4,2020-02-17,2020-02-17,1976-05-01,62.8
...,...,...,...,...
523,2020-02-17,2020-02-17,2019-08-01,62.5
524,2020-02-17,2020-02-17,2019-09-01,62.4
525,2020-02-17,2020-02-17,2019-10-01,62.4
526,2020-02-17,2020-02-17,2019-11-01,62.5


## Now that we've done this the long way . . .

We did the above to illustrate how getting data from APIs often goes. You'll usually need:
- To get an API key
- Read the API documentation a bit to understand how to call it
- Test it a few times and read the responses manually to get comfortable with it
- If you're going to pull lots of data, make sure to take a peek at the site's robots.txt file (these files are always found in the base directory: https://fred.stlouisfed.org/robots.txt).


Now that we've done this manually, it's worth mentioning that many popular APIs have pre-built wrappers that make accessing the API super easy. These wrappers allow you to abstract with much of the boilerplate code we wrote above (constructing the URL, checking the response, reading the content, loading it, etc).

For FRED, a very common one is **fredapi**, which can be installed via pip. 

Go install it before continuing.

In [19]:
#from fredapi import Fred

from fredapi import Fred

In [20]:
#set your api_key

fred = Fred(api_key=mykey)

In [21]:
#pull DE unemployment the easy way

easy_response = fred.get_series('DEUR')

In [22]:
#show the response

easy_response

1976-01-01    7.7
1976-02-01    7.7
1976-03-01    7.7
1976-04-01    8.0
1976-05-01    8.4
             ... 
2019-08-01    3.4
2019-09-01    3.5
2019-10-01    3.7
2019-11-01    3.8
2019-12-01    3.9
Length: 528, dtype: float64

See? Magic 😂