# Fetching LA Metro Data Using AWS
## Based On UCLA ITS Data Camp, Day 1 Exercise *demo added at end*
## Retrieving Data via API Calls

### Exercise 2: Getting data from LA Metro
Most of the time that you are querying APIs for data it will not be so straightforward to get it into a tidy format. Instead, what you will usually want to do is inspect the response content first before deciding how to proceed. Let's take a look at data from [LA Metro's Developer Portal](https://developer.metro.net/). Going to the [Metro Bus & Rail Real-time Arrivals](https://developer.metro.net/portfolio-item/real-time-arrivals/) page, we can see a variety of APIs that are publicly available. Take a look at all the [feeds](https://developer.metro.net/introduction/realtime-api-overview/realtime-api-returning-json/) returning JSON-formatted content, including route information, stop information, and realtime vehicle location information.

You will notice that instead of GeoJSON, it is in a slightly different format that requires just a bit of wrangling to get it in the right format.

##### Create an API Call
Pick any of the Metro routes and, following the structure in the example, make a call to get all the current vehicles on that route. Once we get the response (assuming it is successful), let's take a look at the content.

In [24]:
import requests
import pandas as pd
import os

# TODO: Write the statement to call the Metro API and get all vehicles for a particular route
#       and store the response as resp
# (No need to import the requests package again)
resp = requests.get('http://api.metro.net/agencies/lametro/routes/733/vehicles/')
print(resp.status_code)
# TODO: Store the JSON content as `data`
data = resp.json()

200


In [25]:
data

{'items': [{'run_id': '733_97_0',
   'seconds_since_report': 43,
   'heading': 58.0,
   'route_id': '733',
   'predictable': True,
   'latitude': 34.0277422,
   'longitude': -118.3908683,
   'id': '9302'},
  {'run_id': '733_97_0',
   'seconds_since_report': 240,
   'heading': 139.0,
   'route_id': '733',
   'predictable': True,
   'latitude': 34.015137,
   'longitude': -118.497437,
   'id': '1650'},
  {'run_id': '733_100_1',
   'seconds_since_report': 123,
   'heading': 225.0,
   'route_id': '733',
   'predictable': True,
   'latitude': 33.992805,
   'longitude': -118.455353,
   'id': '6053'},
  {'run_id': '733_100_1',
   'seconds_since_report': 237,
   'heading': 220.0,
   'route_id': '733',
   'predictable': True,
   'latitude': 34.056961,
   'longitude': -118.231354,
   'id': '9371'},
  {'seconds_since_report': 41,
   'heading': 270.0,
   'predictable': False,
   'latitude': 34.055176,
   'longitude': -118.232697,
   'id': '5859'},
  {'run_id': '733_100_1',
   'seconds_since_report'

You will notice that instead of GeoJSON, it is in a slightly different format. We can convert a list of key, value pairs into a Pandas dataframe easily by `df = pd.DataFrame(dict)`. Let's go ahead and convert the json output into a dict. _Hint: Make sure you access the list part of the JSON output!_

In [26]:
# TODO: Convert the JSON output to a dataframe
metro_df = pd.DataFrame(data['items'])

# Examine the head of the dataframe
metro_df.head()

Unnamed: 0,run_id,seconds_since_report,heading,route_id,predictable,latitude,longitude,id
0,733_97_0,43,58.0,733.0,True,34.027742,-118.390868,9302
1,733_97_0,240,139.0,733.0,True,34.015137,-118.497437,1650
2,733_100_1,123,225.0,733.0,True,33.992805,-118.455353,6053
3,733_100_1,237,220.0,733.0,True,34.056961,-118.231354,9371
4,,41,270.0,,False,34.055176,-118.232697,5859


##### Add a Column to the DataFrame
One thing you will notice is that when we made the dataframe above, we are missing the timestamp of the query. If we plan to write out the data for analysis later, we need to add the time of the query as a column value. The easiest way to get the current time in Python is through the [datetime](https://docs.python.org/2/library/datetime.html) package. Take a little bit of time to look through the documentation with a particular focus on the `now()` method.

Once we get the value of the current time, we can add it as a new column value to our current dataframe. Create an additonal column `call_time`. In the function, get the current timestamp of the call and add it as the value for that column.

In [27]:
# Import the datetime module
import datetime as dt

# TODO: Get the current time
now = dt.datetime.now()

# TODO: Add the current time as a value to the dataframe column `call_time`
metro_df['call_time'] = now

In [28]:
metro_df.head()

Unnamed: 0,run_id,seconds_since_report,heading,route_id,predictable,latitude,longitude,id,call_time
0,733_97_0,43,58.0,733.0,True,34.027742,-118.390868,9302,2019-10-14 21:22:26.188259
1,733_97_0,240,139.0,733.0,True,34.015137,-118.497437,1650,2019-10-14 21:22:26.188259
2,733_100_1,123,225.0,733.0,True,33.992805,-118.455353,6053,2019-10-14 21:22:26.188259
3,733_100_1,237,220.0,733.0,True,34.056961,-118.231354,9371,2019-10-14 21:22:26.188259
4,,41,270.0,,False,34.055176,-118.232697,5859,2019-10-14 21:22:26.188259


##### Wrap the API Call in a Function
Let's create a function to take a Route ID and make the API call for all realtime vehicle locations on that route. Add in the code we used in the block above to also create a column with the time we called the API.

_Function Input:_ Route ID  
_Function Output:_ Response Dataframe with the content response 

In [29]:
# TODO: Create the function
def get_vehicles_byroute(routenum):
    resp = requests.get('http://api.metro.net/agencies/lametro/routes/%s/vehicles/' % routenum)
    print(resp.status_code)
    # TODO: Store the JSON content as `data`
    data = resp.json()
    # TODO: Convert the JSON output to a dataframe
    routedata = pd.DataFrame(data['items'])

    # Examine the head of the dataframe
    #routedata.head()
    # Import the datetime module
    

    # TODO: Get the current time
    now = dt.datetime.now()

    # TODO: Add the current time as a value to the dataframe column `call_time`
    routedata['call_time'] = now
    return routedata

Let's take a look to make sure our function is working correctly. Run the cell below to confirm that you are getting the desired result. Go ahead and try changing the input and see how the output changes.

In [30]:
# Call the function for one of the routes
routedata = get_vehicles_byroute(720)

# Examine the head of the dataframe
routedata.head()

200


Unnamed: 0,id,route_id,predictable,run_id,latitude,longitude,heading,seconds_since_report,call_time
0,9532,720,True,720_1144_1,34.034783,-118.235089,274.0,15,2019-10-14 21:22:26.611693
1,9356,720,True,720_1122_0,34.014007,-118.490341,220.0,15,2019-10-14 21:22:26.611693
2,9363,720,True,720_1144_1,34.061722,-118.310936,270.0,237,2019-10-14 21:22:26.611693
3,9412,720,True,720_1144_1,34.020855,-118.159172,270.0,36,2019-10-14 21:22:26.611693
4,9362,720,True,720_1144_1,34.048023,-118.250755,315.0,237,2019-10-14 21:22:26.611693


##### Add Functionality
Great! Now we are able to change the route number and get a dataframe with the current location of all vehicles on the route. One of the next things we might want to do would be to get data from the route throughout the day and store it for later analysis. To do that we are going to need to add the following functionality into our function:

1. Write out the csv to the a file in our `data/processed` folder. Let's set the filename to the format `lametro_[routenum]_[timestamp].csv` (Eg. `lametro_720_2019-09-10-22-26-52.csv`). To write out the file, go ahead and use [Panda's method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html) for writing out a csv file.
2. Add conditional logic to only write out the file if the call was a success. If the call was not successful, print out the error message. Take a look [here](https://2.python-requests.org/en/master/user/quickstart/#response-status-codes) for some guidance.

As the function gets a bit more complex, please add appropriate code comments inside to quickly convey the purpose of each code block.


In [31]:
print(dt.datetime.now())

2019-10-14 21:22:26.642141


In [32]:
# TODO: Re-write the function with the requested features
def get_vehicles_byroute(routenum):
    #make request with requested route number
    resp = requests.get('http://api.metro.net/agencies/lametro/routes/%s/vehicles/' % routenum)
    #print(resp.status_code)
    
    #check if call is successful
    if resp.status_code != requests.codes.ok:
        print('API call unsuccessful')
        resp.raise_for_status()
        return
        
    # TODO: Store the JSON content as `data`
    data = resp.json()
    # TODO: Convert the JSON output to a dataframe
    routedata = pd.DataFrame(data['items'])

    # Examine the head of the dataframe
    #routedata.head()
    
    # TODO: Get the current time
    now = dt.datetime.now()

    # TODO: Add the current time as a value to the dataframe column `call_time`
    routedata['call_time'] = now
    
    #make output directory, if necessary
    cwd = os.getcwd()
    #print(cwd)
    if not os.path.exists("%s/data/processed" % cwd):
        os.makedirs("%s/data/processed" % cwd)
    
    #write to CSV
    #routedata.to_csv('data/processed/lametro_%s_%s.csv' % (routenum, now))
    fname = f"lametro_{routenum}_{now.strftime('%Y-%m-%d-%H-%M-%S')}.csv"
    fpath = "%s/data/processed/" % cwd + fname
    routedata.to_csv(fpath, index=False)
    return routedata

Let's go ahead and call the function again for one of the routes to ensure that it was written out correctly.

In [33]:
get_vehicles_byroute(704)

Unnamed: 0,id,route_id,predictable,run_id,latitude,longitude,heading,seconds_since_report,call_time
0,9374,704,True,704_152_0,34.090733,-118.316574,90.0,64,2019-10-14 21:22:26.892740
1,5834,704,True,704_152_0,34.08223,-118.387844,51.0,15,2019-10-14 21:22:26.892740
2,6041,704,True,704_152_0,34.075235,-118.253951,117.0,15,2019-10-14 21:22:26.892740
3,9272,704,True,704_168_1,34.085673,-118.383163,232.0,44,2019-10-14 21:22:26.892740
4,9308,704,True,704_168_1,34.056961,-118.231056,220.0,64,2019-10-14 21:22:26.892740
5,9209,704,True,704_168_1,34.090964,-118.297871,270.0,15,2019-10-14 21:22:26.892740
6,9301,704,True,704_152_0,34.043767,-118.455936,61.0,15,2019-10-14 21:22:26.892740


Check your data folder - if everything was successful, you should see a CSV file with the data from the call.
##### Introduction to variable-length arguments 
We've now built a function that, for a given route, will get current vehicle location data, format it into a dataframe, and write it out to a CSV file with the current datetime. What if we were interested in 2 routes? or 3 routes? Let's build another function that takes as input a _variable number of route numbers_ and then gets the vehicle data for each of them.
  
We will do this through the [_*args_ syntax](https://www.geeksforgeeks.org/args-kwargs-python/). Following that syntax, create a function called `get_vehicles_byroutes` that takes in a variable number of route numbers. For each route number, the function should call our other function `get_vehicles_byroute`. Between each call to our original function, add a 5 second pause to reduce the load on the server.

In [12]:
import time

# TODO: Finish composing the function
def get_vehicles_byroutes(*routes):
    for route in routes:
        get_vehicles_byroute(route)
        time.sleep(5)

##### Create a Loop to run the Function 
Great! We now have a function that calls the Metro's API, records the location of all vehicles for a particular route(s), logs the current timestamp, and saves the file in a location of our choosing. Let's (1) pick a few routes we want to get data from and (2) create a loop that runs the `get_vehicles_byroutes` function 1x per minute, for 5 minutes with those route numbers as the input.


In [None]:
# TODO: Execute the function 5x, each time separated by a minute
for _ in range(5):
    get_vehicles_byroutes(20, 720, 33)
    time.sleep(60)

#### Fetch a day's worth of LA Metro data

After starting this notebook on your EC2 instance using Screen, run the below cells to fetch a day's worth of LA Metro data for the 20 and 720. This function could be expanded, for example by passing the collection duration as an arguement instead of hardcoding one day. Make sure you've run the rest of the notebook first since it includes necessary imports and function definitions.

Also consider:
1. Some sort of error-handling that will ensure it keeps running if an API call is unsuccessful. Right now, the check coded into ```get_vehicles_byroute``` will end the loop if it encounters an unsuccessful call. You may also consider finding some way for the function to alert you if a call is unsuccessful (email perhaps?).
2. A data structure that meets your needs. Right now, each query generates its own .csv file, just like in the original ITS Data Camp notebook. Since thousands of files may be unwieldy to work with if you're collecting data for many routes over a long period of time, you might want to find a better way.

In [82]:
def get_vehicles_oneday(*routes):
    
    #create datetime objects for now and 1 day from now
    now = dt.datetime.now()
    then = now + dt.timedelta(days=1)
    
    #loop will end when current time passes target 
    while now < then:
        get_vehicles_byroutes(*routes)
        
        #request data every 2 minutes using delay
        time.sleep(120)
        
        #update current time
        now = dt.datetime.now()

In [None]:
get_vehicles_oneday(20, 720)