#### Using an API - Basics

As usual, we have to import all the libraries we'll need for the project. There are two big libraries for parsing an API:
The "requests" for asking the API for the information we need
The lxml library for parsing the (XML formatted) information the API returns

In [None]:
#the requests library allows us to send a request to the API with the specific vaiables and parameters it takes
import requests

#the lxml library allows us to parse the xml formmated file we'll received after we make our request
import lxml.etree as ET

#### Sending the API request

Let's send a request to API that has our key: UznMvHN96Q7DpAPu2YtZQgPQS

We are using the API's url to get the current date and time in the system: http://www.ctabustracker.com/bustime/api/v1/gettime

The API url for time takes one parameter: the key, which is our unique code assigned to by the CTA, which serves as like a username

In [None]:
# get the system time from CTA
content= requests.get("http://www.ctabustracker.com/bustime/api/v1/gettime?key=UznMvHN96Q7DpAPu2YtZQgPQS")    

#save the text that was returned into a string
content_string = content.text

#### Loading the XML to an XML reader

Let's load the XML into an XML reader and see what the returned content looks like. 

We should see a bunch of a tags (like in HTML). What's the tag that our desired information (the time) is in between

In [None]:
# let's load in the XML into an XML reader ET.fromstring
# We'll save the parsed XML as a variable called "doc"
doc = ET.fromstring(content_string)

#Let's see what the returned content looks like'. We should see a bunch of a tags (like in HTML). What's the tag that our desired information (the time) is in between
print ET.tostring(doc,pretty_print=True)


#### Loading the XML to an XML reader

Right now, we have the content , but it's in an XML format. 

We need to parse the XML and access the information in the tag by traversing the XML tree

In [None]:
#let's access the information in between the "tm" tags and get the text
current_time = doc.find('tm').text

#print out the current date and time
print current_time

Acquire Different Kinds of Data

Let's try another API feature provided by CTA. Let's get all of the route names of all of the buses.

The API url to make that request is: http://www.ctabustracker.com/bustime/api/v1/getroutes

We'll pass on our key again

Once we make the request, we'll save the XML content to a string

In [None]:
#retreive the routes of all the buses in the CTA
content= requests.get("http://www.ctabustracker.com/bustime/api/v1/getroutes?key=UznMvHN96Q7DpAPu2YtZQgPQS") 
content_string = content.text

#### Loading XML and printing the XML tree

Pass on the information to an XML parser and save the xml tree as a variable called "doc"

Let's see what the content looks like.' What tag enclose the route name information? Are there tags within tags?

In [None]:
doc = ET.fromstring(content_string)
print content_string

#### Retreiving the information between XML tags 

Let's get a list of all of the information between the "route" tags that you saw when you printed the content_string.

In [None]:
routes = doc.findall('route')

In [None]:

#if we printed the route information, we would get a list of XML containers because there are tags within the 'route' tags
print routes

We can see how many routes there are by printing the length of the list holding all of the routes and their subtags


In [None]:
print len(routes)

We want to access the route names contained within the route tags, so we'll make a new list to hold the specific route numbers called routenames

In [None]:
routenames = []

For every route in the original list of routes, we'll add that specific route number to our route name list

In [None]:
#For every route tag in the XML tree
for route in routes:
    #look within that route tag and acces the 'rtnm' tag. Save the result to a variable called route_result
    route_result = route.find('rtnm').text
    
    #append the route to the list of routenames we made
    routenames.append(route_result)
print routenames

#### Get the route numbers from the XML tree

Go back to the xml tree returned by the request. See what tags hold the route numbers.

Make an empty list called full_route_list and write a "for loop" to go through all of the routes and append each route number to that full_route_list variable 

In [None]:
full_route_list=[]
for route in routes:
	route_result = route.find('rt').text
	full_route_list.append(route_result)

#### Getting the directions for each route

We can now use the full_route_list to take advantage of another API feature.

The CTA API allows you to get the stops for a specific route. However, you need to know the route directions the bus goes in (either Eastbound and Westbound or Northbound and Southbound).

We'll eventually get all of the stops for every bus, but first let's get all of the directions.

#### How to store multiple directions for each route

Let's start by importing a library that allows us to have special type of dictionary called a default dict

A default dictionary allows you store values associated with a unique key (i.e., the bus, and in this case, we're going to store the directions as a list in the dictionary

The default default dictionary allows us to set the default data structure to be a list, so we can immediately start appending to the dictionary instead of having to create the list for each key.

In [None]:
#this will import a type of data structure called a default_dict, where when you first reference a key in the dictionary, it gives it a default type.
from collections import defaultdict

#We want to make a dictionary that has the route number as the key, and stores the directions in a list
#we'll make a default dictionary that holds everything in a list
route_directions = defaultdict(list)

#### Now, let's go through every route number we have and ask the API to return the directions, and we'll go through the XML result and parse out the information like before.

In [None]:
#for every route number in our list of route numers,
for route in full_route_list:


	content= requests.get("http://www.ctabustracker.com/bustime/api/v1/getdirections?key=UznMvHN96Q7DpAPu2YtZQgPQS&rt=%s" % route) 
	content_string = content.text
	
    ##we'll parse the results, which is an XML file holding the route directions in between the "dir" tags

	doc = ET.fromstring(content_string)
	directions = doc.findall('dir')
	

	# for every single 
	for direction in directions:
		route_directions[route].append(direction.text)

#### Use the route number information and te bus number to get the bus stops from the API

We now have a dictionary of all of the directions for every bus number.

Let's get the stops for each bus line and save those to a different dictionary

In [None]:
busstops = defaultdict(list)

for busnumber, directions in route_directions.items():
        for direction in directions:
            content= requests.get("http://www.ctabustracker.com/bustime/api/v1/getstops?key=UznMvHN96Q7DpAPu2YtZQgPQS&rt=%s&dir=%s" %(busnumber,direction))
            content_string = content.text
            doc = ET.fromstring(content_string)
            stops = doc.findall('stop')
            for stop in stops:
                busstops[busnumber].append(stop.find('stpid').text)
            
            

#### Accessing the stop numbers for each bus
Now, for every bus, we know the specific stop numbers it stops at in its routes

In [None]:
print busstops['77']

##### Make a dataframe to hold data on how much overlap two busses have

We'll import pandas and make a data frame where the bus number is both the index and the columns.

This dataframe will hold information regarding how many stops each bus has with every other bus

In [None]:
import pandas as pd
busnetwork = pd.DataFrame(0, index=busstops.keys(), columns=busstops.keys())


In [None]:
import itertools
import numpy

pairwise_combinations = itertools.combinations(busstops,2)
for stop1, stop2 in pairwise_combinations:
    overlap = len(set(busstops[stop1]) & set(busstops[stop2]))
    overlap = int(overlap)
    busnetwork.set_value(stop1, stop2, overlap)
    busnetwork.set_value(stop2, stop1, overlap)

print busnetwork.head(10)

#### Can you determine which 10 buses have the most overlapping stops with other buses?

Use pandas dataframe commands to help you find your answer

In [None]:
top_buses = busnetwork.sum(axis=0)
print top_buses.sort_values(ascending=False)

#### Advanced Analysis Options - Find the most well connected buses

You can treat the data of overlaps between all buses as a network

This network can be loaded into the networkx package, and the centrality of each bus can be calculated

Centrality represents how much opportunity a node has in a network to access other nodes. It's analogous to being "well connected" to the population

There are three common centrality measures used

Degree Centrality: How many direct connections you have with other nodes

Closeness Cenrality: How quickly can you access every other node on average

PageRank Centrality: How influential are the nodes you are connected to

In [None]:
import networkx as nx
graph = nx.from_numpy_matrix(busnetwork.values)
mapping=dict(zip(range(len(busnetwork.index)),busnetwork.index))           
graph = nx.relabel_nodes(graph,mapping)
degree_centrality = nx.degree_centrality(graph)
pagerank_centrality = nx.pagerank_numpy(graph)
closeness_centrality = nx.closeness_centrality(graph)

degree_df = pd.DataFrame(degree_centrality.items(),columns=["BusNumber","DegreeCentrality"])
closeness_df = pd.DataFrame(closeness_centrality.items(),columns=["BusNumber","ClosenessCentrality"])
pagerank_df = pd.DataFrame(pagerank_centrality.items(),columns=["BusNumber","PageRankCentrality"])

centrality_df = pd.merge(pd.merge(degree_df, closeness_df, on='BusNumber'),pagerank_df,on="BusNumber")


print centrality_df.sort_values(by="PageRankCentrality",ascending=False).head(10)


#### Advanced Analysis - Multi-Dimensional Scaling

You can apply advanced data processing techniques to the data like Multi-Dimensional Scaling (MDS) to try to represent the similarity and dissimilarity between busses (in terms of overlapping stops) in two dimensions (e.g., lattitude and longitude).

We'll cover this technique in more detail in the dimension reduction lesson

In [None]:
from sklearn.manifold import MDS
import matplotlib.pyplot as plt

%matplotlib inline
nmds = MDS(n_components=2, metric=True, max_iter=30000, eps=1e-12, dissimilarity="euclidean")

#from scipy.spatial.distance import pdist, squareform
#distances = squareform(pdist(busnetwork.as_matrix(), 'Mahalanobis'))
npos = nmds.fit_transform(busnetwork)
fig = plt.figure(1)
ax = plt.axes([0., 0., 2., 2.])
plt.scatter(npos[:, 0], npos[:, 1], s=20, c='b')
for i in range(len(busnetwork)):
    ax.annotate(list(busnetwork.columns.values)[i], (npos[i, 0],npos[i, 1]))  
plt.show()
