# Introduction

In this tutorial, we will attempt to visualize and get insights on Pittsburgh's HealthyRide bicycle network data. This data was fetched from https://healthyridepgh.com/media-kit/ and https://healthyridepgh.com/data/. For analysizing this data, we will be exploring a few libraries that help deal with network data, mapping, and data analysis. By visualizing this data, we hope to understand common ride patterns, hotspots for biking, how they relate to other biking infrastructure (bike lanes etc.), as well as other insights into trips.

### Libraries
To injest and clean our data, we will be using Pandas. We will also be using it to analyze the data and perform operations such as aggregations. Pandas offers functionality to process the data, clean it, and make it usable for network analysis and other libraries. More information about Pandas can be found here: https://pandas.pydata.org/

The data that we are looking at primarily involves trip information (origin, destination, time etc.). Therefore, we can best process this data in the form of a graph, and then use that graph to conduct further analysis. Python offers various graph libraries, but the most popular one with extensive documentation is networkx. More info can be found here: https://networkx.org/. It offers extensive functionality to create networks from data (including pandas dataframes) and understand properties of these networks.

Finally, to visualize our data geographically, we will be using Google Maps. Google Maps has extensive inbuilt infrastructure and capabilties to represent geographical data. To use Google Maps in the Jupyter notebook, we can look to the conda gmaps library. This library has built on Google Maps API functionality to view Google Maps in a Jupyter Notebook environment. More information about this library can be found here: https://jupyter-gmaps.readthedocs.io/

### Installing the libraries
To install these libraries, you must first run some commands in your local miniconda environment.
For pandas, run 
```
conda install pandas
```
For networkx, run
```
conda install networkx
```
For gmaps, first terminate your existing notebook process. Then, in a new conda environment, run the following commands:
```
conda install -c conda-forge gmaps
jupyter nbextension enable --py --sys-prefix widgetsnbextension
jupyter nbextension enable --py --sys-prefix gmaps
```
After you run these commands, restart your notebook and you should be able to view the maps. Run these commands below to import the libraries and start consuming them:

In [84]:
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
import gmaps
import gmaps.datasets

gmaps.configure(api_key='AIzaSyDfD5CNUra6NLeO4HBnYsaDvsHvW7ssLF8') # Fill in with your API key

### Fetching the trip and local stations data
To get started, navigate to https://healthyridepgh.com/data/ and download the 2020 Q1 data zip file. Unzip this file, and move the 2 csv files into the project directory. The names of the 2 files should be 'healthy-ride-rentals-2021-q1.csv' and 'healthy-ride-station-locations-2021-q1.csv'  
  
We will now use the inbuilt pandas csv reader to get this data, and typecast object to strings.

In [85]:
# Reading the trip data
df_trips = pd.read_csv('healthy-ride-rentals-2021-q1.csv', dtype="str", keep_default_na=False)
# Dropping columns that are not as useful
df_trips = df_trips.drop(['Bikeid', 'Usertype'], axis=1)
# Dropping rows with empty origin and/or dest info
df_trips = df_trips.drop(df_trips[df_trips['To station id'] == ''].index)
df_trips = df_trips.drop(df_trips[df_trips['From station id'] == ''].index)
# Convert the tripduration and dates to the relevant datatypes
df_trips = df_trips.astype({'Tripduration': 'int32'})
df_trips['Starttime'] = pd.to_datetime(df_trips['Starttime'])
df_trips['Stoptime'] = pd.to_datetime(df_trips['Stoptime'])

Now, we will repeat the same process for the station locations as well

In [86]:
df = pd.read_csv('healthy-ride-station-locations-2021-q1.csv', dtype="str", keep_default_na=False)
# Dropping columns that are not as useful
df = df.drop(['# of Racks'], axis=1)
df = df.astype({'Latitude': 'float64', 'Longitude' : 'float64'})
# Fixing an error in the input data (it used the wrong sign for latitude data)
df.loc[86, 'Latitude'] = 40.467715

TODOS:
- Add labels, colors to the nodes, increase canvas size
- (DONE) Represent cycling data as graph with proper relative positions
- (DONE) Sizes based on number of trips to a given node
- (DONE) Sizes based on number of trips starting at a given node
- (DONE) Find average time for each start, end dest.
- Find avg trip time on weekends vs weekdays
- Find avg trip time on mornings vs evenings
- (DONE) Visualize most popular trips
- Find additional data to analyze via network and gmaps/pandas

# Initial insights into trip patterns

Now, we set up all our data from the 2 csv files - for trips and station information. 

### Most popular origin stations
Let us first try finding the most popular origin stations, we can do this via a simple aggregation in Pandas.

In [87]:
df_most_popular_origin = df_trips.groupby(['From station id','From station name']).size().reset_index(name="count").sort_values("count", ascending=False).head(10)
print(df_most_popular_origin)

   From station id                         From station name  count
12            1012  North Shore Trail & Fort Duquesne Bridge    500
0             1000                  Liberty Ave & Stanwix St    442
40            1041                  Fifth Ave & S Bouquet St    442
69            1093                S Bouquet Ave & Sennott St    419
77           49301                   Centre Ave & N Craig St    389
56            1061                       33rd St & Penn Ave     380
70            1094                     O'Hara St & Desoto St    361
44            1045  S 27th St & Sidney St. (Southside Works)    339
66            1084                Hot Metal St & Tunnel Blvd    295
17            1017                        21st St & Penn Ave    268


We can see that the most popular station is the 'North Shore Trail & Fort Duquesne Bridge.' We will visualize this information on a map later on.

### Most popular destination stations
Let us first try finding the most popular destination stations, we can do this via a simple aggregation in Pandas.

In [88]:
df_most_popular_dest = df_trips.groupby(['To station id','To station name']).size().reset_index(name="count").sort_values("count", ascending=False).head(10)
print(df_most_popular_dest)

   To station id                           To station name  count
12          1012  North Shore Trail & Fort Duquesne Bridge    526
0           1000                  Liberty Ave & Stanwix St    495
57          1061                       33rd St & Penn Ave     453
70          1093                S Bouquet Ave & Sennott St    436
40          1041                  Fifth Ave & S Bouquet St    410
44          1045  S 27th St & Sidney St. (Southside Works)    398
78         49301                   Centre Ave & N Craig St    386
67          1084                Hot Metal St & Tunnel Blvd    329
17          1017                        21st St & Penn Ave    321
1           1001                Forbes Ave & Market Square    266


### Most popular trips
By grouping by both origin and destination, we can see that most trips are trips that end at the origin (random biking trips that start and begin at the same place). This is true for a majority of stations, as can be seen from the analysis below in which we group data based on the origin and dest, count the number of trips, and rank them in descending order.

In [89]:
df_grouped = df_trips.groupby(['From station id','From station name', 'To station id','To station name']).size().reset_index(name="count").sort_values("count", ascending=False)
print(df_grouped.head(10))

     From station id                         From station name To station id  \
294             1012  North Shore Trail & Fort Duquesne Bridge          1012   
1459            1061                       33rd St & Penn Ave           1061   
0               1000                  Liberty Ave & Stanwix St          1000   
1662            1084                Hot Metal St & Tunnel Blvd          1084   
1189            1045  S 27th St & Sidney St. (Southside Works)          1045   
1637            1074               South Side Trail & S 4th St          1074   
1736            1093                S Bouquet Ave & Sennott St          1093   
1975           49301                   Centre Ave & N Craig St         49301   
1263            1048                     S 18th St & Sidney St          1048   
2198           49671                         9th St & Penn Ave          1017   

                               To station name  count  
294   North Shore Trail & Fort Duquesne Bridge    372  
1459   

We can see that the North Shore Trail & Fort Duquesne Bridge is the most popular station.

As you can see, when we aggregate data to see which trips are the most popular, 9 out of 10 of the top 10 trips begin and end at the same station.

We will be visualizing this data later on in the form of a Google map.  
Now, we will also be attempting to look at trips that start and end at distinct stations.

In [90]:
df_trips_different_start_end = df_trips[df_trips['From station id'] != df_trips['To station id']]

Now, let's look at the same aggregation (most popular trips), but for trips that end at different locations. This can help us understand data fpor trips that are likely for a purpose (people take trips to different stations to run errands, get to work etc.) compared to trips that end up at the same station which are more likely to be for leisure:

In [91]:
df_trips_different_start_end = df_trips_different_start_end.groupby(['From station id','From station name', 'To station id','To station name']).size().reset_index(name="count").sort_values("count", ascending=False)

print(df_trips_different_start_end.head(10))

     From station id                  From station name To station id  \
2108           49671                  9th St & Penn Ave          1017   
1356            1059  Burns White Center at 3 Crossings          1060   
608             1024           S Negley Ave & Baum Blvd          1028   
1381            1060                 Penn Ave & 29th St          1059   
432             1017                 21st St & Penn Ave         49671   
1714            1094              O'Hara St & Desoto St         49301   
625             1024           S Negley Ave & Baum Blvd         49401   
1936           49401         Stanton Ave & N Negley Ave          1024   
1612            1088             Frazier St & Dawson St          1093   
1711            1094              O'Hara St & Desoto St          1097   

                           To station name  count  
2108                    21st St & Penn Ave     96  
1356                    Penn Ave & 29th St     66  
608   Penn Ave & Putnam St (Bakery Squar

We can see the most popular trip is from 9th St & Penn Ave to 21st St & Penn Ave. We will visualize these trips later using Google maps as well to see how these trips look geographically.

# Setting up a Network Graph
Before we visualize the trip information, it is useful to store the trips as a network graph so that we extract useful analyses from it. For example, once we add all the stations as nodes, we can store their positions as attributes and the edges can correspond to trips. This will allow us to take advantage of networkx's functionality to extract the degree and other network properties.

In [92]:
# Setting up the graph using networkx
X = nx.Graph()

We need to extract the positions data, as well as a list of all the station names.  
By setting up a set of nodes, we can ensure that the trips correspond to stations for which we have positional data.

In [93]:
posList = list(zip(df['Latitude'], df['Longitude']))
labels_list = list(df['Station Name'])
setOfNodes = set()

Now, we can add each station as a node to the graph, as well as edges for each distinct trip. To do this, we will go through each row of the stations dataframe and add these as nodes to the graph. After this, we will go through the trips dataframe and add an edge for each trip. 

In [94]:
# Iterating through the stations
for (i, row) in df.iterrows():
    setOfNodes.add(row[0])
    X.add_node(row[0], pos=posList[i])

# Iterating through the trips
# print(df_trips.head(2))
for (i, row) in df_trips.iterrows():
    if row[4] in setOfNodes and row[6] in setOfNodes:
#         print('here')
        X.add_edge(row[4], row[6])


Now, we will get the degree of the graph.
The degree of a node is the number of edges the end at the node. This number will tell us the number of trips that start/end at a given station. If we wish to compare stations based on this, we can store the degrees in a list and use it later when we visualize the network.

In [95]:
degrees = X.degree
weights = []
for (k, v) in degrees:
    weights.append(v)

# Using Google Maps to visualize the data

We will be using the gmaps library to display our location data in a form that can be visualized and understood. Let us try some basic functions of this library through some sample code below.

The gmaps library provides a figure function to draw maps. You can provide this function arguments such as the center of the map you want to draw and the zoom level of your view. For example, drawing the map of Pittsburgh:

In [96]:
pittsburgh = (40.4406, -79.9959) # (latitude, longitude)
fig = gmaps.figure(center=pittsburgh, zoom_level=12) # centering the map on Pittsburgh
fig

Figure(layout=FigureLayout(height='420px'))

Finally, before we visualize our trip data in the form of a map, we need to get the positional data in a format that can be consumed by the gmaps library (for example, a dictionary). We will use the networkx function to get the position attribute of each node a dictionary.

In [97]:
posDict = nx.get_node_attributes(X,'pos')


## Plotting the trip data on Google Maps

Now, we can begin plotting our trip data on our maps.  
Firstly, we can plot our list of positions for each station on the map. There's different ways to do this, including -
1) A heatmap to understand the density of how the stations are spaced out in Pittsburgh - We can also add weights for each position where each weight corresponds to the degree of the node (as referenced earlier)  
2) A set of markers depicting each of the locations along with some information corresponding to each.

### Drawing a heatmap

First, let's draw a heatmap showing the concentration of cycle stations. This is done via the gmaps heatmap_layer function. This function takes in a list of positons and weights for each of those positions to plot a heatmap. We can also provide a point radius to specify how granular we want the heatmap to be (larger radius means bigger groups).

In [98]:
heatmap_layer = gmaps.heatmap_layer(posList)
heatmap_layer.point_radius = 20
fig.add_layer(heatmap_layer)
fig

Figure(layout=FigureLayout(height='420px'))

Now, we can see that there's a lot of stations concentrated near areas such as point park, downtown, and west oakland/CMU. If you're familiar with Pittsburgh, this makes sense because these areas are either common tourist destinations, commercial spaces, or student areas. However, gmaps provides some interesting functionality to view various layers within the map for a given region.  
For example, gmaps allows you to display cycling routes and highlights them as black lines.
We can plot these along with the heatmap to see how the concentration of stations relates to the presence of bike routes in that area.

In [99]:
fig.add_layer(gmaps.bicycling_layer())
fig

Figure(layout=FigureLayout(height='420px'))

We can see that in general it appears that with higher concentration of stations, there is also a higher concentration of bike lanes (this does not imply that there is a relationship between the 2.) There also appears to be some areas where the concentration of stations is high but there are not as many bike routes (for example - East Liberty/Shadyside).

### Drawing markers

Now, we will be drawing markers to depict each station instead. We will do this using gmaps marker_layer function. Similar to the heatmap_layer function, it takes in a list of all positions. We can edit each marker to have an info box that is displayed when it is clicked. Through this, we can display the station name.

In [100]:
# reinitializing figure to be visualized
pittsburgh = (40.4406, -79.9959) # (latitude, longitude)
fig = gmaps.figure(center=pittsburgh, zoom_level=12) # centering the map on Pittsburgh
# Setting up markers
markers = gmaps.marker_layer(posList)
for i in range(len(markers.markers)):
    markers.markers[i].info_box_content = labels_list[i] # info box is displayed with the station name when you click on a station
    markers.markers[i].display_info_box = True           # Displays the info box
fig.add_layer(markers)
fig

Figure(layout=FigureLayout(height='420px'))

### Most popular origin stations
We can also visualize the top 10 origins for trips. These show us the stations that are overall the most popular fo rthose starting their trips.

In [101]:
# reinitializing figure to be visualized
pittsburgh = (40.4406, -79.9959) # (latitude, longitude)
fig = gmaps.figure(center=pittsburgh, zoom_level=12) # centering the map on Pittsburgh

top10Positions = []
# Setting up markers
for node in posDict:
    if node in set(df_most_popular_origin['From station id']):
        top10Positions.append(posDict[node])

markers = gmaps.marker_layer(top10Positions)
for i in range(len(markers.markers)):
    markers.markers[i].info_box_content = labels_list[i] # info box is displayed with the station name when you click on a station
    markers.markers[i].display_info_box = True           # Displays the info box
fig.add_layer(markers)
fig

Figure(layout=FigureLayout(height='420px'))

### Most popular destination stations
We can also visualize the top 10 destinations for trips. These show us the stations that are overall the most popular destionations for trips.

In [102]:
# reinitializing figure to be visualized
pittsburgh = (40.4406, -79.9959) # (latitude, longitude)
fig = gmaps.figure(center=pittsburgh, zoom_level=12) # centering the map on Pittsburgh

top10Positions = []
# Setting up markers
for node in posDict:
    if node in set(df_most_popular_dest['To station id']):
        top10Positions.append(posDict[node])

markers = gmaps.marker_layer(top10Positions)
for i in range(len(markers.markers)):
    markers.markers[i].info_box_content = labels_list[i] # info box is displayed with the station name when you click on a station
    markers.markers[i].display_info_box = True           # Displays the info box
fig.add_layer(markers)
fig

Figure(layout=FigureLayout(height='420px'))

## Representing trips as a network
We can represent trips in our dataset using functions of the gmaps library. Specifically, we can create a list of lines that corresponds to each edge in the network.  
First, we will plot the top 10 trips in our trips dataframe.  
Then we will plot the top 10 trips with distinct origins and destinations.

### Nodes
To add some visual heirarchy and reduce clutter, we will use different symbols from the default ones provided as markers. By using different colors to show stations in the top 10 trips and all other nodes, we can create an informative visual.  
We do this using the gmaps symbol_layer function that takes in a list of positions for each node, colors, as well as an argument for the scale (size) of each node.

### Edges
The edges represent trips and can be drawn as lines using the gmaps line function to which we will provide the start, end, stroke_color, and stroke_weight as arguments.

In [103]:
def visualize_trips(data):
    # reinitializing figure to be visualized
    pittsburgh = (40.4406, -79.9959) # (latitude, longitude)
    fig = gmaps.figure(center=pittsburgh, zoom_level=12) # centering the map on Pittsburgh

    # Create a list of lines to plot on the map as edges
    lines = []
    
    # Use the nodesInTop10 to track all the nodes in the top 10 trips
    nodesInTop10 = set()

    # For top 20 trips
    for (i, row) in data.iterrows():
        if row[0] in setOfNodes and row[2] in setOfNodes:
            lines.append(gmaps.Line(
                start=posDict[row[0]],
                end=posDict[row[2]],
                stroke_weight=3.0,
                stroke_color="black"
            ))
            nodesInTop10.add(row[0])
            nodesInTop10.add(row[2])

    # Create the colors input list for the symbol_layer function based on whether the node is in the top 10
    colors = []
    for node in posDict:
        if node in nodesInTop10:
            colors.append('blue')
        else:
            colors.append('red')

    # Create the symbol_layer with all the nodes
    symbol_layer = gmaps.symbol_layer(posList, info_box_content=labels_list, fill_color=colors, scale=3)
    # Create the layer with all the lines
    drawing = gmaps.drawing_layer(features=lines)
    # Add the layers to the figures
    fig.add_layer(symbol_layer)
    fig.add_layer(drawing)
visualize_trips(df_trips_different_start_end.head(10))
fig

Figure(layout=FigureLayout(height='420px'))

# Summary and references
This tutorial showed how to visualize and understand the Pittburgh HealthyRide network's data using libraries like Pandas, networkx, and gmaps.
You can look into these libraries and more:
1) pandas: https://pandas.pydata.org/  
2) networkx: https://networkx.org/  
3) gmaps: https://jupyter-gmaps.readthedocs.io/  
4) Pittsburgh HealthyRide: https://healthyridepgh.com/  