# Introduction

In this tutorial, we will attempt to visualize and get insights on Pittsburgh's HealthyRide bicycle network data. This data was fetched from https://healthyridepgh.com/media-kit/ and https://healthyridepgh.com/data/. For analysizing this data, we will be exploring a few libraries that help deal with network data, mapping, and data analysis. By visualizing this data, we hope to understand common ride patterns, hotspots for biking, how they relate to other biking infrastructure (bike lanes etc.), as well as other insights into trips.

### Libraries
To injest and clean our data, we will be using Pandas. We will also be using it to analyze the data and perform operations such as aggregations. Pandas offers functionality to process the data, clean it, and make it usable for network analysis and other libraries. More information about Pandas can be found here: https://pandas.pydata.org/

The data that we are looking at primarily involves trip information (origin, destination, time etc.). Therefore, we can best process this data in the form of a graph, and then use that graph to conduct further analysis. Python offers various graph libraries, but the most popular one with extensive documentation is networkx. More info can be found here: https://networkx.org/. It offers extensive functionality to create networks from data (including pandas dataframes) and understand properties of these networks.

Finally, to visualize our data geographically, we will be using Google Maps. Google Maps has extensive inbuilt infrastructure and capabilties to represent geographical data. To use Google Maps in the Jupyter notebook, we can look to the conda gmaps library. This library has built on Google Maps API functionality to view Google Maps in a Jupyter Notebook environment. More information about this library can be found here: https://jupyter-gmaps.readthedocs.io/

### Installing the libraries
To install these libraries, you must first run some commands in your local miniconda environment.
For pandas, run 
```
conda install pandas
```
For networkx, run
```
conda install networkx
```
For gmaps, first terminate your existing notebook process. Then, in a new conda environment, run the following commands:
```
conda install -c conda-forge gmaps
jupyter nbextension enable --py --sys-prefix widgetsnbextension
jupyter nbextension enable --py --sys-prefix gmaps
```
After you run these commands, restart your notebook and you should be able to view the maps. Run these commands below to import the libraries and start consuming them:

In [101]:
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
import gmaps
import gmaps.datasets

gmaps.configure(api_key='AIzaSyDfD5CNUra6NLeO4HBnYsaDvsHvW7ssLF8') # Fill in with your API key

### Fetching the trip and local stations data
To get started, navigate to https://healthyridepgh.com/data/ and download the 2020 Q1 data zip file. Unzip this file, and move the 2 csv files into the project directory. The names of the 2 files should be 'healthy-ride-rentals-2021-q1.csv' and 'healthy-ride-station-locations-2021-q1.csv'  
  
We will now use the inbuilt pandas csv reader to get this data, and typecast object to strings.

In [168]:
# Reading the trip data
df_trips = pd.read_csv('healthy-ride-rentals-2021-q1.csv', dtype="str", keep_default_na=False)
# Dropping columns that are not as useful
df_trips = df_trips.drop(['Bikeid', 'Usertype'], axis=1)
# Dropping rows with empty origin and/or dest info
df_trips = df_trips.drop(df_trips[df_trips['To station id'] == ''].index)
df_trips = df_trips.drop(df_trips[df_trips['From station id'] == ''].index)
# Convert the tripduration and dates to the relevant datatypes
df_trips = df_trips.astype({'Tripduration': 'int32'})
df_trips['Starttime'] = pd.to_datetime(df_trips['Starttime'])
df_trips['Stoptime'] = pd.to_datetime(df_trips['Stoptime'])

     Trip id           Starttime            Stoptime  Tripduration  \
0  111375309 2021-01-12 12:48:00 2021-01-12 13:04:00           963   
1  111390480 2021-01-13 09:32:00 2021-01-13 09:39:00           387   

  From station id            From station name To station id  \
0           49641           11th St & Penn Ave          1061   
1           49391  E Liberty Blvd & Negley Ave          1064   

                   To station name  
0              33rd St & Penn Ave   
1  Frankstown Ave & E Liberty Blvd  


Now, we will repeat the same process for the station locations as well

In [162]:
df = pd.read_csv('healthy-ride-station-locations-2021-q1.csv', dtype="str", keep_default_na=False)
# Dropping columns that are not as useful
df = df.drop(['# of Racks'], axis=1)
df = df.astype({'Latitude': 'float64', 'Longitude' : 'float64'})
# Fixing an error in the input data (it used the wrong sign for latitude data)
df.loc[86, 'Latitude'] = 40.467715

TODOS:
- Add labels, colors to the nodes, increase canvas size
- (DONE) Represent cycling data as graph with proper relative positions
- (DONE) Sizes based on number of trips to a given node
- (DONE) Sizes based on number of trips starting at a given node
- (DONE) Find average time for each start, end dest.
- Find avg trip time on weekends vs weekdays
- Find avg trip time on mornings vs evenings
- (DONE) Visualize most popular trips
- Find additional data to analyze via network and gmaps/pandas

# Initial insights into trip patterns

Now, we set up all our data from the 2 csv files - for trips and station information. Immediately, from looking at the aggregated data, we can see that most trips are trips that end at the origin (random biking trips that start and begin at the same place). This is true for a majority of stations, as can be seen from the analysis below in which we group data based on the origin and dest, count the number of trips, and rank them in descending order.

In [170]:
df_grouped = df_trips.groupby(['From station id','From station name', 'To station id','To station name']).size().reset_index(name="count").sort_values("count", ascending=False)

     From station id                         From station name To station id  \
294             1012  North Shore Trail & Fort Duquesne Bridge          1012   
1459            1061                       33rd St & Penn Ave           1061   
0               1000                  Liberty Ave & Stanwix St          1000   
1662            1084                Hot Metal St & Tunnel Blvd          1084   
1189            1045  S 27th St & Sidney St. (Southside Works)          1045   
1637            1074               South Side Trail & S 4th St          1074   
1736            1093                S Bouquet Ave & Sennott St          1093   
1975           49301                   Centre Ave & N Craig St         49301   
1263            1048                     S 18th St & Sidney St          1048   
2198           49671                         9th St & Penn Ave          1017   

                               To station name  count  
294   North Shore Trail & Fort Duquesne Bridge    372  
1459   

We can see that the North Shore Trail & Fort Duquesne Bridge is the most popular station.

As you can see, when we aggregate data to see which trips are the most popular, 9 out of 10 of the top 10 trips begin and end at the same station.

We will be visualizing this data later on in the form of a Google map.  
Now, we will also be attempting to look at trips that start and end at distinct stations.

In [174]:
df_trips_different_start_end = df_trips[df_trips['From station id'] != df_trips['To station id']]

Now, let's look at the same aggregation (most popular trips), but for trips that end at different locations. This can help us understand data fpor trips that are likely for a purpose (people take trips to different stations to run errands, get to work etc.) compared to trips that end up at the same station which are more likely to be for leisure:

In [175]:
df_trips_different_start_end = df_trips_different_start_end.groupby(['From station id','From station name', 'To station id','To station name']).size().reset_index(name="count").sort_values("count", ascending=False)

print(df_trips_different_start_end.head(10))

     From station id                  From station name To station id  \
2108           49671                  9th St & Penn Ave          1017   
1356            1059  Burns White Center at 3 Crossings          1060   
608             1024           S Negley Ave & Baum Blvd          1028   
1381            1060                 Penn Ave & 29th St          1059   
432             1017                 21st St & Penn Ave         49671   
1714            1094              O'Hara St & Desoto St         49301   
625             1024           S Negley Ave & Baum Blvd         49401   
1936           49401         Stanton Ave & N Negley Ave          1024   
1612            1088             Frazier St & Dawson St          1093   
1711            1094              O'Hara St & Desoto St          1097   

                           To station name  count  
2108                    21st St & Penn Ave     96  
1356                    Penn Ave & 29th St     66  
608   Penn Ave & Putnam St (Bakery Squar

We can see the most popular trip is from 9th St & Penn Ave to 21st St & Penn Ave. We will visualize these trips later using Google maps as well to see how these trips look geographically.

# Setting up a Network Graph
Before we visualize the trip information, it is useful to store the trips as a network graph so that we extract useful analyses from it. For example, once we add all the stations as nodes, we can store their positions as attributes and the edges can correspond to trips. This will allow us to take advantage of networkx's functionality to extract the degree and other network properties.

In [176]:
# Setting up the graph using networkx
X = nx.Graph()

We need to extract the positions data, as well as a list of all the station names.  
By setting up a set of nodes, we can ensure that the trips correspond to stations for which we have positional data.

In [177]:
posList = list(zip(df['Latitude'], df['Longitude']))
labels_list = list(df['Station Name'])
setOfNodes = set()

Now, we can add each station as a node to the graph, as well as edges for each distinct trip. To do this, we will go through each row of the stations dataframe and add these as nodes to the graph. After this, we will go through the trips dataframe and add an edge for each trip. 

In [131]:
# Iterating through the stations
for (i, row) in df.iterrows():
    setOfNodes.add(row[0])
    X.add_node(row[0], pos=posList[i])

# Iterating through the trips
for (i, row) in df_trips.iterrows():
    if row[4] in setOfNodes and row[5] in setOfNodes:
        X.add_edge(row[4], row[5])

Now, we will get the degree of the graph.
The degree of a node is the number of edges the end at the node. This number will tell us the number of trips that start/end at a given station. If we wish to compare stations based on this, we can store the degrees in a list and use it later when we visualize the network.

In [180]:
degrees = X.degree
weights = []
for (k, v) in degrees:
    weights.append(v)

# Using Google Maps to visualize the data

We will be using the gmaps library to display our location data in a form that can be visualized and understood. Let us try some basic functions of this library through some sample code below.

In [None]:
pittsburgh = (40.4406, -79.9959)
fig = gmaps.figure(center=pittsburgh, zoom_level=12)

Finally, before we visualize our data in the form of a map, we need to get the positional data in a format that can be consumed by the gmaps library (for example, a dictionary). We will use the networkx function to get the position attribute of each node a dictionary.

In [179]:
posDict = nx.get_node_attributes(X,'pos')

Plot these coordinates on a Google map to show the points on a map
Show the connections if possible

Drawing the map of Pittsburgh:

Drawing a heatmap showing the concentration of cycle stations:

In [114]:
heatmap_layer = gmaps.heatmap_layer(posList, weights=weights)
heatmap_layer.point_radius = 20
fig.add_layer(heatmap_layer)
fig

Figure(layout=FigureLayout(height='420px'))

In [115]:
fig.add_layer(gmaps.bicycling_layer())
fig

Figure(layout=FigureLayout(height='420px'))

Drawing markers showing each cycle station:

In [116]:
fig = gmaps.figure()
markers = gmaps.marker_layer(posList)
for i in range(len(markers.markers)):
    #     Adding hover text for the markers
    markers.markers[i].info_box_content = labels_list[i]
    markers.markers[i].display_info_box = True
fig.add_layer(markers)
fig

Figure(layout=FigureLayout(height='420px'))

In [136]:
features = []

data = df_grouped.head(20)

# For top 20 trips
def calculate_trips1(df_trips):
    for (i, row) in df_trips.iterrows():
        if row[0] in setOfNodes and row[1] in setOfNodes:
            features.append(gmaps.Line(
                start=posDict[row[0]],
                end=posDict[row[1]],
                stroke_weight=3.0,
                stroke_color="black"
            ))
            
setOfPoints = set()

for (i, row) in data.iterrows():
    setOfPoints.add(row[0])
    setOfPoints.add(row[1])

colors = []
for pos in posDict:
    if pos in setOfPoints:
        colors.append('blue')
    else:
        colors.append('red')

        
symbol_layer = gmaps.symbol_layer(posList, info_box_content=labels_list, fill_color=colors, scale=3)

calculate_trips1(data)
drawing = gmaps.drawing_layer(features=features)
markers = gmaps.marker_layer(posList)
fig.add_layer(symbol_layer)
fig.add_layer(drawing)
fig

Figure(layout=FigureLayout(height='420px'))

In [109]:
df_grouped = df_trips.groupby(['From station id', 'To station id']).agg({"Tripduration" : ["min", "max", "mean"]}).reset_index()
print(df_grouped.head(10))

  From station id To station id Tripduration                     
                                         min    max          mean
0            1000          1001          200  70805   9878.060606
1            1000          1002          206   3296   2201.750000
2            1000          1003        57710  57710  57710.000000
3            1000          1006          351    463    413.750000
4            1000          1007          751   1742   1248.090909
5            1000          1008          574    731    640.200000
6            1000          1010         1476   4130   3241.000000
7            1000          1011         8021  69580  38800.500000
8            1000          1012          570  53944   4786.217391
9            1000          1013          693  56621   9252.000000


In [None]:
df_weekends = df_trips[df_trips['Starttime'].dt.dayofweek > 5]
df_grouped = df_weekends.groupby(['From station id', 'To station id']).agg({"Tripduration" : ["min", "max", "mean"]}).reset_index()
print(df_grouped.head(10))

In [None]:
df_mornings = df_trips[df_trips['Starttime'].dt.hour < 12]
df_grouped = df_mornings.groupby(['From station id', 'To station id']).agg({"Tripduration" : ["min", "max", "mean"]})
print(df_grouped.head(10))

In [None]:
df_mornings = df_trips[df_trips['Starttime'].dt.hour < 12]
df_grouped = df_mornings.groupby(['From station id', 'To station id']).agg({"Tripduration" : ["min", "max", "mean"]})
df_grouped = df_grouped[]
print(df_grouped.head(10))

In [None]:
df_evenings = df_trips[df_trips['Starttime'].dt.hour > 16]
df_grouped = df_evenings.groupby(['From station id', 'To station id']).agg({"Tripduration" : ["min", "max", "mean"]}).reset_index()
print(df_grouped.head(10))