<h1>Citibike Network Assignment</h1>
<li>The file, 201809-citibike-tripdata.csv, contains citibike trip data from January 2014 (a reasonable sized file!)
<li>The data:<br>
"tripduration","starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","birth year","gender"
<li>Each record in the data is a trip 
<li>The data is described at https://www.citibikenyc.com/system-data

<h1>STEP 1: Read the data into a dataframe</h1>
<li>Convert station ids to str if necessary

In [60]:
import pandas as pd
import numpy as np
datafile = "201809-citibike-tripdata.csv"
df = pd.read_csv(datafile)
df

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,1635,2018-09-01 00:00:05.2690,2018-09-01 00:27:20.6340,252.0,MacDougal St & Washington Sq,40.732264,-73.998522,366.0,Clinton Ave & Myrtle Ave,40.693261,-73.968896,25577,Subscriber,1980,1
1,132,2018-09-01 00:00:11.2810,2018-09-01 00:02:23.4810,314.0,Cadman Plaza West & Montague St,40.693830,-73.990539,3242.0,Schermerhorn St & Court St,40.691029,-73.991834,34377,Subscriber,1969,0
2,3337,2018-09-01 00:00:20.6490,2018-09-01 00:55:58.5470,3142.0,1 Ave & E 62 St,40.761227,-73.960940,3384.0,Smith St & 3 St,40.678724,-73.995991,30496,Subscriber,1975,1
3,436,2018-09-01 00:00:21.7460,2018-09-01 00:07:38.5830,308.0,St James Pl & Oliver St,40.713079,-73.998512,3690.0,Park Pl & Church St,40.713342,-74.009355,28866,Subscriber,1984,2
4,8457,2018-09-01 00:00:27.3150,2018-09-01 02:21:25.3080,345.0,W 13 St & 6 Ave,40.736494,-73.997044,380.0,W 4 St & 7 Ave S,40.734011,-74.002939,20943,Customer,1994,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1877879,369,2018-09-30 00:27:25.9840,2018-09-30 00:33:35.7070,3601.0,Sterling St & Bedford Ave,40.662706,-73.956912,3631.0,Crown St & Bedford Ave,40.666563,-73.956741,32976,Subscriber,1985,2
1877880,191,2018-09-30 00:30:30.1850,2018-09-30 00:33:42.1090,3601.0,Sterling St & Bedford Ave,40.662706,-73.956912,3631.0,Crown St & Bedford Ave,40.666563,-73.956741,15595,Subscriber,1985,2
1877881,1442,2018-09-30 08:10:03.1790,2018-09-30 08:34:05.3870,3601.0,Sterling St & Bedford Ave,40.662706,-73.956912,471.0,Grand St & Havemeyer St,40.712868,-73.956981,28646,Subscriber,1981,1
1877882,453,2018-09-30 12:20:13.6830,2018-09-30 12:27:46.9140,3601.0,Sterling St & Bedford Ave,40.662706,-73.956912,3584.0,Eastern Pkwy & Franklin Ave,40.670777,-73.957680,34272,Subscriber,1986,1


<h1>STEP 2: Basic cleaning</h1>
<li>Remove data that have any nans in any row (none in this file but others do have nans)
<li>and convert stationids to str 

In [61]:
df = df.dropna()
df['start station id'] = df['start station id'].astype(str)
df['end station id'] = df['end station id'].astype(str)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


<h1>STEP 3: Write a function that returns a graph given a citibike data frame</h1> 
<li>Your function should return two things:
<ol>
<li>a graph
<li>a dictionary with station ids as the key and station name as the value
</ol>
<li>The graph should contain 
<ol>
<li>nodes (station ids)
<li>edges (station id, station id)
<li>edge data 
<ol>
<li>count: number of trips on the edge
<li>time: average duration - pickup to dropoff - on that edge
</ol>
</ol>
<li><b>Note:</b> the edge (x1,y1) is the same as (y1,x1) even though the start station ids and end station ids are flipped in the dataframe

In [73]:
def get_citibike_graph(df):
    import networkx as nx
    G = nx.Graph()   # a graph
    node_names = dict()  # a dictionary of node names

    #YOUR CODE GOES HERE
    import numpy as np
    import pandas as pd
    s1 = df['start station id'].unique()
    s2 = df['end station id'].unique()
    s1 = np.append(s1, s2)
    nodes = list(set(list(s1)))
    for x in range(0, len(nodes)):
        try:
            if nodes[x] not in node_names:
                node_names[nodes[x]] = df[df['start station id'] == nodes[x]]['start station name'].iloc[0]
        except:
            continue
            
    edges = list()
    temp = list()
    for i in nodes:
        series = df[df['start station id'] == i]['end station id']
        for j in range(0, len(series)):
            if(series.iloc[j] not in temp):
                edges.append((i, series.iloc[j]))
        temp.append(i)
        
    weighted = list()
    for edge in edges:
        A = len(df[((df['start station id'] == edge[0]) & (df['end station id'] == edge[1])) | 
               ((df['start station id'] == edge[1]) & (df['end station id'] == edge[0]))])
        B = np.mean(df[((df['start station id'] == edge[0]) & (df['end station id'] == edge[1])) | 
                   ((df['start station id'] == edge[1]) & (df['end station id'] == edge[0]))]['tripduration'])
        temp_list = [edge[0], edge[1], A, B]
        weighted.append(temp_list)
    
    for e in weighted:
        G.add_edge(e[0], e[1], trips = e[2], time = e[3])
    
    return G,node_names
    

<h1>STEP 4: Create the following graphs using the function above</h1>
<li>G: A graph of all the data in the dataframe
<li>m_G: A graph containing only data from male riders
<li>f_G: A graph containing only data from female riders
<li>Note: for m_G and f_G you will need to extract data from the dataframe

In [57]:
import networkx as nx
%matplotlib inline
import matplotlib.pyplot as plt

In [74]:
G,nodes=get_citibike_graph(df)

KeyboardInterrupt: 

In [None]:
nx.draw(G) 

In [None]:
m_G,nodes=get_citibike_graph(df[df['gender'] == 1])   # male_riders
nx.draw(m_G)

In [None]:
f_G,nodes=get_citibike_graph(df[df['gender'] == 2])   # female_riders
nx.draw(f_G)

<h1>STEP 5: Answer the following questions for each of the graphs</h1>
<ol>
<li>Which station (name) is the best connected (max degree)?
<li>Travel between which pair of stations is the longest in terms of average duration between bike pickups and dropoffs. Report both the two stations as well as the time in minutes
<li>Which edge is associated with the most number of trips?
<li>Which station is the most central?
<li>Which node is a bottleneck node?

Which station (name) has the greatest number of connections (max degree)?

In [None]:
d1 = nx.degree(G)
l1 = list(d1)
all_val = max(l1, key=lambda x: x[1])
#all_val = max(list(nx.degree(G)), key = lambda x: x[1])

d2 = nx.degree(f_G)
l2 = list(d2)
f_val = max(l2, key=lambda x: x[1])

d3 = nx.degree(m_G)
l3 = list(d3)
m_val = max(l3, key=lambda x: x[1])

print("Busiest female station:", nodes[f_val[0]])
print("Busiest male station:", nodes[m_val[0]])
print("Busiest station:", nodes[all_val[0]])

Busiest female station E 17 St & Broadway
Busiest male station Lawrence St & Willoughby St
Busiest station Lawrence St & Willoughby St


Travel between which pair of stations is the longest in terms of average duration between bike pickups and dropoffs

In [None]:
all_long = max(distance_all, key=lambda x: x[3])
print("Longest average distance all:", nodes[all_long[0]],"to", nodes[all_long[1]],".","Minutes:",all_long[3]/60)

m_long = max(distance_m, key=lambda x: x[3])
print("Longest average males:", nodes[m_long[0]],"to", nodes[m_long[1]],".","Minutes:",m_long[3]/60)

f_long = max(distance_f, key=lambda x: x[3])
print("Longest average females:", nodes[f_long[0]],"to", nodes[f_long[1]],".","Minutes:",f_long[3]/60)

In [10]:
#Note: I've printed the max edges but you don't need to print them

Longest average distance males:  W 43 St & 6 Ave  to  Warren St & Church St . Minutes:  6640
Longest average distance females:  S Portland Ave & Hanson Pl  to  Flushing Ave & Carlton Ave . Minutes:  9093
Longest average distance all:  Flushing Ave & Carlton Ave  to  S Portland Ave & Hanson Pl . Minutes:  9093
('524', '152', {'trips': 5, 'time': 398427.4})
('353', '242', {'trips': 1, 'time': 545583.0})
('242', '353', {'trips': 1, 'time': 545583.0})


Which edge is associated with the most number of trips?

In [None]:
all_long = max(distance_all, key=lambda x: x[2])
print("Most trip route all:", nodes[all_long[0]],"to", nodes[all_long[1]],".","Minutes:",all_long[3]/60)

m_long = max(distance_m, key=lambda x: x[2])
print("Most trip route males:", nodes[m_long[0]],"to", nodes[m_long[1]],".","Minutes:",m_long[3]/60)

f_long = max(distance_f, key=lambda x: x[2])
print("Most trip route females:", nodes[f_long[0]],"to", nodes[f_long[1]],".","Minutes:",f_long[3]/60)

most trip route males:  E 43 St & Vanderbilt Ave  to  W 41 St & 8 Ave . Minutes:  7
most trip route females:  Lafayette St & E 8 St  to  E 7 St & Avenue A . Minutes:  5
most trip route all:  E 43 St & Vanderbilt Ave  to  W 41 St & 8 Ave . Minutes:  7


In [None]:
c_all = nx.closeness_centrality(G)
call = sorted(c_all.items(),key = lambda x: x[1],reverse = True)

c_f = nx.closeness_centrality(f_G)
cf = sorted(c_f.items(),key = lambda x: x[1],reverse = True)

c_m = nx.closeness_centrality(m_G)
cm = sorted(c_m.items(),key = lambda x: x[1],reverse = True)

print("The most central node:", nodes[call[0][0]])
print("The most central female node:", nodes[cf[0][0]])
print("The most central male node:", nodes[cm[0][0]])

In [None]:
c_all = nx.betweenness_centrality(G)
call = sorted(c_all.items(),key = lambda x: x[1],reverse = True)

c_f = nx.betweenness_centrality(f_G)
cf = sorted(c_f.items(),key = lambda x: x[1],reverse = True)

c_m = nx.betweenness_centrality(m_G)
cm = sorted(c_m.items(),key = lambda x: x[1],reverse = True)

print("The most bottleneck node:", nodes[call[0][0]])
print("The most bottleneck female node:", nodes[cf[0][0]])
print("The most bottleneck male node:", nodes[cm[0][0]])

<h2>Centrality</h2>
One of the concerns that the citibike system has to deal with is ensuring that no station has empty slots (a bike should always be available) and that no station should have no empty slots (you should be able to return a bike). To do this, it needs to monitor the movement of bikes through the system, ideally using a directed graph. Though our graph is not directed, we can look at some network characteristics that will help us answer these questions. Note that the "trips" feature in edge data captures flows.
<li>Which node is a possible bottleneck node in terms of bike flows?
<li>Which node is the "nearest" to all other nodes (irrespective of flows)
<li>Which node is the "nearest" to all other nodes (in terms of distance = time)


In [None]:

c_all_1 = nx.betweenness_centrality(G, weight='trips')
bottleneck = sorted(c_all_1.items(),key = lambda x: x[1],reverse = True)

c_all_2 = nx.closeness_centrality(G)
close1 = sorted(c_all_2.items(),key = lambda x: x[1],reverse = True)

c_all_3 = nx.closeness_centrality(G, distance='time')
close2 = sorted(c_all_3.items(),key = lambda x: x[1],reverse = True)

print("Most central in connectivity:", nodes[close1[0][0]])
print("Most central in connectivity using time as distance:", nodes[close2[0][0]])
print("Bottleneck node:", nodes[bottleneck[0][0]])

Most central in connectivity Lawrence St & Willoughby St
Most central in connectivity using time as distance Fulton St & Rockwell Pl
Bottleneck node Atlantic Ave & Fort Greene Pl
