# Name-Subham Kedia
# UNI-sk4355
# Assignment-7

<h1>Citibike Network Assignment</h1>
<li>The file, 2014-01 - Citi Bike trip data.csv, contains citibike trip data from January 2014 (a reasonable sized file!)
<li>The data:<br>
"tripduration","starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","birth year","gender"
<li>Each record in the data is a trip 
<li>The data is described at https://www.citibikenyc.com/system-data

In [21]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import networkx as nx
from networkx.algorithms import closeness_centrality
from networkx.algorithms import communicability
from collections import OrderedDict

<h1>STEP 1: Read the data into a dataframe</h1>
<li>Convert station ids to str if necessary

In [22]:
import pandas as pd
import numpy as np
datafile = "2014-01 - Citi Bike trip data.csv"
df1 = pd.read_csv(datafile)

<h1>STEP 2: Basic cleaning</h1>
<li>Remove data that have any nans in any row (none in this file but others do have nans)
<li>and convert stationids to str 

In [23]:
df1 = df1.dropna()

In [24]:
df1['start station id'] = df1['start station id'].astype(str)
df1['end station id'] = df1['end station id'].astype(str)

In [25]:
df1.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,471,2014-01-01 00:00:06,2014-01-01 00:07:57,2009,Catherine St & Monroe St,40.711174,-73.996826,263,Elizabeth St & Hester St,40.71729,-73.996375,16379,Subscriber,1986,1
1,1494,2014-01-01 00:00:38,2014-01-01 00:25:32,536,1 Ave & E 30 St,40.741444,-73.975361,259,South St & Whitehall St,40.701221,-74.012342,15611,Subscriber,1963,1
2,464,2014-01-01 00:03:59,2014-01-01 00:11:43,228,E 48 St & 3 Ave,40.754601,-73.971879,2022,E 59 St & Sutton Pl,40.758491,-73.959206,16613,Subscriber,1991,1
3,373,2014-01-01 00:05:15,2014-01-01 00:11:28,519,Pershing Square N,40.751884,-73.977702,526,E 33 St & 5 Ave,40.747659,-73.984907,15938,Subscriber,1989,1
4,660,2014-01-01 00:05:18,2014-01-01 00:16:18,83,Atlantic Ave & Fort Greene Pl,40.683826,-73.976323,436,Hancock St & Bedford Ave,40.682166,-73.95399,19830,Subscriber,1990,1


In [26]:
df = df1[0:2000].copy()

In [27]:
df.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,471,2014-01-01 00:00:06,2014-01-01 00:07:57,2009,Catherine St & Monroe St,40.711174,-73.996826,263,Elizabeth St & Hester St,40.71729,-73.996375,16379,Subscriber,1986,1
1,1494,2014-01-01 00:00:38,2014-01-01 00:25:32,536,1 Ave & E 30 St,40.741444,-73.975361,259,South St & Whitehall St,40.701221,-74.012342,15611,Subscriber,1963,1
2,464,2014-01-01 00:03:59,2014-01-01 00:11:43,228,E 48 St & 3 Ave,40.754601,-73.971879,2022,E 59 St & Sutton Pl,40.758491,-73.959206,16613,Subscriber,1991,1
3,373,2014-01-01 00:05:15,2014-01-01 00:11:28,519,Pershing Square N,40.751884,-73.977702,526,E 33 St & 5 Ave,40.747659,-73.984907,15938,Subscriber,1989,1
4,660,2014-01-01 00:05:18,2014-01-01 00:16:18,83,Atlantic Ave & Fort Greene Pl,40.683826,-73.976323,436,Hancock St & Bedford Ave,40.682166,-73.95399,19830,Subscriber,1990,1


<h1>STEP 3: Write a function that returns a graph given a citibike data frame</h1> 
<li>Your function should return two things:
<ol>
<li>a graph
<li>a dictionary with station ids as the key and station name as the value
</ol>
<li>The graph should contain 
<ol>
<li>nodes (station ids)
<li>edges (station id, station id)
<li>edge data 
<ol>
<li>count: number of trips on the edge
<li>time: average duration - pickup to dropoff - on that edge
</ol>
</ol>
<li><b>Note:</b> the edge (x1,y1) is the same as (y1,x1) even though the start station ids and end station ids are flipped in the dataframe

In [28]:
def get_citibike_graph(df):
    import networkx as nx
    G = nx.Graph()
    node_names = dict()
    
    s1 = df['start station id'].unique()
    s2 = df['end station id'].unique()
    s1 = np.append(s1, s2)
    nodes = list(set(list(s1)))
    for x in range(0, len(nodes)):
        try:
            if nodes[x] not in node_names:
                node_names[nodes[x]] = df[df['start station id'] == nodes[x]]['start station name'].iloc[0]
        except:
            continue
            
    edges = list()
    temp = list()
    for i in nodes:
        series = df[df['start station id'] == i]['end station id']
        for j in range(0, len(series)):
            if(series.iloc[j] not in temp):
                edges.append((i, series.iloc[j]))
        temp.append(i)
        
    weighted = list()
    for edge in edges:
        A = len(df[((df['start station id'] == edge[0]) & (df['end station id'] == edge[1])) | 
               ((df['start station id'] == edge[1]) & (df['end station id'] == edge[0]))])
        B = np.mean(df[((df['start station id'] == edge[0]) & (df['end station id'] == edge[1])) | 
                   ((df['start station id'] == edge[1]) & (df['end station id'] == edge[0]))]['tripduration'])
        temp_list = [edge[0], edge[1], A, B]
        weighted.append(temp_list)
    
    for e in weighted:
        G.add_edge(e[0], e[1], trips = e[2], time = e[3])
    
    return G, node_names, weighted

### All the analysis below have been done by taking the first 2000 rows from the dataset because the time required to process all the rows of the dataset (the entire dataset) was huge.

<h1>STEP 4: Create the following graphs using the function above</h1>
<li>G: A graph of all the data in the dataframe
<li>m_G: A graph containing only data from male riders
<li>f_G: A graph containing only data from female riders
<li>Note: for m_G and f_G you will need to extract data from the dataframe

In [29]:
G, nodes, distance_all = get_citibike_graph(df)

In [30]:
m_G, nodes1, distance_m = get_citibike_graph(df[df['gender'] == 1])

In [31]:
f_G, nodes2, distance_f = get_citibike_graph(df[df['gender'] == 2])

In [32]:
T_G, nodes3, distance_T = get_citibike_graph(df[df['gender'] == 0])

<h1>STEP 5: Answer the following questions for each of the graphs</h1>
<ol>
<li>Which station (name) is the best connected (max degree)?
<li>Travel between which pair of stations is the longest in terms of average duration between bike pickups and dropoffs. Report both the two stations as well as the time in minutes
<li>Which edge is associated with the most number of trips?
<li>Which station is the most central?
<li>Which node is a bottleneck node?

Which station (name) has the greatest number of connections (max degree)?

In [33]:
d1 = nx.degree(G)
l1 = list(d1)
all_val = max(l1, key=lambda x: x[1])
#all_val = max(list(nx.degree(G)), key = lambda x: x[1])

d2 = nx.degree(f_G)
l2 = list(d2)
f_val = max(l2, key=lambda x: x[1])

d3 = nx.degree(m_G)
l3 = list(d3)
m_val = max(l3, key=lambda x: x[1])

print("Busiest female station:", nodes[f_val[0]])
print("Busiest male station:", nodes[m_val[0]])
print("Busiest station:", nodes[all_val[0]])

Busiest female station: E 7 St & Avenue A
Busiest male station: 9 Ave & W 45 St
Busiest station: 1 Ave & E 15 St


Travel between which pair of stations is the longest in terms of average duration between bike pickups and dropoffs

In [34]:
all_long = max(distance_all, key=lambda x: x[3])
print("Longest average distance all:", nodes[all_long[0]],"to", nodes[all_long[1]],".","Minutes:",all_long[3]/60)

m_long = max(distance_m, key=lambda x: x[3])
print("Longest average males:", nodes[m_long[0]],"to", nodes[m_long[1]],".","Minutes:",m_long[3]/60)

f_long = max(distance_f, key=lambda x: x[3])
print("Longest average females:", nodes[f_long[0]],"to", nodes[f_long[1]],".","Minutes:",f_long[3]/60)

Longest average distance all: Howard St & Centre St to Broadway & W 60 St . Minutes: 2202.05
Longest average males: Howard St & Centre St to Broadway & W 60 St . Minutes: 2202.05
Longest average females: FDR Drive & E 35 St to 6 Ave & W 33 St . Minutes: 37.9


Which edge is associated with the most number of trips?

In [35]:
all_long = max(distance_all, key=lambda x: x[2])
print("Most trip route all:", nodes[all_long[0]],"to", nodes[all_long[1]],".","Minutes:",all_long[3]/60)

m_long = max(distance_m, key=lambda x: x[2])
print("Most trip route males:", nodes[m_long[0]],"to", nodes[m_long[1]],".","Minutes:",m_long[3]/60)

f_long = max(distance_f, key=lambda x: x[2])
print("Most trip route females:", nodes[f_long[0]],"to", nodes[f_long[1]],".","Minutes:",f_long[3]/60)

Most trip route all: Henry St & Grand St to Norfolk St & Broome St . Minutes: 4.355555555555555
Most trip route males: Norfolk St & Broome St to Henry St & Grand St . Minutes: 4.513888888888888
Most trip route females: Norfolk St & Broome St to Henry St & Grand St . Minutes: 4.038888888888889


In [46]:
c_all = nx.closeness_centrality(G)
call = sorted(c_all.items(),key = lambda x: x[1],reverse = True)

c_f = nx.closeness_centrality(f_G)
cf = sorted(c_f.items(),key = lambda x: x[1],reverse = True)

c_m = nx.closeness_centrality(m_G)
cm = sorted(c_m.items(),key = lambda x: x[1],reverse = True)

print("The most central node:", nodes[call[0][0]])
print("The most central female node:", nodes[cf[0][0]])
print("The most central male node:", nodes[cm[0][0]])

The most central node: 1 Ave & E 15 St
The most central female node: E 2 St & Avenue C
The most central male node: Forsyth St & Broome St


In [37]:
c_all = nx.betweenness_centrality(G)
call = sorted(c_all.items(),key = lambda x: x[1],reverse = True)

c_f = nx.betweenness_centrality(f_G)
cf = sorted(c_f.items(),key = lambda x: x[1],reverse = True)

c_m = nx.betweenness_centrality(m_G)
cm = sorted(c_m.items(),key = lambda x: x[1],reverse = True)

print("The most bottleneck node:", nodes[call[0][0]])
print("The most bottleneck female node:", nodes[cf[0][0]])
print("The most bottleneck male node:", nodes[cm[0][0]])

The most bottleneck node: Forsyth St & Broome St
The most bottleneck female node: 1 Ave & E 15 St
The most bottleneck male node: Forsyth St & Broome St


<h2>Centrality</h2>
One of the concerns that the citibike system has to deal with is ensuring that no station has empty slots (a bike should always be available) and that no station should have no empty slots (you should be able to return a bike). To do this, it needs to monitor the movement of bikes through the system, ideally using a directed graph. Though our graph is not directed, we can look at some network characteristics that will help us answer these questions. Note that the "trips" feature in edge data captures flows.
<li>Which node is a possible bottleneck node in terms of bike flows?
<li>Which node is the "nearest" to all other nodes (irrespective of flows)
<li>Which node is the "nearest" to all other nodes (in terms of distance = time)
<li>Which nodes are peripheral (most likely to be underserved)

In [38]:
c_all_1 = nx.betweenness_centrality(G, weight='trips')
bottleneck = sorted(c_all_1.items(),key = lambda x: x[1],reverse = True)

c_all_2 = nx.closeness_centrality(G)
close1 = sorted(c_all_2.items(),key = lambda x: x[1],reverse = True)

c_all_3 = nx.closeness_centrality(G, distance='time')
close2 = sorted(c_all_3.items(),key = lambda x: x[1],reverse = True)

print("Most central in connectivity:", nodes[close1[0][0]])
print("Most central in connectivity using time as distance:", nodes[close2[0][0]])
print("Bottleneck node:", nodes[bottleneck[0][0]])

Most central in connectivity: 1 Ave & E 15 St
Most central in connectivity using time as distance: W 20 St & 8 Ave
Bottleneck node: Forsyth St & Broome St


### The code written below gives an error because only 2000 rows have been taken to run the analysis and the graph that has been constructed using the 2000 rows is not complete. If we run it on the entire dataset, it gives us the correct answer and runs without an error.

In [20]:
periphery = nx.periphery(G)
print("Peripheral nodes:",nodes[periphery[0]],"and",nodes[periphery[1]])

NetworkXError: Found infinite path length because the graph is not connected