In [1]:
# importing libraries

import pandas as pd                              # for manipulating the dataframe
import plotly.graph_objects as go                # for plotting
import networkx as nx                            # for network analysis

# Network Analysis

Networks comprise graphical representations of the relationships between different variables. Network analysis allows one to estimate complex patterns of relationships to reveal core features of the network.

![Network](https://blog.perceptyx.com/hs-fs/hubfs/Images/Blog/organizational-network-analysis-perceptyx.jpg?width=1907&name=organizational-network-analysis-perceptyx.jpg)

Networks or Graphs are a set of objects (called nodes) having some relationship with each other (called edges). 

In the Uber Drives dataset, the information of the trips between different locations is given. Hence, the travel pattern of the rider, which places are travelled to or from most often and which trips take place frequently can be inspected using a network. 

The Python library that will help create a network is the networkx library.

# Reading the Dataset

In [2]:
# loading the dataset
df = pd.read_csv("Uber_Drives_Clean.csv")

Since there is an inherent direction present in this dataset, i.e., a trip starts from a start location and finishes at an end location, a directed graph is made to signify the direction of movement in a trip. 

Here, nodes are the different locations and an edge between two nodes is a trip between the two locations. 

The same source (start location) and target (end location) nodes will not lead to a proper edge, hence, only those datapoints are considered for which the start location and the stop location are different.

In [3]:
# filtering the dataframe
df_diff_area = df[df['START*']!=df['STOP*']]

df_diff_area.head()

Unnamed: 0,CATEGORY*,START*,STOP*,MILES*,PURPOSE*,Start Date,Start Time,End Date,End Time,Weekday,Duration
4,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit,2016-01-06,14:42:00,2016-01-06,15:49:00,2,67.0
6,Business,West Palm Beach,Palm Beach,7.1,Meeting,2016-01-06,17:30:00,2016-01-06,17:35:00,2,5.0
8,Business,Cary,Morrisville,8.3,Meeting,2016-01-10,08:05:00,2016-01-10,08:25:00,6,20.0
9,Business,Jamaica,New York,16.5,Customer Visit,2016-01-10,12:17:00,2016-01-10,12:44:00,6,27.0
10,Business,New York,Queens,10.8,Meeting,2016-01-10,15:08:00,2016-01-10,15:51:00,6,43.0


# Making the Network

A directed graph is made using the filtered dataframe.

In [4]:
# making the graph/network
G = nx.from_pandas_edgelist(df_diff_area, source='START*', target='STOP*', 
                            create_using=nx.DiGraph())

In [5]:
G

<networkx.classes.digraph.DiGraph at 0xc44f550>

In [6]:
# displaying the nodes
#G.nodes()

NodeView(('Fort Pierce', 'West Palm Beach', 'Palm Beach', 'Cary', 'Morrisville', 'Jamaica', 'New York', 'Queens', 'Elmhurst', 'Midtown', 'East Harlem', 'NoMad', 'Flatiron District', 'Midtown East', 'Hudson Square', 'Lower Manhattan', "Hell's Kitchen", 'Queens County', 'Downtown', 'Gulfton', 'Eagan Park', 'Jamestown Court', 'Durham', 'Farmington Woods', 'Whitebridge', 'Lake Wellingborough', 'Raleigh', 'Fayetteville Street', 'Umstead', 'Hazelwood', 'Westpark Place', 'Fairmont', 'Meredith Townes', 'Leesville Hollow', 'Apex', 'Chapel Hill', 'Northwoods', 'Williamsburg Manor', 'Macgregor Downs', 'Edgehill Farms', 'Tanglewood', 'Preston', 'Eastgate', 'Walnut Terrace', 'East Elmhurst', 'Jackson Heights', 'Midtown West', 'Long Island City', 'Heritage Pines', 'Waverly Place', 'Wayne Ridge', 'Depot Historic District', 'East Austin', 'West University', 'South Congress', 'Arts District', 'The Drag', 'Congress Ave District', 'Red River District', 'Convention Center District', 'North Austin', 'Georg

In [7]:
# displaying the edges
#G.edges()

OutEdgeView([('Fort Pierce', 'West Palm Beach'), ('West Palm Beach', 'Palm Beach'), ('Cary', 'Morrisville'), ('Cary', 'Durham'), ('Cary', 'Raleigh'), ('Cary', 'Apex'), ('Cary', 'Chapel Hill'), ('Cary', 'Latta'), ('Cary', 'Holly Springs'), ('Cary', 'Wake Forest'), ('Cary', 'Winston Salem'), ('Cary', 'Unknown Location'), ('Cary', 'Wake Co.'), ('Cary', 'Fuquay-Varina'), ('Morrisville', 'Cary'), ('Morrisville', 'Raleigh'), ('Morrisville', 'Banner Elk'), ('Jamaica', 'New York'), ('New York', 'Queens'), ('New York', 'Queens County'), ('New York', 'Long Island City'), ('New York', 'Jamaica'), ('Elmhurst', 'New York'), ('Midtown', 'East Harlem'), ('Midtown', 'Midtown East'), ('Midtown', 'Hudson Square'), ('Midtown', 'Midtown West'), ('Midtown', 'Alief'), ('Midtown', 'Sharpstown'), ('Midtown', 'Washington Avenue'), ('Midtown', 'Downtown'), ('Midtown', 'Greater Greenspoint'), ('East Harlem', 'NoMad'), ('Flatiron District', 'Midtown'), ('Midtown East', 'Midtown'), ('Hudson Square', 'Lower Manhatt

### An edge attribute equal to the no. of trips between the two locations needs to be added. 

For that purpose the dataframe is grouped by the start and stop locations and the entries are counted.

In [8]:
#print(G.edges()[('Fort Pierce', 'West Palm Beach')])
#print(G.edges()[('Cary', 'Apex')])
#print(G.edges()[('Whitebridge', 'Hazelwood')])

{}
{}
{}


In [9]:
# making the edge attributes dataframe
df_edge = df_diff_area.groupby(['START*', 'STOP*'], as_index=False)['MILES*'].count()

df_edge.head()

Unnamed: 0,START*,STOP*,MILES*
0,Agnew,Cory,1
1,Agnew,Renaissance,2
2,Almond,Bryson City,1
3,Apex,Cary,13
4,Apex,Eagle Rock,1


Another column 'trip_loc' is added which stores the start and stop locations in the form that it is stored in the network as edges, i.e., (start_location, stop_location).

In [10]:
# adding the trip_loc column
trip_loc = [tuple([df_edge.loc[i]['START*'], df_edge.loc[i]['STOP*']]) for i in range(len(df_edge))]

df_edge['trip_loc'] = trip_loc

In [11]:
# adding the trip_loc column
trip_loc = list(zip(df_edge['START*'], df_edge['STOP*']))

df_edge['trip_loc'] = trip_loc

In [12]:
df_edge.head()

Unnamed: 0,START*,STOP*,MILES*,trip_loc
0,Agnew,Cory,1,"(Agnew, Cory)"
1,Agnew,Renaissance,2,"(Agnew, Renaissance)"
2,Almond,Bryson City,1,"(Almond, Bryson City)"
3,Apex,Cary,13,"(Apex, Cary)"
4,Apex,Eagle Rock,1,"(Apex, Eagle Rock)"


Now the network is updated with the edge attributes.

In [13]:
# loop to iterate over the edges
for i in range(len(df_edge)):
    
    edge = df_edge.iloc[i]['trip_loc']           # extracting the edge
    d = {'Trips':df_edge.iloc[i]['MILES*']}      # dictionary to store the count of trips for that edge
    G.edges()[edge].update(d)                    # adding the attribute
        
        

Every edge stores the no. of trips made for that start and end locations/nodes. 

In [14]:
#print(G.edges()[('Fort Pierce', 'West Palm Beach')])
#print(G.edges()[('Cary', 'Apex')])
#print(G.edges()[('Whitebridge', 'Hazelwood')])

{'Trips': 1}
{'Trips': 14}
{'Trips': 4}


One needs to define the location of the nodes on the graph. The spring_layout is used in this case which finds the position of the nodes using Fruchterman-Reingold force-directed algorithm. 

Other available layouts are random_layout, circular_layout, etc. 

In [15]:
# defining the position of the nodes
pos = nx.spring_layout(G)

# adding the position in the form of a dictionary to the nodes
for node in G.nodes:
    G.nodes[node]['pos'] = list(pos[node])

The position is simply the x and y coordinates that the node will be plotted on in a graph.

In [16]:
#print(G.nodes['Fort Pierce']['pos'])
#print(G.nodes['Cary']['pos'])
#print(G.nodes['New York']['pos'])

[0.08843740576110388, 0.7901051618828822]
[-0.04908019149873204, 0.16270756298709113]
[0.13497842458488654, -0.41824588307947747]


Now the network is ready for plotting.

# Plotting the network

Plotly graph objects will be used for creating the plot. 

The first step is to make a trace which will add all the nodes of the network to the plot.

The nodes also need to be colored according to the no. of unique locations that are travelled to or from them. Suppose if trips to 5 different locations are made from a location A, then node A will have a different color on the plot compared to a location B from where trips are made to 10 different locations. The adjacency() function of a network will be used here. 

In [17]:
#for node, adj in enumerate(G.adjacency()):
 #   print(node)
  #  print(adj)

0
('Fort Pierce', {'West Palm Beach': {'Trips': 1}})
1
('West Palm Beach', {'Palm Beach': {'Trips': 1}})
2
('Palm Beach', {})
3
('Cary', {'Morrisville': {'Trips': 67}, 'Durham': {'Trips': 36}, 'Raleigh': {'Trips': 23}, 'Apex': {'Trips': 14}, 'Chapel Hill': {'Trips': 1}, 'Latta': {'Trips': 1}, 'Holly Springs': {'Trips': 1}, 'Wake Forest': {'Trips': 1}, 'Winston Salem': {'Trips': 1}, 'Unknown Location': {'Trips': 1}, 'Wake Co.': {'Trips': 1}, 'Fuquay-Varina': {'Trips': 1}})
4
('Morrisville', {'Cary': {'Trips': 75}, 'Raleigh': {'Trips': 4}, 'Banner Elk': {'Trips': 1}})
5
('Jamaica', {'New York': {'Trips': 2}})
6
('New York', {'Queens': {'Trips': 1}, 'Queens County': {'Trips': 1}, 'Long Island City': {'Trips': 1}, 'Jamaica': {'Trips': 1}})
7
('Queens', {})
8
('Elmhurst', {'New York': {'Trips': 1}})
9
('Midtown', {'East Harlem': {'Trips': 1}, 'Midtown East': {'Trips': 1}, 'Hudson Square': {'Trips': 1}, 'Midtown West': {'Trips': 1}, 'Alief': {'Trips': 1}, 'Sharpstown': {'Trips': 4}, 'Washing

In [18]:
# Make a node trace

traceRecode = []                    # list to store all the traces

# initialise a node trace
node_trace = go.Scatter(x=[], y=[], hovertext=[], mode='markers', hoverinfo="text", 
                        marker=dict(showscale=True, reversescale=True,
                                    color=[], size=5, colorbar=dict(thickness=10, title='No. of unique locations travelled to or from',
                                                                    xanchor='left', titleside='right'),
                                    colorscale="rdylbu"))

# adding the coordinate position of the nodes
for node in G.nodes():
    x, y = G.nodes()[node]['pos']
    hovertext = node
    node_trace['x'] += tuple([x])
    node_trace['y'] += tuple([y])
    node_trace['hovertext'] += tuple([hovertext])          # add the hovertext (name of the location)

# specify the color of the node
for node, adjacencies in enumerate(G.adjacency()):
    try:
        node_trace['marker']['color']+=tuple([len(adjacencies[1])])
    except:
        pass
    
traceRecode.append(node_trace)                     # adding the node trace

In [19]:
#traceRecode

[Scatter({
     'hoverinfo': 'text',
     'hovertext': [Fort Pierce, West Palm Beach, Palm Beach, Cary, Morrisville,
                   Jamaica, New York, Queens, Elmhurst, Midtown, East Harlem, NoMad,
                   Flatiron District, Midtown East, Hudson Square, Lower Manhattan,
                   Hell's Kitchen, Queens County, Downtown, Gulfton, Eagan Park,
                   Jamestown Court, Durham, Farmington Woods, Whitebridge, Lake
                   Wellingborough, Raleigh, Fayetteville Street, Umstead, Hazelwood,
                   Westpark Place, Fairmont, Meredith Townes, Leesville Hollow,
                   Apex, Chapel Hill, Northwoods, Williamsburg Manor, Macgregor
                   Downs, Edgehill Farms, Tanglewood, Preston, Eastgate, Walnut
                   Terrace, East Elmhurst, Jackson Heights, Midtown West, Long
                   Island City, Heritage Pines, Waverly Place, Wayne Ridge, Depot
                   Historic District, East Austin, West University,

In [20]:
#!pip install ipywidgets



A try except block is used here to ensure that no exceptions arise for pairs of nodes that do not share an edge.

The edge traces are now added. 

The width of the edge will be equal to the no. of trips. So, if 10 trips are made between location A and B, the edge between them will be thicker compared to the edge between locations C and D which have only 5 trips.

In [21]:
# Make an edge trace

for edge in G.edges:
    x0, y0 = G.nodes()[edge[0]]['pos']
    x1, y1 = G.nodes()[edge[1]]['pos']
    weight = G.edges()[edge]['Trips']                 # specifying the parameter for the width of the edge
    
    trace = go.Scatter(x=tuple([x0, x1, None]), y=tuple([y0, y1, None]),    # defining the edge trace
                       mode='lines',
                       line=dict(width=weight,color='Blue'))
    traceRecode.append(trace)                         # adding the edge trace

All the necessary traces have been added. The plot is now created.

In [22]:
figure = {
    "data": traceRecode,
    "layout": go.Layout(title='Network of Trips', showlegend=False, hovermode='closest')}

In [23]:
go.FigureWidget(figure)

In [24]:
#x

On hovering over the nodes, one can see the name of the location it refers to. There are few thick lines but the majority edges are thin, signifying that very few trips are made between the two nodes. 

There are two clusters with very thick lines in between. These can be analysed by zooming into the plot. 

Cary and Morriville have the maximum no. of trips whereas the most locations are travelled to and from Whitebridge. 

This was the network for all the locations present in the dataset. Now one can study the network considering the most frequent starting locations. For that purpose, a filtered dataframe is made where the start locations are the top 10 most frequented ones.

In [25]:
# filtered dataframe
most_frequent_starts = df['START*'].value_counts().nlargest(10).index

df_filtered = df[df['START*'].isin(most_frequent_starts)]

A directed graph is made. The no. of trips are added as edge attributes. The circular layout is used this time to define the positions of the nodes. 

In [35]:
# making the graph/network
G = nx.from_pandas_edgelist(df_filtered, source='START*', target='STOP*', 
                            create_using=nx.DiGraph())

In [36]:
# making the edge attributes dataframe
df_edge = df_filtered.groupby(['START*', 'STOP*'], as_index=False)['MILES*'].count()

In [37]:
# adding the trip_loc column
trip_loc = [tuple([df_edge.loc[i]['START*'], df_edge.loc[i]['STOP*']]) for i in range(len(df_edge))]

df_edge['trip_loc'] = trip_loc

In [38]:
# loop to iterate over the edges
for i in range(len(df_edge)):
    
    edge = df_edge.iloc[i]['trip_loc']                   # extracting the edge
    d = {'Trips':df_edge.iloc[i]['MILES*']}              # dictionary to store the count of trips for that edge
    G.edges()[edge].update(d)                            # adding the attribute

In [30]:
# defining the position of the nodes
pos = nx.circular_layout(G)

# adding the position in the form of a dictionary to the nodes
for node in G.nodes:
    G.nodes[node]['pos'] = list(pos[node])

In [31]:
# Make a node trace

traceRecode = []                    # list to store all the traces

# initialise a node trace
node_trace = go.Scatter(x=[], y=[], hovertext=[], mode='markers', hoverinfo="text", 
                        marker=dict(showscale=True, reversescale=True,
                                    color=[], size=5, colorbar=dict(thickness=10, title='No. of unique locations travelled to or from',
                                                                    xanchor='left', titleside='right'),
                                    colorscale="rdylbu"))

# adding the coordinate position of the nodes
for node in G.nodes():
    x, y = G.nodes()[node]['pos']
    hovertext = node
    node_trace['x'] += tuple([x])
    node_trace['y'] += tuple([y])
    node_trace['hovertext'] += tuple([hovertext])          # add the hovertext (name of the location)

# specify the color of the node
for node, adjacencies in enumerate(G.adjacency()):
    try:
        node_trace['marker']['color']+=tuple([len(adjacencies[1])])
    except:
        pass
    
traceRecode.append(node_trace)                     # adding the node trace

In this case, the width of the edges is constant.

In [32]:
# Make an edge trace

for edge in G.edges:
    x0, y0 = G.nodes()[edge[0]]['pos']
    x1, y1 = G.nodes()[edge[1]]['pos']
    
    trace = go.Scatter(x=tuple([x0, x1, None]), y=tuple([y0, y1, None]),    # defining the edge trace
                       mode='lines',
                       line=dict(width=0.5,color='Blue'))
    traceRecode.append(trace)                         # adding the edge trace

In [33]:
figure = {
    "data": traceRecode,
    "layout": go.Layout(title='Network of Trips for most frequent start locations', showlegend=False, hovermode='closest')}

In [39]:
go.FigureWidget(figure)

FigureWidget({
    'data': [{'hoverinfo': 'text',
              'hovertext': [Cary, Morrisville, Midtown, East…

This plot shows which places are travelled to or from each location. Whitebridge, Midtown and Cary can be considered to be the most frequent location/junction for the rider.