# Problem Set 4: Networks

In this problem set we are going to be working with network-like data. We will be using a smaller dataset obtained from foursquare's API in Riyadh. The dataset was constructed by scrapping an API endpoint that given a venue, it lists the next top 5 venues where users usually check-in afterwards. All the data is aggregated, and the specifics of each trip, or check-in sequence are not available. However, it can provide a good overview of some general dynamics around the city. 

In the dataset, every node of venue, is connected to at least other node or venue. More popular nodes will be connected to more nodes, or will have more edges attached to them. Additionally, by constructing a network with the data, we can analyze some other interesting properties that might give us some insights of the urban dynamics of the region.

1. Constructing a network with the dataset.
2. Assigning spatial properties to the network.
3. Analyzing some basic network properties.
4. Constructing exploratory visualizations that help us make sense out of our analysis and dataset.

In [1]:
# Import some libraries
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# This allows plots to appear on the IPython notebook.
%matplotlib inline 

The data is contained in two different `CSV` files: `FoursqureLinksNetwork.csv` and `FoursqureCheckinNodes.csv`. First, we will import the CSV containing all the edges among **Foursquare** venues, and construct a `networkx` network. In this case, the node index or name will be the venue ID.

In [3]:
# Use pandas to import the csv
df_edges = pd.read_csv('data/FoursqureLinksNetwork.csv', sep=",")
df_edges.head()

Unnamed: 0,FROM-ID,TO-ID,DISTANCE
0,4f3381cae4b0befff0254890,4d90bb7ffa9437048ca338c6,0.050111
1,4f3381cae4b0befff0254890,50e4021b582f294b85631919,0.083781
2,4f3381cae4b0befff0254890,4e341a00e4cdf7a42cad9421,0.046179
3,4f3381cae4b0befff0254890,50433012e4b05698baa75339,0.057012
4,4f3381cae4b0befff0254890,4f8f03f5e4b09b4d92853b2c,0.029747


Next, we will import the CSV containing all the nodes representing **Foursquare** venues.

In [4]:
df_nodes = pd.read_csv('data/FoursqureCheckinNodes.csv', sep=",")
df_nodes.head()

Unnamed: 0,ID,NAME,CATEGORY,CHECK-IN,LAT,LONG
0,558e5ce5498e164a5cb27afc,�����_�� �_�������� ���_���� - King Khalid Air...,Airport Terminal,21,24.760483,46.705338
1,4ccd23f1c0378cfa93b68b48,Princess Nora University,College Academic Building,532,24.774344,46.728845
2,4da5ec8a4b2280544b678da0,Lulu Hypermarket | �������� ���_�_�������_������,Department Store,6453,24.663989,46.703757
3,50295233e4b0db2acbb75c69,Papparoti | ���_���_�����ș_,Coffee Shop,2168,24.693468,46.669636
4,4da5ec8a4b2280544b678da0,Avenue Mall,Department Store,1474,24.663569,46.703937


## Part 1

Now, using your recently acquired **Pandas** knowledge, use **Pandas'** `.join` or `.merge` functions, to match the index of the origin id with its latitude and longitude (we want to add new columns with the lat and lon values for a given node). Add a **lat1 and lon1** column to the df_edge containing this information. Do the same with the destination id, but name the respective columns **lat2 and lon2**.

**Deliverable**
* A pandas DF with 4 new columns: the columns should correspond to the lat and lon of the **FROM-ID** and **TO-ID** columns.

In [5]:
# Your Code here

results1=pd.DataFrame
results1 = pd.merge(df_edges, df_nodes, how='inner', on=None, left_on="FROM-ID", right_on="ID",
      left_index=False, right_index=False, sort=False,
      suffixes=('_x', '_y'), copy=True, indicator=True)
results1.head()

results1.rename(columns={'LAT': 'LAT1','LONG': 'LONG1'})
results1=results1.rename(columns={'LAT': 'LAT1','LONG': 'LONG1', 'ID': 'ID1'})
results1.head()

##pd.merge(df_edges, df_nodes, how='inner', on=None, left_on="TO-ID", right_on="ID",
  ##    left_index=False, right_index=False, sort=True,
    ##  suffixes=('_x', '_y'), copy=True, indicator=True)
    


Unnamed: 0,FROM-ID,TO-ID,DISTANCE,ID1,NAME,CATEGORY,CHECK-IN,LAT1,LONG1,_merge
0,50295233e4b0db2acbb75c69,4bb60acd46d4a5932198c5c0,0.142938,50295233e4b0db2acbb75c69,Papparoti | ���_���_�����ș_,Coffee Shop,2168,24.693468,46.669636,both
1,50295233e4b0db2acbb75c69,4dd6ae1afa76ad96d111ee3f,0.121546,50295233e4b0db2acbb75c69,Papparoti | ���_���_�����ș_,Coffee Shop,2168,24.693468,46.669636,both
2,50295233e4b0db2acbb75c69,4bacf76cf964a5201c1f3be3,0.102685,50295233e4b0db2acbb75c69,Papparoti | ���_���_�����ș_,Coffee Shop,2168,24.693468,46.669636,both
3,50295233e4b0db2acbb75c69,4c72bb9a4bc4236ae30ccc7a,0.194216,50295233e4b0db2acbb75c69,Papparoti | ���_���_�����ș_,Coffee Shop,2168,24.693468,46.669636,both
4,50295233e4b0db2acbb75c69,4f1d62b0e4b03543a3409cd2,0.129974,50295233e4b0db2acbb75c69,Papparoti | ���_���_�����ș_,Coffee Shop,2168,24.693468,46.669636,both


In [6]:
results2=pd.DataFrame
results2 = pd.merge(results1, df_nodes, how='inner', on=None, left_on="TO-ID", right_on="ID",
      left_index=False, right_index=False,
      suffixes=('_x', '_y'), copy=True, indicator=False)
results2.head()

results2=results2.rename(columns={'LAT': 'LAT2','LONG': 'LONG2','ID': 'ID2'})
results2.head()





Unnamed: 0,FROM-ID,TO-ID,DISTANCE,ID1,NAME_x,CATEGORY_x,CHECK-IN_x,LAT1,LONG1,_merge,ID2,NAME_y,CATEGORY_y,CHECK-IN_y,LAT2,LONG2
0,50295233e4b0db2acbb75c69,4f1d62b0e4b03543a3409cd2,0.129974,50295233e4b0db2acbb75c69,Papparoti | ���_���_�����ș_,Coffee Shop,2168,24.693468,46.669636,both,4f1d62b0e4b03543a3409cd2,Wayne's Coffee | ���_�_���_ �������_,Toy / Game Store,3842,24.691873,46.668853
1,4fcf729ae4b07997d3141487,4f1d62b0e4b03543a3409cd2,0.017338,4fcf729ae4b07997d3141487,Panorama Foodcourt,Food Court,769,24.692186,46.670004,both,4f1d62b0e4b03543a3409cd2,Wayne's Coffee | ���_�_���_ �������_,Toy / Game Store,3842,24.691873,46.668853
2,4fcf729ae4b07997d3141487,4f1d62b0e4b03543a3409cd2,0.017338,4fcf729ae4b07997d3141487,Hamleys ���_�����_�_,Food Court,963,24.692337,46.669961,both,4f1d62b0e4b03543a3409cd2,Wayne's Coffee | ���_�_���_ �������_,Toy / Game Store,3842,24.691873,46.668853
3,5318e95f498eaff004e39f6c,533687b3498ec0c5b90ff33b,0.030966,5318e95f498eaff004e39f6c,Candylawa | ���_�����_���_����,Candy Store,2292,24.691785,46.669735,both,533687b3498ec0c5b90ff33b,Cafelawa | ���_���_���_���_,Coffee Shop,389,24.691514,46.669664
4,5318e95f498eaff004e39f6c,533687b3498ec0c5b90ff33b,0.030966,5318e95f498eaff004e39f6c,VERSUS Versace Caff�� ���_���_�_���Ǚ_ ���_���_��,Candy Store,3005,24.691748,46.669056,both,533687b3498ec0c5b90ff33b,Cafelawa | ���_���_���_���_,Coffee Shop,389,24.691514,46.669664


In [7]:
results2.columns
results2.loc[:,['FROM-ID','TO-ID','LAT1','LONG1','LAT2','LONG2' ]]

#results2.head()

Unnamed: 0,FROM-ID,TO-ID,LAT1,LONG1,LAT2,LONG2
0,50295233e4b0db2acbb75c69,4f1d62b0e4b03543a3409cd2,24.693468,46.669636,24.691873,46.668853
1,4fcf729ae4b07997d3141487,4f1d62b0e4b03543a3409cd2,24.692186,46.670004,24.691873,46.668853
2,4fcf729ae4b07997d3141487,4f1d62b0e4b03543a3409cd2,24.692337,46.669961,24.691873,46.668853
3,5318e95f498eaff004e39f6c,533687b3498ec0c5b90ff33b,24.691785,46.669735,24.691514,46.669664
4,5318e95f498eaff004e39f6c,533687b3498ec0c5b90ff33b,24.691748,46.669056,24.691514,46.669664
5,4f1d62b0e4b03543a3409cd2,4fcf729ae4b07997d3141487,24.691873,46.668853,24.692186,46.670004
6,4f1d62b0e4b03543a3409cd2,4fcf729ae4b07997d3141487,24.691873,46.668853,24.692337,46.669961
7,4d4c01cfe4fd6ea8f7e8be61,4c601b8490b2c9b6d7013c22,24.691474,46.670711,24.692466,46.671305
8,4d4c01cfe4fd6ea8f7e8be61,4c601b8490b2c9b6d7013c22,24.691474,46.670711,24.692087,46.671357
9,4d4c01cfe4fd6ea8f7e8be61,4c601b8490b2c9b6d7013c22,24.691474,46.670711,24.693912,46.670527


## Part 2

Now, that we have an appropriate data structure, we will be creating a `networkx` network.

In [8]:
# Let's define an empty undirected graph.
RG = nx.Graph()

Now let's use the `df_nodes` to add nodes to our newly created graph. The node index or name will be the **ID** column of the df. Make sure to add the rest of the df columns to the node as properties.

`Hint: You can loop through all the rows, and use each one of their values to add a node and define specific properties. The property name should be the same as the column name.`

**Deliverable**
* You should populate the **RG** network with all the nodes part of the `df_nodes` df. All the nodes should also have the additional columns as a node property. To show that you correctly populated the network, print out the **node names**, and the **number of nodes**.

In [16]:
# Add nodes to the graph here


Now let's use the `df_edges` to add edges to our graph. The edge index **Row Number** of the df. Make sure to add the rest of the df columns to the node as properties (such as lat1, lon1, lat2, and lon2).

`Hint: You can loop through all the rows of the df, and use each one of their values to add an edge and define specific properties. The property name should be the same as the column name.`

**Deliverable**
* You should populate the **RG** network with all the edges part of the `df_edges` df. All the edges should also have the additional columns as a node property. To show that you correctly populated the network, print out the **edges**, and the **number of edges**.

In [17]:
# Add edges to the graph here


Now that we have a populated network, let's plot it! We will use networkx's `draw()` functions.

**Deliverable**
* You should create a plot of the network. The position of every node (`pos`) should be defined by the lat and lon of the given node. The **color** and **size** of the nodes should be dependent on one of the node properties.

In [6]:
# Draw the network here


## Part 3
Now that we have the network, it is useful to calculate some of it's properties to gain insights of the region.

First, we will obtain the node degrees, and create a histogram that shows the distribution of the degrees across the network nodes'.

**Deliverable**
* You should create a histogram showing the degree distribution across the network. The **x-axis** should have the different degrees, and the **y-axis** should have the number of observations.

In [7]:
# Create your histogram here


Finally, let's look at another useful network property: **centrality**. Centrality indicators help identify the most important vertices within a graph. With centrality algorithms, it is possible to identify key infrastructure nodes in urban networks.

We will be using 2 centrality measures: degree centrality and betweeness centrality. **Betweenness** is a centrality measure of a vertex within a graph. It quantifies the number of times a node acts as a bridge along the shortest path between two other nodes. Conceptually, edges that are more important to the functioning of the network will have a higher betweenness centrality. For `networkx`, the betweenness centrality of a node v is the sum of the fraction of all-pairs shortest paths that pass through v.

Historically first and conceptually simplest centrality measure is **degree centrality**, which is defined as the number of links incident upon a node (the number of ties that a node has). For `networkx`,  the degree centrality for a node v is the fraction of nodes it is connected to.

Using networkx's built-in function, calculate the **degree** and **betweenness** centrality of the network, and create 2 plots of the network, where either the color or the size of the node is dependent on both **centrality** measure. 

**Deliverable**
* You should create a plot of the network. The position of every node (`pos`) should be defined by the lat and lon of the given node. The **color** or **size** of the nodes should be dependent on the centrality measure corresponding to the plot (1 plot for degree, and one for betweenness). The other property not represented by the centrality should depend on another property of the network.

In [8]:
# Your code here
