## Introduction

This tutorial will introduce you to how to use a machine learning platform called GraphLab. We will mainly focus on the graph analytics module of GraphLab, since efficient graph computation was one of the initial purposes of this project (http://dl.acm.org/citation.cfm?id=2212354). The GraphLab project was first started by Prof. Carlos Guestrin in 2009 when he was still in Carnegie Mellon University. It showed very good performance in many data mining tasks. 

The graph analytics of GraphLab is a toolkit which contains different algorithms for analyzing a graph, including connected components, k-core decomposition, PageRank, single-source shortest path, etc.

### Tutorial content

In this tutorial, I will show how to do some basic graph analysis in Python, specifically using [GraphLab](https://turi.com).

We will cover the following topics in this tutorial:
- [Installing the library](#Installing-the-library)
- [Loading data and initializing graph](#Loading-data-and-initializing-graph)
- [Graph Properties](#Graph-Properties)
- [Graph Algorithms](#Graph-Algorithms)

## Installing the library
Before getting started, you'll need to make sure python 2.7.x and pip (version >= 7) have already been installed on your local machine.

Then we can install the GraphLab library that we will use in this tutorial. You can install GraphLab using `pip` and import graphlab library in your pyton script:

    $ pip install graphlab-create

In [4]:
import graphlab as gl
import pandas
import urllib2
import StringIO
import gzip

## Loading data and initializing graph
Now that we have installed and imported the graphlab library, our next step will be loading data and initializing our graph.

In this tutorial, we will use a small graph as an example as shown below. The sample graph is a directed graph, with 8 vertices and 9 edges in total and its edges are all unweighted.

[<img src="http://algs4.cs.princeton.edu/42digraph/images/digraph.png">](http://algs4.cs.princeton.edu/42digraph/images/digraph.png)

In [5]:
def create_graph(src, dst):
    g = gl.SGraph()
    edges = gl.SFrame({'src': src, 'dst': dst})
    g = g.add_edges(edges, src_field='src', dst_field='dst')
    return g

In [7]:
src = [0, 0, 2, 2, 3, 3, 4, 4, 5, 6, 6, 6, 7, 7, 8, 8, 9, 9, 10, 11, 11, 12]
dst = [1, 5, 0, 3, 2, 5, 2, 3, 4, 0, 4, 9, 6, 8, 7, 9, 10, 11, 12, 4, 12, 9]

g = create_graph(src, dst)

Once the graph is initialized, we can access the graph using the built-in methods. The `edges` object is of SFrane type, which is designed to be a tabular, column-mutable dataframe object that can scale to big data. The `g` object is converted from the SFrame object and is of SGraph type, which is a scalable graph data structure designed in GraphLab. It allows you dynamically add a vertex or edge. You can get some basic statistics of the graph by calling the `summary()` method of your SGraph object. In our example, it will show the number of edges and the number of vertices. You can also show all edges by print the `edges` attribute.

In [14]:
print g.summary()
print g.edges.tail(5)

{'num_edges': 22, 'num_vertices': 13}
+----------+----------+
| __src_id | __dst_id |
+----------+----------+
|    12    |    9     |
|    4     |    2     |
|    11    |    12    |
|    4     |    3     |
|    11    |    4     |
+----------+----------+
[5 rows x 2 columns]



GraphLab also provide apis that help you visulaize the graph you just created. Simply calling the `show()` method in your SGraph object and offer it with some basic settings, it will visulize the graph for you. You can access the figure via web brower.

In [15]:
g.show(vlabel='id', arrows=True)

Canvas is accessible via web browser at the URL: http://localhost:57282/index.html
Opening Canvas in default web browser.


## Graph Properties

### Degree
Degree of a vertex of a graph is defined as the number of edges connected to the vertex. GraphLab provides apis to compute in degree, out degree and total degree of each vertex. The following cell shows an example of how to use GraphLab api to compute degree for each vertex in the graph.

In [7]:
degree_model = gl.degree_counting.create(g, verbose=False)
graph_with_degree = degree_model['graph']
degree = graph_with_degree.vertices[['__id', 'in_degree', 'out_degree', 'total_degree']]
print degree.tail(5)

+------+-----------+------------+--------------+
| __id | in_degree | out_degree | total_degree |
+------+-----------+------------+--------------+
|  3   |     2     |     2      |      4       |
|  12  |     2     |     1      |      3       |
|  1   |     1     |     0      |      1       |
|  4   |     3     |     2      |      5       |
|  11  |     1     |     2      |      3       |
+------+-----------+------------+--------------+
[5 rows x 4 columns]



### Connected Component
A connected component of is a subgraph of a graph. Within it, any two vertices are connected to each other by paths. Graphlab provides apis to finds all connected components in the graph. The only input is a SGraph object that we memtioned above. The following cell shows an example of how to use GraphLab's api to find all connected components.

In [17]:
cc = gl.connected_components.create(g, verbose=False)
print cc.summary

<bound method ConnectedComponentsModel.summary of Class                                    : ConnectedComponentsModel

Graph
-----
num_edges                                : 22
num_vertices                             : 13

Results
-------
graph                                    : SGraph. See m['graph']
component size                           : SFrame. See m['component_size']
number of connected components           : 1
vertex component id                      : SFrame. See m['component_id']

Metrics
-------
training time (secs)                     : 1.0159

Queryable Fields
----------------
graph                                    : A new SGraph with the color id as a vertex property
component_id                             : An SFrame with each vertex's component id
component_size                           : An SFrame with the size of each component
training_time                            : Total training time of the model
>


As the result shows, the number of connected components of our sample graph is 1, which is different from what we expected. The following figure shows the correct partitioning of the sample graph into 5 different connected components. So why the api doesn't work here?
[<img src="http://algs4.cs.princeton.edu/42digraph/images/strong-components.png">](http://algs4.cs.princeton.edu/42digraph/images/strong-components.png)
The reason is that the api can only find weakly connected components, which is usually used in undirected graph where any edge can connect two vertices in both direction. However, the edges in our sample graph have direction. So we need to find all strong connected components, in which there is a path in each direction between each pair of vertices.

Here we will use [Kosaraju's algorithm](https://en.wikipedia.org/wiki/Kosaraju%27s_algorithm) to find all strong connected components in our sample graph. Kosaraju's algorithm is a linear time algorithm which basically does two pass depth first search. The following is my implementation of the Kosaraju's algorithm.

In [8]:
def reverse_graph(g):
    """ reverse all edges in a graph
    Inputs:
        g: a SGraph object representing the graph
    Outputs:
        reversed graph: a SGraph object with edges reversed
    """
    rev_g = gl.SGraph()
    src = list(g.get_edges()['__dst_id'].to_numpy())
    dst = list(g.get_edges()['__src_id'].to_numpy())
    rev_edges = gl.SFrame({'src': src, 'dst': dst})
    rev_g = rev_g.add_edges(rev_edges, src_field='src', dst_field='dst')
    return rev_g

def traverse(g, vertex, order, visited):
    """ traverse a graph and get the reversed order of the traversal
    Inputs:
        g: a SGraph object representing the graph
    """
    if vertex in visited:
        return
    visited.add(vertex)
    for edge in g.get_edges(vertex):
        traverse(g, edge['__dst_id'], order, visited)
    order.append(vertex)

def assign(g, vertex, visited, component):
    """ traverse a graph and assign component id to each vertex
    Inputs:
        g: a SGraph object representing the graph
    """
    visited.add(vertex)
    component.append(vertex)
    for edge in g.get_edges(vertex):
        if edge['__dst_id'] in visited:
            continue
        assign(g, edge['__dst_id'], visited, component)

def Kosaraju(g):
    """ find all strong connected components of a graph
    Inputs:
        g: a SGraph object representing the graph
    Outputs:
        connected component: a dict, whose key is the connected component id
                             and the value is a list of vertices which are
                             included in the component
    """
    rev_g = reverse_graph(g)

    order = []
    visited = set([])

    for vertex in rev_g.vertices:
        traverse(rev_g, vertex['__id'], order, visited)
    order.reverse()
    
    cc_cnt = 0
    cc = {}
    visited = set([])
    for vertex in order:
        if vertex in visited:
            continue
        cc[cc_cnt] = []
        assign(g, vertex, visited, cc[cc_cnt])
        cc_cnt += 1
    
    return cc


In [54]:
cc = Kosaraju(g)
print "\n------------- Results --------------"
for id, vertices in cc.iteritems():
    print 'connected component ' + str(id) + ': ',
    print len(vertices),
    print 'vertices',
    print vertices



------------- Results --------------
connected component 0:  1 vertices [1]
connected component 1:  5 vertices [5, 4, 2, 0, 3]
connected component 2:  4 vertices [11, 12, 9, 10]
connected component 3:  1 vertices [6]
connected component 4:  2 vertices [7, 8]


## Graph Algorithms

### Shortest Path
The purpose of shortest path problem is to find a path between two vertices in a graph and the sum of the weights of all edges on the path is minimized. GraphLab provides apis to compute the single-source shortest path distance from a given source vertex to all vertices. The following function is used to calculate the shortest path for all vertices starting from a given source vertex.

In [9]:
def shortest_path(g, src, inf=1e30):
    """ Single source shortest path
    Inputs:
        g: a SGraph object representing the graph
        src: id of the source vertex
    Outputs:
        dist: a dictionary contains distance to all reachable vertices
    """
    sp_model = gl.shortest_path.create(g, source_vid=src, verbose=False)
    dist = sp_model['distance']
    df = dist.to_dataframe()
    df = df.set_index('__id')
    dist = df.to_dict(orient='dict')['distance']
    cleaned_dist = {}
    for dst, distance in dist.iteritems():
        if distance != inf:
            cleaned_dist[dst] = distance
    return cleaned_dist


In [22]:
dist = {}
for d in g.vertices:
    src = d['__id']
    single_dist = shortest_path(g, src)
    dist[src] = single_dist
print 'Number of vertices = ' + str(len(dist))
for src, d in dist.iteritems():
    print 'source = ' + str(src),
    print 'distance dict: ' + str(d)

Number of vertices = 13
source = 0 distance dict: {0: 0.0, 1: 1.0, 2: 3.0, 3: 3.0, 4: 2.0, 5: 1.0}
source = 1 distance dict: {1: 0.0}
source = 2 distance dict: {0: 1.0, 1: 2.0, 2: 0.0, 3: 1.0, 4: 3.0, 5: 2.0}
source = 3 distance dict: {0: 2.0, 1: 3.0, 2: 1.0, 3: 0.0, 4: 2.0, 5: 1.0}
source = 4 distance dict: {0: 2.0, 1: 3.0, 2: 1.0, 3: 1.0, 4: 0.0, 5: 2.0}
source = 5 distance dict: {0: 3.0, 1: 4.0, 2: 2.0, 3: 2.0, 4: 1.0, 5: 0.0}
source = 6 distance dict: {0: 1.0, 1: 2.0, 2: 2.0, 3: 2.0, 4: 1.0, 5: 2.0, 6: 0.0, 9: 1.0, 10: 2.0, 11: 2.0, 12: 3.0}
source = 7 distance dict: {0: 2.0, 1: 3.0, 2: 3.0, 3: 3.0, 4: 2.0, 5: 3.0, 6: 1.0, 7: 0.0, 8: 1.0, 9: 2.0, 10: 3.0, 11: 3.0, 12: 4.0}
source = 8 distance dict: {0: 3.0, 1: 4.0, 2: 4.0, 3: 4.0, 4: 3.0, 5: 4.0, 6: 2.0, 7: 1.0, 8: 0.0, 9: 1.0, 10: 2.0, 11: 2.0, 12: 3.0}
source = 9 distance dict: {0: 4.0, 1: 5.0, 2: 3.0, 3: 3.0, 4: 2.0, 5: 4.0, 9: 0.0, 10: 1.0, 11: 1.0, 12: 2.0}
source = 10 distance dict: {0: 6.0, 1: 7.0, 2: 5.0, 3: 5.0, 4: 4.0, 5:

### Diameter
The diameter of a graph is the largest distance between any pair of vertices in the graph. The simplest way to find the diameter of a graph is find the shortest path between any pair of vertices. Then the largest distance of any of the paths is the diameter of the graph. We can utilize the shortest_path function we wrote before to help solve this problem. The following cell shows an example of how to use shortest path results to compute diameter of a graph.

In [10]:
def diameter(g):
    """ compute diameter of a graph
    Inputs:
        g: a SGraph object representing the graph
    Outputs:
        diameter, src, dst: the diameter of the input graph and the any
                            corresponding source and destination vertices
    """
    max_distance = 0.0
    source = None
    destination = None
    for d in g.vertices:
        src = d['__id']
        dist = shortest_path(g, src)
        for dst, distance in dist.iteritems():
            if distance > max_distance:
                max_distance = distance
                source = src
                destination = dst
    return (max_distance, source, destination)

In [24]:
d, src, dst = diameter(g)
print 'diameter = ' + str(d)
print 'source = ' + str(src)
print 'destination = ' + str(dst)

diameter = 7.0
source = 10
destination = 1


### PageRank
[PageRank](https://en.wikipedia.org/wiki/PageRank) is an algorithm first introduced by Google to rank websites. The basic idea of PageRank is to use inbound and outbound nodes' information to determine the significance of individual webpages.

GraphLab also provides use api that compute PageRank score for each node in the graph. Some configurable parameters are reset_probability(probability that a random surfer jumps to an arbitrary page), threshold(threshold of convergence) and max_iterations(the maximum number of iterations to run). The following cell shows an example of how to use GraphLab's PageRank api to compute the pagerank score of all vertices in a graph.

In [54]:
def pagerank(g):
    pr = gl.pagerank.create(g, reset_probability=0.15, threshold=0.01, max_iterations=100, verbose=False)
    df = pr['pagerank'].to_dataframe()[['__id', 'pagerank']]
    d = {int(row['__id']): row['pagerank'] for _, row in df.iterrows()}
    return d


In [16]:
pr = pagerank(g)
for id, score in pr.iteritems():
    print str(id),
    print score

0 0.762451201699
1 0.474161005331
2 1.26603301276
3 1.26603301276
4 1.35859736242
5 1.01269916296
6 0.260869566109
7 0.260869566109
8 0.260869566109
9 1.15857248236
10 0.643541120083
11 0.643541120083
12 0.970514927593


## Real-world Graph
Next, we apply our implementations of graph algorithms to one read-world graph from Stanford large network dataset collection (http://konect.uni-koblenz.de/networks/moreno_innovation). We first load the data into memory and create our SGraph object.

In [29]:
# read edges from data: http://konect.uni-koblenz.de/networks/moreno_innovation
delimiter = ' '
with open("real_world_graph.tsv") as f:
    edges = [(int(edge.split(delimiter)[0]), int(edge.split(delimiter)[1])) for edge in f.read().splitlines()]

# remove duplicate edges
edges_set = set(edges)
edges = list(edges_set)


The graph has 241 vertices connected by 1098 edges as shown below.

In [30]:
src = [u for u, v in edges]
dst = [v for u, v in edges]

g = create_graph(src, dst)
print g.summary


<bound method SGraph.summary of SGraph({'num_edges': 1098, 'num_vertices': 241})
Vertex Fields:['__id']
Edge Fields:['__src_id', '__dst_id']>


Then we can compute both weak connected component and strong connected component using functions that we implemented earlier in this tutorial.

In [64]:
# weak connected component
wcc = gl.connected_components.create(g, verbose=False)
# strong connected component
scc = Kosaraju(g)

In [65]:
# largest weak connected component
print 'largest   weak connected component -> vertices size: ' + str(wcc['component_size']['Count'].max())
# largest strong connected component
LSCC = []
for id, vertices in scc.iteritems():
    if len(vertices) > len(LSCC):
        LSCC = vertices
print 'largest strong connected component -> vertices size: ' + str(len(LSCC))

largest   weak connected component -> vertices size: 117
largest strong connected component -> vertices size: 95


We can also apply algorithms in the previous section to this graph.

In [52]:
d, src, dst = diameter(g)
print 'diameter -> length: ' + str(d)

diameter -> length: 9.0


In [63]:
pr = pagerank(g)
id = max(pr, key=pr.get)
print 'max pagerank -> id: ' + str(id) + ', score: ' + str(pr[id])

max pagerank -> id: 15, score: 3.78467950157


## Summary and references
This tutorial highlighted just a few elements (basic graph properties and algorithms) of what we can do with the graph analytics module of GraphLab in Python. What you can do with GraphLab is much more than simply graph analytics. More details about the other libraries in GraphLab are available from the following links.

1. GraphLab docs: https://turi.com/products/create/docs/
2. graph analytics toolkit: https://turi.com/products/create/docs/graphlab.toolkits.graph_analytics.html?highlight=graph%20analytics