# Introduction
Graphlab is a powerful data analytics framework, which was initially tailored to handle graph data. The library has now grown into a general purpose data analytics platform which provides various features including Data Modelling, Text Analysis, Machine Learning, Data Visualization and a lot more. This tutorial, will walk through some key Graph Analytics algorithms implemeneted by GraphLab. 


# Table of Contents
1. [Installation](#Installation)
2. [Graph Data](#Graph-Data)
3. [Inspecting Graph Data](#Inspecting-Graph-Data)
4. [Modifying the Graph](#Modifying-the-Graph-Data)
5. [Real World Data](#Real-World-Data)
    * [Degree Counting](#Degree-Counting)
    * [PageRank Algorithm](#Page-Rank)
    * [Triangle Counting](#Triangle-Counting)
    * [Single Source Shortest Path](#Single-Source-Shortest-Path)
    * [Connected Components](#Connected-Components)
    * [K Core](#K-Core)
12. [Conclusion](#Conclusion)
13. [References](#References)

## Installation 

To get started, register for an [adacemic license](https://turi.com/download/academic.html). Once you have registered, make sure you have Anaconda installed on your machine with a python version 2.7.x. You can install Graphlab-create using:



`pip install --upgrade --no-cache-dir https://get.graphlab.com/GraphLab-Create/2.1/your registered email address here/your product key here/GraphLab-Create-License.tar.gz`


Now that you've installed Graphlab, you import the library and start using it.

In [6]:
import graphlab as gl

## Graph Data

Graphs provide a natural schema to model data, making it easy to visualize relationships between two entinites in the dataset. Entities in a dataset are modelled as vertices while the relationships between them are modelled as edges in the graph. 

To facilitate scallable graph analytics, GraphLab provides the SGraph object which extends the features of the SFrame object. SGraph objects can be used to flexibly label the vertices and edges with a dictonary of attributes.

In this section, we will start with an empty graph. We will then add vertices with attributes and connect the vertices with edges. 

In [51]:
g = gl.SGraph()

vertices = [gl.Vertex(0, attr={'club': 'FC Barcelona'}),
            gl.Vertex(1, attr={'club': 'Chelsea'}),
            gl.Vertex(2, attr={'club': 'FC Barcelona'})]
g = g.add_vertices(vertices)

edges = [gl.Edge(1,2),
         gl.Edge(0,1)]
g = g.add_edges(edges)

print g.summary
print g.vertices
print g.edges

<bound method SGraph.summary of SGraph({'num_edges': 2, 'num_vertices': 3})
Vertex Fields:['__id', 'club']
Edge Fields:['__src_id', '__dst_id']>
+------+--------------+
| __id |     club     |
+------+--------------+
|  0   | FC Barcelona |
|  2   | FC Barcelona |
|  1   |   Chelsea    |
+------+--------------+
[3 rows x 2 columns]

+----------+----------+
| __src_id | __dst_id |
+----------+----------+
|    0     |    1     |
|    1     |    2     |
+----------+----------+
[2 rows x 2 columns]



We can combine the three steps into a single step and generate a graph with randomly generated of edges. To view these edges in our python notebook, we will set the GraphLab canvas to ipynb.

In [8]:
import random

g = gl.SGraph().add_vertices([gl.Vertex(i) for i in range(10)]).add_edges(
    [gl.Edge(random.randrange(10), random.randrange(10)) for i in range(20)])

gl.canvas.set_target('ipynb')
g.show(vlabel='id', arrows=True)

We will now try to import data from a csv file and load it into an SFrame, GraphLab's data structure to store tabular data. This data from the SFrame can then be converted into a SGraph. We would load this data to later run a bunch of graph analytics algorithms on it.


For the purpose of this tutorail, we will use the James Bond dataset available on the Turi website. We will also print the data in the native tabular format. We will then save the graph generated into a local file. We will save two  copies, one to copy to load it into an SGraph at a later point and another copy in the `json` format to make it readbale. This is because GraphLab lets the users load graphs from native or remote files. 

In [9]:
edge_data = gl.SFrame.read_csv(
    'https://static.turi.com/datasets/bond/bond_edges.csv')
print edge_data


vertex_data = gl.SFrame.read_csv(
    'https://static.turi.com/datasets/bond/bond_vertices.csv')
print vertex_data

g = gl.SGraph(vertices=vertex_data, edges=edge_data, vid_field='name',
           src_field='src', dst_field='dst')

g.save('james_bond')
g.save('james_bond.json', format='json')


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


+----------------+------------+------------+
|      src       |    dst     |  relation  |
+----------------+------------+------------+
|    Wai Lin     | James Bond |   friend   |
|       M        | James Bond |  worksfor  |
| Inga Bergstorm | James Bond |   friend   |
| Elliot Carver  | James Bond | killed_by  |
|   Gotz Otto    | James Bond | killed_by  |
|   James Bond   |     M      | managed_by |
|       Q        |     M      | managed_by |
|   Moneypenny   |     M      | managed_by |
|       Q        | Moneypenny | colleague  |
|       M        | Moneypenny |  worksfor  |
+----------------+------------+------------+
[20 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


+----------------+--------+-----------------+---------+
|      name      | gender | license_to_kill | villian |
+----------------+--------+-----------------+---------+
|   James Bond   |   M    |        1        |    0    |
|       M        |   M    |        1        |    0    |
|   Moneypenny   |   F    |        1        |    0    |
|       Q        |   M    |        1        |    0    |
|    Wai Lin     |   F    |        1        |    0    |
| Inga Bergstorm |   F    |        0        |    0    |
| Elliot Carver  |   M    |        0        |    1    |
|  Paris Carver  |   F    |        0        |    1    |
|   Gotz Otto    |   M    |        0        |    1    |
|  Henry Gupta   |   M    |        0        |    1    |
+----------------+--------+-----------------+---------+
[10 rows x 4 columns]



## Inspecting Graph Data

GraphLab lets us inspect the structure of the graphs at individual vertex and edge level. To begin with, we start with printing the entire graph, highlighting the James Bond vertex alone.

We would then inspect the attributes of the James Bond vertex and all the edges of the `'worksfor'` relation.

Finally, we will inspect the subgraph which contains just the vertices who are within the one hop radius of 'James Bond' or 'Moneypenney'.

In [10]:
bond_graph = gl.load_sgraph('james_bond')

bond_graph.show(vlabel = 'id',
               elabel = 'relation',
               highlight = ['James Bond'], 
               arrows = True)

In [11]:
sub_verts = bond_graph.get_vertices(ids=['James Bond'])
sub_verts.show()

In [12]:
sub_edges = bond_graph.get_edges(fields={'relation': 'worksfor'})
sub_edges.show()

In [13]:
targets = ['James Bond', 'Moneypenny']
subgraph = bond_graph.get_neighborhood(ids=targets, 
                                       radius=1, 
                                       full_subgraph=True)
subgraph.show(vlabel='id', 
              highlight=['James Bond', 'Moneypenny'], 
              arrows=True)

## Modifying the Graph Data

The underlying strucutre of the data is structurally immutable. However, we can modify the SFrame objects associated to the SGraph.edges and SGraph.vertices objects. 

In [14]:
edges = bond_graph.get_edges()
relation = edges['relation'].apply(lambda x: x[0].upper())
bond_graph.edges['relation'] = relation
bond_graph.get_edges().show()

Accessing the SGraph.vertices and SGraph.edges attributes returns special SFrame. Only the `__id`, `__src_id` and the `__dst_id` can not be modified since alterning these attributes would change the structure of the underlying graph. Howeverm you can always add and delete other attributes Here is an example of altering the SFrame by first adding an edge wieght and later deleting it. 

In [54]:
bond_graph.edges['weight'] = [random.uniform(0.0,1.0) for i in range(len(bond_graph.edges))]
bond_graph.edges['weight'].show()
del bond_graph.edges['weight']

We will now explore the tripple apply function which allows the user to asynchronously modify each vertex and its edges asynchronously.

In [55]:
def count_degree(source, edge, destination):
    source['degree'] += 1
    destination['degree'] += 1
    return (source, edge, destination)

bond_graph.vertices['degree'] = 0

bond_graph = bond_graph.triple_apply(count_degree, 
                                     mutated_fields=['degree'])
print bond_graph.vertices.sort('degree', ascending=False)

+----------------+--------+-----------------+---------+--------+
|      __id      | gender | license_to_kill | villian | degree |
+----------------+--------+-----------------+---------+--------+
|   James Bond   |   M    |        1        |    0    |   8    |
| Elliot Carver  |   M    |        0        |    1    |   7    |
|       M        |   M    |        1        |    0    |   6    |
|       Q        |   M    |        1        |    0    |   4    |
|   Moneypenny   |   F    |        1        |    0    |   4    |
|  Paris Carver  |   F    |        0        |    1    |   3    |
| Inga Bergstorm |   F    |        0        |    0    |   2    |
|    Wai Lin     |   F    |        1        |    0    |   2    |
|  Henry Gupta   |   M    |        0        |    1    |   2    |
|   Gotz Otto    |   M    |        0        |    1    |   2    |
+----------------+--------+-----------------+---------+--------+
[10 rows x 5 columns]



## Real World Data

We will first import a real world dataset into an SGraph and then run a bunch of graph analytics algorithms on it. The idea is to run some standardized algorithms to learn about the structre of the graph and the importance of various nodes in the graph. 

As an aside, it is important to note that GraphLab, unlike other traditional machine learning/data sciece libraries does not use mathematical functions to implement the algorithms. Instead, the library uses an underlying C++ engine which applies the selected algorithm at the granularity of a vertex. In esence, it uses three steps to apply the algorithm on a particular vertex:
1. __Gather__ updates from neighboring vertices.
2. __Apply__ the update function to the vertex. 
3. __Scatter__ the updates to all the neighbors of the selected vertex.

Splitting any algorithm into these set of steps lets GraphLab distrbute this computation across multiple machines and hence converge on a value faster. 

As for this tutorial, for each of the basic algorithms, we would first run the model on it, print its summary and then print some key features of the resultant graphs to help us understand the underlying structure of the graph.

Here is the code to load the real world data into our graph. For the sake of this tutorial we will use the Wikipedia dataset where vertices represent Wikipedia articles about US businesses and edges represent hyperlinks between articles.

In [56]:
import os

data_file = 'US_business_links'
if os.path.exists(data_file):
    sg = gl.load_sgraph(data_file)
else:
    url = 'https://static.turi.com/datasets/' + data_file
    sg = gl.load_sgraph(url)
    sg.save(data_file)

print sg.summary()

{'num_edges': 517127, 'num_vertices': 233121}


## Degree Counting
Intutively, the easiest parameter to help understand the underlying structure of a graph is to look at its degree distributtion and then look at the most *connected* nodes, implying that these are the 'popular' vertices in the network.

The algorithm is fairly straight forward, the idea is to simply sum up the number of out edges and in edges for each vertex in the network.

In [57]:
#degree counting 
from graphlab import degree_counting

dg = degree_counting.create(sg, verbose=False)
dg.show()

degree_graph = dg['graph']
degree_graph.vertices.show()


high_total_degree = dg['graph'].vertices.sort('total_degree', ascending = False).head(10)

subgraph = dg['graph'].get_neighborhood(ids=top_pages['__id'], 
                                       radius = 0, 
                                       full_subgraph = True)

print subgraph.vertices



+-------------------------------+-----------+------------+--------------+
|              __id             | in_degree | out_degree | total_degree |
+-------------------------------+-----------+------------+--------------+
|    The Walt Disney Company    |    4364   |    339     |     4703     |
|             Google            |    8429   |    332     |     8761     |
| American Broadcasting Company |   30450   |    463     |    30913     |
|            Twitter            |   11790   |    494     |    12284     |
|       Ford Motor Company      |   24100   |    488     |    24588     |
|       Paramount Pictures      |    9976   |    321     |    10297     |
|       Columbia Pictures       |    7098   |    241     |     7339     |
|           Microsoft           |   13980   |    370     |    14350     |
|            Facebook           |   11888   |    333     |    12221     |
|           DC Comics           |   14037   |    280     |    14317     |
+-------------------------------+-----

## Page Rank 
One of the most fundamental graph analytics algorithm, PageRank assigns a score of the "importance" of each webpage. The 'Pagerank' of a particular page is  dependent on the pageranks of pages linked to it.

Each page assigns a score of 1/N to each page it is linked to, where N is the number of links from the page. The idea is to think of it as the probability of jumping to any of the linked pages. An additional damping factor (fixed weight) is added to all the pages in the graph. This is to model the jumps to pages without any links.

The page rank algorithm is implemented using the following equation:

    R(v) = α + ((1 − α)*R(u)/OutDegree(u))
    


In [50]:
#Page rank

from graphlab import pagerank
pr = pagerank.create(sg, max_iterations = 10, verbose = False)

top_pages = pr['graph'].vertices.sort('pagerank', ascending = False).head(10)

print pr.show()

subgraph = pr['graph'].get_neighborhood(ids=top_pages['__id'], 
                                       radius = 0, 
                                       full_subgraph = True)

print subgraph.vertices


None
+-------------------------------+---------------+---------------+
|              __id             |    pagerank   |     delta     |
+-------------------------------+---------------+---------------+
|    The Walt Disney Company    | 878.301063411 | 74.9881435439 |
|             Google            | 1180.48509934 |  41.271924661 |
|            Twitter            | 1074.30512161 | 16.3932159159 |
| American Broadcasting Company | 3050.12285481 | 171.095519607 |
|       Paramount Pictures      | 1341.29716993 | 74.7486642804 |
|           Microsoft           | 1640.93294801 | 79.4193105615 |
|       Columbia Pictures       | 921.042912728 | 55.2951992749 |
|       Ford Motor Company      | 1156.31695182 | 84.5552251247 |
|            Facebook           | 1218.69295972 | 23.8510621222 |
|           DC Comics           | 1623.76580168 |  231.05676926 |
+-------------------------------+---------------+---------------+
[10 rows x 3 columns]



## Triangle Counting 
The number of triangles is a computationally expensive graph statistic which is frequently used in complex network analysis, in various random graph models and in important real world applications such as spam detection, and providing recommendations to users. GraphLab can perform this computationally expensive statistic with just a few lines. The table below outputs the total number of triangles and outputs the triangle count for each of the companies

In [58]:
#Triangle Counting
from graphlab import triangle_counting

tri_count = triangle_counting.create(sg, verbose = False)
tri_count.show()

tri_out = tri_count['triangle_count']
print tri_out.topk('triangle_count', k=10)

+-------------------------------+----------------+
|              __id             | triangle_count |
+-------------------------------+----------------+
|           Microsoft           |     21446      |
|             Google            |     15491      |
|            Facebook           |     14200      |
|              IBM              |     11715      |
|       Paramount Pictures      |     10547      |
| American Broadcasting Company |     10513      |
|            Twitter            |      9219      |
|        Delta Air Lines        |      8194      |
|             Intel             |      7952      |
|       American Airlines       |      7411      |
+-------------------------------+----------------+
[10 rows x 2 columns]



## Single Source Shortest Path

This function uses Djikstra's algorithm to compute the distance from source to all other nodes in the network and finally returns the shortest path. As you can see from the output, the shortest path from Twitter to Weyerha is shown.

In [23]:
from graphlab import shortest_path

sssp = shortest_path.create(
    sg, source_vid='Twitter', verbose = False)
sssp.show()

sssp.get_path(vid='Weyerhaeuser', 
              show=True,
              highlight=['Weyerhaeuser', 'Twitter'], 
              arrows=True, ewidth=1.2)


None

[('Twitter', 0.0), ('Starbucks', 1.0), ('Seattle', 2.0), ('Weyerhaeuser', 3.0)]

## Connected Components
The intention behind creating connected components is that there is a path between any two vertices in a sub graph.

The way this is implemented is that we start Breadth First Search from some node s. This gives one component of the graph. Next we pick any unexplored node u and start another Breadth First Search. This will give us the next component. We continue this process until all nodes have been explored. 


In [24]:
#connected components
from graphlab import connected_components

cc = connected_components.create(sg, verbose = False)
cc.show()

print cc['component_size']

#Print component 890
vertex_id = []

for vertex in cc['component_id']:
    if vertex['component_id'] == 890:
        vertex_id.append(vertex['__id'])
        
subgraph = cc['graph'].get_neighborhood(ids=vertex_id, 
                                        radius = 1, 
                                        full_subgraph = True)

subgraph.show(vlabel = "__id", arrows = True)


+--------------+--------+
| component_id | Count  |
+--------------+--------+
|     623      | 233057 |
|     890      |   23   |
|     1477     |   2    |
|     1643     |   2    |
|     3306     |   2    |
|     3444     |   2    |
|     3855     |   2    |
|    60185     |   2    |
|    90628     |   2    |
|    117553    |   3    |
+--------------+--------+
[18 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


## K Core
The k-core of a graph G is the maximal induced subgraph H ⊆ G such that δ(G) ≥ k. Thus all vertices of H are adjacent to at least k other vertices in H. The k-core is found by recursively pruning nodes with degrees less than k. The basic concept is simple. Every vertex will keep a count of the adjacent edges. The vertex will update its counter by removing adjacent edges which have been deleted. If counter goes below K, it deletes itself and signals each of its neighbours


In [41]:
#kcore
from graphlab import kcore

kc = kcore.create(sg, verbose = False)
kc.show()

print kc['kmax']
print kc['kmin']

core_id = [0]*(kc['kmax']+1)


for vertex in kc['core_id']:
    core_id[vertex['core_id']] += 1

distribution = gl.SArray(core_id)

print distribution
distribution.show()


10
0
[0, 145603, 43726, 16640, 8512, 4727, 3095, 2155, 1438, 1113, 6112]


## Conclusion
GraphLab applies the __GAS__(*G*ather *A*pply *S*catter) model to a wide variety of machine learning algorithms not only in the domain of graph analytics but also to other domains of data scicence. Many data formats in various data science domains can be modelled as graph data. This alternate approach provides fast implementation of several techniques including text analysis, deep learning, pattern mining, etc. You can read up more about these techniques on their [website](https://turi.com/learn/userguide/index.html). 

## References

1. https://turi.com/

2. https://turi.com/products/create/docs/index.html

3. http://select.cs.cmu.edu/code/graphlab/doxygen/html/index.html

4. https://en.wikipedia.org/wiki/Connected_component_(graph_theory)