## Introduction

This tutorial will introduce you to some basic methods for processing and analyzing network data, particularly focusing on large scale network analytics using Stanford Network Analysis Platform (SNAP). Networks are a general language for describing complex systems. They can be used to present information like connections between friends and families, societies, road networks, neuron interactions in the brain, and global economic networks. Network data forms a key aspect of data science. In many settings we want to visualize or analyze data with a network component: where information can be represented using nodes (mostly entities) and edges (mostly relationship between entities). Networks are means to represent such kind of data. Visualizing the network data and understanding key network statistics tells us a great deal about some question we are interested in.

An example of this is the degrees of separation statistics, which Facebook reported recently:
[<img src="https://scontent-iad3-1.xx.fbcdn.net/t39.2365-6/12683879_1559394484383614_1037808524_n.png">](https://research.facebook.com/blog/three-and-a-half-degrees-of-separation/)"Each person in the world (at least among the 1.59 billion people active on Facebook) is connected to every other person by an average of three and a half other people. The average distance we observe is 4.57, corresponding to 3.57 intermediaries or 'degrees of separation.'" The network here has Facebook users as nodes and friendships as edges. Degrees of separation is the average number of nodes on the path from any one node to another in the graph/network (we will use these two terms interchangeably).

## Tutorial content

In this tutorial, we will cover some basic network analysis in Python, specifically using [SNAP.py](http://snap.stanford.edu/). SNAP is a general purpose, high-performance system for analysis and manipulation of large networks. It claims to scale to networks with hundreds of millions of nodes and billions of edges.

We will cover the following topics in this tutorial:
- [Installation](#Installation)
- [Loading data and plotting](#Loading-data-and-plotting)
- [Network Analysis](#Network-analysis)
- [Graph Generators](#Graph-Generators)
- [Example application: findingJavaExperts](#Example-application:-findingJavaExperts)
- [Graph Sparsification](#Graph-sparsification)

## Installation

Before getting started, you'll need to download the Snap.py for your platform that we will use. You can download it from http://snap.stanford.edu/snappy/index.html#download. Run the setup.py file using:

    $ python setup.py install
    
After you run the install, make sure the following command work for you:

In [54]:
import snap

import itertools
import random

<a id='Loading-data-and-plotting'></a>

## Loading Data and Plotting

In this tutorial we will use data from Stack Overflow. The dataset comprises of 19,881,020 posts from Jul 2008 to May
2014 and has been made publicly available by Stack Overflow
 (https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z)
 
We will be using a processed version of this data that pertains to posts about 'Java'. This data is uploaded here: https://drive.google.com/drive/folders/0BzBkBWocWLCDTVg2ZHpDZDFTTG8. <b>Please download the files and put them in the working directory before running the cells.</b> The file 'qa.txt' contains 6 columns. We will only use the second (id of the user who posted the question) and the sixth column (id of the user who posted the answer which was accepted) for this analysis.

We are loading the data as a PNGraph (directed graph), where each user is a node and edges are directed such that a question owner points to the owner of the accepted answer.

In [2]:
G = snap.LoadEdgeList(snap.PNGraph, "qa.txt", 1, 5)

print "Number of nodes with non-zero degree: ", snap.CntNonZNodes(G)
print "Number of unique directed edges: ", snap.CntUniqDirEdges(G) 

for node in itertools.islice(G.Nodes(),5):
    print "Node id: %d, Out-degree: %d, In-degree: %d" % (node.GetId(), node.GetOutDeg(), node.GetInDeg())

Number of nodes with non-zero degree:  146874
Number of unique directed edges:  313006
Node id: 78, Out-degree: 1, In-degree: 0
Node id: 86, Out-degree: 4, In-degree: 5
Node id: 58, Out-degree: 7, In-degree: 1
Node id: 35, Out-degree: 0, In-degree: 3
Node id: 122, Out-degree: 7, In-degree: 8


We will use Gnuplot to support drawing and visualization in Snap.py. After setting up Gnuplot, set the system PATH variable, so that Gnuplot is available, or put the executables in the working directory.

In [3]:
snap.PlotInDegDistr(G, "Java-Posts", "Java-Posts In Degree")

This command saves 3 files namely inDeg.Java-Posts.png, inDeg.Java-Posts.plt, inDeg.Java-Posts.tab in the same directory. Here is what the output looks like:
<img src="inDeg.Java-Posts.png" alt="Drawing" style="width: 700px;">

<a id="Network-analysis" ></a>

## Network Analysis

We will now get some off-the-shelf statistics for the network we loaded. The output of the command (might take some time to run) is stored in info.txt.

In [4]:
snap.PrintInfo(G, "Java QA Stats", "info.txt", False)

In the following section we will be using some basic and composite types supported by SNAP. Here, we will quickly see how a few of them can be initialized and accessed as we will need them for most part of the analysis. You can read more about all the types here: http://snap.stanford.edu/snappy/doc/reference/basic.html, http://snap.stanford.edu/snappy/doc/reference/basic.html

In [58]:
vec = snap.TIntV(5)             #Initializes Vector of size 5 to store integers
vec.Add(1)                      #Appends to vec
vec.Add(2)
print "Length of vector: ", vec.Len()

h = snap.TIntStrH()             #Creates a hash with Int key and String value
h[4] = "four"
print "h[4] = ", h[4]

p = snap.TFltIntPr(2.1,2)       #Creates a pair with a float and an int value
print "Values in the pair: ", p.GetVal1(), p.GetVal2()

h2 = snap.TStrIntPrH()          #Creates a hash with String key and Integer pair as value
h2['k'] = snap.TIntPr(1,3)

G1 = snap.TNGraph.New()         #Initializes a directed graph
G2 = snap.TUNGraph.New()        #Initializes an undirected Graph
N1 = snap.TNEANet.New()         #Initializes a directed Network

Length of vector:  7
h[4] =  four
Values in the pair:  2.1 2


We will now analyze connectedness in our Java Posts network from Stack Overflow. 

In [15]:
MxWcc = snap.GetMxWcc(G)        #Returns graph representing largest weakly connected component
print "Largest Weekly Connected Component - #Nodes %d, #Edges %d" % (MxWcc.GetNodes(), MxWcc.GetEdges())

MxScc = snap.GetMxScc(G)        #Returns graph representing largest strongly connected component
print "Largest Strongly Connected Component - #Nodes %d, #Edges %d" % (MxScc.GetNodes(), MxScc.GetEdges())

WccSizes = snap.TIntPrV()
snap.GetWccSzCnt(G, WccSizes)   #Returns size distribution of weakly connected components

for pair in WccSizes:
    print "%d components with size %d" % (pair.GetVal2(), pair.GetVal1())
    
ArtNIdV = snap.TIntV()          #Returns articulation points
snap.GetArtPoints(snap.ConvertGraph(snap.PUNGraph, G), ArtNIdV)
print "Number of Articulation Points: %d" % len(ArtNIdV)

Largest Weekly Connected Component - #Nodes 131188, #Edges 322486
Largest Strongly Connected Component - #Nodes 4323, #Edges 22531
5191 components with size 1
4462 components with size 2
406 components with size 3
68 components with size 4
11 components with size 5
2 components with size 6
2 components with size 7
1 components with size 131188
Number of Articulation Points: 31015


A node in an undirected connected graph is an <b>articulation point</b> (or cut vertex) iff removing it (and edges through it) disconnects the graph. Articulation points represent vulnerabilities in a connected network – single points whose failure would split the network into 2 or more disconnected components. They are useful for designing reliable networks. Note that as G is already disconnected, removing an articulation point in our network will increase the number of connected components.

We will now analyze node connectivity and degree distribution. 

In [59]:
MaxInDegNode = snap.GetMxInDegNId(G)    #User with maximum accepted answers   
print "Node with maximum in-degree: ", MaxInDegNode

MaxOutDegNode = snap.GetMxOutDegNId(G)  #User who accepted maximum answers
print "Node with maximum out-degree: ", MaxOutDegNode

DegDist = snap.TIntPrV()     
snap.GetDegCnt(G, DegDist)              #Returns node degree distribution
print "# of distinct degrees: ",DegDist.Len()
for item in itertools.islice(DegDist, 5):
    print "%d Nodes with Degree %d" % (item.GetVal2(), item.GetVal1())

Node with maximum in-degree:  22656
Node with maximum out-degree:  892029
# of distinct degrees:  322
73102 Nodes with Degree 1
26517 Nodes with Degree 2
12790 Nodes with Degree 3
7638 Nodes with Degree 4
5138 Nodes with Degree 5


SNAP supports Breadth First Search ([BFS](https://en.wikipedia.org/wiki/Breadth-first_search)) and Depth First Search ([DFS](https://en.wikipedia.org/wiki/Depth-first_search)). We can also calculate shortest distance to nodes, neighbors at a particular distance, diameter of the graph and various other useful details.

In [62]:
#Returns (approximation of) Effective Diameter and Diameter of Graph by performing BFS
Diameter = snap.GetBfsFullDiam(G, 100)
print "Diameter: ", Diameter
EffDiameter = snap.GetBfsEffDiam(G, 100)
print "Effective Diameter: ", EffDiameter

ShortPathLength = snap.GetShortPath(G, MaxOutDegNode, MaxInDegNode)
print "Shortest Path from node with max out-degree to node with max in-degree is %d edge" % ShortPathLength

Diameter:  11
Effective Diameter:  5.74034326776
Shortest Path from node with max out-degree to node with max in-degree is 1 edge


We will now obtain spectral properties for a graph. SNAP can compute eigenvalues and eigenvectors using the adjacency matrix representation of the network.

In [63]:
ERGraph = snap.GenRndGnm(snap.PUNGraph, 100, 1000)    #Generates an Erdos-Renyi random graph
EigVals = snap.TFltV()                              
snap.GetEigVals(ERGraph, 2, EigVals)                  #Returns the top 2 Eigenvalues
print "Leading Eigen Value: ", EigVals[0]
    
EigVecs =  snap.TFltV()
snap.GetEigVec(ERGraph, EigVecs)                      #Returns leading eigenvector
print "Corresponding Eigen Vector: ", EigVecs[0]

EigValIprV = snap.TFltPrV()                         
snap.GetInvParticipRat(ERGraph, 5, 1000, EigValIprV)  #Computes inverse participation ratio  
for item in EigValIprV:
    print '%f, %f' % (item.GetVal1(), item.GetVal2())

Leading Eigen Value:  20.8376798671
Corresponding Eigen Vector:  0.106963147083
-6.800735, 0.033524
-2.987092, 0.033854
1.545339, 0.026695
5.841276, 0.025121
20.837679, 0.011645


<a id ="Graph-Generators"></a>

## Graph Generators

In the above example, we generated a [Erdos-Renyi](https://en.wikipedia.org/wiki/Erd%C5%91s%E2%80%93R%C3%A9nyi_model) random graph. SNAP comes with many basic and advanced graph generators. These graph models can be extremely useful. They can be used in probabilistic methods to prove the existence of graphs satisfying various properties, or to provide a rigorous definition of what it means for a property to hold for almost all graphs. We will now try some of them.

In [46]:
Rnd = snap.TRnd(1,0)
SWG = snap.GenSmallWorld(100, 3, 0.7, Rnd) #Returns random small-world graph generated using Watts-Strogatz model
FFG = snap.GenForestFire(1000, 0.5, 0.5)   #Returns a random Forest Fire, directed graph with given probabilities
PLG = snap.GenRndPowerLaw (100, 10)        #Returns a random scale-free graph with power-law degree distribution
print PLG

snap.PlotHops(FFG, "ForestFire", "Forest Fire Hops", False, 1024)

<snap.PUNGraph; proxy of <Swig Object of type 'PUNGraph *' at 0x104ec5240> >


[Here](http://snap.stanford.edu/snappy/doc/reference/generators.html?highlight=generators) is the list of all the graph generators that SNAP provides. PlotHops() plots the cumulative distribution of the shortest path lengths of graph. The implementation is based on ANF (Approximate Neighborhood Function). Here is what the output looks like for the random Forest Fire we generated above:
<img src="hop.ForestFire.png" alt="Drawing" style="width: 700px;"/>

<a id="Example-application:-findingJavaExperts"></a>

## Example Application - FindingJavaExperts

Now we want to find the top Java experts on Stack overflow using the posts (questions and answers) we loaded at the beginning in graph G. We cannot use the reputation score provided by Stack Overflow as that is not specific to a topic. One way we can do this is to count the number of answers posted by the user on questions related to Java, but this doesn't guarantee that the answer provided was useful! 

Recall that the in-degree of a node is same as the number of answers posted by the user that were accepted by question owners and out-degree corresponds to the number of questions the question owner asked for which (s)he also accepted an answer. Thus, looking at nodes with maximum out-degree (top learners) and in-degree (top experts) can give us some insight!

In [56]:
InDegrees = snap.TIntPrV()                    #Creates a vector of Int pairs
snap.GetNodeInDegV(G, InDegrees)              #Computes the in-degree for every node in G
TopInDegrees = sorted(InDegrees, key = lambda key: key.GetVal2(), reverse = True)
print "\nTop 5 Java experts by In-degree\n   Id   | In-degree"
for item in TopInDegrees[:5]:
    print "%7s | %d" % (item.GetVal1(), item.GetVal2())

OutDegrees = snap.TIntPrV()
snap.GetNodeOutDegV(G, OutDegrees)            #Computes the out-degree for every node in G
print "\nTop 5 Java learners by Out-degree\n   Id   | Out-degree"
TopOutDegrees = sorted(OutDegrees, key = lambda key: key.GetVal2(), reverse = True)
for item in TopOutDegrees[:5]:
    print "%7s | %d" % (item.GetVal1(), item.GetVal2())


Top 5 Java experts by In-degree
   Id   | In-degree
  22656 | 3999
 157882 | 2828
 571407 | 2714
  57695 | 2448
 139985 | 2340

Top 5 Java learners by Out-degree
   Id   | Out-degree
 892029 | 434
1194415 | 374
 454049 | 204
 359862 | 193
  44330 | 182


Can we do better?

We can use a measure of node centrality to find Java experts and learners!
First, we will use <b>PageRank</b> as we have already learnt about it in the course.

In [48]:
PageRank = snap.TIntFltH()                 #Creates a hash with Int key and Flt as value
snap.GetPageRank(G, PageRank)              #Populates the hash with PageRanks for all nodes

TopPR = sorted(PageRank, key = lambda key: PageRank[key], reverse = True)
print "\nTop 5 Java experts by PageRank\n   Id   | PageRank"
for node in TopPR[:5]:
    print "%7s | %.5f" % (node, PageRank[node])


Top 5 Java experts by PageRank
   Id   | PageRank
 992484 | 0.01398
 135152 | 0.01001
  22656 | 0.00710
 139985 | 0.00552
 157882 | 0.00460


We can also look at other measures like the [HITS score](https://en.wikipedia.org/wiki/HITS_algorithm) (a.k.a 'hubs and authorities') to capture top experts and top learners (question-owners/hubs who are linked to nodes that are considered to be Java experts/authorities).

In [49]:
HubScore = snap.TIntFltH()                 #Creates a hash with Int key and Flt as value
AuthScore = snap.TIntFltH()
snap.GetHits(G, HubScore, AuthScore)       #Populates the hashes with hub and authority scores for all nodes

TopAuth = sorted(AuthScore, key = lambda key: AuthScore[key], reverse = True)
print "\nTop 5 Java experts by Hits\n   Id   | Authority Rank"
for node in TopAuth[:5]:
    print "%7s | %.5f" % (node, AuthScore[node])

TopHub = sorted(HubScore, key = lambda key: HubScore[key], reverse = True)
print "\nTop 5 Java learners by Hits\n   Id   | Hub Rank"
for node in TopHub[:5]:
    print "%7s | %.5f" % (node, HubScore[node])


Top 5 Java experts by Hits
   Id   | Authority Rank
  22656 | 0.60472
 157882 | 0.29870
 571407 | 0.28391
  57695 | 0.26931
 139985 | 0.24940

Top 5 Java learners by Hits
   Id   | Hub Rank
 892029 | 0.07336
1194415 | 0.05955
 359862 | 0.05688
 648138 | 0.05559
 470184 | 0.05318


Note that there are common user ids we get in the Top 5 list obtained by using the 3 different methods!

<a id="Graph-sparsification"></a>

## Graph Sparsification

As the second example application in this tutorial, I wanted to show how we can detect communities in the entire Stack Overflow Data (we were only using the posts related to Java above) and then look at the communities to get a sense of how they vary in terms of questions tags/topics in each. Turns out that the larger dataset (>3.5 million edges and ~0.8 million nodes) is too big and community detection methods in SNAP could not give an answer even in several hours. This is a common problem is network analysis. When the network gets too big, computation becomes very time-consuming. One of the common techniques used in network analysis to deal with such scenarios is graph sparsification, where we take a representative sample from the network to estimate its properties. Sparsification  is approximating a given graph by a graph with fewer edges or vertices.

First, we will load the graph using all the answers and questions in the Stack Overflow data. I have pre-processed the data separately and the structure of the graph is same as before i.e. each user is a node and edges are directed such that a question owner points to the owner of the accepted answer.

In [50]:
G2 = snap.LoadEdgeList(snap.PNGraph, "qa_complete.txt", 1, 4)

print "Number of nodes with non-zero degree in the complete graph: ", snap.CntNonZNodes(G2)
print "Number of unique directed edges in the complete graph: ", snap.CntUniqDirEdges(G2)

GraphClustCoeff = snap.GetClustCf (G2, -1)
print "Clustering coefficient of the complete graph: %f" % GraphClustCoeff

Number of nodes with non-zero degree in the complete graph:  793542
Number of unique directed edges in the complete graph:  3507757
Clustering coefficient of the complete graph: 0.009795


State-of-the-art sparsification techniques use advanced data structures and algorithms. Here, we will just try to get an idea about how sparsification works using random sampling. We will sample 1/5th of the edges from the original graph.

In [50]:
sample_size = snap.CntUniqDirEdges(G2)/5
with open("qa_complete.txt") as f1:
    with open("qa_sample.txt", 'w+') as f2:
        lines = random.sample(f1.readlines(), sample_size)
        f2.writelines(lines)

GS = snap.LoadEdgeList(snap.PNGraph, "qa_sample.txt", 1, 4)
print "Number of nodes with non-zero degree in the sparsified graph: ", snap.CntNonZNodes(GS)
print "Number of unique directed edges in the sparsified graph: ",snap.CntUniqDirEdges(GS)

Number of nodes with non-zero degree:  345368
Number of unique directed edges:  613415


In [51]:
GraphClustCoeff = snap.GetClustCf (GS, -1)
print "Clustering coefficient of the sparsified graph: %f" % GraphClustCoeff

Clustering coefficient: 0.001770


We notice that the clustering coefficient of our sparsified network is ~1/5th of the original network, which was also the proportion by which we sampled down. We can make these estimates more accurate by taking multiple such samples and then using the average value. This indicates that we can still estimate some properties of our original graph even though we only have 1/5th of it! It is a useful exercise to think about properties which cannot be estimated if we do sparsification using random sampling (for instance, diameter of the graph).

More sophisticated sampling techniques give more accurate estimates. For instance, sparsification techniques based on random-walks and 'forest fire' can accurately match both static as well as evolutionary graph patterns, with sample sizes down to about 15% of the original graph.



## Summary and references

This tutorial highlighted a few elements of what is possible with network data analyzed using SNAP in Python. More detail about the libraries and questions on networks analysis in general are available from the following links.

1. SNAP.py reference manual: http://snap.stanford.edu/snappy/doc/reference/index-ref.html
2. SNAP Website: http://snap.stanford.edu/snap/index.html
3. Introduction to Social Network Analysis using NetworkX (another Python library for network analysis): https://ep2013.europython.eu/media/conference/slides/social-network-analysis-in-python.pdf
4. Hyperlink-Induced Topic Search (HITS): https://en.wikipedia.org/wiki/HITS_algorithm
5. Sampling A Large Network (Graph Sparsification): http://snap.stanford.edu/class/cs224w-2012/projects/cs224w-036-final.pdf
