<img src="uva_seal.png">  

## Introduction to GraphX and GraphFrames
### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: August 20, 2023

---  

**RECOMMENDED KERNEL: DS 5110**

### SOURCES: 

- Apache Spark documentation: 
    - https://spark.apache.org/graphx/
    - https://spark.apache.org/docs/latest/graphx-programming-guide.html     
- https://mapr.com/blog/how-get-started-using-apache-spark-graphx-scala/  
- GraphFrames User Guide
    - https://graphframes.github.io/graphframes/docs/_site/user-guide.html


### OBJECTIVES
- Introduction to `GraphX` and `GraphFrames`
- Review `GraphX` code
- Gain an understanding of some of the supported calculations on graphs
- Build your own graph and compute on it

### CONCEPTS

- `Graph`, `Undirected Graph`, `Directed Graph`, `Multigraph`
- `Vertex Table`
- `Edge Table`
- `Property Graph`
- `Neighborhood Aggregation`
- `Connected Components`
- `Triangle Count`
- `PageRank`

---



### Background

**Graphs and Multigraphs**  

First some quick definitions from Graph Theory:

- A *graph* is a mathematical structure containing a set of objects in which some pairs of objects are related. The objects are called *vertices*, and the connections between vertices are called *edges*.  The edges model a relationship.    


- A *multigraph* is a graph which is permitted to have multiple edges; that is, edges that have the same end nodes. 


- For later use in `GraphX`, we define an *Edge Triplet* to be an edge along with its source and destination vertices.


**Undirected Graph**  

Undirected graphs contain only bidirectional edges like this:

<img src="undirected_graph.png">

**Directed Graph**  

Directed graphs contain edges pointing in a single direction like this:

<img src="directed_graph.png">

**Multigraph**

Multigraphs contain pairs of vertices connected by multiple edges, like this:

<img src="image01_flight-relationship.png">  

### GraphX Capabilities

`GraphX` is the Spark API for graphs and graph-parallel computation.  
It combines ETL, exploratory data analysis, and iterative graph computation into one system.

`GraphX` includes a library of graph algorithms including:

- PageRank
- Connected components / strongly connected components
- Label propagation
- SVD++
- Triangle count

### GraphX Objects

`GraphX` extends the Spark RDD with a *Resilient Distributed Property Graph*, which is a *multigraph*.  Each edge and vertex has defined properties, and parallel edges allow multiple relationships between the same vertices.  

As an example, imagine two airports with multiple flights between the airports.  The airports are the vertices, the flights are the edges, and the construct is a multigraph.  

Like RDDs, property graphs are immutable, distributed, and fault-tolerant.  Each partition of a graph can be recreated on a different machine in the event of failure.

The classes `VertexRDD` and `EdgeRDD` extend RDDs.  Their purpose is to provide functionality for graph computation.

**Example Property Graph**

The property graph below shows collaborators on the GraphX project, and their relationships with one another.  The Vertex Table holds identifiers, usernames, and titles.  The Edge Table holds the relationships.  For example, Prof Franklin (*SrcId 5*) advises R Xin (*DstId 3*).

<img src="property_graph.png">  

### Fundamental Operators

`GraphX` exposes fundamental operators from the following categories:

- information about the graph
- views of the graph
- functions for caching graphs
- change partitioning
- transform vertex and edge attributes
- modify the graph structure
- join RDDs with the graph
- aggregations
- algorithms (listed above in Capabilities section)

For a detailed listing, visit:  
https://spark.apache.org/docs/latest/graphx-programming-guide.html

Some of the most important operators:  

- `subgraph`  

Takes vertex and edge predicates and returns the graph containing only the vertices and edges satisfying their predicates, while connecting vertices that satisfy the vertex predicate.  

Examples include returning a graph without broken links (a pruned graph).    

- `joinVertices`  

Joins the vertices with the input RDD and returns a new graph with the vertex properties obtained by applying the user defined map function to the result of the joined vertices.  

- `aggregateMessages`  

`aggregateMessages` applies a user defined sendMsg function to each edge triplet in the graph and then uses the mergeMsg function to aggregate those messages at their destination vertex.

### Basic Operations and Calculations with Graphs

The functionality supports interesting calculations, ranging from very simple (count the number of edges, count the number of vertices) to more complex (count the number of followers).  Counting the number of followers can be computed using `PageRank`.  

**Neighborhood aggregation**  
*Neighborhood aggregation* is the process of aggregating information about the neighborhood of each vertex.  

**Caching**

Graphs are NOT cached by default.  If you plan to use a graph more than once, avoid recomputing it by calling `cache()`.  

**Graph Builders**

`GraphX` provides several ways of building a graph from vertices and edges in an RDD or on disk.  The graph builders leave the edges in their default partitions, to avoid shuffling data.

**Optimized Representation**

`GraphX` follows a vertex-cut approach to distributed graph partitioning.  Effectively, `GraphX` partitions the graph along vertices to reduce communication and storage costs.  Thus, the logical plan assigns each edge to a single machine, while the vertices can span multiple machines. 

### Algorithms

`PageRank`  
This algo measures the importance of each vertex in a graph.  For example, if a user (represented by a vertex) has many followers (represented by many edges), then it will have a high rank.

`Connected Components`  
A *component* in an undirected graph is a subgraph in which any two vertices are connected to each other by edges, and not connected to other vertices in the supergraph.  Essentially, the components form clusters in the graph.  This diagram may help clarify:

<img src="component.png">

The `Connected Components` algorithm labels each connected component of the graph with the index of the lowest-numbered vertex.

`Triangle Counting`  
This algo provides a measure of clustering.  A ratio of the number of triangles in the cluster to the number of possible triangles is computed.  A vertex is part of the triangle when it has two adjacent vertices with an edge connecting them.

### GraphFrames in PySpark

`GraphX` was initially supported only in Scala Spark.  
With the addition of the `GraphFrames` package, PySpark now includes graph functionality.  
Vertices and edges are represented as DataFrames instead of RDDs.

This analogy will be helpful:

`GraphX` : `RDDs` :: `GraphFrames` : `DataFrames`  

`GraphFrames` is a separate project from core Apache Spark.  

### GraphFrames Examples  

The code snippets below are based on the `GraphFrames` user guide.  They illustrate basic operations and algorithm use.  
Run the cells and note their output.

**Create a Vertex DataFrame and an Edge DataFrame**

In [1]:
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
spark = SparkSession.builder \
        .master("local") \
        .appName("mllib_classifier") \
        .getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)

from graphframes import *

# Vertex DataFrame; contains identifier field "id"
v = sqlContext.createDataFrame([
  ("1", "Adam", "koala"),
  ("2", "Callie", "flamingo"),
  ("3", "Elle", "panda"),
  ("4", "Jacqui", "fox")
], ["id", "name", "favorite_animal"])

# Edge DataFrame; contains source field "src" and destination field "dst"
e = sqlContext.createDataFrame([
  ("1", "2", "dad"),
  ("1", "3", "husband"),
  ("1", "4", "son_in_law"),
  ("2", "1", "daughter"),
  ("2", "3", "daughter"),
  ("2", "4", "granddaughter"),
  ("3", "1", "wife"),
  ("3", "2", "mom"),
  ("3", "4", "daughter"),
  ("4", "1", "mother_in_law"),
  ("4", "2", "grandmother"),
  ("4", "3", "mom")
], ["src", "dst", "relationship"])

In [2]:
# Create a GraphFrame
g = GraphFrame(v, e)

In [3]:
# print the graph
print(g)

GraphFrame(v:[id: string, name: string ... 1 more field], e:[src: string, dst: string ... 1 more field])


In [None]:
# show the vertices
g.vertices.show()

In [None]:
# show the edges
g.edges.show()

In [None]:
# compute the number of daughter relationships
numDaughter = g.edges.filter("relationship = 'daughter'").count()

In [None]:
# run PageRank for 10 iterations
results = g.pageRank(resetProbability=0.15, maxIter=10)

# print results
results.vertices.show()
results.edges.show()

In [None]:
# compute the triangle count
results = g.triangleCount()
results.select("id", "count").show()