# Spark GraphDataframe with Azure Databricks

## Sample dataset with SQLSat Slovenia 1010 Speakers

In [0]:
from functools import reduce
from pyspark.sql.functions import col, lit, when
from graphframes import *
#from graphframes.examples import Graphs

### Creating a sample dataset

Graphs usually have the vertices and edges. And same goes for Graph DataFrames.

Vertex DataFrame: A vertex DataFrame should contain a special column named "id" which specifies unique IDs for each vertex in the graph.

Edge DataFrame: An edge DataFrame should contain two special columns: "src" (source vertex ID of edge) and "dst" (destination vertex ID of edge).

Both DataFrames can have arbitrary other columns. Those columns can represent vertex and edge attributes.

In [0]:
vertices = sqlContext.createDataFrame([
  ("a", "Alice", 34, "F"),
  ("b", "Bob", 36, "M"),
  ("c", "Charlie", 30, "M"),
  ("d", "David", 29, "M"),
  ("e", "Esther", 32, "F"),
  ("f", "Fanny", 36, "F"),
  ("g", "Gabby", 60, "F"),
  ("h", "Mark", 45, "M"),
  ("i", "Eddie", 60, "M"),
  ("j", "Mandy", 21, "F")
], ["id", "name", "age", "gender"])

In [0]:
edges = sqlContext.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend"),
  ("a", "e", "friend"),
  ("a", "h", "follow"),
  ("a", "i", "follow"),
  ("a", "j", "follow"),
  ("j", "h", "friend"),
  ("i", "c", "follow"),
  ("i", "c", "friend"),
  ("b", "j", "follow"),
  ("d", "h", "friend"),
  ("e", "j", "friend"),
  ("h", "a", "friend")
], ["src", "dst", "relationship"])

Let's create a graph using vertices and edges

In [0]:
graph_sample = GraphFrame(vertices, edges)
print(graph_sample)

In [0]:
# This example graph also comes with the GraphFrames package.
from graphframes.examples import Graphs
same_graph = Graphs(sqlContext).friends()
print(same_graph)

## Querying graph

In [0]:
display(graph_sample.vertices)

id,name,age
a,Alice,34
b,Bob,36
c,Charlie,30
d,David,29
e,Esther,32
f,Fanny,36
g,Gabby,60


In [0]:
display(graph_sample.edges)

src,dst,relationship
a,b,friend
b,c,follow
c,b,follow
f,c,follow
e,f,follow
e,d,friend
d,a,friend
a,e,friend


The incoming degree of the vertices:

In [0]:
display(graph_sample.inDegrees)

id,inDegree
b,2
f,1
a,1
e,1
c,2
d,1


The outgoing degree of the vertices:

In [0]:
display(graph_sample.degrees)

id,degree
b,3
a,3
c,3
f,2
e,3
d,2


You can run queries directly on the vertices DataFrame. For example, we can find the age of the youngest person in the graph:

In [0]:
youngest = graph_sample.vertices.groupBy().min("age")
display(youngest)

min(age)
29


Likewise, you can run queries on the edges DataFrame. For example, let's count the number of 'follow' relationships in the graph:

In [0]:
numFollows = graph_sample.edges.filter("relationship = 'friend'").count()
print("The number of friend edges is", numFollows)

## Motif

Using motifs you can build more complex relationships involving edges and vertices. The following cell finds the pairs of vertices with edges in both directions between them. The result is a DataFrame, in which the column names are given by the motif keys.

In [0]:
# Search for pairs of vertices with edges in both directions between them.
motifs = graph_sample.find("(a)-[e]->(b); (b)-[e2]->(a)")
display(motifs)

a,e,b,e2
"List(c, Charlie, 30, M)","List(c, b, follow)","List(b, Bob, 36, M)","List(b, c, follow)"
"List(b, Bob, 36, M)","List(b, c, follow)","List(c, Charlie, 30, M)","List(c, b, follow)"
"List(h, Mark, 45, M)","List(h, a, friend)","List(a, Alice, 34, F)","List(a, h, follow)"
"List(a, Alice, 34, F)","List(a, h, follow)","List(h, Mark, 45, M)","List(h, a, friend)"


## Filter

In [0]:
filtered = motifs.filter("(b.age > 30 or a.age > 30) and (a.gender = 'M' and b.gender ='F')")
display(filtered)
# I guess Mark has a crush on Alice, but she just wants to be a follower :)

a,e,b,e2
"List(h, Mark, 45, M)","List(h, a, friend)","List(a, Alice, 34, F)","List(a, h, follow)"


## Stateful queries

Stil example shows combining GraphFrame motif finding with filters on the result where the filters use sequence operations to operate over DataFrame columns.

In [0]:
# Find chains of 4 vertices.
chain4 = graph_sample.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)")

# Query on sequence, with state (cnt)
#  (a) Define method for updating state given the next element of the motif.
def cumFriends(cnt, edge):
  relationship = col(edge)["relationship"]
  return when(relationship == "friend", cnt + 1).otherwise(cnt)

#  (b) Use sequence operation to apply method to sequence of elements in motif.
#   In this case, the elements are the 3 edges.
edges = ["ab", "bc", "cd"]
numFriends = reduce(cumFriends, edges, lit(0))
    
chainWith2Friends2 = chain4.withColumn("num_friends", numFriends).where(numFriends >= 2)
display(chainWith2Friends2)

a,ab,b,bc,c,cd,d,num_friends
"List(e, Esther, 32, F)","List(e, d, friend)","List(d, David, 29, M)","List(d, h, friend)","List(h, Mark, 45, M)","List(h, a, friend)","List(a, Alice, 34, F)",3
"List(e, Esther, 32, F)","List(e, j, friend)","List(j, Mandy, 21, F)","List(j, h, friend)","List(h, Mark, 45, M)","List(h, a, friend)","List(a, Alice, 34, F)",3
"List(b, Bob, 36, M)","List(b, j, follow)","List(j, Mandy, 21, F)","List(j, h, friend)","List(h, Mark, 45, M)","List(h, a, friend)","List(a, Alice, 34, F)",2
"List(a, Alice, 34, F)","List(a, j, follow)","List(j, Mandy, 21, F)","List(j, h, friend)","List(h, Mark, 45, M)","List(h, a, friend)","List(a, Alice, 34, F)",2
"List(h, Mark, 45, M)","List(h, a, friend)","List(a, Alice, 34, F)","List(a, h, follow)","List(h, Mark, 45, M)","List(h, a, friend)","List(a, Alice, 34, F)",2
"List(d, David, 29, M)","List(d, a, friend)","List(a, Alice, 34, F)","List(a, h, follow)","List(h, Mark, 45, M)","List(h, a, friend)","List(a, Alice, 34, F)",2
"List(a, Alice, 34, F)","List(a, e, friend)","List(e, Esther, 32, F)","List(e, d, friend)","List(d, David, 29, M)","List(d, a, friend)","List(a, Alice, 34, F)",3
"List(j, Mandy, 21, F)","List(j, h, friend)","List(h, Mark, 45, M)","List(h, a, friend)","List(a, Alice, 34, F)","List(a, b, friend)","List(b, Bob, 36, M)",3
"List(d, David, 29, M)","List(d, h, friend)","List(h, Mark, 45, M)","List(h, a, friend)","List(a, Alice, 34, F)","List(a, b, friend)","List(b, Bob, 36, M)",3
"List(a, Alice, 34, F)","List(a, h, follow)","List(h, Mark, 45, M)","List(h, a, friend)","List(a, Alice, 34, F)","List(a, b, friend)","List(b, Bob, 36, M)",2


## Standard graph algorithms
GraphFrames comes with a number of standard graph algorithms built in:

- Breadth-first search (BFS)
- Connected components
- Strongly connected components
- Label Propagation Algorithm (LPA)
- PageRank (regular and personalized)
- Shortest paths
- Triangle count

### BFS - Breadth-first search; applying expression through edges

This is searching from expression through the Graph to expression. This will look from A: person named Esther to B: everyone who is 30 or younger.

In [0]:
paths = graph_sample.bfs("name = 'Esther'", "age < 31")
display(paths)

from,e0,to
"List(e, Esther, 32, F)","List(e, d, friend)","List(d, David, 29, M)"
"List(e, Esther, 32, F)","List(e, j, friend)","List(j, Mandy, 21, F)"


You can also refine the search

In [0]:
filteredPaths = graph_sample.bfs(
  fromExpr = "name = 'Esther'",
  toExpr = "age < 32",
  edgeFilter = "relationship != 'friend'",
  maxPathLength = 3)
display(filteredPaths)

from,e0,v1,e1,to
"List(e, Esther, 32, F)","List(e, f, follow)","List(f, Fanny, 36, F)","List(f, c, follow)","List(c, Charlie, 30, M)"


## Shortest paths

Computes shortest paths to the given set of "landmark" vertices, where landmarks are specified by vertex ID.

In [0]:
results = graph_sample.shortestPaths(landmarks=["a", "d"])
display(results)

id,name,age,gender,distances
g,Gabby,60,F,Map()
f,Fanny,36,F,"Map(a -> 5, d -> 7)"
e,Esther,32,F,"Map(a -> 2, d -> 1)"
h,Mark,45,M,"Map(a -> 1, d -> 3)"
d,David,29,M,"Map(a -> 1, d -> 0)"
c,Charlie,30,M,"Map(a -> 4, d -> 6)"
i,Eddie,60,M,"Map(a -> 5, d -> 7)"
j,Mandy,21,F,"Map(a -> 2, d -> 4)"
b,Bob,36,M,"Map(a -> 3, d -> 5)"
a,Alice,34,F,"Map(a -> 0, d -> 2)"


In [0]:
results = graph_sample.shortestPaths(landmarks=["a", "d", "h"])
display(results)


id,name,age,gender,distances
g,Gabby,60,F,Map()
f,Fanny,36,F,"Map(h -> 4, a -> 5, d -> 7)"
e,Esther,32,F,"Map(h -> 2, a -> 2, d -> 1)"
h,Mark,45,M,"Map(h -> 0, a -> 1, d -> 3)"
d,David,29,M,"Map(h -> 1, a -> 1, d -> 0)"
c,Charlie,30,M,"Map(h -> 3, a -> 4, d -> 6)"
i,Eddie,60,M,"Map(h -> 4, a -> 5, d -> 7)"
j,Mandy,21,F,"Map(h -> 1, a -> 2, d -> 4)"
b,Bob,36,M,"Map(h -> 2, a -> 3, d -> 5)"
a,Alice,34,F,"Map(h -> 1, a -> 0, d -> 2)"
