# Graph Algorithms - Part 1 - The Basics

## Implemented in the Apache Spark GraphX platform

Topics Covered:

  1. Initialising the environment
  1. Starting a Spark session
  1. Load sample graph data
  1. Basic GraphX operations
  1. Simple Pregel API


---
## Initialising the environment

### 1.1 Source the libraries for Apache Spark

When running in a jupyter notebook, sometimes the required libraries may not exist in the classpath.

Load essential spark libraries from maven public repositories at runtime like this:

In [1]:
import $ivy.`org.apache.spark::spark-core:3.2.0`
import $ivy.`org.apache.spark::spark-mllib-local:3.2.0`
import $ivy.`org.apache.spark::spark-mllib:3.2.0`
import $ivy.`org.apache.spark::spark-graphx:3.2.0`
import $ivy.`org.apache.spark::spark-streaming:3.2.0`
import $ivy.`org.apache.spark::spark-tags:3.2.0`

[32mimport [39m[36m$ivy.$                                   
[39m
[32mimport [39m[36m$ivy.$                                          
[39m
[32mimport [39m[36m$ivy.$                                    
[39m
[32mimport [39m[36m$ivy.$                                     
[39m
[32mimport [39m[36m$ivy.$                                        
[39m
[32mimport [39m[36m$ivy.$                                   [39m

In [2]:
import $ivy.`org.scalanlp::breeze-viz:1.2`
import $ivy.`org.jfree:jfreechart:1.5.4`
import $ivy.`org.creativescala::doodle-core:0.9.21`

[32mimport [39m[36m$ivy.$                             
[39m
[32mimport [39m[36m$ivy.$                           
[39m
[32mimport [39m[36m$ivy.$                                      [39m

---

### 1.2 Import the Spark Libraries

In [3]:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

[32mimport [39m[36morg.apache.spark.SparkContext
[39m
[32mimport [39m[36morg.apache.spark.SparkConf
[39m
[32mimport [39m[36morg.apache.spark.sql.SparkSession[39m

In [4]:
import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.sql.Row
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.functions.{col, udf, _}

[32mimport [39m[36morg.apache.spark.ml.linalg.{Matrix, Vectors}
[39m
[32mimport [39m[36morg.apache.spark.sql.Row
[39m
[32mimport [39m[36morg.apache.spark.sql.Dataset
[39m
[32mimport [39m[36morg.apache.spark.sql.functions.{col, udf, _}[39m

In [5]:
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.graphx._
// To make some of the examples work we will also need RDD
import org.apache.spark.rdd.RDD

[32mimport [39m[36morg.apache.spark._
[39m
[32mimport [39m[36morg.apache.spark.sql._
[39m
[32mimport [39m[36morg.apache.spark.graphx._
// To make some of the examples work we will also need RDD
[39m
[32mimport [39m[36morg.apache.spark.rdd.RDD[39m

In [6]:
import breeze.linalg._
import breeze.plot._

[32mimport [39m[36mbreeze.linalg._
[39m
[32mimport [39m[36mbreeze.plot._[39m

In [6]:
// this uses the IBM DB2 connector to read from a DB2 table
//import $ivy.`com.ibm.db2.jcc:db2jcc:db2jcc4`;

In [7]:
val appName = "Spark_Graph_Algorithms"

[36mappName[39m: [32mString[39m = [32m"Spark_Graph_Algorithms"[39m

### 1.3 Setup the Logger

To control the volume of log messages, change the log4j configuraiton programatically like this:

In [8]:
import org.apache.log4j.{Level, Logger}
//Logger.getLogger("org").setLevel(Level.INFO)

val logger: Logger = Logger.getLogger(appName)
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.apache.spark.storage.BlockManager").setLevel(Level.ERROR)
logger.setLevel(Level.INFO)

[32mimport [39m[36morg.apache.log4j.{Level, Logger}
//Logger.getLogger("org").setLevel(Level.INFO)

[39m
[36mlogger[39m: [32mLogger[39m = org.apache.log4j.Logger@3463b322

---
## 2. Create Spark session

### 2.1 Initialise Spark Session

In [8]:
// close the spark session and spark context before starting a new one, if re-executing the notebook.

//spark.stop()
//sc.stop()

In [9]:
val sparkConf = new SparkConf()
             .setAppName(appName)
             .setMaster("local[*]")
             //.setMaster("spark://localhost:7077")
             //.setMaster("spark://sparkmaster320:7077")
             .set("spark.driver.extraClassPath", "c:/bin/lib/db2jcc4.jar,c:/bin/lib/breeze-viz_2.12-1.2.jar")
             .set("spark.executor.extraClassPath", "c:/bin/lib/db2jcc4.jar,c:/bin/lib/breeze-viz_2.12-1.2.jar")
             .set("spark.default.parallelism", "6")

[36msparkConf[39m: [32mSparkConf[39m = org.apache.spark.SparkConf@6fb026

In [10]:
// Apply the config to start a spark session:
val spark = org.apache.spark.sql.SparkSession.builder()
    .config(sparkConf)
    .getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
23/05/21 10:24:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@23dcaed

In [11]:
val sc = spark.sparkContext

[36msc[39m: [32mSparkContext[39m = org.apache.spark.SparkContext@deaff9a

### 2.2 Get information on Spark Session

Use spark context and config objects to get essential information.

In [12]:
println("Spark Master: %s, User: %s, Version: %s, Deployment mode: %s".format(
        sc.master, sc.sparkUser, sc.version, sc.deployMode
    ))

println("Default Partitions: %d, Scheduling Mode: %s".format(
         sc.defaultMinPartitions, sc.getSchedulingMode
    ))

Spark Master: local[*], User: notebooker, Version: 3.2.0, Deployment mode: client
Default Partitions: 2, Scheduling Mode: FIFO


In [13]:
val config = sc.getConf

for ((k,v) <- config.getAll) println(s"Configuration Parameter: $k=$v")

Configuration Parameter: spark.driver.host=jupyterlab
Configuration Parameter: spark.app.startTime=1684644882273
Configuration Parameter: spark.driver.port=38353
Configuration Parameter: spark.default.parallelism=6
Configuration Parameter: spark.executor.extraClassPath=c:/bin/lib/db2jcc4.jar,c:/bin/lib/breeze-viz_2.12-1.2.jar
Configuration Parameter: spark.app.name=Spark_Graph_Algorithms
Configuration Parameter: spark.master=local[*]
Configuration Parameter: spark.driver.extraClassPath=c:/bin/lib/db2jcc4.jar,c:/bin/lib/breeze-viz_2.12-1.2.jar
Configuration Parameter: spark.executor.id=driver
Configuration Parameter: spark.app.id=local-1684644889897


[36mconfig[39m: [32mSparkConf[39m = org.apache.spark.SparkConf@64832105

In [14]:
config.getOption("spark.executor.extraClassPath")

[36mres13[39m: [32mOption[39m[[32mString[39m] = [33mSome[39m(
  [32m"c:/bin/lib/db2jcc4.jar,c:/bin/lib/breeze-viz_2.12-1.2.jar"[39m
)

In [15]:
config.getOption("spark.jars")

[36mres14[39m: [32mOption[39m[[32mString[39m] = [32mNone[39m

In [16]:
sys.env("PATH")

[36mres15[39m: [32mString[39m = [32m"/opt/conda/bin:/home/notebooker/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"[39m

In [17]:
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._

[32mimport [39m[36mspark.implicits._[39m

In [66]:
import org.apache.spark.storage.StorageLevel

[32mimport [39m[36morg.apache.spark.storage.StorageLevel[39m

---

## Load sample graph data

Data can be loaded into a graph by reading from an edgelist file

In [210]:
// read from edgelist file
val graph1 = GraphLoader
      .edgeListFile(sc,
                    "../src/test/resources/graph1_edgelist.txt",
                    edgeStorageLevel=StorageLevel.MEMORY_AND_DISK,
                    vertexStorageLevel=StorageLevel.MEMORY_AND_DISK)
      .mapEdges(e => e.attr.toDouble)
      .mapVertices[(Long, Double)]((vid, data) => (vid.toLong, 0.0));

23/05/22 10:50:36 INFO FileInputFormat: Total input files to process : 1


[36mgraph1[39m: [32mGraph[39m[([32mLong[39m, [32mDouble[39m), [32mDouble[39m] = org.apache.spark.graphx.impl.GraphImpl@5ace36c4

Data can also be fed in via RDDs of edges and vertices:

In [65]:
// Create an RDD for the vertices
val users: RDD[(VertexId, (String, String))] =
  sc.parallelize( Array(
        (3L, ("rxin", "student"))
      , (7L, ("jgonzal", "postdoc"))
      , (1L, ("somebody", "postdoc"))
      , (5L, ("franklin", "prof"))
      , (2L, ("istoica", "prof"))
      , (10L, ("hoityToity", "student"))
     )
   ).persist(StorageLevel.MEMORY_AND_DISK)

[36musers[39m: [32mRDD[39m[([32mVertexId[39m, ([32mString[39m, [32mString[39m))] = ParallelCollectionRDD[210] at parallelize at cmd64.sc:2

In [64]:
// Create an RDD for edges
val relationships: RDD[Edge[String]] =
  sc.parallelize(
      Array(
      Edge(3L, 7L, "collab")
      , Edge(5L, 3L, "advisor")
      , Edge(2L, 5L, "colleague")
      , Edge(5L, 7L, "pi")
      , Edge(10L, 5L, "friend")
      , Edge(10L, 1L, "friend")
      )
    ).persist(StorageLevel.MEMORY_AND_DISK)

[36mrelationships[39m: [32mRDD[39m[[32mEdge[39m[[32mString[39m]] = ParallelCollectionRDD[209] at parallelize at cmd63.sc:2

In [69]:
// Define a default user in case there are relationship with missing user
val defaultUser = ("John Doe", "Missing")

[36mdefaultUser[39m: ([32mString[39m, [32mString[39m) = ([32m"John Doe"[39m, [32m"Missing"[39m)

In [70]:
// Build the initial Graph
val graph2 = Graph(users, relationships, defaultUser)

[36mgraph2[39m: [32mGraph[39m[([32mString[39m, [32mString[39m), [32mString[39m] = org.apache.spark.graphx.impl.GraphImpl@58cdbc18

---

## Basic GraphX operations


In [121]:
// define convenience function to print all edges of a graph:
def printAllEdges[V, D, E](graph: Graph[(V, D), E]): Unit = {

val facts: RDD[String] = graph.triplets.map(
  triplet =>
  "(" + triplet.srcId + "," + triplet.srcAttr._1 + ") --[ " + triplet.attr + " ]--> (" + triplet.dstId + "," + triplet.dstAttr._1 + ")");

facts.collect.foreach(println(_))
}

defined [32mfunction[39m [36mprintAllEdges[39m

In [163]:
// define convenience function to print all edges of a graph:
def printOnlyEdges[V, E]( graph: Graph[V, E] ): Unit = {
    
    val facts: RDD[String] = graph.triplets.map(triplet => 
      " " + triplet.toTuple._1 + " --[" + triplet.toTuple._3 + "]--> " + triplet.toTuple._2 );

    facts.collect.foreach(println(_))
}

defined [32mfunction[39m [36mprintOnlyEdges[39m

In [58]:
def printGraphProperties( graph: Graph[_,_] ): Unit = {
    // graph operators:
    println( "Num of edges = " + graph.numEdges )
    println( "Num of vertices = " + graph.numVertices )
    println( "Num of inDegrees = " + graph.inDegrees.count() )
    println( "Num of outDegrees = " + graph.outDegrees.count() )
    println( "Num of degrees = " + graph.degrees.count() )
}

defined [32mfunction[39m [36mprintGraphProperties[39m

In [145]:
def printNeighbors[V, D, E](graph: Graph[_, _], edgeDirection: EdgeDirection):Unit = {
    graph.collectNeighborIds(edgeDirection).collect.foreach(
      x =>
        println("Neighbors of " + x._1 + " ("+ edgeDirection +") are: " + x._2.mkString(",") )
    );
}

defined [32mfunction[39m [36mprintNeighbors[39m

In [177]:
def printVertices[V, E](graph:Graph[_, _]):Unit = {
    graph.vertices.map(
      vd => "Vertex ID = " + vd._1 + ": " + vd._2
    ).collect.foreach(println(_))
}

defined [32mfunction[39m [36mprintVertices[39m

In [182]:
printVertices( graph2 )

Vertex ID = 1: (somebody,postdoc)
Vertex ID = 7: (jgonzal,postdoc)
Vertex ID = 2: (istoica,prof)
Vertex ID = 3: (rxin,student)
Vertex ID = 10: (hoityToity,student)
Vertex ID = 5: (franklin,prof)


In [211]:
// print out the graph:
printAllEdges( graph1 )

(10,10) --[ 1.0 ]--> (20,20)
(20,20) --[ 1.0 ]--> (30,30)
(30,30) --[ 1.0 ]--> (10,10)
(70,70) --[ 1.0 ]--> (80,80)
(40,40) --[ 1.0 ]--> (50,50)
(50,50) --[ 1.0 ]--> (60,60)
(60,60) --[ 1.0 ]--> (20,20)
(80,80) --[ 1.0 ]--> (90,90)
(90,90) --[ 1.0 ]--> (70,70)


In [184]:
// print out the graph:
printAllEdges( graph2 )

(3,rxin) --[ collab ]--> (7,jgonzal)
(5,franklin) --[ advisor ]--> (3,rxin)
(2,istoica) --[ colleague ]--> (5,franklin)
(5,franklin) --[ pi ]--> (7,jgonzal)
(10,hoityToity) --[ friend ]--> (5,franklin)
(10,hoityToity) --[ friend ]--> (1,somebody)


In [212]:
// print out basic properties:
printGraphProperties(graph1)

Num of edges = 9
Num of vertices = 9
Num of inDegrees = 8
Num of outDegrees = 9
Num of degrees = 9


In [186]:
// print out basic properties:
printGraphProperties(graph2)

Num of edges = 6
Num of vertices = 6
Num of inDegrees = 4
Num of outDegrees = 4
Num of degrees = 6


In [213]:
println( "Graph 1: Count all the edges where src > dst")
println( graph1.edges.filter(e => e.srcId > e.dstId).count )

Graph 1: Count all the edges where src > dst
3


In [214]:
println( "Graph 2: Count all the edges where src > dst")
println( graph2.edges.filter(e => e.srcId > e.dstId).count )

Graph 2: Count all the edges where src > dst
3


In [189]:
println( "Graph 1: Reversed edge directions")
printAllEdges( graph1.reverse)

Graph 1: Reversed edge directions
(20,20) --[ 1.0 ]--> (10,10)
(30,30) --[ 1.0 ]--> (20,20)
(50,50) --[ 1.0 ]--> (40,40)
(80,80) --[ 1.0 ]--> (70,70)
(20,20) --[ 1.0 ]--> (60,60)
(60,60) --[ 1.0 ]--> (50,50)
(70,70) --[ 1.0 ]--> (90,90)
(90,90) --[ 1.0 ]--> (80,80)


In [190]:
val graph3 = graph1.subgraph(vpred = (id, attr) => id > 10)

[36mgraph3[39m: [32mGraph[39m[([32mLong[39m, [32mDouble[39m), [32mDouble[39m] = org.apache.spark.graphx.impl.GraphImpl@2744f861

In [215]:
println( "Graph 1: subgraph where src vertex id > 10")
printAllEdges(graph3)

Graph 1: subgraph where src vertex id > 10
(20,20) --[ 1.0 ]--> (30,30)
(40,40) --[ 1.0 ]--> (50,50)
(70,70) --[ 1.0 ]--> (80,80)
(50,50) --[ 1.0 ]--> (60,60)
(60,60) --[ 1.0 ]--> (20,20)
(80,80) --[ 1.0 ]--> (90,90)
(90,90) --[ 1.0 ]--> (70,70)


In [130]:
val graph2_nofriends = graph2.subgraph(epred = edgetriplet => edgetriplet.attr != "friend")

[36mgraph2_nofriends[39m: [32mGraph[39m[([32mString[39m, [32mString[39m), [32mString[39m] = org.apache.spark.graphx.impl.GraphImpl@3cdef54d

In [131]:
val graph2_friends = graph2.subgraph(epred = edgetriplet => edgetriplet.attr == "friend")

[36mgraph2_friends[39m: [32mGraph[39m[([32mString[39m, [32mString[39m), [32mString[39m] = org.apache.spark.graphx.impl.GraphImpl@2e30b00f

In [139]:
printAllEdges(graph2_nofriends)

(3,rxin) --[ collab ]--> (7,jgonzal)
(5,franklin) --[ advisor ]--> (3,rxin)
(2,istoica) --[ colleague ]--> (5,franklin)
(5,franklin) --[ pi ]--> (7,jgonzal)


In [192]:
val graph5 = graph1.mask(graph3)

[36mgraph5[39m: [32mGraph[39m[([32mLong[39m, [32mDouble[39m), [32mDouble[39m] = org.apache.spark.graphx.impl.GraphImpl@2fcd897a

In [193]:
printAllEdges(graph5)

(20,20) --[ 1.0 ]--> (30,30)
(40,40) --[ 1.0 ]--> (50,50)
(70,70) --[ 1.0 ]--> (80,80)
(50,50) --[ 1.0 ]--> (60,60)
(60,60) --[ 1.0 ]--> (20,20)
(80,80) --[ 1.0 ]--> (90,90)
(90,90) --[ 1.0 ]--> (70,70)


In [194]:
val graph6 = graph2.groupEdges( (x, y) => "friend")

[36mgraph6[39m: [32mGraph[39m[([32mString[39m, [32mString[39m), [32mString[39m] = org.apache.spark.graphx.impl.GraphImpl@6534b2f5

In [195]:
printAllEdges(graph6)

(3,rxin) --[ collab ]--> (7,jgonzal)
(5,franklin) --[ advisor ]--> (3,rxin)
(2,istoica) --[ colleague ]--> (5,franklin)
(5,franklin) --[ pi ]--> (7,jgonzal)
(10,hoityToity) --[ friend ]--> (5,franklin)
(10,hoityToity) --[ friend ]--> (1,somebody)


In [216]:
printNeighbors(graph1, EdgeDirection.Out)

Neighbors of 80 (EdgeDirection.Out) are: 90
Neighbors of 30 (EdgeDirection.Out) are: 10
Neighbors of 50 (EdgeDirection.Out) are: 60
Neighbors of 40 (EdgeDirection.Out) are: 50
Neighbors of 90 (EdgeDirection.Out) are: 70
Neighbors of 70 (EdgeDirection.Out) are: 80
Neighbors of 20 (EdgeDirection.Out) are: 30
Neighbors of 60 (EdgeDirection.Out) are: 20
Neighbors of 10 (EdgeDirection.Out) are: 20


In [197]:
printNeighbors(graph2, EdgeDirection.In)

Neighbors of 1 (EdgeDirection.In) are: 10
Neighbors of 7 (EdgeDirection.In) are: 3,5
Neighbors of 2 (EdgeDirection.In) are: 
Neighbors of 3 (EdgeDirection.In) are: 5
Neighbors of 10 (EdgeDirection.In) are: 
Neighbors of 5 (EdgeDirection.In) are: 2,10


In [None]:
// TODO: Join RDDs with the graph
val graph7 = graph1.joinVertices[U](table: RDD[(VertexId, U)])(mapFunc: (VertexId, VD, U) => VD)

val graph8 = graph1.outerJoinVertices[U, VD2](other: RDD[(VertexId, U)])
      (mapFunc: (VertexId, VD, Option[U]) => VD2)

In [None]:
// send message
val vrdd = graph1.aggregateMessages[Msg: ClassTag](
      sendMsg: EdgeContext[VD, ED, Msg] => Unit,
      mergeMsg: (Msg, Msg) => Msg,
      tripletFields: TripletFields = TripletFields.All)

## Basic graph algorithms

In [217]:
val graph10 = graph1.pageRank(tol=0.01, resetProb = 0.15)

[36mgraph10[39m: [32mGraph[39m[[32mDouble[39m, [32mDouble[39m] = org.apache.spark.graphx.impl.GraphImpl@65d7b55a

In [247]:
val pgrank_df = spark.createDataFrame(graph10.vertices).toDF(Seq("Vertex_id", "pagerank_score"):_*)

[36mpgrank_df[39m: [32mDataFrame[39m = [Vertex_id: bigint, pagerank_score: double]

In [248]:
pgrank_df.show()

+---------+-------------------+
|Vertex_id|     pagerank_score|
+---------+-------------------+
|       80| 0.9822563471669581|
|       30| 1.7297158594552773|
|       50| 0.2880277933777645|
|       40|0.15569069912311598|
|       90| 0.9822563471669581|
|       70| 0.9822563471669581|
|       20|  1.862052953709926|
|       60| 0.4005143234942158|
|       10|  1.617229329338826|
+---------+-------------------+



In [218]:
printOnlyEdges(graph10)

 (10,1.617229329338826) --[1.0]--> (20,1.862052953709926)
 (20,1.862052953709926) --[1.0]--> (30,1.7297158594552773)
 (30,1.7297158594552773) --[1.0]--> (10,1.617229329338826)
 (70,0.9822563471669581) --[1.0]--> (80,0.9822563471669581)
 (40,0.15569069912311598) --[1.0]--> (50,0.2880277933777645)
 (50,0.2880277933777645) --[1.0]--> (60,0.4005143234942158)
 (60,0.4005143234942158) --[1.0]--> (20,1.862052953709926)
 (80,0.9822563471669581) --[1.0]--> (90,0.9822563471669581)
 (90,0.9822563471669581) --[1.0]--> (70,0.9822563471669581)


In [220]:
printVertices(graph10)

Vertex ID = 80: 0.9822563471669581
Vertex ID = 30: 1.7297158594552773
Vertex ID = 50: 0.2880277933777645
Vertex ID = 40: 0.15569069912311598
Vertex ID = 90: 0.9822563471669581
Vertex ID = 70: 0.9822563471669581
Vertex ID = 20: 1.862052953709926
Vertex ID = 60: 0.4005143234942158
Vertex ID = 10: 1.617229329338826


In [221]:
// Run Connected Components
val ccGraph = graph1.connectedComponents() // No longer contains missing field

// Remove missing vertices as well as the edges to connected to them
val validGraph = graph1.subgraph(vpred = (id, attr) => attr._2 != "Missing")

// Restrict the answer to the valid subgraph
val validCCGraph = ccGraph.mask(validGraph)

[36mccGraph[39m: [32mGraph[39m[[32mVertexId[39m, [32mDouble[39m] = org.apache.spark.graphx.impl.GraphImpl@777bf690
[36mvalidGraph[39m: [32mGraph[39m[([32mLong[39m, [32mDouble[39m), [32mDouble[39m] = org.apache.spark.graphx.impl.GraphImpl@5143abf9
[36mvalidCCGraph[39m: [32mGraph[39m[[32mVertexId[39m, [32mDouble[39m] = org.apache.spark.graphx.impl.GraphImpl@4e7b99f1

In [222]:
printOnlyEdges(ccGraph)

 (10,10) --[1.0]--> (20,10)
 (20,10) --[1.0]--> (30,10)
 (30,10) --[1.0]--> (10,10)
 (70,70) --[1.0]--> (80,70)
 (40,10) --[1.0]--> (50,10)
 (50,10) --[1.0]--> (60,10)
 (60,10) --[1.0]--> (20,10)
 (80,70) --[1.0]--> (90,70)
 (90,70) --[1.0]--> (70,70)


In [223]:
printVertices(ccGraph)

Vertex ID = 80: 70
Vertex ID = 30: 10
Vertex ID = 50: 10
Vertex ID = 40: 10
Vertex ID = 90: 70
Vertex ID = 70: 70
Vertex ID = 20: 10
Vertex ID = 60: 10
Vertex ID = 10: 10


In [224]:
val triGraph = graph1.triangleCount()

[36mtriGraph[39m: [32mGraph[39m[[32mInt[39m, [32mDouble[39m] = org.apache.spark.graphx.impl.GraphImpl@5eccb798

In [225]:
printOnlyEdges(triGraph)

23/05/22 10:52:38 WARN ShippableVertexPartitionOps: Joining two VertexPartitions with different indexes is slow.
23/05/22 10:52:38 WARN ShippableVertexPartitionOps: Joining two VertexPartitions with different indexes is slow.


 (10,1) --[1.0]--> (20,1)
 (20,1) --[1.0]--> (30,1)
 (30,1) --[1.0]--> (10,1)
 (70,1) --[1.0]--> (80,1)
 (40,0) --[1.0]--> (50,0)
 (50,0) --[1.0]--> (60,0)
 (60,0) --[1.0]--> (20,1)
 (80,1) --[1.0]--> (90,1)
 (90,1) --[1.0]--> (70,1)


In [226]:
printVertices(triGraph)

Vertex ID = 80: 1
Vertex ID = 30: 1
Vertex ID = 50: 0
Vertex ID = 40: 0
Vertex ID = 90: 1
Vertex ID = 70: 1
Vertex ID = 20: 1
Vertex ID = 60: 0
Vertex ID = 10: 1


In [238]:
val cc_df = spark.createDataFrame(ccGraph.vertices).toDF(Seq("Vertex_id", "Connected_id"):_*)

[36mcc_df[39m: [32mDataFrame[39m = [Vertex_id: bigint, Connected_id: bigint]

In [244]:
cc_df.show()

+---------+------------+
|Vertex_id|Connected_id|
+---------+------------+
|       80|          70|
|       30|          10|
|       50|          10|
|       40|          10|
|       90|          70|
|       70|          70|
|       20|          10|
|       60|          10|
|       10|          10|
+---------+------------+



In [241]:
val tri_df = spark.createDataFrame(triGraph.vertices).toDF(Seq("Vertex_id", "triangle_count"):_*)

[36mtri_df[39m: [32mDataFrame[39m = [Vertex_id: bigint, triangle_count: int]

In [245]:
tri_df.show()

+---------+--------------+
|Vertex_id|triangle_count|
+---------+--------------+
|       80|             1|
|       30|             1|
|       50|             0|
|       40|             0|
|       90|             1|
|       70|             1|
|       20|             1|
|       60|             0|
|       10|             1|
+---------+--------------+



In [242]:
val sccGraph = graph1.stronglyConnectedComponents(numIter=10)

[36msccGraph[39m: [32mGraph[39m[[32mVertexId[39m, [32mDouble[39m] = org.apache.spark.graphx.impl.GraphImpl@679f0259

In [243]:
val scc_df = spark.createDataFrame(sccGraph.vertices).toDF(Seq("Vertex_id", "strong_conn_comp"):_*)

[36mscc_df[39m: [32mDataFrame[39m = [Vertex_id: bigint, strong_conn_comp: bigint]

In [246]:
scc_df.show()

+---------+----------------+
|Vertex_id|strong_conn_comp|
+---------+----------------+
|       80|              70|
|       30|              10|
|       50|              50|
|       40|              40|
|       90|              70|
|       70|              70|
|       20|              10|
|       60|              60|
|       10|              10|
+---------+----------------+



In [231]:
printOnlyEdges(sccGraph)

 (10,10) --[1.0]--> (20,10)
 (20,10) --[1.0]--> (30,10)
 (30,10) --[1.0]--> (10,10)
 (70,70) --[1.0]--> (80,70)
 (40,40) --[1.0]--> (50,50)
 (50,50) --[1.0]--> (60,60)
 (60,60) --[1.0]--> (20,10)
 (80,70) --[1.0]--> (90,70)
 (90,70) --[1.0]--> (70,70)


In [232]:
printVertices(sccGraph)

Vertex ID = 80: 70
Vertex ID = 30: 10
Vertex ID = 50: 50
Vertex ID = 40: 40
Vertex ID = 90: 70
Vertex ID = 70: 70
Vertex ID = 20: 10
Vertex ID = 60: 60
Vertex ID = 10: 10


In [265]:
// join dataframes on Vertex_id : pgrank_df, cc_df, tri_df, scc_df
val graph1_vtx_data = cc_df.join(pgrank_df,cc_df("Vertex_id") === pgrank_df("Vertex_id"),"inner" )
.join(tri_df,cc_df("Vertex_id") === tri_df("Vertex_id"),"inner" )
.join(scc_df,cc_df("Vertex_id") === scc_df("Vertex_id"),"inner" )
.select(cc_df("Vertex_id"),cc_df("Connected_id"),pgrank_df("pagerank_score"), tri_df("triangle_count"), scc_df("strong_conn_comp"))

[36mgraph1_vtx_data[39m: [32mDataFrame[39m = [Vertex_id: bigint, Connected_id: bigint ... 3 more fields]

In [266]:
graph1_vtx_data.show()

+---------+------------+-------------------+--------------+----------------+
|Vertex_id|Connected_id|     pagerank_score|triangle_count|strong_conn_comp|
+---------+------------+-------------------+--------------+----------------+
|       10|          10|  1.617229329338826|             1|              10|
|       20|          10|  1.862052953709926|             1|              10|
|       30|          10| 1.7297158594552773|             1|              10|
|       40|          10|0.15569069912311598|             0|              40|
|       50|          10| 0.2880277933777645|             0|              50|
|       60|          10| 0.4005143234942158|             0|              60|
|       70|          70| 0.9822563471669581|             1|              70|
|       80|          70| 0.9822563471669581|             1|              70|
|       90|          70| 0.9822563471669581|             1|              70|
+---------+------------+-------------------+--------------+----------------+

In [267]:
graph1_vtx_data.write.option("header",true).mode(SaveMode.Overwrite).csv("graph1_properties.csv")

23/05/22 15:24:07 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/05/22 15:24:07 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/05/22 15:24:10 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/05/22 15:24:10 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/05/22 15:24:11 INFO FileOutputCommitter: Saved output of task 'attempt_202305221524102744077725859891567_7408_m_000000_2480' to file:/work/src/spark_projs/graphx-algorithms/examples/graph1_properties.csv/_temporary/0/task_202305221524102744077725859891567_7408_m_000000


---
## Simple Example of using the Pregel API

Pregel provides an iterative graph-parallel computation.

Here is an example of computing the single source shortest path in a graph.

In [None]:
// val graph11 = graph1.pregel[A](initialMsg: A, maxIterations: Int, activeDirection: EdgeDirection)(
//       vprog: (VertexId, VD, A) => VD,
//       sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
//       mergeMsg: (A, A) => A)

In [287]:
def ShortestPath(graph: Graph[(Long, Double), Double], srcID: VertexId): RDD[Row] = {
    
    // Initialize the graph such that all vertices except the root have distance infinity.
    val initialGraph = graph.mapVertices(
        (id, _) =>
        if (id == srcID) 0.0 else Double.PositiveInfinity
    )

    val sssp = initialGraph.pregel(Double.PositiveInfinity, maxIterations=7)(
      (id, dist, newDist) => math.min(dist, newDist), // Vertex Program
      triplet => {  // Send Message
        if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
          Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
        } else {
          Iterator.empty
        }
      },
      (a, b) => math.min(a, b) // Merge Message
    )
    
    return( sssp.vertices.filter(y => y._2 < Double.PositiveInfinity).map( x => Row(srcID, x._1, x._2)))
}

defined [32mfunction[39m [36mShortestPath[39m

In [288]:
val distances = ShortestPath(graph1, 10)

[36mdistances[39m: [32mRDD[39m[[32mRow[39m] = MapPartitionsRDD[4234] at map at cmd286.sc:21

In [290]:
val rowArray = distances.collect()

[36mrowArray[39m: [32mArray[39m[[32mRow[39m] = [33mArray[39m([10,30,2.0], [10,20,1.0], [10,10,0.0])

In [282]:
println(distances.collect.mkString("\n"))

[10,30,2.0]
[10,20,1.0]
[10,10,0.0]


---
## Persisting Data to Storage

All data and can be saved to disk.

The default data format for saving to disk is Parquet which also compresses the data structure using SNAPPY compression.

In [297]:
// Save vertices RDD to disk
graph1.vertices.saveAsObjectFile("vertices.obj")

23/05/22 23:13:06 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/05/22 23:13:06 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/05/22 23:13:06 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/05/22 23:13:06 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/05/22 23:13:06 INFO FileOutputCommitter: Saved output of task 'attempt_202305222313068791303148318076102_4244_m_000000_0' to file:/work/src/spark_projs/graphx-algorithms/examples/vertices.obj/_temporary/0/task_202305222313068791303148318076102_4244_m_000000
23/05/22 23:13:06 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/05/22 23:13:06 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2

In [298]:
// then, save edges RDD to disk
graph1.edges.saveAsObjectFile("edges.obj")

23/05/22 23:13:13 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/05/22 23:13:13 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/05/22 23:13:13 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/05/22 23:13:13 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/05/22 23:13:13 INFO FileOutputCommitter: Saved output of task 'attempt_202305222313138990698911240321899_4246_m_000000_0' to file:/work/src/spark_projs/graphx-algorithms/examples/edges.obj/_temporary/0/task_202305222313138990698911240321899_4246_m_000000
23/05/22 23:13:13 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/05/22 23:13:13 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/0

---

### Stop the Spark Session

In [299]:
spark.stop()

In [300]:
sc.stop()