# Graph Algorithms - Part 1 - The Basics

Algorithms implemented in the Apache Spark GraphX platform.

The following topics have been covered in this notebook:

  1. Initialising the environment
  1. Starting a Spark session
  1. Load sample graph data
  1. Creating graphs from datasets
  1. Generating artifical graphs
  1. Basic GraphX properties
  1. Updating Edge data of a graph
  1. Updating vertex data of a graph
  1. Extracting vertex data from graph
  2. Saving graph to JSON format
  3. Saving graph to GEXF (Gephi) format


---
## Initialising the environment

### 1.1 Source the libraries for Apache Spark

When running in a jupyter notebook, sometimes the required libraries may not exist in the classpath.

Load essential spark libraries from maven public repositories at runtime like this:

In [1]:
import $ivy.`org.apache.spark::spark-core:3.4.0`
import $ivy.`org.apache.spark::spark-mllib-local:3.4.0`
import $ivy.`org.apache.spark::spark-mllib:3.4.0`
import $ivy.`org.apache.spark::spark-graphx:3.4.0`
import $ivy.`org.apache.spark::spark-streaming:3.4.0`
import $ivy.`org.apache.spark::spark-tags:3.4.0`

[32mimport [39m[36m$ivy.$                                   
[39m
[32mimport [39m[36m$ivy.$                                          
[39m
[32mimport [39m[36m$ivy.$                                    
[39m
[32mimport [39m[36m$ivy.$                                     
[39m
[32mimport [39m[36m$ivy.$                                        
[39m
[32mimport [39m[36m$ivy.$                                   [39m

In [2]:
import $ivy.`org.scalanlp::breeze-viz:2.1.0`
import $ivy.`org.jfree:jfreechart:1.5.4`
import $ivy.`org.creativescala::doodle-core:0.18.0`
import $ivy.`com.fasterxml.jackson.core:jackson-databind:2.15.1`

[32mimport [39m[36m$ivy.$                               
[39m
[32mimport [39m[36m$ivy.$                           
[39m
[32mimport [39m[36m$ivy.$                                      
[39m
[32mimport [39m[36m$ivy.$                                                   [39m

---

### 1.2 Import the Spark Libraries

In [3]:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.sql.Row
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.functions.{col, udf, _}
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.graphx._
// To make some of the examples work we will also need RDD
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel
import org.apache.spark.graphx.util.GraphGenerators

[32mimport [39m[36morg.apache.spark.SparkContext
[39m
[32mimport [39m[36morg.apache.spark.SparkConf
[39m
[32mimport [39m[36morg.apache.spark.sql.SparkSession
[39m
[32mimport [39m[36morg.apache.spark.ml.linalg.{Matrix, Vectors}
[39m
[32mimport [39m[36morg.apache.spark.sql.Row
[39m
[32mimport [39m[36morg.apache.spark.sql.Dataset
[39m
[32mimport [39m[36morg.apache.spark.sql.functions.{col, udf, _}
[39m
[32mimport [39m[36morg.apache.spark._
[39m
[32mimport [39m[36morg.apache.spark.sql._
[39m
[32mimport [39m[36morg.apache.spark.graphx._
// To make some of the examples work we will also need RDD
[39m
[32mimport [39m[36morg.apache.spark.rdd.RDD
[39m
[32mimport [39m[36morg.apache.spark.storage.StorageLevel
[39m
[32mimport [39m[36morg.apache.spark.graphx.util.GraphGenerators[39m

In [4]:
import com.fasterxml.jackson.core.`type`.TypeReference
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import breeze.linalg._
import breeze.plot._

[32mimport [39m[36mcom.fasterxml.jackson.core.`type`.TypeReference
[39m
[32mimport [39m[36mcom.fasterxml.jackson.module.scala.DefaultScalaModule
[39m
[32mimport [39m[36mbreeze.linalg._
[39m
[32mimport [39m[36mbreeze.plot._[39m

In [5]:
val appName = "Spark_Graph_Algorithms_1"

[36mappName[39m: [32mString[39m = [32m"Spark_Graph_Algorithms_1"[39m

### 1.3 Setup the Logger

To control the volume of log messages, change the log4j configuraiton programatically like this:

In [6]:
import org.apache.log4j.{Level, Logger}
//Logger.getLogger("org").setLevel(Level.INFO)

val logger: Logger = Logger.getLogger(appName)
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.apache.spark.storage.BlockManager").setLevel(Level.ERROR)
logger.setLevel(Level.INFO)

[32mimport [39m[36morg.apache.log4j.{Level, Logger}
//Logger.getLogger("org").setLevel(Level.INFO)

[39m
[36mlogger[39m: [32mLogger[39m = org.apache.log4j.Logger@194e1f5c

---
## 2. Starting a Spark session

### 2.1 Initialise Spark Session

In [6]:
// close the spark session and spark context before starting a new one, if re-executing the notebook.

//spark.stop()
//sc.stop()

In [7]:
val sparkConf = new SparkConf()
             .setAppName(appName)
             .setMaster("local[*]")
             //.setMaster("spark://sparkmaster320:7077")
             .set("spark.driver.extraClassPath", "/mnt/shared/lib/db2jcc4.jar,/mnt/shared/lib/breeze-viz_2.12-1.2.jar")
             .set("spark.executor.extraClassPath", "/mnt/shared/lib/db2jcc4.jar,/mnt/shared/lib/breeze-viz_2.12-1.2.jar")
             .set("spark.default.parallelism", "6")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
23/06/02 17:40:02 WARN Utils: Your hostname, icy resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
23/06/02 17:40:02 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


[36msparkConf[39m: [32mSparkConf[39m = org.apache.spark.SparkConf@3c40d2b4

In [8]:
// Apply the config to start a spark session:
val spark = org.apache.spark.sql.SparkSession.builder()
    .config(sparkConf)
    .getOrCreate()

23/06/02 17:40:02 INFO SparkContext: Running Spark version 3.4.0
23/06/02 17:40:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/06/02 17:40:02 INFO ResourceUtils: No custom resources configured for spark.driver.
23/06/02 17:40:02 INFO SparkContext: Submitted application: Spark_Graph_Algorithms_1
23/06/02 17:40:02 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
23/06/02 17:40:02 INFO ResourceProfile: Limiting resource is cpu
23/06/02 17:40:02 INFO ResourceProfileManager: Added ResourceProfile id: 0
23/06/02 17:40:02 INFO SecurityManager: Changing view acls to: notebooker
23/06/02 17:40:02 INFO SecurityManager: Changing modify acls to: notebooker
23/06/02 17

[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@3f719f23

In [9]:
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._

[32mimport [39m[36mspark.implicits._[39m

### Set logging preferences

In [10]:
Logger.getLogger("org.apache.spark.storage.BlockManager").setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark.storage.BlockManagerMaster").setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark.storage.BlockManagerMasterEndpoint").setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark.storage.BlockManagerInfo").setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark.storage.DiskBlockManager").setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark.storage.memory.MemoryStore").setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark.storage.ShuffleBlockFetcherIterator").setLevel(Level.ERROR)

Logger.getLogger("org.apache.spark.scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint").setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark.scheduler.DAGScheduler").setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark.scheduler.TaskSchedulerImpl").setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark.scheduler.TaskSetManager").setLevel(Level.ERROR)

Logger.getLogger("org.apache.spark.SparkContext").setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark.executor.Executor").setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark.ui.JettyUtils").setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark.network.netty.NettyBlockTransferService").setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark.SparkEnv").setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark.util.Utils").setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark.rdd.HadoopRDD").setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark.MapOutputTrackerMasterEndpoint").setLevel(Level.ERROR)
Logger.getLogger("org.apache.hadoop.mapred.FileOutputCommitter").setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark.mapred.SparkHadoopMapRedUtil").setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark.internal.io.HadoopMapRedCommitProtocol").setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark.internal.io.SparkHadoopWriter").setLevel(Level.ERROR)

In [11]:
val sc = spark.sparkContext

[36msc[39m: [32mSparkContext[39m = org.apache.spark.SparkContext@668e7802

### 2.2 Get information on Spark Session

Use spark context and config objects to get essential information.

In [12]:
println("Spark Master: %s, User: %s, Version: %s, Deployment mode: %s".format(
        sc.master, sc.sparkUser, sc.version, sc.deployMode
    ))

println("Default Partitions: %d, Scheduling Mode: %s".format(
         sc.defaultMinPartitions, sc.getSchedulingMode
    ))

val config = sc.getConf

for ((k,v) <- config.getAll) println(s"Configuration Parameter: $k=$v")

Spark Master: local[*], User: notebooker, Version: 3.4.0, Deployment mode: client
Default Partitions: 2, Scheduling Mode: FIFO
Configuration Parameter: spark.driver.host=10.0.2.15
Configuration Parameter: spark.default.parallelism=6
Configuration Parameter: spark.executor.extraClassPath=/mnt/shared/lib/db2jcc4.jar,/mnt/shared/lib/breeze-viz_2.12-1.2.jar
Configuration Parameter: spark.app.name=Spark_Graph_Algorithms_1
Configuration Parameter: spark.app.startTime=1685727602427
Configuration Parameter: spark.master=local[*]
Configuration Parameter: spark.app.id=local-1685727603602
Configuration Parameter: spark.executor.id=driver
Configuration Parameter: spark.driver.extraClassPath=/mnt/shared/lib/db2jcc4.jar,/mnt/shared/lib/breeze-viz_2.12-1.2.jar
Configuration Parameter: spark.driver.extraJavaOptions=-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/ja

[36mconfig[39m: [32mSparkConf[39m = org.apache.spark.SparkConf@7db70d1d

---

## Load sample graph data

Data can be loaded into a graph by reading from an edgelist file

In [13]:
// define convenience function to print all edges of a graph with vertex data:
def printEdges[V, E]( graph: Graph[V, E] ): Unit = {
    
    val facts: RDD[String] = graph.triplets.map(triplet => 
      " " + triplet.toTuple._1 + " --[" + triplet.toTuple._3 + "]--> " + triplet.toTuple._2 );

    facts.collect.foreach(println(_))
}

defined [32mfunction[39m [36mprintEdges[39m

In [14]:
def printVertices[V, E](graph:Graph[_, _]):Unit = {
    graph.vertices.map(
      vd => "Vertex ID = " + vd._1 + ": " + vd._2
    ).collect.foreach(println(_))
}

defined [32mfunction[39m [36mprintVertices[39m

In [15]:
// read from edgelist file
val graph1 = GraphLoader
      .edgeListFile(sc,
                    "../src/test/resources/graph1_edgelist.txt",
                    edgeStorageLevel=StorageLevel.MEMORY_AND_DISK,
                    vertexStorageLevel=StorageLevel.MEMORY_AND_DISK)
      .mapEdges(e => e.attr.toDouble)
      // here, we define vertex with a Row that holds a Long data type and a Double data type
      // these would be sufficient to hold results of most graph algorithms
      .mapVertices[Row]((vid, data) => Row(0L, 0.0));

// print out the graph:
printEdges( graph1 )

23/06/02 17:40:06 INFO FileInputFormat: Total input files to process : 1
23/06/02 17:40:07 INFO GraphLoader: It took 1423 ms to load the edges


 (10,[0,0.0]) --[1.0]--> (20,[0,0.0])
 (20,[0,0.0]) --[1.0]--> (30,[0,0.0])
 (30,[0,0.0]) --[1.0]--> (10,[0,0.0])
 (70,[0,0.0]) --[1.0]--> (80,[0,0.0])
 (40,[0,0.0]) --[1.0]--> (50,[0,0.0])
 (50,[0,0.0]) --[1.0]--> (60,[0,0.0])
 (60,[0,0.0]) --[1.0]--> (20,[0,0.0])
 (80,[0,0.0]) --[1.0]--> (90,[0,0.0])
 (90,[0,0.0]) --[1.0]--> (70,[0,0.0])


[36mgraph1[39m: [32mGraph[39m[[32mRow[39m, [32mDouble[39m] = org.apache.spark.graphx.impl.GraphImpl@28da41f9

## Creating graphs from vertex and edge data
Data can also be fed in via RDDs of edges and vertices:

In [16]:
// Create an RDD for the vertices
val users: RDD[(VertexId, (String, String))] =
  sc.makeRDD( Array(
        (3L, ("rxin", "student"))
      , (7L, ("jgonzal", "postdoc"))
      , (1L, ("somebody", "postdoc"))
      , (5L, ("franklin", "prof"))
      , (2L, ("istoica", "prof"))
      , (10L, ("hoityToity", "student"))
     )
   ).persist(StorageLevel.MEMORY_AND_DISK)

[36musers[39m: [32mRDD[39m[([32mVertexId[39m, ([32mString[39m, [32mString[39m))] = ParallelCollectionRDD[23] at makeRDD at cmd15.sc:2

In [17]:
// Create an RDD for edges
val relationships: RDD[Edge[Row]] =
  sc.makeRDD(
      Array(
        Edge(3L, 7L,  /*"collab"   */ Row("collab", 0.0, 0L))
      , Edge(5L, 3L,  /*"advisor"  */ Row("advisor", 0.0, 0L))
      , Edge(2L, 5L,  /*"colleague"*/ Row("colleague", 0.0, 0L))
      , Edge(5L, 7L,  /*"advisor"  */ Row("advisor", 0.0, 0L))
      , Edge(10L, 5L, /*"friend"   */ Row("friend", 0.0, 0L))
      , Edge(10L, 1L, /*"friend"   */ Row("friend", 0.0, 0L))
      )
    ).persist(StorageLevel.MEMORY_AND_DISK)

[36mrelationships[39m: [32mRDD[39m[[32mEdge[39m[[32mRow[39m]] = ParallelCollectionRDD[24] at makeRDD at cmd16.sc:2

In [19]:
// Define a default user in case there are relationship with missing user
val defaultUser = ("Jane", "Missing")

[36mdefaultUser[39m: ([32mString[39m, [32mString[39m) = ([32m"Jane"[39m, [32m"Missing"[39m)

In [20]:
// Build the initial Graph
val graph2 = Graph(users, relationships, defaultUser)

[36mgraph2[39m: [32mGraph[39m[([32mString[39m, [32mString[39m), [32mRow[39m] = org.apache.spark.graphx.impl.GraphImpl@66bed1c7

In [21]:
val graphGrid = GraphGenerators.gridGraph(sc, 4, 4)
val graphStar = GraphGenerators.starGraph(sc, 8)
val graphLogNormGen = GraphGenerators.logNormalGraph(sc, 10)

[36mgraphGrid[39m: [32mGraph[39m[([32mInt[39m, [32mInt[39m), [32mDouble[39m] = org.apache.spark.graphx.impl.GraphImpl@71b1c65d
[36mgraphStar[39m: [32mGraph[39m[[32mInt[39m, [32mInt[39m] = org.apache.spark.graphx.impl.GraphImpl@12d04e39
[36mgraphLogNormGen[39m: [32mGraph[39m[[32mLong[39m, [32mInt[39m] = org.apache.spark.graphx.impl.GraphImpl@2a587bbd

---

## View Basic GraphX Properties


In [22]:
def printGraphProperties( graph: Graph[_,_] ): Unit = {
    // graph operators:
    println( "Num of edges = " + graph.numEdges )
    println( "Num of vertices = " + graph.numVertices )
    println( "Num of inDegrees = " + graph.inDegrees.count() )
    println( "Num of outDegrees = " + graph.outDegrees.count() )
    println( "Num of degrees = " + graph.degrees.count() )
}

defined [32mfunction[39m [36mprintGraphProperties[39m

In [23]:
def printNeighbors[V, D, E](graph: Graph[_, _], edgeDirection: EdgeDirection):Unit = {
    graph.collectNeighborIds(edgeDirection).collect.foreach(
      x =>
        println("Neighbors of " + x._1 + " ("+ edgeDirection +") are: " + x._2.mkString(",") )
    );
}

defined [32mfunction[39m [36mprintNeighbors[39m

In [24]:
printVertices( graph2 )

Vertex ID = 1: (somebody,postdoc)
Vertex ID = 7: (jgonzal,postdoc)
Vertex ID = 2: (istoica,prof)
Vertex ID = 3: (rxin,student)
Vertex ID = 10: (hoityToity,student)
Vertex ID = 5: (franklin,prof)


In [25]:
// print out the graph:
printEdges( graph2 )

 (3,(rxin,student)) --[[collab,0.0,0]]--> (7,(jgonzal,postdoc))
 (5,(franklin,prof)) --[[advisor,0.0,0]]--> (3,(rxin,student))
 (2,(istoica,prof)) --[[colleague,0.0,0]]--> (5,(franklin,prof))
 (5,(franklin,prof)) --[[advisor,0.0,0]]--> (7,(jgonzal,postdoc))
 (10,(hoityToity,student)) --[[friend,0.0,0]]--> (5,(franklin,prof))
 (10,(hoityToity,student)) --[[friend,0.0,0]]--> (1,(somebody,postdoc))


In [26]:
// print out basic properties:
printGraphProperties(graph1)

Num of edges = 9
Num of vertices = 9
Num of inDegrees = 8
Num of outDegrees = 9
Num of degrees = 9


In [27]:
// print out basic properties:
printGraphProperties(graph2)

Num of edges = 6
Num of vertices = 6
Num of inDegrees = 4
Num of outDegrees = 4
Num of degrees = 6


## Graph Projections / Subsets

In [28]:
println( "Graph 1: Count all the edges where src > dst")
println( graph1.edges.filter(e => e.srcId > e.dstId).count )

Graph 1: Count all the edges where src > dst
3


In [29]:
println( "Graph LogNormGen: Count all the edges where src > dst")
println( graphLogNormGen.edges.filter(e => e.srcId > e.dstId).count )

Graph LogNormGen: Count all the edges where src > dst
28


In [30]:
println( "Graph 1: Reversed edge directions")
printEdges( graph1.reverse)

Graph 1: Reversed edge directions
 (10,[0,0.0]) --[1.0]--> (30,[0,0.0])
 (20,[0,0.0]) --[1.0]--> (10,[0,0.0])
 (30,[0,0.0]) --[1.0]--> (20,[0,0.0])
 (80,[0,0.0]) --[1.0]--> (70,[0,0.0])
 (20,[0,0.0]) --[1.0]--> (60,[0,0.0])
 (50,[0,0.0]) --[1.0]--> (40,[0,0.0])
 (60,[0,0.0]) --[1.0]--> (50,[0,0.0])
 (70,[0,0.0]) --[1.0]--> (90,[0,0.0])
 (90,[0,0.0]) --[1.0]--> (80,[0,0.0])


In [39]:
val graph2_nofriends = graph2.subgraph(epred = edgetriplet => edgetriplet.attr(0) != "friend")
printEdges(graph2_nofriends)

 (3,(rxin,student)) --[[collab,0.0,0]]--> (7,(jgonzal,postdoc))
 (5,(franklin,prof)) --[[advisor,0.0,0]]--> (3,(rxin,student))
 (2,(istoica,prof)) --[[colleague,0.0,0]]--> (5,(franklin,prof))
 (5,(franklin,prof)) --[[advisor,0.0,0]]--> (7,(jgonzal,postdoc))


[36mgraph2_nofriends[39m: [32mGraph[39m[([32mString[39m, [32mString[39m), [32mRow[39m] = org.apache.spark.graphx.impl.GraphImpl@bc62b1f

In [40]:
val graph2_friends = graph2.subgraph(epred = edgetriplet => edgetriplet.attr(0) == "friend")
printEdges(graph2_friends)

 (10,(hoityToity,student)) --[[friend,0.0,0]]--> (5,(franklin,prof))
 (10,(hoityToity,student)) --[[friend,0.0,0]]--> (1,(somebody,postdoc))


[36mgraph2_friends[39m: [32mGraph[39m[([32mString[39m, [32mString[39m), [32mRow[39m] = org.apache.spark.graphx.impl.GraphImpl@6fe343f1

In [41]:
val graph3 = graph1.subgraph(vpred = (id, attr) => id > 10)
println( "From Graph 1, a subgraph where src vertex id > 10")
printEdges(graph3)

From Graph 1, a subgraph where src vertex id > 10
 (20,[0,0.0]) --[1.0]--> (30,[0,0.0])
 (70,[0,0.0]) --[1.0]--> (80,[0,0.0])
 (40,[0,0.0]) --[1.0]--> (50,[0,0.0])
 (50,[0,0.0]) --[1.0]--> (60,[0,0.0])
 (60,[0,0.0]) --[1.0]--> (20,[0,0.0])
 (80,[0,0.0]) --[1.0]--> (90,[0,0.0])
 (90,[0,0.0]) --[1.0]--> (70,[0,0.0])


[36mgraph3[39m: [32mGraph[39m[[32mRow[39m, [32mDouble[39m] = org.apache.spark.graphx.impl.GraphImpl@6a44f5b1

In [42]:
val graph5 = graph1.mask(graph3)
printEdges(graph5)

 (20,[0,0.0]) --[1.0]--> (30,[0,0.0])
 (70,[0,0.0]) --[1.0]--> (80,[0,0.0])
 (40,[0,0.0]) --[1.0]--> (50,[0,0.0])
 (50,[0,0.0]) --[1.0]--> (60,[0,0.0])
 (60,[0,0.0]) --[1.0]--> (20,[0,0.0])
 (80,[0,0.0]) --[1.0]--> (90,[0,0.0])
 (90,[0,0.0]) --[1.0]--> (70,[0,0.0])


[36mgraph5[39m: [32mGraph[39m[[32mRow[39m, [32mDouble[39m] = org.apache.spark.graphx.impl.GraphImpl@4051f81c

In [43]:
val graph6 = graph2.groupEdges( (x, y) => Row("friend"))
printEdges(graph6)

 (3,(rxin,student)) --[[collab,0.0,0]]--> (7,(jgonzal,postdoc))
 (5,(franklin,prof)) --[[advisor,0.0,0]]--> (3,(rxin,student))
 (2,(istoica,prof)) --[[colleague,0.0,0]]--> (5,(franklin,prof))
 (5,(franklin,prof)) --[[advisor,0.0,0]]--> (7,(jgonzal,postdoc))
 (10,(hoityToity,student)) --[[friend,0.0,0]]--> (5,(franklin,prof))
 (10,(hoityToity,student)) --[[friend,0.0,0]]--> (1,(somebody,postdoc))


[36mgraph6[39m: [32mGraph[39m[([32mString[39m, [32mString[39m), [32mRow[39m] = org.apache.spark.graphx.impl.GraphImpl@2e055148

In [44]:
printNeighbors(graph6, EdgeDirection.Out)

Neighbors of 1 (EdgeDirection.Out) are: 
Neighbors of 7 (EdgeDirection.Out) are: 
Neighbors of 2 (EdgeDirection.Out) are: 5
Neighbors of 3 (EdgeDirection.Out) are: 7
Neighbors of 10 (EdgeDirection.Out) are: 5,1
Neighbors of 5 (EdgeDirection.Out) are: 3,7


In [45]:
printNeighbors(graph2, EdgeDirection.In)

Neighbors of 1 (EdgeDirection.In) are: 10
Neighbors of 7 (EdgeDirection.In) are: 3,5
Neighbors of 2 (EdgeDirection.In) are: 
Neighbors of 3 (EdgeDirection.In) are: 5
Neighbors of 10 (EdgeDirection.In) are: 
Neighbors of 5 (EdgeDirection.In) are: 2,10


## Update Vertex data of a graph

Load external data as properties of the vertices.

In [46]:
printEdges(graph1)

 (10,[0,0.0]) --[1.0]--> (20,[0,0.0])
 (20,[0,0.0]) --[1.0]--> (30,[0,0.0])
 (30,[0,0.0]) --[1.0]--> (10,[0,0.0])
 (70,[0,0.0]) --[1.0]--> (80,[0,0.0])
 (40,[0,0.0]) --[1.0]--> (50,[0,0.0])
 (50,[0,0.0]) --[1.0]--> (60,[0,0.0])
 (60,[0,0.0]) --[1.0]--> (20,[0,0.0])
 (80,[0,0.0]) --[1.0]--> (90,[0,0.0])
 (90,[0,0.0]) --[1.0]--> (70,[0,0.0])


In [47]:
val labelsRDD = sc.makeRDD(
    Seq(
        (10L,100.0),
        (20L,200.0),
        (30L,300.0),
        (70L,700.0),
        (40L,400.0),
        (50L,500.0),
        (60L,600.0),
        (80L,800.0),
        (90L,900.0),
    )
)

[36mlabelsRDD[39m: [32mRDD[39m[([32mLong[39m, [32mDouble[39m)] = ParallelCollectionRDD[208] at makeRDD at cmd46.sc:1

In [48]:
// Join RDDs with the graph
val graph7 = graph1.joinVertices(labelsRDD:RDD[Tuple2[Long, Double]])( (vid:VertexId, vd:Row, U:Double) => Row(vd(0), U))

printEdges(graph7)

 (10,[0,100.0]) --[1.0]--> (20,[0,200.0])
 (20,[0,200.0]) --[1.0]--> (30,[0,300.0])
 (30,[0,300.0]) --[1.0]--> (10,[0,100.0])
 (70,[0,700.0]) --[1.0]--> (80,[0,800.0])
 (40,[0,400.0]) --[1.0]--> (50,[0,500.0])
 (50,[0,500.0]) --[1.0]--> (60,[0,600.0])
 (60,[0,600.0]) --[1.0]--> (20,[0,200.0])
 (80,[0,800.0]) --[1.0]--> (90,[0,900.0])
 (90,[0,900.0]) --[1.0]--> (70,[0,700.0])


[36mgraph7[39m: [32mGraph[39m[[32mRow[39m, [32mDouble[39m] = org.apache.spark.graphx.impl.GraphImpl@7ae9bf2c

In [None]:
val graph8 = graph1.outerJoinVertices[U, VD2](other: RDD[(VertexId, U)])
      (mapFunc: (VertexId, VD, Option[U]) => VD2)

## Extracting vertex data from graph

In [61]:
val graph7vertices = graph7.vertices.map( x => (x._1, x._2(0), x._2(1)) )

[36mgraph7vertices[39m: [32mRDD[39m[([32mVertexId[39m, [32mAny[39m, [32mAny[39m)] = MapPartitionsRDD[229] at map at cmd60.sc:1

In [62]:
graph7vertices.collect()

[36mres61[39m: [32mArray[39m[([32mVertexId[39m, [32mAny[39m, [32mAny[39m)] = [33mArray[39m(
  ([32m80L[39m, [32m0L[39m, [32m800.0[39m),
  ([32m30L[39m, [32m0L[39m, [32m300.0[39m),
  ([32m50L[39m, [32m0L[39m, [32m500.0[39m),
  ([32m40L[39m, [32m0L[39m, [32m400.0[39m),
  ([32m90L[39m, [32m0L[39m, [32m900.0[39m),
  ([32m70L[39m, [32m0L[39m, [32m700.0[39m),
  ([32m20L[39m, [32m0L[39m, [32m200.0[39m),
  ([32m60L[39m, [32m0L[39m, [32m600.0[39m),
  ([32m10L[39m, [32m0L[39m, [32m100.0[39m)
)

In [68]:
//val graph7_df = spark.createDataFrame(graph7.vertices.map( x => (x._1, x._2(0), x._2(1)) )).toDF(Seq("Vertex_id", "col2", "col3"):_*)

---

## Saving graphs as object files
All data and can be saved to disk.

The default data format for saving to disk is Parquet which also compresses the data structure using SNAPPY compression.

In [70]:
// Save vertices RDD to disk
graph1.vertices.saveAsObjectFile("graph1_vertices.obj")

23/06/02 18:20:50 INFO SequenceFileRDDFunctions: Saving as sequence file of type (NullWritable,BytesWritable)
23/06/02 18:20:50 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
23/06/02 18:20:50 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/06/02 18:20:50 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/06/02 18:20:50 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/06/02 18:20:50 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/06/02 18:20:51 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/06/02 18:20:51 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/06/02 18:20:51 INFO Fil

In [71]:
// then, save edges RDD to disk
graph1.edges.saveAsObjectFile("graph1_edges.obj")

23/06/02 18:20:54 INFO SequenceFileRDDFunctions: Saving as sequence file of type (NullWritable,BytesWritable)
23/06/02 18:20:54 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/06/02 18:20:54 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/06/02 18:20:54 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/06/02 18:20:54 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/06/02 18:20:54 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/06/02 18:20:54 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/06/02 18:20:54 INFO FileOutputCommitter: Saved output of task 'attempt_202306021820544756191222913200626_0248_m_000000_0' to file:/mnt/src/spark_proj

## Saving graph to JSON format

In [72]:
graph2.vertices.map(x => {
    val mapper = new com.fasterxml.jackson.databind.ObjectMapper()
    mapper.registerModule(
    com.fasterxml.jackson.module.scala.DefaultScalaModule)
    val writer = new java.io.StringWriter()
    mapper.writeValue(writer, x)
    writer.toString
    }).coalesce(1,true).saveAsTextFile("graph2_vertices_json")

23/06/02 18:20:58 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/06/02 18:20:58 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/06/02 18:20:58 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/06/02 18:20:58 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/06/02 18:20:58 INFO FileOutputCommitter: Saved output of task 'attempt_202306021820582503446850220815044_0254_m_000000_0' to file:/mnt/src/spark_projs/graphx-algorithms/examples/graph2_vertices_json/_temporary/0/task_202306021820582503446850220815044_0254_m_000000


In [73]:
graph2.edges.mapPartitions(edges => {
val mapper = new com.fasterxml.jackson.databind.ObjectMapper();
mapper.registerModule(DefaultScalaModule)
val writer = new java.io.StringWriter()
edges.map(e => {writer.getBuffer.setLength(0)
mapper.writeValue(writer, e)
writer.toString})
}).coalesce(1,true).saveAsTextFile("graph2_edges_json")

23/06/02 18:21:00 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/06/02 18:21:00 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/06/02 18:21:01 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/06/02 18:21:01 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/06/02 18:21:01 INFO FileOutputCommitter: Saved output of task 'attempt_202306021821004346008190923540698_0260_m_000000_0' to file:/mnt/src/spark_projs/graphx-algorithms/examples/graph2_edges_json/_temporary/0/task_202306021821004346008190923540698_0260_m_000000


In [74]:
graph3.vertices.mapPartitions(vertices => {
val mapper = new com.fasterxml.jackson.databind.ObjectMapper()
mapper.registerModule(DefaultScalaModule)
val writer = new java.io.StringWriter()
vertices.map(v => {writer.getBuffer.setLength(0)
mapper.writeValue(writer, v)
writer.toString})
}).coalesce(1,true).saveAsTextFile("graph3_vertices_json")

23/06/02 18:21:08 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/06/02 18:21:08 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/06/02 18:21:08 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
23/06/02 18:21:08 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
23/06/02 18:21:08 INFO FileOutputCommitter: Saved output of task 'attempt_202306021821085844259637603774879_0266_m_000000_0' to file:/mnt/src/spark_projs/graphx-algorithms/examples/graph3_vertices_json/_temporary/0/task_202306021821085844259637603774879_0266_m_000000


## Saving graph to GEXF (Gephi) format

In [46]:
def collectAsGexf[VD,ED](g:Graph[VD,ED]):String = {
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
    "<gexf xmlns=\"http://www.gexf.net/1.2draft\" version=\"1.2\">\n" +
    " <graph mode=\"static\" defaultedgetype=\"directed\">\n" +
    " <nodes>\n" +
    g.vertices.map(v => " <node id=\"" + v._1 + "\" label=\"" +
    v._2 + "\" />\n").collect.mkString +
    " </nodes>\n" +
    " <edges>\n" +
    g.edges.map(e => " <edge source=\"" + e.srcId +
    "\" target=\"" + e.dstId + "\" label=\"" + e.attr +
    "\" />\n").collect.mkString +
    " </edges>\n" +
    " </graph>\n" +
    "</gexf>"
}

defined [32mfunction[39m [36mcollectAsGexf[39m

In [45]:
val pw = new java.io.PrintWriter("graph2.gexf")
pw.write(collectAsGexf(graph2))
pw.close

[36mpw[39m: [32mjava[39m.[32mio[39m.[32mPrintWriter[39m = java.io.PrintWriter@f928bc1

---

### Stop the Spark Session

In [75]:
spark.stop()

23/06/02 18:22:12 INFO SparkUI: Stopped Spark web UI at http://10.0.2.15:4040


In [76]:
sc.stop()

### End of file