## Original Graph

In [7]:
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD


val v = sc.parallelize(Array((1L,1),(2L,2),(3L,3),(4L,4),(5L,5)))
val e = sc.parallelize(Array(Edge(1,2,3),Edge(2,3,1),Edge(3,4,1)))

val g: Graph[Int,Int] = Graph(v,e)

import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
v: org.apache.spark.rdd.RDD[(Long, Int)] = ParallelCollectionRDD[30] at parallelize at <console>:36
e: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Int]] = ParallelCollectionRDD[31] at parallelize at <console>:37
g: org.apache.spark.graphx.Graph[Int,Int] = org.apache.spark.graphx.impl.GraphImpl@5d2ea


In [8]:
g.vertices.collect

res4: Array[(org.apache.spark.graphx.VertexId, Int)] = Array((4,4), (1,1), (5,5), (2,2), (3,3))


In [9]:
g.edges.collect

res5: Array[org.apache.spark.graphx.Edge[Int]] = Array(Edge(1,2,3), Edge(2,3,1), Edge(3,4,1))


## Create a subgraph

#### Criterion 1
<li>choose edges with value greater than 1</li>
<li>all vertices are selected (no vertex condition)</li>
<li>only edge with attr > 1 is selected (two edges are dropped)</li>

In [10]:
// choose all vertices
val verts = g.subgraph(e=>e.attr>1,(v_id,v_attr) => true).vertices.collect
// choose edges with attribute value greater than 1 ==> one edge left: (1,2,3)
val edgs = g.subgraph(e=>e.attr>1,(v_id,v_attr) => true).edges.collect

verts: Array[(org.apache.spark.graphx.VertexId, Int)] = Array((4,4), (1,1), (5,5), (2,2), (3,3))
edgs: Array[org.apache.spark.graphx.Edge[Int]] = Array(Edge(1,2,3))


#### Criterion 2
<li>choose vertices with attr greater than 2</li>
<li>vertices 3,4,5 are included</li>
<li>even though all three edges satisfy the edge criterion, <b>only Edge(3,4,1) survives</b> becasue the other two edges each have at least one vertices that are not in the vertex set</li>

In [11]:
// choose vertices with attribute value greater than 2 ==> only 3,4,5 are included
val verts = g.subgraph(e=>true,(v_id,v_attr) => v_attr>2).vertices.collect
// edge.selector chooses all edges, but only 1 edge(3,4,1) is included, 
// because the other two edges each have at least one vertices that are not in the vertex set
// Edge(1,2,3), Edge(2,3,1) are filtered out, here
val edgs = g.subgraph(e=>true,(v_id,v_attr) => v_attr>2).edges.collect

verts: Array[(org.apache.spark.graphx.VertexId, Int)] = Array((4,4), (5,5), (3,3))
edgs: Array[org.apache.spark.graphx.Edge[Int]] = Array(Edge(3,4,1))


<li>Combining the two</li>
<li>None of the edges survive because either their vertices are not in the final set or they don't satsify the edge condition</li>

In [12]:
// Edge(3,4,1) is also gone, since the edge attribute value is not greater than 1
val verts = g.subgraph(e=>e.attr>1,(v_id,v_attr) => v_attr>2).vertices.collect
val edgs = g.subgraph(e=>e.attr>1,(v_id,v_attr) => v_attr>2).edges.collect

verts: Array[(org.apache.spark.graphx.VertexId, Int)] = Array((4,4), (5,5), (3,3))
edgs: Array[org.apache.spark.graphx.Edge[Int]] = Array()


<h3>Bottom line</h3>
<li>Edges in the subgraph must satisfy the <b>edge condition</b> AND <b>both vertices in the edge must satisfy the vertex condition</b></li>
<li>Vertices in the subgraph must satisfy the <b>vertex condition</b></li>