## GraphX: procesamiento de grafos

Programación paralela de grafos con Spark

-   Principal abstracción: [*Graph*](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Graph)
    -   Multigrafo dirigido con propiedades asignadas a vértices y aristas
    -   Extensión de los RDDs
- Incluye constructores de grafos, operadores básicos ( *reverse*, *subgraph*…) y algoritmos de grafos ( *PageRank*, *Triangle Counting*…)
- Actualmente, no disponible en PySpark (solo Scala)

Documentación: [spark.apache.org/docs/latest/graphx-programming-guide.html](http://spark.apache.org/docs/latest/graphx-programming-guide.html)

### Grafos en GraphX
<img src="http://localhost:8085/figs/grapxgraph.png" alt="Grafo en GraphX" style="width: 600px;"/>
(Fuente: M.S. Malak, R. East "Spark GraphX in action", Manning, 2016)

Ejemplo de grafo sencillo
<img src="http://localhost:8085/figs/simpsonsgraph.png" alt="Grafo de los Simpson" style="width: 600px;"/>
(Fuente: P. Zecević, M. Bonaći "Spark in action", Manning, 2017)

In [3]:
import org.apache.spark.graphx._
case class Person(name:String, age:Int)
val vertices = sc.parallelize(Array((1L, Person("Homer", 39)),
                                    (2L, Person("Marge", 39)),
                                    (3L, Person("Bart", 12)),
                                    (4L, Person("Milhouse", 12))))
                                    
val aristas = sc.parallelize(Array(Edge(4L, 3L, "amigo"),
                                 Edge(3L, 1L, "padre"),
                                 Edge(3L, 2L, "madre"),
                                 Edge(1L, 2L, "casadoCon")))
                                 
val graph = Graph(vertices, aristas)

graph.vertices.count()
graph.edges.count()

In [4]:
val rdd = sc.textFile("../datos/cite75_99.txt")
            .filter(l => !l.startsWith("\"CITING\""))
            .map(l => {
                val spl = l.split(",");
                (spl(0).toLong, spl(1).toLong)})
println(rdd.count())

val filtered = rdd.filter(p => p._1 >= 3000000 && p._1 < 4000000)
println(filtered.count()) 

In [5]:
val citeGraph = Graph.fromEdgeTuples(filtered, 0)

println("Número de vértices = "+citeGraph.numVertices)
println("Número de aristas = "+citeGraph.numEdges)

In [6]:
val rank = citeGraph.pageRank(0.1)

val orden = new Ordering[Tuple2[VertexId, Double]] {
    def compare(x:Tuple2[VertexId, Double], y:Tuple2[VertexId, Double]):Int =
                x._2.compareTo(y._2)
}

val top10 = rank.vertices.top(10)(orden)

println("Patente con mayor rank = "+top10(0)._1+" (rango = "+top10(0)._2+")")