### grp

# Spark: The Definitive Guide

## PART 6: Advanced Analytics and Machine Learning 

## dataPaths

In [1]:
bikes = '/Users/grp/sparkTheDefinitiveGuide/data/bike-data/201508_station_data.csv'
trips = '/Users/grp/sparkTheDefinitiveGuide/data/bike-data/201508_trip_data.csv'

## _Chapter #30 - Graph Analytics_

-  graphs described a relationship among objects
-  graphs are a logical representation of data
-  edges and vertices in graphs can have data (weights) associated with them

### Graph Structure:
-  nodes [vertices] (objects)
-  edges (define relationships between nodes)

### Graph Types:
-  undirected:
    -  edges do not have a specified "start" and "end" vertex (unbounded linkage)
-  directed:
    -  edges have a directional "start" location and "end" location (bounded linkage)

### Use Cases:
-  social network
-  airline connections
-  hierarchy data
-  Google's pageRank algorithm

### Spark Models:
-  GraphFrames (structued DF API)
-  GraphX (unstructured RDD API)

### GraphFrames vs Graph Databases:
-  Spark is a distributed computing engine that does not store data or perform transactional processing
-  GraphFrames are better suited for larger analytics workloads like ETL, querying, and modeling
-  Graph Databases are better suited for supporting transactional processing and server-side scaling

### _Chapter #30 Exercises (Graph)_

### _Terminal Package Example_

In [2]:
'''
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
'''

'\npyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11\n'

### _Graph Build Example_

In [3]:
# vertices
bikeStations = spark.read.option("header","true").csv(bikes)
bikeStations.printSchema()
bikeStations.show(3)

# edges
tripData = spark.read.option("header","true").csv(trips)
tripData.printSchema()
tripData.show(3, True)

root
 |-- station_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- lat: string (nullable = true)
 |-- long: string (nullable = true)
 |-- dockcount: string (nullable = true)
 |-- landmark: string (nullable = true)
 |-- installation: string (nullable = true)

+----------+--------------------+---------+-----------+---------+--------+------------+
|station_id|                name|      lat|       long|dockcount|landmark|installation|
+----------+--------------------+---------+-----------+---------+--------+------------+
|         2|San Jose Diridon ...|37.329732|-121.901782|       27|San Jose|    8/6/2013|
|         3|San Jose Civic Ce...|37.330698|-121.888979|       15|San Jose|    8/5/2013|
|         4|Santa Clara at Al...|37.333988|-121.894902|       11|San Jose|    8/6/2013|
+----------+--------------------+---------+-----------+---------+--------+------------+
only showing top 3 rows

root
 |-- Trip ID: string (nullable = true)
 |-- Duration: string (nullable = t

In [4]:
stationVertices = bikeStations.withColumnRenamed("name", "id").distinct()
tripEdges = tripData.withColumnRenamed("Start Station", "src").withColumnRenamed("End Station", "dst")

In [5]:
from graphframes import GraphFrame

In [6]:
# builds graph
stationGraph = GraphFrame(stationVertices, tripEdges)
stationGraph.cache()

GraphFrame(v:[id: string, station_id: string ... 5 more fields], e:[src: string, dst: string ... 9 more fields])

### _Graph Structure Example_

In [7]:
stationGraph.vertices.printSchema()
stationGraph.edges.printSchema()

root
 |-- station_id: string (nullable = true)
 |-- id: string (nullable = true)
 |-- lat: string (nullable = true)
 |-- long: string (nullable = true)
 |-- dockcount: string (nullable = true)
 |-- landmark: string (nullable = true)
 |-- installation: string (nullable = true)

root
 |-- Trip ID: string (nullable = true)
 |-- Duration: string (nullable = true)
 |-- Start Date: string (nullable = true)
 |-- src: string (nullable = true)
 |-- Start Terminal: string (nullable = true)
 |-- End Date: string (nullable = true)
 |-- dst: string (nullable = true)
 |-- End Terminal: string (nullable = true)
 |-- Bike #: string (nullable = true)
 |-- Subscriber Type: string (nullable = true)
 |-- Zip Code: string (nullable = true)



In [8]:
print("Total Number of Stations: " + str(stationGraph.vertices.count()))
print("Total Number of Trips in Graph: " + str(stationGraph.edges.count()))
print("Total Number of Trips in Original Data: " + str(tripData.count()))

Total Number of Stations: 70
Total Number of Trips in Graph: 354152
Total Number of Trips in Original Data: 354152


### _Graph Query Example_

In [9]:
from pyspark.sql.functions import desc

In [10]:
stationGraph.edges.groupBy("src", "dst").count().orderBy(desc("count")).show(10, False)

stationGraph.edges\
.where("src = 'Townsend at 7th' OR dst = 'Townsend at 7th'")\
.groupBy("src", "dst").count()\
.orderBy(desc("count"))\
.show(10, False)

+---------------------------------------------+----------------------------------------+-----+
|src                                          |dst                                     |count|
+---------------------------------------------+----------------------------------------+-----+
|San Francisco Caltrain 2 (330 Townsend)      |Townsend at 7th                         |3748 |
|Harry Bridges Plaza (Ferry Building)         |Embarcadero at Sansome                  |3145 |
|2nd at Townsend                              |Harry Bridges Plaza (Ferry Building)    |2973 |
|Townsend at 7th                              |San Francisco Caltrain 2 (330 Townsend) |2734 |
|Harry Bridges Plaza (Ferry Building)         |2nd at Townsend                         |2640 |
|Embarcadero at Folsom                        |San Francisco Caltrain (Townsend at 4th)|2439 |
|Steuart at Market                            |2nd at Townsend                         |2356 |
|Embarcadero at Sansome                       |Ste

### _SubGraph Example_

In [11]:
# subgraphs are smaller graphs within the main graphs
townAnd7thEdges = stationGraph.edges.where("src = 'Townsend at 7th' OR dst = 'Townsend at 7th'")
subgraph = GraphFrame(stationGraph.vertices, townAnd7thEdges)

### _Motif (Triange) Structure Example_

In [12]:
# motifs find structural patterns in a graph
# ex: find matching vertices and edges with vertex 'a' connection to vertex 'b' through edge 'ab' => (a)-[ab]->(b)
motifs = stationGraph.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[ca]->(a)")
# (a) => represents starting station
# [ab] => represents an edge from station 'a' to station 'b'
# (b) => represents ending station
# repeat process for stations (b) to (c) and (c) to (a)
# pattern is a triangle from ab to bc to ca
# output returns nested fields for vertices a, b, and c as well as connected edges

### _Motif (Search) Query Example_

In [13]:
from pyspark.sql.functions import expr

In [14]:
ms = motifs.selectExpr("*",
    "to_timestamp(ab.`Start Date`, 'MM/dd/yyyy HH:mm') as abStart",
    "to_timestamp(bc.`Start Date`, 'MM/dd/yyyy HH:mm') as bcStart",
    "to_timestamp(ca.`Start Date`, 'MM/dd/yyyy HH:mm') as caStart")\
.where("ca.`Bike #` = bc.`Bike #`").where("ab.`Bike #` = bc.`Bike #`")\
.where("a.id != b.id").where("b.id != c.id")\
.where("abStart < bcStart").where("bcStart < caStart")\
.orderBy(expr("cast(caStart as long) - cast(abStart as long)"))\
.selectExpr("a.id", "b.id", "c.id", "ab.`Start Date`", "ca.`End Date`")\
.limit(1)
for i in ms.collect(): print(i)

Row(id='San Francisco Caltrain 2 (330 Townsend)', id='Townsend at 7th', id='San Francisco Caltrain (Townsend at 4th)', Start Date='5/19/2015 16:09', End Date='5/19/2015 16:33')


### _PageRank Example_

In [15]:
from pyspark.sql.functions import desc

In [16]:
# pageRank is an algorithm for ranking web pages and computing values
# in this example higher values indicate more important bike stations with the most bike traffic/visits
ranks = stationGraph.pageRank(resetProbability=0.15, maxIter=10)
ranks.vertices.orderBy(desc("pagerank")).select("id", "pagerank").show(10, False)

+----------------------------------------+------------------+
|id                                      |pagerank          |
+----------------------------------------+------------------+
|San Jose Diridon Caltrain Station       |4.051504835989957 |
|San Francisco Caltrain (Townsend at 4th)|3.351183296428704 |
|Mountain View Caltrain Station          |2.514390771015558 |
|Redwood City Caltrain Station           |2.3263087713711688|
|San Francisco Caltrain 2 (330 Townsend) |2.2311442913698567|
|Harry Bridges Plaza (Ferry Building)    |1.8251120118882902|
|2nd at Townsend                         |1.5821217785039197|
|Santa Clara at Almaden                  |1.573007408490752 |
|Townsend at 7th                         |1.568456580534067 |
|Embarcadero at Sansome                  |1.5414242087748948|
+----------------------------------------+------------------+
only showing top 10 rows



### _In-Degree & Out-Degree Metrics Example_

In [17]:
# inbound connections [in-degree] (ex: social network followers; bike start station)
inDeg = stationGraph.inDegrees
inDeg.orderBy(desc("inDegree")).show(5, False)

# outbound connections [out-degree] (ex: social network people they follow; bike end station)
outDeg = stationGraph.outDegrees
outDeg.orderBy(desc("outDegree")).show(5, False)

# degree ratio
degreeRatio = inDeg.join(outDeg, "id")\
.selectExpr("id", "double(inDegree)/double(outDegree) as degreeRatio")

# higher ratio provides rank where bike trips mostly end (but rarely start)
degreeRatio.orderBy(desc("degreeRatio")).show(10, False)

# lower ratio provives rank where bike trips mostly start (but rarely end)
degreeRatio.orderBy("degreeRatio").show(10, False)

+----------------------------------------+--------+
|id                                      |inDegree|
+----------------------------------------+--------+
|San Francisco Caltrain (Townsend at 4th)|34810   |
|San Francisco Caltrain 2 (330 Townsend) |22523   |
|Harry Bridges Plaza (Ferry Building)    |17810   |
|2nd at Townsend                         |15463   |
|Townsend at 7th                         |15422   |
+----------------------------------------+--------+
only showing top 5 rows

+---------------------------------------------+---------+
|id                                           |outDegree|
+---------------------------------------------+---------+
|San Francisco Caltrain (Townsend at 4th)     |26304    |
|San Francisco Caltrain 2 (330 Townsend)      |21758    |
|Harry Bridges Plaza (Ferry Building)         |17255    |
|Temporary Transbay Terminal (Howard at Beale)|14436    |
|Embarcadero at Sansome                       |14158    |
+------------------------------------------

### _Breadth-First Search Example_

In [18]:
# searches graph for how to connect 2 sets of nodes via graph edges
# paramters:
    # maxPathLength - maximum edges to follow
    # edgeFilter - filter out edges
stationGraph.bfs(fromExpr="id = 'Townsend at 7th'",\
                 toExpr="id = 'Spear at Folsom'", maxPathLength=2).show(10)

# type Row
for i in stationGraph.bfs(fromExpr="id = 'Townsend at 7th'",\
                 toExpr="id = 'Spear at Folsom'", maxPathLength=2).take(1): print(i)

+--------------------+--------------------+--------------------+
|                from|                  e0|                  to|
+--------------------+--------------------+--------------------+
|[65, Townsend at ...|[913371, 663, 8/3...|[49, Spear at Fol...|
|[65, Townsend at ...|[913265, 658, 8/3...|[49, Spear at Fol...|
|[65, Townsend at ...|[911919, 722, 8/3...|[49, Spear at Fol...|
|[65, Townsend at ...|[910777, 704, 8/2...|[49, Spear at Fol...|
|[65, Townsend at ...|[908994, 1115, 8/...|[49, Spear at Fol...|
|[65, Townsend at ...|[906912, 892, 8/2...|[49, Spear at Fol...|
|[65, Townsend at ...|[905201, 980, 8/2...|[49, Spear at Fol...|
|[65, Townsend at ...|[904010, 969, 8/2...|[49, Spear at Fol...|
|[65, Townsend at ...|[903375, 850, 8/2...|[49, Spear at Fol...|
|[65, Townsend at ...|[899944, 910, 8/2...|[49, Spear at Fol...|
+--------------------+--------------------+--------------------+
only showing top 10 rows

Row(from=Row(station_id='65', id='Townsend at 7th', lat='37.7710

### _Connected Components Example_

In [19]:
# defines an 'undirected' subgraph that has connections to itself but does not connect to the greater graph
spark.sparkContext.setCheckpointDir("/Users/grp/sparkTheDefinitiveGuide/tmp/checkpointCC")

minGraph = GraphFrame(stationVertices, tripEdges.sample(False, 0.1))
cc = minGraph.connectedComponents()

In [20]:
cc.where("component != 0").show(3)

+----------+--------------------+---------+-----------+---------+-------------+------------+----------+
|station_id|                  id|      lat|       long|dockcount|     landmark|installation| component|
+----------+--------------------+---------+-----------+---------+-------------+------------+----------+
|        33|Rengstorff Avenue...|37.400241|-122.099076|       15|Mountain View|   8/16/2013|8589934592|
|        25|Stanford in Redwo...| 37.48537|-122.203288|       15| Redwood City|   8/12/2013|8589934592|
|        84|         Ryland Park|37.342725|-121.895617|       15|     San Jose|    4/9/2014|8589934592|
+----------+--------------------+---------+-----------+---------+-------------+------------+----------+
only showing top 3 rows



### _Strongly Connected Components Example_

In [21]:
# takes directionality into account
# subgraph that has paths between all pairs of vertices inside it
scc = minGraph.stronglyConnectedComponents(maxIter=3)
scc.groupBy("component").count().show(3)

+------------+-----+
|   component|count|
+------------+-----+
|128849018880|   16|
|  8589934592|   19|
|           0|   33|
+------------+-----+
only showing top 3 rows



### grp