# Working with Graphs in Spark


In this lab, you will learn some of the functionality of Spark GraphFrames. GraphFrames is the next-generation library for working with graphs on Spark. 

In [None]:
sc

In [None]:
spark

To work with GraphFrames in Python, you need to import the `graphframes` library. **Note, this library is not installed by default with Spark on EMR. The post-startup-script you ran today downloaded it and made it accessible to the Spark engine.**

In [None]:
from graphframes import *

You will be using data from the Bay Area Bike Share Portal (similar service to Capital Bikeshare in DC.) 

In the following two cells, read in two csv files located in s3:
* `s3://bigdatateaching/bike-data/station_data.csv`
* `s3://bigdatateaching/bike-data/trip_data.csv`

The station file contains the metadata of the bicycile stations, and the trip data contains all the bike trips.

In [None]:
bike_stations = 

In [None]:
trip_data = 

Explore the datasets:

In [None]:
bike_stations.show(10)

In [None]:
trip_data.show(10)

You will now modify the two DataFrames read in above to create a vertix list and an edge list.

In the next cell, use the station data and rename the "name" column to "id" and get distinct records:

In [None]:
station_vertices = 

In the next cell, use the trip data and rename the "Start Station" column to "src" and the "End Station" columnt to "dst".

In [None]:
trip_edges = 

In the next cell, you will create a GraphFrame passing in a vertex list and an edge list. Which is which from your original datasets?

In [None]:
station_graph = GraphFrame()

Since you will be using the GraphFrame more than once, it is best to cache it.

In [None]:
station_graph.cache()

### Graph metadata

Count the number of vertices in the graph:

In [None]:
station_graph

Count the number of edges in the graph:

In [None]:
station_graph

### Querying the Graph

The most basic way of interacting with the graph is querying it. Since the GraphFrame is based on DataFrames, you can perform the same type of operations you would on a DataFrame.

In the next cell, show the top 10 source and destination combinations, ordered in descending order by count:

In [None]:
from pyspark.sql.functions import desc
station_graph.edges.

In the next cell, show the top 10 source and destination combinations **where the source or destination station is 'Townsend at 7th'**, ordered in descending order by count:

In [None]:
station_graph.edges.

### Subsetting a Graph

Sometimes you need to work with a subset of a graph. The easiest way to create a subset is create a new graph with the vertices and edges of your your subset. 

In the next cell, subset the edges where the source or destination station is 'Townsend at 7th', and create a new graph called sg1 using the original vertices and the new edge list:

In [None]:
townsend_and_7th_edges = 
sg1 = GraphFrame()

In [None]:
sg1.

In [None]:
sg1.

### Motifs

*Motifs* are ways of expressing structural patterns in a graph. The following cell creates a triangular pattern.

In [None]:
motifs = station_graph.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[ca]->(a)")

The following cell takes the motifs DataFrame  and unnests the data:

In [None]:
from pyspark.sql.functions import expr
motifs.selectExpr("*",
    "to_timestamp(ab.`Start Date`, 'MM/dd/yyyy HH:mm') as abStart",
    "to_timestamp(bc.`Start Date`, 'MM/dd/yyyy HH:mm') as bcStart",
    "to_timestamp(ca.`Start Date`, 'MM/dd/yyyy HH:mm') as caStart")\
  .where("ca.`Bike #` = bc.`Bike #`").where("ab.`Bike #` = bc.`Bike #`")\
  .where("a.id != b.id").where("b.id != c.id")\
  .where("abStart < bcStart").where("bcStart < caStart")\
  .orderBy(expr("cast(caStart as long) - cast(abStart as long)"))\
  .selectExpr("a.id", "b.id", "c.id", "ab.`Start Date`", "ca.`End Date`")\
.limit(1).show(1, False)

### Graph Algorithms

### PageRank

In [None]:
ranks = station_graph.pageRank(resetProbability=0.15, maxIter=10)
ranks.vertices.orderBy(desc("pagerank")).select("id", "pagerank").show(10)

### In-Degree and Out-Degree Metrics

In [None]:
inDeg = station_graph.inDegrees
inDeg.orderBy(desc("inDegree")).show(5, False)

In [None]:
outDeg = station_graph.outDegrees
outDeg.orderBy(desc("outDegree")).show(5, False)

In [None]:
degreeRatio = inDeg.join(outDeg, "id")\
  .selectExpr("id", "double(inDegree)/double(outDegree) as degreeRatio")
degreeRatio.orderBy(desc("degreeRatio")).show(10, False)
degreeRatio.orderBy("degreeRatio").show(10, False)