<a href="https://colab.research.google.com/github/sherif17/PySpark-For-Big-Data/blob/main/GraphFrams_PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The following section is for Colab Users.
### Just run the following code cells

In [None]:
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q https://bitbucket.org/habedi/datasets/raw/b6769c4664e7ff68b001e2f43bc517888cbe3642/spark/spark-3.0.2-bin-hadoop2.7.tgz
!tar xf spark-3.0.2-bin-hadoop2.7.tgz
!rm -rf spark-3.0.2-bin-hadoop2.7.tgz*
!pip -q install findspark pyspark graphframes

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.7/154.7 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [None]:
!wget https://repos.spark-packages.org/graphframes/graphframes/0.8.2-spark3.0-s_2.12/graphframes-0.8.2-spark3.0-s_2.12.jar -P /content/spark-3.0.2-bin-hadoop2.7/jars/
!cp /content/spark-3.0.2-bin-hadoop2.7/jars/graphframes-0.8.2-spark3.0-s_2.12.jar /content/spark-3.0.2-bin-hadoop2.7/graphframes-0.8.2-spark3.0-s_2.12.zip

--2023-05-01 20:18:29--  https://repos.spark-packages.org/graphframes/graphframes/0.8.2-spark3.0-s_2.12/graphframes-0.8.2-spark3.0-s_2.12.jar
Resolving repos.spark-packages.org (repos.spark-packages.org)... 52.85.151.5, 52.85.151.57, 52.85.151.46, ...
Connecting to repos.spark-packages.org (repos.spark-packages.org)|52.85.151.5|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 247882 (242K) [binary/octet-stream]
Saving to: ‘/content/spark-3.0.2-bin-hadoop2.7/jars/graphframes-0.8.2-spark3.0-s_2.12.jar’


2023-05-01 20:18:29 (32.7 MB/s) - ‘/content/spark-3.0.2-bin-hadoop2.7/jars/graphframes-0.8.2-spark3.0-s_2.12.jar’ saved [247882/247882]



In [None]:
import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.2-bin-hadoop2.7"
os.environ["HADOOP_HOME"] = os.environ["SPARK_HOME"]

os.environ["PYSPARK_DRIVER_PYTHON"] = "jupyter"
os.environ["PYSPARK_DRIVER_PYTHON_OPTS"] = "notebook"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[*] pyspark-shell"

In [None]:
import findspark
findspark.init()

In [None]:
!export PYSPARK_SUBMIT_ARGS="--master local[*] pyspark-shell"
!export PYSPARK_DRIVER_PYTHON=jupyter
!export PYSPARK_DRIVER_PYTHON_OPTS=notebook

In [None]:
from pyspark.sql import SparkSession
from graphframes import *

spark = SparkSession.builder.master("local[*]").appName("GraphFrames").getOrCreate()

In [None]:
os.environ["PYSPARK_SUBMIT_ARGS"] = "--packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 pyspark-shell"

**************************************************************************
**************************************************************************
**************************************************************************

In [None]:
spark.conf.set("spark.sql.repl.eagerEval.enabled", True) # Property used to format output tables better
spark
sc = spark.sparkContext
sc

In [None]:
from IPython.display import display, HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

### Read departuredelays.csv in Edge DataFrame
### Read airport-codes-na.txt in Vertix DataFrame (the separator is Tab i.e sep = '\t' )

#### The US flight delays data set has five columns:
- The <b>date</b> column contains an integer like 02190925 . When converted, this maps to 02-19 09:25 am.
- The <b>delay</b> column gives the delay in minutes between the scheduled and actual departure times. Early departures show negative numbers.
- The <b>distance</b> column gives the distance in miles from the origin airport to the destination airport.
- The <b>origin</b> column contains the origin IATA airport code.
- The <b>destination</b> column contains the destination IATA airport code.

#### The airport-codes data set has four columns:
- The <b>IATA</b> column contains IATA airport code.
- The <b>City, State, and Country</b> columns contains information about the airport location. 

In [None]:
# Import necessary packages
from pyspark.sql.functions import to_timestamp

# Read 'departuredelays.csv' in an Edge DataFrame
edges_df = spark.read.format('csv').options(header=True, inferSchema=True).load('/content/departuredelays.csv')
# edges_df = edges_df.withColumn('Date', to_timestamp(edges_df['Date'], 'MMddHHmm'))

# Read 'airport-codes-na.txt' in a Vertex DataFrame
vertices_df = spark.read.format('csv').options(header=True, inferSchema=True, sep='\t').load('/content/airport-codes-na.txt')
# vertices = spark.read.csv("airport-codes-na.txt", sep='\t', header=True)

In [None]:
vertices_df.show()

+-----------+-----+-------+----+
|       City|State|Country|IATA|
+-----------+-----+-------+----+
| Abbotsford|   BC| Canada| YXX|
|   Aberdeen|   SD|    USA| ABR|
|    Abilene|   TX|    USA| ABI|
|      Akron|   OH|    USA| CAK|
|    Alamosa|   CO|    USA| ALS|
|     Albany|   GA|    USA| ABY|
|     Albany|   NY|    USA| ALB|
|Albuquerque|   NM|    USA| ABQ|
| Alexandria|   LA|    USA| AEX|
|  Allentown|   PA|    USA| ABE|
|   Alliance|   NE|    USA| AIA|
|     Alpena|   MI|    USA| APN|
|    Altoona|   PA|    USA| AOO|
|   Amarillo|   TX|    USA| AMA|
|Anahim Lake|   BC| Canada| YAA|
|  Anchorage|   AK|    USA| ANC|
|   Appleton|   WI|    USA| ATW|
|     Arviat|  NWT| Canada| YEK|
|  Asheville|   NC|    USA| AVL|
|      Aspen|   CO|    USA| ASE|
+-----------+-----+-------+----+
only showing top 20 rows



In [None]:
edges_df.show()

+-------+-----+--------+------+-----------+
|   date|delay|distance|origin|destination|
+-------+-----+--------+------+-----------+
|1011245|    6|     602|   ABE|        ATL|
|1020600|   -8|     369|   ABE|        DTW|
|1021245|   -2|     602|   ABE|        ATL|
|1020605|   -4|     602|   ABE|        ATL|
|1031245|   -4|     602|   ABE|        ATL|
|1030605|    0|     602|   ABE|        ATL|
|1041243|   10|     602|   ABE|        ATL|
|1040605|   28|     602|   ABE|        ATL|
|1051245|   88|     602|   ABE|        ATL|
|1050605|    9|     602|   ABE|        ATL|
|1061215|   -6|     602|   ABE|        ATL|
|1061725|   69|     602|   ABE|        ATL|
|1061230|    0|     369|   ABE|        DTW|
|1060625|   -3|     602|   ABE|        ATL|
|1070600|    0|     369|   ABE|        DTW|
|1071725|    0|     602|   ABE|        ATL|
|1071230|    0|     369|   ABE|        DTW|
|1070625|    0|     602|   ABE|        ATL|
|1071219|    0|     569|   ABE|        ORD|
|1080600|    0|     369|   ABE| 

### In the vertix DataFrame, drop any duplicated rows with the same  IATA code.

In [None]:
vertex_df = vertices_df.dropDuplicates(['IATA'])

### In the edges DataFrame:
- Rename the <b>date</b> columns to become <b>tripid</b>.
- Rename the <b>origin</b> columns to become <b>src</b>.
- Rename the <b>destination</b> columns to become <b>dst</b>.

In [None]:
from pyspark.sql.functions import col

# Rename the date column to tripid
edges_df = edges_df.withColumnRenamed('date', 'tripid')

# Rename the origin column to src
edges_df = edges_df.withColumnRenamed('origin', 'src')

# Rename the destination column to dst
edges_df = edges_df.withColumnRenamed('destination', 'dst')


### In the Vertix DataFrame:
- Rename the <b>IATA</b> columns to become <b>id</b>.

In [None]:
# Rename the 'IATA' column to 'id'
vertices_df = vertices_df.withColumnRenamed('IATA', 'id')

### Create GraphFrame from Vertix and Edges DataFrames

In [None]:
from graphframes import GraphFrame

spark = SparkSession.builder.appName('graph-example').getOrCreate()


In [None]:
# vertices = vertex_df.selectExpr('id', 'city', 'state', 'country')

# # Create edge DataFrame
# edges = edges_df.selectExpr('tripid', 'src', 'dst', 'delay', 'distance')

# Create GraphFrame
graph = GraphFrame(vertices_df, edges_df)


### Determine the number of airports

In [None]:
num_airports = graph.vertices.count()
print("Number of airports:", num_airports)

Number of airports: 526


### Determine the number of trips 

In [None]:
num_airports = graph.edges.count()
print(num_airports)

1391578


### What is the longest delay?

In [None]:
from pyspark.sql.functions import max
longest_delay = edges_df.select(max("delay")).collect()[0][0]
print("The longest delay is:", longest_delay, "minutes")

The longest delay is: 1642 minutes


### Find out the number of delayed flights vs. early flights (flights that departed before actual time)

In [None]:
from pyspark.sql.functions import when

edges = graph.edges.withColumn("status", when(graph.edges.delay > 0, "delayed").when(graph.edges.delay < 0, "early").otherwise("on time"))
edges.groupBy("status").count().show()

+-------+------+
| status| count|
+-------+------+
|on time|131122|
|delayed|591727|
|  early|668729|
+-------+------+



### What flight destinations departing SFO are most likely to have significant delays? Select the top 10
#### Hint: you should get the average delay for each destination for trips that depart from SFO only

In [None]:
from pyspark.sql.functions import avg

# filter the edges DataFrame by selecting only rows where src is SFO
sfo_departures = edges.filter(edges['src'] == 'SFO')

# group the resulting DataFrame by dst and calculate the mean delay for each group
delay_by_dest = sfo_departures.groupBy('dst').agg(avg('delay').alias('avg_delay'))

# sort the result by mean delay in descending order and select the top 10 destinations
top_destinations = delay_by_dest.orderBy('avg_delay', ascending=False).limit(10)

# show the result
top_destinations.show()

+---+------------------+
|dst|         avg_delay|
+---+------------------+
|JAC| 30.78846153846154|
|OKC|24.822222222222223|
|SUN|22.696629213483146|
|COS| 22.58888888888889|
|SAT|             22.16|
|STL|         20.203125|
|HNL|19.982608695652175|
|ASE|19.846153846153847|
|CEC|19.089820359281436|
|MDW|18.771929824561404|
+---+------------------+



### Find the Incoming connections to the airport sorted in Desc. order.

In [None]:
from pyspark.sql.functions import desc

airport = "SFO"
incoming_connections = graph.inDegrees.filter(f"id == '{airport}'").sort(desc("inDegree"))
incoming_connections.show()

+---+--------+
| id|inDegree|
+---+--------+
|SFO|   38988|
+---+--------+



### Find the Outgoing connections from the airport sorted in Desc. order.

In [None]:
# Sort the DataFrame in descending order based on the outDegree column
out_degrees = graph.outDegrees.sort("outDegree", ascending=False)

# Show the top 10 outgoing connections
out_degrees.show(10)

+---+---------+
| id|outDegree|
+---+---------+
|ATL|    91484|
|DFW|    68482|
|ORD|    64228|
|LAX|    54086|
|DEN|    53148|
|IAH|    43361|
|PHX|    40155|
|SFO|    39483|
|LAS|    33107|
|CLT|    28402|
+---+---------+
only showing top 10 rows



### Use motif finding to answer this question: which delays could we blame on SFO?
#### Hint: this practically means that SFO is a transit station

In [None]:
# from graphframes import motifs
# pattern = motifs.Pattern.fromEdges("src", "dst", "src")

motifs = graph.find("(a)-[e1]->(b);(b)-[e2]->(c)").filter("b.id = 'SFO'").filter("e1.delay <=0 ").filter("e2.delay >0 ")
motifs.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                   a|                  e1|                   b|                  e2|                   c|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|[Albuquerque, NM,...|[1010600, -7, 779...|[San Francisco, C...|[1011250, 55, 224...|[New York, NY, US...|
|[Albuquerque, NM,...|[1010600, -7, 779...|[San Francisco, C...|[1011610, 134, 12...|[Dallas, TX, USA,...|
|[Albuquerque, NM,...|[1010600, -7, 779...|[San Francisco, C...|[1012330, 32, 160...|[Chicago, IL, USA...|
|[Albuquerque, NM,...|[1010600, -7, 779...|[San Francisco, C...|[1011330, 3, 1273...|[Dallas, TX, USA,...|
|[Albuquerque, NM,...|[1010600, -7, 779...|[San Francisco, C...|[1011410, 124, 16...|[Chicago, IL, USA...|
|[Albuquerque, NM,...|[1010600, -7, 779...|[San Francisco, C...|[1011250, 139, 29...|[Los Angeles, CA,...|
|[Albuquerque, NM,...|[1010600, -7, 7

### Determine Airport Ranking in Desc. order using PageRank algorithm

In [None]:
# Run PageRank until convergence to tolerance "tol".
results =  graph.pageRank(resetProbability=0.15, tol=0.01)
results.vertices.show()
results.edges.show()


results.vertices.orderBy('pagerank',ascending=False).show()
results.edges.orderBy('weight',ascending=False).show()


results.vertices.select("id", "pagerank").orderBy('pagerank').show()
results.edges.select("src", "dst", "weight").orderBy('weight').show()


+-------------+-----+-------+---+-------------------+
|         City|State|Country| id|           pagerank|
+-------------+-----+-------+---+-------------------+
|Rouyn-Noranda|   PQ| Canada|YUY|0.32912955224621604|
|   Miles City|   MT|    USA|MLS|0.32912955224621604|
|        Butte|   MT|    USA|BTM| 0.3815704636828766|
|State College|   PA|    USA|SCE| 0.3728480692654548|
|   Ogdensburg|   NY|    USA|OGS|0.32912955224621604|
|     Appleton|   WI|    USA|ATW|0.47786740691068647|
|     Waterloo|   IA|    USA|ALO| 0.3725878399974831|
|   Huntington|   WV|    USA|HTS|0.32912955224621604|
|    Pensacola|   FL|    USA|PNS| 0.8034356126664143|
|    Vancouver|   BC| Canada|YVR|0.32912955224621604|
|      Yakutat|   AK|    USA|YAK| 0.7511050233408701|
|   Dillingham|   AK|    USA|DLG|0.32912955224621604|
|Orange County|   CA|    USA|SNA|  2.510100989015544|
|   Bar Harbor|   ME|    USA|BHB|0.32912955224621604|
|Iron Mountain|   MI|    USA|IMT|0.34966814337233343|
|      Medford|   OR|    USA

## Determine the most popular flights (single city hops)

In [None]:
from graphframes import *
from pyspark.sql.functions import col

single_city_hops = graph.find("(src)-[flight]->(dst)")\
                    .filter("src.City == dst.City")

single_city_hops.show(10)

+---+---+-----------+
|src|dst|Occurrences|
+---+---+-----------+
|IAD|DCA|          1|
+---+---+-----------+



### Find and Save a Subragph that obtained from the following pattern:
#### The flight starts from an airport and return back to the same airport through 2 other airports.

In [None]:
subgraph = graph.dropIsolatedVertices()
subgraph = graph.find("(a)-[]->(b); (b)-[]->(c); (c)-[]->(a)")
g = GraphFrame(subgraph.toPandas().rename(columns=lambda x: x.split(".")[1]), subgraph.toPandas().rename(columns=lambda x: x.split(".")[1]))
g.vertices.write.parquet("vertices",mode='overwrite')
g.edges.write.parquet("edges",mode='overwrite')