# PySpark - GraphFrames and Graph Theory

## Introduction

Sometimes, it can be difficult to explain or better comprehend certain types of dataset or data problems with just a simple distribution charts, pie charts or scatter plots. These kinds of dataset can consists of geographical data points, social networks or user interactions. This is where graphs that consists of 3 components that are edges, nodes (or vertices) and their properties are utilised to represent these kinds of data problems in a more intuitive and easier way for comprehension. The ability to simply assign nodes to anything and define their relationship (between these nodes) with edges provides a great amount of flexibility to represent the data in a different way. This can also means that it is possible to connect two seemingly disparate graphs into a common graph, as long as there is a link that can be found between the nodes of the two disparate graphs. For example, joining a social network with restaurant reccomendations, number of travellers and airport delays etc.

For more information:
- https://www.geeksforgeeks.org/mathematics-graph-theory-basics-set-1/
- http://www.analytictech.com/mb021/graphtheory.htm

## Breakdown of this Notebook

- An Introduction to Graph Theory and GraphFrames for Apache Spark
- Installing GraphFrames
- Data Preparation
- Building the Graph
- Running querries against the graph
- Understanding the Graph
- Utilise PageRank to determine airport ranks
- Finding the fewest number of connections (flights)
- Visualising the Graph.

## Why use GraphFrames with Spark?

One of the main problems that persist when designing and computing graphs is that the traversal and computation of these graphing algorithms are ofthen computationally expensive and at times can be very slow. To overcome this, GraphFrams with Apache Spark is able to take advantage of the performance inherent of the DataFrames where it is distributed. 

### Under the hood of Graphframes:

GraphFrames utilises two Spark DataFrames where one would be used for the nodes and another for the edges, it leverages the optimisations and simplicity of the DataFrame API and in addition, it can be used and interacted with by other programming languages such as Python, Java and Scala APIs.

## Datasets:

The datasets for this project are the (1) Airline On-Time Performance and Causes of Flight Delays data which consists of information about scheduled and actual departure/arrival times along with the delay causes, and (2) OpenFlights data which details the airport and airlines. More details can be found in the link below.

The Datasets are obtained from:
- https://catalog.data.gov/dataset/airline-on-time-performance-and-causes-of-flight-delays-on-time-data
- https://openflights.org/data.html

Or download the folder from this repository which should contain the following files:
- airport-codes-na.txt
- departuredelays.csv

## 1 Installing GraphFrames:

GraphFrames Spark package can be found at the link: https://spark-packages.org/package/graphframes/graphframes. Where it uses PySpark to download the latest version, compile it and execute it within the context of the Spark Job.

At the time of writing this Notebook, the command was:

> $SPARK_HOME/bin/spark-shell --packages graphframes:graphframes:0.8.0-spark3.0-s_2.12

To install:
1. In a Terminal Window, after activating your PySpark Environment, type in "cd /your spark installed location/bin.
    - In my case it was "cd /opt/spark/bin
2. Then type in: spark-shell --packages graphframes:graphframes:0.8.0-spark3.0-s_2.12

## 2 PySpark Machine Configuration:

Here it only uses four processing cores from the CPU, and it set up by the following code.

In [None]:
%%configure
{
    "executorCores" : 4
}

In [None]:
from pyspark.sql.types import *

## 3 Setup the Correct Directory:

In [None]:
import os

# Change the Path:
path = '++++your working directory here++++/Datasets/'
os.chdir(path)
folder_pathway = os.getcwd()

# print(folder_pathway)

## 4 Example of Nodes and Edges in Graphs:

The diagram below details the nodes and edges in a graph. 

In [None]:
%%local

# Import the required library and set to use ggplot:
import os
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline

folder_pathway = os.getcwd()
image_path = folder_pathway + "/Description Images/"

# plot the image
fig, ax1 = plt.subplots(figsize=(16,10))
image = mpimg.imread(image_path + 'Example Graph Theory.png')
plt.imshow(image);

print('Image source -> https://www.geeksforgeeks.org/mathematics-graph-theory-basics-set-1/')

## 5

In [None]:
%%local

# plot the image
fig, ax1 = plt.subplots(figsize=(16, 10))
image = mpimg.imread(image_path + 'Microbatch lines of DStream.png')
plt.imshow(image);

print('Image source -> ')