# Graph Analysis using Spark - GraphFrames

#### Graphs are data structures composed of nodes, or vertices, which are arbitrary objects, and edges that define relationships between these nodes. 

#### Graph analytics is the process of analyzing these relationships.

An example graph might be your friend group. 
<br>In the context of graph analytics, each vertex or node would represent a person, and each edge would represent a relationship.

#### Edges and vertices in graphs can also have data associated with them.

In our friend example, the weight of the edge might represent the intimacy between different friends; 
<br>acquaintances would have low-weight edges between them, 
<br>while married individuals would have edges with large weights. 
<br>We could set this value by looking at communication frequency between nodes and weighting the edges accordingly. 
<br>Each vertex (person) might also have data such as a name.

#### Graphs are a natural way of describing relationships and many different problem sets.

#### Some business use cases could be
Fraud Detection & Analytics - Spot Fraud Rings in Their Tracks
<br>motif finding.
<br>Determining importance of papers in bibliographic networks (i.e., which papers are most referenced).
<br>Ranking web pages, as Google famously used the PageRank algorithm to do.
<br>Identity & Access Management - Track Roles, Groups and Assets like Never Before.
<br>Knowledge Graph - Augment Your Knowledge Graph with Highly Contextual Search Results.
<br>Master Data Management - Graphs Provide a 360° View of Your Data.
<br>Network and Database Infrastructure Monitoring for IT Operationss - Manage and Monitor Complex Networks with Real-Time Insights.
<br> And Many more ....

### SPARK GraphX and GraphFrames
Spark provides several ways of working in this analytics paradigm.
<br>Spark has long contained an RDD-based library for performing graph processing: GraphX.
<br>This provided a very low-level interface that was extremely powerful, but just like RDDs, wasn’t easy to use or optimize.
<br>GraphX remains a core part of Spark.
<br>Developers of Spark, have recently created a next-generation graph analytics library on Spark: GraphFrames.
<br>GraphFrames extends GraphX to provide a DataFrame API and support for Spark’s different language bindings,
<br>so that users of Python can take advantage of the scalability of the tool.

<br>**GraphFrames** is currently available as a Spark package, 
<br>an external package that you need to load when you start up your Spark application, 
<br>but may be merged into the core of Spark in the future.

<br>**HOW DOES GRAPHFRAMES COMPARE TO GRAPH DATABASES?**
<br>Spark is not a database.
<br>Spark is a distributed computation engine, but it does not store data long-term or perform transactions.
<br>You can build a graph computation on top of Spark, but that’s fundamentally different from a database.
<br>GraphFrames can scale to much larger workloads than many graph databases and 
<br>performs well for analytics but does not support transactional processing and serving.

Before initiating the jupyter notebook, pass external dependencies to the pyspark kernel
<br>export PACKAGES="graphframes:graphframes:0.5.0-spark2.1-s_2.11"
<br>export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"

## Problem Statement
There are two datasets airports.dat (Data about all the airports) and departureDelays.csv (trips delayed at the departure) obtained from OpenFlights and US DoT.

** Analyze these datasets by**
<br>Creating a graph structure using airports and departure delays data where
each airport as a node and trips between two airports as edge/relation/connections.

Upon building the graph structure, analyze the graph to find
<br>1. Delayed vs On-Time flights
<br>2. What flights departing from a specific location are most likely to have significant delays?
<br>3. What destinations tend to have delays?
<br>4. What destinations tend to have significant delays departing from a specific location?
<br>5. Degree wise (InDegree, OutDegree) analysis.
<br>6. Motif Findings.
<br>7. Determine the influential airports using PageRank.
<br>8. Find most popular trips.
<br>9. Find transfer cities/hubs.
<br>10. Use Breadth First Search to find the connections between two cities with 1 Hop, 2 Hops...
<br>11. Visualizations on On-Time/Early vs Delayed, Delayed from West Coast originated flights, All Flights etc.

### Dataset Descriptions
#### Airports data - airports.dat - 
#### OpenFlights: Airport, airline and route data : https://openflights.org/data.html
<br>**Airport ID** Unique OpenFlights identifier for this airport.  
<br>**Name** Name of airport. May or may not contain the City name. 
<br>**City** Main city served by airport. May be spelled differently from Name. 
<br>**Country** Country or territory where airport is located. See countries.dat to cross-reference to ISO 3166-1 codes.  
<br>**IATA** 3-letter IATA code. Null if not assigned/unknown. 
<br>**ICAO** 4-letter ICAO code - Null if not assigned. 
<br>**Latitude** Decimal degrees, usually to six significant digits. Negative is South, positive is North. 
<br>**Longitude** Decimal degrees, usually to six significant digits. Negative is West, positive is East. 
<br>**Altitude** In feet. 
<br>**Timezone** Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5. 
<br>**DST** Daylight savings time. 
<br>One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown). 
<br>See also: Help: Time  
<br>**Tz database time zone** Timezone in "tz" (Olson) format, eg. "America/Los_Angeles".  
<br>**Type** Type of the airport. 
<br>Value "airport" for air terminals, 
<br>"station" for train stations, 
<br>"port" for ferry terminals and 
<br>"unknown" if not known. 
<br>In airports.csv, only type=airport is included.  
<br>**Source** Source of this data. 
<br>"OurAirports" for data sourced from OurAirports, 
<br>"Legacy" for old data not matched to OurAirports (mostly DAFIF), 
<br>"User" for unverified user contributions. 
<br>In airports.csv, only source=OurAirports is included.  

The data is UTF-8 (Unicode) encoded.


#### Departure Delays Data - departureDelays.csv 
#### Source: United States Department of Transportation: Bureau of Transportation Statistics (TranStats)
#### https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time 
#### April 2018 Data

<br>**flightDate** Flight Date (yyyymmdd) 
<br>**originAirportID** Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport.
<br>**origin** Origin Airport.
<br>**originCity** Origin Airport, City Name.
<br>**originState** Origin Aiport, State Code.
<br>**destAirportID** Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport.
<br>**destination** Destination Airport.
<br>**destinationCity** Destination Airport, City Name.
<br>**destinationState** Destination Airport, State Code.
<br>**depDelayInMinutes** Difference in minutes between scheduled and actual departure time. Early departures set to 0.
<br>**distanceInMiles** Distance between Airports (in miles)

#### Configure Spark Environment

In [1]:
## Set Python - Spark environment.
import os
import sys
os.environ["SPARK_HOME"] = "/usr/hdp/current/spark2-client"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] + "/py4j-0.10.6-src.zip")
sys.path.insert(0, os.environ["PYLIB"] + "/pyspark.zip")

#### Create and Initialize Spark Driver

In [2]:
## Create SparkContext, SparkSession
from os.path import expanduser, join, abspath

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark import SparkContext
sc = SparkContext()

# warehouse_location points to the default location for managed databases and tables
warehouse_location = 'hdfs:///apps/hive/warehouse/'

spark = SparkSession \
    .builder \
    .appName("Spark Machine Learning Example") \
    .config("spark.sql.warehouse.dir", warehouse_location) \
    .enableHiveSupport() \
    .getOrCreate()

#### Verify Spark Driver - Spark Context and Spark Sessions

In [3]:
## Verify Spark Context
sc

In [4]:
## Verify Spark Session
spark

#### Load Dependent Libraries

#### Spark libraries

In [5]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
from graphframes import *

In [6]:
import numpy as np
import StringIO
import pandas as pd
import warnings

In [7]:
# Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

import plotly
plotly.tools.set_credentials_file(username='jmulmoodi', api_key='01MhmVVn87V5Fn7oN37k')

import plotly.figure_factory as ff
import plotly.plotly as py
import plotly.offline as pyoff
import plotly.graph_objs as go

# Initializing some settings
sns.set_style('whitegrid')
sns.set(color_codes=True)
warnings.filterwarnings('ignore')
pyoff.init_notebook_mode(connected=True)
get_ipython().magic('matplotlib inline')

#### Create dataframes from the 2 datasets.

#### Set file paths

In [8]:
#tripdelaysFilePath = "file:///home/rameshm/Datasets/flightDelays/Data/DepartureDelays/departureDelays.csv"
#airportsnaFilePath = "file:///home/rameshm/Datasets/flightDelays/Data/Airports/airports.dat"
tripdelaysFilePath = "/user/rameshm/datasets/flightData/DepartureDelays/departureDelays.csv"
airportsnaFilePath = "/user/rameshm/datasets/flightData/Airports/airports.dat"

#### Define schema for airports data

In [9]:
airportsDataSchema = StructType([
         StructField("airportID", IntegerType(), True),
         StructField("airportName", StringType(), True),
         StructField("city", StringType(), True),
         StructField("country", StringType(), True),
         StructField("IATA", StringType(), True),
         StructField("ICAO", StringType(), True),
         StructField("latitude", DoubleType(), True),
         StructField("longitude", DoubleType(), True),        
         StructField("altitude", IntegerType(), True),
         StructField("timezone", IntegerType(), True),
         StructField("dst", StringType(), True),
         StructField("tzDBTimezone", StringType(), True),
         StructField("type", StringType(), True),
         StructField("source", StringType(), True)])

#### Create dataframe from airports dataset

In [10]:
airportsDF = spark.read.format("csv")\
            .option("header", "false")\
            .option("inferSchema", "true")\
            .load(airportsnaFilePath, schema = airportsDataSchema)

#### Verify counts

In [11]:
print("Rows and columns in Airports dataset are {} and {}".
      format(airportsDF.count(), len(airportsDF.columns)))

Rows and columns in Airports dataset are 7184 and 14


#### Verify records

In [12]:
airportsDF.show(4)

+---------+--------------------+-----------+----------------+----+----+------------------+------------------+--------+--------+---+--------------------+-------+-----------+
|airportID|         airportName|       city|         country|IATA|ICAO|          latitude|         longitude|altitude|timezone|dst|        tzDBTimezone|   type|     source|
+---------+--------------------+-----------+----------------+----+----+------------------+------------------+--------+--------+---+--------------------+-------+-----------+
|        1|      Goroka Airport|     Goroka|Papua New Guinea| GKA|AYGA|-6.081689834590001|     145.391998291|    5282|      10|  U|Pacific/Port_Moresby|airport|OurAirports|
|        2|      Madang Airport|     Madang|Papua New Guinea| MAG|AYMD|    -5.20707988739|     145.789001465|      20|      10|  U|Pacific/Port_Moresby|airport|OurAirports|
|        3|Mount Hagen Kagam...|Mount Hagen|Papua New Guinea| HGU|AYMH|-5.826789855957031|144.29600524902344|    5388|      10|  U|Paci

#### Define schema for departure Delays data

In [13]:
departureDelaysSchema = StructType([
         StructField("flightDate", StringType(), True),
         StructField("originAirportID", IntegerType(), True),
         StructField("origin", StringType(), True),
         StructField("originCity", StringType(), True),
         StructField("originState", StringType(), True),
         StructField("destAirportID", IntegerType(), True),
         StructField("destination", StringType(), True),
         StructField("destinationCity", StringType(), True),
         StructField("destinationState", StringType(), True),
         StructField("depDelayInMinutes", DoubleType(), True),        
         StructField("distanceInMiles", DoubleType(), True)])

#### Create dataframe from departureDelays dataset

In [14]:
departureDelaysDF = spark.read.format("csv")\
                    .option("header", "false")\
                    .option("inferSchema", "true")\
                    .load(tripdelaysFilePath, schema = departureDelaysSchema)

In [15]:
print("Rows and columns in Airports dataset are {} and {}".
      format(departureDelaysDF.count(), len(departureDelaysDF.columns)))

Rows and columns in Airports dataset are 596046 and 11


#### Cache departure delays dataframe

In [16]:
departureDelaysDF.cache()

DataFrame[flightDate: string, originAirportID: int, origin: string, originCity: string, originState: string, destAirportID: int, destination: string, destinationCity: string, destinationState: string, depDelayInMinutes: double, distanceInMiles: double]

In [17]:
departureDelaysDF.show(4)

+----------+---------------+------+----------------+-----------+-------------+-----------+---------------+----------------+-----------------+---------------+
|flightDate|originAirportID|origin|      originCity|originState|destAirportID|destination|destinationCity|destinationState|depDelayInMinutes|distanceInMiles|
+----------+---------------+------+----------------+-----------+-------------+-----------+---------------+----------------+-----------------+---------------+
|2018-04-01|          12266|   IAH|     Houston, TX|         TX|        10747|        BRO|Brownsville, TX|              TX|              0.0|          308.0|
|2018-04-01|          12915|   LCH|Lake Charles, LA|         LA|        12266|        IAH|    Houston, TX|              TX|              0.0|          127.0|
|2018-04-01|          12177|   HOB|       Hobbs, NM|         NM|        12266|        IAH|    Houston, TX|              TX|              0.0|          501.0|
|2018-04-01|          13930|   ORD|     Chicago, IL|

#### Define Tables/Views for the above dataframes to execute sql queries

In [23]:
airportsDF.createOrReplaceTempView("airportsDF_SQL")
departureDelaysDF.createOrReplaceTempView("departureDelaysDF_SQL")

As the airports dataset consists of all the airports details
<br>Filter only the airports where there is a record/evidence exists for an airport in the trip datasets
<br>1. Get unique airport codes (iata) combining origin and destination from the departureDelays dataset
<br>2. Filter/Include only the airports returned from above from the airprotsDF

#### Extract all unique IATA codes from the departure delays dataframe

In the data the IATA codes are available in origin and destination fields
<br>Find the unique origin codes 
<br>Find the unique destination codes
<br>combine both the above 
<br>Find the unique/distinct set from the above

In [35]:
uniqueIATACodes = spark.sql("""SELECT DISTINCT iata, state FROM 
                           (SELECT DISTINCT origin AS iata, originState as state FROM departureDelaysDF_SQL
                            UNION ALL
                            SELECT DISTINCT destination AS iata, destinationState as state FROM departureDelaysDF_SQL)
                            a
                            """)

In [36]:
uniqueIATACodes.show(5)

+----+-----+
|iata|state|
+----+-----+
| SJC|   CA|
| MEI|   MS|
| BTV|   VT|
| ROC|   NY|
| SAF|   NM|
+----+-----+
only showing top 5 rows



In [37]:
uniqueIATACodes.createOrReplaceTempView("uniqueIATACodes_SQL")

#### Filter airports exists in the above list
Only include airports with atleast one trip from the departureDelays dataset

In [38]:
airportsNADF = spark.sql("""SELECT at.IATA, at.city, it.state, at.country 
                            FROM airportsDF_SQL at 
                            JOIN uniqueIATACodes_SQL it 
                            ON at.IATA = it.IATA""")

In [39]:
airportsNADF.cache()
airportsNADF.createOrReplaceTempView("airportsNADF_SQL")

In [40]:
airportsNADF.show(4)

+----+----------+-----+-------------+
|IATA|      city|state|      country|
+----+----------+-----+-------------+
| SJC|  San Jose|   CA|United States|
| MEI|  Meridian|   MS|United States|
| BTV|Burlington|   VT|United States|
| ROC| Rochester|   NY|United States|
+----+----------+-----+-------------+
only showing top 4 rows



#### Build departureDelays_imp DataFrame
Obtain key attributes such as Date of flight, delays, distance, and airport information (Origin, Destination)

In [41]:
departureDelays_imp = spark.sql("""SELECT 
                                    CAST(ddt.flightDate AS STRING) AS flightDate,
                                    CAST(ddt.depDelayInMinutes AS INT) AS departureDelayInMinutes, 
                                    CAST(ddt.distanceInMiles AS INT) AS distanceInMiles, 
                                    ddt.origin AS src, 
                                    ddt.destination AS dst, 
                                    ats.city AS src_city, 
                                    atd.city AS dst_city, 
                                    ddt.originState AS src_state, 
                                    ddt.destinationState AS dst_state 
                                    FROM departureDelaysDF_SQL ddt 
                                    JOIN airportsNADF_SQL ats 
                                    ON ats.iata = ddt.origin 
                                    JOIN airportsNADF_SQL atd 
                                    ON atd.iata = ddt.destination""")

#### Add an unique index column (sequential number) to the above dataframe

In [42]:
# Modified Schema
schemaNew  = StructType([StructField("tripId", LongType(), False)] 
                        + departureDelays_imp.schema.fields[:])

In [27]:
### from pyspark.sql.functions import monotonicallyIncreasingId
### This will return a new DF with all the columns + id

### * Returns monotonically increasing 64-bit integers.
### *
### * The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
### * The current implementation puts the partition ID in the upper 31 bits, and the lower 33 bits represent the 
### * record number within each partition. The assumption is that the data frame has less than 1 billion partitions, 
### * and each partition has less than 8 billion records.
###
### departureDelays_imp = departureDelays_imp.withColumn("tripId", monotonicallyIncreasingId())

In [43]:
departureDelays_imp = (departureDelays_imp.rdd.zipWithIndex().map(lambda line: 
(line[1], line[0][0],line[0][1],line[0][2],line[0][3],line[0][4],line[0][5],line[0][6],line[0][7],line[0][8])).toDF(schemaNew))

In [44]:
departureDelays_imp = departureDelays_imp.fillna(0)

In [45]:
departureDelays_imp.cache()
departureDelays_imp.createOrReplaceTempView("departureDelays_imp_SQL")

In [46]:
departureDelays_imp.show(4)

+------+----------+-----------------------+---------------+---+---+------------+-----------+---------+---------+
|tripId|flightDate|departureDelayInMinutes|distanceInMiles|src|dst|    src_city|   dst_city|src_state|dst_state|
+------+----------+-----------------------+---------------+---+---+------------+-----------+---------+---------+
|     0|2018-04-01|                      0|            308|IAH|BRO|     Houston|Brownsville|       TX|       TX|
|     1|2018-04-01|                      0|            127|LCH|IAH|Lake Charles|    Houston|       LA|       TX|
|     2|2018-04-01|                      0|            501|HOB|IAH|       Hobbs|    Houston|       NM|       TX|
|     3|2018-04-01|                      0|            299|ORD|DSM|     Chicago| Des Moines|       IL|       IA|
+------+----------+-----------------------+---------------+---+---+------------+-----------+---------+---------+
only showing top 4 rows



### Building the Graph
<br>** Now that we've imported our data, we're going to need to build our graph. 
<br> To do so we're going to do two things. 
<br> We are going to build the structure of the vertices (or nodes) and build the structure of the edges. 
<br> What's awesome about GraphFrames is that this process is incredibly simple.
<br> Rename IATA airport code to id in the Vertices Table
<br> Start and End airports to src and dst for the Edges Table (flights)
<br> These are required naming conventions for vertices and edges in GraphFrames as of the time of this writing. **

<br> ** Note, ensure you have already installed the GraphFrames spark-package **

In [47]:
from graphframes import *

** Create Vertices (airports) and Edges (flights)
<br> Users can create GraphFrames from vertex and edge DataFrames.
<br> Vertex DataFrame: A vertex DataFrame should contain a special column named “id” which specifies unique IDs for each vertex in the graph.
<br> Edge DataFrame: An edge DataFrame should contain two special columns: “src” (source vertex ID of edge) and “dst” (destination vertex ID of edge).
<br> Both DataFrames can have arbitrary other columns. Those columns can represent vertex and edge attributes. **

In [48]:
tripVertices = airportsNADF.withColumnRenamed("IATA", "id").distinct()
tripEdges = departureDelays_imp.select("tripId", "departureDelayInMinutes", 
                                       "distanceInMiles", "src", "dst", 
                                       "src_city", "src_state", "dst_city", "dst_state")

In [49]:
tripVertices.cache()
tripEdges.cache()

DataFrame[tripId: bigint, departureDelayInMinutes: int, distanceInMiles: int, src: string, dst: string, src_city: string, src_state: string, dst_city: string, dst_state: string]

#### Vertices - The vertices of our graph are the airports

In [50]:
tripVertices.show(4)

+---+----------+-----+-------------+
| id|      city|state|      country|
+---+----------+-----+-------------+
|OGG|   Kahului|   HI|United States|
|GSO|Greensboro|   NC|United States|
|DEN|    Denver|   CO|United States|
|CVG|Cincinnati|   KY|United States|
+---+----------+-----+-------------+
only showing top 4 rows



#### Edges - The edges of our graph are the flights between airports

In [51]:
tripEdges.show(4)

+------+-----------------------+---------------+---+---+------------+---------+-----------+---------+
|tripId|departureDelayInMinutes|distanceInMiles|src|dst|    src_city|src_state|   dst_city|dst_state|
+------+-----------------------+---------------+---+---+------------+---------+-----------+---------+
|     0|                      0|            308|IAH|BRO|     Houston|       TX|Brownsville|       TX|
|     1|                      0|            127|LCH|IAH|Lake Charles|       LA|    Houston|       TX|
|     2|                      0|            501|HOB|IAH|       Hobbs|       NM|    Houston|       TX|
|     3|                      0|            299|ORD|DSM|     Chicago|       IL| Des Moines|       IA|
+------+-----------------------+---------------+---+---+------------+---------+-----------+---------+
only showing top 4 rows



#### Build **tripGraph** GraphFrame
This GraphFrame builds up on the vertices and edges based on our trips (flights)

In [53]:
tripGraph = GraphFrame(tripVertices, tripEdges)
print tripGraph

GraphFrame(v:[id: string, city: string ... 2 more fields], e:[src: string, dst: string ... 7 more fields])


#### Build tripGraphPrime GraphFrame
This graphframe contains a smaller subset of data to make it easier to display motifs and subgraphs (below)

In [54]:
tripEdgesPrime = departureDelays_imp.select("tripId", "departureDelayInMinutes",
                                            "distanceInMiles", "src", "dst")
tripGraphPrime = GraphFrame(tripVertices, tripEdgesPrime)
print tripGraphPrime

GraphFrame(v:[id: string, city: string ... 2 more fields], e:[src: string, dst: string ... 3 more fields])


#### Determine the number of airports and trips

In [55]:
print("Airports: {} and Trips: {}".format(tripGraph.vertices.count(), tripGraph.edges.count()))

Airports: 337 and Trips: 595766


#### Determining the longest delay in this dataset

In [56]:
longestDelay = tripGraph.edges.groupBy().max("departureDelayInMinutes")
longestDelay.show()

+----------------------------+
|max(departureDelayInMinutes)|
+----------------------------+
|                        1659|
+----------------------------+



#### Determining the number of delayed vs. on-time / early flights

In [57]:
print("On-time / Early Flights: {}".format(tripGraph.edges.filter("departureDelayInMinutes <= 0").count()))
print("Delayed Flights: {}".format(tripGraph.edges.filter("departureDelayInMinutes > 0").count()))

On-time / Early Flights: 402635
Delayed Flights: 193131


#### What flights departing 'SFO' are most likely to have significant delays ?
Note, delay can be <= 0 meaning the flight left on time or early

In [58]:
tripGraph.edges\
.filter("src = 'SFO' AND departureDelayInMinutes > 0")\
.groupBy("src", "dst")\
.avg("departureDelayInMinutes")\
.sort(desc("avg(departureDelayInMinutes)")).show()

+---+---+----------------------------+
|src|dst|avg(departureDelayInMinutes)|
+---+---+----------------------------+
|SFO|MFR|           91.79166666666667|
|SFO|TUS|           84.19354838709677|
|SFO|SBA|           83.46511627906976|
|SFO|OMA|           81.33333333333333|
|SFO|EUG|                    81.21875|
|SFO|FAT|                        80.5|
|SFO|OKC|                        80.0|
|SFO|SUN|                        76.0|
|SFO|GEG|           74.61538461538461|
|SFO|BFL|           65.53846153846153|
|SFO|MIA|           61.84444444444444|
|SFO|STS|                      56.125|
|SFO|ABQ|                        56.0|
|SFO|AUS|           51.27272727272727|
|SFO|ACV|                        50.0|
|SFO|RDD|                        50.0|
|SFO|LIH|          49.888888888888886|
|SFO|EWR|          48.538812785388124|
|SFO|PSC|                        48.5|
|SFO|RDM|          48.285714285714285|
+---+---+----------------------------+
only showing top 20 rows



#### What destinations tend to have delays ?

In [62]:
tripDelays = tripGraph.edges.filter("departureDelayInMinutes > 0")\
.groupBy("dst_state")\
.avg("departureDelayInMinutes")

tripDelays.show(4)

+---------+----------------------------+
|dst_state|avg(departureDelayInMinutes)|
+---------+----------------------------+
|       AZ|          29.945888216447134|
|       SC|           35.99658469945355|
|       LA|          33.009387572641934|
|       MN|           49.33994184509648|
+---------+----------------------------+
only showing top 4 rows



In [63]:
df = tripDelays.toPandas()
df.head()

Unnamed: 0,dst_state,avg(departureDelayInMinutes)
0,AZ,29.945888
1,SC,35.996585
2,LA,33.009388
3,MN,49.339942
4,NJ,59.410905


In [64]:
# Refer: https://plot.ly/python/choropleth-maps/
for col in df.columns:
    df[col] = df[col].astype(str)

scl = [[0.0, 'rgb(242,240,247)'],
       [0.2, 'rgb(218,218,235)'],
       [0.4, 'rgb(188,189,220)'],
       [0.6, 'rgb(158,154,200)'],
       [0.8, 'rgb(117,107,177)'],
       [1.0, 'rgb(84,39,143)']]

data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = df['dst_state'],
        z = df['avg(departureDelayInMinutes)'].astype(float),
        locationmode = 'USA-states',
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Delay in Minutes")
        ) ]

layout = dict(
        title = 'What destinations tend to have delays by State<br>(Hover for breakdown)',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)'),
             )
    
fig = dict( data=data, layout=layout )
py.iplot( fig, filename='d3-cloropleth-map' )

#### What destinations tend to have significant delays departing from SEA ?
States with the longest cumulative delays (with individual delays > 100 minutes) (origin: Seattle)

In [65]:
tripDelaysSEA = tripGraph.edges.filter("src = 'SEA' and departureDelayInMinutes > 100")\
.groupBy("dst_state")\
.avg("departureDelayInMinutes")

tripDelaysSEA.show(4)

+---------+----------------------------+
|dst_state|avg(departureDelayInMinutes)|
+---------+----------------------------+
|       AZ|                       151.5|
|       MN|          461.14285714285717|
|       NJ|                     179.625|
|       OR|                       153.5|
+---------+----------------------------+
only showing top 4 rows



In [66]:
df = tripDelaysSEA.toPandas()
df.head()

Unnamed: 0,dst_state,avg(departureDelayInMinutes)
0,AZ,151.5
1,MN,461.142857
2,NJ,179.625
3,OR,153.5
4,VA,149.333333


In [68]:
# Refer: https://plot.ly/python/choropleth-maps/
for col in df.columns:
    df[col] = df[col].astype(str)

scl = [[0.0, 'rgb(242,240,247)'],
       [0.2, 'rgb(218,218,235)'],
       [0.4, 'rgb(188,189,220)'],            
       [0.6, 'rgb(158,154,200)'],
       [0.8, 'rgb(117,107,177)'],
       [1.0, 'rgb(84,39,143)']]

data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = df['dst_state'],
        z = df['avg(departureDelayInMinutes)'].astype(float),
        locationmode = 'USA-states',
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Delay in Minutes")
        ) ]

layout = dict(
        title = 'What destinations tend to have significant delays departing from SEA<br>(Hover for breakdown)',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)'),
             )
    
fig = dict( data=data, layout=layout )
py.iplot( fig, filename='d3-cloropleth-map' )

#### Vertex Degrees
<br>**inDegrees:** Incoming connections to the airport
<br>**outDegrees:** Outgoing connections from the airport
<br>**degrees:** Total connections to and from the airport

#### Degree

In [69]:
# Degrees - The number of degrees - the number of incoming and outgoing connections - for various airports within this sample dataset
tripGraphDegrees = tripGraph.degrees.sort(desc("degree")).limit(20)
tripGraphDegrees.show()

+---+------+
| id|degree|
+---+------+
|ATL| 65412|
|ORD| 52871|
|DFW| 44998|
|CLT| 38884|
|DEN| 37891|
|LAX| 36564|
|PHX| 29797|
|SFO| 28768|
|LGA| 28743|
|IAH| 28134|
|LAS| 27215|
|DTW| 26761|
|MSP| 26256|
|BOS| 25286|
|EWR| 24447|
|MCO| 23550|
|SEA| 22268|
|DCA| 21756|
|JFK| 21459|
|PHL| 19727|
+---+------+



In [70]:
df = tripGraphDegrees.toPandas()
df.head()

Unnamed: 0,id,degree
0,ATL,65412
1,ORD,52871
2,DFW,44998
3,CLT,38884
4,DEN,37891


In [71]:
data = [go.Bar(x=df.id, y=df.degree)]
layout = go.Layout(title='Degree - Top 20 - Descending Order',)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='basic_bar1')

#### Motif Finding
Motifs are a way of expresssing structural patterns in a graph. 
<br>When we specify a motif, we are querying for patterns in the data instead of actual data. 
<br>In GraphFrames, we specify our query in a domain-specific language similar to Neo4J’s Cypher language. 
<br>This language lets us specify combinations of vertices and edges and assign then names. 

<br>For example, if we want to specify that a given vertex a connects to another vertex b through an edge ab, 
<br>we would specify (a)-[ab]->(b). 
<br>The names inside parentheses or brackets do not signify values but instead 
<br>what the columns for matching vertices and edges should be named in the resulting DataFrame. 
<br>We can omit the names (e.g., (a)-[]->()) if we do not intend to query the resulting values.

City / Flight Relationships through Motif Finding
<br>To more easily understand the complex relationship of city airports and their flights with each other,
<br>we can use motifs to find patterns of airports (i.e. vertices) connected by flights (i.e. edges).
<br>The result is a DataFrame in which the column names are given by the motif keys.

<br>**What delays might we blame on SFO?**
<br>Using tripGraphPrime to more easily display
<br>- The associated edge (ab, bc) relationships
<br>- With the different the city / airports (a, b, c) where SFO is the connecting city (b)
<br>- Ensuring that flight ab (i.e. the flight to SFO) occured before flight bc (i.e. flight leaving SFO)
<br>- Note, TripID was generated based on time in the format of CCYYMMDD converted to int - Sequence Number
<br>  -- Therefore bc.tripid < ab.tripid + 10000 means the second flight (bc) occured within approx a day of the first flight (ab)

<br>**Note:** In reality, we would need to be more careful to link trips ab and bc.

In [72]:
motifs = tripGraphPrime.find("(a)-[ab]->(b); (b)-[bc]->(c)")\
  .filter("(b.id = 'SFO') and (ab.departureDelayInMinutes > 500 or bc.departureDelayInMinutes > 500) and bc.tripId > ab.tripId and bc.tripId < ab.tripId + 10000")

In [73]:
motifs.show(4, truncate = False)

+----------------------------------------+------------------------+---------------------------------------+--------------------------+---------------------------------+
|a                                       |ab                      |b                                      |bc                        |c                                |
+----------------------------------------+------------------------+---------------------------------------+--------------------------+---------------------------------+
|[MSP, Minneapolis, MN, United States]   |[734, 0, 1589, MSP, SFO]|[SFO, San Francisco, CA, United States]|[2100, 700, 733, SFO, GEG]|[GEG, Spokane, WA, United States]|
|[SNA, Santa Ana, CA, United States]     |[751, 0, 372, SNA, SFO] |[SFO, San Francisco, CA, United States]|[2100, 700, 733, SFO, GEG]|[GEG, Spokane, WA, United States]|
|[SLC, Salt Lake City, UT, United States]|[935, 0, 599, SLC, SFO] |[SFO, San Francisco, CA, United States]|[2100, 700, 733, SFO, GEG]|[GEG, Spokane, WA, Un

### Graph Algorithms
A graph is just a logical representation of data. 
<br>Graph theory provides numerous algorithms for analyzing data in this format, and 
<br>GraphFrames allows us to leverage many algorithms out of the box.
<br>Development continues as new algorithms are added to GraphFrames, 
<br>so this list will most likely continue to grow.

#### PageRank
One of the most prolific graph algorithms is PageRank.
<br>Larry Page, cofounder of Google, created PageRank as a research project for how to rank web pages. 

_PageRank works by counting the number and quality of links to a page to determine a rough estimate 
of how important the website is. The underlying assumption is that more important websites are likely 
to receive more links from other websites._

#### Determining Airport Ranking using PageRank
<br>PageRank generalizes quite well outside of the web domain. 
<br>We can apply this right to our own data and get a sense for important airports (specifically, those that receive a lot of air traffic). 
<br>In this example, important airports will be assigned large PageRank values:

<br>There are a large number of flights and connections through these various airports included in this Departure Delay Dataset.
<br>Using the pageRank algorithm, Spark iteratively traverses the graph and determines a rough estimate of how important the airport is.

In [75]:
# Determining Airport ranking of importance using `pageRank`
ranks = tripGraph.pageRank(resetProbability=0.15, maxIter=5)
ranksDisplay = ranks.vertices.orderBy(ranks.vertices.pagerank.desc()).limit(20)
ranksDisplay.show()

+---+-----------------+-----+-------------+------------------+
| id|             city|state|      country|          pagerank|
+---+-----------------+-----+-------------+------------------+
|ATL|          Atlanta|   GA|United States|16.915424137986847|
|ORD|          Chicago|   IL|United States|14.685568368446031|
|DFW|Dallas-Fort Worth|   TX|United States|13.754927110399024|
|DEN|           Denver|   CO|United States|10.509304186339417|
|CLT|        Charlotte|   NC|United States|10.286917907868556|
|MSP|      Minneapolis|   MN|United States| 8.215568646758799|
|LAX|      Los Angeles|   CA|United States| 7.999926238637892|
|DTW|          Detroit|   MI|United States| 7.592141699315214|
|IAH|          Houston|   TX|United States| 7.212897198095626|
|PHX|          Phoenix|   AZ|United States| 7.114367338553403|
|SFO|    San Francisco|   CA|United States| 6.962381254309524|
|LAS|        Las Vegas|   NV|United States| 6.336559210687976|
|LGA|         New York|   NY|United States| 6.233288533

In [76]:
df = ranksDisplay.toPandas()
df["id_city_state"] = df["id"].map(str) + ", " + df["city"] + ", " + df["state"]
df.head()

Unnamed: 0,id,city,state,country,pagerank,id_city_state
0,ATL,Atlanta,GA,United States,16.915424,"ATL, Atlanta, GA"
1,ORD,Chicago,IL,United States,14.685568,"ORD, Chicago, IL"
2,DFW,Dallas-Fort Worth,TX,United States,13.754927,"DFW, Dallas-Fort Worth, TX"
3,DEN,Denver,CO,United States,10.509304,"DEN, Denver, CO"
4,CLT,Charlotte,NC,United States,10.286918,"CLT, Charlotte, NC"


In [77]:
data = [go.Bar(x=df.id_city_state, y=df.pagerank)]
layout = go.Layout(title='PageRank - Top 20 - Descending Order',)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='basic_bar2')

#### Single Hops
Most popular flights (single city hops)
<br>Using the tripGraph, we can quickly determine what are the most popular single city hop flights

In [78]:
import pyspark.sql.functions as func

topTrips = tripGraph \
  .edges \
  .groupBy("src", "dst") \
  .agg(func.count("departureDelayInMinutes").alias("trips"))
  
topTrips20 = topTrips.orderBy(topTrips.trips.desc()).limit(20)

topTrips20.show(4)

+---+---+-----+
|src|dst|trips|
+---+---+-----+
|ORD|LGA| 1301|
|LGA|ORD| 1300|
|SFO|LAX| 1255|
|LAX|SFO| 1250|
+---+---+-----+
only showing top 4 rows



In [79]:
df = topTrips20.toPandas()
df["src_dst"] = df["src"].map(str) + "/" +df["dst"]
df.head()

Unnamed: 0,src,dst,trips,src_dst
0,ORD,LGA,1301,ORD/LGA
1,LGA,ORD,1300,LGA/ORD
2,SFO,LAX,1255,SFO/LAX
3,LAX,SFO,1250,LAX/SFO
4,JFK,LAX,1106,JFK/LAX


In [80]:
data = [go.Bar(x=df.src_dst, y=df.trips)]
layout = go.Layout(title='Top20 - Single HOPS',)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='basic_bar3')

#### Transfer Cities
Top Transfer Cities
<br>Many airports are used as transfer points instead of the final Destination.
<br>An easy way to calculate this is by calculating the ratio of
<br>inDegree (the number of flights to the airport) / outDegree (the number of flights leaving the airport).
<br>Values close to 1 may indicate many transfers, whereas
<br>values < 1 indicate many outgoing flights and
<br>values > 1 indicate many incoming flights.

<br>**Note:** this is a simple calculation that does not take into account of timing or scheduling of flights,
<br>just the overall aggregate number within the dataset.

In [81]:
# Calculate the inDeg (flights into the airport) and outDeg (flights leaving the airport)
inDeg = tripGraph.inDegrees
outDeg = tripGraph.outDegrees

In [82]:
# Calculate the degreeRatio (inDeg/outDeg)
degreeRatio = inDeg.join(outDeg, inDeg.id == outDeg.id) \
  .drop(outDeg.id) \
  .selectExpr("id", "double(inDegree)/double(outDegree) as degreeRatio") \
  .cache()

In [83]:
# Join back to the `airports` DataFrame (instead of registering temp table as above)
nonTransferAirports = degreeRatio.join(airportsNADF, degreeRatio.id == airportsNADF.IATA) \
  .selectExpr("id", "city", "degreeRatio") \
  .filter("degreeRatio < .9 or degreeRatio > 1.1")

In [84]:
# List out the city airports which have abnormal degree ratios.
nonTransferAirports.show()

+---+----+-----------+
| id|city|degreeRatio|
+---+----+-----------+
+---+----+-----------+



In [87]:
# Join back to the `airports` DataFrame (instead of registering temp table as above)
transferAirports = degreeRatio.join(airportsNADF, degreeRatio.id == airportsNADF.IATA) \
  .selectExpr("id", "city", "degreeRatio") \
  .filter("degreeRatio between 0.9 and 1.1")

In [88]:
# List out the top 20 transfer city airports
transferAirportsDF = transferAirports.orderBy("degreeRatio").limit(20)
transferAirportsDF.show()

+---+-------------+------------------+
| id|         city|       degreeRatio|
+---+-------------+------------------+
|HDN|       Hayden|              0.92|
|TWF|   Twin Falls|0.9508196721311475|
|EGE|         Vail|0.9558823529411765|
|KOA|         Kona|0.9748784440842788|
|MTJ|  Montrose CO|0.9767441860465116|
|GGG|     Longview|0.9824561403508771|
|PUB|       Pueblo| 0.987012987012987|
|IAG|Niagara Falls|            0.9875|
|ERI|         Erie|0.9885057471264368|
|HYS|         Hays|0.9887640449438202|
|JAC|  Jacksn Hole|0.9894179894179894|
|LBE|      Latrobe|0.9896907216494846|
|ASE|        Aspen|0.9900744416873449|
|ROW|      Roswell|0.9901960784313726|
|PSC|        Pasco|0.9910714285714286|
|HLN|       Helena|0.9912280701754386|
|BZN|      Bozeman|0.9915254237288136|
|AGS|   Bush Field|0.9915966386554622|
|YUM|         Yuma|0.9916666666666667|
|EWN|     New Bern|             0.992|
+---+-------------+------------------+



In [89]:
df = transferAirportsDF.toPandas()
df["city_id"] = df["city"].map(str) + ", " +df["id"]
df.head()

Unnamed: 0,id,city,degreeRatio,city_id
0,HDN,Hayden,0.92,"Hayden, HDN"
1,TWF,Twin Falls,0.95082,"Twin Falls, TWF"
2,EGE,Vail,0.955882,"Vail, EGE"
3,KOA,Kona,0.974878,"Kona, KOA"
4,MTJ,Montrose CO,0.976744,"Montrose CO, MTJ"


In [90]:
data = [go.Bar(x=df.city_id, y=df.degreeRatio)]
layout = go.Layout(title='Top20 - Transfer Cities',)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='basic_bar4')

#### Breadth First Search
Breadth-first search (BFS) is designed to traverse the graph to quickly find the desired vertices 
<br>(i.e. airports) and edges (i.e flights).
<br>Let's try to find the shortest number of connections between cities based on the dataset.
<br>**Note:** These examples do not take into account of time or distance, just hops between cities.

#### Example 1: Direct Seattle to San Francisco

In [91]:
filteredPaths1 = tripGraph.bfs(
  fromExpr = "id = 'SEA'",
  toExpr = "id = 'SFO'",
  maxPathLength = 1)

filteredPaths1.show()

+--------------------+--------------------+--------------------+
|                from|                  e0|                  to|
+--------------------+--------------------+--------------------+
|[SEA, Seattle, WA...|[1278, 0, 679, SE...|[SFO, San Francis...|
|[SEA, Seattle, WA...|[3120, 0, 679, SE...|[SFO, San Francis...|
|[SEA, Seattle, WA...|[3127, 0, 679, SE...|[SFO, San Francis...|
|[SEA, Seattle, WA...|[3627, 8, 679, SE...|[SFO, San Francis...|
|[SEA, Seattle, WA...|[3751, 0, 679, SE...|[SFO, San Francis...|
|[SEA, Seattle, WA...|[3861, 1, 679, SE...|[SFO, San Francis...|
|[SEA, Seattle, WA...|[4177, 0, 679, SE...|[SFO, San Francis...|
|[SEA, Seattle, WA...|[4263, 0, 679, SE...|[SFO, San Francis...|
|[SEA, Seattle, WA...|[11679, 38, 679, ...|[SFO, San Francis...|
|[SEA, Seattle, WA...|[11683, 0, 679, S...|[SFO, San Francis...|
|[SEA, Seattle, WA...|[11684, 0, 679, S...|[SFO, San Francis...|
|[SEA, Seattle, WA...|[12147, 0, 679, S...|[SFO, San Francis...|
|[SEA, Seattle, WA...|[12

As you can see, there are a number of direct flights between Seattle and San Francisco.

#### Example 2: Direct San Francisco and Buffalo

In [92]:
filteredPaths2 = tripGraph.bfs(
  fromExpr = "id = 'SFO'",
  toExpr = "id = 'BUF'",
  maxPathLength = 1)

filteredPaths2.show()

+---+----+-----+-------+
| id|city|state|country|
+---+----+-----+-------+
+---+----+-----+-------+



As you can see, there are no direct flights between San Francisco and Buffalo.

#### Example 3: Flying from San Francisco to Buffalo

In [93]:
filteredPaths3 = tripGraph.bfs(
  fromExpr = "id = 'SFO'",
  toExpr = "id = 'BUF'",
  maxPathLength = 2)

filteredPaths3.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                from|                  e0|                  v1|                  e1|                  to|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|[SFO, San Francis...|[2930, 44, 2442, ...|[DCA, Washington,...|[12834, 0, 296, D...|[BUF, Buffalo, NY...|
|[SFO, San Francis...|[2930, 44, 2442, ...|[DCA, Washington,...|[14648, 0, 296, D...|[BUF, Buffalo, NY...|
|[SFO, San Francis...|[2930, 44, 2442, ...|[DCA, Washington,...|[19593, 29, 296, ...|[BUF, Buffalo, NY...|
|[SFO, San Francis...|[2930, 44, 2442, ...|[DCA, Washington,...|[28025, 1, 296, D...|[BUF, Buffalo, NY...|
|[SFO, San Francis...|[2930, 44, 2442, ...|[DCA, Washington,...|[28053, 0, 296, D...|[BUF, Buffalo, NY...|
|[SFO, San Francis...|[2930, 44, 2442, ...|[DCA, Washington,...|[58129, 35, 296, ...|[BUF, Buffalo, NY...|
|[SFO, San Francis...|[2930, 44, 2442

As you can see, there are flights from San Francisco to Buffalo with Washington as the transfer point.

-----------------------------------------------------------------------------------------------------------------------------
#### On-time and Early Arrivals

In [94]:
OnTime_EarlyArrivals = spark.sql("""SELECT 
                                         src, dst AS dest, COUNT(1) AS count
                                         FROM departureDelays_imp_SQL 
                                         WHERE departureDelayInMinutes <= 0
                                         GROUP BY src, dst""")

In [95]:
OnTime_EarlyArrivals.show()

+---+----+-----+
|src|dest|count|
+---+----+-----+
|STS| PHX|   19|
|SPI| ORD|   74|
|MCI| IAH|  133|
|DSM| MCO|    2|
|PHL| MCO|  254|
|ORD| PDX|  131|
|ATL| GSP|  207|
|SMF| BUR|  139|
|SNA| PHX|  232|
|PBI| DCA|   95|
|SJC| LIH|   26|
|DSM| EWR|   25|
|FSD| ATL|   28|
|BQN| MCO|   16|
|MCI| MKE|   31|
|TPA| ACY|   25|
|SHD| LWB|   24|
|LAS| LIT|   10|
|PBG| PGD|    5|
|MDW| MEM|   21|
+---+----+-----+
only showing top 20 rows



#### Delayed Trips Departing from the West Coast
Notice that most of the delayed trips are with Western US cities - Delayed Trips from CA, OR, and WA

In [96]:
Delayed_from_CA_OR_WA = spark.sql("""SELECT 
                                     src, dst AS dest, COUNT(1) AS count 
                                     FROM departureDelays_imp_SQL 
                                     WHERE src_state in ('CA', 'OR', 'WA')
                                     AND departureDelayInMinutes > 0
                                     GROUP BY src, dst""")
                                  
Delayed_from_CA_OR_WA.show()

+---+----+-----+
|src|dest|count|
+---+----+-----+
|SMF| BUR|  104|
|SNA| PHX|  110|
|SJC| LIH|    4|
|STS| PHX|   11|
|SJC| ONT|   71|
|PSP| JFK|   15|
|SFO| TUS|   31|
|FAT| LAX|   48|
|LAX| PIT|   26|
|PSC| SLC|   12|
|SFO| BOI|   38|
|SMF| PHX|   83|
|SBA| LAX|   20|
|EUG| OAK|    6|
|SMF| OGG|    1|
|FAT| SAN|    8|
|LAX| SBP|   27|
|SEA| RNO|    5|
|BLI| AZA|    6|
|SBP| SFO|   30|
+---+----+-----+
only showing top 20 rows



#### All Flights
All Trips

In [97]:
allTrips = spark.sql("""SELECT
                        src, dst AS dest, COUNT(1) AS count
                        FROM departureDelays_imp_SQL 
                        GROUP BY src, dst""")

allTrips.show()

+---+----+-----+
|src|dest|count|
+---+----+-----+
|STS| PHX|   30|
|SPI| ORD|   86|
|MCI| IAH|  144|
|DSM| MCO|    3|
|PHL| MCO|  427|
|ORD| PDX|  186|
|TPA| ACY|   30|
|ATL| GSP|  313|
|LAS| LIT|   30|
|MCI| MKE|   52|
|MDW| MEM|   57|
|SMF| BUR|  243|
|SNA| PHX|  342|
|PBI| DCA|  125|
|SJC| LIH|   30|
|DSM| EWR|   30|
|PBG| PGD|    9|
|FSD| ATL|   30|
|BQN| MCO|   30|
|SHD| LWB|   27|
+---+----+-----+
only showing top 20 rows



#### Write above 3 files to local storage

In [75]:
OnTime_EarlyArrivals.coalesce(1).write.format("csv").option("header", "true").save("file:///home/rameshm/Datasets/flightDelays/Results/OnTime_EarlyArrivals")

Delayed_from_CA_OR_WA.coalesce(1).write.format("csv").option("header", "true").save("file:///home/rameshm/Datasets/flightDelays/Results/Delayed_from_CA_OR_WA")

allTrips.coalesce(1).write.format("csv").option("header", "true").save("file:///home/rameshm/Datasets/flightDelays/Results/allTrips")