# Flight Data

In the first part of this exercise, you will load flight data from the domestic-flights/flights.parquet file and airport codes from the airport-codes/airport-codes.csv file. As a reminder, you can load a parquet and CSV files as follows. (Click on the image below to download the code.)

As a first step, load both files into Spark and print the schemas. The flight data uses the International Air Transport Association (IATA) codes of the origin and destination airports. The IATA code is a three-letter code identifying the airport. For instance, Omaha’s Eppley Airfield is OMA, Baltimore-Washington International Airport is BWI, Los Angeles International Airport is LAX, and New York’s John F. Kennedy International Airport is JFK. The airport codes file contains information for each of the airports.

Import Requred modules

In [1]:
from pyspark.sql import SparkSession
import pandas

spark = SparkSession.builder.appName('week4').getOrCreate()

In [2]:
# Create file paths including filenames
parquet_file_path = r'/home/ram/share/650/dsc650-master/data/domestic-flights/flights.parquet'

airportdata_filepath = r'/home/ram/share/650/dsc650-master/data/airport-codes/airport-codes.csv'


In [3]:
df_flight = spark.read.parquet(parquet_file_path)

In [4]:
df_flight.head(5)

[Row(origin_airport_code='MHK', destination_airport_code='AMW', origin_city='Manhattan, KS', destination_city='Ames, IA', passengers=21, seats=30, flights=1, distance=254.0, origin_population=122049, destination_population=86219, flight_year=2008, flight_month=10, __index_level_0__=0),
 Row(origin_airport_code='EUG', destination_airport_code='RDM', origin_city='Eugene, OR', destination_city='Bend, OR', passengers=41, seats=396, flights=22, distance=103.0, origin_population=284093, destination_population=76034, flight_year=1990, flight_month=11, __index_level_0__=1),
 Row(origin_airport_code='EUG', destination_airport_code='RDM', origin_city='Eugene, OR', destination_city='Bend, OR', passengers=88, seats=342, flights=19, distance=103.0, origin_population=284093, destination_population=76034, flight_year=1990, flight_month=12, __index_level_0__=2),
 Row(origin_airport_code='EUG', destination_airport_code='RDM', origin_city='Eugene, OR', destination_city='Bend, OR', passengers=11, seats=7

In [5]:
df_airpot_codes = spark.read.load(airportdata_filepath, format="csv", sep=",", inferschema=True, header=True)

df_airpot_codes.head(5)

[Row(ident='00A', type='heliport', name='Total Rf Heliport', elevation_ft=11.0, continent=None, iso_country='US', iso_region='US-PA', municipality='Bensalem', gps_code='00A', iata_code=None, local_code='00A', coordinates='-74.93360137939453, 40.07080078125'),
 Row(ident='00AA', type='small_airport', name='Aero B Ranch Airport', elevation_ft=3435.0, continent=None, iso_country='US', iso_region='US-KS', municipality='Leoti', gps_code='00AA', iata_code=None, local_code='00AA', coordinates='-101.473911, 38.704022'),
 Row(ident='00AK', type='small_airport', name='Lowell Field', elevation_ft=450.0, continent=None, iso_country='US', iso_region='US-AK', municipality='Anchor Point', gps_code='00AK', iata_code=None, local_code='00AK', coordinates='-151.695999146, 59.94919968'),
 Row(ident='00AL', type='small_airport', name='Epps Airpark', elevation_ft=820.0, continent=None, iso_country='US', iso_region='US-AL', municipality='Harvest', gps_code='00AL', iata_code=None, local_code='00AL', coordinat

In [6]:
#printing schemas
df_flight.printSchema()

root
 |-- origin_airport_code: string (nullable = true)
 |-- destination_airport_code: string (nullable = true)
 |-- origin_city: string (nullable = true)
 |-- destination_city: string (nullable = true)
 |-- passengers: long (nullable = true)
 |-- seats: long (nullable = true)
 |-- flights: long (nullable = true)
 |-- distance: double (nullable = true)
 |-- origin_population: long (nullable = true)
 |-- destination_population: long (nullable = true)
 |-- flight_year: long (nullable = true)
 |-- flight_month: long (nullable = true)
 |-- __index_level_0__: long (nullable = true)



In [7]:
df_airpot_codes.printSchema()

root
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- elevation_ft: double (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- local_code: string (nullable = true)
 |-- coordinates: string (nullable = true)



# Join Data

Join the flight data to airport codes data by matching the IATA code of the originating flight to the IATA code in the airport codes file. Note that the airport codes file may not contain IATA codes for all of the origin and destination flights in the flight data. We still want information on those flights even if we cannot match it to a value in the airport codes file. This means you will want to use a left join instead of the default inner join.

In [8]:
joinexpression =  df_flight['origin_airport_code'] == df_airpot_codes['iata_code']
joinType = "left_outer"

In [136]:
df_flight.join(df_airpot_codes,joinexpression,joinType).show(3)

+-------------------+------------------------+-------------+----------------+----------+-----+-------+--------+-----------------+----------------------+-----------+------------+-----------------+-----+--------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|origin_airport_code|destination_airport_code|  origin_city|destination_city|passengers|seats|flights|distance|origin_population|destination_population|flight_year|flight_month|__index_level_0__|ident|          type|                name|elevation_ft|continent|iso_country|iso_region|municipality|gps_code|iata_code|local_code|         coordinates|
+-------------------+------------------------+-------------+----------------+----------+-----+-------+--------+-----------------+----------------------+-----------+------------+-----------------+-----+--------------+--------------------+------------+---------+-----------+----------+------------+--------

In [10]:
df_merged = df_flight.join(df_airpot_codes,joinexpression,joinType)

In [135]:
df_merged.head(2)

[Row(origin_airport_code='MHK', destination_airport_code='AMW', origin_city='Manhattan, KS', destination_city='Ames, IA', passengers=21, seats=30, flights=1, distance=254.0, origin_population=122049, destination_population=86219, flight_year=2008, flight_month=10, __index_level_0__=0, ident='KMHK', type='medium_airport', name='Manhattan Regional Airport', elevation_ft=1057.0, continent=None, iso_country='US', iso_region='US-KS', municipality='Manhattan', gps_code='KMHK', iata_code='MHK', local_code='MHK', coordinates='-96.6707992553711, 39.14099884033203'),
 Row(origin_airport_code='EUG', destination_airport_code='RDM', origin_city='Eugene, OR', destination_city='Bend, OR', passengers=41, seats=396, flights=22, distance=103.0, origin_population=284093, destination_population=76034, flight_year=1990, flight_month=11, __index_level_0__=1, ident='KEUG', type='medium_airport', name='Mahlon Sweet Field', elevation_ft=374.0, continent=None, iso_country='US', iso_region='US-OR', municipality=

In [12]:
df_merged.printSchema()

root
 |-- origin_airport_code: string (nullable = true)
 |-- destination_airport_code: string (nullable = true)
 |-- origin_city: string (nullable = true)
 |-- destination_city: string (nullable = true)
 |-- passengers: long (nullable = true)
 |-- seats: long (nullable = true)
 |-- flights: long (nullable = true)
 |-- distance: double (nullable = true)
 |-- origin_population: long (nullable = true)
 |-- destination_population: long (nullable = true)
 |-- flight_year: long (nullable = true)
 |-- flight_month: long (nullable = true)
 |-- __index_level_0__: long (nullable = true)
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- elevation_ft: double (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- local_c

# Rename and Remove columns

Next, we want to rename some of the joined columns and remove unneeded columns. Remove the following columns from the joined dataframe.

    __index_level_0__
    ident
    local_code
    continent
    iso_country
    iata_code

Rename the following columns.

    type: origin_airport_type
    name: origin_airport_name
    elevation_ft: origin_airport_elevation_ft
    iso_region: origin_airport_region
    municipality: origin_airport_municipality
    gps_code: origin_airport_gps_code
    coordinates: origin_airport_coordinates


In [21]:
df_merged_modified = df_merged.drop("__index_level_0__","ident","local_code","continent","iso_country","iata_code")

In [22]:
df_merged.printSchema()

root
 |-- origin_airport_code: string (nullable = true)
 |-- destination_airport_code: string (nullable = true)
 |-- origin_city: string (nullable = true)
 |-- destination_city: string (nullable = true)
 |-- passengers: long (nullable = true)
 |-- seats: long (nullable = true)
 |-- flights: long (nullable = true)
 |-- distance: double (nullable = true)
 |-- origin_population: long (nullable = true)
 |-- destination_population: long (nullable = true)
 |-- flight_year: long (nullable = true)
 |-- flight_month: long (nullable = true)
 |-- __index_level_0__: long (nullable = true)
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- elevation_ft: double (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- local_c

In [24]:
df_merged_modified.printSchema()

root
 |-- origin_airport_code: string (nullable = true)
 |-- destination_airport_code: string (nullable = true)
 |-- origin_city: string (nullable = true)
 |-- destination_city: string (nullable = true)
 |-- passengers: long (nullable = true)
 |-- seats: long (nullable = true)
 |-- flights: long (nullable = true)
 |-- distance: double (nullable = true)
 |-- origin_population: long (nullable = true)
 |-- destination_population: long (nullable = true)
 |-- flight_year: long (nullable = true)
 |-- flight_month: long (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- elevation_ft: double (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- coordinates: string (nullable = true)



In [28]:
#df_merged_modified2 = df_merged_modified.withColumnRenamed("type","origin_airport_type").withColumnRenamed("name","origin_airport_name").withColumnRenamed("elevation_ft","origin_airport_elevation_ft").withColumnRenamed("iso_region","origin_airport_region").withColumnRenamed("municipality","origin_airport_municipality").withColumnRenamed("gps_code","origin_airport_gps_code").withColumnRenamed("coordinates","origin_airport_coordinates")

In [32]:
df_merged_modified2 = df_merged_modified.withColumnRenamed("type","origin_airport_type")\
                                        .withColumnRenamed("name","origin_airport_name")\
                                        .withColumnRenamed("elevation_ft","origin_airport_elevation_ft")\
                                        .withColumnRenamed("iso_region","origin_airport_region")\
                                        .withColumnRenamed("municipality","origin_airport_municipality")\
                                        .withColumnRenamed("gps_code","origin_airport_gps_code")\
                                        .withColumnRenamed("coordinates","origin_airport_coordinates")

In [33]:
df_merged_modified2.printSchema()

root
 |-- origin_airport_code: string (nullable = true)
 |-- destination_airport_code: string (nullable = true)
 |-- origin_city: string (nullable = true)
 |-- destination_city: string (nullable = true)
 |-- passengers: long (nullable = true)
 |-- seats: long (nullable = true)
 |-- flights: long (nullable = true)
 |-- distance: double (nullable = true)
 |-- origin_population: long (nullable = true)
 |-- destination_population: long (nullable = true)
 |-- flight_year: long (nullable = true)
 |-- flight_month: long (nullable = true)
 |-- origin_airport_type: string (nullable = true)
 |-- origin_airport_name: string (nullable = true)
 |-- origin_airport_elevation_ft: double (nullable = true)
 |-- origin_airport_region: string (nullable = true)
 |-- origin_airport_municipality: string (nullable = true)
 |-- origin_airport_gps_code: string (nullable = true)
 |-- origin_airport_coordinates: string (nullable = true)



# Join to Destination Airport

Repeat parts a and b joining the airport codes file to the destination airport instead of the origin airport. Drop the same columns and rename the same columns using the prefix destination_airport_ instead of origin_airport_. Print the schema of the resultant dataframe. The final schema and dataframe should contain the added information (name, region, coordinate, …) for the destination and origin airports.

In [106]:
joinexpression2 =  df_merged_modified2['destination_airport_code'] == df_airpot_codes['iata_code']
joinType2 = "left_outer"


In [108]:
df_merged_modified2.join(df_airpot_codes,joinexpression2,joinType2).show(2)


+-------------------+------------------------+-------------+----------------+----------+-----+-------+--------+-----------------+----------------------+-----------+------------+-------------------+--------------------+---------------------------+---------------------+---------------------------+-----------------------+--------------------------+-----+--------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|origin_airport_code|destination_airport_code|  origin_city|destination_city|passengers|seats|flights|distance|origin_population|destination_population|flight_year|flight_month|origin_airport_type| origin_airport_name|origin_airport_elevation_ft|origin_airport_region|origin_airport_municipality|origin_airport_gps_code|origin_airport_coordinates|ident|          type|                name|elevation_ft|continent|iso_country|iso_region|municipality|gps_code|iata_code|local_code|         coordinates|


In [109]:
df_merged_modified_dest= df_merged_modified2.join(df_airpot_codes,joinexpression2,joinType2)

In [110]:
df_merged_modified_dest.printSchema()

root
 |-- origin_airport_code: string (nullable = true)
 |-- destination_airport_code: string (nullable = true)
 |-- origin_city: string (nullable = true)
 |-- destination_city: string (nullable = true)
 |-- passengers: long (nullable = true)
 |-- seats: long (nullable = true)
 |-- flights: long (nullable = true)
 |-- distance: double (nullable = true)
 |-- origin_population: long (nullable = true)
 |-- destination_population: long (nullable = true)
 |-- flight_year: long (nullable = true)
 |-- flight_month: long (nullable = true)
 |-- origin_airport_type: string (nullable = true)
 |-- origin_airport_name: string (nullable = true)
 |-- origin_airport_elevation_ft: double (nullable = true)
 |-- origin_airport_region: string (nullable = true)
 |-- origin_airport_municipality: string (nullable = true)
 |-- origin_airport_gps_code: string (nullable = true)
 |-- origin_airport_coordinates: string (nullable = true)
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |--

In [111]:

df_merged_modified_dest2 = df_merged_modified_dest.drop("__index_level_0__","ident","local_code","continent","iso_country")


In [112]:
df_merged_modified_dest2.printSchema()

root
 |-- origin_airport_code: string (nullable = true)
 |-- destination_airport_code: string (nullable = true)
 |-- origin_city: string (nullable = true)
 |-- destination_city: string (nullable = true)
 |-- passengers: long (nullable = true)
 |-- seats: long (nullable = true)
 |-- flights: long (nullable = true)
 |-- distance: double (nullable = true)
 |-- origin_population: long (nullable = true)
 |-- destination_population: long (nullable = true)
 |-- flight_year: long (nullable = true)
 |-- flight_month: long (nullable = true)
 |-- origin_airport_type: string (nullable = true)
 |-- origin_airport_name: string (nullable = true)
 |-- origin_airport_elevation_ft: double (nullable = true)
 |-- origin_airport_region: string (nullable = true)
 |-- origin_airport_municipality: string (nullable = true)
 |-- origin_airport_gps_code: string (nullable = true)
 |-- origin_airport_coordinates: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- 

In [113]:
df_merged_modified_dest_final = df_merged_modified_dest2.withColumnRenamed("type","destination_airport_type")\
                                        .withColumnRenamed("name","destination_airport_name")\
                                        .withColumnRenamed("elevation_ft","destination_airport_elevation_ft")\
                                        .withColumnRenamed("iso_region","destination_airport_region")\
                                        .withColumnRenamed("municipality","destination_airport_municipality")\
                                        .withColumnRenamed("gps_code","destination_airport_gps_code")\
                                        .withColumnRenamed("coordinates","destination_airport_coordinates")

In [114]:
df_merged_modified_dest_final.printSchema()

root
 |-- origin_airport_code: string (nullable = true)
 |-- destination_airport_code: string (nullable = true)
 |-- origin_city: string (nullable = true)
 |-- destination_city: string (nullable = true)
 |-- passengers: long (nullable = true)
 |-- seats: long (nullable = true)
 |-- flights: long (nullable = true)
 |-- distance: double (nullable = true)
 |-- origin_population: long (nullable = true)
 |-- destination_population: long (nullable = true)
 |-- flight_year: long (nullable = true)
 |-- flight_month: long (nullable = true)
 |-- origin_airport_type: string (nullable = true)
 |-- origin_airport_name: string (nullable = true)
 |-- origin_airport_elevation_ft: double (nullable = true)
 |-- origin_airport_region: string (nullable = true)
 |-- origin_airport_municipality: string (nullable = true)
 |-- origin_airport_gps_code: string (nullable = true)
 |-- origin_airport_coordinates: string (nullable = true)
 |-- destination_airport_type: string (nullable = true)
 |-- destination_airp

In [115]:
df_merged_modified_dest_final.head(2)

[Row(origin_airport_code='MHK', destination_airport_code='AMW', origin_city='Manhattan, KS', destination_city='Ames, IA', passengers=21, seats=30, flights=1, distance=254.0, origin_population=122049, destination_population=86219, flight_year=2008, flight_month=10, origin_airport_type='medium_airport', origin_airport_name='Manhattan Regional Airport', origin_airport_elevation_ft=1057.0, origin_airport_region='US-KS', origin_airport_municipality='Manhattan', origin_airport_gps_code='KMHK', origin_airport_coordinates='-96.6707992553711, 39.14099884033203', destination_airport_type='small_airport', destination_airport_name='Ames Municipal Airport', destination_airport_elevation_ft=956.0, destination_airport_region='US-IA', destination_airport_municipality='Ames', destination_airport_gps_code='KAMW', iata_code='AMW', destination_airport_coordinates='-93.621803, 41.992001'),
 Row(origin_airport_code='EUG', destination_airport_code='RDM', origin_city='Eugene, OR', destination_city='Bend, OR',

# Top Ten Airports

Create a dataframe using only data from 2008. This dataframe will be a report of the top ten airports by the number of inbound passengers. This dataframe should contain the following fields:

    Rank (1-10)
    Name
    IATA code
    Total Inbound Passengers
    Total Inbound Flights
    Average Daily Passengers
    Average Inbound Flights

Show the results of this dataframe using the show method.

In [116]:
df_merged_modified_dest_final.createOrReplaceTempView("dfTable")

In [117]:
sqlcount = spark.sql("SELECT COUNT(*) FROM dfTable")

sqlcount.show()

+--------+
|count(1)|
+--------+
| 3606803|
+--------+



In [120]:
# dataframe with data from 2008
df_2008 = spark.sql("SELECT * FROM dfTable where flight_year = 2008")
df_2008.head(2)

[Row(origin_airport_code='MHK', destination_airport_code='AMW', origin_city='Manhattan, KS', destination_city='Ames, IA', passengers=21, seats=30, flights=1, distance=254.0, origin_population=122049, destination_population=86219, flight_year=2008, flight_month=10, origin_airport_type='medium_airport', origin_airport_name='Manhattan Regional Airport', origin_airport_elevation_ft=1057.0, origin_airport_region='US-KS', origin_airport_municipality='Manhattan', origin_airport_gps_code='KMHK', origin_airport_coordinates='-96.6707992553711, 39.14099884033203', destination_airport_type='small_airport', destination_airport_name='Ames Municipal Airport', destination_airport_elevation_ft=956.0, destination_airport_region='US-IA', destination_airport_municipality='Ames', destination_airport_gps_code='KAMW', iata_code='AMW', destination_airport_coordinates='-93.621803, 41.992001'),
 Row(origin_airport_code='SEA', destination_airport_code='RDM', origin_city='Seattle, WA', destination_city='Bend, OR'

In [121]:
df_2008.printSchema()

root
 |-- origin_airport_code: string (nullable = true)
 |-- destination_airport_code: string (nullable = true)
 |-- origin_city: string (nullable = true)
 |-- destination_city: string (nullable = true)
 |-- passengers: long (nullable = true)
 |-- seats: long (nullable = true)
 |-- flights: long (nullable = true)
 |-- distance: double (nullable = true)
 |-- origin_population: long (nullable = true)
 |-- destination_population: long (nullable = true)
 |-- flight_year: long (nullable = true)
 |-- flight_month: long (nullable = true)
 |-- origin_airport_type: string (nullable = true)
 |-- origin_airport_name: string (nullable = true)
 |-- origin_airport_elevation_ft: double (nullable = true)
 |-- origin_airport_region: string (nullable = true)
 |-- origin_airport_municipality: string (nullable = true)
 |-- origin_airport_gps_code: string (nullable = true)
 |-- origin_airport_coordinates: string (nullable = true)
 |-- destination_airport_type: string (nullable = true)
 |-- destination_airp

In [122]:
display(df_2008.head(2))

[Row(origin_airport_code='MHK', destination_airport_code='AMW', origin_city='Manhattan, KS', destination_city='Ames, IA', passengers=21, seats=30, flights=1, distance=254.0, origin_population=122049, destination_population=86219, flight_year=2008, flight_month=10, origin_airport_type='medium_airport', origin_airport_name='Manhattan Regional Airport', origin_airport_elevation_ft=1057.0, origin_airport_region='US-KS', origin_airport_municipality='Manhattan', origin_airport_gps_code='KMHK', origin_airport_coordinates='-96.6707992553711, 39.14099884033203', destination_airport_type='small_airport', destination_airport_name='Ames Municipal Airport', destination_airport_elevation_ft=956.0, destination_airport_region='US-IA', destination_airport_municipality='Ames', destination_airport_gps_code='KAMW', iata_code='AMW', destination_airport_coordinates='-93.621803, 41.992001'),
 Row(origin_airport_code='SEA', destination_airport_code='RDM', origin_city='Seattle, WA', destination_city='Bend, OR'

In [123]:
from pyspark.sql.functions import *
from pyspark.sql.window import Window

In [124]:
df_2008.groupBy("destination_airport_name","iata_code")\
        .agg(count("passengers").alias("Total Inbound Passengers"),\
            count("flights").alias("Total Inbound Flights"),\
            mean("passengers").alias("Total Inbound Passengers"),\
            mean("flights").alias("Total Inbound Flights"),\
            ).show()

+------------------------+---------+------------------------+---------------------+------------------------+---------------------+
|destination_airport_name|iata_code|Total Inbound Passengers|Total Inbound Flights|Total Inbound Passengers|Total Inbound Flights|
+------------------------+---------+------------------------+---------------------+------------------------+---------------------+
|    Newark Liberty In...|      EWR|                    4680|                 4680|      2378.2735042735044|   32.577564102564104|
|    Astoria Regional ...|      AST|                       5|                    5|                     4.0|                  1.2|
|    Cavern City Air T...|      CNM|                      37|                   37|      102.13513513513513|    34.32432432432432|
|    Alexandria Intern...|      AEX|                     196|                  196|       634.8979591836735|   25.678571428571427|
|    Rochester Interna...|      RST|                     273|                  273|

In [125]:
df_2008_temp = df_2008.groupBy("destination_airport_name","iata_code")\
        .agg(count("passengers").alias("Total Inbound Passengers"),\
            countDistinct("flights").alias("Total Inbound Flights"),\
            mean("passengers").alias("Average Daily Passengers"),\
            mean("flights").alias("Average Daily Flights"),\
            )

In [126]:
df_2008_rank = df_2008_temp\
        .withColumn('Rank',dense_rank().over(Window.orderBy(desc('Total Inbound Passengers'))))\
        .withColumnRenamed('destination_airport_name', 'Name')\
        .withColumnRenamed('iata_code', 'IATA code')\
        .filter(col('Rank') <=10)


In [127]:
df_2008_rank.show()

+--------------------+---------+------------------------+---------------------+------------------------+---------------------+----+
|                Name|IATA code|Total Inbound Passengers|Total Inbound Flights|Average Daily Passengers|Average Daily Flights|Rank|
+--------------------+---------+------------------------+---------------------+------------------------+---------------------+----+
|Chicago O'Hare In...|      ORD|                    9479|                  265|      2784.9765798079966|    37.61683721911594|   1|
|Hartsfield Jackso...|      ATL|                    8775|                  276|       4052.626210826211|    45.03612535612535|   2|
|Charlotte Douglas...|      CLT|                    6152|                  189|      2444.4878088426526|    33.32899869960988|   3|
|Minneapolis-St Pa...|      MSP|                    5988|                  175|      2354.3926185704745|   29.740313961255843|   4|
|Philadelphia Inte...|      PHL|                    5880|                  1

# User Defined Functions

The latitude and longitude coordinates for the destination and origin airports are string values and not numeric. You will create a user-defined function in Python that will convert the string coordinates into numeric coordinates. Below is the Python code that will help you create and use this user-defined function. (Click on the image below to download the code.)

In [128]:
from pyspark.sql.functions import udf

In [129]:
@udf('double')
def get_latitude(coordinates):
    split_coords = coordinates.split(',')
    if len(split_coords) != 2:
        return None

    return float(split_coords[0].strip())


@udf('double')
def get_longitude(coordinates):
    split_coords = coordinates.split(',')
    if len(split_coords) != 2:
        return None

    return float(split_coords[1].strip())



In [133]:
df__final = df_merged_modified_dest_final\
.withColumn('origin_airport_latitude',get_latitude(df_merged_modified_dest_final['origin_airport_coordinates']))\
.withColumn('origin_airport_longitude',get_longitude(df_merged_modified_dest_final['origin_airport_coordinates']))\
.withColumn('destination_airport_latitude',get_latitude(df_merged_modified_dest_final['destination_airport_coordinates']))\
.withColumn('destination_airport_longitude',get_longitude(df_merged_modified_dest_final['destination_airport_coordinates']))


In [134]:
df__final.head(2)

[Row(origin_airport_code='MHK', destination_airport_code='AMW', origin_city='Manhattan, KS', destination_city='Ames, IA', passengers=21, seats=30, flights=1, distance=254.0, origin_population=122049, destination_population=86219, flight_year=2008, flight_month=10, origin_airport_type='medium_airport', origin_airport_name='Manhattan Regional Airport', origin_airport_elevation_ft=1057.0, origin_airport_region='US-KS', origin_airport_municipality='Manhattan', origin_airport_gps_code='KMHK', origin_airport_coordinates='-96.6707992553711, 39.14099884033203', destination_airport_type='small_airport', destination_airport_name='Ames Municipal Airport', destination_airport_elevation_ft=956.0, destination_airport_region='US-IA', destination_airport_municipality='Ames', destination_airport_gps_code='KAMW', iata_code='AMW', destination_airport_coordinates='-93.621803, 41.992001', origin_airport_latitude=-96.6707992553711, origin_airport_longitude=39.14099884033203, destination_airport_latitude=-93