### **Flights Data Exploration**

### 1. Connection

To connect to the Spark cluster, create a SparkSession object with the following params:

+ **appName:** FlightsDataExploration - The name of your Spark application, displayed in the Spark UI (e.g., `http://localhost:4040`).  Helps identify your application.

+ **spark.driver.memory:** 1g - `Memory` allocated to the `Spark driver` process, which coordinates application execution. 1 gigabytes.

+ **spark.executor.memory:** 1g - `Memory` allocated to each `Spark executor` process, where data processing and computations occur. 1 gigabytes.

+ **spark.sql.shuffle.partitions:** 200 - Number of `partitions` created during `shuffle operations` (e.g., joins, aggregations). 200 partitions.

+ **spark.sql.adaptive.enabled:** true - Enables `Adaptive Query Execution` (AQE) for dynamic query plan optimization. Enabled.

+ **spark.sql.autoBroadcastJoinThreshold:** 100mb - `Size threshold` for a table to be considered for `broadcast join`. 100 megabytes.

In [50]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [51]:
# Initialize Spark Session with configurations
spark = (
    SparkSession.builder.appName("FlightsDataExplorationToptBusiestAirportsAnalysis")
    .config("spark.driver.memory", "1g")  # Driver memory
    .config("spark.executor.memory", "1g")  # Executor memory
    .config("spark.sql.shuffle.partitions", "200") # Control shuffle partitions
    .config("spark.sql.adaptive.enabled", "true")  # Enable adaptive query execution
    .config("spark.sql.autoBroadcastJoinThreshold", "100mb") # Adjust broadcast join threshold
    .getOrCreate()
)

### 2. Read

This section focuses on reading the various datasets required for our flight analysis. We utilize the `spark.read.csv()` function to load data from CSV files into Spark DataFrames.  The `header=True` option specifies that the first row of each CSV file contains column names.  `inferSchema=True` tells Spark to attempt to automatically determine the data type of each column (e.g., string, integer, double).

Specifically, we load the following datasets:

* **Airlines Dataset:**  `airlines_df` contains information about airlines, including their IATA codes and full names.  It is read from the `airlines.csv` file.

* **Airports Dataset:** `airports_df` contains details about airports, such as their IATA codes, names, city, state, and location.  It is read from the `airports.csv` file.

* **Cancellation Codes Dataset:** `cancellation_codes_df` provides descriptions for different cancellation codes. It is read from the `cancellation_codes.csv` file.

* **Flights Dataset:** `flights_df` contains detailed information about individual flights, including dates, times, airlines, origin and destination airports, delays, and cancellation status. It is read from the `flights.csv` file.

These DataFrames will be used in subsequent sections for data exploration, transformations, and analysis.

In [52]:
# --- Read Data ---
airlines_df = spark.read \
  .csv(
    "data/flights/airlines.csv",
    header=True,
    inferSchema=True
  )
airports_df = spark.read \
  .csv(
    "data/flights/airports.csv",
    header = True,
    inferSchema = True
  )
cancellation_codes_df = spark.read \
  .csv(
    "data/flights/cancellation_codes.csv",
    header = True,
    inferSchema = True
  )

                                                                                

**Reading Flight Data with Schema Inference**

This code demonstrates an efficient way to read flight data from a CSV file using Spark, leveraging schema inference for optimal performance.


In [53]:
sampled_flights_df = spark.read \
  .csv(
    "data/flights/flights.csv",
    header = True,
    inferSchema = True
  ) \
  .sample(
    withReplacement = False,
    fraction = 0.000001,
    seed = 42
  )

flights_schema = sampled_flights_df.schema

flights_df = spark.read \
  .csv(
    "data/flights/flights.csv",
    header = True,
    schema=flights_schema
  )

                                                                                

### 3. Data Exploration and Transformations

This section demonstrates various data exploration and transformation techniques using PySpark to analyze the flight data. We perform `aggregations`, `filtering`, `joins`, and other operations to gain insights into the data.  Because we are using a sample of the data, the results shown here will reflect the properties of the sample, not the full dataset.  If you wish to run these analyses on the full dataset, remove the `.sample()` operation and the `caching/persisting`.

### 1. Which airline had the most cancellations?

* **Finding the Airline with Most Cancellations:** We filter the `flights_df` to isolate cancelled flights (`CANCELLED` == 1). Then, we group the cancelled flights by `AIRLINE` and count the number of cancellations for each airline. The results are ordered in descending order of cancellation count to identify the airline with the most cancellations.

* **Joining with Airline Names:** To provide more context, we join the cancellation counts with the `airlines_df` using the `AIRLINE` IATA code. This adds the full airline name to the cancellation counts, making the results easier to interpret.


In [None]:
cancelled_flights = flights_df.filter(
  col("CANCELLED") == 1
)

cancellation_counts = (
    cancelled_flights.groupBy("AIRLINE")
    .agg(
      count("*").alias("cancellation_count")
    )
    .orderBy(
      col("cancellation_count").desc()
    )
)

cancellation_counts.show()

In [None]:
# Join with airline names for better readability
cancellation_counts_with_names = cancellation_counts.join(
    airlines_df,
    cancellation_counts["AIRLINE"] == airlines_df["IATA_CODE"],
    "inner"
).select(airlines_df["AIRLINE"], "cancellation_count")

cancellation_counts_with_names.show()

[Stage 12:>                                                         (0 + 6) / 6]

+--------------------+------------------+
|             AIRLINE|cancellation_count|
+--------------------+------------------+
|United Air Lines ...|              6573|
|    Spirit Air Lines|              2004|
|American Airlines...|             10919|
|Atlantic Southeas...|             15231|
|     JetBlue Airways|              4276|
|Delta Air Lines Inc.|              3824|
|Skywest Airlines ...|              9960|
|Frontier Airlines...|               588|
|     US Airways Inc.|              4067|
|American Eagle Ai...|             15025|
|Hawaiian Airlines...|               171|
|Alaska Airlines Inc.|               669|
|      Virgin America|               534|
|Southwest Airline...|             16043|
+--------------------+------------------+



                                                                                

### 2. What are top 10 Airlines with most flights:**

* **Counting Flights per Airline:** We group the `flights_df` by `AIRLINE` and count the number of flights for each airline.

* **Limiting to Top 10:** We use the `limit()` function to select only the top 10 airlines with the most flights.

* **Ordering and Displaying Results:** The results are ordered by flight count in descending order.


In [24]:
top_airlines = (
    flights_df.groupBy("AIRLINE")
    .agg(count("*").alias("flight_count"))
    .orderBy(col("flight_count").desc())
    .limit(10)  # Limit to top 10
)
top_airlines.show()



+-------+------------+
|AIRLINE|flight_count|
+-------+------------+
|     WN|     1261855|
|     DL|      875881|
|     AA|      725984|
|     OO|      588353|
|     EV|      571977|
|     UA|      515723|
|     MQ|      294632|
|     B6|      267048|
|     US|      198715|
|     AS|      172521|
+-------+------------+



                                                                                

### 3. What were the top 10 busiest airports (most flights)?

* **Calculating Airport Traffic:** We determine the busiest airports by calculating the total number of flights (arrivals and departures) for each airport. We group the `flights_df` by `ORIGIN_AIRPORT` and `DESTINATION_AIRPORT` separately to count departures and arrivals.

* **Combining Arrival and Departure Counts:** We perform a full outer join on the origin and destination counts to get the combined traffic for each airport.  The `coalesce` function is used to handle cases where an airport might only have arrivals or departures, ensuring that all airports are included in the results. The total traffic is calculated by summing the origin and destination counts.

* **Ordering and Displaying Results:** The results are ordered by total traffic in descending order to show the busiest airports.


In [54]:
origin_counts = flights_df.groupBy("ORIGIN_AIRPORT") \
  .agg(
    count("*")
    .alias("origin_count")
  )
destination_counts = flights_df.groupBy("DESTINATION_AIRPORT") \
  .agg(
    count("*")
    .alias("destination_count")
  )

origin_counts.show()
destination_counts.show()

                                                                                

+--------------+------------+
|ORIGIN_AIRPORT|origin_count|
+--------------+------------+
|           BGM|         262|
|           PSE|         749|
|           INL|         574|
|           MSY|       38804|
|           PPG|         107|
|           GEG|        9505|
|           SNA|       37187|
|           BUR|       18889|
|           GRB|        4881|
|           GTF|        1966|
|           IDA|        2247|
|           GRR|       10845|
|           JLN|         666|
|           EUG|        3632|
|           PSG|         664|
|           GSO|        6737|
|           PVD|       11058|
|           MYR|        4831|
|           OAK|       42316|
|           MSN|        9135|
+--------------+------------+
only showing top 20 rows



[Stage 11:>                                                         (0 + 6) / 6]

+-------------------+-----------------+
|DESTINATION_AIRPORT|destination_count|
+-------------------+-----------------+
|                BGM|              264|
|                INL|              574|
|                PSE|              751|
|                MSY|            38802|
|                PPG|              107|
|                GEG|             9504|
|                SNA|            37195|
|                BUR|            18890|
|                GRB|             4883|
|                GTF|             1966|
|                IDA|             2247|
|                GRR|            10840|
|                JLN|              666|
|                EUG|             3631|
|                PSG|              664|
|                MYR|             4830|
|                GSO|             6734|
|                PVD|            11059|
|                OAK|            42313|
|                MSN|             9129|
+-------------------+-----------------+
only showing top 20 rows



                                                                                

In [55]:
airport_traffic = origin_counts.join(
  destination_counts,
  origin_counts["ORIGIN_AIRPORT"] == destination_counts["DESTINATION_AIRPORT"],
  "fullouter"
  ) \
  .select(
    coalesce(
      col("ORIGIN_AIRPORT"),
      col("DESTINATION_AIRPORT")
    ).alias("airport"),
    coalesce(
      origin_counts["origin_count"],
      lit(0)
    ).alias("origin_count"),
    coalesce(
      destination_counts["destination_count"],
      lit(0)
    ).alias("destination_count")) \
  .withColumn(
    "total_traffic",
    col("origin_count") + col("destination_count")
  ) \
  .orderBy(
    col("total_traffic").desc()
  )

airport_traffic.show(10)



+-------+------------+-----------------+-------------+
|airport|origin_count|destination_count|total_traffic|
+-------+------------+-----------------+-------------+
|    ATL|      346836|           346904|       693740|
|    ORD|      285884|           285906|       571790|
|    DFW|      239551|           239582|       479133|
|    DEN|      196055|           196010|       392065|
|    LAX|      194673|           194696|       389369|
|    SFO|      148008|           147966|       295974|
|    PHX|      146815|           146812|       293627|
|    IAH|      146622|           146683|       293305|
|    LAS|      133181|           133198|       266379|
|    MSP|      112117|           112128|       224245|
+-------+------------+-----------------+-------------+
only showing top 10 rows



                                                                                

### 4. Stop the application

In [56]:
spark.stop()