# Airlines Data Cleaning

The goal of this notebook is to clean up the `airlines` data, and have this in a ready and joinable condition. In the process of cleaning we will perform an initial EDA to understand the ranges of data that we are working with, as well as gather an understanding of how it will map into the Weather and Stations related data.

### Imports

In [0]:
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType, NullType, ShortType, DateType, BooleanType, BinaryType
from pyspark.sql import SQLContext
from pyspark.sql.functions import isnan, when, count, col
from pyspark.sql.functions import to_timestamp
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType


import pandas as pd
import us
import plotly.express as px
from datetime import datetime, timedelta
from pytz import timezone
import pytz
import math

sqlContext = SQLContext(sc)


### Load Data

In [0]:
display(dbutils.fs.ls("dbfs:/mnt/mids-w261/datasets_final_project/parquet_airlines_data/"))

path,name,size
dbfs:/mnt/mids-w261/datasets_final_project/parquet_airlines_data/2015.parquet/,2015.parquet/,0
dbfs:/mnt/mids-w261/datasets_final_project/parquet_airlines_data/2016.parquet/,2016.parquet/,0
dbfs:/mnt/mids-w261/datasets_final_project/parquet_airlines_data/2017.parquet/,2017.parquet/,0
dbfs:/mnt/mids-w261/datasets_final_project/parquet_airlines_data/2018.parquet/,2018.parquet/,0
dbfs:/mnt/mids-w261/datasets_final_project/parquet_airlines_data/2019.parquet/,2019.parquet/,0
dbfs:/mnt/mids-w261/datasets_final_project/parquet_airlines_data/airlines_size_test.parquet/,airlines_size_test.parquet/,0


### Take a 0.1% sample to look at

In [0]:
airlines = spark.read.option("header", "true").parquet(f"dbfs:/mnt/mids-w261/datasets_final_project/parquet_airlines_data/201*.parquet")
airlines_sample = airlines.sample(False, 0.00001)
display(airlines_sample)

In [0]:
airlines.printSchema()

In [0]:
#Count of all the row items for our airlines, every row represents one flight's single leg journey
print("Number of flights (2015 - 2019):  ", airlines.count())
print("Number of data columns:  ", len(airlines.columns))


### Drop DIV Columns
From prior analysis saw that these columns are a majority `NaN` or `Null`, dropping them

- Diverted Airport Information (Data starts 10/2008)
- Therefore we have a ton of empty data, that is burdening performance.
- Dropping these columns
- Perform this analysis on the larger data as well

In [0]:
#Get All the columns
air_columns = airlines.columns
print(len(air_columns))

#In this case I want to focus on any column that has the the sub string "DIV" in it 
div_cols = [y for y in air_columns if "DIV" in y]
print(len(div_cols))

#These first two columns we don't want to drop, so just selecting the rest
div_cols = div_cols[2:]
print(len(div_cols))

#### Display the count of nulls in these columns

In [0]:
#This function is just to make things look pretty when printing so that its not one ugly response
def get_chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    return [lst[i:i + n] for i in range(0, len(lst), n)]
chunks = get_chunks(div_cols, 9)
print(len(chunks))

print("Total records in airlinse {0}".format(airlines.count()))

#Below is printing number of NaN or Nulls in each of these columns
for divs in chunks:
  airlines.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in divs]).show()

In [0]:
columns_to_drop = div_cols
airlines = airlines.drop(*columns_to_drop)

In [0]:
airlines.printSchema()

### Column Wise - EDA

For each column, we start with a high level `describe` and `isNan` count. Then jump into analyzing facets of each potential feature

In [0]:
#Describe Data
display(airlines.describe())

In [0]:
# Count Number of NaNs in each column
air_columns = airlines.columns
print(len(air_columns))

chunks = get_chunks(air_columns, 10)
print(len(chunks))

print("Total records in airlines {0}".format(airlines.count()))
for divs in chunks:
  airlines.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in divs]).show()


### Dropping data once again

Now that we can see the count of `NaN` or `Null` values across the remaining columns, we can determine what columns will be useful to us, and what we can drop column wise, and what we can prune row wise.

In [0]:
# 82% of the rows in these columns are NaN, in addition if we were to use them as features, then it becomes an issue of multicolinearity as we would be using delay as a feature, and as our ultimate prediction
delay_cols = ["CARRIER_DELAY", "WEATHER_DELAY", "NAS_DELAY","SECURITY_DELAY","LATE_AIRCRAFT_DELAY"]
columns_to_drop = ['CANCELLATION_CODE','FIRST_DEP_TIME', 'TOTAL_ADD_GTIME', 'LONGEST_ADD_GTIME']

columns_to_drop.extend(delay_cols)
airlines = airlines.drop(*columns_to_drop)

drop_if_all_null = ["AIR_TIME", "ACTUAL_ELAPSED_TIME", "CRS_ELAPSED_TIME","ARR_DELAY_GROUP","ARR_DEL15", "ARR_DELAY_NEW", "ARR_DELAY","ARR_TIME","TAXI_IN", "DEP_TIME", "DEP_DELAY", "DEP_DELAY_NEW", "DEP_DEL15", "DEP_DELAY_GROUP", "DEP_TIME_BLK", "TAXI_OUT", "WHEELS_OFF", "WHEELS_ON", "TAXI_IN"]

#Prune Rows
airlines = airlines.dropna(how='any', subset=drop_if_all_null)

In [0]:
# Count Number of NaNs in each column
air_columns = airlines.columns
print(len(air_columns))

chunks = get_chunks(air_columns, 10)
print(len(chunks))

print("Total records in airlines {0}".format(airlines.count()))
for divs in chunks:og_
  airlines.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in divs]).show()


### Cleaned Data

1.8119% of airlines data rows were discarded ==> 575,229 rows

In [0]:
og_length = 31746841
after_pruned = 31171612
c_o_s = (og_length - after_pruned) / og_length
str(round(c_o_s*100,4))+'% of data discarded'

In [0]:
#Describe Data
display(airlines.describe())

In [0]:
#Write cleaned airlines data to our store
# airlines.write.parquet("dbfs:/mnt/mids-w261/team20SSDK/cleaned_data/airlines/airlines_latest")