# Airport On-Time Data
- The data comes the [Bureau of Transport Statistics](https://www.transtats.bts.gov/)
- For now I'm only using data for 2019, but historical data exists since 1987.

In [108]:
# import libraries
import pandas as pd
import glob

data_path = 'C:/Users/Zaca/Documents/Datasets/flights/2019/'

## Column Description

In [109]:
cols = pd.read_csv(data_path + 'columns.txt', sep=',', names=['colname', 'coldesc'], index_col= False)
cols

Unnamed: 0,colname,coldesc
0,FlightDate,Flight Date (yyyymmdd)
1,Reporting_Airline,Unique Carrier Code
2,Flight_Number_Reporting_Airline,Flight Number
3,OriginAirportID,Airport ID
4,Origin,Origin Airport
5,OriginCityName,City Name
6,OriginState,Origin Airport State Code
7,DestAirportID,Destination AirportID
8,Dest,Destination Airport
9,DestCityName,Destination Airport City Name


In [110]:
# load data
jan_flights = pd.read_csv('C:/Users/Zaca/Documents/Datasets/flights/2019/01.csv', usecols=cols['colname'])

In [111]:
# check dtypes
jan_flights.dtypes

FlightDate                          object
Reporting_Airline                   object
Flight_Number_Reporting_Airline      int64
OriginAirportID                      int64
Origin                              object
OriginCityName                      object
OriginState                         object
DestAirportID                        int64
Dest                                object
DestCityName                        object
DestState                           object
CRSDepTime                           int64
DepTime                            float64
DepDelay                           float64
TaxiOut                            float64
WheelsOff                          float64
WheelsOn                           float64
TaxiIn                             float64
CRSArrTime                           int64
ArrTime                            float64
ArrDelay                           float64
Cancelled                          float64
Diverted                           float64
CRSElapsedT

*It seems that all columns are in the correct data format*

# Checking for NaNs

In [112]:
# get proportion of nan values in each col.
jan_flights.isna().sum()/jan_flights.shape[0]

FlightDate                         0.000000
Reporting_Airline                  0.000000
Flight_Number_Reporting_Airline    0.000000
OriginAirportID                    0.000000
Origin                             0.000000
OriginCityName                     0.000000
OriginState                        0.000000
DestAirportID                      0.000000
Dest                               0.000000
DestCityName                       0.000000
DestState                          0.000000
CRSDepTime                         0.000000
DepTime                            0.028001
DepDelay                           0.028006
TaxiOut                            0.028453
WheelsOff                          0.028453
WheelsOn                           0.029215
TaxiIn                             0.029215
CRSArrTime                         0.000000
ArrTime                            0.029215
ArrDelay                           0.030860
Cancelled                          0.000000
Diverted                        

### About NaNs:
* Most NaNs seem to happen in the columns that describe the reason for the delay. I am still very interested in the data contained in these columns, so I was really hoping not to drop them.

In [113]:
# check the proportion of nans in flights with actual delays
# because some flights make up for departure delays during airtime, I'm going to focus on arrival delays
print(jan_flights[jan_flights.ArrDelay > 15].isna().sum())

# interesting, there are no nans when we look at arrival delay which means all of them are explained.
# what about depature delay?
print(jan_flights[jan_flights.DepDelay > 15].isna().sum())

# when we only consider departure delays now we see nans in the cols. I'm not really going to do the math but
# this suggests that perhaps departure delays does not always mean an arrival delay and in that case there are
# no values on the delay cause.

FlightDate                         0
Reporting_Airline                  0
Flight_Number_Reporting_Airline    0
OriginAirportID                    0
Origin                             0
OriginCityName                     0
OriginState                        0
DestAirportID                      0
Dest                               0
DestCityName                       0
DestState                          0
CRSDepTime                         0
DepTime                            0
DepDelay                           0
TaxiOut                            0
WheelsOff                          0
WheelsOn                           0
TaxiIn                             0
CRSArrTime                         0
ArrTime                            0
ArrDelay                           0
Cancelled                          0
Diverted                           0
CRSElapsedTime                     0
ActualElapsedTime                  0
AirTime                            0
Distance                           0
C

* It seems that these cols only have values when the flight was actually delayed more than 15 min on arrival, if that is the case, then it makes sense to replace these nans by zero.

In [114]:
# making a list of the delay reason cols and filling with 0
delay_cols = ['CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
jan_flights[delay_cols] = jan_flights[delay_cols].fillna(value=0)

In [115]:
# the rest of nans only account for about 3% of the values in each column, so I will just drop the rows.
# we have enough data...
print('# Flights before drop: ', jan_flights.shape[0])
jan_flights.dropna(inplace=True)
print('# Flights after drop: ', jan_flights.shape[0])

# Flights before drop:  583985
# Flights after drop:  565963


FlightDate                         0
Reporting_Airline                  0
Flight_Number_Reporting_Airline    0
OriginAirportID                    0
Origin                             0
OriginCityName                     0
OriginState                        0
DestAirportID                      0
Dest                               0
DestCityName                       0
DestState                          0
CRSDepTime                         0
DepTime                            0
DepDelay                           0
TaxiOut                            0
WheelsOff                          0
WheelsOn                           0
TaxiIn                             0
CRSArrTime                         0
ArrTime                            0
ArrDelay                           0
Cancelled                          0
Diverted                           0
CRSElapsedTime                     0
ActualElapsedTime                  0
AirTime                            0
Distance                           0
C