# Cleaning and Preprocessing - 2020 Data

**Author**: Stephanie Golob

**contact**: estefaniagolob@gmail.com

**Date**: July 4, 2022

**Notebook**: 1 of 7

**Next Notebook**: 2 of 7 (Cleaning and Preprocessing - 2021 Data)

---

## Introduction

Flight delays in the United States cost airlines and passengers billions of dollars a year. According to the most recent statistics published in 2020 by the Federal Aviation Association (FAA), 54% of estimated flight delay costs were incurred by passengers. Furthermore, flight delays have a negative impact on the environment through increased fuel consumption and greenhouse gas emissions. Predicting flight delays would reduce emissions and the costs incurred by all parties. Ultimately improving productivity through a reduction in lost time and wages for passengers, saving the United States economy billions of dollars a year.

### Data

The data was downloaded from [The Bureau of Transportation Statistics](https://transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr). It consists of domestic flight data from the USA for all months of 2020. The data was downloaded by month and then concatenated to create a full 2020 data set.

After initial exploratory data analysis I decided to focus on a subset of data to reduce noise from including too many airports and flights. I have chosen to focus on Hartsfield-Jackson Atlanta International Airport as it is the most popular origin airport in this data set (and the 2021 data set as well).

In this notebook I perform some limited exploratory data analysis to remove columns with high percentages of missing values, check for duplicated values, and create a subset of data with ATL selected as the origin airport. I have also added weather data for Hartsfield-Jackson Atlanta International Airport downloaded from [NOAA](https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USW00013874/detail).

### Data Dictionary

| Feature                         | Data type | Description                                                                                                                                                                                                                           |
|---------------------------------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Year                            | numeric   | The year of the flight                                                                                                                                                                                                                |
| Quarter                         | numeric   | The quarter of the flight                                                                                                                                                                                                             |
| Month                           | numeric   | The month of the flight                                                                                                                                                                                                               |
| DayofMonth                      | numeric   | The day of the month of the flight                                                                                                                                                                                                    |
| DayOfWeek                       | numeric   | The day of the week of the flight                                                                                                                                                                                                     |
| Flightdate                      | object    | The date of the flight (yyyymmdd)                                                                                                                                                                                                     |
| Reporting_Airline               | numeric   | Unique carrier code for each airline                                                                                                                                                                                                  |
| DOT_ID_Reporting_Airline        | numeric   | An identification number assigned by US Department of Transportation (DOT) to identify a unique airline (carrier)                                                                                                                     |
| IATA_CODE_Reporting_Airline     | numeric   | ode assigned by the International Air Transport Association (IATA) and commonly used to identify a carrier                                                                                                                            |
| Tail_Number                     | numeric   | Flight tail number used to identify a unique airplane                                                                                                                                                                                 |
| Flight_Number_Reporting_Airline | numeric   | Number used to identify the flight                                                                                                                                                                                                    |
| OriginAirportID                 | numeric   | An identification number assigned by US DOT to identify a unique airport                                                                                                                                                              |
| OriginAirportSeqID              | numeric   | An identification number assigned by US DOT to identify a unique airport at a given point of time                                                                                                                                     |
| OriginCityMarketID              | numeric   | City Market ID is an identification number assigned by US DOT to identify a city market                                                                                                                                               |
| Origin                          | object    | 3-letter string used to identify each origin airport                                                                                                                                                                                  |
| OriginCityName                  | object    | String with the city and state name of the origin airport                                                                                                                                                                             |
| OriginState                     | object    | 2-letter string used to identify each origin state                                                                                                                                                                                    |
| OriginStateFips                 | numeric   | 2-digit unique code to identify each origin state                                                                                                                                                                                     |
| OriginStateName                 | object    | String with the state name of the origin airport                                                                                                                                                                                      |
| OriginWac                       | numeric   | Origin airport unique world airport code                                                                                                                                                                                              |
| DestAirportID                   | numeric   | An identification number assigned by US DOT to identify a unique airport                                                                                                                                                              |
| DestAirportSeqID                | numeric   | An identification number assigned by US DOT to identify a unique airport at a given point of time                                                                                                                                     |
| DestCityMarketID                | numeric   | City Market ID is an identification number assigned by US DOT to identify a city market                                                                                                                                               |
| Dest                            | object    | 3-letter string used to identify each destination airport                                                                                                                                                                             |
| DestCityName                    | object    | String with the city and state name of the destination airport                                                                                                                                                                        |
| DestState                       | object    | 2-letter string used to identify each destination state                                                                                                                                                                               |
| DestStateFips                   | numeric   | 2-digit unique code to identify each destination state                                                                                                                                                                                |
| DestStateName                   | object    | String with the state name of the destination airport                                                                                                                                                                                 |
| DestWac                         | numeric   | Destination airport unique world airport code                                                                                                                                                                                         |
| CRSDepTime                      | numeric   | Scheduled departure time (local time hhmm)                                                                                                                                                                                            |
| DepTime                         | numeric   | Actual departure time                                                                                                                                                                                                                 |
| DepDelay                        | numeric   | Difference in minutes between scheduled and actual departure time. Early departures show negative numbers                                                                                                                             |
| DepDelayMinutes                 | numeric   | Difference in minutes between scheduled and actual departure time. Early departures set to 0.                                                                                                                                         |
| DepDel15                        | numeric   | Departure Delay Indicator, 15 Minutes or More (1=Yes)                                                                                                                                                                                 |
| DepartureDelayGroups            | numeric   | Departure Delay intervals, every (15 minutes from <-15 to >180)                                                                                                                                                                       |
| DepTimeBlk                      | numeric   | Scheduled departure time block, given in hourly intervals                                                                                                                                                                             |
| TaxiOut                         | numeric   | Taxi out time, in minutes                                                                                                                                                                                                             |
| WheelsOff                       | numeric   | The time when the plane has fully taken off from the terminal after taxi out (local time: hhmm)                                                                                                                                       |
| WheelsOn                        | numeric   | The time when the plane has landed at the destination airport (local time: hhmm)                                                                                                                                                      |
| TaxiIn                          | numeric   | Taxi in time, in minutes                                                                                                                                                                                                              |
| CRSArrTime                      | numeric   | Scheduled departure time (local time hhmm)                                                                                                                                                                                            |
| ArrTime                         | numeric   | Actual arrival time                                                                                                                                                                                                                   |
| ArrDelay                        | numeric   | Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers                                                                                                                                 |
| ArrDelayMinutes                 | numeric   | Difference in minutes between scheduled and actual arrival time. Early arrivals set to 0                                                                                                                                              |
| ArrDel15                        | numeric   | Arrival delay indicator, 15 minutes or more (1=Yes)                                                                                                                                                                                   |
| ArrivalDelayGroups              | numeric   | Arrival delay intervals, every (15-minutes from <-15 to >180)                                                                                                                                                                         |
| ArrTimeBlk                      | numeric   | Arrival time block, hourly intervals                                                                                                                                                                                                  |
| Cancelled                       | numeric   | Cancelled Flight Indicator (1=Yes)                                                                                                                                                                                                    |
| CancellationCode                | object    | String specifying the reason for cancellation                                                                                                                                                                                         |
| Diverted                        | numeric   | Diverted Flight Indicator (1=Yes)                                                                                                                                                                                                     |
| CRSElapsedTime                  | numeric   | The scheduled elapsed time of the flight, in minutes                                                                                                                                                                                  |
| ActualElapsedTime               | numeric   | The actual elapsed time of the flight, in minutes                                                                                                                                                                                     |
| AirTime                         | numeric   | Flight time in minutes                                                                                                                                                                                                                |
| Flights                         | numeric   | The number of flights per row, all values equal 1                                                                                                                                                                                     |
| Distance                        | numeric   | Distance between airports (miles)                                                                                                                                                                                                     |
| DistanceGroup                   | numeric   | Distance Intervals, every 250 Miles, for Flight Segment                                                                                                                                                                               |
| CarrierDelay                    | numeric   | The cause of the delay was due to circumstances within the airline's control (maintenance and crew, fueling, baggage loading, cleaning the cabin, etc.), in minutes                                                                   |
| WeatherDelay                    | numeric   | Significant meteorological conditions (actual or forecasted) that, in the judgment of the carrier, delays or prevents the operation of a flight such as tornado, blizzard or hurricane, in minutes                                    |
| NASDelay                        | numeric   | Delays and cancellations attributable to the national aviation system that refer to a broad set of conditions, such as non-extreme weather conditions, airport operations, heavy traffic volume, and air traffic control, in minutes  |
| SecurityDelay                   | numeric   | Delays or cancellations caused by evacuation of a terminal or concourse, re-boarding of aircraft because of security breach, inoperative screening equipment and/or long lines in excess of 29 minutes at screening areas, in minutes |
| LateAircraftDelay               | numeric   | A previous flight with same aircraft arrived late, causing the present flight to depart late, in minutes                                                                                                                              |
| FirstDepTime                    | numeric   | First Gate Departure Time at Origin Airport                                                                                                                                                                                           |
| TotalAddGTime                   | numeric   | Total Ground Time Away from Gate for Gate Return or Cancelled Flight                                                                                                                                                                  |
| LongestAddGTime                 | numeric   | Longest Time Away from Gate for Gate Return or Cancelled Flight                                                                                                                                                                       |
| DivAirportLandings              | numeric   | Number of Diverted Airport Landings                                                                                                                                                                                                   |
| DivReachedDest                  | numeric   | Diverted Flight Reaching Scheduled Destination Indicator (1=Yes)                                                                                                                                                                      |
| DivActualElapsedTime            | numeric   | Elapsed Time of Diverted Flight Reaching Scheduled Destination, in Minutes. The ActualElapsedTime column remains NULL for all diverted flights                                                                                        |
| DivArrDelay                     | numeric   | Difference in minutes between scheduled and actual arrival time for a diverted flight reaching scheduled destination. The ArrDelay column remains NULL for all diverted flights                                                       |
| DivDistance                     | numeric   | Distance between scheduled destination and final diverted airport (miles). Value will be 0 for diverted flight reaching scheduled destination                                                                                         |
| Div1Airport                     | numeric   | Diverted Airport Code1                                                                                                                                                                                                                |
| Div1AirportID                   | numeric   | Airport ID of Diverted Airport 1                                                                                                                                                                                                      |
| Div1AirportSeqID                | numeric   | Airport Sequence ID of Diverted Airport 1                                                                                                                                                                                             |
| Div1WheelsOn                    | numeric   | Wheels On Time (local time: hhmm) at Diverted Airport Code1                                                                                                                                                                           |
| Div1TotalGTime                  | numeric   | Total Ground Time Away from Gate at Diverted Airport Code1                                                                                                                                                                            |
| Div1LongestGTime                | numeric   | Longest Ground Time Away from Gate at Diverted Airport Code1                                                                                                                                                                          |
| Div1WheelsOff                   | numeric   | Wheels Off Time (local time: hhmm) at Diverted Airport Code1                                                                                                                                                                          |
| Div1TailNum                     | numeric   | Aircraft Tail Number for Diverted Airport Code1                                                                                                                                                                                       |
| Div2Airport                     | numeric   | Diverted Airport Code2                                                                                                                                                                                                                |
| Div2AirportID                   | numeric   | Airport ID of Diverted Airport 2                                                                                                                                                                                                      |
| Div2AirportSeqID                | numeric   | Airport Sequence ID of Diverted Airport 2. Unique Key for Time Specific Information for an Airport                                                                                                                                    |
| Div2WheelsOn                    | numeric   | Wheels On Time (local time: hhmm) at Diverted Airport Code2                                                                                                                                                                           |
| Div2TotalGTime                  | numeric   | Total Ground Time Away from Gate at Diverted Airport Code2                                                                                                                                                                            |
| Div2LongestGTime                | numeric   | Longest Ground Time Away from Gate at Diverted Airport Code2                                                                                                                                                                          |
| Div2WheelsOff                   | numeric   | Wheels Off Time (local time: hhmm) at Diverted Airport Code2                                                                                                                                                                          |
| Div2TailNum                     | numeric   | Aircraft Tail Number for Diverted Airport Code2                                                                                                                                                                                       |
| Div3Airport                     | numeric   | Diverted Airport Code3                                                                                                                                                                                                                |
| Div3AirportID                   | numeric   | Airport ID of Diverted Airport 3                                                                                                                                                                                                      |
| Div3AirportSeqID                | numeric   | Airport Sequence ID of Diverted Airport 3                                                                                                                                                                                             |
| Div3WheelsOn                    | numeric   | Wheels On Time (local time: hhmm) at Diverted Airport Code3                                                                                                                                                                           |
| Div3TotalGTime                  | numeric   | Total Ground Time Away from Gate at Diverted Airport Code3                                                                                                                                                                            |
| Div3LongestGTime                | numeric   | Longest Ground Time Away from Gate at Diverted Airport Code3                                                                                                                                                                          |
| Div3WheelsOff                   | numeric   | Wheels Off Time (local time: hhmm) at Diverted Airport Code3                                                                                                                                                                          |
| Div3TailNum                     | numeric   | Aircraft Tail Number for Diverted Airport Code3                                                                                                                                                                                       |
| Div4Airport                     | numeric   | Diverted Airport Code4                                                                                                                                                                                                                |
| Div4AirportID                   | numeric   | Airport ID of Diverted Airport 4                                                                                                                                                                                                      |
| Div4AirportSeqID                | numeric   | Airport Sequence ID of Diverted Airport 4. Unique Key for Time Specific Information for an Airport                                                                                                                                    |
| Div4WheelsOn                    | numeric   | Wheels On Time (local time: hhmm) at Diverted Airport Code4                                                                                                                                                                           |
| Div4TotalGTime                  | numeric   | Total Ground Time Away from Gate at Diverted Airport Code4                                                                                                                                                                            |
| Div4LongestGTime                | numeric   | Longest Ground Time Away from Gate at Diverted Airport Code4                                                                                                                                                                          |
| Div4WheelsOff                   | numeric   | Wheels Off Time (local time: hhmm) at Diverted Airport Code4                                                                                                                                                                          |
| Div4TailNum                     | numeric   | Aircraft Tail Number for Diverted Airport Code4                                                                                                                                                                                       |
| Div5Airport                     | numeric   | Diverted Airport Code5                                                                                                                                                                                                                |
| Div5AirportID                   | numeric   | Airport ID of Diverted Airport 5                                                                                                                                                                                                      |
| Div5AirportSeqID                | numeric   | Airport Sequence ID of Diverted Airport 5. Unique Key for Time Specific Information for an Airport                                                                                                                                    |
| Div5WheelsOn                    | numeric   | Wheels On Time (local time: hhmm) at Diverted Airport Code5                                                                                                                                                                           |
| Div5TotalGTime                  | numeric   | Total Ground Time Away from Gate at Diverted Airport Code5                                                                                                                                                                            |
| Div5LongestGTime                | numeric   | Longest Ground Time Away from Gate at Diverted Airport Code5                                                                                                                                                                          |
| Div5WheelsOff                   | numeric   | Wheels Off Time (local time: hhmm) at Diverted Airport Code5                                                                                                                                                                          |
| Div5TailNum                     | numeric   | Aircraft Tail Number for Diverted Airport Code5                                                                                                                                             

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Cleaning-and-Preprocessing---2020-Data" data-toc-modified-id="Cleaning-and-Preprocessing---2020-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Cleaning and Preprocessing - 2020 Data</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><span><a href="#Data" data-toc-modified-id="Data-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Data</a></span></li><li><span><a href="#Data-Dictionary" data-toc-modified-id="Data-Dictionary-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Data Dictionary</a></span></li></ul></li><li><span><a href="#Cleaning-and-Preprocessing-2020-Data" data-toc-modified-id="Cleaning-and-Preprocessing-2020-Data-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Cleaning and Preprocessing 2020 Data</a></span></li><li><span><a href="#Data-Cleaning-on-ATL-2020-Data-Set" data-toc-modified-id="Data-Cleaning-on-ATL-2020-Data-Set-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Data Cleaning on ATL 2020 Data Set</a></span><ul class="toc-item"><li><span><a href="#Missing-Values" data-toc-modified-id="Missing-Values-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Missing Values</a></span></li><li><span><a href="#Data-Types" data-toc-modified-id="Data-Types-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Data Types</a></span></li><li><span><a href="#Weather-Data" data-toc-modified-id="Weather-Data-1.3.3"><span class="toc-item-num">1.3.3&nbsp;&nbsp;</span>Weather Data</a></span></li><li><span><a href="#Weather-Data-Dictionary" data-toc-modified-id="Weather-Data-Dictionary-1.3.4"><span class="toc-item-num">1.3.4&nbsp;&nbsp;</span>Weather Data Dictionary</a></span></li><li><span><a href="#Next-Steps" data-toc-modified-id="Next-Steps-1.3.5"><span class="toc-item-num">1.3.5&nbsp;&nbsp;</span>Next Steps</a></span></li></ul></li></ul></li></ul></div>

## Cleaning and Preprocessing 2020 Data

Import the required libraries.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# use argument display.max_columns to display all columns, None sets it to an unlimited view
pd.set_option('display.max_columns', None)

Import all of the datasets using `read_csv`.

In [3]:
df_jan = pd.read_csv("data/2020/2020_jan.csv")
df_feb = pd.read_csv("data/2020/2020_feb.csv")
df_mar = pd.read_csv("data/2020/2020_mar.csv")
df_apr = pd.read_csv("data/2020/2020_apr.csv")
df_may = pd.read_csv("data/2020/2020_may.csv")
df_jun = pd.read_csv("data/2020/2020_jun.csv")
df_jul = pd.read_csv("data/2020/2020_jul.csv")
df_aug = pd.read_csv("data/2020/2020_aug.csv")
df_sep = pd.read_csv("data/2020/2020_sep.csv")
df_oct = pd.read_csv("data/2020/2020_oct.csv")
df_nov = pd.read_csv("data/2020/2020_nov.csv")
df_dec = pd.read_csv("data/2020/2020_dec.csv")

  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)


Concatenate the dataframes together to create a complete 2020 data set using `pd.concat`.

In [4]:
df_2020 = pd.concat([df_jan, df_feb, df_mar, df_apr, df_may, df_jun, df_jul, df_aug, df_sep, df_oct, df_nov, df_dec], axis=0)


Check the shape of the data set, select the 0 index from shape to get the number of rows and the 1st index from shape to get the number of columns.

In [5]:
print(f"There are {df_2020.shape[0]} rows and {df_2020.shape[1]} columns.")

There are 4688354 rows and 110 columns.


Check if there are any missing values using `isnull` and `sum` to get the total number of missing values from each column. Divide this value by the length of the dataframe and multiply by 100 to turn this number into the percent of missing values in each column.

In [6]:
# create a dictionary to get key-value pairs for each column
dict(df_2020.isnull().sum()/len(df_2020)*100)

{'Year': 0.0,
 'Quarter': 0.0,
 'Month': 0.0,
 'DayofMonth': 0.0,
 'DayOfWeek': 0.0,
 'FlightDate': 0.0,
 'Reporting_Airline': 0.0,
 'DOT_ID_Reporting_Airline': 0.0,
 'IATA_CODE_Reporting_Airline': 0.0,
 'Tail_Number': 3.283625767166899,
 'Flight_Number_Reporting_Airline': 0.0,
 'OriginAirportID': 0.0,
 'OriginAirportSeqID': 0.0,
 'OriginCityMarketID': 0.0,
 'Origin': 0.0,
 'OriginCityName': 0.0,
 'OriginState': 0.0,
 'OriginStateFips': 0.0,
 'OriginStateName': 0.0,
 'OriginWac': 0.0,
 'DestAirportID': 0.0,
 'DestAirportSeqID': 0.0,
 'DestCityMarketID': 0.0,
 'Dest': 0.0,
 'DestCityName': 0.0,
 'DestState': 0.0,
 'DestStateFips': 0.0,
 'DestStateName': 0.0,
 'DestWac': 0.0,
 'CRSDepTime': 0.0,
 'DepTime': 5.9687472405027435,
 'DepDelay': 5.970090995688466,
 'DepDelayMinutes': 5.970090995688466,
 'DepDel15': 5.970090995688466,
 'DepartureDelayGroups': 5.970090995688466,
 'DepTimeBlk': 0.0,
 'TaxiOut': 5.98357120644047,
 'WheelsOff': 5.98357120644047,
 'WheelsOn': 6.011491453077135,
 'Ta

A lot of columns have a high percentage of missing values. I will remove all the columns that have >90% missing values since such a large proportion of the data cannot be imputed. 

The same columns will be dropped from the 2021 data set to ensure consistency. Columns that have a lower percentage of missing values (<90%) will be explored when the subset of ATL data is created.

Use `drop` and specify which columns to remove from the data set.

In [7]:
df_2020 = df_2020.drop(columns = ['Div3Airport', 'Div3AirportID', 'Div3AirportSeqID', 
                                  'Div3WheelsOn', 'Div3TotalGTime', 'Div3LongestGTime', 
                                  'Div3WheelsOff', 'Div3TailNum', 'Div4Airport', 
                                  'Div4AirportID', 'Div4AirportSeqID', 'Div4WheelsOn', 
                                  'Div4TotalGTime', 'Div4LongestGTime', 'Div4WheelsOff', 
                                  'Div4TailNum', 'Div5Airport', 'Div5AirportID', 
                                  'Div5AirportSeqID', 'Div5WheelsOn', 'Div5TotalGTime', 
                                  'Div5LongestGTime', 'Div5WheelsOff', 'Div5TailNum', 
                                  'Unnamed: 109', 'Div2WheelsOn', 'Div2TotalGTime', 
                                  'Div2LongestGTime', 'Div2WheelsOff', 'Div2TailNum' ,
                                  'Div1WheelsOff', 'Div1TailNum', 'Div2Airport', 
                                  "Div2AirportID", "Div2AirportSeqID", "Div1AirportID", 
                                  "Div1AirportSeqID", "Div1WheelsOn", "Div1TotalGTime", 
                                  "Div1LongestGTime", "DivReachedDest", "DivActualElapsedTime", 
                                  "DivArrDelay", "DivDistance", "Div1Airport", 
                                  "FirstDepTime", "TotalAddGTime", "LongestAddGTime", 
                                  "WeatherDelay", "NASDelay", "SecurityDelay", 
                                  "LateAircraftDelay", "CancellationCode", "CarrierDelay"])

Check the remaining columns.

In [8]:
df_2020.columns

Index(['Year', 'Quarter', 'Month', 'DayofMonth', 'DayOfWeek', 'FlightDate',
       'Reporting_Airline', 'DOT_ID_Reporting_Airline',
       'IATA_CODE_Reporting_Airline', 'Tail_Number',
       'Flight_Number_Reporting_Airline', 'OriginAirportID',
       'OriginAirportSeqID', 'OriginCityMarketID', 'Origin', 'OriginCityName',
       'OriginState', 'OriginStateFips', 'OriginStateName', 'OriginWac',
       'DestAirportID', 'DestAirportSeqID', 'DestCityMarketID', 'Dest',
       'DestCityName', 'DestState', 'DestStateFips', 'DestStateName',
       'DestWac', 'CRSDepTime', 'DepTime', 'DepDelay', 'DepDelayMinutes',
       'DepDel15', 'DepartureDelayGroups', 'DepTimeBlk', 'TaxiOut',
       'WheelsOff', 'WheelsOn', 'TaxiIn', 'CRSArrTime', 'ArrTime', 'ArrDelay',
       'ArrDelayMinutes', 'ArrDel15', 'ArrivalDelayGroups', 'ArrTimeBlk',
       'Cancelled', 'Diverted', 'CRSElapsedTime', 'ActualElapsedTime',
       'AirTime', 'Flights', 'Distance', 'DistanceGroup',
       'DivAirportLandings'],
      

Check the shape of the new data set.

In [9]:
# down to 56 columns from 110
df_2020.shape

(4688354, 56)

Check for duplicated rows:

In [10]:
# check for duplicated rows and sum these values to get a total for the entire dataframe
df_2020.duplicated().sum()

0

There are no duplicated rows.

---

After doing some initial cleaning of the data, I decided to take a subset of the data to perform the remaining analysis on. 
There are 367 unique airports and 10+ million rows of data between the 2020 and 2021 data sets, so I'm choosing to focus on only one airport to improve the scope of the study and make working with the data more manageable. 

To choose an airport to focus on I will select the Origin airport with the most number of flights, which is Hartsfield-Jackson Atlanta International Airport (ATL). The number of flights from this airport is 255,324 which accounts for ~5% of all domestic flights in 2020.

In [11]:
# select the column 'Origin' and use value_counts() to count the number of rows by each origin airport
df_2020["Origin"].value_counts()

ATL    255324
DFW    225821
DEN    198540
ORD    189639
CLT    170892
        ...  
JST        50
BFM        47
PPG        26
HYA        22
UIN        10
Name: Origin, Length: 367, dtype: int64

In [12]:
# select the column 'Dest' and use value_counts() to count the number of rows by each destination airport
df_2020["Dest"].value_counts()

ATL    255282
DFW    225816
DEN    198494
ORD    189598
CLT    170823
        ...  
JST        51
BFM        46
PPG        26
HYA        22
UIN         9
Name: Dest, Length: 367, dtype: int64

In [13]:
# get the value_counts for flights for each origin airport as a percentage of the total number of flights by dividing
# by the length of the dataframe and multiplying by 100
df_2020["Origin"].value_counts()/len(df_2020)*100

ATL    5.445920
DFW    4.816637
DEN    4.234748
ORD    4.044895
CLT    3.645032
         ...   
JST    0.001066
BFM    0.001002
PPG    0.000555
HYA    0.000469
UIN    0.000213
Name: Origin, Length: 367, dtype: float64

Create a subset of data that includes only ATL as the origin airport using `loc`.

In [14]:
atl_2020 = df_2020.loc[df_2020["Origin"] == "ATL"]

Check that the origin is ATL only.

In [15]:
atl_2020["Origin"].value_counts()

ATL    255324
Name: Origin, dtype: int64

Create a csv of df_ATL using `to_csv`.

In [16]:
atl_2020.to_csv("data/2020/ATL_2020.csv")

---

## Data Cleaning on ATL 2020 Data Set

Check the new ATL dataframe using `head`.

In [17]:
atl_2020.head()

Unnamed: 0,Year,Quarter,Month,DayofMonth,DayOfWeek,FlightDate,Reporting_Airline,DOT_ID_Reporting_Airline,IATA_CODE_Reporting_Airline,Tail_Number,Flight_Number_Reporting_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,Origin,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,Dest,DestCityName,DestState,DestStateFips,DestStateName,DestWac,CRSDepTime,DepTime,DepDelay,DepDelayMinutes,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrTime,ArrDelay,ArrDelayMinutes,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,Cancelled,Diverted,CRSElapsedTime,ActualElapsedTime,AirTime,Flights,Distance,DistanceGroup,DivAirportLandings
648,2020,1,1,1,3,2020-01-01,B6,20409,B6,N583JB,996,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,501,503.0,2.0,2.0,0.0,0.0,0001-0559,12.0,515.0,712.0,5.0,733,717.0,-16.0,0.0,0.0,-2.0,0700-0759,0.0,0.0,152.0,134.0,117.0,1.0,946.0,4,0.0
649,2020,1,1,2,4,2020-01-02,B6,20409,B6,N606JB,996,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,501,507.0,6.0,6.0,0.0,0.0,0001-0559,13.0,520.0,720.0,5.0,733,725.0,-8.0,0.0,0.0,-1.0,0700-0759,0.0,0.0,152.0,138.0,120.0,1.0,946.0,4,0.0
650,2020,1,1,3,5,2020-01-03,B6,20409,B6,N775JB,996,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,501,500.0,-1.0,0.0,0.0,-1.0,0001-0559,14.0,514.0,709.0,19.0,733,728.0,-5.0,0.0,0.0,-1.0,0700-0759,0.0,0.0,152.0,148.0,115.0,1.0,946.0,4,0.0
651,2020,1,1,4,6,2020-01-04,B6,20409,B6,N768JB,996,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,501,506.0,5.0,5.0,0.0,0.0,0001-0559,11.0,517.0,658.0,4.0,733,702.0,-31.0,0.0,0.0,-2.0,0700-0759,0.0,0.0,152.0,116.0,101.0,1.0,946.0,4,0.0
652,2020,1,1,5,7,2020-01-05,B6,20409,B6,N796JB,996,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,501,622.0,81.0,81.0,1.0,5.0,0001-0559,31.0,653.0,900.0,4.0,733,904.0,91.0,91.0,1.0,6.0,0700-0759,0.0,0.0,152.0,162.0,127.0,1.0,946.0,4,0.0


Check the shape of the new dataframe.

In [18]:
# select index 0 from shape to get the number of rows and index 1 to get the number of columns

print(f"There are {atl_2020.shape[0]} rows and {atl_2020.shape[1]} columns.")

There are 255324 rows and 56 columns.


Using `rename` change the name of the column DayOfWeek so it's in the same format as DayofMonth.

In [19]:
atl_2020 = atl_2020.rename(columns = {"DayOfWeek":"DayofWeek"})

Check that it has been changed.

In [20]:
atl_2020.head()

Unnamed: 0,Year,Quarter,Month,DayofMonth,DayofWeek,FlightDate,Reporting_Airline,DOT_ID_Reporting_Airline,IATA_CODE_Reporting_Airline,Tail_Number,Flight_Number_Reporting_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,Origin,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,Dest,DestCityName,DestState,DestStateFips,DestStateName,DestWac,CRSDepTime,DepTime,DepDelay,DepDelayMinutes,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrTime,ArrDelay,ArrDelayMinutes,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,Cancelled,Diverted,CRSElapsedTime,ActualElapsedTime,AirTime,Flights,Distance,DistanceGroup,DivAirportLandings
648,2020,1,1,1,3,2020-01-01,B6,20409,B6,N583JB,996,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,501,503.0,2.0,2.0,0.0,0.0,0001-0559,12.0,515.0,712.0,5.0,733,717.0,-16.0,0.0,0.0,-2.0,0700-0759,0.0,0.0,152.0,134.0,117.0,1.0,946.0,4,0.0
649,2020,1,1,2,4,2020-01-02,B6,20409,B6,N606JB,996,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,501,507.0,6.0,6.0,0.0,0.0,0001-0559,13.0,520.0,720.0,5.0,733,725.0,-8.0,0.0,0.0,-1.0,0700-0759,0.0,0.0,152.0,138.0,120.0,1.0,946.0,4,0.0
650,2020,1,1,3,5,2020-01-03,B6,20409,B6,N775JB,996,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,501,500.0,-1.0,0.0,0.0,-1.0,0001-0559,14.0,514.0,709.0,19.0,733,728.0,-5.0,0.0,0.0,-1.0,0700-0759,0.0,0.0,152.0,148.0,115.0,1.0,946.0,4,0.0
651,2020,1,1,4,6,2020-01-04,B6,20409,B6,N768JB,996,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,501,506.0,5.0,5.0,0.0,0.0,0001-0559,11.0,517.0,658.0,4.0,733,702.0,-31.0,0.0,0.0,-2.0,0700-0759,0.0,0.0,152.0,116.0,101.0,1.0,946.0,4,0.0
652,2020,1,1,5,7,2020-01-05,B6,20409,B6,N796JB,996,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,501,622.0,81.0,81.0,1.0,5.0,0001-0559,31.0,653.0,900.0,4.0,733,904.0,91.0,91.0,1.0,6.0,0700-0759,0.0,0.0,152.0,162.0,127.0,1.0,946.0,4,0.0


---

### Missing Values

There are still some columns with missing values to deal with, check which ones still have missing values as a percentage of the total dataframe.

In [21]:
# divide the number of missing values for each column by the length of the dataframe 
# and multiply by 100 to get a percentage 

atl_2020.isnull().sum()/len(atl_2020)*100

Year                               0.000000
Quarter                            0.000000
Month                              0.000000
DayofMonth                         0.000000
DayofWeek                          0.000000
FlightDate                         0.000000
Reporting_Airline                  0.000000
DOT_ID_Reporting_Airline           0.000000
IATA_CODE_Reporting_Airline        0.000000
Tail_Number                        1.873306
Flight_Number_Reporting_Airline    0.000000
OriginAirportID                    0.000000
OriginAirportSeqID                 0.000000
OriginCityMarketID                 0.000000
Origin                             0.000000
OriginCityName                     0.000000
OriginState                        0.000000
OriginStateFips                    0.000000
OriginStateName                    0.000000
OriginWac                          0.000000
DestAirportID                      0.000000
DestAirportSeqID                   0.000000
DestCityMarketID                

Columns with missing values:
- Tail_Number
- CRSDepTime
- DepTime
- DepDelay
- DepDelayMinutes
- DepDel15
- DepartureDelayGroups
- TaxiOut
- WheelsOff
- WheelsOn
- TaxiIn
- ArrTime
- ArrDelay
- ArrDelayMinutes
- ArrDel15
- ArrivalDelayGroups
- ActualElapsedTime
- AirTime
- DivAirportLandings

These columns all have missing values below 5%, these columns need to be dealt with before moving on.

Use `loc` to select all rows where there are missing values to see if there is any pattern with the missing values.

In [22]:
# use isnull to select missing values and use to select any rows that have missing values for all columns
atl_2020.loc[atl_2020.isnull().any(axis = 1), :]

Unnamed: 0,Year,Quarter,Month,DayofMonth,DayofWeek,FlightDate,Reporting_Airline,DOT_ID_Reporting_Airline,IATA_CODE_Reporting_Airline,Tail_Number,Flight_Number_Reporting_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,Origin,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,Dest,DestCityName,DestState,DestStateFips,DestStateName,DestWac,CRSDepTime,DepTime,DepDelay,DepDelayMinutes,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrTime,ArrDelay,ArrDelayMinutes,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,Cancelled,Diverted,CRSElapsedTime,ActualElapsedTime,AirTime,Flights,Distance,DistanceGroup,DivAirportLandings
46573,2020,1,1,12,7,2020-01-12,AA,19805,AA,N944UW,1434,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,14100,1410005,34100,PHL,"Philadelphia, PA",PA,42,Pennsylvania,23,1214,1208.0,-6.0,0.0,0.0,-1.0,1200-1259,15.0,1223.0,,,1419,,,,,,1400-1459,0.0,1.0,125.0,,,1.0,666.0,3,1.0
50295,2020,1,1,18,6,2020-01-18,AA,19805,AA,N93003,1574,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,13930,1393007,30977,ORD,"Chicago, IL",IL,17,Illinois,41,730,,,,,,0700-0759,,,,,846,,,,,,0800-0859,1.0,0.0,136.0,,,1.0,606.0,3,0.0
52090,2020,1,1,17,5,2020-01-17,AA,19805,AA,N772XF,2427,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,13930,1393007,30977,ORD,"Chicago, IL",IL,17,Illinois,41,1948,,,,,,1900-1959,,,,,2109,,,,,,2100-2159,1.0,0.0,141.0,,,1.0,606.0,3,0.0
54501,2020,1,1,10,5,2020-01-10,AA,19805,AA,N301NW,1630,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,11298,1129806,30194,DFW,"Dallas/Fort Worth, TX",TX,48,Texas,74,1629,,,,,,1600-1659,,,,,1802,,,,,,1800-1859,1.0,0.0,153.0,,,1.0,731.0,3,0.0
72789,2020,1,1,11,6,2020-01-11,AA,19805,AA,N818AW,1119,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,13930,1393007,30977,ORD,"Chicago, IL",IL,17,Illinois,41,1813,,,,,,1800-1859,,,,,1938,,,,,,1900-1959,1.0,0.0,145.0,,,1.0,606.0,3,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336390,2020,4,12,23,3,2020-12-23,WN,19393,WN,N8581Z,4787,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,13871,1387102,33316,OMA,"Omaha, NE",NE,31,Nebraska,65,1350,,,,,,1300-1359,,,,,1520,,,,,,1500-1559,1.0,0.0,150.0,,,1.0,821.0,4,0.0
358281,2020,4,12,29,2,2020-12-29,YX,20452,YX,N134HQ,4961,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,13930,1393007,30977,ORD,"Chicago, IL",IL,17,Illinois,41,1755,,,,,,1700-1759,,,,,1905,,,,,,1900-1959,1.0,0.0,130.0,,,1.0,606.0,3,0.0
365393,2020,4,12,13,7,2020-12-13,YX,20452,YX,N206JQ,5591,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,12124,1212402,32124,HHH,"Hilton Head, SC",SC,45,South Carolina,37,1515,,,,,,1500-1559,,,,,1623,,,,,,1600-1659,1.0,0.0,68.0,,,1.0,238.0,1,0.0
368147,2020,4,12,16,3,2020-12-16,YX,20452,YX,N650RW,3500,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,12264,1226402,30852,IAD,"Washington, DC",VA,51,Virginia,38,1500,,,,,,1500-1559,,,,,1647,,,,,,1600-1659,1.0,0.0,107.0,,,1.0,534.0,3,0.0


There are 12,393 rows with missing values. Looking at the columns of 'Cancelled' and 'Diverted' it appears that most rows with missing values are associated with these columns. I can check by selecting rows where 'Cancelled' = 1 and where 'Diverted' = 1.

In [23]:
# select all rows where Cancelled = 1
# there are 11982 cancelled flights in this dataframe

atl_2020.loc[atl_2020["Cancelled"] == 1]

Unnamed: 0,Year,Quarter,Month,DayofMonth,DayofWeek,FlightDate,Reporting_Airline,DOT_ID_Reporting_Airline,IATA_CODE_Reporting_Airline,Tail_Number,Flight_Number_Reporting_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,Origin,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,Dest,DestCityName,DestState,DestStateFips,DestStateName,DestWac,CRSDepTime,DepTime,DepDelay,DepDelayMinutes,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrTime,ArrDelay,ArrDelayMinutes,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,Cancelled,Diverted,CRSElapsedTime,ActualElapsedTime,AirTime,Flights,Distance,DistanceGroup,DivAirportLandings
50295,2020,1,1,18,6,2020-01-18,AA,19805,AA,N93003,1574,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,13930,1393007,30977,ORD,"Chicago, IL",IL,17,Illinois,41,730,,,,,,0700-0759,,,,,846,,,,,,0800-0859,1.0,0.0,136.0,,,1.0,606.0,3,0.0
52090,2020,1,1,17,5,2020-01-17,AA,19805,AA,N772XF,2427,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,13930,1393007,30977,ORD,"Chicago, IL",IL,17,Illinois,41,1948,,,,,,1900-1959,,,,,2109,,,,,,2100-2159,1.0,0.0,141.0,,,1.0,606.0,3,0.0
54501,2020,1,1,10,5,2020-01-10,AA,19805,AA,N301NW,1630,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,11298,1129806,30194,DFW,"Dallas/Fort Worth, TX",TX,48,Texas,74,1629,,,,,,1600-1659,,,,,1802,,,,,,1800-1859,1.0,0.0,153.0,,,1.0,731.0,3,0.0
72789,2020,1,1,11,6,2020-01-11,AA,19805,AA,N818AW,1119,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,13930,1393007,30977,ORD,"Chicago, IL",IL,17,Illinois,41,1813,,,,,,1800-1859,,,,,1938,,,,,,1900-1959,1.0,0.0,145.0,,,1.0,606.0,3,0.0
72795,2020,1,1,17,5,2020-01-17,AA,19805,AA,N920NN,1119,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,13930,1393007,30977,ORD,"Chicago, IL",IL,17,Illinois,41,1811,,,,,,1800-1859,,,,,1936,,,,,,1900-1959,1.0,0.0,145.0,,,1.0,606.0,3,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336390,2020,4,12,23,3,2020-12-23,WN,19393,WN,N8581Z,4787,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,13871,1387102,33316,OMA,"Omaha, NE",NE,31,Nebraska,65,1350,,,,,,1300-1359,,,,,1520,,,,,,1500-1559,1.0,0.0,150.0,,,1.0,821.0,4,0.0
358281,2020,4,12,29,2,2020-12-29,YX,20452,YX,N134HQ,4961,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,13930,1393007,30977,ORD,"Chicago, IL",IL,17,Illinois,41,1755,,,,,,1700-1759,,,,,1905,,,,,,1900-1959,1.0,0.0,130.0,,,1.0,606.0,3,0.0
365393,2020,4,12,13,7,2020-12-13,YX,20452,YX,N206JQ,5591,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,12124,1212402,32124,HHH,"Hilton Head, SC",SC,45,South Carolina,37,1515,,,,,,1500-1559,,,,,1623,,,,,,1600-1659,1.0,0.0,68.0,,,1.0,238.0,1,0.0
368147,2020,4,12,16,3,2020-12-16,YX,20452,YX,N650RW,3500,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,12264,1226402,30852,IAD,"Washington, DC",VA,51,Virginia,38,1500,,,,,,1500-1559,,,,,1647,,,,,,1600-1659,1.0,0.0,107.0,,,1.0,534.0,3,0.0


In [24]:
# check how many rows where Cancelled = 1 have missing values

atl_2020.loc[atl_2020["Cancelled"] == 1].isnull().sum()

Year                                   0
Quarter                                0
Month                                  0
DayofMonth                             0
DayofWeek                              0
FlightDate                             0
Reporting_Airline                      0
DOT_ID_Reporting_Airline               0
IATA_CODE_Reporting_Airline            0
Tail_Number                         4783
Flight_Number_Reporting_Airline        0
OriginAirportID                        0
OriginAirportSeqID                     0
OriginCityMarketID                     0
Origin                                 0
OriginCityName                         0
OriginState                            0
OriginStateFips                        0
OriginStateName                        0
OriginWac                              0
DestAirportID                          0
DestAirportSeqID                       0
DestCityMarketID                       0
Dest                                   0
DestCityName    

Check the rows where diverted = 1.

In [25]:
atl_2020.loc[atl_2020["Diverted"] == 1]

Unnamed: 0,Year,Quarter,Month,DayofMonth,DayofWeek,FlightDate,Reporting_Airline,DOT_ID_Reporting_Airline,IATA_CODE_Reporting_Airline,Tail_Number,Flight_Number_Reporting_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,Origin,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,Dest,DestCityName,DestState,DestStateFips,DestStateName,DestWac,CRSDepTime,DepTime,DepDelay,DepDelayMinutes,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrTime,ArrDelay,ArrDelayMinutes,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,Cancelled,Diverted,CRSElapsedTime,ActualElapsedTime,AirTime,Flights,Distance,DistanceGroup,DivAirportLandings
46573,2020,1,1,12,7,2020-01-12,AA,19805,AA,N944UW,1434,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,14100,1410005,34100,PHL,"Philadelphia, PA",PA,42,Pennsylvania,23,1214,1208.0,-6.0,0.0,0.0,-1.0,1200-1259,15.0,1223.0,,,1419,,,,,,1400-1459,0.0,1.0,125.0,,,1.0,666.0,3,1.0
96547,2020,1,1,1,3,2020-01-01,DL,19790,DL,N6700,1997,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,12441,1244102,32441,JAC,"Jackson, WY",WY,56,Wyoming,88,950,952.0,2.0,2.0,0.0,0.0,0900-0959,17.0,1009.0,1740.0,19.0,1209,1759.0,,,,,1200-1259,0.0,1.0,259.0,,,1.0,1572.0,7,1.0
99603,2020,1,1,4,6,2020-01-04,WN,19393,WN,N266WN,5041,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,13204,1320402,31454,MCO,"Orlando, FL",FL,12,Florida,33,920,957.0,37.0,37.0,1.0,2.0,0900-0959,29.0,1026.0,1400.0,9.0,1050,1409.0,,,,,1000-1059,0.0,1.0,90.0,,,1.0,404.0,2,1.0
102213,2020,1,1,10,5,2020-01-10,WN,19393,WN,N7745A,1422,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,11259,1125904,30194,DAL,"Dallas, TX",TX,48,Texas,74,2115,2112.0,-3.0,0.0,0.0,-1.0,2100-2159,10.0,2122.0,3.0,8.0,2235,11.0,,,,,2200-2259,0.0,1.0,140.0,,,1.0,721.0,3,1.0
105826,2020,1,1,11,6,2020-01-11,WN,19393,WN,N421LV,3128,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,10693,1069302,30693,BNA,"Nashville, TN",TN,47,Tennessee,54,940,943.0,3.0,3.0,0.0,0.0,0900-0959,12.0,955.0,1227.0,6.0,950,1233.0,,,,,0900-0959,0.0,1.0,70.0,,,1.0,214.0,1,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
131311,2020,4,12,24,4,2020-12-24,DL,19790,DL,N556NW,1598,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,15304,1530402,33195,TPA,"Tampa, FL",FL,12,Florida,33,1809,1811.0,2.0,2.0,0.0,0.0,1800-1859,17.0,1828.0,2156.0,3.0,1930,2159.0,,,,,1900-1959,0.0,1.0,81.0,,,1.0,406.0,2,1.0
209634,2020,4,12,20,7,2020-12-20,OO,20304,OO,N127SY,5260,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,12266,1226603,31453,IAH,"Houston, TX",TX,48,Texas,74,700,658.0,-2.0,0.0,0.0,-1.0,0700-0759,23.0,721.0,1547.0,11.0,834,1558.0,,,,,0800-0859,0.0,1.0,154.0,,,1.0,689.0,3,1.0
287418,2020,4,12,17,4,2020-12-17,WN,19393,WN,N439WN,1087,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,10821,1082106,30852,BWI,"Baltimore, MD",MD,24,Maryland,35,945,950.0,5.0,5.0,0.0,0.0,0900-0959,12.0,1002.0,,,1130,,,,,,1100-1159,0.0,1.0,105.0,,,1.0,577.0,3,1.0
300276,2020,4,12,7,1,2020-12-07,WN,19393,WN,N8615E,2026,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,11292,1129202,30325,DEN,"Denver, CO",CO,8,Colorado,82,710,708.0,-2.0,0.0,0.0,-1.0,0700-0759,11.0,719.0,959.0,7.0,835,1006.0,,,,,0800-0859,0.0,1.0,205.0,,,1.0,1199.0,5,1.0


Check the number of rows where diverted = 1 are null values.

In [26]:
atl_2020.loc[atl_2020["Diverted"] == 1].isnull().sum()

Year                                 0
Quarter                              0
Month                                0
DayofMonth                           0
DayofWeek                            0
FlightDate                           0
Reporting_Airline                    0
DOT_ID_Reporting_Airline             0
IATA_CODE_Reporting_Airline          0
Tail_Number                          0
Flight_Number_Reporting_Airline      0
OriginAirportID                      0
OriginAirportSeqID                   0
OriginCityMarketID                   0
Origin                               0
OriginCityName                       0
OriginState                          0
OriginStateFips                      0
OriginStateName                      0
OriginWac                            0
DestAirportID                        0
DestAirportSeqID                     0
DestCityMarketID                     0
Dest                                 0
DestCityName                         0
DestState                

Calculate the percent of missing values that diverted and cancelled flights account for.

In [27]:
percent_null = ((11982 + 410)/12393)*100
percent_null

99.99193092875011

In [28]:
print(f"Cancelled and Diverted flights account for {round(percent_null, 2)}% of missing values")

Cancelled and Diverted flights account for 99.99% of missing values


I'm going to remove these rows with missing values since I'm focusing on the delays. There's still one extra row that was not diverted or cancelled but since it's only 1 row I'm going to remove it as well.

Use `dropna` to remove the rows with missing values.

In [29]:
# axis = 0 specifies rows with missing values
# axis = 1 specifies columns with missing values

atl_2020_clean = atl_2020.dropna(axis = 0)

Check the shape of the cleaned dataframe.

In [30]:
# check the shape of the cleaned dataframe
atl_2020_clean.shape

(242931, 56)

There should be no more missing values in the data set, use `isnull` to check.

In [31]:
# check for missing values using 'isnull' and 'sum'. Summing once gets the total for each column, summing twice gives
# the total for the data set
atl_2020_clean.isnull().sum().sum()

0

The Cancelled and Diverted columns can now be dropped since all the values are 0 in each column. 

In [32]:
atl_2020_clean = atl_2020_clean.drop(columns = ["Cancelled", "Diverted"])

---

### Data Types

Check the data types because they need to be changed to reduce the memory usage of the dataset. Use `info` to check the data type of each column.

In [33]:
atl_2020_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 242931 entries, 648 to 371329
Data columns (total 54 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   Year                             242931 non-null  int64  
 1   Quarter                          242931 non-null  int64  
 2   Month                            242931 non-null  int64  
 3   DayofMonth                       242931 non-null  int64  
 4   DayofWeek                        242931 non-null  int64  
 5   FlightDate                       242931 non-null  object 
 6   Reporting_Airline                242931 non-null  object 
 7   DOT_ID_Reporting_Airline         242931 non-null  int64  
 8   IATA_CODE_Reporting_Airline      242931 non-null  object 
 9   Tail_Number                      242931 non-null  object 
 10  Flight_Number_Reporting_Airline  242931 non-null  int64  
 11  OriginAirportID                  242931 non-null  int64  
 12  

**Notes**:
- Everything (numeric) has been stored as float64 or int64, this isn't necessary for almost all columns
- FlightDate is stored as an object rather than datetime, this needs to be changed

First, change FlightDate to 'datetime' using `to_datetime`.

In [34]:
atl_2020_clean["FlightDate"] = pd.to_datetime(atl_2020_clean["FlightDate"])

Check that FlightDate has been changed.

In [35]:
atl_2020_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 242931 entries, 648 to 371329
Data columns (total 54 columns):
 #   Column                           Non-Null Count   Dtype         
---  ------                           --------------   -----         
 0   Year                             242931 non-null  int64         
 1   Quarter                          242931 non-null  int64         
 2   Month                            242931 non-null  int64         
 3   DayofMonth                       242931 non-null  int64         
 4   DayofWeek                        242931 non-null  int64         
 5   FlightDate                       242931 non-null  datetime64[ns]
 6   Reporting_Airline                242931 non-null  object        
 7   DOT_ID_Reporting_Airline         242931 non-null  int64         
 8   IATA_CODE_Reporting_Airline      242931 non-null  object        
 9   Tail_Number                      242931 non-null  object        
 10  Flight_Number_Reporting_Airline  242931 no

**Notes**:

New data type for each column:
- Quarter = int8
- Month = int8
- DayofMonth = int8
- DayofWeek = int8
- Year = int32
- DOT_ID_Reporting_Airline = int32
- Flight_Number_Reporting_Airline = int32
- OriginAirportID = int32                       
- OriginAirportSeqID = int32                       
- OriginCityMarketID = int32                      
- OriginStateFips = int8
- OriginWac = int8
- DestAirportID = int32
- DestAirportSeqID = int32
- DestCityMarketID = int32
- DestStateFips = int8
- DestWac = int8
- CRSDepTime = int32
- DepTime = int32
- DepDelay = float32
- DepDelayMinutes = float32
- DepDel15 = int8 - only has 0 and 1 values
- DepartureDelayGroups = int8
- TaxiOut = int8
- WheelsOff = float32
- WheelsOn = float32
- TaxiIn = int8
- CRSArrTime = int32
- ArrTime = int32
- ArrDelay = float32
- ArrDelayMinutes = float32
- ArrDel15 = int8
- ArrivalDelayGroups = int8
- Cancelled = int8
- Diverted = int8
- CRSElapsedTime = float32
- ActualElapsedTime = float32
- AirTime = float32
- Flights = int8 - only value of 1
- Distance = int32
- DistanceGroup = int8
- DivAirportLandings = float32

Start changing data types, change all the columns that will be converted to int8 together, and then int32, and finally float32.

Use `astype` to change the data type to the required type.

Change to int8: Quarter, Month, DayofMonth, DayofWeek, OriginStateFips, OriginWac, DestStateFips, DestWac, Flights, DistanceGroup

In [36]:
atl_2020_clean[["Quarter", "Month", "DayofMonth", "DayofWeek", "OriginStateFips", "OriginWac", "DestStateFips", "DestWac", "Flights", "DistanceGroup"]] = atl_2020_clean[["Quarter", "Month", "DayofMonth", "DayofWeek","OriginStateFips", "OriginWac", "DestStateFips", "DestWac", "Flights", "DistanceGroup"]].astype('int8')


Change to int32: DOT_ID_Reporting_Airline, Flight_Number_Reporting_Airline, OriginAirportID, OriginAirportSeqID, OriginCityMarketID, CRSDepTime, DestAirportID, DestAirportSeqID, DestCityMarketID, CRSArrTime, Distance

In [37]:
atl_2020_clean[["Year", "DOT_ID_Reporting_Airline", "Flight_Number_Reporting_Airline", "OriginAirportID", "OriginAirportSeqID", "OriginCityMarketID", "CRSDepTime","DepTime", "DestAirportID", "DestAirportSeqID", "DestCityMarketID", "CRSArrTime", "ArrTime", "Distance"]] = atl_2020_clean[["Year","DOT_ID_Reporting_Airline", "Flight_Number_Reporting_Airline", "OriginAirportID", "OriginAirportSeqID", "OriginCityMarketID", "CRSDepTime", "DepTime", "DestAirportID", "DestAirportSeqID", "DestCityMarketID", "CRSArrTime", "ArrTime", "Distance"]].astype('int32')


Columns to be changed to float32: DepDel15, DepartureDelayGroups, TaxiOut, TaxiIn, ArrDel15, ArrivalDelayGroups, DepTime, WheelsOff, WheelsOff, DepDelay, DepDelayMinutes, ArrDelay, ArrDelayMinutes, ActualElapsedTime, AirTime, CRSElapsedTime

In [38]:
atl_2020_clean[["WheelsOff", "WheelsOn", "DepDel15", "DepartureDelayGroups", "TaxiOut", "TaxiIn", "ArrDel15", "ArrivalDelayGroups", "DepDelay", "DepDelayMinutes", "ArrDelay", "ArrDelayMinutes", "ActualElapsedTime", "AirTime", "DivAirportLandings", "CRSElapsedTime"]] = atl_2020_clean[["WheelsOff", "WheelsOn", "DepDel15", "DepartureDelayGroups", "TaxiOut", "TaxiIn", "ArrDel15", "ArrivalDelayGroups", "DepDelay", "DepDelayMinutes", "ArrDelay", "ArrDelayMinutes", "ActualElapsedTime", "AirTime", "DivAirportLandings", "CRSElapsedTime"]].astype('float32')


Check that all the data types have been changed.

In [39]:
atl_2020_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 242931 entries, 648 to 371329
Data columns (total 54 columns):
 #   Column                           Non-Null Count   Dtype         
---  ------                           --------------   -----         
 0   Year                             242931 non-null  int32         
 1   Quarter                          242931 non-null  int8          
 2   Month                            242931 non-null  int8          
 3   DayofMonth                       242931 non-null  int8          
 4   DayofWeek                        242931 non-null  int8          
 5   FlightDate                       242931 non-null  datetime64[ns]
 6   Reporting_Airline                242931 non-null  object        
 7   DOT_ID_Reporting_Airline         242931 non-null  int32         
 8   IATA_CODE_Reporting_Airline      242931 non-null  object        
 9   Tail_Number                      242931 non-null  object        
 10  Flight_Number_Reporting_Airline  242931 no

All data types have been changed and memory usage has been significantly reduced.

---

### Weather Data

Weather data was downloaded from [NOAA](https://www.ncdc.noaa.gov/cdo-web/datatools/selectlocation). The location was selected for the Atlanta airport only. The data includes daily weather summaries for each day of 2020.

### Weather Data Dictionary

| **Column** | **Data type** | **Description**                                                                                              |
|------------|---------------|--------------------------------------------------------------------------------------------------------------|
| STATION    | object        | Station identification code                                                                                  |
| NAME       | object        | Name of the airport                                                                                          |
| LATITUDE   | numeric       | The latitude of the airport, reported in decimal degrees                                                     |
| LONGITUDE  | numeric       | The longitude of the airport, reported in decimal degrees                                                    |
| ELEVATION  | numeric       | The elevation of the airport above sea level in metres                                                       |
| DATE       | object        | The date of the weather record                                                                               |
| AWND       | numeric       | Average daily wind speed (miles per hour)                                                                    |
| PGTM       | numeric       | Peak gust time (hours and minutes, hhmm)                                                                     |
| PRCP       | numeric       | Precipitation (inches)                                                                                       |
| SNOW       | numeric       | Snowfall (inches)                                                                                            |
| SNWD       | numeric       | Snow depth (inches)                                                                                          |
| TAVG       | numeric       | Average daily temperature (Fahrenheit)                                                                       |
| TMAX       | numeric       | Maximum daily temperature (Fahrenheit)                                                                       |
| TMIN       | numeric       | Minimum daily temperature (Fahrenheit)                                                                       |
| WDF2       | numeric       | Direction of fastest 2-minute wind (degrees)                                                                 |
| WDF5       | numeric       | Direction of fastest 5-second wind (degrees)                                                                 |
| WSF2       | numeric       | Fastest 2-minute wind speed                                                                                  |
| WSF5       | numeric       | Fastest 5-second wind speed                                                                                  |
| WT01       | numeric       | Weather type: Fog, ice fog, or freezing fog (may include heavy fog). Binary column, 1 = true                 |
| WT02       | numeric       | Weather type: Heavy fog or heaving freezing fog (not always distinguished from fog). Binary column, 1 = true |
| WT03       | numeric       | Weather type: Thunder. Binary column, 1 = true                                                               |
| WT04       | numeric       | Weather type: Ice pellets, sleet, snow pellets, or small hail. Binary column, 1 = true                       |
| WT05       | numeric       | Weather type: Hail (may include small hail). Binary column, 1 = true                                                                  |
| WT06       | numeric       | Weather type: Glaze or rime. Binary column, 1 = true                                                                                  |
| WT08       | numeric       | Weather type: Smoke or haze. Binary column, 1 = true                                                                                  |

Read in the weather data.

In [40]:
atl_2020_weather = pd.read_csv("data/2020/atl_2020_weather.csv")

Check the dataframe.

In [41]:
atl_2020_weather.head()

Unnamed: 0,STATION,NAME,LATITUDE,LONGITUDE,ELEVATION,DATE,AWND,PGTM,PRCP,SNOW,SNWD,TAVG,TMAX,TMIN,WDF2,WDF5,WSF2,WSF5,WT01,WT02,WT03,WT04,WT05,WT06,WT08
0,USW00013874,ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...,33.62972,-84.44224,308.2,2020-01-01,7.16,,0.0,0.0,0.0,45,57,36,280,330,13.0,17.0,,,,,,,
1,USW00013874,ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...,33.62972,-84.44224,308.2,2020-01-02,6.71,,0.92,0.0,0.0,47,50,46,40,30,16.1,21.0,1.0,,,,,,
2,USW00013874,ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...,33.62972,-84.44224,308.2,2020-01-03,5.82,,0.97,0.0,0.0,56,63,50,300,280,23.0,28.0,1.0,1.0,,,,,
3,USW00013874,ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...,33.62972,-84.44224,308.2,2020-01-04,14.54,,0.14,0.0,0.0,56,59,37,300,310,33.1,42.1,1.0,,,,,,1.0
4,USW00013874,ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...,33.62972,-84.44224,308.2,2020-01-05,8.28,,0.0,0.0,0.0,41,55,32,320,320,21.9,28.0,,,,,,,


Check the shape of the dataframe. The dataframe has 366 rows because 2020 was a leap year.

In [42]:
atl_2020_weather.shape

(366, 25)

Check the dataframe information. The columns that correspond to different Weather Types 'WT01, WT02' etc., contain missing values. The column PGTM (peak gust time) is 100% empty so I will drop it as well.

After reviewing the NOAA documentation (see NOAA_documentation.pdf) I noticed that there are 22 possible Weather Types, but only 7 are available for Atlanta airport. I cannot infer that missing values would imply any other weather type because the full list of possibilities are not avaiable in this data set. Therefore, I will drop these columns from the dataframe.

In [43]:
atl_2020_weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 25 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   STATION    366 non-null    object 
 1   NAME       366 non-null    object 
 2   LATITUDE   366 non-null    float64
 3   LONGITUDE  366 non-null    float64
 4   ELEVATION  366 non-null    float64
 5   DATE       366 non-null    object 
 6   AWND       366 non-null    float64
 7   PGTM       0 non-null      float64
 8   PRCP       366 non-null    float64
 9   SNOW       366 non-null    float64
 10  SNWD       366 non-null    float64
 11  TAVG       366 non-null    int64  
 12  TMAX       366 non-null    int64  
 13  TMIN       366 non-null    int64  
 14  WDF2       366 non-null    int64  
 15  WDF5       366 non-null    int64  
 16  WSF2       366 non-null    float64
 17  WSF5       366 non-null    float64
 18  WT01       142 non-null    float64
 19  WT02       21 non-null     float64
 20  WT03      

I also noticed that the columns 'SNOW' and 'SNWD' appear to have only zero values, which is expected for Atlanta since it does not typically snow there. I can confirm this by checking the value counts for each column.

In [44]:
atl_2020_weather["SNOW"].value_counts()

0.0    366
Name: SNOW, dtype: int64

In [45]:
atl_2020_weather["SNWD"].value_counts()

0.0    366
Name: SNWD, dtype: int64

I can now drop all of these columns from the dataframe, along with 'STATION', 'LATITUDE', 'LONGITUDE' and 'ELEVATION' since these aren't required for modeling or analysis. I'm also choosing to drop the columns WDF5 and WSF5 because they had missing values in the 2021 data set, and I want to ensure consistency across both data sets (see Notebook 2021). **CHANGE NOTEBOOK NAME**

In [46]:
atl_2020_weather = atl_2020_weather.drop(columns = ["STATION", "LATITUDE", "LONGITUDE", "ELEVATION", "PGTM", "SNOW", "SNWD", "WT01", "WT02", "WT03", "WT04", "WT05", "WT06", "WT08", "WDF5", "WSF5"])


Check columns have been dropped.

In [47]:
atl_2020_weather.head()

Unnamed: 0,NAME,DATE,AWND,PRCP,TAVG,TMAX,TMIN,WDF2,WSF2
0,ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...,2020-01-01,7.16,0.0,45,57,36,280,13.0
1,ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...,2020-01-02,6.71,0.92,47,50,46,40,16.1
2,ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...,2020-01-03,5.82,0.97,56,63,50,300,23.0
3,ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...,2020-01-04,14.54,0.14,56,59,37,300,33.1
4,ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...,2020-01-05,8.28,0.0,41,55,32,320,21.9


My next step is to merge this data set with atl_2020_clean using the column 'DATE' from atl_2020_weather and 'FlightDate' from atl_2020_clean. But before I can do that I need to conver the data type of the column 'DATE' in atl_2020_weather since the columns have to be of the same data type to be merged.

In [48]:
atl_2020_weather["DATE"] = pd.to_datetime(atl_2020_weather["DATE"])

In order to reduce memory usage in this dataframe and the new merged dataframe, I will also change the data types of AWND, PRCP, WSF2, TAVG, TMAX, TMIN and WDF2.

In [49]:
# change these columns to float32
atl_2020_weather[["AWND", "PRCP", "WSF2"]] = atl_2020_weather[["AWND", "PRCP", "WSF2"]].astype('float32')


In [50]:
# change these columns to int32
atl_2020_weather[["TAVG", "TMAX", "TMIN", "WDF2"]] = atl_2020_weather[["TAVG", "TMAX", "TMIN", "WDF2"]].astype('int32')


In [51]:
atl_2020_weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   NAME    366 non-null    object        
 1   DATE    366 non-null    datetime64[ns]
 2   AWND    366 non-null    float32       
 3   PRCP    366 non-null    float32       
 4   TAVG    366 non-null    int32         
 5   TMAX    366 non-null    int32         
 6   TMIN    366 non-null    int32         
 7   WDF2    366 non-null    int32         
 8   WSF2    366 non-null    float32       
dtypes: datetime64[ns](1), float32(3), int32(4), object(1)
memory usage: 15.9+ KB


Use merge to join the dataframes, and set the column from the left data frame to join on as 'FlightDate' and from the right data frame as 'DATE'.

In [52]:
atl_2020_clean = atl_2020_clean.merge(atl_2020_weather, left_on = "FlightDate", right_on = "DATE")

Check the shape of the new dataframe.

In [53]:
atl_2020_clean.shape

(242931, 63)

Check that the data has been merged correctly.

In [54]:
atl_2020_clean.head()

Unnamed: 0,Year,Quarter,Month,DayofMonth,DayofWeek,FlightDate,Reporting_Airline,DOT_ID_Reporting_Airline,IATA_CODE_Reporting_Airline,Tail_Number,Flight_Number_Reporting_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,Origin,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,Dest,DestCityName,DestState,DestStateFips,DestStateName,DestWac,CRSDepTime,DepTime,DepDelay,DepDelayMinutes,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrTime,ArrDelay,ArrDelayMinutes,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,CRSElapsedTime,ActualElapsedTime,AirTime,Flights,Distance,DistanceGroup,DivAirportLandings,NAME,DATE,AWND,PRCP,TAVG,TMAX,TMIN,WDF2,WSF2
0,2020,1,1,1,3,2020-01-01,B6,20409,B6,N583JB,996,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,501,503,2.0,2.0,0.0,0.0,0001-0559,12.0,515.0,712.0,5.0,733,717,-16.0,0.0,0.0,-2.0,0700-0759,152.0,134.0,117.0,1,946,4,0.0,ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...,2020-01-01,7.16,0.0,45,57,36,280,13.0
1,2020,1,1,1,3,2020-01-01,B6,20409,B6,N591JB,1153,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,13204,1320402,31454,MCO,"Orlando, FL",FL,12,Florida,33,1203,1156,-7.0,0.0,0.0,-1.0,1200-1259,12.0,1208.0,1305.0,7.0,1337,1312,-25.0,0.0,0.0,-2.0,1300-1359,94.0,76.0,57.0,1,404,2,0.0,ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...,2020-01-01,7.16,0.0,45,57,36,280,13.0
2,2020,1,1,1,3,2020-01-01,B6,20409,B6,N571JB,2932,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,11697,1169706,32467,FLL,"Fort Lauderdale, FL",FL,12,Florida,33,925,917,-8.0,0.0,0.0,-1.0,0900-0959,11.0,928.0,1100.0,12.0,1123,1112,-11.0,0.0,0.0,-1.0,1100-1159,118.0,115.0,92.0,1,581,3,0.0,ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...,2020-01-01,7.16,0.0,45,57,36,280,13.0
3,2020,1,1,1,3,2020-01-01,OH,20397,OH,N526EA,5216,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,11057,1105703,31057,CLT,"Charlotte, NC",NC,37,North Carolina,36,1943,1952,9.0,9.0,0.0,0.0,1900-1959,20.0,2012.0,2103.0,14.0,2113,2117,4.0,4.0,0.0,0.0,2100-2159,90.0,85.0,51.0,1,226,1,0.0,ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...,2020-01-01,7.16,0.0,45,57,36,280,13.0
4,2020,1,1,1,3,2020-01-01,OH,20397,OH,N597NN,5354,10397,1039707,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,11057,1105703,31057,CLT,"Charlotte, NC",NC,37,North Carolina,36,1810,1917,67.0,67.0,1.0,4.0,1800-1859,15.0,1932.0,2014.0,33.0,1931,2047,76.0,76.0,1.0,5.0,1900-1959,81.0,90.0,42.0,1,226,1,0.0,ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...,2020-01-01,7.16,0.0,45,57,36,280,13.0


Drop the columns 'NAME' and 'DATE' since they are redundant now.

In [55]:
atl_2020_clean = atl_2020_clean.drop(columns = ['NAME', 'DATE'])

Final shape of the dataframe with weather variables added.

In [56]:
atl_2020_clean.shape

(242931, 61)

Save a csv version for future use.

In [57]:
atl_2020_clean.to_csv("data/2020/ATL_2020_clean.csv")

Save a pickle version of this cleaned data set with the new data types preserved.

In [58]:
# use 'to_pickle' to save a version of the data
atl_2020_clean.to_pickle("data/atl_2020_clean.pkl")

---

### Next Steps

This cleaned ATL 2020 data set will be merged with the cleaned ATL 2021 data set and further EDA will be completed in Notebook 3 (Exploratory Data Analysis).

The next notebook is Notebook 2(Cleaning and Preprocessing - 2021 Data).

---