# **Data Collection Notebook**

## Objectives

* Fetch monthly data for airport departures and arrivals "Airline On-Time Statistics and Delay Causes" from the United States Department of Transportation (https://www.transtats.bts.gov/)
* Concatenate monthly .csv files into one file
* Preliminary data exploration 

## Inputs

* Monthly csv files for "Airline On-Time Statistics and Delay Causes" - public data

## Outputs

* Generate Dataset: outputs/datasets/collection/airlineDelayPredictor.csv


---

# Change working directory

* Check the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/airline-delay-predictor/jupyter_notebooks'

Set the root directory as the current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspaces/airline-delay-predictor'

# Fetch data from Transtats (USA)

All necessary packages for fetching data have already been install using the requirements.txt file

In [4]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


Download zip file '' from https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp?20=E using the following filters:
* Select a carrier: All
* Select an airport All
* Period from: June, 2003
* Period to: June, 2025

File saved in the path : inputs/datasets/raw/transtats_raw_2003june_2025june.zip

Unzip file to get 'Airline_Delay_Cause.csv' and 'Download_Column_Definitions.xlsx'

In [None]:
DestinationFolder = "inputs/datasets/raw"
filename = "transtats_raw_2003june_2025june.zip"

! unzip {DestinationFolder}/*.zip -d {DestinationFolder}

Archive:  inputs/datasets/raw/transtats_raw_2003june_2025june.zip
replace inputs/datasets/raw/Airline_Delay_Cause.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


---

# Load and Inspect transtats data

Section 2 content

In [13]:

import pandas as pd
import numpy as np

df = pd.read_csv(f"inputs/datasets/raw/Airline_Delay_Cause.csv")
df.head()

Unnamed: 0,year,month,carrier,carrier_name,airport,airport_name,arr_flights,arr_del15,carrier_ct,weather_ct,...,security_ct,late_aircraft_ct,arr_cancelled,arr_diverted,arr_delay,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2025,6,9E,Endeavor Air Inc.,ABE,"Allentown/Bethlehem/Easton, PA: Lehigh Valley ...",90.0,26.0,8.63,3.24,...,0.0,9.72,4.0,0.0,1884.0,561.0,223.0,282.0,0.0,818.0
1,2025,6,9E,Endeavor Air Inc.,ABY,"Albany, GA: Southwest Georgia Regional",5.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,124.0,0.0,0.0,0.0,0.0,124.0
2,2025,6,9E,Endeavor Air Inc.,AEX,"Alexandria, LA: Alexandria International",69.0,23.0,7.21,1.82,...,0.0,5.32,2.0,0.0,1698.0,981.0,54.0,294.0,0.0,369.0
3,2025,6,9E,Endeavor Air Inc.,AGS,"Augusta, GA: Augusta Regional at Bush Field",155.0,43.0,12.78,2.69,...,0.0,17.19,9.0,0.0,2877.0,827.0,198.0,517.0,0.0,1335.0
4,2025,6,9E,Endeavor Air Inc.,ALB,"Albany, NY: Albany International",86.0,29.0,9.32,0.0,...,0.0,15.61,4.0,0.0,1934.0,638.0,0.0,194.0,0.0,1102.0


DataFrame Summary

In [10]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 407721 entries, 0 to 407720
Data columns (total 21 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   year                 407721 non-null  int64  
 1   month                407721 non-null  int64  
 2   carrier              407721 non-null  object 
 3   carrier_name         407721 non-null  object 
 4   airport              407721 non-null  object 
 5   airport_name         407721 non-null  object 
 6   arr_flights          407061 non-null  float64
 7   arr_del15            406766 non-null  float64
 8   carrier_ct           407061 non-null  float64
 9   weather_ct           407061 non-null  float64
 10  nas_ct               407061 non-null  float64
 11  security_ct          407061 non-null  float64
 12  late_aircraft_ct     407061 non-null  float64
 13  arr_cancelled        407061 non-null  float64
 14  arr_diverted         407061 non-null  float64
 15  arr_delay        

Replace fractional variable ending in _ct with actual frequencies.

In [14]:
ct_cols = [col for col in df.columns if col.endswith('_ct')]

for col in ct_cols:
    freq_col = col.replace('_ct', '_freq')
    df[freq_col] = np.where(df['arr_del15'] > 0, df[col] / df['arr_del15'], 0)

df.drop(columns=ct_cols,inplace=True)
df.head(10)

Unnamed: 0,year,month,carrier,carrier_name,airport,airport_name,arr_flights,arr_del15,arr_cancelled,arr_diverted,...,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,carrier_freq,weather_freq,nas_freq,security_freq,late_aircraft_freq
0,2025,6,9E,Endeavor Air Inc.,ABE,"Allentown/Bethlehem/Easton, PA: Lehigh Valley ...",90.0,26.0,4.0,0.0,...,561.0,223.0,282.0,0.0,818.0,0.331923,0.124615,0.169615,0.0,0.373846
1,2025,6,9E,Endeavor Air Inc.,ABY,"Albany, GA: Southwest Georgia Regional",5.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,124.0,0.0,0.0,0.0,0.0,1.0
2,2025,6,9E,Endeavor Air Inc.,AEX,"Alexandria, LA: Alexandria International",69.0,23.0,2.0,0.0,...,981.0,54.0,294.0,0.0,369.0,0.313478,0.07913,0.376087,0.0,0.231304
3,2025,6,9E,Endeavor Air Inc.,AGS,"Augusta, GA: Augusta Regional at Bush Field",155.0,43.0,9.0,0.0,...,827.0,198.0,517.0,0.0,1335.0,0.297209,0.062558,0.240465,0.0,0.399767
4,2025,6,9E,Endeavor Air Inc.,ALB,"Albany, NY: Albany International",86.0,29.0,4.0,0.0,...,638.0,0.0,194.0,0.0,1102.0,0.321379,0.0,0.140345,0.0,0.538276
5,2025,6,9E,Endeavor Air Inc.,ATL,"Atlanta, GA: Hartsfield-Jackson Atlanta Intern...",2695.0,712.0,96.0,11.0,...,19049.0,5308.0,10868.0,33.0,33913.0,0.182879,0.060899,0.205618,0.001124,0.549466
6,2025,6,9E,Endeavor Air Inc.,AUS,"Austin, TX: Austin - Bergstrom International",90.0,24.0,4.0,1.0,...,537.0,110.0,455.0,0.0,508.0,0.411667,0.057917,0.339167,0.0,0.19125
7,2025,6,9E,Endeavor Air Inc.,AVL,"Asheville, NC: Asheville Regional",76.0,29.0,4.0,0.0,...,607.0,284.0,731.0,0.0,510.0,0.293448,0.032414,0.371379,0.0,0.302414
8,2025,6,9E,Endeavor Air Inc.,BGR,"Bangor, ME: Bangor International",99.0,24.0,7.0,0.0,...,330.0,111.0,161.0,0.0,1165.0,0.180417,0.03375,0.2075,0.0,0.577917
9,2025,6,9E,Endeavor Air Inc.,BHM,"Birmingham, AL: Birmingham-Shuttlesworth Inter...",137.0,53.0,11.0,0.0,...,1095.0,9.0,1888.0,0.0,1322.0,0.33283,0.006792,0.31283,0.0,0.347547


Change data type for 'month' and 'year' columns from int64 to object 

In [15]:
df = df.astype({'year': 'object', 'month': 'object'})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 407721 entries, 0 to 407720
Data columns (total 21 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   year                 407721 non-null  object 
 1   month                407721 non-null  object 
 2   carrier              407721 non-null  object 
 3   carrier_name         407721 non-null  object 
 4   airport              407721 non-null  object 
 5   airport_name         407721 non-null  object 
 6   arr_flights          407061 non-null  float64
 7   arr_del15            406766 non-null  float64
 8   arr_cancelled        407061 non-null  float64
 9   arr_diverted         407061 non-null  float64
 10  arr_delay            407061 non-null  float64
 11  carrier_delay        407061 non-null  float64
 12  weather_delay        407061 non-null  float64
 13  nas_delay            407061 non-null  float64
 14  security_delay       407061 non-null  float64
 15  late_aircraft_del

# Save file 

create destination folder: outputs/datasets/collection

path: outputs/dataset/collection as transtatsAirlineDelay.csv

In [16]:
import os

output_path = r"outputs/datasets/collection"
try:
  os.makedirs(name=output_path)
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/transtatsAirlineDelay.csv",index=False)

---

# Conclusions and Next Steps

* raw data have been downloaded
* Preliminary checks and variable data type changes
* Output file saved to the 'collections' subfolder in 'outputs/dataset/'
* Next notebook will tackle data exploration