# Talk about Framing and outlining the problem (Business problem and specifics)
# Talk about data

# Will it be delayed?

Everyone who has flown has experienced a delayed or cancelled flight. Both airlines and airports would like to improve their on-time performance and predict when a flight will be delayed or cancelled several days in advance. You are being hired to build a model that can predict if a flight will be delayed. To learn more, you must schedule a meeting with your client (me). To schedule an appointment with your client, send an event request through Google Calendar for a 15 minute meeting. Both you and your project partner must attend the meeting. Come prepared with questions to ask your client. Remember that your client is not a data scientist and you will need to explain things in a way that is easy to understand. Make sure that your communications are efficient, thought out, and not redundant as your client might get frustrated and "fire" you (this only applies to getting information from your client, this does not necessary apply to asking for help with the actual project itself - you should continuously ask questions for getting help).

For this project you must go through most all steps in the checklist. You must write responses for all items as done in the homeworks, however sometimes the item will simply be "does not apply". Keep your progress and thoughts organized in this document and use formatting as appropriate (using markdown to add headers and sub-headers for each major part). Some changes to the checklist:

* Do not do the final part (launching the product).
* Your presentation will be done as information written in this document in a dedicated section (no slides or anything like that). It should include high-level summary of your results (including what you learned about the data, the "accuracy" of your model, what features were important, etc). It should be written for your client, not your professor or teammates. It should include the best summary plots/graphics/data points.
* The models and hyperparameters you should consider during short-listing and fine-tuning will be released at a later time (dependent on how far we get over the next two weeks).
* Data retrieval must be automatic as part of the code (so it can easily be re-run and grab the latest data). Do not commit any data to the repository.
* Your submission must include a pickled final model along with this notebook.


# Notes

* gov agency tasked with improving efficiency with comercial airtraffic from consumer standpoint, 
* predict delays and errors 7 days in advanced
* data is from the 80s, subitted take off, supposed to leave, etc, on BTS (get a link) only look at 2023 and 2024
* less predictions but more accurate predictions by a 1/4
* National Air System (group that runs opperations at a single airport, makes all decisions so something they do can cause a run down, weather issues, security (TSA, bad person on deck), late arrival, cancelations,  ) broken down into minor, medium, major categories
* Things to ignore in data, diverted, international destination or arrival (only USA to USA)

In [2]:
!pip install numpy scipy pandas matplotlib scikit-learn pyarrow fastparquet seaborn

Collecting scikit-learn
  Using cached scikit_learn-1.6.1-cp313-cp313-macosx_12_0_arm64.whl.metadata (31 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Using cached scikit_learn-1.6.1-cp313-cp313-macosx_12_0_arm64.whl (11.1 MB)
Using cached joblib-1.4.2-py3-none-any.whl (301 kB)
Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.4.2 scikit-learn-1.6.1 threadpoolctl-3.6.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
# All of your imports here (you may need to add some)
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
import pyarrow as pa
import fastparquet as fp
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Frame problem, get data, explore

In [6]:
def read_csv(file_path):
    """Reads a CSV file and returns its content as a Pandas DataFrame."""
    try:
        df = pd.read_csv(file_path)
        return df
    except FileNotFoundError:
        print(f"Error: The file '{file_path}' was not found.")
    except Exception as e:
        print(f"An error occurred: {e}")
    
    return None


read_csv('T_ONTIME_REPORTING.csv')

  df = pd.read_csv(file_path)


Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,...,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM
0,2024,1,1,1,1,1/1/2024 12:00:00 AM,9E,20363,9E,N131EV,...,,,,,,,,,,
1,2024,1,1,1,1,1/1/2024 12:00:00 AM,9E,20363,9E,N132EV,...,,,,,,,,,,
2,2024,1,1,1,1,1/1/2024 12:00:00 AM,9E,20363,9E,N132EV,...,,,,,,,,,,
3,2024,1,1,1,1,1/1/2024 12:00:00 AM,9E,20363,9E,N133EV,...,,,,,,,,,,
4,2024,1,1,1,1,1/1/2024 12:00:00 AM,9E,20363,9E,N133EV,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
547266,2024,1,1,31,3,1/31/2024 12:00:00 AM,YX,20452,YX,N879RW,...,,,,,,,,,,
547267,2024,1,1,31,3,1/31/2024 12:00:00 AM,YX,20452,YX,N882RW,...,,,,,,,,,,
547268,2024,1,1,31,3,1/31/2024 12:00:00 AM,YX,20452,YX,N882RW,...,,,,,,,,,,
547269,2024,1,1,31,3,1/31/2024 12:00:00 AM,YX,20452,YX,N882RW,...,,,,,,,,,,


In [7]:
train_set = read_csv('T_ONTIME_REPORTING.csv')
train_set.head()

  df = pd.read_csv(file_path)


Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,...,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM
0,2024,1,1,1,1,1/1/2024 12:00:00 AM,9E,20363,9E,N131EV,...,,,,,,,,,,
1,2024,1,1,1,1,1/1/2024 12:00:00 AM,9E,20363,9E,N132EV,...,,,,,,,,,,
2,2024,1,1,1,1,1/1/2024 12:00:00 AM,9E,20363,9E,N132EV,...,,,,,,,,,,
3,2024,1,1,1,1,1/1/2024 12:00:00 AM,9E,20363,9E,N133EV,...,,,,,,,,,,
4,2024,1,1,1,1,1/1/2024 12:00:00 AM,9E,20363,9E,N133EV,...,,,,,,,,,,


* Things to ignore in data, diverted, international destination or arrival (only USA to USA)

In [8]:
train_sample = train_set.sample(n=1000, random_state=42)

X = train_sample.copy()
y = X.pop('DIVERTED')

X.shape

X.head()


Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,...,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM
110938,2024,1,1,7,7,1/7/2024 12:00:00 AM,AA,19805,AA,N989NN,...,,,,,,,,,,
17170,2024,1,1,1,1,1/1/2024 12:00:00 AM,YX,20452,YX,N745YX,...,,,,,,,,,,
496423,2024,1,1,29,1,1/29/2024 12:00:00 AM,9E,20363,9E,N937XJ,...,,,,,,,,,,
166047,2024,1,1,10,3,1/10/2024 12:00:00 AM,B6,20409,B6,N974JT,...,,,,,,,,,,
176788,2024,1,1,10,3,1/10/2024 12:00:00 AM,WN,19393,WN,N8701Q,...,,,,,,,,,,


In [None]:
#droping all the FIPS numbers outside of the US (1-56) 

us_fips_codes = {1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,
                21,22,23,24,25,26,27,28,29,30,31,32,33,34,
                35,36,37,38,39,40,41,42,44,45,46,47,48,49,
                50,51,53,54,55,56}

train_sample_filtered = train_sample[
    train_sample['ORIGIN_STATE_FIPS'].isin(us_fips_codes) &
    train_sample['DEST_STATE_FIPS'].isin(us_fips_codes)
]
train_sample_filtered = train_sample_filtered[train_sample_filtered['YEAR'] >= 2023]

train_sample_filtered.shape
# train_sample_filtered.head()
train_sample_filtered


Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,...,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM
110938,2024,1,1,7,7,1/7/2024 12:00:00 AM,AA,19805,AA,N989NN,...,,,,,,,,,,
17170,2024,1,1,1,1,1/1/2024 12:00:00 AM,YX,20452,YX,N745YX,...,,,,,,,,,,
496423,2024,1,1,29,1,1/29/2024 12:00:00 AM,9E,20363,9E,N937XJ,...,,,,,,,,,,
166047,2024,1,1,10,3,1/10/2024 12:00:00 AM,B6,20409,B6,N974JT,...,,,,,,,,,,
226521,2024,1,1,13,6,1/13/2024 12:00:00 AM,UA,19977,UA,N34131,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
351044,2024,1,1,20,6,1/20/2024 12:00:00 AM,WN,19393,WN,N7819A,...,,,,,,,,,,
302369,2024,1,1,18,4,1/18/2024 12:00:00 AM,AA,19805,AA,N452AN,...,,,,,,,,,,
392940,2024,1,1,23,2,1/23/2024 12:00:00 AM,AA,19805,AA,N825NN,...,,,,,,,,,,
139856,2024,1,1,8,1,1/8/2024 12:00:00 AM,UA,19977,UA,N485UA,...,,,,,,,,,,


In [10]:
train_sample_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 981 entries, 110938 to 126418
Columns: 109 entries, YEAR to DIV5_TAIL_NUM
dtypes: float64(67), int64(21), object(21)
memory usage: 843.0+ KB
