# Flight delay time exploratory data analysis

In [None]:
import numpy as np
import pandas as pd
import glob
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

First we read in the input files. We can use the `glob` package with `*` as a wildcard to make a list of all the csv files, and then open and concatenate all the files in the list to get a single dataframe.

In [None]:
df = pd.concat([pd.read_csv(f) for f in glob.glob("/kaggle/input/historical-flight-and-weather-data/*.csv") ])

Next, lets explore some basic characteristics of our data.

In [None]:
df.head()

In [None]:
df.dtypes

In [None]:
df.hist(figsize=(20,20)); # Tip: put a semicolon at the end of the line to avoid printing a bunch of text output.

In [None]:
df.shape

In [None]:
df.dtypes

So from the initial analysis above, we can see that we've got a database of 5.5 billion flights, with each record including information about the airline ("carrier_code"), origin and destination airport, date and time, and weather information. This dataset is not well documented, but we'll assume that `*_x` corresponds to weather at the origin airport and `*_y` corresponds to weather at the destination airport. There is also information about flight delays and cancellations.

Our goal is always to do something useful. Some useful things we could do with this dataset could be to gain insight into what conditions are related to delayed and canceled flights, and potentially predict or avoid those delays in the future, so we will explore the dataset with that goal in mind.

First, we'll look into the frequency of delays and cancellations:

In [None]:
(df.arrival_delay > 0).sum() / df.shape[0]

In [None]:
(df.arrival_delay > 30).sum() / df.shape[0]

In [None]:
(df.arrival_delay > 60).sum() / df.shape[0]

In [None]:
(df.departure_delay > 0).sum() / df.shape[0]

In [None]:
((df.arrival_delay > 0) & (df.departure_delay > 0)).sum() / df.shape[0]

In [None]:
df.cancelled_code.value_counts()

In [None]:
(df.cancelled_code != "N").sum() / df.shape[0]

From the above, we can see that 34% of flight arrivals are delayed, 12% are delayed by more than 30 minutes, and 7% are delayed by more than one hour. (We're assuming the times are in minutes. Hopefully the benefit of having a well-documented dataset is apparent here.)

If we assume that a cancelled code of "N" means not cancelled, and everything else is cancelled, then about 1.5% of flights are cancelled.

We can start out by looking at how conditions were different for flights that were canceled compared to other flights. One way to do this is to create two sets of histograms:

In [None]:
df_cancel = df[df.cancelled_code != "N"]
df_cancel.hist(figsize=(20,20)); 

In [None]:
df_nocancel = df[df.cancelled_code == "N"]
df_nocancel.hist(figsize=(20,20)); 

One insight this gives us is that the max windspeed for non-canceled flights appears much higher than the max windspeed for flights that were canceled. TWe can investigate this further:

In [None]:
print(df_cancel.HourlyWindSpeed_x.mean(), df_cancel.HourlyWindSpeed_x.median(), df_cancel.HourlyWindSpeed_x.max())
print(df_nocancel.HourlyWindSpeed_x.mean(), df_nocancel.HourlyWindSpeed_x.median(), df_nocancel.HourlyWindSpeed_x.max())

When we look into it more closely, we can see that the mean and median wind speed for canceled flights are actually higher than non-canceled flights, but the max wind speed is much higher for non-canceled flights, apparently approximately mach 5-10, depending on what the units are for the wind speed. So we've found something that could be a predictor of flight cancelations, and we have also found some suspicious data points, which can be just as important. 

What should we do next? In groups, brainstorm and implement additional ways to explore this dataset. Some ideas:

* Are flight delays or cancellations related to weather in ways other than the one we just looked at?
* Are the related to certain days of the week or holidays etc?
* Are delayed departures and delayed arrivals related to each other?