<a href="https://colab.research.google.com/github/illiyas-sha/Colab-Notebook/blob/main/EDA_Flight_delay_and_causes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Flight Delay and Causes**

This Dataset contains Flights trip and multiple cause of delay.
Using this data you can find what caused the delay for flight whether it's Security delay, NAS delay or Carrier delay, etc.

Dataset:
https://www.kaggle.com/undersc0re/flight-delay-and-causes


### I hope you find this kernel useful and your **UPVOTES** would be very much appreciated

# **1. Adding dataset to the notebook**
 

# **2. Reading Dataset**

## 2.1 Importing Packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## 2.2 Reading data set



1.   Loading dataset using pandas
2.   View them




In [None]:
#import
import pandas as pd
df = pd.read_csv("../input/flight-delay-and-causes/Flight_delay.csv") 
df.head()

# **3. Data Preprocessing and cleaning**

## Finding columns having NaN values to handle missing values

In [None]:
#finding shape of the column
df.shape


> *we have 29 columns and 484551 rows*




In [None]:
# Finding Missing Values by matrix view
# %matplotlib inline
# import missingno as msno
# msno.matrix(df)

In [None]:
df.isnull().sum()

Out of 29 columns, 2 columns have Nan values



let us see the percentage of the missing values per column






In [None]:
missing_percentage= df.isnull().sum().sort_values(ascending=False)/ len(df)
missing_percentage


In [None]:
missing_percentage.plot(kind='barh')

In [None]:
missing_percentage[missing_percentage != 0].plot(kind='bar',figsize=(5,7))

Here, the missing values are very very low (i.e) 0.24% and 0.30%. 

If we have more missing values, we can drop those columns. But here, we have less number of missing values. So the best method is to handle those NAN values.

Now, we have to handle this value. only few values are missing. So we are filling those values with most frequent values

In [None]:
df.Org_Airport.mode()

In [None]:
df.Dest_Airport.mode()

Here **Chicago O'Hare International Airport** is the most frequent value. Replacing Nan with this value

In [None]:
#Replacing Missing values of Org_Airport and Dest_Airport with most frequent values
df['Org_Airport'] = df['Org_Airport'].fillna(df['Org_Airport'].mode()[0])
df['Dest_Airport'] = df['Dest_Airport'].fillna(df['Dest_Airport'].mode()[0])




> Now all Missing values are handled



Finding columns having numerical data and categorical data

In [None]:
#Name of the columns having numeric values
numeric=df.select_dtypes(include=np.number).columns.tolist()
#Number of columns having numeric values 
len(numeric)




> Out of 29 columns, 20 columns have numerical data. And the remaining 9 columns have categorical data.




# **4. Exploratory Data Analysis and Visualization**

## 4.1 Delay categories

Separating Delay into another dataframe

In [None]:
#creating new dataframe by combining 5 types of delays
data=[df['CarrierDelay'],df['WeatherDelay'],df['NASDelay'],df['SecurityDelay'],df['LateAircraftDelay']]
headers = ['CarrierDelay','WeatherDelay','NASDelay','SecurityDelay','LateAircraftDelay']
df1 = pd.concat(data, axis=1, keys=headers)
df1.head()

In [None]:
df1.isin([0]).sum()

In [None]:
#check whether the data have atleast any one of the  delay
df1['outcome'] = 0
df1.loc[df1.loc[(df1.iloc[:,:-1].nunique(axis=1) == 1) \
    & (df1.iloc[:,:-1] == 0).all(axis=1)].index, 'outcome'] = 1
#stores '0' or '1' in 'outcome' ----- '0' - if the row contains any one of the delay.
#                                     '1' - if the row contains no delay

In [None]:
#filtering rows which have 1 in 'outcome' column
df1[df1['outcome']==1]
#  if all rows of 'outcome' column is 0, then it has any one of the 5 types of delay

All rows have '0' as outcome

So all the flights in the data of this dataset, have atleast any one of the delay.

## 4.2 Airlines which have more travel records

In [None]:
#Finding unique airlines in 'Airline' column
Airlines=df.Airline.unique()
len(Airlines)

The data contains 12 unique Airlines

In [None]:
#returns counts of each unique values
value=df.Airline.value_counts()
value

In [None]:
#Horizontal bar plot of this value counts
value.plot(kind='barh')

Southwest Airline company has the largest number of travel delay records

## 4.3 Day of week ( In which day of the week, the delay happens the most ?  ) 

In [None]:
# pie plot to show the days of week 
f,ax=plt.subplots(1,2,figsize=(18,6))
df['DayOfWeek'].value_counts().plot.pie(explode=[0.1,0.005,0.005,0.005,0.005,0.005,0.005],autopct='%1.1f%%',ax=ax[0],shadow=True) 
ax[0].set_title('DayOfWeek')
ax[0].set_ylabel('')
sns.countplot('DayOfWeek', data=df,ax=ax[1])
ax[1].set_title('DayOfWeek')
plt.show()

print('DayOfWeek represents whether the flight was on delayed on Monday-(1), Tuesday-(2),Wednesday - (3),Thursday- (4), Friday- (5), Saturday- (6), Sunday- (7)')

The maximum number of delay happened on **FRIDAY**

## 4.4 Departure Time

In [None]:
#TO CHANGE THE MISSING DIGIT ------ DepTime - 958 to 0958
df['DepTime'] = df.DepTime.map("{:04}".format)
df.head()

In [None]:
#ADDING COLON AFTER TWO CHARACHTER -----DepTime  - 09:58
df['DepTime'] =df['DepTime'].astype(str).replace(r"(\d{2})(\d+)", r"\1:\2", regex=True)
df.head()

In [None]:
#changing 24:00 to 00:00
#because while changing to Standard Timestamp, we will get error if the column have 24:00 value)
df['DepTime'] = df.DepTime.replace(to_replace ='24:', value = '00:', regex = True)
df.head()

In [None]:
#checking the specific row that contains 24:00
df.DepTime[268503]

In [None]:
#Time delta function
#df["DepTime"] = pd.to_datetime(df.DepTime).apply(lambda x: x.strftime(r'%H:%M:%S'))
#df['DepTime'] = pd.to_timedelta(np.where(df['DepTime'].str.count(':') == 1, df['DepTime'] + ':00', df['DepTime']))
#df.head()
# df['DepTimeStamp']=df.apply(lambda r : pd.datetime.combine(r['Date'],r['DepTime']),1)
# df.head()

In [None]:
#Creating a new column for Departure Time Stamp
df['DepTimeStamp']= np.nan

In [None]:
#Combining 'Date' column and 'DepTime' column
df['DepTimeStamp'] = df.Date.map(str) + " " + df.DepTime
df.head()

In [None]:
#Applying time stamp to dataframe DepTimeStamp
df.DepTimeStamp = pd.to_datetime(df.DepTimeStamp)
df.head()

In [None]:
#checking the specific row that contained 24:00 time
df.DepTimeStamp[268503]

In [None]:
df.head()

**On which time delay happens mostly?**

In [None]:
#distribution plot for the 24 hours for all the data
sns.distplot(df.DepTimeStamp.dt.hour, bins=24 , kde=False, hist_kws={"rwidth":0.75,'edgecolor':'black', 'alpha':1.0},norm_hist=True, )
plt.ylabel("Percentage Of Delay Occurance")

-- A High Percentage of Delay occurs between 15:00:00 to 20:00:00 (i.e) 3 PM to 8 PM 

--The flights scheduled to depart at 3 PM to 8 PM delays mostly

-- The flights scheduled to depart at 12 AM to 5 AM -less delay


## 4.5 Month

In [None]:
#distribution plot for all the months in the year
sns.displot(df.DepTimeStamp.dt.month,kind="kde", bw_adjust=0.25 )
plt.xlabel("Month")
plt.title("Month vs Delay Occurance")

After this sudden decrease in the graph(7th month to 12th month), I was a little surprised at the result, & I rechecked my data.

And found that, **the data is available only for First 6 months of the 2019**. So we don't have another 6 months data. That is the reason for this sudden change.

## 4.6 Feature selection by **Pearson Correlation**

In [None]:
df.corr()

This contains raw correlated values. To visualize this let as plot heatmap

In [None]:
#correlation matrix
corrmat = df.corr()
f, ax = plt.subplots(figsize=(18, 16))
sns.heatmap(corrmat, vmax=.8, square=True,  cmap="YlGnBu",annot=True);
plt.show()

In [None]:
# with the following function we can select highly correlated features
# it will remove the first feature that is correlated with anything other feature

def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

In [None]:
#Length of correlated columns
corr_features = correlation(df, 0.9)
len(set(corr_features))

In [None]:
#column names of correlated features
corr_features

These columns can be droped as these columns are highley correlated.

In [None]:
df.drop(corr_features,inplace=True,axis=1)
df.head()

It also have 2 empty columns - Cancelled and Diverted

So no flight was cancelled or divered on these months.

We can drop those columns

In [None]:
can_div= {'Cancelled', 'Diverted', 'CancellationCode' }
df.drop(can_div,inplace=True,axis=1)

In [None]:
df.head()

## 4.7 Arrival Delay

In [None]:
#Arrival delay is the Difference in minutes between scheduled and actual arrival time
sns.histplot(df['ArrDelay'], )
from matplotlib import rcParams

# figure size in inches
rcParams['figure.figsize'] = 11,8
plt.show()


In [None]:
#skewness and kurtosis
print("Skewness: %f" % df['ArrDelay'].skew())
print("Kurtosis: %f" % df['ArrDelay'].kurt())

---We know that in **postively skewed distribution** the data values are clustered around the left side of the distribution and the right side is longer.

---The data is very closely distributed. The height of the peak is greater than width of the peak. 

--- So the majority of the delays are short timed. Compared to short time delay, minority of the delays are long timed.



In [None]:
#min value of Arrival delay(in minutes)
min_value = df.ArrDelay.min()
min_value

In [None]:
#max value of Arrival delay(in minutes)
max_value = df.ArrDelay.max()
max_value

--- The minimum Arrival delay is **15 Minutes**

--- The maximum Arrival delay is **1707 Minutes**

(where Arrival delay is the difference in scheduled arrival time and actual  arrival time)

## 4.8 **Airline vs Types of delays**

CarrierDelay →     Flight delay due to carrier(e.g. maintenance or crew problems, aircraft cleaning, fueling, etc), 0 = No, yes = (in minutes)

WeatherDelay →     Flight delay due to weather, 0 = No, yes = (in minutes)

NASDelay →         Flight delay by NSA(National Aviation System), 0 = No, yes = (in minutes)

SecurityDelay → Flight delay by this reason, 0 = No, yes = (in minutes)

LateAircraftDelay → Flight delay by this reason, 0 = No, yes = (in minutes)

####     4.8.1 Airline vs CarrierDelay(in minutes)


In [None]:
import matplotlib.pyplot as plt

cols = df.columns
figure, ax1 = plt.subplots(figsize=(35,10))
ax1.plot(df[cols[6]],df[cols[17]],linewidth= 0.5,zorder=1 )


 **American Eagle Airlines Inc.** flights take more time (in minutes) for the carrier delay.

In other words, American Eagle Airlines has taken the highest time in minutes for the carrier delay.

#### 4.8.2 Airline vs Weather Delay

In [None]:
cols = df.columns
figure, ax1 = plt.subplots(figsize=(35,10))
ax1.plot(df[cols[6]],df[cols[18]],linewidth= 0.5,zorder=1 )

**American Airlines Inc.** flights take more time (in minutes) for the weather delay.

In other words, American Airlines Inc. has taken the highest time in minutes for the weather delay.

#### 4.8.3 Airline vs NAS delay

In [None]:
cols = df.columns
figure, ax1 = plt.subplots(figsize=(35,10))
ax1.plot(df[cols[6]],df[cols[19]],linewidth= 0.5,zorder=1 )

**American Airlines Inc.** flights take more time (in minutes) for the NAS delay.

In other words, American Airlines Inc. has taken the highest time in minutes for the NAS delay.

#### 4.8.4  Airline vs Security Delay

In [None]:
cols = df.columns
figure, ax1 = plt.subplots(figsize=(35,10))
ax1.plot(df[cols[6]],df[cols[20]],linewidth= 0.5,zorder=1 )

**Atlantic Southeast Airlines** flights take more time (in minutes) for the Security delay.

In other words, Atlantic Southeast Airlines. has taken the highest time in minutes for the Security delay.

####  4.8.5 Airline vs Late Aircraft Delay

In [None]:
cols = df.columns
figure, ax1 = plt.subplots(figsize=(35,10))
ax1.plot(df[cols[6]],df[cols[21]],linewidth= 0.5,zorder=1 )

**United Airline Inc.** flights take more time (in minutes) for the Late Aircraft delay.

In other words, United Airline Inc. has taken the highest time in minutes for the Late Aircraft delay.

## 4.9 **Causes For Delay**

In [None]:
df['Month']= df['DepTimeStamp'].dt.month
df.head()

In [None]:
df2 = df.filter(['Month','CarrierDelay','WeatherDelay','NASDelay','SecurityDelay','LateAircraftDelay'], axis=1)
df2 = df2.groupby('Month')['LateAircraftDelay','CarrierDelay','WeatherDelay','NASDelay','SecurityDelay'].sum().plot()
df2.legend(loc='upper center', bbox_to_anchor=(0.5, 1.25), ncol=3, fancybox=True, shadow=True)
from matplotlib import rcParams

# figure size in inches
rcParams['figure.figsize'] = 7,4
plt.show()

#### This clearly shows that **LateAircraft delay, Carrier Delay, and NAS delay** shows most delay during the year.

## 4.10 **Late Aircraft Delay**

In [None]:
#pair plot for 5 types of delays and arrival time
sns.set()
cols = ['ArrDelay','SecurityDelay','WeatherDelay','NASDelay','CarrierDelay','LateAircraftDelay']
sns.pairplot(df[cols], size = 1.5)
plt.show()

From this pairplot, we can see that Late Aircraft Delay is the most important feature

There is no significant correlation between the types of delays themselves. But more information can be extracted from the correlation between Arrival delay and types of delay.

We can find out exact root cause for each delay with the help of the routes of each aircrafts and other some details.
But that is not within the scope of this analysis.

## 4.11  **Carrier Delay**

#### Value counts of each Unique Carrier 

In [None]:
print(df['UniqueCarrier'].value_counts())

#### Average Delay by carrier

In [None]:
f,ax=plt.subplots(1,2,figsize=(20,8))
sns.barplot('UniqueCarrier','CarrierDelay', data=df,ax=ax[0], order=['WN', 'AA', 'MQ', 'UA','OO','US','DL',
                                                                     'EV', 'B6', 'AS','F9','HA'])
                                                                                
                                                                                
ax[0].set_title('Average Delay by Carrier')


sns.boxplot('UniqueCarrier','CarrierDelay', data=df,ax=ax[1], order=['WN', 'AA', 'MQ', 'UA','OO','US','DL',
                                                                     'EV', 'B6', 'AS','F9','HA'])
ax[1].set_title('Delay Distribution by Carrier')
plt.close(2)
plt.show()

print(['WN: Southwest Airlines', 'AA: American Airlines', 'MQ: American Eagle Airlines', 'UA: United Airlines',
       'OO: Skywest Airlines','US: US Airways','DL: Delta Airlines','EV: Atlantic Southeast Airlines',
       'B6: JetBlue Airways','AS: Alaska Airlines','F9: Frontier Airlines','HA: Hawaiian Airlines',])


In [None]:
Unique = df[["UniqueCarrier", "CarrierDelay"]]
Unique.shape

In [None]:
#mean value of HA- Hawaiian Airlines Carrier
HA = Unique[Unique["UniqueCarrier"] == 'HA']
HA.mean()

In [None]:
#mean value of EV -Atlantic Southeast Airlines Carrier
EV = Unique[Unique["UniqueCarrier"] == 'EV']
EV.mean()

Carriers with higher average delay generation are Hawaiian Airlines (HA) with 36.41 minutes per flight,  Atlantic Southeast Airlines (EV) with 33.60 minutes per flight.

## 4.12 **NAS Delay**

After Little bit of research, I found that NAS Delays include some extreme weather conditions, heavy traffic volume , air traffic control, etc. Delays that occur after Actual Gate Out are usually attributed to the NAS.

So theses conditions may occur on both the Origin Airport and the Destination Airport.


In [None]:
df4=df[['Origin','NASDelay']].groupby('Origin').agg(['mean','count']).sort_values(by=('NASDelay','mean'), ascending=False)[:10]
df4

In [None]:
df4.plot(kind='bar')

We sorted the origins in terms of departure delay ratio . The locations with high delay rates have very few flights.
The locations with low delay rates have very high number of flight counts.

# **5. Conclusion**

## Ask and Answer Questions



Which airlines take more time for each of these 5 delay?

In which day of the week delay happens the most?

which flight delay most frequently?

In which Time of the day mostly delay happens?

what day of the week the delay happens the most?

which month have the most delay?

Major causes of the delay?


## Insights

*   All the flights in the data of this dataset, have atleast any one of the delay.

*   **Southwest Airline company** has the largest number of travel delay records

*   The maximum number of delay happened on **FRIDAY**

*   -- A **High** Percentage of Delay occurs between 15:00:00 to 20:00:00 (i.e) **3 PM to 8 PM** .

*   The flights scheduled to depart at **3 PM to 8 PM** delays mostly

*    The flights scheduled to depart at **12 AM to 5 AM** - **less** number of **delay**

*   Data is available for **First half of the year 2019**.
From that first 6 months, March month has more delay records

*    The **majority of the delays are short timed**. Compared to short time delay, minority of the delays are long timed.

*  The **minimum Arrival delay** is **15 Minutes** &  The **maximum Arrival delay** is **1707 Minutes**


*   Chicago O'Hare International Airport is the airport where most frequently flights depart and arrive 

*   **American Eagle Airlines** has taken the highest time in minutes for the **carrier delay**.

*   **American Airlines Inc.** has taken the highest time in minutes for the **weather delay**.

*   **American Airlines Inc.** has taken the highest time in minutes for the **NAS delay**.


*   **Atlantic Southeast Airlines** has taken the highest time in minutes for the **Security delay**.


*    **United Airline Inc.** has taken the highest time in minutes for the **Late Aircraft delay**.

*   **LateAircraft delay, Carrier Delay, and NAS delay** shows most delay during the year.

*    Carriers with higher average delay generation are **Hawaiian Airlines** (HA) with **36.41 minutes** per flight, **Atlantic Southeast Airlines** (EV) with **33.60 minutes** per flight

*    The locations with high delay rates have very few flights. The locations with low delay rates have very high number of flight counts.


