# 2015 Flights' Data Analysis
---
#### By Omar Bougacha


****
## Introduction
****

***
## I. Data Wrangling
***

### I.1. Data Gathering

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
base_color = sns.color_palette()[0]

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
flights_df = pd.read_csv('/kaggle/input/flight-delays/flights.csv')
flights_df.head(2)

In [None]:
flights_df.shape

The flights dataframe is composed of 5 819 079 flight record that are described by 31 features.

In [None]:
airports_df = pd.read_csv('/kaggle/input/flight-delays/airports.csv')
airports_df.head(2)

In [None]:
airports_df.shape

We have a database of 322 airport that are described by 7 features.

In [None]:
airlines_df = pd.read_csv('/kaggle/input/flight-delays/airlines.csv')
airlines_df.head(2)

In [None]:
airlines_df.shape

We have 14 airlines with specific IATA codes and names.

### I.2. Data Quality & Tidiness Assessment

#### a- Data Completeness (Missing Values)

In [None]:
missing_values_df = pd.DataFrame()
missing_values_df['Feature'] = flights_df.columns
missing_values_df['N_missing'] = flights_df.isnull().sum().values
missing_values_df['M_percent'] = flights_df.isnull().sum().values*100/flights_df.shape[0]
missing_values_df

As we can see, we have several features that present missing values. The reason these values are missing could be: 
* randomly missed information due to the acquision process 
* values are missed depends on the data (for example canceled flights do not have a take-off time, flights that took off do not have a cancellation reason, etc).

To know what method to use to solve the missing values problem a further investigation of the data should be done. 

Lets focuse on the number of **canceled flights** and see if it is correlated with the missing values of some of the features:

In [None]:
flights_df[flights_df['CANCELLED']==1].shape[0]

We have a total of 89884 cancelled flights. 

In [None]:
flights_df[flights_df['CANCELLED']==1].isnull().sum()

We can see that:
* for some features all the missing data are caused by the cancellation of the flight. These features are: 'TAIL_NUMBER', 'DEPARTURE_TIME', 'DEPARTURE_DELAY', 'TAXI_OUT', 'WHEELS_OFF'. 
* for some features a great part of missing data is caused by the cancellation of the flight like: 'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME', 'WHEELS_ON', 'TAXI_IN', 'ARRIVAL_TIME', 'ARRIVAL_DELAY'.
* for some features the missing values caused by the cancellation of the flight counts for a little % of the total missing values. These features are: 'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY', and 'WEATHER_DELAY'. 
* All the records of the cancelled flights have cancellation_reason

Lets now investigate the **diverted flights**.

In [None]:
flights_df[flights_df['DIVERTED']==1].shape[0]

We have a total of 15187 diverted flight. 

In [None]:
flights_df[flights_df['DIVERTED']==1].isnull().sum()

From the obtained tables we can see that:
* for the features in which the number of missing values is mostly caused by the cancellation of the flight, the rest of the missing values is caused by the diverted flights. 
* For features: 'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY', and 'WEATHER_DELAY' the missing values of the diverted flights is only a small % of the overall missing values. 

Lets see the flights that have a positive delay.

In [None]:
flights_df[flights_df['DEPARTURE_DELAY']>0].shape[0]

In [None]:
flights_df[flights_df['ARRIVAL_DELAY']>0].shape[0]

In [None]:
flights_df[flights_df['DEPARTURE_DELAY']>0].isnull().sum()

In [None]:
flights_df[flights_df['ARRIVAL_DELAY']>0].isnull().sum()

Based on these two tables, we have delayed flights either in take-off or in arrival that present several missing values. However, let's see an example before taking the decision. 

In [None]:
flights_df[flights_df['ARRIVAL_DELAY']>0].head(2)

As given in the example, we still have a big part of missing values in the features: 'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY', and 'WEATHER_DELAY'. Therefore, I propose to drop these features. As for others, I think we should devide the data set into 3 parts: normal flights, cancelled flights, and diverted flights. Because each category has its features. 

#### Data Accuracy

In [None]:
flights_df.dtypes

it seems like the data types are adequate to the features. 

### Data Uniqueness: 

Lets check for duplicated values. 

In [None]:
flights_df.duplicated().sum()

No duplicated entries. 

### Data Tidiness: 

The presented flights data contain only flight's information, in a way that each feature is a column and each observation is a row. Therefore, the presented data is considered tidy. 

### I.3. Data Cleaning

The process of cleaning data should be well documented to allow the reproducibility of the operations and the obtained results. Therefore, I propose to treat each issue over three steps Define, in which I describe the action, Code, in which the used method is coded, and Test to check if the operation is a success.

1- Missing Values: 

To treat the problem of missing values I propose to divide the table into 3 sets: the normal set, the diverted flights set, and the canceled flights set. This allows us to keep consistent data structure without missing records. Normally, when the part of the data represent a low number of records with alot of missing values we drop these records in we have an intially big dataset. In this case, we could drop the diverted and the canceled flights because they represent less than 2% of all data. However, I prefer to keep these entries in different tables so we can analyze them. 

##### Define: 
* Divide the original data to three dataframes one for cleaned flights, one for canceled flights, and one for the diverted flights. 

##### Code:

In [None]:
canceled_flights = flights_df[flights_df['CANCELLED']==1]
diverted_flights = flights_df[flights_df['DIVERTED']==1]
canceled_flights.shape[0], diverted_flights.shape[0], flights_df.shape[0]

In [None]:
89884*100/5819079, 15187*100/5819079

In [None]:
cleaned_flights = flights_df.drop(canceled_flights.index)

In [None]:
cleaned_flights = cleaned_flights.drop(diverted_flights.index)

In [None]:
canceled_flights.reset_index(drop=True, inplace=True)
diverted_flights.reset_index(drop=True, inplace=True)
cleaned_flights.reset_index(drop=True, inplace=True)

##### Test:

In [None]:
canceled_flights.shape[0] + diverted_flights.shape[0]+ cleaned_flights.shape[0] == flights_df.shape[0]

Still in treating the missing records. Now for the **canceled_flights**. 

##### Define: 
- Drop the features with missing records more than 30% of the canceled flights. 

##### Code: 

In [None]:
canceled_flights.isnull().sum()*100/canceled_flights.shape[0]

We only get to keep the Tail number and the scheduled time. All other features are to drop. However, since we cannot find a way to fill the missing values of the tail number, this feature is also dropped. 

In [None]:
cols = canceled_flights.isnull().sum()[canceled_flights.isnull().sum()>0].index.tolist()
cols.remove('SCHEDULED_TIME')

In [None]:
canceled_flights.drop(cols, axis=1,inplace=True)

In [None]:
canceled_flights.shape

##### Define:
* Fill the still missing values of the scheduled time of the canceled flights table with the mode value of the scheduled time. 

##### Code: 

In [None]:
canceled_flights['SCHEDULED_TIME'].mode()[0]

In [None]:
canceled_flights['SCHEDULED_TIME'] = canceled_flights['SCHEDULED_TIME'].fillna(85)

##### Test:

In [None]:
canceled_flights.isnull().sum()

Now, we have fixed the canceled flights table. Lets move to the **diverted flights** table.

##### Define: 
- Drop the features with missing records more than 30% of the diverted flights. 

##### Code: 

In [None]:
diverted_flights.isnull().sum()*100/diverted_flights.shape[0]

In [None]:
cols=['ELAPSED_TIME', 'AIR_TIME', 'ARRIVAL_DELAY', 'CANCELLATION_REASON', 'AIR_SYSTEM_DELAY', 
      'SECURITY_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY', 'WEATHER_DELAY']

In [None]:
diverted_flights.drop(cols, axis=1, inplace=True)

##### Test

In [None]:
diverted_flights.shape

Lets try to understand the relationship between the rest Scheduled arrival and the missing values of arrival time, wheels on, and taxi in that are time based features about the flight landing

In [None]:
diverted_flights[['SCHEDULED_ARRIVAL', 'ARRIVAL_TIME', 'WHEELS_ON', 'TAXI_IN']]

* As we can see the Taxi in means the duration of time between the arrival and the wheels on instant. This feature could be imputed using the mode value we can even fine tune this imputation by computing the mean value by arrival aeroport, since this time is basically a caracteristic of the aeroport mixed with some weather. 
* for the arrival time, the imputation is quite tricky. I propose to compute the difference in duration between the scheduled arrival and the arrival time. Then, use the median value and add it to the schedule time to imputate the arrival time. 
* As for the wheels on it is simply the arrival time minus the taxi in.

##### Define: 
* impute the taxi in with the mode of all arrival aeroport. 
* compute the median difference between the scheduled arrival and the arrival time
* impute the arrival time = median difference + scheduled arrival
* impute wheels on = arrival time - taxi in.

##### Code: 

In [None]:
diverted_flights['TAXI_IN'] = diverted_flights['TAXI_IN'].fillna(diverted_flights['TAXI_IN'].mode()[0])
diverted_flights['TAXI_IN'].isnull().sum()

In [None]:
arrival_delay = diverted_flights['ARRIVAL_TIME'] - diverted_flights['SCHEDULED_ARRIVAL']
arrival_delay.median()

In [None]:
diverted_flights['ARRIVAL_TIME']=diverted_flights.apply(lambda x: x['SCHEDULED_ARRIVAL']+237 if np.isnan(x['ARRIVAL_TIME']) else x['ARRIVAL_TIME'], axis=1)

In [None]:
diverted_flights['WHEELS_ON']=diverted_flights.apply(lambda x: x['ARRIVAL_TIME']-x['TAXI_IN'] if np.isnan(x['WHEELS_ON']) else x['WHEELS_ON'], axis=1)

##### Test

In [None]:
diverted_flights.isnull().sum()

##### Define: 
* Impute the Scheduled time with the mode. 

##### Code: 


In [None]:
diverted_flights['SCHEDULED_TIME'].mode()[0]

In [None]:
diverted_flights['SCHEDULED_TIME'] = diverted_flights['SCHEDULED_TIME'].fillna(140)

##### Test:

In [None]:
diverted_flights.isnull().sum()

Now, we have to fix the arrival time and the wheels on time. These features are numerical and represent the time in a hhmm format. 

##### Define:
* fix the arrival time and the wheels on 

##### Code: 

In [None]:
def fix_time(x): 
    if x%100>=60: 
        x=x+40
    if x//100>=24:
        x=x-2400
    return x

In [None]:
diverted_flights['ARRIVAL_TIME'] = diverted_flights['ARRIVAL_TIME'].apply(fix_time)

In [None]:
diverted_flights['WHEELS_ON'] = diverted_flights['WHEELS_ON'].apply(fix_time)

##### Test:

In [None]:
diverted_flights['WHEELS_ON'].describe()

#### Cleaned Flights Table

In [None]:
cleaned_flights.isnull().sum()

##### Define: 
* Drop the features with the missing values

##### Code: 

In [None]:
cols= ['CANCELLATION_REASON', 'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY',
       'WEATHER_DELAY']
cleaned_flights = cleaned_flights.drop(cols, axis=1)

##### Test:

In [None]:
cleaned_flights.isnull().sum()

In the obtained table we have features with unique values: DIVERTED and CANCELLED in all three tables. Since we divided each category in a table, I propose to drop these features.

##### Define: 
* drop the 'DIVERTED' and 'CANCELLED' features from all tables. 

##### Code:

In [None]:
cleaned_flights = cleaned_flights.drop(['DIVERTED', 'CANCELLED'], axis=1)
diverted_flights = diverted_flights.drop(['DIVERTED', 'CANCELLED'], axis=1)
canceled_flights = canceled_flights.drop(['DIVERTED', 'CANCELLED'], axis=1)

##### Test:

In [None]:
('DIVERTED' in  cleaned_flights.columns, 'CANCELLED' in cleaned_flights.columns, 
 'DIVERTED' in  diverted_flights.columns, 'CANCELLED' in diverted_flights.columns, 
 'DIVERTED' in  canceled_flights.columns, 'CANCELLED' in canceled_flights.columns)

Before moving to the analysis part, lets save these dataframes.

In [None]:
cleaned_flights.to_csv('canceled_flights.csv', index=False)
diverted_flights.to_csv('diverted_flights.csv', index=False)
canceled_flights.to_csv('canceled_flights.csv', index=False)

***
## II. Exploratory Data Analysis
***

In the EDA process, we continue working using the cleaned_flights table to analyze the different relationships between the variables. The EDA process has 3 main components:

* Univariate Analysis
* Bivariate Analysis
* Multivariate Analysis

### II.1. Univarite data analysis

In [None]:
cleaned_flights['DEPARTURE_TIME'].hist(bins=1000)
plt.xlabel('Departure Time (HHMM)')
plt.ylabel('Count')
plt.show()

From this figure, we can observe that most flights are scheduled for departure between 5 (500) in the morning and 11 (2300) in the afternoon. We can also observe that between midnight and 5 in the morning very few flights are scheduled. One thing stands out from this graph is the gaps in the time for example between the second half of 9 in the morning (930) and the start of 10 (1000). This gap exists between each two consecutive hours. It is as if the flights are scheduled only for the first half of the hour.

In [None]:
cleaned_flights['DEPARTURE_DELAY'].hist(bins=1000)
plt.xlabel('Departure Delay (Minutes)')
plt.ylabel('Count')
plt.show()

We can see here that the departure delay has a skewed to the right distribution. From what we can see we have a very big delay values. Lets check this out. 

In [None]:
cleaned_flights['DEPARTURE_DELAY'].describe()

Lets zoom in on the delays less than 240 minutes (3-hours).

In [None]:
cleaned_flights['DEPARTURE_DELAY'].hist(bins=1000)
plt.xlabel('Departure Delay (Minutes)')
plt.xlim((-100,240))
plt.ylabel('Count')
plt.show()

We can see that most flights are actually up to 20 minutes in advance. This is quite unusual. 

In [None]:
cleaned_flights['SCHEDULED_DEPARTURE'].hist(bins=1000)
plt.xlabel('Scheduled Departure (HHMM)')
plt.ylabel('Count')
plt.show()

In [None]:
cleaned_flights['TAXI_OUT'].hist(bins=100)
plt.xlabel('The Duration Between Closing Gate and Wheels Out (Minutes)')
plt.ylabel('Count')
plt.show()

The distribution of the Taxi out duration is quite skewed to the right. This is quite logical. 

In [None]:
cleaned_flights['ELAPSED_TIME'].hist(bins=1000)
plt.xlabel('Duration between Gate Closing and Passenger Out (Minutes)')
plt.ylabel('Count')
plt.show()

We can observe that the distribution is skewed to the right. This means that most flights are of short length. This can be verified by plotting the Air time and the traveled distance. 

In [None]:
cleaned_flights['AIR_TIME'].hist(bins=1000)
plt.xlabel('Flight Duration (Minutes)')
plt.ylabel('Count')
plt.show()

We can see that the distribution is quite similar to the elapsed time distribution. We can also note the existance of several gaps in the time distribution. Is this due to the trips clusters? (i.e. range of distance?) 

In [None]:
cleaned_flights['DISTANCE'].hist(bins=100)
plt.xlabel('Trip Distance (mi)')
plt.ylabel('Count')
plt.show()

The traveled distance is also right skewed. However, we cannot see the gabs as in the flight duration! I propose to further investigate this point when we get to the bivariate analysis.

In [None]:
cleaned_flights['ARRIVAL_TIME'].hist(bins=1000)
plt.xlabel('Arrival Time (HHMM)')
plt.ylabel('Count')
plt.show()

In [None]:
cleaned_flights['SCHEDULED_ARRIVAL'].hist(bins=1000)
plt.xlabel('Scheduled Arrival (HHMM)')
plt.ylabel('Count')
plt.show()

In the schedule arrival distribution, we can see the same patterns as in the scheduled departure. Most flights are during the day and very few are in the late night after midnight. Also, the same gab exists between the second half of each hour and the next hour.

In [None]:
cleaned_flights['ARRIVAL_DELAY'].hist(bins=1000)
plt.xlabel('Arrival Delay (Minutes)')
plt.ylabel('Count')
plt.show()

The arrival delay has exactly the same shape of the departure delay. The distribution is right skewed and with most values are negative implying a big number of flights arriving in advance.

In [None]:
cleaned_flights['TAXI_IN'].hist(bins=1000)
plt.xlabel('Landing Duration (Minutes)')
plt.ylabel('Count')
plt.show()

The landing process has almost the same duration needed as the take off. 

Lets analyse the relative count of flights to the month and the day of the week.

In [None]:
sns.catplot(x='MONTH', kind='count', data=cleaned_flights, color=base_color)
plt.show()

From this figure, we can see that the number of flights is quite uniformally distributed over the months. The number of flights in February seems a bit lower that the other months but this could be simply because of the number of days.

In [None]:
sns.catplot(x='DAY_OF_WEEK', kind='count', data=cleaned_flights, color=base_color)
plt.show()

Lets note that the distribution of the flights is almost uniformally distributed over the days of the week with slightly less flights on the beginning of the weekend (saturday).

In [None]:
cleaned_flights['ORIGIN_AIRPORT'].nunique(), cleaned_flights['DESTINATION_AIRPORT'].nunique()

We have a big number of airports. Therefore, plotting the number of flights originated from or destinated will not be easy to interpret. Therefore, I propose to present the Top 10 and the Least Common 10 Origin Airports and the Top 10 and Least Common 10 Destinations.

##### Origin Airports:

In [None]:
origin_air_flights = cleaned_flights.groupby('ORIGIN_AIRPORT', as_index=False)['FLIGHT_NUMBER'].count()
origin_air_flights.sort_values(by='FLIGHT_NUMBER',inplace=True, ignore_index=True)

In [None]:
origin_air_flights.head(10)

If we plot these variables like this we will not be able to fully understand them. We need to get the names of the airport for the plots to be more intuitive. Unfortunatly, we do not have a full list of all airports. So the study is limited to airports in the USA. 

In [None]:
origin_air_flights = origin_air_flights.merge(airports_df[['IATA_CODE', 'AIRPORT', 'STATE', 'COUNTRY']],
                                              right_on='IATA_CODE', left_on='ORIGIN_AIRPORT')

In [None]:
worst = origin_air_flights.iloc[:10,:]

In [None]:
sns.catplot(y='AIRPORT', x='FLIGHT_NUMBER', kind='bar', data=worst, 
            color=base_color, aspect=2)
plt.xlabel('Number of Flights')
plt.ylabel('Origin Airport Name')
plt.show()

In [None]:
best = origin_air_flights.iloc[-10:,:]

In [None]:
sns.catplot(y='AIRPORT', x='FLIGHT_NUMBER', kind='bar', data=best, 
            color=base_color, aspect=2)
plt.xlabel('Number of Flights')
plt.ylabel('Origin Airport Name')
plt.show()

##### Destination Airports:

In [None]:
dest_air_flights = cleaned_flights.groupby('DESTINATION_AIRPORT', as_index=False)['FLIGHT_NUMBER'].count()
dest_air_flights.sort_values(by='FLIGHT_NUMBER',inplace=True, ignore_index=True)

In [None]:
dest_air_flights = dest_air_flights.merge(airports_df[['IATA_CODE', 'AIRPORT', 'STATE', 'COUNTRY']],
                                              right_on='IATA_CODE', left_on='DESTINATION_AIRPORT')

In [None]:
worst = dest_air_flights.iloc[:10,:]

In [None]:
sns.catplot(y='AIRPORT', x='FLIGHT_NUMBER', kind='bar', data=worst, 
            color=base_color, aspect=2)
plt.xlabel('Number of Flights')
plt.ylabel('Destination Airport Name')
plt.show()

In [None]:
best = dest_air_flights.iloc[-10:,:]

In [None]:
sns.catplot(y='AIRPORT', x='FLIGHT_NUMBER', kind='bar', data=best, 
            color=base_color, aspect=2)
plt.xlabel('Number of Flights')
plt.ylabel('Destination Airport Name')
plt.show()

We can see that the same airports are the best destination and at the same time the most common departure airport. Lets figure out what flight trips are the most common during the whole year. 

In [None]:
air_trips = cleaned_flights.groupby(['ORIGIN_AIRPORT', 'DESTINATION_AIRPORT'], as_index=False)['FLIGHT_NUMBER'].count()
air_trips.sort_values(by='FLIGHT_NUMBER',inplace=True, ignore_index=True)

In [None]:
air_trips['Trips'] = air_trips.apply(lambda x: str(x['ORIGIN_AIRPORT'])+'-'+str(x['DESTINATION_AIRPORT']),axis=1)

In [None]:
sns.catplot(y='Trips', x='FLIGHT_NUMBER', kind='bar', 
            data=air_trips.iloc[-10:,:], 
            color=base_color, aspect=2)
plt.xlabel('Number of Flights')
plt.ylabel('Trips')
plt.show()

From this graph, we can see that the most two common trips are from the airport of San Francisco to Los Angeles and the inverse path. We can see that the trips are the most common with a slight difference in the number of flights. This observation is common for all 10 most common trips. From my small experience in the traveling world, most trips are scheduled in a round trip style. Last December when I visited Porto, the flights where scheduled in this way: Paris-Porto-Paris-Porto. In the same day, the airplane do one round trip between Porto and Paris then goes and stays in Porto. 

### II.2. Bivarite data analysis

In [None]:
sns.scatterplot(x='SCHEDULED_DEPARTURE', y='DEPARTURE_DELAY', data=cleaned_flights, alpha=0.2, linewidth=0)
plt.xlabel('Scheduled Departure (HHMM)')
plt.ylabel('Departure Delay in Minutes')
plt.show()

We can see that there is quite a relationship between the departure delay and the scheduled departure time. It seems like the two variables are negatively correlated. The latter the scheduled departure is the less delayed the flight tends to be.

In [None]:
sns.scatterplot(x='DEPARTURE_TIME', y='DEPARTURE_DELAY', data=cleaned_flights, alpha=0.2, linewidth=0)
plt.xlabel('Departure Time (HHMM)')
plt.ylabel('Departure Delay in Minutes')
plt.show()

This is quite interesting. It seems like the behaviour betwee the actually departure time and the departure delay have a more positive correlation. This is quite logical. Because the actual departure time takes into account the departure delay (or the ahead of its intended departure). Therefore, the relationship should seem more linear.

In [None]:
sns.scatterplot(x='SCHEDULED_DEPARTURE', y='DEPARTURE_TIME', data=cleaned_flights, alpha=0.2, linewidth=0)
plt.xlabel('Scheduled Departure (HHMM)')
plt.ylabel('Departure Time (HHMM)')
plt.show()

This scatterpoint curve between the Scheduled departure and the actual departure time proves the previous point. Both variables are quite linearly positively dependent because to think of it: 
$$Departure\_Time = Scheduled\_Departure + Departure\_Delay$$ 

Moreover, the point at the top left corner and those on the bottom right corner are due to the representation of the Scheduled and Departure times in HHMM. Because once the Departure_time exceeds 2359 it moves back to 0. And when the flight is actually ahead of its time (i.e. the delay is negative). We can see a Scheduled_Departure at 0020 moves back to 2350 for example.

In [None]:
sns.scatterplot(x='SCHEDULED_DEPARTURE', y='TAXI_OUT', data=cleaned_flights, alpha=0.2, linewidth=0)
plt.xlabel('Scheduled Departure (HHMM)')
plt.ylabel('Taxi Out in Minutes')
plt.show()

We cannot conclude on the relationship between the two variables. They seem quite independent. 

In [None]:
sns.scatterplot(x='TAXI_OUT', y='DEPARTURE_DELAY', data=cleaned_flights, alpha=0.2, linewidth=0)
plt.ylabel('Departure Delay in Minutes')
plt.xlabel('Taxi Out in Minutes')
plt.show()

It seems like the Departure delay and the duration for taxi out are quite correlated but negatively. We can observe what it looks like an exponential relationship between the variables in which the higher the taxi out duration the lower the departure delay. 

In [None]:
sns.scatterplot(x='SCHEDULED_DEPARTURE', y='SCHEDULED_TIME', data=cleaned_flights, alpha=0.2, linewidth=0)
plt.xlabel('Scheduled Departure (HHMM)')
plt.ylabel('Estimated Flight Duration in Minutes')
plt.show()

We can see that the estimated flight duration is not correlation with the scheduled departure. We can see that the durations of flights are quite uniformally distributed on the scheduled departure. So the scheduling of flight seems to not take into account the duration of the flight. 

In [None]:
sns.scatterplot(x='SCHEDULED_DEPARTURE', y='ELAPSED_TIME', data=cleaned_flights, alpha=0.2, linewidth=0)
plt.xlabel('Scheduled Departure (HHMM)')
plt.ylabel('Elapsed Flight Duration in Minutes')
plt.show()

The same goes for the elapsed time (real flight time). 

In [None]:
sns.scatterplot(x='SCHEDULED_DEPARTURE', y='AIR_TIME', data=cleaned_flights, alpha=0.2, linewidth=0)
plt.xlabel('Scheduled Departure (HHMM)')
plt.ylabel('Flight Duration in Air in Minutes')
plt.show()

Since the elapsed time include the time spent in the air during the flight, it was quite expected to see the same graphe almost. 

Now, lets see if the actual departure time has an influence on the time spent in the air. I know from my small experience that when the departure is late, pilots tend to speed up during in the air to reduce the delay. Ofcourse, the time spent in the air depends also on other factors like the loading of the airplane, the weather and most importantly the direction of wind. 

In [None]:
sns.scatterplot(x='DEPARTURE_TIME', y='AIR_TIME', data=cleaned_flights, alpha=0.2, linewidth=0)
plt.xlabel('Departure Time (HHMM)')
plt.ylabel('Flight Duration in Air in Minutes')
plt.show()

Well the graph is quite different from the previous ones but still we cannot conclude on the existance of a relationship. Lets try with the departure delay.

In [None]:
sns.scatterplot(x='DEPARTURE_DELAY', y='AIR_TIME', data=cleaned_flights, alpha=0.2, linewidth=0)
plt.xlabel('Departure Delay in Minutes')
plt.ylabel('Flight Duration in Air in Minutes')
plt.show()

Well I guess after all the myth is quite true. The more delayed the flight is the less the duration of the flight in air is. So finally, the pilots seems to speed up in the air to reduce the arrival delay. However, the relationship is not linear. This is due also to multiple factors like distance, loading of the plane, and the direction of wind. 

Previously we say that the distribution of the duration spent in air presents several gaps and I wondered if those gaps are not caused by the distance to travel. 

In [None]:
sns.scatterplot(x='DISTANCE', y='AIR_TIME', data=cleaned_flights, alpha=0.2, linewidth=0)
plt.xlabel('Distance in Miles')
plt.ylabel('Flight Duration in Air in Minutes')
plt.show()

Well it is quite logical that the distance is linearly positvely correlated to the duration in the air. This graph also proves that the gaps we saw in the distribution of the air time are caused by the clusters of trips length (i.e. the distance). 

In [None]:
sns.scatterplot(y='TAXI_IN', x='SCHEDULED_ARRIVAL', data=cleaned_flights, alpha=0.2, linewidth=0)
plt.ylabel('Duration of Landing in Minutes')
plt.xlabel('Scheduled Arrival (HHMM)')
plt.show()

There is not a clear relationship between the landing duration and the scheduled arrival time. Lets see if the pilots rush the landing. 

In [None]:
sns.scatterplot(y='TAXI_IN', x='ARRIVAL_TIME', data=cleaned_flights, alpha=0.2, linewidth=0)
plt.ylabel('Duration of Landing in Minutes')
plt.xlabel('Arrival Time (HHMM)')
plt.show()

In [None]:
sns.scatterplot(y='TAXI_IN', x='ARRIVAL_DELAY', data=cleaned_flights, alpha=0.2, linewidth=0)
plt.ylabel('Duration of Landing in Minutes')
plt.xlabel('Arrival Delay in Minutes')
plt.show()

It seems like the pilots rush the landing of the airplane when the flight is behind schedule. The higher the delay is the less time spent by the pilots to land. However, this relationship is not linear it is more exponential. We should keep in mind that several other factors influence this relationship, like the airport (by specifying a tight time window for landing, and how much traffic there is in that hour), the landing is against or with the wind, etc. 

In [None]:
sns.scatterplot(y='DEPARTURE_DELAY', x='ARRIVAL_DELAY', data=cleaned_flights, alpha=0.2, linewidth=0)
plt.ylabel('Departure Delay in Minutes')
plt.xlabel('Arrival Delay in Minutes')
plt.show()

This relationship is so obvious, once the flight is delayed in the take off it is delayed in the landing. However, this is true after a certain threshold that is computed while taking into account the distance to travel, the wind direction, and how fast the pilot can safely go. 

In [None]:
sns.catplot(x='DAY_OF_WEEK', y='DEPARTURE_DELAY', data=cleaned_flights, kind='violin', color=base_color)
plt.xlabel('Day of Week')
plt.ylabel('Departure Delay in Minutes')
plt.show()

It seems like the duration of the departure delay is more important when we get closer to the weekends. We have more very long delays. Lets zoom in and see how the dalays are closely.

In [None]:
sns.catplot(x='DAY_OF_WEEK', y='DEPARTURE_DELAY', data=cleaned_flights, kind='violin', color=base_color)
plt.xlabel('Day of Week')
plt.ylim((-50,200))
plt.ylabel('Departure Delay in Minutes')
plt.show()

The format of the distribution of the departure delay seems to vary from one day to the other. The Tuesday distribution is quite concentrated around zero while the other distribution are quite more scattered.

In [None]:
cor = cleaned_flights[[x for x in cleaned_flights.columns if x not in ['YEAR', 'DAY', 'FLIGHT_NUMBER']]].corr()
cor

In [None]:
fig = plt.figure(figsize=(15,15))
ax = fig.add_subplot(1,1,1)
sns.heatmap(cor, ax=ax)
plt.show()

This obtained heatmap summarize the foundings in this section of the analysis by showing the features that are strongly correlated. 

### II.3. Multivarite data analysis

In this section, I focuse on some of the previous foundings to further investigate them. The idea is to find the relationship between several variables. However, the more variables we add the to plots/analysis the more complicated the analysis becomes. Therefore, I only show two examples on the interactions between three variables. 

In [None]:
fig = plt.figure(figsize=(25,6))
ax1 = fig.add_subplot(1,5,1)
sns.scatterplot(x='TAXI_OUT', y='DEPARTURE_DELAY', 
                data=cleaned_flights[cleaned_flights['ORIGIN_AIRPORT']==origin_air_flights.iloc[-1,0]],
                alpha=0.2, linewidth=0, ax=ax1)
plt.ylabel('Departure Delay in Minutes')
plt.xlabel('Taxi Out in Minutes')
plt.title(origin_air_flights.iloc[-1,3])
ax2 = fig.add_subplot(1,5,2)
sns.scatterplot(x='TAXI_OUT', y='DEPARTURE_DELAY', 
                data=cleaned_flights[cleaned_flights['ORIGIN_AIRPORT']==origin_air_flights.iloc[-2,0]],
                alpha=0.2, linewidth=0, ax=ax2)
plt.ylabel('Departure Delay in Minutes')
plt.xlabel('Taxi Out in Minutes')
plt.title(origin_air_flights.iloc[-2,3])
ax3 = fig.add_subplot(1,5,3)
sns.scatterplot(x='TAXI_OUT', y='DEPARTURE_DELAY', 
                data=cleaned_flights[cleaned_flights['ORIGIN_AIRPORT']==origin_air_flights.iloc[-3,0]],
                alpha=0.2, linewidth=0, ax=ax3)
plt.ylabel('Departure Delay in Minutes')
plt.xlabel('Taxi Out in Minutes')
plt.title(origin_air_flights.iloc[-3,3])
ax4 = fig.add_subplot(1,5,4)
sns.scatterplot(x='TAXI_OUT', y='DEPARTURE_DELAY', 
                data=cleaned_flights[cleaned_flights['ORIGIN_AIRPORT']==origin_air_flights.iloc[-4,0]],
                alpha=0.2, linewidth=0, ax=ax4)
plt.ylabel('Departure Delay in Minutes')
plt.xlabel('Taxi Out in Minutes')
plt.title(origin_air_flights.iloc[-4,3])
ax5 = fig.add_subplot(1,5,5)
sns.scatterplot(x='TAXI_OUT', y='DEPARTURE_DELAY', 
                data=cleaned_flights[cleaned_flights['ORIGIN_AIRPORT']==origin_air_flights.iloc[-5,0]],
                alpha=0.2, linewidth=0, ax=ax5)
plt.ylabel('Departure Delay in Minutes')
plt.xlabel('Taxi Out in Minutes')
plt.title(origin_air_flights.iloc[-5,3])
plt.tight_layout()
plt.show()

By Comparing the behaviour of the take off duration and the departure duration between the most common departure airports. we can see that the airport have an influence of these variables. We can see that the departure delay and the take off durations are more disperse for the Chicago O'Hare airport compared to other airports. 

In [None]:
fig = plt.figure(figsize=(25,6))
ax1 = fig.add_subplot(1,5,1)
sns.scatterplot(x='TAXI_IN', y='ARRIVAL_DELAY', 
                data=cleaned_flights[cleaned_flights['DESTINATION_AIRPORT']==dest_air_flights.iloc[-1,0]],
                alpha=0.2, linewidth=0, ax=ax1)
plt.ylabel('Arrival Delay in Minutes')
plt.xlabel('Taxi In in Minutes')
plt.title(dest_air_flights.iloc[-1,3])
ax2 = fig.add_subplot(1,5,2)
sns.scatterplot(x='TAXI_IN', y='ARRIVAL_DELAY', 
                data=cleaned_flights[cleaned_flights['DESTINATION_AIRPORT']==dest_air_flights.iloc[-2,0]],
                alpha=0.2, linewidth=0, ax=ax2)
plt.ylabel('Arrival Delay in Minutes')
plt.xlabel('Taxi In in Minutes')
plt.title(dest_air_flights.iloc[-2,3])
ax3 = fig.add_subplot(1,5,3)
sns.scatterplot(x='TAXI_IN', y='ARRIVAL_DELAY', 
                data=cleaned_flights[cleaned_flights['DESTINATION_AIRPORT']==dest_air_flights.iloc[-3,0]],
                alpha=0.2, linewidth=0, ax=ax3)
plt.ylabel('Arrival Delay in Minutes')
plt.xlabel('Taxi In in Minutes')
plt.title(dest_air_flights.iloc[-3,3])
ax4 = fig.add_subplot(1,5,4)
sns.scatterplot(x='TAXI_IN', y='ARRIVAL_DELAY', 
                data=cleaned_flights[cleaned_flights['DESTINATION_AIRPORT']==dest_air_flights.iloc[-4,0]],
                alpha=0.2, linewidth=0, ax=ax4)
plt.ylabel('Arrival Delay in Minutes')
plt.xlabel('Taxi In in Minutes')
plt.title(dest_air_flights.iloc[-4,3])
ax5 = fig.add_subplot(1,5,5)
sns.scatterplot(x='TAXI_IN', y='ARRIVAL_DELAY', 
                data=cleaned_flights[cleaned_flights['DESTINATION_AIRPORT']==dest_air_flights.iloc[-5,0]],
                alpha=0.2, linewidth=0, ax=ax5)
plt.ylabel('Arrival Delay in Minutes')
plt.xlabel('Taxi In in Minutes')
plt.title(dest_air_flights.iloc[-5,3])
plt.tight_layout()
plt.show()

We can see that the distributions are influenced by the destination airport. 

In [None]:
fig = plt.figure(figsize=(25,6))
ax1 = fig.add_subplot(1,5,1)
sns.scatterplot(x='DISTANCE', y='AIR_TIME', 
                data=cleaned_flights[cleaned_flights['ORIGIN_AIRPORT']==origin_air_flights.iloc[-1,0]],
                alpha=0.2, linewidth=0, ax=ax1)
plt.xlabel('Distance in Miles')
plt.ylabel('Flight Duration in Air in Minutes')
plt.title(origin_air_flights.iloc[-1,3])
ax2 = fig.add_subplot(1,5,2)
sns.scatterplot(x='DISTANCE', y='AIR_TIME', 
                data=cleaned_flights[cleaned_flights['ORIGIN_AIRPORT']==origin_air_flights.iloc[-2,0]],
                alpha=0.2, linewidth=0, ax=ax2)
plt.xlabel('Distance in Miles')
plt.ylabel('Flight Duration in Air in Minutes')
plt.title(origin_air_flights.iloc[-2,3])
ax3 = fig.add_subplot(1,5,3)
sns.scatterplot(x='DISTANCE', y='AIR_TIME', 
                data=cleaned_flights[cleaned_flights['ORIGIN_AIRPORT']==origin_air_flights.iloc[-3,0]],
                alpha=0.2, linewidth=0, ax=ax3)
plt.xlabel('Distance in Miles')
plt.ylabel('Flight Duration in Air in Minutes')
plt.title(origin_air_flights.iloc[-3,3])
ax4 = fig.add_subplot(1,5,4)
sns.scatterplot(x='DISTANCE', y='AIR_TIME', 
                data=cleaned_flights[cleaned_flights['ORIGIN_AIRPORT']==origin_air_flights.iloc[-4,0]],
                alpha=0.2, linewidth=0, ax=ax4)
plt.xlabel('Distance in Miles')
plt.ylabel('Flight Duration in Air in Minutes')
plt.title(origin_air_flights.iloc[-4,3])
ax5 = fig.add_subplot(1,5,5)
sns.scatterplot(x='DISTANCE', y='AIR_TIME', 
                data=cleaned_flights[cleaned_flights['ORIGIN_AIRPORT']==origin_air_flights.iloc[-5,0]],
                alpha=0.2, linewidth=0, ax=ax5)
plt.xlabel('Distance in Miles')
plt.ylabel('Flight Duration in Air in Minutes')
plt.title(origin_air_flights.iloc[-5,3])
plt.tight_layout()
plt.show()

We can see from the graph of Los Angeles Airport the distance in miles and the flight duration are actually divided by cluster of trip. 

To further investigate this point, I propose to create a trip feature in the cleaned_flights table and then plot the relationship between the distance of the trip and the flight duration in air.

In [None]:
cleaned_flights['Trips'] = cleaned_flights.apply(lambda x: str(x['ORIGIN_AIRPORT'])+'-'+str(x['DESTINATION_AIRPORT']),axis=1)

In [None]:
t = air_trips[air_trips['ORIGIN_AIRPORT'].isin(origin_air_flights.iloc[-5:,0].values)].iloc[-20:,3].values.tolist()

In [None]:
aaa = cleaned_flights[cleaned_flights['Trips'].isin(t)]

In [None]:
fig = plt.figure(figsize=(25,6))
ax1 = fig.add_subplot(1,5,1)
sns.scatterplot(x='DISTANCE', y='AIR_TIME', hue='Trips',
                data=aaa[aaa['ORIGIN_AIRPORT']==origin_air_flights.iloc[-1,0]],
                alpha=0.2, linewidth=0, ax=ax1, legend=False)
plt.xlabel('Distance in Miles')
plt.ylabel('Flight Duration in Air in Minutes')
plt.title(origin_air_flights.iloc[-1,3])
ax2 = fig.add_subplot(1,5,2)
sns.scatterplot(x='DISTANCE', y='AIR_TIME', hue='Trips',
                data=aaa[aaa['ORIGIN_AIRPORT']==origin_air_flights.iloc[-2,0]],
                alpha=0.2, linewidth=0, ax=ax2, legend=False)
plt.xlabel('Distance in Miles')
plt.ylabel('Flight Duration in Air in Minutes')
plt.title(origin_air_flights.iloc[-2,3])
ax3 = fig.add_subplot(1,5,3)
sns.scatterplot(x='DISTANCE', y='AIR_TIME', hue='Trips',
                data=aaa[aaa['ORIGIN_AIRPORT']==origin_air_flights.iloc[-3,0]],
                alpha=0.2, linewidth=0, ax=ax3, legend=False)
plt.xlabel('Distance in Miles')
plt.ylabel('Flight Duration in Air in Minutes')
plt.title(origin_air_flights.iloc[-3,3])
ax4 = fig.add_subplot(1,5,4)
sns.scatterplot(x='DISTANCE', y='AIR_TIME', hue='Trips',
                data=aaa[aaa['ORIGIN_AIRPORT']==origin_air_flights.iloc[-4,0]],
                alpha=0.2, linewidth=0, ax=ax4, legend=False)
plt.xlabel('Distance in Miles')
plt.ylabel('Flight Duration in Air in Minutes')
plt.title(origin_air_flights.iloc[-4,3])
ax5 = fig.add_subplot(1,5,5)
sns.scatterplot(x='DISTANCE', y='AIR_TIME', hue='Trips',
                data=aaa[aaa['ORIGIN_AIRPORT']==origin_air_flights.iloc[-5,0]],
                alpha=0.2, linewidth=0, ax=ax5, legend=False)
plt.xlabel('Distance in Miles')
plt.ylabel('Flight Duration in Air in Minutes')
plt.title(origin_air_flights.iloc[-5,3])
plt.tight_layout()
plt.show()

Now, it is even more clear that the gaps detected earlier are originated from the clusters of trips. 