# Flights Data Exploration Challenge
### Completed by Aman Poddar
**Note**: Done as a part of the Microsoft Data Science 30 day challenge


In this challenge, I'll explore a real-world dataset containing flights data from the US Department of Transportation.

Let's start by loading and viewing the data.

In [None]:
import pandas as pd

df_flights = pd.read_csv('../input/data-microsoft/challenges/data/flights.csv')
df_flights.head()

The dataset contains observations of US domestic flights in 2013, and consists of the following fields:

- **Year**: The year of the flight (all records are from 2013)
- **Month**: The month of the flight
- **DayofMonth**: The day of the month on which the flight departed
- **DayOfWeek**: The day of the week on which the flight departed - from 1 (Monday) to 7 (Sunday)
- **Carrier**: The two-letter abbreviation for the airline.
- **OriginAirportID**: A unique numeric identifier for the departure aiport
- **OriginAirportName**: The full name of the departure airport
- **OriginCity**: The departure airport city
- **OriginState**: The departure airport state
- **DestAirportID**: A unique numeric identifier for the destination aiport
- **DestAirportName**: The full name of the destination airport
- **DestCity**: The destination airport city
- **DestState**: The destination airport state
- **CRSDepTime**: The scheduled departure time
- **DepDelay**: The number of minutes departure was delayed (flight that left ahead of schedule have a negative value)
- **DepDelay15**: A binary indicator that departure was delayed by more than 15 minutes (and therefore considered "late")
- **CRSArrTime**: The scheduled arrival time
- **ArrDelay**: The number of minutes arrival was delayed (flight that arrived ahead of schedule have a negative value)
- **ArrDelay15**: A binary indicator that arrival was delayed by more than 15 minutes (and therefore considered "late")
- **Cancelled**: A binary indicator that the flight was cancelled

My challenge was to explore the flight data to analyze possible factors that affect delays in departure or arrival of a flight.

1. Started by cleaning the data.
    - Identified any null or missing data, and impute appropriate replacement values.
    - Identified and eliminated any outliers in the **DepDelay** and **ArrDelay** columns.
2. Explored the cleaned data.
    - Viewed summary statistics for the numeric fields in the dataset.
    - Determined the distribution of the **DepDelay** and **ArrDelay** columns.
    - Used statistics, aggregate functions, and visualizations to answer the following questions:
        - *What are the average (mean) departure and arrival delays?*
        - *How do the carriers compare in terms of arrival delay performance?*
        - *Is there a noticable difference in arrival delays for different days of the week?*
        - *Which departure airport has the highest average departure delay?*
        - *Do **late** departures tend to result in longer arrival delays than on-time departures?*
        - *Which route (from origin airport to destination airport) has the most **late** arrivals?*
        - *Which route has the highest average arrival delay?*
        

## CLEANING DATA

checking no. of null values for each column, if any.

In [None]:
df_flights.isnull().sum()

We have quite a number of "null" late departures. Now let us see departure delay column (DepDelay) for all the "null" late departures

In [None]:
df_flights[df_flights.isnull().any(axis=1)][['DepDelay','DepDel15']]

Therefore some "null" late departures but it looks like maybe they all have delay in departure (DepDelay) as 0 min. We can verify this by using statistics summary of DepDelay

In [None]:
df_flights[df_flights.isnull().any(axis=1)].DepDelay.describe()

Since "min","max", and "mean" all are zeroes, therefore none of the "null" DepDel15 was actually a "late" departure (delay more than 15 min.).
So, now we will replace the missing ("null") DepDel15 values with a "0" and check for missing values again

In [None]:
df_flights.DepDel15 = df_flights.DepDel15.fillna(0)
df_flights.isnull().sum()

Hence our data is cleaned of any missing values!

## CLEANING ANY OUTLINERS
We will view distribution and statistics summary for DepDelay and ArrDelay

In [None]:
# Function to show sumary stats and distribution for a column
def distribution(var_data):
    from matplotlib import pyplot as plt

    # Soring statistics values
    min_val = var_data.min()
    max_val = var_data.max()
    mean_val = var_data.mean()
    median_val = var_data.median()
    mode_val = var_data.mode()[0]

    print(var_data.name, '\nMinimum={:.2f}\nMean={:.2f}\nMedian={:.2f}\nMode={:.2f}\nMaximum={:.2f}\n'.format(min_val,mean_val,median_val,mode_val,max_val))

    # Figure space for 2 subplots (2 in a row and in 1 column)
    fig, ax = plt.subplots(2,1,figsize = (15,7))

    # Histogram plotting (subplot 1)
    ax[0].hist(var_data)
    ax[0].set_ylabel('Frequency')

    # Adding markings for the mean(cyan), meadian(red), mode(yellow), minimum and maximum (gray)
    ax[0].axvline(x=min_val, color = 'gray', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=mean_val, color = 'cyan', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=median_val, color = 'red', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=mode_val, color = 'yellow', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=max_val, color = 'gray', linestyle='dashed', linewidth = 2)

    # Boxplot plotting (subplot 2)
    ax[1].boxplot(var_data, vert = False)
    ax[1].set_xlabel('Values')

    # Figure title
    fig.suptitle(var_data.name)

    # Show figure
    fig

#Calling the function for delay columns ("DepDelay", "ArrDelay")
delayColumns = ['DepDelay','ArrDelay']
for columns in delayColumns:
    distribution(df_flights[columns])

From the plots it's clear that for both variables, there exist outliners at both the lower and upper ends, and more towards upper end

Therefore we will trim the data so that we will only include rows where values for these fields are within 1st and 90th percentile using quantile function

In [None]:
# Trimming outliners for ArrDelay based on 1% and 90% percentiles
ArrDelay_01percentile = df_flights.ArrDelay.quantile(0.01)  #1% percentile
ArrDelay_90percentile = df_flights.ArrDelay.quantile(0.90)  #90% percentile
df_flights = df_flights[df_flights.ArrDelay < ArrDelay_90percentile]
df_flights = df_flights[df_flights.ArrDelay > ArrDelay_01percentile]

# Trimming outliners for DepDelay based on 1% and 90% percentiles
DepDelay_01percentile = df_flights.DepDelay.quantile(0.01)  #1% percentile
DepDelay_90percentile = df_flights.DepDelay.quantile(0.90)  #90% percentile
df_flights = df_flights[df_flights.DepDelay < DepDelay_90percentile]
df_flights = df_flights[df_flights.DepDelay > DepDelay_01percentile]


# trimmed and revised distributions
for columns in delayColumns:
    distribution(df_flights[columns])

This looks much better and a lot of outliners are cleaned

## EXPLORING THE DATA
Let us first view the overall statistics summary for the numeric columns

In [None]:
df_flights.describe()

Now let us answer the following questions:
### **What are the average (mean) departure and arrival delays?**

In [None]:
df_flights[delayColumns].mean()   #delayColumns include "DepDelay" and "ArrDelay"

### **How do the carriers compare in terms of arrival delay performance?**

In [None]:
# To comapre Arrival Delays on the besis of Carrier
df_flights.boxplot(column = 'ArrDelay', by='Carrier', figsize=(10,10))

### **Is there a noticable difference in arrival delays for different days of the week?**

In [None]:
# To compare Arrival Delays for different days of week.
df_flights.boxplot(column='ArrDelay', by='DayOfWeek', figsize=(10,10))

### **Which departure airport has the highest average departure delay?**

In [None]:
departure_airports = df_flights.groupby(df_flights.OriginAirportName)

mean_departureDelays = pd.DataFrame(departure_airports['DepDelay'].mean()).sort_values('DepDelay', ascending=False) #largest average delay time first

mean_departureDelays.plot(kind= "bar", figsize=(20,15))
mean_departureDelays

### **Do late departures tend to result in longer arrival delays than on-time departures?**

In [None]:
df_flights.boxplot(column='ArrDelay', by='DepDel15', figsize= (15,15)) #DepDel15 column represents late departures (departure delay more than 15 min.)

### **Which route (from origin airport to destination airport) has the most late arrivals?**

In [None]:
# Making a column for "Route"
Route = pd.Series(df_flights['OriginAirportName'] + '--->' + df_flights['DestAirportName'])
df_flights = pd.concat([df_flights, Route.rename("Routes")], axis =1)

# Grouping the routes
RouteGroups = df_flights.groupby(df_flights.Routes)
pd.DataFrame(RouteGroups['ArrDel15'].sum()).sort_values('ArrDel15', ascending=False) #it'll show number of times arrival was late (arrival delay more than 15 minutes)

In [None]:
print('Hence, route with most number of LATE arrivals is \n' + pd.DataFrame(RouteGroups['ArrDel15'].sum()).sort_values('ArrDel15', ascending=False).index[0])

### **Which route has the highest average arrival delay?**

In [None]:
pd.DataFrame(RouteGroups['ArrDelay'].mean()).sort_values('ArrDelay', ascending=False) #shows highest average delays in minutes.

In [None]:
print('Hence the route with highest arrival delay time is \n' + pd.DataFrame(RouteGroups['ArrDelay'].mean()).sort_values('ArrDelay', ascending=False).index[0])

Completed by Aman Poddar.