# Test the hypothesis whether planes fly faster when there is the departure delay?

This test is asking us to run a Chi2 contingency tests on 2 categorical values. The first is a value of "Delay" that is derived from `dep_delay`. It is True when `dep_delay` is positive and False when it's non-positive. The second is a value of Faster_Flight that returns True if the `actual_elapsed_time` for the flight is shorter than the `crs_elapsed_time`, which is the planned time for the flight.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
## 1. read the big flights_csv file in chunks of 1 million rows at a time
chunk = pd.read_csv('flights.csv', chunksize=1000000, low_memory=False)
df = pd.concat(chunk)


## OR 2. read the flights_sample csv
# df = pd.read_csv('flights_sample.csv')

In [8]:
# take only the relevant columns because this is a large dataframe
columns = ['dep_delay',
           'crs_elapsed_time',
           'actual_elapsed_time']
df = df[columns]
df.dropna(inplace=True) # remove Null values

In [9]:
# if dep_delay is positive, then delay is True
delay = df['dep_delay'].apply(lambda x: True if x>0 else False)

# get the difference between the actual elapsed time and what was planned
change_in_speed = df['crs_elapsed_time'] - df['actual_elapsed_time']
# if the actual time is smaller, then the planed flew faster
flew_faster = change_in_speed.apply(lambda x: True if x>0 else False)

In [29]:
# join the 2 above into a dataframe
df = pd.DataFrame({'Delay': delay,
         'Faster_Flight': flew_faster})

In [42]:
df.head(20) # inspect

Unnamed: 0,Delay,Faster_Flight
0,False,True
1,True,True
2,False,True
3,False,True
4,False,False
5,False,True
6,False,True
7,False,False
8,False,True
9,False,True


In [31]:
# calculate the contingency between the 2 categories
contingency = pd.crosstab(df['Delay'], df['Faster_Flight'])
contingency

Faster_Flight,False,True
Delay,Unnamed: 1_level_1,Unnamed: 2_level_1
False,2942454,7354788
True,1581559,3734801


In [26]:
# test with chi2_contingency
from scipy.stats import chi2_contingency


stat, p, dof, expected = chi2_contingency(contingency)

print(f'p is {p}')
print('The two are probably', 
      (' independent.' if p > 0.05 else ' dependent.')
     )

p is 0.0
The two are probably  dependent.


In [41]:
## Testing a sample of the data

df_sample = df.sample(n=10000)
contingency = pd.crosstab(df_sample['Delay'], df_sample['Faster_Flight'])

stat, p, dof, expected = chi2_contingency(contingency)

print(f'p is {p}')
print('The two are probably', 
      (' independent.' if p > 0.05 else ' dependent.')
     )


p is 0.0006022316096862148
The two are probably  dependent.


## Result

Testing the Hypothesis that planes fly faster when there is departure delay

**Result**: Planes probably fly faster when there is a departure delay.