# Assignment: Analyzing Airline Flight Delays
For a full treatment of the unit 14 case study, please review module 14.3. Some points from the video are given below.

Work with the airline data set (use R or Python to manage out-of-core).
Answer the following questions by using the split-apply-combine technique:
* Which airports are most likely to be delayed flying out of or into?
* Which flights with same origin and destination are most likely to be delayed?
* Can you regress how delayed a flight will be before it is delayed?
* What are the most important features for this regression?

Remember to properly cross-validate models.

Use meaningful evaluation criteria.

Create at least one new feature variable for the regression.

In [1]:
import dask.dataframe as dd #http://dask.pydata.org/en/latest/
import pandas as pd
from datetime import datetime
from bokeh.io import output_notebook

### Other Settings
# Show more rows
pd.options.display.max_rows = 999

# Prevent scientific notation of decimals
pd.set_option('precision',3)
pd.options.display.float_format = '{:,.3f}'.format

# Data Location
# Ryan's
parq_folder = "C:/Users/ryan.shuhart/Downloads/AirlineDelays.tar/AirlineDelays/parquet/"

In [2]:
# Allow inline display of bokeh graphics
output_notebook()

## [Here is some info about Dask]...

...General facts about Dask... blah blah

#### Comparison of Dask Files
* Ryan's Hardware: 
    - CPU: Intel i5-4300M @ 2.60GHz
    - Disk: Samsung SSD 850 Pro
    - RAM: 8 GB
    

* Dask using original csv:
    - no conversion
    - size on disk
        - 11.2 gb
    - benchmark of describing 'Distance':
        - Approx. 4 minutes
* Dask using uncompressed parquet: 
    - conversion to parquet
        - approx 10 minutes
    - size on disk:
        - 13.8 gb
    - benchmark of describing 'Distance':
        - 1 loop, best of 3: 6.2 s per loop
* Dask using gzip compressed parquet:
    - converstion to parquet
        - approx 42 minutes
    - size on disk:
        - 1.36 gb <- big difference
    - benchmark of describing 'Distance':
        - 1 loop, best of 3: 8.83 s per loop

#### Summary
Dask allows for out of core management of data sets. CSV files are universal, but slow to process. Converting to parquet file format, speeds up the process by a factor of 38. Using the gzip compression, reduces size on disk from 13.8gb to 1.36 or about 10% of the uncompressed size. This comes in handy for a distributed processing in a cluster since not as much network bandwidth would be needed. The trade off of compression is a 42.4% increasing in processing time, however, 3 additional seconds is hardly noticable, but might be more of an issue for other tasks. 

## Exploratory

In [20]:
# http://stat-computing.org/dataexpo/2009/the-data.html
var_desc = pd.read_csv("../ref/var_descriptions.csv", index_col='var_id')
var_desc

Unnamed: 0_level_0,Name,Data Type,Description
var_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Year,int64,1987-2008
2,Month,int64,1 - 12
3,DayofMonth,int64,1 - 31
4,DayOfWeek,int64,1 (Monday) - 7 (Sunday)
5,DepTime,float64,"actual departure time (local, hhmm)"
6,CRSDepTime,int64,"scheduled departure time (local, hhmm)"
7,ArrTime,float64,"actual arrival time (local, hhmm)"
8,CRSArrTime,int64,"scheduled arrival time (local, hhmm)"
9,UniqueCarrier,O,unique carrier code
10,FlightNum,int64,flight number


In [None]:
# Load compressed Parquet format of all years ~2 sec
start = datetime.now()
all_years = dd.read_parquet(parq_folder)
print("Load parquet time: ", datetime.now() - start)
print()

# Length of dask dataframe ~3 min
start = datetime.now()
print("There are {:,d} rows".format(len(all_years))) #123,534,969 Matches Eric Larson
print("Time to determine row count: ", datetime.now() - start)

### Glance at Beginning and End

In [24]:
print("First 5 rows:")
all_years.head()

First 5 rows:


Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,1987,10,14,3,741.0,730,912.0,849,PS,1451,...,,,0,,0,,,,,
1,1987,10,15,4,729.0,730,903.0,849,PS,1451,...,,,0,,0,,,,,
2,1987,10,17,6,741.0,730,918.0,849,PS,1451,...,,,0,,0,,,,,
3,1987,10,18,7,729.0,730,847.0,849,PS,1451,...,,,0,,0,,,,,
4,1987,10,19,1,749.0,730,922.0,849,PS,1451,...,,,0,,0,,,,,


In [25]:
print("Last 5 rows:")
all_years.tail()

Last 5 rows:


Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
499212,2008,12,13,6,1002.0,959,1204.0,1150,DL,1636,...,6.0,45.0,0,,0,,,,,
499213,2008,12,13,6,834.0,835,1021.0,1023,DL,1637,...,5.0,23.0,0,,0,,,,,
499214,2008,12,13,6,655.0,700,856.0,856,DL,1638,...,24.0,12.0,0,,0,,,,,
499215,2008,12,13,6,1251.0,1240,1446.0,1437,DL,1639,...,13.0,13.0,0,,0,,,,,
499216,2008,12,13,6,1110.0,1103,1413.0,1418,DL,1641,...,8.0,11.0,0,,0,,,,,


In [31]:
# Busiest Origins
start = datetime.now()
origin_counts = (all_years[['Origin','Year']].groupby('Origin').count().compute()
                 .sort_values(by='Year', ascending=False)
                 .rename(columns={"Year":"Count"})
                )

print("Time to determine row count: ", datetime.now() - start, "\n")
format = lambda x: "{0:,.0f}".format(x) 
print(origin_counts[:10].applymap(format))

Time to determine row count:  0:00:40.050300 

            Count
Origin           
ORD     6,597,442
ATL     6,100,953
DFW     5,710,980
LAX     4,089,012
PHX     3,491,077
DEN     3,319,905
DTW     2,979,158
IAH     2,884,518
MSP     2,754,997
SFO     2,733,910


## Which airports are most likely to be delayed flying out of or into?

## Which flights with same origin and destination are most likely to be delayed?

## Can you regress how delayed a flight will be before it is delayed?

## What are the most important features for this regression?

# Regression of Delay

The Dask module is a solution for processing "big data," however, the it currently does not include methods for analysis, such as generalized linear models, like other big data solutions. The following will use a series of simple random sampling and kfold cross validation to find the coefficient estimates of a linear model.

**The following features will explore if the flight will have departure delay**

The predicted variable will be: 
* DepDelay

The explanatory variables:
* Month
* DayofMonth
* DayOfWeek
* DepTime
* UniqueCarrier
* Dest

* Possible features to Create
    * Plane's flight number of the day
    * Plane's arrival delay of previous flight
    * Plane's age

### Feature Preparation 


In [42]:
from sklearn import linear_model
from sklearn.model_selection import KFold

seeds = [123,456,789,101,112]

coefs = []


# Sample the entire data set as large as possible a few times. Each time has it's own cross validation sampling.

for i in range(len(seeds)):
    start = datetime.now()
    # Take a sample from all the data
    all_years_reg = all_years[['ArrDelay','Distance', 'DepTime']].dropna().sample(.0001, random_state=seeds[i]).compute()
    #print(all_years_reg.info())
    print(all_years_reg.shape)
        
    ######
    # Insert a cross validation split step here
    cv = KFold(n_splits=5)
    
    ######
    
    reg = linear_model.LinearRegression(n_jobs=-1)
    ArrDelay_X = all_years_reg[['Distance', 'DepTime']]
    ArrDelay_y = all_years_reg[['ArrDelay']]
    # reg.fit(ArrDelay_X, ArrDelay_y)
    print(cross_val_score(reg, ArrDelay_X, ArrDelay_y, scoring='neg_mean_squared_error', cv=cv, n_jobs=-1))
    #print('Coefficients: \n', reg.coef_)
    #coefs.append(reg.coef_)
    print("Time to sample and regress: ", datetime.now() - start)

print(coefs) # Chart this eventually

(12075, 3)
[ -524.06996918  -561.1476998  -1157.50267071  -976.45717156 -1344.74938536]
Time to sample and regress:  0:00:28.899890
(12075, 3)
[ -508.30384147  -640.45282691  -811.52108368  -779.14818309 -1691.81887899]
Time to sample and regress:  0:00:27.936793
(12075, 3)
[ -474.23533688  -817.47183487  -861.547646    -742.72202079 -1039.94753683]
Time to sample and regress:  0:00:23.774377
(12075, 3)
[ -565.22754141  -725.80441104  -804.64082215 -1050.43113609 -1932.72969948]
Time to sample and regress:  0:00:26.041604
(12075, 3)
[ -512.96497114  -554.70283698  -971.21771649 -1119.09744853 -1116.43919295]
Time to sample and regress:  0:00:29.011901
[]


In [70]:
coefs[1][0][0]

0.00045107047014612947

### Future Work

* Optimize with index key base on Data, deptarture time, and TailNum
* Use of alternative compression, such as snappy or LZ4
    * http://java-performance.info/performance-general-compression/
* Use a diffent big data approach to find a more efficient way to estimating the linear model coefficients:
    * Spark MLLib
    * Dask GLM
    * Turi/Graphlab Create

## Bibliography

* Dask Documentation, http://dask.pydata.org/en/latest/
* Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, Boyd, et al http://stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf
* https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
* Variable Descriptions: http://stat-computing.org/dataexpo/2009/the-data.html

## Appendices

### Appendix A - CSV to Parquet Conversion

### Appendix B - Benchmark Tests