# Assignment: Analyzing Airline Flight Delays
For a full treatment of the unit 14 case study, please review module 14.3. Some points from the video are given below.

Work with the airline data set (use R or Python to manage out-of-core).
Answer the following questions by using the split-apply-combine technique:
* Which airports are most likely to be delayed flying out of or into?
* Which flights with same origin and destination are most likely to be delayed?
* Can you regress how delayed a flight will be before it is delayed?
* What are the most important features for this regression?

Remember to properly cross-validate models.

Use meaningful evaluation criteria.

Create at least one new feature variable for the regression.

In [16]:
import dask.dataframe as dd #http://dask.pydata.org/en/latest/
import pandas as pd
from datetime import datetime
from bokeh.io import output_notebook

### Other Settings
# Show more rows
pd.options.display.max_rows = 999

# Prevent scientific notation of decimals
pd.set_option('precision',3)
pd.options.display.float_format = '{:,.3f}'.format

# Data Location
# Ryan's
parq_folder = "C:/Users/ryan.shuhart/Downloads/AirlineDelays.tar/AirlineDelays/parquet/"

In [3]:
# Allow inline display of bokeh graphics
output_notebook()

## [Here is some info about Dask]...

...General facts about Dask... blah blah

#### Comparison of Dask Files
* Ryan's Hardware: 
    - CPU: Intel i5-4300M @ 2.60GHz
    - Disk: Samsung SSD 850 Pro
    - RAM: 8 GB
    

* Dask using original csv:
    - no conversion
    - size on disk
        - 11.2 gb
    - benchmark of describing 'Distance':
        - Approx. 4 minutes
* Dask using uncompressed parquet: 
    - conversion to parquet
        - approx 10 minutes
    - size on disk:
        - 13.8 gb
    - benchmark of describing 'Distance':
        - 1 loop, best of 3: 6.2 s per loop
* Dask using gzip compressed parquet:
    - converstion to parquet
        - approx 42 minutes
    - size on disk:
        - 1.36 gb <- big difference
    - benchmark of describing 'Distance':
        - 1 loop, best of 3: 8.83 s per loop

#### Summary
Dask allows for out of core management of data sets. CSV files are universal, but slow to process. Converting to parquet file format, speeds up the process by a factor of 38. Using the gzip compression, reduces size on disk from 13.8gb to 1.36 or about 10% of the uncompressed size. This comes in handy for a distributed processing in a cluster since not as much network bandwidth would be needed. The trade off of compression is a 42.4% increasing in processing time, however, 3 additional seconds is hardly noticable, but might be more of an issue for other tasks. 

## Exploratory

In [8]:
# Data types of the fields used in the analysis
dts = {'ActualElapsedTime': 'float64', # Confirmed
 'AirTime': 'float64', # Confirmed
 'ArrDelay': 'float64', # Confirmed
 'ArrTime': 'float64', # Confirmed
 'CRSArrTime': 'int64', # Confirmed
 'CRSDepTime': 'int64', # Confirmed
 'CRSElapsedTime': 'float64', # Confirmed
 'CancellationCode': 'O', # Confirmed by lesson video
 'Cancelled': 'int64', # Confirmed
 'CarrierDelay': 'float64', # Confirmed
 'DayOfWeek': 'int64', # Confirmed
 'DayofMonth': 'int64', # Confirmed
 'DepDelay': 'float64', # Confirmed
 'DepTime': 'float64', # Confirmed
 'Dest': 'O', # Confirmed
 'Distance': 'float64', # Confirmed
 'Diverted': 'int64', # Confirmed
 'FlightNum': 'int64', # Exploring if int or string
 'LateAircraftDelay': 'float64', # Confirmed
 'Month': 'int64', # Confirmed
 'NASDelay': 'float64', # Confirmed
 'Origin': 'O', # Confirmed
 'SecurityDelay': 'float64', # Confirmed
 'TailNum': 'O', # Confirmed
 'TaxiIn': 'float64', # Confirmed
 'TaxiOut': 'float64', # Confirmed
 'UniqueCarrier': 'O', # Confirmed
 'WeatherDelay': 'float64', # Confirmed
 'Year': 'int64'} # Confirmed

In [13]:
# Load compressed Parquet format of all years
start = datetime.now()
all_years = dd.read_parquet(parq_folder)
print("Load parquet time: ", datetime.now() - start)
print()

# Length of dask dataframe
start = datetime.now()
print("There are {:,d} rows".format(len(all_years))) #123,534,969 Matches Eric Larson
print("Time to determine row count: ", datetime.now() - start)

# Tiny sample of 
all_years.sample(.00001).head()

Load parquet time:  0:00:01.217000

There are 123,534,969 rows
Time to determine row count:  0:02:52.357000


In [31]:
# Busiest Origins
start = datetime.now()
origin_counts = (all_years[['Origin','Year']].groupby('Origin').count().compute()
                 .sort_values(by='Year', ascending=False)
                 .rename(columns={"Year":"Count"})
                )

print("Time to determine row count: ", datetime.now() - start, "\n")
format = lambda x: "{0:,.0f}".format(x) 
print(origin_counts[:10].applymap(format))

Time to determine row count:  0:00:40.050300 

            Count
Origin           
ORD     6,597,442
ATL     6,100,953
DFW     5,710,980
LAX     4,089,012
PHX     3,491,077
DEN     3,319,905
DTW     2,979,158
IAH     2,884,518
MSP     2,754,997
SFO     2,733,910


## Which airports are most likely to be delayed flying out of or into?

## Which flights with same origin and destination are most likely to be delayed?

## Can you regress how delayed a flight will be before it is delayed?

## What are the most important features for this regression?

# Regression of Delay

The Dask module is a solution for processing "big data," however, the it currently does not include methods for analysis, such as generalized linear models, like other big data solutions. The following will use a series of simple random sampling and kfold cross validation to find the coefficient estimates of a linear model.

In [32]:
from sklearn import linear_model
seeds = [123,456,789,101,112]

coefs = []


# Sample the entire data set as large as possible a few times. Each time has it's own cross validation sampling.
start = datetime.now()
for i in range(len(seeds)):
    # Take a sample from all the data
    all_years_reg = all_years[['ArrDelay','Distance', 'DepTime']].dropna().sample(.0001, random_state=seeds[i]).compute()
    #print(all_years_reg.info())
    print(all_years_reg.shape)
        
    ######
    # Insert a cross validation split step here
    ######
    
    reg = linear_model.LinearRegression()
    ArrDelay_X = all_years_reg[['Distance', 'DepTime']]
    ArrDelay_y = all_years_reg[['ArrDelay']]
    reg.fit(ArrDelay_X, ArrDelay_y)
    print('Coefficients: \n', reg.coef_)
    coefs.append(reg.coef_)
    print("Time to sample and regress: ", datetime.now() - start)

print(coefs) # Chart this eventually

(12075, 3)
Coefficients: 
 [[ 0.00183383  0.01027518]]
Time to sample and regress:  0:00:19.308600
(12075, 3)
Coefficients: 
 [[  1.98780496e-05   1.05399143e-02]]
Time to sample and regress:  0:00:36.565000
(12075, 3)
Coefficients: 
 [[ 0.00089394  0.00993085]]
Time to sample and regress:  0:00:53.828000
(12075, 3)
Coefficients: 
 [[ 0.00061025  0.01016242]]
Time to sample and regress:  0:01:13.605000
(12075, 3)
Coefficients: 
 [[ 0.0001067   0.01025777]]
Time to sample and regress:  0:01:36.391100
[array([[ 0.00183383,  0.01027518]]), array([[  1.98780496e-05,   1.05399143e-02]]), array([[ 0.00089394,  0.00993085]]), array([[ 0.00061025,  0.01016242]]), array([[ 0.0001067 ,  0.01025777]])]


In [70]:
coefs[1][0][0]

0.00045107047014612947

### Future Work

* Optimize with index key base on Data, deptarture time, and TailNum
* Use of alternative compression, such as snappy or LZ4
    * http://java-performance.info/performance-general-compression/
* Use a diffent big data approach to find a more efficient way to estimating the linear model coefficients:
    * Spark MLLib
    * Dask GLM
    * Turi/Graphlab Create

## Bibliography

* Dask Documentation, http://dask.pydata.org/en/latest/
* Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, Boyd, et al http://stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf

## Appendices

### Appendix A - CSV to Parquet Conversion

### Appendix B - Benchmark Tests