# Assignment: Analyzing Airline Flight Delays 
#### By Brett Hallum, Chris Ficklin, and Ryan Shuhart<br>April 2017

For a full treatment of the unit 14 case study, please review module 14.3. Some points from the video are given below.

Work with the airline data set (use R or Python to manage out-of-core).
Answer the following questions by using the split-apply-combine technique:
* Which airports are most likely to be delayed flying out of or into?
* Which flights with same origin and destination are most likely to be delayed?
* Can you regress how delayed a flight will be before it is delayed?
* What are the most important features for this regression?

Remember to properly cross-validate models.

Use meaningful evaluation criteria.

Create at least one new feature variable for the regression.

In [1]:
import dask.dataframe as dd #http://dask.pydata.org/en/latest/
import pandas as pd
import numpy as np
from datetime import datetime
from bokeh.io import output_notebook

from dask.distributed import Client
client = Client(set_as_default=True)
print(client)

### Other Settings
# Show more rows
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999

# Prevent scientific notation of decimals
pd.set_option('precision',3)
pd.options.display.float_format = '{:,.3f}'.format

<Client: scheduler='tcp://127.0.0.1:54355' processes=4 cores=4>


In [16]:
# Allow inline display of bokeh graphics
output_notebook()

## [Here is some info about Dask]...

...General facts about Dask... blah blah

## Data

In [2]:
# http://stat-computing.org/dataexpo/2009/the-data.html
var_desc = pd.read_csv("../ref/var_descriptions.csv", index_col='var_id')
var_desc

Unnamed: 0_level_0,Name,Data Type,Description
var_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Year,int64,1987-2008
2,Month,int64,1 - 12
3,DayofMonth,int64,1 - 31
4,DayOfWeek,int64,1 (Monday) - 7 (Sunday)
5,DepTime,float64,"actual departure time (local, hhmm)"
6,CRSDepTime,int64,"scheduled departure time (local, hhmm)"
7,ArrTime,float64,"actual arrival time (local, hhmm)"
8,CRSArrTime,int64,"scheduled arrival time (local, hhmm)"
9,UniqueCarrier,O,unique carrier code
10,FlightNum,int64,flight number


In [30]:
# Data Location
# Ryan's
parq_folder = "C:/Users/ryan.shuhart/Downloads/AirlineDelays.tar/AirlineDelays/parquet-tiny/"
#parq_folder = "C:/Users/ryan.shuhart/Downloads/AirlineDelays.tar/AirlineDelays/parquet/"

# Load compressed Parquet format of all years ~2 sec
start = datetime.now()
df = dd.read_parquet(parq_folder)
print("Load parquet time: ", datetime.now() - start)
print()

# Length of dask dataframe ~3 min
start = datetime.now()
print("There are {:,d} rows".format(len(df))) #123,534,969 Matches Eric Larson
print("Time to determine row count: ", datetime.now() - start)

Load parquet time:  0:00:00.981056

There are 12,338 rows
Time to determine row count:  0:00:01.993114


### Glance at Beginning and End

In [31]:
print("First 5 rows:")
df.head()

First 5 rows:


Unnamed: 0,Year,Month,DayOfWeek,DepTime,CRSDepTime,UniqueCarrier,TailNum,ArrDelay,DepDelay,Origin,Dest,Distance
0,1987,10,6,1334.0,1330,AA,,20.0,4.0,LAX,SJC,308.0
1,1987,10,3,1310.0,1310,AA,,7.0,0.0,RNO,SFO,192.0
2,1987,10,1,1730.0,1730,NW,,20.0,0.0,DTW,MCI,629.0
3,1987,10,7,818.0,818,UA,,1.0,0.0,MDT,ORD,594.0
4,1987,10,4,1802.0,1750,TW,,5.0,12.0,STL,DCA,719.0


In [32]:
print("Last 5 rows:")
df.tail()

Last 5 rows:


Unnamed: 0,Year,Month,DayOfWeek,DepTime,CRSDepTime,UniqueCarrier,TailNum,ArrDelay,DepDelay,Origin,Dest,Distance
45,2008,12,4,1341.0,1325,WN,N659SW,18.0,16.0,RNO,BOI,335.0
46,2008,12,6,2220.0,2144,EV,N901EV,40.0,36.0,ATL,AVL,164.0
47,2008,12,3,2045.0,2005,B6,N603JB,48.0,40.0,FLL,HPN,1098.0
48,2008,12,7,622.0,627,OO,N251YV,-13.0,-5.0,LAX,SAN,109.0
49,2008,12,5,1042.0,1045,XE,N12552,-1.0,-3.0,IAH,SAV,851.0


## Feature Preparation and Creation

In [33]:
# Create an hour field
# 2400 minutes from midnight reduced to 2399 then int division drops to 23
df = df.assign(Hour=df.CRSDepTime.clip(upper=2399)//100) 

# Make Categories as categorical
df = df.categorize(['DayOfWeek', 'UniqueCarrier', 'Dest', 'Origin'])

# Months from 0 AD
df['FlightAge'] = 12*df['Year']+df['Month']-1

# The months from first flight is consider the approx age of the plane. 
# Unfortunately, trail numbers not tracked until 1995. 
tail_births = (df.groupby('TailNum')[['FlightAge']].min().reset_index()
                 .rename(columns={'FlightAge':'FirstFlight'}))


In [11]:
# df_tails = df[['TailNum','Year','Month','DepDelay']]#[df['Year']>1994]
# print(len(df_tails))
# df_tails['FlightAge'] = 12*df_tails['Year']+df_tails['Month']-1

# df_min_ages = df_tails.groupby('TailNum')[['FlightAge']].min().reset_index().rename(columns={'FlightAge':'FirstFlight'})
# df_tails = dd.merge(df_tails, df_min_ages, how='left', on='TailNum')

# df_tails['Age']= df_tails['FlightAge'] - df_tails['FirstFlight']

# df_tails_fir = df_tails.drop(['FlightAge','FirstFlight'], axis=1).dropna()
# #df_tails.compute().tail()
# df_tails.sample(.1)

## Flight Delays

When a schedule airflight is behind more than 15 minutes then it is officially delayed. Same logic will be followed for arrival times. Only arrivals 15 minutes past scheduled time will be considered late

http://aspmhelp.faa.gov/index.php/Types_of_Delay

### Aggregations

View visualization of dask distrubuted at work

http://127.0.0.1:8787/

In [12]:
import dask
start = datetime.now()
# Define some aggregations to plot
aggregations = (
    #1 Average departure delay by year
    df.groupby('Year').DepDelay.mean(),
    
    #2 Average departure delay by Month
    df.groupby('Month').DepDelay.mean(), 
    
    #3 Average departure delay by hour of day
    df.groupby('Hour').DepDelay.mean(), 
    
    #4 Average departure delay by Carrier, top 15
    df.groupby('UniqueCarrier').DepDelay.mean().nlargest(15), 
    
    #5 Average arrival delay by destination, top 15
    (df.groupby('Dest').ArrDelay.mean().nlargest(15) 
     .reset_index().rename(columns={'ArrDelay':'AvgArrDelay'})),
    
    #6 Count of arrivals to destinations, excludes missing
    (df.groupby('Dest').ArrDelay.count() 
     .reset_index().rename(columns={'ArrDelay':'ArrCount'})),
    
    #7 Average departure delay by origin, top 15
    (df.groupby('Origin').DepDelay.mean().nlargest(15).reset_index().rename(columns={'DepDelay':'AvgDepDelay'})),
    
    #8 Count of departures by origin, excludes missing
    (df.groupby('Origin').DepDelay.count().reset_index().rename(columns={'DepDelay':'DepCount'})), 
    
    #9 Average departure by origin and destination
    (df.groupby(['Origin','Dest']).DepDelay.mean().reset_index().rename(columns={'DepDelay':'AvgDepDelay'})),
    
    #10 Count of departures between origin and destination
    (df.groupby(['Origin','Dest']).DepDelay.count().reset_index().rename(columns={'DepDelay':'DepCount'})),
    
    #11 Percentage of officially delayed flights by origin
    ((df[df.DepDelay>15].groupby('Origin').DepDelay.count() / df.groupby('Origin').DepDelay.count())
     .reset_index().rename(columns={'DepDelay':'PercDepDelay'})),
    
    #12 Percentage of officially late flights by destination
    ((df[df.ArrDelay>15].groupby('Dest').ArrDelay.count() / df.groupby('Dest').ArrDelay.count())
     .reset_index().rename(columns={'ArrDelay':'PercArrDelay'})),
                
    #13 Percentage of officially delayed flights by origin and destination
    ((df[df.DepDelay>15].groupby(['Origin','Dest']).DepDelay.count() / df.groupby(['Origin','Dest']).DepDelay.count())
     .reset_index().rename(columns={'DepDelay':'PercDepDelay'})),
                
    #14 Percentage of officially late flights by origin and destination
    ((df[df.ArrDelay>15].groupby(['Origin','Dest']).ArrDelay.count() / df.groupby(['Origin','Dest']).ArrDelay.count())
     .reset_index().rename(columns={'ArrDelay':'PercArrDelay'})),
    
    #15 Average departure delay by hour of day
    df.groupby('DayOfWeek').DepDelay.mean()
)

# Compute them all in a single pass over the data
(delayed_by_year, #1
delayed_by_month, #2
delayed_by_hour, #3
delayed_by_carrier, #4
delayed_by_dest, #5
delayed_by_dest_count, #6
delayed_by_origin, #7
delayed_by_origin_count, #8
delayed_by_origin_dest, #9
delayed_by_origin_dest_count, #10
pct_delayed_by_origin, #11
pct_late_by_dest, #12
pct_delayed_by_origin_dest, #13
pct_late_by_origin_dest, #14
delayed_by_day
) = dask.compute(*aggregations)
print(datetime.now() - start)

0:00:22.842307


### Visualization of Average Delay

In [18]:
from bokeh.plotting import figure, show
from bokeh.charts.attributes import cat
from bokeh.charts import Bar
from bokeh.layouts import gridplot

# Average Delay by Year
p1 = Bar(delayed_by_year.reset_index(), 'Year', values= 'DepDelay', 
         legend=False, ylabel="Average Delay in Minutes", 
         title="Average Delay by Year")

# Average Delay by Month
delayed_by_month = delayed_by_month.sort_index()
p2 = Bar(delayed_by_month.reset_index(), 'Month', values= 'DepDelay', 
         legend=False, ylabel="Average Delay in Minutes", 
         title="Average Delay by Month")

# Average Delay by Hour of Day
p3 = Bar(delayed_by_hour.reset_index(), 'Hour', values= 'DepDelay', 
         legend=False, ylabel="Average Delay in Minutes",
         title="Average Delay by Hour of Day")

# Average Delay by Hour of Day
p4 = Bar(delayed_by_day.reset_index(), 'DayOfWeek', values= 'DepDelay', 
         legend=False, ylabel="Average Delay in Minutes",
         title="Average Delay by Day of Week")

# Average Delay by Carrier
delayed_by_carrier = delayed_by_carrier.reset_index()
delayed_by_carrier['UniqueCarrier'] = delayed_by_carrier['UniqueCarrier'].astype('O')
p5 = Bar(delayed_by_carrier, label=cat('UniqueCarrier', sort=False), values= 'DepDelay', 
         legend=False, ylabel="Average Delay in Minutes", xlabel="Unique Carrier", title="Average Delay by Carrier")


show(gridplot([[p1,p2],[p3,p4], [p5,None]], plot_width=400, plot_height=300))

## Which airports are most likely to be delayed flying out of or into?

In [14]:
airport_delays_pcts = (pd.merge(pct_delayed_by_origin, pct_late_by_dest, left_on='Origin', right_on='Dest')
                 .assign(AvgDelay= lambda x: (x['PercDepDelay'] + x['PercArrDelay'])/2)
                 .sort_values(by='AvgDelay', ascending=False)
                 .drop('Dest', axis=1)
                )

airport_delays_pcts = pd.merge(airport_delays_pcts, delayed_by_origin_count, on='Origin')

airport_delays_pcts[airport_delays_pcts['DepCount'] > 50].nlargest(15, 'AvgDelay')

Unnamed: 0,Origin,PercDepDelay,PercArrDelay,AvgDelay,DepCount
47,PBI,0.183,0.275,0.229,60
49,EWR,0.21,0.233,0.221,248
52,SFO,0.153,0.281,0.217,262
54,BWI,0.207,0.219,0.213,184
55,SEA,0.171,0.253,0.212,181
56,LGA,0.179,0.239,0.209,212
57,ATL,0.201,0.213,0.207,617
58,JFK,0.182,0.232,0.207,154
64,CLT,0.172,0.225,0.198,233
67,ORD,0.196,0.193,0.195,693


## Which flights with same origin and destination are most likely to be delayed?

In [26]:
org_dest_pcts = (pd.merge(pct_delayed_by_origin_dest, pct_late_by_origin_dest, on=['Origin','Dest'])
                 .assign(AvgDelay= lambda x: (x['PercDepDelay'] + x['PercArrDelay'])/2)
                 .sort_values(by='AvgDelay', ascending=False)
                )

org_dest_pcts = pd.merge(org_dest_pcts, delayed_by_origin_dest_count, on=['Origin','Dest'])

org_dest_pcts[org_dest_pcts['DepCount'] > 5].nlargest(15, 'AvgDelay')

Unnamed: 0,Origin,Dest,PercDepDelay,PercArrDelay,AvgDelay,DepCount
206,ORD,CID,0.714,0.571,0.643,7
213,MSP,SEA,0.5,0.667,0.583,6
214,EWR,MCO,0.667,0.5,0.583,6
213,MSP,SEA,0.5,0.667,0.583,6
214,EWR,MCO,0.667,0.5,0.583,6
215,ORD,ABE,0.571,0.571,0.571,7
216,SAN,LAS,0.571,0.571,0.571,7
215,ORD,ABE,0.571,0.571,0.571,7
216,SAN,LAS,0.571,0.571,0.571,7
217,JFK,DCA,0.625,0.5,0.562,8


## Can you regress how delayed a flight will be before it is delayed?

## What are the most important features for this regression?

# Regression of Delay

The Dask module is a solution for processing "big data," however, the it currently does not include methods for analysis, such as generalized linear models, like other big data solutions. The following will use a series of simple random sampling and kfold cross validation to find the coefficient estimates of a linear model.

#### The following features will be explore to predict if the flight will have departure delay

##### The predicted variable will be: 
* DepDelay

##### The explanatory variables:
* Month
* DayofMonth
* DayOfWeek
* DepTime



* Dask currently doesn't have a workable out-of-core glm process. The below categorical variables add many dimensions 
that exponentially inflate the required memory
    * UniqueCarrier
    * Dest

##### Possible features to Create
* Plane's flight number of the day - Possibly highly correlated with hour of the day
* Plane's arrival delay of previous flight
* Plane's age

In [None]:
# https://adventuresindatascience.wordpress.com/2014/12/30/minibatch-learning-for-large-scale-data-using-scikit-learn/


In [59]:
y

array([  0.,   0.,   3., ...,   1.,  12.,  -5.])

In [64]:
Xy = df[['DepDelay'] + Xcols].dropna().sample(.3).compute()
X = Xy[Xcols]
y = Xy['DepDelay'].values

In [60]:
from sklearn import linear_model
reg = linear_model.SGDRegressor()

reg.partial_fit(X,y)

SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.01,
       fit_intercept=True, l1_ratio=0.15, learning_rate='invscaling',
       loss='squared_loss', n_iter=5, penalty='l2', power_t=0.25,
       random_state=None, shuffle=True, verbose=0, warm_start=False)

In [76]:
sgd = linear_model.SGDRegressor(loss="squared_loss")
reg = linear_model.LinearRegression(n_jobs=-1)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

Xy = df[['DepDelay'] + Xcols].dropna().compute()
X = Xy[Xcols]
y = Xy['DepDelay'].values

#X = scaler.fit_transform(X) 

sgd.fit(X,y)
print("SGD:",sgd.coef_)

reg.fit(X, y)
print("Reg:",reg.coef_)

SGD: [  4.65615772e+11  -4.53513553e+10]
Reg: [ 0.60107124  0.00205056]


In [41]:
from sklearn import linear_model
reg = linear_model.LinearRegression(n_jobs=-1)
#reg = linear_model.SGDRegressor

# from sklearn.pipeline import Pipeline
# pipe_lr = Pipeline([('scl', StandardScaler()),
#                     ('pca', PCA()),
#                     ('clf', LogisticRegression(random_state=1))])


Xcols = ['Hour', 'Distance']
ycol =  'ArrDelay'

start = datetime.now()
Xy = df[['DepDelay'] + Xcols].dropna()
X = Xy[Xcols]
y = Xy['DepDelay'].values

reg.fit(X, y)

print('Coefficients: \n', reg.coef_)
print("Time to sample and regress: ", datetime.now() - start)

Coefficients: 
 [ 0.60107124  0.00205056]
Time to sample and regress:  0:00:12.370707


In [62]:
from sklearn import linear_model
reg = linear_model.SGDRegressor()

# from sklearn.pipeline import Pipeline
# pipe_lr = Pipeline([('scl', StandardScaler()),
#                     ('pca', PCA()),
#                     ('clf', LogisticRegression(random_state=1))])


Xcols = ['Hour', 'Distance']
ycol =  'DepDelay'

seeds = [123,456,789,101,112]
coefs = []


# Sample the entire data set as large as possible a few times. Each time has it's own cross validation sampling.

for i in range(10):
    start = datetime.now()
    # Take a sample from all the data
    Xy = df[['DepDelay'] + Xcols].dropna().sample(.3).compute()
    X = Xy[Xcols]
    y = Xy['DepDelay'].values

    reg.partial_fit(X, y)
    #print(cross_val_score(reg, ArrDelay_X, ArrDelay_y, scoring='neg_mean_squared_error', cv=cv, n_jobs=-1))
    print('Coefficients: \n', reg.coef_)
    coefs.append(reg.coef_)
    print("Time to sample and regress: ", datetime.now() - start)

print(coefs) # Chart this eventually

    DepDelay  Hour  Distance
12     0.000    13   451.000
10     0.000    15    72.000
15     4.000    13   340.000
16     7.000    21   550.000
38     0.000     6   441.000
Coefficients: 
 [  7.00715811e+11  -1.07369311e+12]
Time to sample and regress:  0:00:08.817504
    DepDelay  Hour  Distance
48     0.000    16   317.000
11     0.000     7   110.000
13     6.000    11   192.000
4     12.000    17   719.000
1      0.000    13   192.000
Coefficients: 
 [ -1.11949940e+11  -7.34485929e+11]
Time to sample and regress:  0:00:09.120522
    DepDelay  Hour  Distance
22     0.000    12 1,258.000
6     70.000    20   113.000
47   102.000     8   948.000
2      0.000    17   629.000
59     0.000     7   487.000
Coefficients: 
 [  3.95308037e+11  -5.02516475e+10]
Time to sample and regress:  0:00:08.696497
    DepDelay  Hour  Distance
49     0.000     8   221.000
28    -1.000    18   414.000
5     16.000     7 1,975.000
62    -1.000    16 1,045.000
33    -2.000     9   191.000
Coefficients: 
 

In [38]:
reg_coefs = pd.DataFrame.from_records(coefs, columns=Xcols)
reg_coefs

Unnamed: 0,Hour,Distance
0,0.509,0.002
1,0.596,0.002
2,0.568,0.002
3,0.549,0.002
4,0.521,0.001


In [63]:
sgd_coefs = pd.DataFrame.from_records(coefs, columns=Xcols)
sgd_coefs

Unnamed: 0,Hour,Distance
0,15039909278.51,-362174193996.291
1,15039909278.51,-362174193996.291
2,15039909278.51,-362174193996.291
3,15039909278.51,-362174193996.291
4,15039909278.51,-362174193996.291
5,15039909278.51,-362174193996.291
6,15039909278.51,-362174193996.291
7,15039909278.51,-362174193996.291
8,15039909278.51,-362174193996.291
9,15039909278.51,-362174193996.291


In [3]:
var_desc = pd.read_csv("../ref/var_descriptions.csv", index_col='var_id')

columns = ['Origin','Year', 'CRSDepTime','DepDelay','ArrDelay', 'DayOfWeek', 'DepTime', 
       'Month', 'DepTime', 'UniqueCarrier', 'Dest', 'TailNum']

use_vars = var_desc[var_desc['Name'].isin(columns)]

### Future Work

* Optimize with index key base on Data, deptarture time, and TailNum
* Use of alternative compression, such as snappy or LZ4
    * http://java-performance.info/performance-general-compression/
* Use a diffent big data approach to find a more efficient way to estimating the linear model coefficients:
    * Spark MLLib
    * Dask GLM
    * Turi/Graphlab Create

## Bibliography

* Dask Documentation, http://dask.pydata.org/en/latest/
* Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, Boyd, et al http://stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf
* https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
* Variable Descriptions: http://stat-computing.org/dataexpo/2009/the-data.html
* Dask example using airline data https://jcrist.github.io/dask-sklearn-part-3.html

## Appendices

### Appendix A - CSV to Parquet Conversion

In [29]:
# Convert csv to parquet
csv_folder = "C:/Users/ryan.shuhart/Downloads/AirlineDelays.tar/AirlineDelays/*.csv"
parq_folder = "C:/Users/ryan.shuhart/Downloads/AirlineDelays.tar/AirlineDelays/parquet-tiny/"

var_desc = pd.read_csv("../ref/var_descriptions.csv", index_col='var_id')

columns = ['Year', 'Month', 'DayOfWeek', 'Origin','Dest', 'DepTime', 'CRSDepTime',
           'DepDelay','ArrDelay', 'UniqueCarrier', 'TailNum', 'Distance']

use_vars = var_desc[var_desc['Name'].isin(columns)]

def csv_to_parquet(csv_folder, parq_folder):
    start = datetime.now()
    df_csv = dd.read_csv(csv_folder,                       
                         usecols = use_vars['Name'],
                         dtype=dict(use_vars[['Name','Data Type']].values), 
                         encoding='iso-8859-1')

    print(df_csv.head())

    # Flip to parquet
    df_csv.sample(.0001).to_parquet(parq_folder,
                      compression='gzip',
                      object_encoding='utf8')

    time_to_complete = datetime.now() - start
    print(time_to_complete)

csv_to_parquet(csv_folder, parq_folder)

   Year  Month  DayOfWeek  DepTime  CRSDepTime UniqueCarrier TailNum  \
0  1987     10          3  741.000         730            PS     NaN   
1  1987     10          4  729.000         730            PS     NaN   
2  1987     10          6  741.000         730            PS     NaN   
3  1987     10          7  729.000         730            PS     NaN   
4  1987     10          1  749.000         730            PS     NaN   

   ArrDelay  DepDelay Origin Dest  Distance  
0    23.000    11.000    SAN  SFO   447.000  
1    14.000    -1.000    SAN  SFO   447.000  
2    29.000    11.000    SAN  SFO   447.000  
3    -2.000    -1.000    SAN  SFO   447.000  
4    33.000    19.000    SAN  SFO   447.000  
0:02:46.349515


### Appendix B - Benchmark Tests

### Appendix C - Comparison of Dask Files
* Ryan's Hardware: 
    - CPU: Intel i5-4300M @ 2.60GHz
    - Disk: Samsung SSD 850 Pro
    - RAM: 8 GB
    

* Dask using original csv:
    - no conversion
    - size on disk
        - 11.2 gb
    - benchmark of describing 'Distance':
        - Approx. 4 minutes
* Dask using uncompressed parquet: 
    - conversion to parquet
        - approx 10 minutes
    - size on disk:
        - 13.8 gb
    - benchmark of describing 'Distance':
        - 1 loop, best of 3: 6.2 s per loop
* Dask using gzip compressed parquet:
    - converstion to parquet
        - approx 42 minutes
    - size on disk:
        - 1.36 gb <- big difference
    - benchmark of describing 'Distance':
        - 1 loop, best of 3: 8.83 s per loop

#### Summary
Dask allows for out of core management of data sets. CSV files are universal, but slow to process. Converting to parquet file format, speeds up the process by a factor of 38. Using the gzip compression, reduces size on disk from 13.8gb to 1.36 or about 10% of the uncompressed size. This comes in handy for a distributed processing in a cluster since not as much network bandwidth would be needed. The trade off of compression is a 42.4% increasing in processing time, however, 3 additional seconds is hardly noticable, but might be more of an issue for other tasks. 