# Daily Stopwatch Data Science: Two Sigma Financial Modeling Challenge

Note: this is a part of a Kaggle competition: Two Sigma Financial Modeling Challenge. For more information, see https://www.kaggle.com/c/two-sigma-financial-modeling

## Stage 1: Ask a question

My objective is to the predict the value of unknown variable $y$, which is a time series.

I measure the performance by R-squared. It seems that the actual competition has online version of it for reinforcement learning. I will get back to it later.

## Stage 2: Set the environment up and get data

First, set up a directory for data and link it to this workplace. Download data into your choice of directory.

In [29]:
#import sys
#reload(sys)
#sys.setdefaultencoding("unicode")

#Set up the environment
import numpy as np                         #Numpy
import pandas as pd                        #Pandas
import matplotlib.pyplot as plt            #Plot

In [30]:

# Set up data directory
DataDir = "C:/Users/Admin/Documents/data/"

# Here's an example of loading the CSV using Pandas's built-in HDF5 support:
with pd.HDFStore(DataDir+ "train.h5", "r") as train:
    # Note that the "train" dataframe is the only dataframe in the file
    df = train.get("train")

In [31]:
%matplotlib inline                         
#Make plot appeared inline

## Stage 3: Explore the data

Explore, Visualize, Clean, Transform, Feature engineering

In [32]:
#Basis stats
len(df), len(df.columns)   #number of rows and columns

(1710756, 111)

In [33]:
#See all column names and types
#print(df.dtypes.to_string())

In [34]:
#See first ten rows
#df.head(10)

In [35]:
#See last ten rows
#df.tail(10)

Here one can see something peculiar about this data. There are ids and timestamps. Let's check the number of unique elements

In [36]:
len(df["timestamp"].unique()), len(df["id"].unique())

(1813, 1424)

Let's create a summary table for those columns including type, min, mean, max, sigma, percent of zeros, and percent of missing values

In [50]:
df_column_summary = pd.DataFrame(df.dtypes,columns=['type'])
df_column_summary.reset_index(inplace=True)
df_column_summary['min'] = list(df.min())
df_column_summary['mean'] = list(df.mean())
df_column_summary['max'] = list(df.max())
df_column_summary['sigma'] = list(df.std())

l = ['NA'] * len(df.columns)
for i in range(0,len(df.columns)):
    l[i] = 1.0-np.count_nonzero(df.iloc[:,i])*1.0/len(df)
df_column_summary['zero'] = l

df_column_summary['missing'] = list(1.0-df.count()*1.0/len(df))

In [51]:
print(df_column_summary.to_string())

              index     type         min          mean          max       sigma      zero  missing
0                id    int16    0.000000  1.093858e+03  2158.000000  630.856268  0.000962      0.0
1         timestamp    int16    0.000000  9.456257e+02  1812.000000  519.568520  0.000438      0.0
2         derived_0  float64  -80.766290 -5.835554e-06    13.041907    0.978604  0.042647      0.0
3         derived_1  float64   -0.010143  8.672707e-10   140.194964    0.976062  0.047364      0.0
4         derived_2  float64 -151.055738 -1.131928e-06    58.641785    0.875937  0.233026      0.0
5         derived_3  float64 -336.431813 -1.020050e-06    12.150310    0.955405  0.087371      0.0
6         derived_4  float64   -9.256424  1.294702e-05    73.276079    0.873322  0.237590      0.0
7     fundamental_0  float64   -9.317354 -8.064686e-06     5.605945    0.993677  0.013998      0.0
8     fundamental_1  float64 -139.114219 -3.953050e-10     0.007603    0.776586  0.396941      0.0
9     fund

Here are some lessons from data summary:

1. It makes sense to treat this id as an object rather than int to avoid the sense of ordering. 
2. In terms of sigma (together with other measures), one can spot a strange large number in many features. Worth taking a look.
3. In terms of max and min, one can spot discreteness in many features. It may be non-numerical. Worth taking a look.
4. In terms of zero values, one should aware that zero may mean missing values. It is also associated with a discrete feature.
5. In terms of missing values, there is nothing to worry yet. No significant poor features.

Let's further investigate each feature using visualization tool.

In [39]:
#Play around with interactive visualization tool. However, due to large amount of data. Let's use simple one instead.
#from bokeh.plotting import figure, output_notebook, show
#output_notebook()

#i = 0
#p = figure(title=df.columns[i], width=500, height=500)
#p.circle(df[df.columns[i]], df['y'], size=7, color="firebrick", alpha=0.5)
#show(p)

According to the notes above, we should look into feature 3, 8, 23, 67 (large sigma)

In [40]:
#2-d array plots
#l = [3, 8, 23, 67]
#f, axarr = plt.subplots(1, 4, figsize=(4*4,3*1))
#num = 0
#for i in l:
#    axarr[num].scatter(df[df.columns[i]],df['y'])
#    axarr[num].set_title(str(i))
#    num = num + 1
                             

According to the notes above, we should look into feature 70, 72, 75, 77, 78, 79, 80, 82, 83, 84, 85, 89, 94, 97, 99, 102, 103, 104, 107, 108 (discrete max/min)

In [41]:
#2-d array plots
#l = [70, 72, 75, 77, 78, 79, 80, 82, 83, 84, 85, 89, 94, 97, 99, 102, 103, 104, 107, 108]
#f, axarr = plt.subplots(5, 4, figsize=(4*4,3*5))
#num = 0
#for i in l:
#    axarr[num/4,num%4].scatter(df[df.columns[i]],df['y'])
#    axarr[num/4,num%4].set_title(str(i))
#    num = num + 1
                             

According to the notes above, we should look into feature (70), (77), (78), (80), 81, (82), (83), 85, 87, (89), 94, 95, (97), (99), (102), (103), (104), (107) (# of zeros > 10 percent)

In [42]:
#2-d array plots
#l =  [81, 85, 87, 94, 95]
#f, axarr = plt.subplots(2, 4, figsize=(4*4,3*2))
#num = 0
#for i in l:
#    axarr[num/4,num%4].scatter(df[df.columns[i]],df['y'])
#    axarr[num/4,num%4].set_title(str(i))
#    num = num + 1
                             

These features do not give a strong signal but it looks fine. There are only two discrete features 89 and 99. Sinee they are set to be decimal points, let's believe that there are ordering features. So, there is nothing to worry for now.

Let's normalize data so that each feature has 0-mean and 1-std (except id, timestamp, and input). 

In [43]:
#One may use sklearn to do this, but the trouble is we need to deal with NA values, which is annoying. So just do it from scratch.
#from sklearn.preprocessing import normalize
#X = normalize(np.array(df[df.columns[range(2,110)]].dropna(axis=0)))

In [44]:
for i in range(2,110):
    mean = df_column_summary.loc[i,'mean']
    sigma = df_column_summary.loc[i,'sigma']
    df[df.columns[i]] = (df[df.columns[i]]-mean)/sigma

Next let's look at possible dimensionality reduction for features (except id, timestamp, and output).

In [45]:
#Let's try PCA
from sklearn.decomposition import PCA

In [46]:
#try group them according to feature types: derived
X = np.array(df[df.columns[range(2,7)]].dropna(axis=0))
pca_derived = PCA(n_components=5)
pca_derived.fit(X)
print(pca_derived.explained_variance_ratio_) 

[ 0.32672635  0.27849502  0.27681062  0.1114164   0.00655162]


In [47]:
#try group them according to feature types: fundamental
X = np.array(df[df.columns[range(7,70)]].dropna(axis=0))
pca_fundamental = PCA(n_components=20)
pca_fundamental.fit(X)
print(pca_fundamental.explained_variance_ratio_) 

[ 0.25558325  0.13097034  0.10843365  0.06818235  0.06199728  0.05705869
  0.04216031  0.03775116  0.03570373  0.03272416  0.02860421  0.02629839
  0.02333962  0.02214361  0.01652564  0.01068863  0.00874587  0.00707701
  0.00565645  0.00433277]


In [48]:
#try group them according to feature types: technical
X = np.array(df[df.columns[range(70,110)]].dropna(axis=0))
pca_technical = PCA(n_components=40)
pca_technical.fit(X)
print(pca_technical.explained_variance_ratio_) 

[ 0.144173    0.08449022  0.05374959  0.04201616  0.03940461  0.03401835
  0.02991891  0.02958015  0.02731697  0.02642108  0.02606273  0.02566239
  0.02549551  0.0254263   0.02527166  0.02518373  0.02511065  0.02498515
  0.02477049  0.02456923  0.0228427   0.02177022  0.02136387  0.01724815
  0.01452915  0.01372289  0.01339999  0.01210162  0.0116663   0.0110958
  0.01007953  0.00972421  0.00956461  0.00916052  0.00891192  0.00738879
  0.00706167  0.00642801  0.00502863  0.00328453]


Let's use a rule of thumbs, say, we want to capture at least 95% variance ratio. For 'derived', we need 4 out of 5 components. For 'fundamental', we need 16 out of 63 components. For 'technical', we need 32 out of 40 components.  

The problem with PCA is that we need to deal with NA values. If we want to continue analysis with PCA, we may 

1. fill NA with zero (mean) and continue with full number of rows. 
2. delete rows with NA and proceed.

Since there are many NAs in data, let's try option 1 and see how it goes.

In [49]:
#CLEAN
#replace NaN values with zero (mean).
df = df.fillna(value=0)

In [59]:
d5.columns

Index([u'y'], dtype='object')

In [62]:
#TRANFORM/ NEW VARIABLES
'''
#Apply PCA to different group of features and reattached everything
d1 = df[['id','timestamp']]
#derived
d2 = pd.DataFrame(pca_derived.transform(df[df.columns[range(2,7)]]))
d2 = d2.loc[:,0:3]
d2.columns = ['derived_pca_1','derived_pca_2','derived_pca_3','derived_pca_4']

d3 = pd.DataFrame(pca_fundamental.transform(df[df.columns[range(7,70)]]))
d3 = d3.loc[:,0:15]
d3.columns = ['fundamental_pca_' + str(i) for i in range(0,16)]

d4 = pd.DataFrame(pca_technical.transform(df[df.columns[range(70,110)]]))
d4 = d4.loc[:,0:31]
d4.columns = ['technical_pca_' + str(i) for i in range(0,32)]

d5 = df[['y']]

df = pd.concat([d1,d2,d3,d4,d5])
'''
#It seems that using this new data, the model suffers from too much information. We need new ways to address this issue.

# Convert ID to be object, not a number
df['id'] = df['id'].astype(object)

## Stage 4: Model the data

I have prepared data for validation as follow:

In [24]:
# Now test/train split by random selection. Ideally, we should do cross-validation and parameter average, but save it for later.
r = np.random.uniform(0,1,len(df)) # Random UNIForm numbers, one per row
train = df[ r < 0.7]
test = df[0.7 <= r]

In [25]:
len(train), len(test)

(5986465, 2567315)

First, let's try something simple that accommodate non-number types: trees.

In [26]:
#Random forest
#from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.tree import DecisionTreeRegressor

X_train = train.drop('y', axis=1)
y_train = train['y']
X_test = test.drop('y', axis=1)
y_test = test['y']

In [63]:
#clf = RandomForestRegressor()
#clf = clf.fit(X_train,y_train)

#Took too much time to respond

Here four models are used: GBM (Gradient Boost Method) DRF (Distributed Random Forest) GLM (Generalized Linear Model) and Deep Learning.

In [12]:
def split_fit_predict(data):
    global gbm0,drf0,glm0,dl0
    # Classic Test/Train split
    r = data['Days'].runif() # Random UNIForm numbers, one per row
    train = data[ r < 0.7]
    test = data[0.7 <= r]
    print("Training data has",train.ncol,"columns and",train.nrow,"rows, test has",test.nrow,"rows")
    bike_names_x = data.names
    if "bikes" in bike_names_x: bike_names_x.remove("bikes")

    # Run GBM
    s = time.time()
    gbm0 = h2o.H2OGradientBoostingEstimator(ntrees=400, max_depth=6, learn_rate=0.1)
    gbm0.train(x=bike_names_x,y="bikes",training_frame =train,validation_frame=test)
    gbm_elapsed = time.time() - s #measure elapse time

    # Run DRF
    s = time.time()
    drf0 = h2o.H2ORandomForestEstimator(ntrees=100, max_depth=30)
    drf0.train(x=bike_names_x,y="bikes",training_frame =train,validation_frame=test)
    drf_elapsed = time.time() - s

    # Run GLM
    if "WC1" in bike_names_x: bike_names_x.remove("WC1")
    s = time.time()
    glm0 = h2o.H2OGeneralizedLinearEstimator(Lambda=[1e-5], family="poisson")
    glm0.train(x=bike_names_x,y="bikes",training_frame =train,validation_frame=test)
    glm_elapsed = time.time() - s

    # Run DL
    s = time.time()
    dl0 = h2o.H2ODeepLearningEstimator(hidden=[50,50,50,50], epochs=6)
    dl0.train(x=bike_names_x,y="bikes",training_frame =train,validation_frame=test)
    dl_elapsed = time.time() - s

    # ----------
    # Score & report
    header = ["Model", "R2 TRAIN", "R2 TEST", "Model Training Time (s)"]
    table = [
     ["GBM", gbm0.r2(train=True), gbm0.r2(valid=True),
    round(gbm_elapsed,3)],
     ["DRF", drf0.r2(train=True), drf0.r2(valid=True),
    round(drf_elapsed,3)],
     ["GLM", glm0.r2(train=True), glm0.r2(valid=True),
    round(glm_elapsed,3)],
     ["DL ", dl0 .r2(train=True), dl0 .r2(valid=True),
    round( dl_elapsed,3)],
    ]
    h2o.display.H2ODisplay(table,header)
    # --------


In [13]:
# Split the data (into test & train), fit some models and look at the results
split_fit_predict(bpd)
# Explore (in Flow) the 4 models - training time, quality of fit, tendency to overfit


('Training data has', 7, 'columns and', 97390, 'rows, test has', 41871, 'rows')

gbm Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

glm Model Build Progress: [##################################################] 100%

deeplearning Model Build Progress: [##################################################] 100%


0,1,2,3
Model,R2 TRAIN,R2 TEST,Model Training Time (s)
GBM,0.9608841,0.9258335,31.555
DRF,0.8794651,0.8876446,98.654
GLM,0.5402835,0.5430230,1.218
DL,0.8517503,0.8565373,36.067


In [14]:
#Go back to Step 2,3 once we have new relevant data
# Load weather data\n",
wthr1 = h2o.import_file(path=[path +"31081_New_York_City__Hourly_2013.csv",path +"31081_New_York_City__Hourly_2014.csv"])
# Peek at the data
wthr1.summary()
# Lots of columns in there! Lets plan on converting to time-sinceepoch to do
# a 'join' with the bike data, plus gather weather info that might affect
# cyclists - rain, snow, temperature. Alas, drop the "snow" column since it's
# all NA's. Also add in dew point and humidity just in case. Slice out just
# the columns of interest and drop the rest.
wthr2 = wthr1[["Year Local","Month Local","Day Local","Hour Local","Dew Point (C)","Humidity Fraction",\
               "Precipitation One Hour (mm)","Temperature (C)","Weather Code 1/ Description"]]
wthr2.set_name("Precipitation One Hour (mm)", "Rain (mm)")
wthr2.set_name("Weather Code 1/ Description", "WC1")
wthr2.summary()
# Much better!

# Filter down to the weather at Noon: approximate the weather for the day
wthr3 = wthr2[ wthr2["Hour Local"]==12 ]
# Lets now get Days since the epoch... we'll convert year/month/day into Epoch
# time, and then back to Epoch days. Need zero-based month and days, but have
# 1-based.
wthr3["msec"] = h2o.H2OFrame.mktime(year=wthr3["Year Local"], month=wthr3["Month Local"]-1, day=wthr3["Day Local"]-1,
hour=wthr3["Hour Local"])
wthr3["Days"] = (wthr3["msec"]/secsPerDay).floor()
wthr3.summary()
# msec looks sane (numbers like 1.3e12 are in the correct range for msec since
# 1970). Epoch Days matches closely with the epoch day numbers from the
# CitiBike dataset
# Lets drop off the extra time columns to make a easy-to-handle dataset.
wthr4 = wthr3.drop("Year Local").drop("Month Local").drop("Day Local").drop("Hour Local").drop("msec")
# Also, most rain numbers are missing - lets assume those are zero rain days
rain = wthr4["Rain (mm)"]
rain[ rain.isna() ] = 0
wthr4["Rain (mm)"] = rain
print("Merge Daily Weather with Bikes-Per-Day")
bpd_with_weather = bpd.merge(wthr4,all_x=True,all_y=False)
bpd_with_weather.summary()
bpd_with_weather.dim



Parse Progress: [##################################################] 100%


Unnamed: 0,Year Local,Month Local,Day Local,Hour Local,Year UTC,Month UTC,Day UTC,Hour UTC,Cavok Reported,Cloud Ceiling (m),Cloud Cover Fraction,Cloud Cover Fraction 1,Cloud Cover Fraction 2,Cloud Cover Fraction 3,Cloud Cover Fraction 4,Cloud Cover Fraction 5,Cloud Cover Fraction 6,Cloud Height (m) 1,Cloud Height (m) 2,Cloud Height (m) 3,Cloud Height (m) 4,Cloud Height (m) 5,Cloud Height (m) 6,Dew Point (C),Humidity Fraction,Precipitation One Hour (mm),Pressure Altimeter (mbar),Pressure Sea Level (mbar),Pressure Station (mbar),Snow Depth (cm),Temperature (C),Visibility (km),Weather Code 1,Weather Code 1/ Description,Weather Code 2,Weather Code 2/ Description,Weather Code 3,Weather Code 3/ Description,Weather Code 4,Weather Code 4/ Description,Weather Code 5,Weather Code 5/ Description,Weather Code 6,Weather Code 6/ Description,Weather Code Most Severe / Icon Code,Weather Code Most Severe,Weather Code Most Severe / Description,Wind Direction (degrees),Wind Gust (m/s),Wind Speed (m/s)
type,int,int,int,int,int,int,int,int,int,real,real,real,real,real,int,int,int,real,real,real,int,int,int,real,real,real,real,int,int,int,real,real,int,enum,int,enum,int,enum,int,enum,int,enum,int,enum,int,int,enum,int,real,real
mins,2013.0,1.0,1.0,0.0,2013.0,1.0,1.0,0.0,0.0,61.0,0.0,0.0,0.25,0.5,,,,60.96,213.36,365.76,,,,-26.7,0.1251,0.0,983.2949,,,,-15.6,0.001,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,3.0,0.0,0.0,1.0,0.0,10.0,7.2,0.0
mean,2013.5,6.52602739726,15.7205479452,11.5,2013.50057078,6.52511415525,15.721347032,11.5001141553,0.0,1306.31195846,0.416742490522,0.361207349081,0.872445384073,0.963045685279,0.0,0.0,0.0,1293.9822682,1643.73900166,2084.89386376,0.0,0.0,0.0,4.31304646766,0.596736389159,1.37993010753,1017.82581441,0.0,0.0,0.0,12.5789090701,14.3914429682,4.84251968504,,3.65867689358,,2.84660766962,,2.01149425287,,4.125,,3.0,0.0,1.37848173516,4.84251968504,,194.69525682,9.42216948073,2.41032887849
maxs,2014.0,12.0,31.0,23.0,2015.0,12.0,31.0,23.0,0.0,3657.6,1.0,1.0,1.0,1.0,,,,3657.5999,3657.5999,3657.5999,,,,24.4,1.0,26.924,1042.2113,,,,36.1,16.0934,60.0,11.0,60.0,10.0,36.0,7.0,27.0,4.0,27.0,2.0,3.0,0.0,16.0,60.0,11.0,360.0,20.58,10.8
sigma,0.500014270017,3.44794972385,8.79649804852,6.92238411188,0.500584411716,3.44782405458,8.79561488868,6.92230165203,0.0,995.339856966,0.462720830993,0.42770569708,0.197155690367,0.0861015598104,-0.0,-0.0,-0.0,962.743095854,916.73861349,887.215847511,-0.0,-0.0,-0.0,10.9731282097,0.185792011866,2.56215129179,7.46451697179,-0.0,-0.0,-0.0,10.0396739531,3.69893623033,5.70486576983,,6.13386253912,,5.80553286364,,3.12340844261,,6.15223536611,,0.0,0.0,4.07386062702,5.70486576983,,106.350000031,1.81511871115,1.61469790524
zeros,0,0,0,730,0,0,0,730,17455,0,8758,8758,0,0,0,0,0,0,0,0,0,0,0,268,0,501,0,0,0,0,269,0,0,17,0,30,0,13,0,20,0,12,0,2,14980,0,17,0,0,2768
missing,0,0,0,0,0,0,0,0,65,10780,375,375,14682,16535,17520,17520,17520,9103,14683,16535,17520,17520,17520,67,67,15660,360,17520,17520,17520,67,412,14980,14980,16477,16477,17181,17181,17433,17433,17504,17504,17518,17518,0,14980,14980,9382,14381,1283
0,2013.0,1.0,1.0,0.0,2013.0,1.0,1.0,5.0,0.0,2895.6,1.0,0.9,1.0,,,,,2895.5999,3352.8,,,,,-5.0,0.5447,,1013.0917,,,,3.3,16.0934,,,,,,,,,,,,,0.0,,,,,2.57
1,2013.0,1.0,1.0,1.0,2013.0,1.0,1.0,6.0,0.0,3048.0,1.0,1.0,,,,,,3048.0,,,,,,-4.4,0.5463,,1012.0759,,,,3.9,16.0934,,,,,,,,,,,,,0.0,,,260.0,9.77,4.63
2,2013.0,1.0,1.0,2.0,2013.0,1.0,1.0,7.0,0.0,1828.8,1.0,1.0,,,,,,1828.7999,,,,,,-3.3,0.619,,1012.4145,,,,3.3,16.0934,,,,,,,,,,,,,0.0,,,,7.72,1.54


Unnamed: 0,Year Local,Month Local,Day Local,Hour Local,Dew Point (C),Humidity Fraction,Rain (mm),Temperature (C),WC1
type,int,int,int,int,real,real,real,real,enum
mins,2013.0,1.0,1.0,0.0,-26.7,0.1251,0.0,-15.6,0.0
mean,2013.5,6.52602739726,15.7205479452,11.5,4.31304646766,0.596736389159,1.37993010753,12.5789090701,
maxs,2014.0,12.0,31.0,23.0,24.4,1.0,26.924,36.1,11.0
sigma,0.500014270017,3.44794972385,8.79649804852,6.92238411188,10.9731282097,0.185792011866,2.56215129179,10.0396739531,
zeros,0,0,0,730,268,0,501,269,17
missing,0,0,0,0,67,67,15660,67,14980
0,2013.0,1.0,1.0,0.0,-5.0,0.5447,,3.3,
1,2013.0,1.0,1.0,1.0,-4.4,0.5463,,3.9,
2,2013.0,1.0,1.0,2.0,-3.3,0.619,,3.3,


Unnamed: 0,Year Local,Month Local,Day Local,Hour Local,Dew Point (C),Humidity Fraction,Rain (mm),Temperature (C),WC1,msec,Days
type,int,int,int,int,real,real,real,real,enum,int,int
mins,2013.0,1.0,1.0,12.0,-26.7,0.1723,0.0,-13.9,0.0,1.3570704e+12,15706.0
mean,2013.5,6.52602739726,15.7205479452,12.0,4.23012379642,0.539728198074,1.53125714286,14.0687757909,,1.3885608526e+12,16070.5
maxs,2014.0,12.0,31.0,12.0,23.3,1.0,12.446,34.4,10.0,1.420056e+12,16435.0
sigma,0.500342818004,3.45021529307,8.80227802701,0.0,11.1062964725,0.179945027923,2.36064248615,10.3989855149,,18219740080.4,210.877136425
zeros,0,0,0,0,14,0,15,7,1,0,0
missing,0,0,0,0,3,3,660,3,620,0,0
0,2013.0,1.0,1.0,12.0,-3.3,0.5934,,3.9,,1.3570704e+12,15706.0
1,2013.0,1.0,2.0,12.0,-11.7,0.4806,,-2.2,,1.3571568e+12,15707.0
2,2013.0,1.0,3.0,12.0,-10.6,0.5248,,-2.2,,1.3572432e+12,15708.0


Merge Daily Weather with Bikes-Per-Day


Unnamed: 0,Days,start station name,mean_tripduration,mean_birth year,bikes,mean_gender,weekday,Humidity Fraction,Rain (mm),Temperature (C),WC1,Dew Point (C)
type,int,enum,real,real,int,real,int,real,real,real,enum,real
mins,15887.0,0.0,62.0,1929.5,1.0,1.0,0.0,0.1723,0.0,-13.9,0.0,-26.7
mean,16099.9758008,,897.298678074,1975.99965771,74.7341035897,1.23490808342,3.00197470936,0.532494425803,0.0860139306769,15.6334205959,,5.47825137402
maxs,16314.0,339.0,166694.5,1997.0,680.0,2.0,6.0,1.0,8.382,34.4,10.0,23.3
sigma,123.635133897,,1358.93483261,3.1356602598,64.1243887565,0.107465576239,1.99844557028,0.178408938664,0.577304430765,10.9454511961,,11.7308194576
zeros,0,428,0,0,0,0,19858,0,131155,1598,324,1954
missing,0,0,0,64,0,64,0,981,0,981,119130,981
0,15887.0,1 Ave & E 15 St,706.85106383,1976.05263158,47.0,1.15789473684,4.0,0.9354,4.572,22.8,rain,21.7
1,15887.0,1 Ave & E 18 St,927.025,1973.81081081,40.0,1.13513513514,4.0,0.9354,4.572,22.8,rain,21.7
2,15887.0,1 Ave & E 30 St,768.857142857,1975.28571429,42.0,1.14285714286,4.0,0.9354,4.572,22.8,rain,21.7


[139261, 12]

In [15]:
# Split the data (into test & train), fit some models and look at the results
split_fit_predict(bpd_with_weather)
# Explore (in Flow) the 4 models - training time, quality of fit, tendency to overfit

('Training data has', 12, 'columns and', 97459, 'rows, test has', 41802, 'rows')

gbm Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

glm Model Build Progress: [##################################################] 100%

deeplearning Model Build Progress: [##################################################] 100%


0,1,2,3
Model,R2 TRAIN,R2 TEST,Model Training Time (s)
GBM,0.9587998,0.9250398,33.013
DRF,0.8806556,0.8822460,140.557
GLM,0.7097445,0.7099442,0.67
DL,0.8679371,0.8658120,35.528


I have checked individual models and combined models by looking at H2O Flow models (Go to http://localhost:54321/flow/index.html Choose Model > List All Models). I found that the best performing model is GBM. Important features are station names, temperature, Days and weekday.

Validation is done implicitly when we look at H2O Flow models and train-test comparison.

## Stage 5: Communicate the data

I have concluded that GBM with additional features (weather + weekday) is the best model with R2 test = 0.9250398

Here is the performance visual.

<img src="20160511_nyc-bike-GBM-performance.jpg" width = "500x">

Here is some output, product, ...

