## The Kaggle Rossman Cometition

This was a kaggle competition to forecast sales at a pharmacy chain/dept store in Europe. It was run back in 2015.

The aim of this lab is

(a) to see how to "grid-search" when we think the data is too large to use cross-validation. This is in opposition to the other way we usually do grid search using pipelines. But we still want to use sklearn/dask pipelines as much as possible so that ALL transformations can be used on validation and test sets
(b) to understand some aspects of feature engineering that come in with continuous and categorical variables, and to see some of the new features in sklearn 0.20
(c) to capture results from validation
(d) to investigate the use of categorical "embeddings" to improve performance of a multi-layer percepton
(e) if time permits to use dask to do some of this stuff.

### Preprocessing

In [1]:
import numpy as np
import scipy.stats
import scipy.special

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from matplotlib import cm
import pandas as pd
%matplotlib inline

In [2]:
from pathlib import Path

In [3]:
data = Path('./data')

We engage in some cleaning. A lot of cleaning of this dataset has already been done for us. Some features have been created. In particular we moved from dates to week-of-year, day-of week, etc. For example the 49th and 50th weeks of the year may have higher sales!

In [111]:
train_df = pd.read_csv(data/"train_clean.csv").drop(['index', 'PromoInterval'], axis=1)
test_df = pd.read_csv(data/"test_clean.csv").drop(['index', 'PromoInterval'], axis=1)

In [151]:
train_df['Events'] = train_df['Events'].fillna('None')
test_df['Events'] = test_df['Events'].fillna('None')

And in log-transforming the dependent variable because it is long-tailed

In [174]:
train_resp = np.log(train_df['Sales'].copy())
train_df = train_df.drop('Sales', axis=1)

Lets get some idea about our dataset.

In [175]:
train_df.head()

Unnamed: 0,Store,DayOfWeek,Date,Customers,Open,Promo,StateHoliday,SchoolHoliday,Year,Month,...,AfterStateHoliday,BeforeStateHoliday,AfterPromo,BeforePromo,SchoolHoliday_bw,StateHoliday_bw,Promo_bw,SchoolHoliday_fw,StateHoliday_fw,Promo_fw
0,1,5,2015-07-31,555,1,1,False,1,2015,7,...,57,0,0,0,5.0,0.0,5.0,7.0,0.0,5.0
1,2,5,2015-07-31,625,1,1,False,1,2015,7,...,67,0,0,0,5.0,0.0,5.0,1.0,0.0,1.0
2,3,5,2015-07-31,821,1,1,False,1,2015,7,...,57,0,0,0,5.0,0.0,5.0,5.0,0.0,5.0
3,4,5,2015-07-31,1498,1,1,False,1,2015,7,...,67,0,0,0,5.0,0.0,5.0,1.0,0.0,1.0
4,5,5,2015-07-31,559,1,1,False,1,2015,7,...,57,0,0,0,5.0,0.0,5.0,1.0,0.0,1.0


In [176]:
train_df.shape, test_df.shape

((844338, 90), (41088, 90))

In [177]:
train_df.Date # latest date first

0         2015-07-31
1         2015-07-31
2         2015-07-31
3         2015-07-31
4         2015-07-31
5         2015-07-31
6         2015-07-31
7         2015-07-31
8         2015-07-31
9         2015-07-31
10        2015-07-31
11        2015-07-31
12        2015-07-31
13        2015-07-31
14        2015-07-31
15        2015-07-31
16        2015-07-31
17        2015-07-31
18        2015-07-31
19        2015-07-31
20        2015-07-31
21        2015-07-31
22        2015-07-31
23        2015-07-31
24        2015-07-31
25        2015-07-31
26        2015-07-31
27        2015-07-31
28        2015-07-31
29        2015-07-31
             ...    
844308    2013-01-02
844309    2013-01-02
844310    2013-01-02
844311    2013-01-02
844312    2013-01-02
844313    2013-01-02
844314    2013-01-02
844315    2013-01-02
844316    2013-01-02
844317    2013-01-02
844318    2013-01-02
844319    2013-01-02
844320    2013-01-02
844321    2013-01-01
844322    2013-01-01
844323    2013-01-01
844324    201

In [224]:
train_df.columns

Index(['Store', 'DayOfWeek', 'Date', 'Customers', 'Open', 'Promo',
       'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Week', 'Day',
       'Dayofweek', 'Dayofyear', 'Is_month_end', 'Is_month_start',
       'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start',
       'Elapsed', 'StoreType', 'Assortment', 'CompetitionDistance',
       'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2',
       'Promo2SinceWeek', 'Promo2SinceYear', 'State', 'file', 'week', 'trend',
       'file_DE', 'week_DE', 'trend_DE', 'Date_DE', 'State_DE', 'Month_DE',
       'Day_DE', 'Dayofweek_DE', 'Dayofyear_DE', 'Is_month_end_DE',
       'Is_month_start_DE', 'Is_quarter_end_DE', 'Is_quarter_start_DE',
       'Is_year_end_DE', 'Is_year_start_DE', 'Elapsed_DE', 'Max_TemperatureC',
       'Mean_TemperatureC', 'Min_TemperatureC', 'Dew_PointC', 'MeanDew_PointC',
       'Min_DewpointC', 'Max_Humidity', 'Mean_Humidity', 'Min_Humidity',
       'Max_Sea_Level_PressurehPa', 'Mean_Sea_Le

### Types of variables and cardinality

We make a note of which variables are categorical and which are not. This is a choice. If cardinality is not too high, binning or categorizing can be beneficial. Often this will be true for integer valued variables.

In [351]:
cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday', 'CompetitionMonthsOpen',
    'Promo2Weeks', 'StoreType', 'Assortment', 'CompetitionOpenSinceYear', 'Promo2SinceYear',
    'State', 'Week', 'Events', 'Promo_fw', 'Promo_bw', 'StateHoliday_fw', 'StateHoliday_bw',
    'SchoolHoliday_fw', 'SchoolHoliday_bw', 'Promo', 'SchoolHoliday']

cont_vars = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC',
   'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Wind_SpeedKm_h', 
   'Mean_Wind_SpeedKm_h', 'CloudCover', 'trend', 'trend_DE',
   'AfterStateHoliday', 'BeforeStateHoliday']

We look for missing data and store the column names where this happend in the continuous data

In [354]:
nacols=[]
for v in cont_vars:
    if np.sum(trdf[v].isnull()) > 0:
        nacols.append(v)
        print(v, np.sum(trdf[v].isnull()))

CompetitionDistance 2075
CloudCover 64187


And look at some cardinalities: since we have none below 10, we dont engage in binning.

In [355]:
for k in cont_vars:
    print(k, trdf[k].unique().shape[0])
    if trdf[k].unique().shape[0] < 10:
        print(trdf[k].unique())

CompetitionDistance 655
Max_TemperatureC 49
Mean_TemperatureC 44
Min_TemperatureC 40
Max_Humidity 50
Mean_Humidity 70
Min_Humidity 93
Max_Wind_SpeedKm_h 42
Mean_Wind_SpeedKm_h 27
CloudCover 10
trend 67
trend_DE 36
AfterStateHoliday 136
BeforeStateHoliday 147


We do a similar looksie on the categorical variables. Some of these have many levels. Is there really that much information in 1115 store labels. Can we get some compression to increase our signal-to-noise?

In [399]:
for k in cat_vars:
    print(k, trdf[k].unique().shape[0])
    if trdf[k].unique().shape[0] < 50:
        print(trdf[k].unique())

Store 1115
DayOfWeek 7
[5 4 3 2 1 7 6]
Year 3
[2015 2014 2013]
Month 12
[ 6  5  4  3  2  1 12 11 10  9  8  7]
Day 31
[19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1 31 30 29 28 27
 26 25 24 23 22 21 20]
StateHoliday 2
[False  True]
CompetitionMonthsOpen 25
[ 0 24  2 18  8 15 16  6 23 14 21 10 12  1 22 11  3  9 13 19  7 17  5 20
  4]
Promo2Weeks 26
[25  0 19 11  2  7 18 10  1  6 17  9  5 16  8  4 15 24  3 14 23 13 22 12
 21 20]
StoreType 4
['d' 'c' 'a' 'b']
Assortment 3
['c' 'a' 'b']
CompetitionOpenSinceYear 23
[1900 2008 2007 2006 2009 2015 2013 2014 2000 2011 2010 2005 1999 2003
 2012 2004 2002 1961 1995 2001 1990 1994 1998]
Promo2SinceYear 8
[2012 1900 2010 2011 2009 2014 2015 2013]
State 12
['HE' 'TH' 'NW' 'BE' 'SN' 'SH' 'HB,NI' 'BY' 'BW' 'RP' 'ST' 'HH']
Week 52
Events 22
['Rain' 'None' 'Fog' 'Fog-Rain' 'Rain-Thunderstorm'
 'Fog-Rain-Thunderstorm' 'Thunderstorm' 'Rain-Hail' 'Fog-Thunderstorm'
 'Rain-Hail-Thunderstorm' 'Rain-Snow' 'Fog-Rain-Hail-Thunderstorm' 'Snow'
 'Rain-

### Creating a validation set

The construction of a validation or "development" set is not always a `test_train_split` deal. Here we create a validation set of "latest" data, cireesponding oin date and size to what we have in the test set. Hopefully this will make sure we have similar distributions of features and outcomes on both.

In [180]:
cut = train_df['Date'][(train_df['Date'] == train_df['Date'][len(test_df)])].index.max()
cut

41395

In [181]:
valid_idx = range(cut)
train_idx = list(np.setdiff1d(range(train_df.shape[0]), valid_idx))

In [182]:
trdf = train_df.iloc[train_idx]
vadf = train_df.iloc[valid_idx]

In [183]:
trdf.shape, vadf.shape

((802943, 90), (41395, 90))

### Transformation Pipelines

Ok, now we'll use the new `ColumnTransformer`, with imputation, missing-data indicators, the new `OrdinalEncoder`, and the usual Standard Scaling.

In [356]:
from sklearn.impute import SimpleImputer,MissingIndicator
from sklearn.pipeline import make_pipeline, make_union, Pipeline

In [373]:
impu = SimpleImputer(strategy="median") # create a median imputer

We do the missing indicator separately as it creates a new column. It is possible to do this in the pipeline flow in `sklearn` using a union, but subsequent scaling wants to scale this indicator fince the categorical list does not include the new columns.

In [371]:
mi = MissingIndicator() # create, fit, and transform a missingness indicator
mi.fit(trdf[nacols])
Xtrmi = mi.transform(trdf[nacols])
Xvami = mi.transform(vadf[nacols])

In [372]:
Xtrmi[4460,:]

array([False,  True])

In [374]:
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
ss = StandardScaler()
oe = OrdinalEncoder()

In [375]:
trdf_cat = trdf[cat_vars]
trdf_cont = trdf[cont_vars]

We construct two pipelines, one for categoricals and one for continuous variables

In [376]:
cont_pipe = Pipeline([("imp",impu), ("scale", ss)])


In [377]:
cat_pipe = Pipeline([("categorify", oe)])

And combine them here in a transformer list.

In [378]:
transformers = [('cat', cat_pipe, cat_vars),
                    ('cont', cont_pipe, cont_vars)]

Now we use a `ColumnTransformer` to combine these.

In [379]:
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(transformers=transformers)

In [380]:
ct.fit(trdf)

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('cat', Pipeline(memory=None,
     steps=[('categorify', OrdinalEncoder(categories='auto', dtype=<class 'numpy.float64'>))]), ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday', 'CompetitionMonthsOpen', 'Promo2Weeks', 'StoreType', 'Assortment', 'CompetitionOpenSinceYear', 'P...ean_Wind_SpeedKm_h', 'CloudCover', 'trend', 'trend_DE', 'AfterStateHoliday', 'BeforeStateHoliday'])])

In [381]:
Xtr = ct.transform(trdf)
Xval = ct.transform(vadf)

In [384]:
Xtr.shape, Xtrmi.shape

((802943, 37), (802943, 2))

We concatenate the old indicators back in. The transformer lists all the categoricals first, since thats the first item in `transformers`, so we pre-pend.

In [400]:
Xtrain = np.concatenate([Xtrmi, Xtr], axis=1)
Xtrain.shape

(802943, 39)

In [401]:
Xvalid = np.concatenate([Xvami, Xval], axis=1)
Xvalid.shape

(41395, 39)

sklearn-pipelines lose our nice pandas names. so we bring them back.

In [436]:
cols = trdf.columns
actcols = []
actcolcount = 0
nacols_cat = []
for k in nacols:
    actcols.append((k+'_missing', 'cont'))
    nacols_cat.append(k+'_missing')
    actcolcount+=1
for k in cat_vars+cont_vars:
    if k in cat_vars:
        actcols.append((k, "cat"))
        actcolcount+=1
    if k in cont_vars:
        actcols.append((k, "cont"))
        actcolcount+=1
        
list(enumerate(actcols)), actcolcount

([(0, ('CompetitionDistance_missing', 'cont')),
  (1, ('CloudCover_missing', 'cont')),
  (2, ('Store', 'cat')),
  (3, ('DayOfWeek', 'cat')),
  (4, ('Year', 'cat')),
  (5, ('Month', 'cat')),
  (6, ('Day', 'cat')),
  (7, ('StateHoliday', 'cat')),
  (8, ('CompetitionMonthsOpen', 'cat')),
  (9, ('Promo2Weeks', 'cat')),
  (10, ('StoreType', 'cat')),
  (11, ('Assortment', 'cat')),
  (12, ('CompetitionOpenSinceYear', 'cat')),
  (13, ('Promo2SinceYear', 'cat')),
  (14, ('State', 'cat')),
  (15, ('Week', 'cat')),
  (16, ('Events', 'cat')),
  (17, ('Promo_fw', 'cat')),
  (18, ('Promo_bw', 'cat')),
  (19, ('StateHoliday_fw', 'cat')),
  (20, ('StateHoliday_bw', 'cat')),
  (21, ('SchoolHoliday_fw', 'cat')),
  (22, ('SchoolHoliday_bw', 'cat')),
  (23, ('Promo', 'cat')),
  (24, ('SchoolHoliday', 'cat')),
  (25, ('CompetitionDistance', 'cont')),
  (26, ('Max_TemperatureC', 'cont')),
  (27, ('Mean_TemperatureC', 'cont')),
  (28, ('Min_TemperatureC', 'cont')),
  (29, ('Max_Humidity', 'cont')),
  (30, ('

### Time to learn

We first plit the y (the log of the y, really)

In [390]:
ytrain = train_resp[train_idx]
yvalid = train_resp[list(valid_idx)]
ytrain.shape, yvalid.shape

((802943,), (41395,))

and import what we need to for Gradient Boosting

In [430]:
from sklearn.metrics import mean_squared_error

In [431]:
from sklearn.ensemble import GradientBoostingRegressor

Peter Prettenhofer, who wrote sklearns GBRT implementation writes in his pydata14 talk (worth watching!)

>Hyperparameter tuning I usually follow this recipe to tune the hyperparameters:

> 
- Pick n_estimators as large as (computationally) possible (e.g. 3000)
- Tune max_depth, learning_rate, min_samples_leaf, and max_features via grid search
- A lower learning_rate requires a higher number of n_estimators. Thus increase n_estimators even more and tune learning_rate again holding the other parameters fixed

>This last point is a tradeof between number of iterations or runtime against accuracy. And keep in mind that it might lead to overfitting.

Let me add however, that poor learners do rather well. So you might want to not cross-validate max_depth. And min_samples_per_leaf is not independent either, so if you do use cross-val, you might just use one of those.

We use `ParameterGrid` here to construct the entire grid for us! We put the output in a list of dictionaries and then save it in a dataframe. We might want to persist such dataframes to disk.

In [432]:
param_grid = {'learning_rate': [0.1, 0.01],
              'max_depth': [1,2, 3],
              'max_features': [0.2, 0.6]
              }

In [433]:
from sklearn.model_selection import ParameterGrid

In [434]:
ds=[]
for p in ParameterGrid(param_grid):
    print(p)
    gb = GradientBoostingRegressor(n_estimators=200)
    gb.set_params(**p)
    gb.fit(Xtrain, ytrain)
    ypred = gb.predict(Xvalid)
    ypredtrain = gb.predict(Xtrain)
    d = p.copy()
    d['n_estimators']=200
    d['mse'] = mean_squared_error(ypred, yvalid)
    d['msetr'] = mean_squared_error(ypredtrain, ytrain)
    print("MSE", d['mse'], d['msetr'])
    ds.append(d)
ds

{'learning_rate': 0.1, 'max_depth': 1, 'max_features': 0.2}
MSE 0.11952736672561466 0.12841955343258007
{'learning_rate': 0.1, 'max_depth': 1, 'max_features': 0.6}
MSE 0.11949081835876278 0.12800020775942245
{'learning_rate': 0.1, 'max_depth': 2, 'max_features': 0.2}
MSE 0.11005260932751748 0.11449586709006311
{'learning_rate': 0.1, 'max_depth': 2, 'max_features': 0.6}
MSE 0.10579365894338559 0.11001812163105405
{'learning_rate': 0.1, 'max_depth': 3, 'max_features': 0.2}
MSE 0.09878988052062682 0.10056851339771275
{'learning_rate': 0.1, 'max_depth': 3, 'max_features': 0.6}
MSE 0.08781969095003936 0.08837837714343463
{'learning_rate': 0.01, 'max_depth': 1, 'max_features': 0.2}
MSE 0.14663437176401278 0.1567381426802156
{'learning_rate': 0.01, 'max_depth': 1, 'max_features': 0.6}
MSE 0.13861640989094157 0.1504924776928524
{'learning_rate': 0.01, 'max_depth': 2, 'max_features': 0.2}
MSE 0.13690674397698613 0.14609089983139759
{'learning_rate': 0.01, 'max_depth': 2, 'max_features': 0.6}
MS

[{'learning_rate': 0.1,
  'max_depth': 1,
  'max_features': 0.2,
  'mse': 0.11952736672561466,
  'msetr': 0.12841955343258007,
  'n_estimators': 200},
 {'learning_rate': 0.1,
  'max_depth': 1,
  'max_features': 0.6,
  'mse': 0.11949081835876278,
  'msetr': 0.12800020775942245,
  'n_estimators': 200},
 {'learning_rate': 0.1,
  'max_depth': 2,
  'max_features': 0.2,
  'mse': 0.11005260932751748,
  'msetr': 0.11449586709006311,
  'n_estimators': 200},
 {'learning_rate': 0.1,
  'max_depth': 2,
  'max_features': 0.6,
  'mse': 0.10579365894338559,
  'msetr': 0.11001812163105405,
  'n_estimators': 200},
 {'learning_rate': 0.1,
  'max_depth': 3,
  'max_features': 0.2,
  'mse': 0.09878988052062682,
  'msetr': 0.10056851339771275,
  'n_estimators': 200},
 {'learning_rate': 0.1,
  'max_depth': 3,
  'max_features': 0.6,
  'mse': 0.08781969095003936,
  'msetr': 0.08837837714343463,
  'n_estimators': 200},
 {'learning_rate': 0.01,
  'max_depth': 1,
  'max_features': 0.2,
  'mse': 0.14663437176401278

In [435]:
dsdf = pd.DataFrame.from_records(ds)
dsdf.sort_values('mse')

Unnamed: 0,learning_rate,max_depth,max_features,mse,msetr,n_estimators
5,0.1,3,0.6,0.08782,0.088378,200
4,0.1,3,0.2,0.09879,0.100569,200
3,0.1,2,0.6,0.105794,0.110018,200
2,0.1,2,0.2,0.110053,0.114496,200
1,0.1,1,0.6,0.119491,0.128,200
0,0.1,1,0.2,0.119527,0.12842,200
11,0.01,3,0.6,0.123093,0.132434,200
10,0.01,3,0.2,0.128312,0.137077,200
9,0.01,2,0.6,0.130502,0.140609,200
8,0.01,2,0.2,0.136907,0.146091,200


### A Multi-Layer Perceptron Model

This is based on the 3rd prize winning entry, whose authors wrote a [paper](https://arxiv.org/pdf/1604.06737.pdf) afterwords.

What we are first going to do is to reduce the cardinality dimensionality of our categorocals by using **embeddings**. This is a technique used often in recommender systems(via matrix factorization), but also in NLP models such as `word2vec`. The idea is to map down to a smaller latent space.

Here we divide the carnality by 2 and add 1 to get the embedding (this is a heuristic). If the cardinality is high, we clamp the size of the latent space down at 50.

In [404]:
cards={}
for k in nacols_cat:
    cards[k] = (2,2)
for k in cat_vars :
    embed_sz_base = trdf[k].unique().size//2 + 1
    embed_sz = (embed_sz_base <=50)*embed_sz_base + 50*((embed_sz_base > 50))
    cards[k] = (trdf[k].unique().size, embed_sz)
cards

{'Assortment': (3, 2),
 'CloudCover_missing': (2, 2),
 'CompetitionDistance_missing': (2, 2),
 'CompetitionMonthsOpen': (25, 13),
 'CompetitionOpenSinceYear': (23, 12),
 'Day': (31, 16),
 'DayOfWeek': (7, 4),
 'Events': (22, 12),
 'Month': (12, 7),
 'Promo': (2, 2),
 'Promo2SinceYear': (8, 5),
 'Promo2Weeks': (26, 14),
 'Promo_bw': (6, 4),
 'Promo_fw': (6, 4),
 'SchoolHoliday': (2, 2),
 'SchoolHoliday_bw': (8, 5),
 'SchoolHoliday_fw': (8, 5),
 'State': (12, 7),
 'StateHoliday': (2, 2),
 'StateHoliday_bw': (8, 5),
 'StateHoliday_fw': (8, 5),
 'Store': (1115, 50),
 'StoreType': (4, 3),
 'Week': (52, 27),
 'Year': (3, 2)}

We have to be careful (very book-keepy) in constructing a model in Keras. We use the Keras Functional API as opposed to the Sequential API.

In [426]:
from keras.models import Sequential
from keras.models import Model as KerasModel
from keras.layers import Input, Dense, Activation, Reshape
from keras.layers import Concatenate
from keras.layers.embeddings import Embedding

def build_keras_model():
    input_cat = []
    output_embeddings = []
    for k in nacols_cat+cat_vars:
        print('{}_embedding'.format(k))
        input_1d = Input(shape=(1,))
        output_1d = Embedding(cards[k][0], cards[k][1], name='{}_embedding'.format(k))(input_1d)
        output = Reshape(target_shape=(cards[k][1],))(output_1d)
        input_cat.append(input_1d)
        output_embeddings.append(output)

    main_input = Input(shape=(len(cont_vars),), name='main_input')
    output_model = Concatenate()([main_input, *output_embeddings])
    output_model = Dense(1000, kernel_initializer="uniform")(output_model)
    output_model = Activation('relu')(output_model)
    output_model = Dense(500, kernel_initializer="uniform")(output_model)
    output_model = Activation('relu')(output_model)
    output_model = Dense(1)(output_model)
    #output_model = Activation('sigmoid')(output_model)
    #kmodel = KerasModel(inputs=input_model, outputs=output_model)
    kmodel = KerasModel(
        inputs=[*input_cat, main_input], 
        outputs=output_model#[main_output, output_embeddings]
)
    kmodel.compile(loss='mean_squared_error', optimizer='adam')
    return kmodel

def fitmodel(kmodel, Xtr, ytr, Xval, yval, epochs, bs):
    h = kmodel.fit(Xtr, ytr, validation_data=(Xval, yval),
                       epochs=epochs, batch_size=bs)
    return h

The data input needs to match our construction:

In [427]:
list_cat_trains=[]
list_cat_valids=[]
catlen=len(nacols_cat+cat_vars)
for i in range(catlen):
    list_cat_trains.append(Xtrain[:,i])
    list_cat_valids.append(Xvalid[:,i])
cont_train=Xtrain[:,catlen:]
cont_valid=Xvalid[:,catlen:]

In [428]:
cont_train.shape

(802943, 14)

Now we run (only a little bit for now!)

In [429]:
emodel = build_keras_model()
history = fitmodel(emodel, [*list_cat_trains, cont_train], ytrain, [*list_cat_valids, cont_valid], yvalid, 2, 256)

CompetitionDistance_missing_embedding
CloudCover_missing_embedding
Store_embedding
DayOfWeek_embedding
Year_embedding
Month_embedding
Day_embedding
StateHoliday_embedding
CompetitionMonthsOpen_embedding
Promo2Weeks_embedding
StoreType_embedding
Assortment_embedding
CompetitionOpenSinceYear_embedding
Promo2SinceYear_embedding
State_embedding
Week_embedding
Events_embedding
Promo_fw_embedding
Promo_bw_embedding
StateHoliday_fw_embedding
StateHoliday_bw_embedding
SchoolHoliday_fw_embedding
SchoolHoliday_bw_embedding
Promo_embedding
SchoolHoliday_embedding
Train on 802943 samples, validate on 41395 samples
Epoch 1/2
Epoch 2/2


### Homework

Lets do the GBM on dask. And later, at your leisure, you'll want to `ParameterGrid` Keras.