## Allstate week 3

This week we will learn how to:

* tune LightGBM
* create Neural Networks with Keras (Theano or Tensorflow backend)
* tune Neural Networks
* create a simple ensemble of XGBoost, LightGBM and Neural Networks


In [None]:
import xgboost as xgb
import pandas as pd
from sklearn import preprocessing, pipeline, metrics, grid_search, cross_validation
import time
import random
import numpy as np
import matplotlib.pyplot as plt
import math
from sklearn.metrics import mean_absolute_error

from scipy import sparse
%matplotlib inline

In [None]:
def logregobj(labels, preds):
    con = 2
    x =preds-labels
    grad =con*x / (np.abs(x)+con)
    hess =con**2 / (np.abs(x)+con)**2
    return grad, hess 

def log_mae(labels,preds,lift=200):
    return mean_absolute_error(np.exp(labels)-lift, np.exp(preds)-lift)

log_mae_scorer = metrics.make_scorer(log_mae, greater_is_better = False)

def search_model(train_x, train_y, est, param_grid, n_jobs, cv, refit=False):
##Grid Search for the best model
    model = grid_search.GridSearchCV(estimator  = est,
                                     param_grid = param_grid,
                                     scoring    = log_mae_scorer,
                                     verbose    = 10,
                                     n_jobs  = n_jobs,
                                     iid        = True,
                                     refit    = refit,
                                     cv      = cv)
    # Fit Grid Search Model
    model.fit(train_x, train_y)
    print("Best score: %0.3f" % model.best_score_)
    print("Best parameters set:", model.best_params_)
    print("Scores:", model.grid_scores_)
    return model



def xg_eval_mae(yhat, dtrain, lift=200):
    y = dtrain.get_label()
    return 'mae', mean_absolute_error(np.exp(y)-lift, np.exp(yhat)-lift)

def xgb_logregobj(preds, dtrain):
    con = 2
    labels = dtrain.get_label()
    x =preds-labels
    grad =con*x / (np.abs(x)+con)
    hess =con**2 / (np.abs(x)+con)**2
    return grad, hess


def search_model_mae (train_x, train_y, est, param_grid, n_jobs, cv, refit=False):
##Grid Search for the best model
    model = grid_search.GridSearchCV(estimator  = est,
                                     param_grid = param_grid,
                                     scoring    = 'neg_mean_absolute_error',
                                     verbose    = 10,
                                     n_jobs  = n_jobs,
                                     iid        = True,
                                     refit    = refit,
                                     cv      = cv)
    # Fit Grid Search Model
    model.fit(train_x, train_y)
    print("Best score: %0.3f" % model.best_score_)
    print("Best parameters set:", model.best_params_)
    print("Scores:", model.grid_scores_)
    return model

## Load Data

In [None]:
# Load data
start = time.time() 
train_data = pd.read_csv('../input/train.csv')
train_size=train_data.shape[0]
print ("Loading train data finished in %0.3fs" % (time.time() - start))        

test_data = pd.read_csv('../input/test.csv')
print ("Loading test data finished in %0.3fs" % (time.time() - start))        

In [None]:
train_data.head(5)

## Merge train and test

This will save our time on duplicating logics for train and test and will also ensure the transformations applied on train and test are the same.

In [None]:
full_data=pd.concat([train_data
                       ,test_data])
del( train_data, test_data)
print ("Full Data set created.")

## Group features

In this step we will group the features into different groups so we can preprocess them seperately afterward.

In [None]:
data_types = full_data.dtypes  
cat_cols = list(data_types[data_types=='object'].index)
num_cols = list(data_types[data_types=='int64'].index) + list(data_types[data_types=='float64'].index)

id_col = 'id'
target_col = 'loss'
num_cols.remove('id')
num_cols.remove('loss')

print ("Categorical features:", cat_cols)
print ( "Numerica features:", num_cols)
print ( "ID: %s, target: %s" %( id_col, target_col))

## Categorical features 
### 1. Label Encoding (Factorizing)

In [None]:
LBL = preprocessing.LabelEncoder()
start=time.time()
for cat_col in cat_cols:
#     print ("Factorize feature %s" % (cat))
    full_data[cat_col] = LBL.fit_transform(full_data[cat_col])
print ('Label enconding finished in %f seconds' % (time.time()-start))


### 2. One Hot Encoding (get dummies)

OHE can be done by either Pandas' get_dummies() or SK Learn's OneHotEncoder. 

* get_dummies is easier to implement (can be used directly on raw categorical features, i.e. strings, but it takes longer time and is not memory efficient.

* OneHotEncoder requires the features being converted to numeric, which has already been done by LabelEncoder in previous step, and is much more efficient (7x faster).

* We will convert the OHE's results to a sparse matrix which uses way less memory as compared to dense matrix. However, not all algorithms and packagers support sparse matrix, e.g. Keras. In that case, we'll need to use other tricks to make it work.

In [None]:
OHE = preprocessing.OneHotEncoder(sparse=True)
start=time.time()
full_data_sparse=OHE.fit_transform(full_data[cat_cols])
print ('One-hot-encoding finished in %f seconds' % (time.time()-start))

print (full_data_sparse.shape)

## it should be (313864, 1176)

### 3. Leave-one-out Encoding

This is a very useful trick that has been used by many Kaggle winning solutions. It's particularly effective for high cardinality categorical features, postal code for instance. However, it doesn't seem to help a lot for this competition and the following code is just FYI. Feel free to skip it as it may take long time to run.

In [None]:
# start=time.time()
# loo_cols =[]
# for col in cat_cols:
#     print ("Leave-One-Out Encoding  %s" % (col))
#     print ("Leave-one-out encoding column %s for %s......" % (col, target_col))
#     aggr=full_data.groupby(col)[target_col].agg([np.mean]).join(full_data[:train_size].groupby(col)[target_col].agg([np.sum,np.size]),how='left')        
#     meanTagetAggr = np.mean(aggr['mean'].values)
#     aggr=full_data.join(aggr,how='left', on=col)[list(aggr.columns)+[target_col]]
#     loo_col = 'MEAN_BY_'+col+'_'+target_col
#     full_data[loo_col] = \
#     aggr.apply(lambda row: row['mean'] if math.isnan(row[target_col]) 
#                                                        else (row['sum']-row[target_col])/(row['size']-1)*random.uniform(0.95, 1.05) , axis=1)
#     loo_cols.append(loo_col)
#     print ("New feature %s created." % (loo_col))
# print ('Leave-one-out enconding finished in %f seconds' % (time.time()-start))

## Numeric features

We will apply two preprocessings on numeric features:

1. Apply box-cox transformations for skewed numeric features.

2. Scale numeric features so they will fall in the range between 0 and 1.

Please be advised that these preprocessings are not necessary for tree-based models, e.g. XGBoost. However, linear or linear-based models, which will be dicussed in following weeks, may benefit from them.

** Calculate skewness of each numeric features: **

In [None]:
from scipy.stats import skew, boxcox
skewed_cols = full_data[num_cols].apply(lambda x: skew(x.dropna()))
print (skewed_cols.sort_values())

** Apply box-cox transformations: **

In [None]:
skewed_cols = skewed_cols[skewed_cols > 0.25].index.values
for skewed_col in skewed_cols:
    full_data[skewed_col], lam = boxcox(full_data[skewed_col] + 1)

** Apply Standard Scaling:**

In [None]:
# SSL = preprocessing.StandardScaler()
# for num_col in num_cols:
#     full_data[num_col] = SSL.fit_transform(full_data[num_col])

#### Note: LBL and OHE are likely exclusive so we will use one of them at a time combined with numeric features. In the following steps we will use OHE + Numeric to tune XGBoost models and you can apply the same process with OHE + Numeric features. Averaging results from two different models will likely generate better results.

In [None]:
lift = 200

full_data_sparse = sparse.hstack((full_data_sparse
                                  ,full_data[num_cols])
                                 , format='csr'
                                 )
print (full_data_sparse.shape)
train_x = full_data_sparse[:train_size]
test_x = full_data_sparse[train_size:]
train_y = np.log(full_data[:train_size].loss.values + lift)
ID = full_data.id[:train_size].values

xgtrain = xgb.DMatrix(train_x, label=train_y,missing=np.nan) #used for Bayersian Optimization

from sklearn.cross_validation import train_test_split
X_train, X_val, y_train, y_val = train_test_split(train_x, train_y, train_size=.80, random_state=1234)

## LightGBM Tuning

* LightGBM

https://github.com/Microsoft/LightGBM

LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

    * Faster training speed and higher efficiency
    * Lower memory usage
    * Better accuracy
    * Parallel learning supported
    * Capable of handling large-scale data

* pyLIghtGBM

pyLightGBM is a python binding for Microsoft LightGBM

https://github.com/ArdalanM/pyLightGBM

#### 1. Tune num_leaves
* default=127, type=int, alias=num_leaf
* number of leaves in one tree
* control overfit
    * Smaller: unerfit
    * larger: overfit
* start from default (127), double of half the number and check if it improves. repeat the process until there's no improvement.

In [None]:
# tune num_leaves
from pylightgbm.models import GBMRegressor

rgr = GBMRegressor(exec_path="/users/cchen1/library/LightGBM/lightgbm",
                   learning_rate=0.1,
                   metric = 'l1',
                   num_threads = 4, #The acutal cores of CPU
                   num_iterations=10000,
                   early_stopping_round=50,
                   num_leaves=127,
                   verbose = True)

rgr.fit(X_train,
        y_train,
        test_data=[(X_val,y_val)])

y_pred = rgr.predict(X_val)
print rgr.best_round
print("MAE: ", log_mae(y_val,y_pred, 200))






In [None]:
num_leaves = <best num_leaves>

#### 2. Tune min_data_in_leaf
* default=100, type=int, alias=min_data_per_leaf , min_data
* Minimal number of data in one leaf. Can use this to deal with over-fit.
* control overfit
    * Smaller: overfit
    * larger: underfit
* start from default (100), increase or descrese by 20 and check if it improves. repeat the process until there's no improvement.

In [None]:
## Tune min_data_in_leaf

rgr = GBMRegressor(exec_path="/users/cchen1/library/LightGBM/lightgbm",
                   learning_rate=0.1,
                   metric = 'l1',
                   num_threads = 4, #The acutal cores of CPU
                   num_iterations=10000,
                   early_stopping_round=50,
                   num_leaves=num_leaves,
                   min_data_in_leaf=100,
                   verbose = True)

rgr.fit(X_train,
        y_train,
        test_data=[(X_val,y_val)])

y_pred = rgr.predict(X_val)
print rgr.best_round
print("MAE: ", log_mae(y_val,y_pred, 200))



In [None]:
min_data_in_leaf = <best min_data_in_leaf>

#### 3. Tune feature_fraction
* feature_fraction, default=1.0, type=double, 0.0 < feature_fraction < 1.0, alias=sub_feature
* LightGBM will random select part of features on each iteration if feature_fraction smaller than 1.0. For example, if * set to 0.8, will select 80% features before training each tree.
* Can use this to speed up training
* Can use this to deal with over-fit
    * Smaller: overfit
    * larger: underfit
* start from default (1), descrese by 0.1 and check if it improves. repeat the process until there's no improvement.

In [None]:
## Tune feature_fraction

rgr = GBMRegressor(exec_path="/users/cchen1/library/LightGBM/lightgbm",
                   learning_rate=0.1,
                   metric = 'l1',
                   num_threads = 4, #The acutal cores of CPU
                   num_iterations=num_iterations,
                   early_stopping_round=early_stopping_round,
                   num_leaves=num_leaves,
                   min_data_in_leaf=min_data_in_leaf,
                   feature_fraction = 1,

                   verbose = True)

rgr.fit(X_train,
        y_train,
        test_data=[(X_val,y_val)])

y_pred = rgr.predict(X_val)
print rgr.best_round
print("MAE: ", log_mae(y_val,y_pred, 200))


In [None]:
feature_fraction = <best feature_fraction>

#### 4. Tune bagging_freq
* default=0, type=int
* Frequency for bagging, 0 means disable bagging. k means will perform bagging at every k iteration.
* Note: To enable bagging, should set bagging_fraction as well (1 is recommended).
* start from default (1), descrese by 0.1 and check if it improves. repeat the process until there's no improvement.

In [None]:
## Tune bagging_fraction
rgr = GBMRegressor(exec_path="/users/cchen1/library/LightGBM/lightgbm",
                   learning_rate=0.1,
                   metric = 'l1',
                   num_threads = 4, #The acutal cores of CPU
                   num_iterations=num_iterations,
                   early_stopping_round=early_stopping_round,
                   num_leaves=num_leaves,
                   min_data_in_leaf=min_data_in_leaf,
                   feature_fraction = feature_fraction,
                   bagging_freq = 1, # this has to be set to an integer greater than 0 to enable bagging
                   bagging_fraction = 1,
                   verbose = True)

rgr.fit(X_train,
        y_train,
        test_data=[(X_val,y_val)])

y_pred = rgr.predict(X_val)
print rgr.best_round
print("MAE: ", log_mae(y_val,y_pred, 200))



In [None]:
bagging_fraction = <best bagging_fraction>

#### 4. Tune bagging_freq
* default=0, type=int
* Frequency for bagging, 0 means disable bagging. k means will perform bagging at every k iteration.
* Note: To enable bagging, should set bagging_fraction as well (1 is recommended).
* start from default (1), descrese by 0.1 and check if it improves. repeat the process until there's no improvement.

In [None]:
## Tune max_bin
rgr = GBMRegressor(exec_path="/users/cchen1/library/LightGBM/lightgbm",
                   learning_rate=0.1,
                   metric = 'l1',
                   num_threads = 4, #The acutal cores of CPU
                   num_iterations=10000,
                   early_stopping_round=50,
                   num_leaves=num_leaves,
                   min_data_in_leaf=min_data_in_leaf,
                   feature_fraction = feature_fraction,
                   bagging_freq = 1,
                   bagging_fraction = bagging_fraction,
                   max_bin = 255,
                   verbose = True)

rgr.fit(X_train,
        y_train,
        test_data=[(X_val,y_val)])

y_pred = rgr.predict(X_val)
print rgr.best_round
print("MAE: ", log_mae(y_val,y_pred, 200))



#### 5. Tune max_bin
* default=255, type=int
* max number of bin that feature values will bucket in. Small bin may reduce training accuracy but may increase general power (deal with over-fit).
* start from default (255), double of half the number and check if it improves. repeat the process until there's no improvement.

In [None]:
max_bin = <best max_bin>

### Automated tuning - Bayesian Optimization

Github: https://github.com/fmfn/BayesianOptimization

The idea is to set a range for each parameters, for which we can leverage the parameters from manual tuning, then let the bayersian optimization to seek best parameters.

It's more efficient than grid search but is still time consuming. Therefore knowing an approximate range of values for each parameter will greatly improve the performance.

In [None]:
from sklearn.model_selection import cross_val_score
from bayes_opt import BayesianOptimization
from pylightgbm.models import GBMRegressor
from sklearn.cross_validation import StratifiedKFold, KFold


def lgbm_cv(max_bin, num_leaves, min_data_in_leaf, feature_fraction,bagging_fraction, learning_rate=0.1):
    skf = list(KFold(len(train_y), 4))
    scores=[]
    for i, (train, val) in enumerate(skf):
        est=GBMRegressor(learning_rate = learning_rate,
                        max_bin=int(max_bin),
                        num_leaves=int(num_leaves),
                        min_data_in_leaf=int(min_data_in_leaf),
                        feature_fraction=feature_fraction,
                        bagging_fraction=bagging_fraction,
                        bagging_freq = 1,
                        num_threads=4,
                        exec_path ="/users/cchen1/library/LightGBM/lightgbm")
        train_x_fold = train_x[train]
        train_y_fold = train_y[train]
        val_x_fold = train_x[val]
        val_y_fold = train_y[val]
        est.set_params( num_iterations=100000)
        est.set_params( early_stopping_round=50)
        est.set_params( metric='l1')
        est.set_params(verbose = False)
        print (est)
        est.fit(train_x_fold,
                train_y_fold,
                test_data=[(val_x_fold, val_y_fold)]
               )
        val_y_predict_fold = est.predict(val_x_fold)
        score = log_mae(val_y_fold, val_y_predict_fold,200)
        print (score, est.best_round)
        scores.append(score)
    return -np.mean(scores)
            


lgbm_BO = BayesianOptimization(lgbm_cv, {
                                     'max_bin': (, ),
                                     'num_leaves': (,),
                                     'min_data_in_leaf' :(,),
                                     'feature_fraction': (,),
                                     'bagging_fraction' : (,)})
lgbm_BO.maximize(init_points=5, n_iter=30)


In [None]:
gbm_bo_scores = pd.DataFrame([[s[0]['num_leaves'],
                               s[0]['min_data_in_leaf'],
                               s[0]['max_bin'],
                               s[0]['feature_fraction'],
                               s[0]['bagging_fraction'],
                               s[1]] for s in zip(lgbm_BO.res['all']['params'],lgbm_BO.res['all']['values'])],
                            columns = ['num_leaves',
                                       'min_data_in_leaf',
                                       'max_bin',
                                       'feature_fraction',
                                       'bagging_fraction',
                                       'score'])
gbm_bo_scores=gbm_bo_scores.sort_values('score',ascending=False)
gbm_bo_scores

### Cross validation using parameters from Bayesian Optimization

* This step is to show how the tuned model performs with smaller learning rate (0.01 or smaller). You'd expect to see more iterations for LightGBM to converage. Therefore, you may want to use a larger number (200 for instance) for early stopping.

* It will also provide optimized iterations (n_rounds/n_estimators).

In [None]:
def lgbm_cv(max_bin, num_leaves, min_data_in_leaf, feature_fraction,bagging_fraction
            , learning_rate=0.1,early_stopping_round=50):
    skf = list(KFold(len(train_y), 4))
    scores=[]
    best_rounds=[]
    for i, (train, val) in enumerate(skf):
        est=GBMRegressor(learning_rate = learning_rate,
                        max_bin=int(max_bin),
                        num_leaves=int(num_leaves),
                        min_data_in_leaf=int(min_data_in_leaf),
                        feature_fraction=feature_fraction,
                        bagging_fraction=bagging_fraction,
                        bagging_freq = 1,
                        num_threads=4,
                        exec_path ="/users/cchen1/library/LightGBM/lightgbm")
        train_x_fold = train_x[train]
        train_y_fold = train_y[train]
        val_x_fold = train_x[val]
        val_y_fold = train_y[val]
        est.set_params( num_iterations=100000)
        est.set_params( early_stopping_round=early_stopping_round)
        est.set_params( metric='l1')
        est.set_params(verbose = False)
        print (est)
        est.fit(train_x_fold,
                train_y_fold,
                test_data=[(val_x_fold, val_y_fold)]
               )
        val_y_predict_fold = est.predict(val_x_fold)
        score = log_mae(val_y_fold, val_y_predict_fold,200)
        print (score, est.best_round)
        best_rounds.append(est.best_round)
        scores.append(score)
    return -np.mean(scores), np.mean(best_rounds)

gbm_score, gbm_best_round = lgbm_cv(max_bin=582, 
                         num_leaves=140, 
                         min_data_in_leaf=75, 
                         feature_fraction=0.2,
                         bagging_fraction=1, 
                         learning_rate=0.01,
                         early_stopping_round=200)

## Submission - LightGBM OHE

In [None]:
GBMRegressor(learning_rate = ,
                        max_bin=,
                        num_leaves=,
                        min_data_in_leaf=,
                        feature_fraction=,
                        bagging_fraction=,
                        bagging_freq = 1,
                        num_threads=4,
                        exec_path ="/users/cchen1/library/LightGBM/lightgbm")
rgr.fit(train_x, train_y)

pred_y = np.exp(rgr.predict(test_x)) - lift

results = pd.DataFrame()
results['id'] = full_data[train_size:].id
results['loss'] = pred_y
results.to_csv("../output/sub_gbm_ohe_tuned.csv", index=False)
print ("Submission created.")

## Kearas

https://keras.io

Keras is a high-level neural networks library, written in Python and capable of running on top of either TensorFlow or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.


In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import PReLU
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras import backend as K
from keras.optimizers import SGD,Nadam
from keras.regularizers import WeightRegularizer, ActivityRegularizer,l2, activity_l2

## Comment out following lines if you are using Theano as backend
import tensorflow as tf
tf.python.control_flow_ops = tf

In [None]:
# custom metric function for Keras

def mae_log(y_true, y_pred): 
    return K.mean(K.abs((K.exp(y_pred)-200) - (K.exp(y_true)-200)))


# Keras deosn't support sparse matrix. 
# The following functions are useful to split a large sparse matrix into smaller batches so they can be loaded into mem.

def batch_generator(X, y, batch_size, shuffle):
    number_of_batches = np.ceil(X.shape[0]/batch_size)
    counter = 0
    sample_index = np.arange(X.shape[0])
    if shuffle:
        np.random.shuffle(sample_index)
    while True:
        batch_index = sample_index[batch_size*counter:batch_size*(counter+1)]
        X_batch = X[batch_index,:].toarray()
        y_batch = y[batch_index]
        counter += 1
        yield X_batch, y_batch
        if (counter == number_of_batches):
            if shuffle:
                np.random.shuffle(sample_index)
            counter = 0

def batch_generatorp(X, batch_size, shuffle):
    number_of_batches = X.shape[0] / np.ceil(X.shape[0]/batch_size)
    counter = 0
    sample_index = np.arange(X.shape[0])
    while True:
        batch_index = sample_index[batch_size * counter:batch_size * (counter + 1)]
        X_batch = X[batch_index, :].toarray()
        counter += 1
        yield X_batch
        if (counter == number_of_batches):
            counter = 0

### Keras starter

Below is a quick starter example for creating a neural networks model using Keras. It covers the following aspects:
1. multiple layers: 1 input, 1 hidden and 1 output
2. normalization.
3. dropout regularization.
4. early stopping
5. activate function
6. optimizer
6. batch training

Advanced optimizers, activations and dropout regularization are the key characteristics that differentiate modern Neural Networks from conventional ones.

In [None]:
early_stop = EarlyStopping(monitor='val_mae_log', # custom metric
                           patience=5, #early stopping for epoch
                           verbose=0, mode='auto')

def create_model(input_dim):
    model = Sequential()
    
    model.add(Dense(400, # number of input units: needs to be tuned
                    input_dim = input_dim # fixed length: number of columns of X
                   ))
    
    model.add(PReLU()) # activation function
    model.add(BatchNormalization()) # normalization
    model.add(Dropout(0.4)) #dropout rate. needs to be tuned
        
    model.add(Dense(200)) # number of hidden units. needs to be tuned.
    model.add(PReLU())
    model.add(BatchNormalization())    
    model.add(Dropout(0.2)) #dropout rate. needs to be tuned
    
    
    model.add(Dense(1)) # 1 for regression 
    model.compile(loss = 'mae',
                  metrics=[mae_log],
                  optimizer = 'adadelta' # optimizer. you may want to try different ones
                 )
    return(model)

model = create_model(X_train.shape[1])
fit= model.fit_generator(generator=batch_generator(X_train, y_train, 128, True),
                         nb_epoch=1000,
                         samples_per_epoch=train_size,
                         validation_data=(X_val.todense(), y_val),
                         callbacks=[early_stop,checkpointer]
                         )

### Cross Validation and ...

The following sample shows how to do cross validation for Keras with early stopping and much more. NN is time consuming, not to mention cross validation. In fact we can leverage every minutes we spent on training NN and make good use of them.

we'll first create the framework:

In [None]:
from sklearn.cross_validation import StratifiedKFold, KFold

early_stop = EarlyStopping(monitor='val_mae_log', patience=5, verbose=0, mode='auto')
checkpointer = ModelCheckpoint(filepath="weights.hdf5", monitor='val_mae_log', verbose=1, save_best_only=True, mode='min')

def nn_model(params):
    model = Sequential()
    model.add(Dense(params['input_size'], input_dim = params['input_dim']))

    model.add(PReLU())
    model.add(BatchNormalization())
    model.add(Dropout(params['input_drop_out']))
        
    model.add(Dense(params['hidden_size']))
    model.add(PReLU())
    model.add(BatchNormalization())    
    model.add(Dropout(params['hidden_drop_out']))
    
    
#     nadam = Nadam(lr=1e-4)
    nadam = Nadam(lr=params['learning_rate'])
    
    model.add(Dense(1))
    model.compile(loss = 'mae', metrics=[mae_log], optimizer = 'adadelta')
    return(model)


def nn_blend_data(parameters, train_x, train_y, test_x, fold, early_stopping_rounds=0, batch_size=128):
    print ("Blend %d estimators for %d folds" % (len(parameters), fold))
    skf = list(KFold(len(train_y), fold))
    
    train_blend_x = np.zeros((train_x.shape[0], len(parameters)))
    test_blend_x = np.zeros((test_x.shape[0], len(parameters)))
    scores = np.zeros ((len(skf),len(parameters)))
    best_rounds = np.zeros ((len(skf),len(parameters)))
 
    for j, nn_params in enumerate(parameters):
        print ("Model %d: %s" %(j+1, nn_params))
        test_blend_x_j = np.zeros((test_x.shape[0], len(skf)))
        for i, (train, val) in enumerate(skf):
            print ("Model %d fold %d" %(j+1,i+1))
            fold_start = time.time() 
            train_x_fold = train_x[train]
            train_y_fold = train_y[train]
            val_x_fold = train_x[val]
            val_y_fold = train_y[val]

            # early stopping
            model = nn_model(nn_params)
            print (model)
            fit= model.fit_generator(generator=batch_generator(train_x_fold, train_y_fold, batch_size, True),
                                     nb_epoch=60,
                                     samples_per_epoch=train_x_fold.shape[0],
                                     validation_data=(val_x_fold.todense(), val_y_fold),
                                     callbacks=[
#                                                 EarlyStopping(monitor='val_mae_log'
#                                                               , patience=early_stopping_rounds, verbose=0, mode='auto'),
                                                ModelCheckpoint(filepath="weights.hdf5"
                                                                , monitor='val_mae_log', 
                                                                verbose=1, save_best_only=True, mode='min')
                                                ]
                                     )

            best_round=len(fit.epoch)-early_stopping_rounds-1
            best_rounds[i,j]=best_round
            print ("best round %d" % (best_round))
            
            model.load_weights("weights.hdf5")
            # Compile model (required to make predictions)
            model.compile(loss = 'mae', metrics=[mae_log], optimizer = 'adadelta')

         
            # print (mean_absolute_error(np.exp(y_val)-200, pred_y))
            val_y_predict_fold = model.predict_generator(generator=batch_generatorp(val_x_fold, batch_size, True),
                                        val_samples=val_x_fold.shape[0]
                                     )
            
            score = log_mae(val_y_fold, val_y_predict_fold,200)
            print ("Score: ", score, mean_absolute_error(val_y_fold, val_y_predict_fold))
            scores[i,j]=score
            train_blend_x[val, j] = val_y_predict_fold.reshape(val_y_predict_fold.shape[0])
            
            model.load_weights("weights.hdf5")
            # Compile model (required to make predictions)
            model.compile(loss = 'mae', metrics=[mae_log], optimizer = 'adadelta')            
            test_blend_x_j[:,i] = model.predict_generator(generator=batch_generatorp(test_x, batch_size, True),
                                        val_samples=test_x.shape[0]
                                     ).reshape(test_x.shape[0])
            print ("Model %d fold %d fitting finished in %0.3fs" % (j+1,i+1, time.time() - fold_start))            
   
        test_blend_x[:,j] = test_blend_x_j.mean(1)
        print ("Score for model %d is %f" % (j+1,np.mean(scores[:,j])))
    print ("Score for blended models is %f" % (np.mean(scores)))
    return (train_blend_x, test_blend_x, scores,best_rounds )

Then let's create a list of parameters that we thought might be working for NN, and cross validate each of them

In [None]:
nn_parameters = [
    { 'input_size' :400 ,
     'input_dim' : train_x.shape[1],
     'input_drop_out' : 0.4 ,
     'hidden_size' : 200 ,
     'hidden_drop_out' :0.2,
     'learning_rate': 0.1},
    { 'input_size' :450 ,
     'input_dim' : train_x.shape[1],
     'input_drop_out' : 0.4 ,
     'hidden_size' : 200 ,
     'hidden_drop_out' :0.2,
     'learning_rate': 0.1},
    { 'input_size' :400 ,
     'input_dim' : train_x.shape[1],
     'input_drop_out' : 0.4 ,
     'hidden_size' : 250 ,
     'hidden_drop_out' :0.2,
     'learning_rate': 0.1},
    { 'input_size' :400 ,
     'input_dim' : train_x.shape[1],
     'input_drop_out' : 0.5 ,
     'hidden_size' : 200 ,
     'hidden_drop_out' :0.2,
     'learning_rate': 0.1}

]

(train_blend_x, test_blend_x, blend_scores,best_round) = nn_blend_data(nn_parameters, train_x, train_y, test_x,
                                                         4,
                                                         5)


We can now create two submissions: 

* one is from the best CV score, the fourth in my case
* another is the average of all four

You can submit both and see if averaging helps.

In [None]:
pred_y = np.exp(test_blend_x[:,3:4]) - 200 # the forth column of test_blend_x
results = pd.DataFrame()
results['id'] = full_data[train_size:].id
results['loss'] = pred_y
results.to_csv("../output/sub_keras_starter.csv", index=False)
print ("Submission created.")

pred_y = np.exp(np.mean(test_blend_x,axis=1)) - 200

results = pd.DataFrame()
results['id'] = full_data[train_size:].id
results['loss'] = pred_y
results.to_csv("../output/sub_keras_mean.csv", index=False)
print ("Submission created.")

## Follow up questions
* So far we've already create five models/ submissions:
    * XGBoost with LE
    * XGBoost with OHE
    * LightGBM with LE
    * LightGBM with OHE
    * Keras
    
  Now let's create another submission, or more, by avaraging them or with whatever weights working for you. It should yield better results.
  
    
* Is there a way to ensemble the models even more effectively? 