__Please note this file can't be run directly until you implement the keras callbacks in the kerasregressor in scikit_learn.py, as shown below before training.__

__Splitting the dataset into inputs and outputs.__


In [1]:
import pandas as pd

df = pd.read_csv('cleaned_data.csv')
X = df.loc[:,:'capacity']
Y = df.loc[:,'tickets_9_eur':'tickets_19_eur']
X.head()

Unnamed: 0,month,day_of_year,hour,minute,day_of_week,holiday,route_A->B,route_B->A,capacity
0,1,1,8,15,3,1,0,1,82.0
1,1,1,9,15,3,1,1,0,82.0
2,1,1,10,15,3,1,0,1,82.0
3,1,1,11,45,3,1,1,0,82.0
4,1,1,12,45,3,1,0,1,82.0


__We split the inputs into train and test sets. It is not needed to split the outputs here, but it is automatically split later.__

In [2]:
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(X,test_size=0.2,shuffle=False)

__A standard scaler is being used only on certain features. Initially scaler is fit on training set and then applied on the test set. This scaler is pickled for later use. Here the train and test sets are again concatenated. This will be later split along with outputs.__  

In [3]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import StandardScaler
import numpy as np
import joblib
scaler = StandardScaler()
X_train.loc[:,'month':'day_of_week'] = scaler.fit_transform(X_train.loc[:,'month':'day_of_week'])
X_test.loc[:,'month':'day_of_week'] = scaler.transform(X_test.loc[:,'month':'day_of_week'])
X_scaled = np.concatenate((X_train,X_test))
joblib.dump(scaler,"scaler.pkl")

['scaler.pkl']

__A simple neural network based on Keras, with dropouts to most layers to reduce overfitting.__

In [4]:
from keras.models import Sequential
from keras.layers import Dense,Dropout
def mlp(dropout_rate=0.0,activation1='relu', activation2='relu', activation3='relu', activation4='relu',
                       activation5='relu',activation6='relu'):
    model = Sequential()
    model.add(Dense(1024, input_dim=len(X.columns), kernel_initializer='normal', activation=activation1))
    model.add(Dropout(dropout_rate))
    model.add(Dense(512, kernel_initializer='normal', activation=activation2))
    model.add(Dropout(dropout_rate))
    model.add(Dense(256, kernel_initializer='normal', activation=activation3))
    model.add(Dropout(dropout_rate))
    model.add(Dense(128, kernel_initializer='normal', activation=activation4))
    model.add(Dropout(dropout_rate))
    model.add(Dense(64, kernel_initializer='normal', activation=activation5))
    model.add(Dropout(dropout_rate))
    model.add(Dense(32, kernel_initializer='normal', activation=activation6))
    model.add(Dense(4, kernel_initializer='normal', activation='relu'))
    # Compile model
    model.compile(loss='mae', optimizer='adam', metrics=['mae'])
    return model

Using TensorFlow backend.


In [5]:
import numpy as np
seed = 7
np.random.seed(seed)

__Implementing callbacks for earlystopping and checkpointing. Earlystopping ensures the model doesn't overfit.__
__Please change the path if you have to rerun.__

In [6]:
from keras.callbacks import  ModelCheckpoint, EarlyStopping, TensorBoard
outputFolder = 'models'
filepath = outputFolder + "/model-{val_loss:.4f}.hdf5"

checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, \
                             save_best_only=True, save_weights_only=False, \
                             mode='auto', period=1)
earlystop = EarlyStopping(monitor='val_loss', patience=5, verbose=1, mode='auto')
tensorboard = TensorBoard(log_dir='./logs')
callbacks_list = [earlystop, checkpoint, tensorboard]

__Setting up the parameter grid with activations, batchsize, dropout__

In [7]:
activation1 = ['elu','tanh','sigmoid']
activation2 = ['elu', 'sigmoid','tanh']
activation3 = ['elu', 'sigmoid','tanh']
activation4 = ['elu', 'sigmoid','tanh']
activation5 = ['elu', 'sigmoid','tanh']
activation6 = ['elu', 'tanh', 'sigmoid']
activation7 = ['relu']

dropout_rate = [0.2,0.3]

(row_size,_)=np.shape(X_train)
min_bsize=int(row_size/100)
batch_size=[min_bsize,2*min_bsize]

param_grid = dict(activation1=activation1, activation2=activation2, activation3=activation3, activation4=activation4,
                  activation5=activation5, activation6=activation6,  batch_size=batch_size, dropout_rate=dropout_rate)

__Here Kerasregressor is used to give it as input for the next step in the pipeline. Since the `validation_split` of 0.2 is being used here for early stopping we did not have a seperate train and test sets for inputs and outputs.__ 

__Unfortunately Kerasregressor doesn't support the callbacks yet. Please implement it based on this suggestion [here](https://github.com/keras-team/keras/issues/4278#issuecomment-258922449).__

In [8]:
from keras.wrappers.scikit_learn import KerasRegressor
estimator = KerasRegressor(build_fn=mlp, epochs=100,  verbose=2, validation_split=0.2,
                           callbacks=callbacks_list)


__using scikit learns `GridSearchCV` to iterate over paramgrid and use estimator defined above. Shufflesplit is used here skip the cross validation on several folds, rather only use the gridsearch feature.__

In [9]:
from sklearn.model_selection import GridSearchCV,ShuffleSplit
grid = GridSearchCV(estimator, param_grid=param_grid, cv=ShuffleSplit(test_size=0.01, n_splits=1), scoring='neg_mean_absolute_error',verbose=2)

__Careful before starting training. This creates more than 2000 models due to checkpointing.__

In [None]:
grid_result = grid.fit(X_scaled, Y.to_numpy())
#type(Y.to_numpy()),type(X_scaled)