This code demonstrates the effect of dropout at the visible and hidden layer on the data. Lets load the libraries.

In [1]:
import numpy
from pandas import read_csv
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from keras.constraints import maxnorm
from keras.optimizers import SGD
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from keras.utils import np_utils

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
dataframe = read_csv("Book1.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:15].astype(float)
Y = dataset[:,15]
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)

# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)

In [3]:
def create_baseline():
    model = Sequential()
    model.add(Dense(50, input_dim=15, kernel_initializer='normal', activation='relu'))
    model.add(Dense(3, kernel_initializer='normal', activation='sigmoid'))
    sgd = SGD(lr=0.1, momentum=0.9, decay=0.0, nesterov=False)
    model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
    return model

numpy.random.seed(seed)
estimators = []
estimators.append(('minmaxscale', MinMaxScaler()))
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_baseline, epochs=100,
batch_size=5, verbose=1)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Epoch 1/100


InternalError: Blas GEMM launch failed : a.shape=(5, 15), b.shape=(15, 50), m=5, n=50, k=15
	 [[Node: dense_1/MatMul = MatMul[T=DT_FLOAT, _class=["loc:@training/SGD/gradients/dense_1/MatMul_grad/MatMul_1"], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_dense_1_input_0_0/_29, dense_1/kernel/read)]]

Running the example for the baseline model without dropout generates an estimated classication accuracy of 98.93%. 

Using Dropout on the Visible Layer:
Dropout can be applied to input neurons called the visible layer. Lets add a new Dropout layer between the input (or visible layer) and the hidden layer. The dropout rate is set to 20%, meaning one in five inputs will be randomly excluded from each update cycle. Additionally, as recommended in the original paper on dropout by Srivatsava et al (2014), a constraint is imposed on the weights for each hidden layer, ensuring that the maximum norm of the weights does not exceed a value of 3. This is done by setting the kernel constraint argument on the Dense class when constructing the layers. The learning rate was lifted by one order of magnitude and the momentum was increased to 0.9. These increases in the learning rate were also recommended in the original dropout paper. Continuing on from the baseline example above, the code below exercises the same network with input dropout.

In [None]:
# Example of Dropout on the dataset at the Visible Layer
def create_baseline():
    model = Sequential()
    model.add(Dropout(0.2, input_shape=(15,)))
    model.add(Dense(50, input_dim=15, kernel_initializer='normal', activation='relu',
                    kernel_constraint=maxnorm(3)))
    model.add(Dense(3, kernel_initializer='normal', activation='sigmoid'))
    sgd = SGD(lr=0.1, momentum=0.9, decay=0.0, nesterov=False)
    model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
    return model
    
numpy.random.seed(seed)
estimators = []
estimators.append(('minmaxscale', MinMaxScaler()))
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_baseline, epochs=100,
batch_size=5, verbose=1)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Running the example provides a drop in classication accuracy, at least on a single test run. Visible: 96.13% (0.27%) compared to 98.93% (0.13%) without dropout. 

Using Dropout on Hidden Layers:
Dropout can be applied to hidden neurons in the body of your network model. Here, dropout is applied between the hidden layer and the output layer. Again a dropout rate of 20% is used as is a weight constraint on those layers.

In [None]:
def create_baseline():
    model = Sequential()
    model.add(Dense(50, input_dim=15, kernel_initializer='normal', activation='relu',
                    kernel_constraint=maxnorm(3)))
    model.add(Dropout(0.2))
    model.add(Dense(3, kernel_initializer='normal', activation='sigmoid'))
    sgd = SGD(lr=0.1, momentum=0.9, decay=0.0, nesterov=False)
    model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
    return model

numpy.random.seed(seed)
estimators = []
estimators.append(('minmaxscale', MinMaxScaler()))
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_baseline, epochs=100,
batch_size=5, verbose=1)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

We can see that for this problem and for the chosen network configuration that using dropout in the hidden layers is extremely worse than the baseline. Hidden: 35.26% (2.02%).

Using Dropout on both the visible layer and the Hidden Layer:
Dropout can be applied to both the visible and the hidden neurons in the body of your network model. Here, dropout is applied between the input layer and the hidden layer and thehidden layer and the output layer. Again a dropout rate of 20% is used as is a weight constraint on those layers.

In [None]:
def create_baseline():
    model = Sequential()
    model.add(Dropout(0.2, input_shape=(15,)))
    model.add(Dense(50, input_dim=15, kernel_initializer='normal', activation='relu',
                    kernel_constraint=maxnorm(3)))
    model.add(Dropout(0.2))
    model.add(Dense(3, kernel_initializer='normal', activation='sigmoid'))
    sgd = SGD(lr=0.1, momentum=0.9, decay=0.0, nesterov=False)
    model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
    return model

numpy.random.seed(seed)
estimators = []
estimators.append(('minmaxscale', MinMaxScaler()))
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_baseline, epochs=100,
batch_size=5, verbose=1)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

We can see that for this problem and for the chosen network configuration that using dropout in both the visible layer and the hidden layers is also worse that the baseline. Visible and Hidden: 49.41% (7.07%)).

We come to a conclusion that dropout has to be used for a larger network. You are likely to get better performance when dropout is used on a larger network, giving the model more of an opportunity to learn independent representations. This is a small network for which dropout is totally unnecessary.