### Distributed Neural Networks

How to work with Cross Validation, Neural Networks and Clusters:

In [2]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import StratifiedKFold
import warnings
from dask.distributed import Client
import os
import tensorflow as tf
import numpy as np
import time
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '0' 
warnings.filterwarnings('ignore')




  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


We create a simple binary classification dataset:

In [3]:
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
print("Example of datset row: "+str(X[:1]))

Example of datset row: [[ 0.96479937 -0.06644898  0.98676805 -0.35807945  0.99726557  1.18189004
  -1.61567885 -1.2101605  -0.62807677  1.22727382]]


Let's define the ```Deep Neural Network```:

In [4]:
def build_deep_neural_network():
    
    # create model
    model = Sequential()
    model.add(Dense(60, input_shape=(10,), kernel_initializer='normal', activation='relu'))
    model.add(Dense(30, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
  
    
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model


Prepare the cross validation data splits.

In [9]:
data_split = []
kfold = StratifiedKFold(5, shuffle=True, random_state=42)
for train, test in kfold.split(X, y):
    data_split.append((train, test))

Create our ```classifier``` instance:

In [5]:
clf = KerasClassifier(build_fn=build_deep_neural_network, epochs=100, batch_size=5, verbose=0)

Train the ```Deep Neural Network``` classifier for each dataset train split and test the trained model on each test dataset split, and in the end we calculate the mean of the results.

In [None]:
start_time = time.time()
results = []
for train_idx, test_idx in data_split:
    clf.fit(X[train_idx], y[train_idx])
    Y_pred = clf.predict(X[test_idx])
    results.append(accuracy_score(y[test_idx], Y_pred))
end_time = time.time()    
print("Mean of the results: "+str(np.array(results).mean())+" in: "+str(end_time-start_time)+"s")

The process have took several seconds, but wath change if we use a cluster instead? Let's see:

First, let create the cluster and share the dataset to it:

In [6]:

client = Client('127.0.0.1:8786')
client.scatter(X)
client.scatter(y)

Then, let's define the process that will be distributed:

In [7]:
def distribute_cross_validation(args):
    train_idx, test_idx = args
    with tf.device('/cpu:0'):
        clf = KerasClassifier(build_fn=build_deep_neural_network, epochs=100, batch_size=5, verbose=0)
        clf.fit(X[train_idx], y[train_idx])

        y_pred = clf.predict(X[test_idx])
        print(str(accuracy_score(y[test_idx], y_pred)))
    return accuracy_score(y[test_idx], y_pred)

Start the process and retrieve the results:

In [10]:
start_time = time.time()

futures = client.map(distribute_cross_validation, [(train_idx, test_idx) for train_idx, test_idx in data_split])
results = client.gather(futures)

end_time = time.time()   
print("Mean of the results: "+str(np.array(results).mean())+" in: "+str(end_time-start_time)+"s")

Mean of the results: 0.880029100727518 in: 19.9167821407s


The process took ~1/5 of the total time. In general approaches when you have to validate a reliable machine learning model, this process should be reapeated a huge number of times with random splits of data. Using cluster machines is essential to work in a proper way. Nowadays, the improvements of the deep models and the increase of the available data (big data) has further increase those necessities.

Note: This is a simple approach, but clusters are extremely useful even when the ml algorithms can be parllelized. Apache Spark, that we will see in last lecture,  exploit those ideas.