Merge pull request #181 from scaleoutsystems/develop

v0.2.2
scaleoutsystems · Apr 19, 2021 · 3567837 · 3567837
2 parents 512094d + 809bb87
commit 3567837
Show file tree

Hide file tree

Showing 21 changed files with 270 additions and 338 deletions.
diff --git a/README.md b/README.md
diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -0,0 +1,16 @@
+# Frequently asked questions
+
+### Q: How do I remove/replace the compute package?
+
+We do not provide an out-of-the box way to clear the compute package for a model that has been intitialized. 
+This is a security constraint enforced to not allow for arbitrary code package replacement in an already configured federation. 
+Once the federated model has been initilized and seeded it should be seeen as immutable. However, during development of a new model
+it will be necessary to reinitialize. Then you can follow this procedure: 
+
+  1. Clear the database. Navigate to http://localhost:8081 and delete the entire "fedn-test-network" collection. 
+  2. Restart the reducer, combiner and reattach the clients. 
+
+There are also additional ways to enable interative development by bypassing the need to use/upload a compute package.  
+
+
+
diff --git a/docs/README.md b/docs/README.md
diff --git a/docs/_navbar.md b/docs/_navbar.md
@@ -1,4 +1,7 @@
 <!-- docs/_navbar.md -->
 
 * [Home](/)
+* [Architecture](/architecture)
+* [Deployment](/deployment)
 * [Release Notes](/releasenotes)
+* [FAQ](/FAQ)
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -0,0 +1,28 @@
+## Architecture overview
+
+Constructing a federated model with FEDn amounts to a) specifying the details of the client-side training code and data integrations, and b) deploying the reducer-combiner network. A FEDn network, as illustrated in the picture below, is made up of three main components: the *Reducer*, one or more *Combiners*, and a number of *Clients*. The combiner network forms the backbone of the FedML orchestration mechanism, while the Reducer provides discovery services and provides controls to coordinate training over the combiner network. By horizontally scaling the combiner network, one can meet the needs of a growing number of clients.  
+
+![alt-text](img/overview.png?raw=true "FEDn network")
+
+### Main components
+
+#### Client
+A Client is a data node, holding private data and connecting to a Combiner to receive model update requests and model validation requests during training rounds. Importantly, clients do not require any open ingress ports. A client receives the code to be executed from the Reducer upon connecting to the network, and thus they only need to be configured prior to connection to read the local datasets during training and validation. Python3 client implementation is provided out of the box, and it is possible to write clients in a variety of languages to target different software and hardware requirements.  
+
+#### Combiner
+A combiner is an actor whose main role is to orchestrate and aggregate model updates from a number of clients during a training round. When and how to trigger such orchestration rounds are specified in the overall *compute plan* laid out by the Reducer. Each combiner in the network runs an independent gRPC server, providing RPCs for interacting with the alliance subsystem it controls. Hence, the total number of clients that can be accommodated in a FEDn network is proportional to the number of active combiners in the FEDn network. Combiners can be deployed anywhere, e.g. in a cloud or on a fog node to provide aggregation services near the cloud edge. 
+
+#### Reducer
+The reducer fills three main roles in the FEDn network: 1.) it lays out the overall, global training strategy and communicates that to the combiner network. It also dictates the strategy to aggregate model updates from individual combiners into a single global model, 2.) it handles global state and maintains the *model trail* - an immutable trail of global model updates uniquely defining the FedML training timeline, and  3.) it provides discovery services, mediating connections between clients and combiners. For this purpose, the Reducer exposes a standard REST API. 
+
+### Services and communication
+The figure below provides a logical architecture view of the services provided by each agent and how they interact. 
+
+![Alt text](img/FEDn-arch-overview.png?raw=true "FEDn architecture overview")
+
+### Control flows and algorithms
+FEDn is designed to allow customization of the FedML algorithm, following a specified pattern, or programming model. Model aggregation happens on two levels in the system. First, each Combiner can be configured with a custom orchestration and aggregation implementation, that reduces model updates from Clients into a single, *combiner level* model. Then, a configurable aggregation protocol on the Reducer level is responsible for combining the combiner-level models into a global model. By varying the aggregation schemes on the two levels in the system, many different possible outcomes can be achieved. Good staring configurations are provided out-of-the-box to help the user get started. 
+
+#### Hierarchical Federated Averaging
+The currently implemented default scheme uses a local SGD strategy on the Combiner level aggregation and a simple average of models on the reducer level. This results in a highly horizontally scalable FedAvg scheme. The strategy works well with most artificial neural network (ANNs) models, and can in general be applied to models where it is possible and makes sense to form mean values of model parameters (for example SVMs). Additional FedML training protocols, including support for various types of federated ensemble models, are in active development.  
+![Alt text](img/HFedAvg.png?raw=true "FEDn architecture overview")
diff --git a/docs/deployment.md b/docs/deployment.md
@@ -0,0 +1,44 @@
+## Distributed deployment
+The actual deployment, sizing of nodes, and tuning of a FEDn network in production depends heavily on the use case (cross-silo, cross-device, etc), the size of model updates, on the available infrastructure, and on the strategy to provide end-to-end security. You can easily use the provided docker-compose templates to deploy FEDn network across different hosts in a live environment, but note that it might be necessary to modify them slightly depending on your target environment and host configurations.   
+
+This example serves as reference deployment for setting up a fully distributed FEDn network consisting of one host serving the supporting services (Minio, MongoDB), one host serving the reducer, one host running two combiners, and one host running a variable number of clients. 
+
+> Warning, there are additional security considerations when deploying a live FEDn network, outside of core FEDn functionality. Make sure to include these aspects in your deployment plans.
+
+### Prerequisite for the reference deployment
+
+#### Hosts
+This example assumes root access to 4 Ubuntu 20.04 Servers for running the FEDn network. We recommend at least 4 CPU, 8GB RAM flavors for the base services and the reducer, and 4 CPU, 16BG RAM for the combiner host. Client host sizing depends on the number of clients you plan to run. You need to be able to configure security groups/ingress settings for the service node, combiner, and reducer host.
+
+#### Certificates
+Certificates are needed for the reducer and combiner services. By default, FEDn will generate unsigned certificates for the reducer and combiner nodes using OpenSSL. 
+
+> Certificates based on IP addresses are not supported due to incompatibilities with gRPC. 
+
+### 1. Deploy supporting services  
+First, deploy Minio and Mongo services on one host (make sure to change the default passwords). Confirm that you can access MongoDB via the MongoExpress dashboard before proceeding with the reducer.  
+
+### 2. Deploy the reducer
+Follow the steps for pseudo-distributed deployment, but now edit the settings-reducer.yaml file to provide the appropriate connection settings for MongoDB and Minio from Step 1. Also, copy 'config/extra-hosts-reducer.yaml.template' to 'config/extra-hosts-reducer.yaml' and edit it, adding a host:IP mapping for each combiner you plan to deploy. Then you can start the reducer: 
+
+```bash
+sudo docker-compose -f config/reducer.yaml -f config/extra-hosts-reducer.yaml up 
+```
+
+### 3. Deploy combiners
+Edit 'config/settings-combiner.yaml' to provide a name for the combiner (used as a unique identifier for the combiner in the network), a hostname (which is used by reducer and clients to connect to combiner RPC), and the port (default is 12080, make sure to allow access to this port in your security group/firewall settings). Also, provide the IP and port for the reducer under the 'controller' tag. Then deploy the combiner: 
+
+```bash
+sudo docker-compose -f config/combiner.yaml up 
+```
+
+Optional: Repeat the same steps for the second combiner node. Make sure to provide unique names for the two combiners. 
+
+> Note that it is not currently possible to use the host's IP address as 'host'. This is due to gRPC not being able to handle certificates based on IP. 
+
+### 4. Attach clients to the FEDn network
+Once the FEDn network is deployed, you can attach clients to it in the same way as for the pseudo-distributed deployment. You need to provide clients with DNS information for all combiner nodes in the network. For example, to start 5 unique MNIST clients on a single host, copy  'config/extra-hosts-clients.template.yaml' to 'test/mnist-keras/extra-hosts.yaml' and edit it to provide host:IP mappings for the combiners in the network. Then, from 'test/mnist-keras':
+
+```bash
+sudo docker-compose -f docker-compose.yaml -f config/extra-hosts-client.yaml up --scale client=5 
+```
diff --git a/docs/index.html b/docs/index.html
@@ -2,7 +2,7 @@
 <html lang="en">
 <head>
   <meta charset="UTF-8">
-  <title>Document</title>
+  <title>FEDn Docs</title>
   <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
   <meta name="description" content="Description">
   <meta name="viewport" content="width=device-width, user-scalable=no, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0">

diff --git a/docs/releasenotes.md b/docs/releasenotes.md
@@ -1,5 +1,13 @@
 # Release Notes
 
+## v0.2.2
+
+### What's new?
+- The MNIST examples (Keras and PyTorch) have been updated so that they now bundle the example data in .npz format.
+
+### Other
+- Docs updates 
+
 ## v0.2.1
 
 ### What's new?

diff --git a/test/mnist-keras/README.md b/test/mnist-keras/README.md
@@ -6,7 +6,8 @@ This classic example of hand-written text recognition is well suited both as a l
 ## Setting up a client
 
 ### Provide local training and test data
-This example assumes that trainig and test data is available at 'test/mnist-keras/data/train.csv' and 'test/mnist-keras/data/test.csv'. Data can be downloaded from e.g. https://www.kaggle.com/oddrationale/mnist-in-csv, but there are several hosted versions available. To make testing flexible, each client subsamples from this dataset upon first invokation of a training request, then cache this subsampled data for use for the remaining lifetime of the client. 
+This example is provided with the mnist dataset from https://s3.amazonaws.com/img-datasets/mnist.npz in 'data/mnist.npz'.
+To make testing flexible, each client subsamples from this dataset upon first invokation of a training request, then cache this subsampled data for use for the remaining lifetime of the client. It is thus normal that the first training round takes a bit longer than subssequent ones.  
 
 ### Create and upload a compute package
 To train a model in FEDn you provide the client code (in 'client') as a tarball. For convenience, we ship a pre-made package. Whenever you make updates to the client code (such as altering any of the settings in the above mentioned file), you need to re-package the code (as a .tar.gz archive) and copy the updated package to 'packages'. From 'test/mnist-keras':

diff --git a/test/mnist-keras/client/data/read_data.py b/test/mnist-keras/client/data/read_data.py
@@ -1,16 +1,22 @@
-import numpy
-import pandas as pd
+import numpy as np
+import os
+import tensorflow as tf
 from sklearn.model_selection import train_test_split
-
 import keras
-import numpy
 
-def read_data(filename, nr_examples=1000,bias=0.7):
+def read_data(path, nr_examples=1000,trainset=True):
     """ Helper function to read and preprocess data for training with Keras. """
 
-    data = numpy.array(pd.read_csv(filename))
-    X = data[:, 1::]
-    y = data[:, 0]
+    pack = np.load(path)
+
+    if trainset:
+        X = pack['x_train']
+        y = pack['y_train']
+    else:
+        X = pack['x_test']
+        y = pack['y_test']
+
+    print("X shape: ", X.shape, ", y shape: ", y.shape)
 
     sample_fraction = float(nr_examples)/len(y)  
 
@@ -22,9 +28,10 @@ def read_data(filename, nr_examples=1000,bias=0.7):
     # Input image dimensions
     img_rows, img_cols = 28, 28
 
-    # The data, split between train and test sets
-    X = X.reshape(X.shape[0], img_rows, img_cols, 1)
+    ## The data, split between train and test sets
+    #X = X.reshape(X.shape[0], img_rows, img_cols, 1)
     X = X.astype('float32')
+    X = np.expand_dims(X,-1)
     X /= 255
     y = keras.utils.to_categorical(y, len(classes))
     return  (X, y, classes)

diff --git a/test/mnist-keras/client/train.py b/test/mnist-keras/client/train.py
@@ -3,7 +3,7 @@
 import tensorflow as tf
 import tensorflow.keras as keras 
 import tensorflow.keras.models as krm
-
+import numpy as np
 import pickle
 import yaml
 tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
@@ -18,23 +18,18 @@ def train(model,data,settings):
     # We are caching the partition in the container home dir so that
     # the same training subset is used for each iteration for a client. 
     try:
-        with open('/app/mnist_train/x.pyb','rb') as fh:
-            x_train=pickle.loads(fh.read())
-        with open('/app/mnist_train/y.pyb','rb') as fh:
-            y_train=pickle.loads(fh.read())
-        with open('/app/mnist_train/classes.pyb','rb') as fh:
-            classes=pickle.loads(fh.read())
+        x_train = np.save('/app/local_dataset/x_train.npz')
+        y_train = np.save('/app/local_dataset/y_train.npz')
     except:
-        (x_train, y_train, classes) = read_data(data,nr_examples=settings['training_samples'])
+        (x_train, y_train, classes) = read_data(data,
+                                                nr_examples=settings['training_samples'],
+                                                trainset=True)
 
         try:
-            os.mkdir('/app/mnist_train')
-            with open('/app/mnist_train/x.pyb','wb') as fh:
-                fh.write(pickle.dumps(x_train))
-            with open('/app/mnist_train/y.pyb','wb') as fh:
-                fh.write(pickle.dumps(y_train))
-            with open('/app/mnist_train/classes.pyb','wb') as fh:
-                fh.write(pickle.dumps(classes))
+            os.mkdir('/app/local_dataset')
+            np.save('/app/local_dataset/x_train.npz',x_train)
+            np.save('/app/local_dataset/y_train.npz',y_train)
+
         except:
             pass
 
@@ -59,6 +54,6 @@ def train(model,data,settings):
     model = create_seed_model()
     model.set_weights(weights)
 
-    model = train(model,'../data/train.csv',settings)
+    model = train(model,'/app/data/mnist.npz',settings)
     helper.save_model(model.get_weights(),sys.argv[2])
 
diff --git a/test/mnist-keras/client/validate.py b/test/mnist-keras/client/validate.py
@@ -1,15 +1,13 @@
 import sys
 import tensorflow as tf 
 tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
-import tensorflow.keras as keras
-import tensorflow.keras.models as krm
 from data.read_data import read_data
 import pickle
 import json
 from sklearn import metrics
-import numpy
 import os
 import yaml
+import numpy as np
 
 def validate(model,data,settings):
     print("-- RUNNING VALIDATION --", flush=True)
@@ -21,48 +19,37 @@ def validate(model,data,settings):
 
     # Training error (Client validates global model on same data as it trains on.)
     try:
-        with open('/app/mnist_train/x.pyb','rb') as fh:
-            x_train=pickle.loads(fh.read())
-        with open('/app/mnist_train/y.pyb','rb') as fh:
-            y_train=pickle.loads(fh.read())
-        with open('/app/mnist_train/classes.pyb','rb') as fh:
-            classes=pickle.loads(fh.read())
+        x_train = np.save('/app/local_dataset/x_train.npz')
+        y_train = np.save('/app/local_dataset/y_train.npz')
     except:
-        (x_train, y_train, classes) = read_data(data,nr_examples=settings['training_samples'])
-
+        (x_train, y_train, classes) = read_data(data,
+                                                nr_examples=settings['training_samples'],
+                                                trainset=True)
         try:
-            os.mkdir('/app/mnist_train')
-            with open('/app/mnist_train/x.pyb','wb') as fh:
-                fh.write(pickle.dumps(x_train))
-            with open('/app/mnist_train/y.pyb','wb') as fh:
-                fh.write(pickle.dumps(y_train))
-            with open('/app/mnist_train/classes.pyb','wb') as fh:
-                fh.write(pickle.dumps(classes))
+            os.mkdir('/app/local_dataset')
+            np.save('/app/local_dataset/x_train.npz', x_train)
+            np.save('/app/local_dataset/y_train.npz', y_train)
+
         except:
             pass
 
     # Test error (Client has a small dataset set aside for validation)
     try:
-        with open('/app/mnist_test/x_test.pyb','rb') as fh:
-            x_test=pickle.loads(fh.read())
-        with open('/app/mnist_test/y_test.pyb','rb') as fh:
-            y_test=pickle.loads(fh.read())
-        with open('/app/mnist_test/classes_test.pyb','rb') as fh:
-            classes_test=pickle.loads(fh.read())
+        x_test = np.save('/app/local_dataset/x_test.npz')
+        y_test = np.save('/app/local_dataset/y_test.npz')
     except:
-        (x_test, y_test, classes_test) = read_data("../data/test.csv",nr_examples=settings['test_samples'])
-
+        (x_test, y_test, classes) = read_data(data,
+                                                nr_examples=settings['test_samples'],
+                                                trainset=False)
         try:
-            os.mkdir('/app/mnist_test')
-            with open('/app/mnist_test/x_test.pyb','wb') as fh:
-                fh.write(pickle.dumps(x_test))
-            with open('/app/mnist_test/y_test.pyb','wb') as fh:
-                fh.write(pickle.dumps(y_test))
-            with open('/app/mnist_test/classes_test.pyb','wb') as fh:
-                fh.write(pickle.dumps(classes_test))
+            os.mkdir('/app/local_dataset')
+            np.save('/app/local_dataset/x_test.npz', x_test)
+            np.save('/app/local_dataset/y_test.npz', y_test)
+
         except:
             pass
 
+
     try:
         model_score = model.evaluate(x_train, y_train, verbose=0)
         print('Training loss:', model_score[0])
@@ -105,7 +92,7 @@ def validate(model,data,settings):
     model = create_seed_model()
     model.set_weights(weights)
 
-    report = validate(model,'../data/train.csv',settings)
+    report = validate(model,'/app/data/mnist.npz',settings)
 
     with open(sys.argv[2],"w") as fh:
         fh.write(json.dumps(report))

diff --git a/test/mnist-keras/data/mnist.npz b/test/mnist-keras/data/mnist.npz
diff --git a/test/mnist-keras/package/mnist.tar.gz b/test/mnist-keras/package/mnist.tar.gz