# Distributed DNN Training Architecture Simulator: DDTAS 
# README
Distributed DNN Training Architecture Simulator (DDTAS) is a tool used to compare the performance of various distributed training architectures for a user-specified Deep Neural Network (DNN). The tool allows the user to modify **any** of the following variables in the simulated network:

| Variable       | Name in Code     | Unit     |
| :----------: | :----------: | :----------: |
| Processor amount | max_processors | None |
 Processing power | C | GFLOP/s |
 Residual memory | R | GB |
 Residual fraction | E | None |
 Bandwidth | B | GB/s |
 Layer amount | max_layers | None |
 Network size | G_base | GFLOP |
 Memory required | M_base | GB |
 Server network size | G_server | GFLOP |
 Client network size  | G_client | GFLOP |
 Aggregation network size | G_agg | GFLOP |
 Server memory required | M_server | GB |
 Client memory required | M_client | GB |
 Aggregation-server memory required | M_agg | GB |
 Intermediate results | D_client_out | GB |
 Weights matrix | D_weights | GB |
 Individual sample size | batch_individual_file_size | GB |
 Batch size | batch_size | None |
 Batches per client | total_batches_XX* | None |
 Epochs | epochs | None |
 Maximum calculated paths | max_paths | None |
 Minimum Split | min_split | None |
 Minimum clients | client_amount | None |
 Amount of events | event_amount | None |
 
** *Replace XX with the architecture abbreviation of your choice (PL, FL, SL, PSL, FSL)*

All of these variables can be modified in **config.ipynb**. Based on the user-specified variables (network conditions), the tool will then output the **fastest training time** and the accompanying **network graph** for the specified DNN. 



## Supported Distributed Training Architectures
DDTAS currently supports 5 following distributed training architectures:
1. Pipeline Learning (PL)
2. Federated Learning (FL)
3. Split Learning (SL)
4. Parallel Split Learning (PSL)
5. Federated Split Learning (FSL)



## Necessary Libraries
- Most necessary libraries to run DDTAS, including [PyTorch](https://pytorch.org/) and [PySyft](https://blog.openmined.org/tag/pysyft/) 0.2.9, are installed by following the instructions [here](https://github.com/OpenMined/PySyft/tree/PySyft/syft_0.2.x). 
- [Pandas](https://pandas.pydata.org/) is also required to be able to read the outputs of the simulator. 
- In case it is not installed already, install [NumPy](https://numpy.org/).

## Optional Libraries
- To monitor progress in a multi-variate analysis, install [tqdm](https://pypi.org/project/tqdm/). 

## How to Run
DDTAS can be run in multiple modes:

### Single Simulation Mode
1. Input all parameters listed in the table shown before into the *config.ipynb*. Save the file only, there is no need to run it. If the user wants to know the values of these parameters from a real NN implemented in PyTorch, refer to the Measurement Tools shown in the next section.
2. In *RUN_SIMULATOR.ipynb*, set the MODE variable to "SINGLE".
3. Define the output file name by modifying the *file_name* (without specifying a extension).
4. In *Logger.ipynb* in the method *logger_individual()*, define the output path for the results. All results will be saved with the file name specified in the previous step and will be saved in .CSV format. The user is free to modify the *logger_individual()* method to output the results in another format.
5. Run all cells in *RUN_SIMULATOR.ipynb* and wait for the result. 

### Multiple Simulation Mode
It is possible to run a sequence of multiple events where some variables remain constant while others randomly change from simulation to simulation. This process is ideal when performing a sensitivity analysis. To do so:
1.  Input the necessary parameters into the *config.ipynb* file. Save only, there is no need to run it
2.  In the *config_mode_multiple.ipynb* file, define which variables the user wants to analyze the influence of, by modifying the list *X_axis_variables*. 
3. Modify the *X_dict* dictionary to establish the constant *X* axis values of each of the analyzed variables. The minimum and maximum of this dictionary will also define the range of values a variable can take when it is being used as the Randomized variable. Save the file.
4.  In *Logger.ipynb* in the method *logger_multiple()*, define the output path for the simulation results. 
5.  Set the mode in *RUN_SIMULATOR.ipynb* to MODE == ''MULTIPLE".
6.  Run all cells and wait for the result

### Running Simulations with Parameters from a Real DNN
The procedure for this is almost identical to the two previous sections, with the exception that now the user needs to obtain these parameters from a real DNN. Doing so from just a segment of code in PyTorch is complicated, so we recommend using the measurement tools showcased in the next section. Once these tools have been used to obtain the NN and processing power parameters, simply repeat the processes shown above.


## How to Obtain the Simulator Parameters for your DNN

### MEASUREMENT_TOOL_NN
The purpose of this tool is to obtain the simulator's required NN parameters from an already implemented NN with PyTorch. The steps to use this tool are listed below:
1. Replace the contents in the class Net_Complete() with the contents of the user's NN of choice. This class should contain the entire NN, without splits. 
2. Replace the contents of the classes Net_server() and Net_client() with the splits of the user's NN of choice. As their name suggests, introduce the server-side NN's contents into the Net_server() class, and the client-side NN's contents into the Net_client() class. 
3. Finally, simply run the entirety of the code ("Run all cells") and the printed output will give the user the value of the necessary variables.

### MEASUREMENT_TOOL_POWER
A parameter that is crucial to the adequate functioning of the simulator is the "available processor power" parameter, named ''C" in the code. The steps to obtain this parameter are explained below:
1. Follow the same procedure as with the previous tool by replacing the NN already in the code with the desired NN. In this case, the only class that requires replacement is Net_Complete(). 
2. Modify the *training_data* variable to point towards a sample of the dataset with which the NN will be trained.
3. Modify the variables IMG_SIZE, BATCH_SIZE, and LR (learning rate). This is important, as the tool will execute a simple training scenario on the specified NN and measure the time the machine it took for the machine to finish all calculations. 
4. Replace the value of the variable NN_size with the G_base output from the previous tool.
5. *Optional*: Modify the training sequence outlined in the last cell of the notebook. This sequence will most likely not require any significant modification, as it follows a standard NN training command structure (forward pass, loss, backward pass, weight adjustment). 
6. Run all cells in the code. This will prompt the tool to train Net_Complete() with a batch sample of the input data. 
7. Once the tool is done quick-training the NN, the resulting output is the processing power allocated by the user's machine to the training of this NN, expressed in GFLOP/s. 


## Allocation Algorithms
Each of the individual allocation algorithms can be modified to the user's liking in the files:
- *ALGORITHM_PL.ipynb*
- *ALGORITHM_FL.ipynb*
- *ALGORITHM_SL.ipynb*
- *ALGORITHM_PSL.ipynb*
- *ALGORITHM_FSL.ipynb*