distributed.md doc refine (#291)

Kaihui-intel · WenjiaoYue · web-flow · commit 01f2dff3e4a7 · 2022-12-25T21:06:29.000+08:00
* update doc

* Modify the directory

Signed-off-by: Yue, Wenjiao &lt;wenjiao.yue@intel.com&gt;

* Modify body format

Signed-off-by: Kaihui-intel &lt;kaihui.tang@intel.com&gt;

---------

Signed-off-by: Yue, Wenjiao &lt;wenjiao.yue@intel.com&gt;
Signed-off-by: Kaihui-intel &lt;kaihui.tang@intel.com&gt;
Co-authored-by: Yue, Wenjiao &lt;wenjiao.yue@intel.com&gt;
diff --git a/docs/source/distributed.md b/docs/source/distributed.md
@@ -1,27 +1,49 @@
 Distributed Training and Inference (Evaluation)
 ============
 
+1. [Introduction](#introduction)
+2. [Supported Feature Matrix](#supported-feature-matrix)
+3. [Get Started with Distributed Training and Inference API](#get-started-with-distributed-training-and-inference-api)
+
+    3.1. [Option 1: Pure Yaml Configuration](#option-1-pure-yaml-configuration)
+
+    3.2. [Option 2: User Defined Training Function](#option-2-user-defined-training-function)
+
+    3.3. [Horovodrun Execution](#horovodrun-execution)
+
+    3.4. [Security](#security)
+4. [Examples](#examples)
+
+    4.1. [Pytorch Examples](#pytorch-examples)
+
+    4.2. [Tensorflow Examples](#tensorflow-examples)
 ## Introduction
 
 Neural Compressor uses [horovod](https://github.com/horovod/horovod) for distributed training.
 
-## horovod installation
-
 Please check horovod installation documentation and use following commands to install horovod:
 ```
 pip install horovod
 ```
 
-## Distributed training and inference (evaluation)
+## Supported Feature Matrix
+Distributed training and inference are supported in PyTorch and TensorFlow currently.
+| Framework  | Type    | Distributed Support |
+|------------|---------|:-------------------:|
+| PyTorch    | QAT     |       &#10004;      |
+| PyTorch    | PTQ     |       &#10004;      |
+| TensorFlow | PTQ     |       &#10004;      |
+| Keras      | Pruning |       &#10004;      |
 
-Distributed training and inference are supported in PyTorch and TensorFlow currently. (i.e. PyTorch currently supports distributed QAT and PTQ. TensorFlow currently supports distributed PTQ and Keras-backend Pruning). To enable distributed training or inference, the steps are:
+## Get Started with Distributed Training and Inference API
+To enable distributed training or inference, the steps are:
 
 1. Setting up distributed training or inference scripts. We have 2 options here:
     - Option 1: Enable distributed training or inference with pure yaml configuration. In this case, Neural Compressor builtin training function is used.
     - Option 2: Pass the user defined training function to Neural Compressor. In this case, please follow the horovod documentation and below example to know how to write such training function with horovod on different frameworks.
 2. use horovodrun to execute your program.
 
-### Option 1: pure yaml configuration
+### Option 1: Pure Yaml Configuration
 
 To enable distributed training in Neural Compressor, user only need to add a field: `Distributed: True` in dataloader configuration:
 
@@ -58,7 +80,7 @@ result = quantizer.strategy.evaluation_result  # result -> (accuracy, evaluation
 ```
 
 
-### Option2: user defined training function
+### Option 2: User Defined Training Function
 
 Neural Compressor supports User defined PyTorch training function for distributed training which requires user to modify training script following horovod requirements. We provide a MNIST example to show how to do that and following are the steps for PyTorch.
 
@@ -119,7 +141,7 @@ component.eval_func = test_func
 model = component()
 ```
 
-### horovodrun
+### Horovodrun Execution
 
 User needs to use horovodrun to execute distributed training. For more usage, please refer to [horovod documentation](https://horovod.readthedocs.io/en/stable/running_include.html).
 
@@ -133,22 +155,22 @@ For example, the following command means that two processes will be assigned to
 horovodrun -np 2 -H node-001:1,node-002:1 python example.py
 ```
 
-## security
+### Security
 
-horovodrun requires user set up SSH on all hosts without any prompts. To do distributed training with Neural Compressor, user needs to ensure the SSH setting on all hosts.
+Horovodrun requires user set up SSH on all hosts without any prompts. To do distributed training with Neural Compressor, user needs to ensure the SSH setting on all hosts.
 
-## Following examples are supported
-PyTorch:
+## Examples
+### PyTorch Examples:
 - PyTorch example-1: MNIST
-  - Please follow this README.md exactly：[MNIST](../examples/pytorch/image_recognition/mnist)
+  - Please follow this README.md exactly：[MNIST](../../examples/pytorch/image_recognition/mnist)
 
 - PyTorch example-2: QAT (Quantization Aware Training)
-  - Please follow this README.md exactly：[QAT](../examples/pytorch/image_recognition/torchvision_models/quantization/qat/eager/distributed)
+  - Please follow this README.md exactly：[QAT](../../examples/pytorch/image_recognition/torchvision_models/quantization/qat/eager/distributed)
 
-TensorFlow:
+### TensorFlow Examples:
 - TensorFlow example-1: 'ResNet50 V1.0' PTQ (Post Training Quantization) with distributed inference    
-  - Step-1: Please cd (change directory) to the [TensorFlow Image Recognition Example](../examples/tensorflow/image_recognition) and follow the readme to run PTQ, ensure that PTQ of 'ResNet50 V1.0' can be successfully executed.
-  - Step-2: We only need to modify the [resnet50_v1.yaml](../examples/tensorflow/image_recognition/tensorflow_models/quantization/ptq/resnet50_v1.yaml), add a line 'distributed: True' in the 'evaluation' field.
+  - Step-1: Please cd (change directory) to the [TensorFlow Image Recognition Example](../../examples/tensorflow/image_recognition) and follow the readme to run PTQ, ensure that PTQ of 'ResNet50 V1.0' can be successfully executed.
+  - Step-2: We only need to modify the [resnet50_v1.yaml](../../examples/tensorflow/image_recognition/tensorflow_models/quantization/ptq/resnet50_v1.yaml), add a line 'distributed: True' in the 'evaluation' field.
     ```
     # only need to modify the resnet50_v1.yaml, add a line 'distributed: True'
     ......
@@ -175,4 +197,4 @@ TensorFlow:
     horovodrun -np 2 -H your_node1_name:1,your_node2_name:1 python main.py --tune --config=resnet50_v1.yaml --input-graph=/PATH/TO/resnet50_fp32_pretrained_model.pb --output-graph=./nc_resnet50_v1.pb
     ```
 - TensorFlow example-2: 'resnet_v2' pruning on Keras backend with distributed training and inference
-   - Please follow this README.md exactly：[Pruning](../examples/tensorflow/image_recognition/resnet_v2)
+   - Please follow this README.md exactly：[Pruning](../../examples/tensorflow/image_recognition/resnet_v2)