Skip to content

Commit 01f2dff

Browse files
distributed.md doc refine (#291)
* update doc * Modify the directory Signed-off-by: Yue, Wenjiao <wenjiao.yue@intel.com> * Modify body format Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> --------- Signed-off-by: Yue, Wenjiao <wenjiao.yue@intel.com> Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> Co-authored-by: Yue, Wenjiao <wenjiao.yue@intel.com>
1 parent 1f6163f commit 01f2dff

File tree

1 file changed

+39
-17
lines changed

1 file changed

+39
-17
lines changed

docs/source/distributed.md

Lines changed: 39 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,49 @@
11
Distributed Training and Inference (Evaluation)
22
============
33

4+
1. [Introduction](#introduction)
5+
2. [Supported Feature Matrix](#supported-feature-matrix)
6+
3. [Get Started with Distributed Training and Inference API](#get-started-with-distributed-training-and-inference-api)
7+
8+
3.1. [Option 1: Pure Yaml Configuration](#option-1-pure-yaml-configuration)
9+
10+
3.2. [Option 2: User Defined Training Function](#option-2-user-defined-training-function)
11+
12+
3.3. [Horovodrun Execution](#horovodrun-execution)
13+
14+
3.4. [Security](#security)
15+
4. [Examples](#examples)
16+
17+
4.1. [Pytorch Examples](#pytorch-examples)
18+
19+
4.2. [Tensorflow Examples](#tensorflow-examples)
420
## Introduction
521

622
Neural Compressor uses [horovod](https://github.com/horovod/horovod) for distributed training.
723

8-
## horovod installation
9-
1024
Please check horovod installation documentation and use following commands to install horovod:
1125
```
1226
pip install horovod
1327
```
1428

15-
## Distributed training and inference (evaluation)
29+
## Supported Feature Matrix
30+
Distributed training and inference are supported in PyTorch and TensorFlow currently.
31+
| Framework | Type | Distributed Support |
32+
|------------|---------|:-------------------:|
33+
| PyTorch | QAT | &#10004; |
34+
| PyTorch | PTQ | &#10004; |
35+
| TensorFlow | PTQ | &#10004; |
36+
| Keras | Pruning | &#10004; |
1637

17-
Distributed training and inference are supported in PyTorch and TensorFlow currently. (i.e. PyTorch currently supports distributed QAT and PTQ. TensorFlow currently supports distributed PTQ and Keras-backend Pruning). To enable distributed training or inference, the steps are:
38+
## Get Started with Distributed Training and Inference API
39+
To enable distributed training or inference, the steps are:
1840

1941
1. Setting up distributed training or inference scripts. We have 2 options here:
2042
- Option 1: Enable distributed training or inference with pure yaml configuration. In this case, Neural Compressor builtin training function is used.
2143
- Option 2: Pass the user defined training function to Neural Compressor. In this case, please follow the horovod documentation and below example to know how to write such training function with horovod on different frameworks.
2244
2. use horovodrun to execute your program.
2345

24-
### Option 1: pure yaml configuration
46+
### Option 1: Pure Yaml Configuration
2547

2648
To enable distributed training in Neural Compressor, user only need to add a field: `Distributed: True` in dataloader configuration:
2749

@@ -58,7 +80,7 @@ result = quantizer.strategy.evaluation_result # result -> (accuracy, evaluation
5880
```
5981

6082

61-
### Option2: user defined training function
83+
### Option 2: User Defined Training Function
6284

6385
Neural Compressor supports User defined PyTorch training function for distributed training which requires user to modify training script following horovod requirements. We provide a MNIST example to show how to do that and following are the steps for PyTorch.
6486

@@ -119,7 +141,7 @@ component.eval_func = test_func
119141
model = component()
120142
```
121143

122-
### horovodrun
144+
### Horovodrun Execution
123145

124146
User needs to use horovodrun to execute distributed training. For more usage, please refer to [horovod documentation](https://horovod.readthedocs.io/en/stable/running_include.html).
125147

@@ -133,22 +155,22 @@ For example, the following command means that two processes will be assigned to
133155
horovodrun -np 2 -H node-001:1,node-002:1 python example.py
134156
```
135157

136-
## security
158+
### Security
137159

138-
horovodrun requires user set up SSH on all hosts without any prompts. To do distributed training with Neural Compressor, user needs to ensure the SSH setting on all hosts.
160+
Horovodrun requires user set up SSH on all hosts without any prompts. To do distributed training with Neural Compressor, user needs to ensure the SSH setting on all hosts.
139161

140-
## Following examples are supported
141-
PyTorch:
162+
## Examples
163+
### PyTorch Examples:
142164
- PyTorch example-1: MNIST
143-
- Please follow this README.md exactly:[MNIST](../examples/pytorch/image_recognition/mnist)
165+
- Please follow this README.md exactly:[MNIST](../../examples/pytorch/image_recognition/mnist)
144166

145167
- PyTorch example-2: QAT (Quantization Aware Training)
146-
- Please follow this README.md exactly:[QAT](../examples/pytorch/image_recognition/torchvision_models/quantization/qat/eager/distributed)
168+
- Please follow this README.md exactly:[QAT](../../examples/pytorch/image_recognition/torchvision_models/quantization/qat/eager/distributed)
147169

148-
TensorFlow:
170+
### TensorFlow Examples:
149171
- TensorFlow example-1: 'ResNet50 V1.0' PTQ (Post Training Quantization) with distributed inference
150-
- Step-1: Please cd (change directory) to the [TensorFlow Image Recognition Example](../examples/tensorflow/image_recognition) and follow the readme to run PTQ, ensure that PTQ of 'ResNet50 V1.0' can be successfully executed.
151-
- Step-2: We only need to modify the [resnet50_v1.yaml](../examples/tensorflow/image_recognition/tensorflow_models/quantization/ptq/resnet50_v1.yaml), add a line 'distributed: True' in the 'evaluation' field.
172+
- Step-1: Please cd (change directory) to the [TensorFlow Image Recognition Example](../../examples/tensorflow/image_recognition) and follow the readme to run PTQ, ensure that PTQ of 'ResNet50 V1.0' can be successfully executed.
173+
- Step-2: We only need to modify the [resnet50_v1.yaml](../../examples/tensorflow/image_recognition/tensorflow_models/quantization/ptq/resnet50_v1.yaml), add a line 'distributed: True' in the 'evaluation' field.
152174
```
153175
# only need to modify the resnet50_v1.yaml, add a line 'distributed: True'
154176
......
@@ -175,4 +197,4 @@ TensorFlow:
175197
horovodrun -np 2 -H your_node1_name:1,your_node2_name:1 python main.py --tune --config=resnet50_v1.yaml --input-graph=/PATH/TO/resnet50_fp32_pretrained_model.pb --output-graph=./nc_resnet50_v1.pb
176198
```
177199
- TensorFlow example-2: 'resnet_v2' pruning on Keras backend with distributed training and inference
178-
- Please follow this README.md exactly:[Pruning](../examples/tensorflow/image_recognition/resnet_v2)
200+
- Please follow this README.md exactly:[Pruning](../../examples/tensorflow/image_recognition/resnet_v2)

0 commit comments

Comments
 (0)