You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Neural Compressor uses [horovod](https://github.com/horovod/horovod) for distributed training.
7
23
8
-
## horovod installation
9
-
10
24
Please check horovod installation documentation and use following commands to install horovod:
11
25
```
12
26
pip install horovod
13
27
```
14
28
15
-
## Distributed training and inference (evaluation)
29
+
## Supported Feature Matrix
30
+
Distributed training and inference are supported in PyTorch and TensorFlow currently.
31
+
| Framework | Type | Distributed Support |
32
+
|------------|---------|:-------------------:|
33
+
| PyTorch | QAT |✔|
34
+
| PyTorch | PTQ |✔|
35
+
| TensorFlow | PTQ |✔|
36
+
| Keras | Pruning |✔|
16
37
17
-
Distributed training and inference are supported in PyTorch and TensorFlow currently. (i.e. PyTorch currently supports distributed QAT and PTQ. TensorFlow currently supports distributed PTQ and Keras-backend Pruning). To enable distributed training or inference, the steps are:
38
+
## Get Started with Distributed Training and Inference API
39
+
To enable distributed training or inference, the steps are:
18
40
19
41
1. Setting up distributed training or inference scripts. We have 2 options here:
20
42
- Option 1: Enable distributed training or inference with pure yaml configuration. In this case, Neural Compressor builtin training function is used.
21
43
- Option 2: Pass the user defined training function to Neural Compressor. In this case, please follow the horovod documentation and below example to know how to write such training function with horovod on different frameworks.
22
44
2. use horovodrun to execute your program.
23
45
24
-
### Option 1: pure yaml configuration
46
+
### Option 1: Pure Yaml Configuration
25
47
26
48
To enable distributed training in Neural Compressor, user only need to add a field: `Distributed: True` in dataloader configuration:
27
49
@@ -58,7 +80,7 @@ result = quantizer.strategy.evaluation_result # result -> (accuracy, evaluation
58
80
```
59
81
60
82
61
-
### Option2: user defined training function
83
+
### Option 2: User Defined Training Function
62
84
63
85
Neural Compressor supports User defined PyTorch training function for distributed training which requires user to modify training script following horovod requirements. We provide a MNIST example to show how to do that and following are the steps for PyTorch.
User needs to use horovodrun to execute distributed training. For more usage, please refer to [horovod documentation](https://horovod.readthedocs.io/en/stable/running_include.html).
125
147
@@ -133,22 +155,22 @@ For example, the following command means that two processes will be assigned to
horovodrun requires user set up SSH on all hosts without any prompts. To do distributed training with Neural Compressor, user needs to ensure the SSH setting on all hosts.
160
+
Horovodrun requires user set up SSH on all hosts without any prompts. To do distributed training with Neural Compressor, user needs to ensure the SSH setting on all hosts.
139
161
140
-
## Following examples are supported
141
-
PyTorch:
162
+
## Examples
163
+
### PyTorch Examples:
142
164
- PyTorch example-1: MNIST
143
-
- Please follow this README.md exactly:[MNIST](../examples/pytorch/image_recognition/mnist)
165
+
- Please follow this README.md exactly:[MNIST](../../examples/pytorch/image_recognition/mnist)
- Please follow this README.md exactly:[QAT](../examples/pytorch/image_recognition/torchvision_models/quantization/qat/eager/distributed)
168
+
- Please follow this README.md exactly:[QAT](../../examples/pytorch/image_recognition/torchvision_models/quantization/qat/eager/distributed)
147
169
148
-
TensorFlow:
170
+
### TensorFlow Examples:
149
171
- TensorFlow example-1: 'ResNet50 V1.0' PTQ (Post Training Quantization) with distributed inference
150
-
- Step-1: Please cd (change directory) to the [TensorFlow Image Recognition Example](../examples/tensorflow/image_recognition) and follow the readme to run PTQ, ensure that PTQ of 'ResNet50 V1.0' can be successfully executed.
151
-
- Step-2: We only need to modify the [resnet50_v1.yaml](../examples/tensorflow/image_recognition/tensorflow_models/quantization/ptq/resnet50_v1.yaml), add a line 'distributed: True' in the 'evaluation' field.
172
+
- Step-1: Please cd (change directory) to the [TensorFlow Image Recognition Example](../../examples/tensorflow/image_recognition) and follow the readme to run PTQ, ensure that PTQ of 'ResNet50 V1.0' can be successfully executed.
173
+
- Step-2: We only need to modify the [resnet50_v1.yaml](../../examples/tensorflow/image_recognition/tensorflow_models/quantization/ptq/resnet50_v1.yaml), add a line 'distributed: True' in the 'evaluation' field.
152
174
```
153
175
# only need to modify the resnet50_v1.yaml, add a line 'distributed: True'
0 commit comments