Skip to content
Permalink
Browse files

Add Model Tuner front-end tool. (#3816)

Summary:
**Summary**
Add a front-end tool for tuning (calibrating) the quantization parameters of a model. The parameters are tuned using the model accuracy as the optimization metric.

**Motivation**
The motivation for this tool is this: when using extremely brutal quantization schemes (for example the "SymmetricWithPower2Scale") the accuracy difference between the floating-point model and the quantized model can be pretty high (up to 10's of percents). One such example is an internally designed model which has:
- 99 % accuracy for the floating-point model
- 81 % accuracy for the  (initial) quantized model using SymmetricWithPower2Scale schema
- 98 % accuracy for the (final) quantized model after tuning

Attached here is the console output of this Tool after running it for the model:
- [ModelTuner_Log.txt](https://github.com/pytorch/glow/files/3886201/ModelTuner_Log.txt)

**Algorithm**
The quantization parameters are initially chosen such that no saturation occurs (quantized range includes the min/max of the profile). For some of the tensors where the histogram exhibits some outlier values it might be better to use quantization parameters which saturate the outliers with the benefit of having smaller quantization step for the bulk of the histogram.
The tuning algorithm "plays" with the **scale** quantization parameter by trying sequentially for each node new values around the original value and picks the one which provides the best accuracy (it tries original_scale, original_scale/2 and original_scale/4). One might say that there is a philosophical problem with this approach: the algorithm is over-fitting the quantization parameters for a given dataset. But from a practical point of view it is better to have maybe an over-fitting of a couple of percent than have an under-fitting of 10s of percent for the quantization mechanism.

**Refactorizations**

- Had to refactor a a little the Loader class in order to provide more information through its API for the tools which is using the Loader.
- Refactored Base.cpp to provide a function for validating the quantization parameters.
- Refactored ProtobufLoader.cpp to add function to retrieve the unique input placeholder for a model with a single input placeholder.

**Documentation**
doc/ModelTuner.md

**Test Plan**
None
Pull Request resolved: #3816

Differential Revision: D19166714

Pulled By: jfix71

fbshipit-source-id: cf1caf51abd4ac8b5ba90a96e937487131389ddf
  • Loading branch information
mciprian13 authored and facebook-github-bot committed Dec 19, 2019
1 parent de133ee commit 41d3e30caad62289af7e1fa8713b951b07647351
@@ -0,0 +1,115 @@
## ModelTuner

This front end tool is used for tuning (calibrating) the quantization parameters of a model.
During the quantization flow, the model is first profiled by gathering the dynamic range (min/max)
for each tensor in the graph. Next, the quantization parameters are chosen in such a way that, for
the given profile, no saturation occurs. Although this makes sense at first glance, there
is actually a tradeoff when choosing the quantization parameters for a given tensor: it might be
be beneficial overall if the quantization parameters are chosen such that to provide a smaller
quantization step (e.g. smaller **scale** parameter) which means a better representation of most
of the tensor values (the bulk of the histogram) at the expense of actually saturating the extreme
values (outliers).

This tool is basically tuning the quantization parameters by using the following simple algorithm:
- For each node in the graph, try different quantization parameters in the vicinity of the initially
chosen values (right after the profiling). For example, this is done by successively dividing the
**scale** parameter by 2 for a maximum of 3 iterations.
- For each tested quantization parameters, keep the ones which provide the best accuracy with respect
to a given dataset.

### Command line options

The specific command line options for running this tool are presented below. Apart from the specific
options, some generic options are used which are also used for the other front end tools (see the
image-classifier documentation):
- options for specifying the model, the quantization options (schema, precision), the backend,
the image preprocessing options (layout, channel order, normalization).

```
model-tuner -model=<model-path> <image-options> <quantization-options> -dataset-path=<dataset-folder>
-dataset-file=<dataset-file> -load-profile=<input-profile> -dump-tuned-profile=<tuned-profile>
```

where:
- *dataset-path* - the folder where the dataset files are located. The assumption is that all the dataset files
are located in the same directory.
- *dataset-file* - the path to the dataset description file which contains on each line a data path and integer
label separated by space (" ") or comma (","). The integer labels start with 0 (0,1,...).
An example might look like this:
image0.png 0
image1.png 13
.............
Another example might look like this:
image0.png,0,
image1.png,13,
..............
- *load-profile* - the path of the input profile which is loaded and tuned.
- *dump-tuned-profile* - the path where the tuned profile is written.

More information can be acquired by typing the following command:
```
model-tuner -help
```

### Extra command line options

There are a couple of extra command line parameters which can be used to tweak the algorithm behavior:
- *target-accuracy* - The tuning procedure is stopped when the accuracy has reached or surpassed the given
value. A float value between 0.0 and 1.0 is expected. If not specified, the tuning will
run until completion.
- *max-iter-per-node* - The maximum number of tuning iterations per node (default is 3).
- *acc-drop-skip* - The accuracy drop for which the tuning of any node is skipped. The default value is 0.05 (5%).

### Command line output

When running this tool the console output will might look like this:

```
Computing initial accuracy ...
Initial accuracy: 81.0180 %
Number of nodes: 277
Target accuracy: 100.0000 %
[1/277] Tuning node "broadcast_B_tile0_save__1:0"
[1/3] Testing scale = 0.00195
Accuracy = 81.0180 %
Tunning stopped for this node (no effect)
Best accuracy : 81.0180 %
Iteration time: 34 seconds
Remaining time: 2 hours 36 minutes
[2/277] Tuning node "W52__1:0"
[1/3] Testing scale = 0.06250
Accuracy = 81.4422 %
[2/3] Testing scale = 0.03125
Accuracy = 79.0032 %
[3/3] Testing scale = 0.01562
Accuracy = 67.1262 %
Best accuracy : 81.4422 %
Iteration time: 68 seconds
Remaining time: 5 hours 11 minutes
..................................
..................................
[277/277] Tuning node "W42__1:0"
[1/3] Testing scale = 0.01562
Accuracy = 90.2439 %
Tunning stopped for this node
Best accuracy : 97.9852 %
Iteration time: 66 seconds
Remaining time: 0 hours 0 minutes
Final accuracy: 97.9852 %
Total time: 5 hours 6 minutes
```

Notes:
- The quantization tuning procedure is a long procedure: the order of magnitude of the time
required to run is similar to training. For example, the model used for tuning in the above example
is a medium size model (e.g. similar to a Mobile Net with a scale factor of 0.5). For this reason
the tool also prints an estimated remaining time for running the tuning (the estimation gets
better after calibrating more nodes).
- When the estimated time for the tuning is too much, one might use a smaller tuning dataset.
@@ -165,8 +165,8 @@ class ProtobufLoader {
bool hasNodeByName(llvm::StringRef name) const;

/// Constructs new ProtobufLoader object. It will populate the network into
/// \p F. The list \p types and \p names are used to initialized the inputs
/// and outputs with specific names and types. If \p errPtr is not null then
/// \p F. The list \p types and \p names are used to initialize the inputs
/// of the model with specific names and types. If \p errPtr is not null then
/// if an error occurs it will get assigned there otherwise if an error
/// occurs it will abort.
ProtobufLoader(llvm::ArrayRef<const char *> tensorNames,
@@ -191,16 +191,21 @@ class ProtobufLoader {
/// that there is only one output, returns Error otherwise. For image
/// classification, this single final output is usually the result of the
/// last softmax or regression layer.
Expected<Placeholder *> getSingleOutput() {
RETURN_ERR_IF_NOT(outputVarsByName_.size() == 1,
"There must be only one output.");
return outputVarsByName_.begin()->second;
}
Expected<Placeholder *> getSingleOutput() const;

/// \returns the single input of the network. The function assumes that there
/// is only one input, returns Error otherwise. For most of the models the
/// single input is usually an image tensor.
Expected<Placeholder *> getSingleInput() const;

/// \returns the Placeholder for the external output with \p name.
/// \pre outputVarsByName_.find(name) != outputVarsByName_.end()
Expected<Placeholder *> getOutputByName(llvm::StringRef name) const;

/// \returns the Placeholder for the external input with \p name.
/// \pre inputVarsByName_.find(name) != inputVarsByName_.end()
Expected<Placeholder *> getInputByName(llvm::StringRef name) const;

/// \returns True if the operator with name \p typeName having input node
/// list as \p inputs is constant foldable.
bool isConstantFoldable(llvm::ArrayRef<NodeValue> inputs,
@@ -255,6 +255,15 @@ Tensor tensor4BitsFusedRowwiseDequantization(const Tensor &input);
QuantizationTransform32To8 quantizeScaleOffset32To8(float scale,
int32_t offset);

/// Function to get the quantized range for a given precision type \p qTy.
/// \returns the range as a (min, max) pair.
std::pair<int64_t, int64_t> getQuantizationRange(ElemKind qTy);

/// Function to validate that the given quantization parameters \p qParams
/// comply with the given quantization \p schema and precision \p qTy.
void validateQuantizationParams(TensorQuantizationParams qParams, Schema schema,
ElemKind qTy);

/// Calculate TensorQuantizationParams based on the clipped \p min and \p max
/// floating point range and using the base quantization type \p qTy and the
/// quantization method described by \p schema.
@@ -84,6 +84,18 @@ bool ProtobufLoader::hasConstantByName(llvm::StringRef name) const {
return getConstantByNameOrNull(name) != nullptr;
}

Expected<Placeholder *> ProtobufLoader::getSingleOutput() const {
RETURN_ERR_IF_NOT(outputVarsByName_.size() == 1,
"There must be only one output.");
return outputVarsByName_.begin()->second;
}

Expected<Placeholder *> ProtobufLoader::getSingleInput() const {
RETURN_ERR_IF_NOT(inputVarsByName_.size() == 1,
"There must be only one input.");
return inputVarsByName_.begin()->second;
}

Expected<Placeholder *>
ProtobufLoader::getOutputByName(llvm::StringRef name) const {
auto it = outputVarsByName_.find(name);
@@ -94,6 +106,16 @@ ProtobufLoader::getOutputByName(llvm::StringRef name) const {
return it->second;
}

Expected<Placeholder *>
ProtobufLoader::getInputByName(llvm::StringRef name) const {
auto it = inputVarsByName_.find(name);
RETURN_ERR_IF_NOT(
it != inputVarsByName_.end(),
llvm::Twine("No external input Variable was registered with name ", name)
.str());
return it->second;
}

NodeValue
ProtobufLoader::getNodeValueByNameOrNullNodeValue(llvm::StringRef name) const {
auto it = nodeValueByName_.find(name);
@@ -187,11 +209,10 @@ ProtobufLoader::ProtobufLoader(llvm::ArrayRef<const char *> tensorNames,
for (size_t i = 0, e = tensorNames.size(); i < e; i++) {
RETURN_ERR_IF_NOT(!hasNodeByName(tensorNames[i]),
"Input names have duplicate");
auto placeholderOrErr =
createAndRegisterPlaceholder(tensorNames[i], types[i]);
if (!placeholderOrErr) {
return placeholderOrErr.takeError();
}
Placeholder *placeholder;
ASSIGN_VALUE_OR_RETURN_ERR(
placeholder, createAndRegisterPlaceholder(tensorNames[i], types[i]));
inputVarsByName_.try_emplace(tensorNames[i], placeholder);
}
return Error::success();
};
@@ -265,11 +265,7 @@ QuantizationTransform32To8 quantizeScaleOffset32To8(float scale,
offset);
}

TensorQuantizationParams chooseQuantizationParams(float min, float max,
Schema schema, ElemKind qTy) {
assert(min <= max && "min must not be bigger than max");

// Compute the quantized int range.
std::pair<int64_t, int64_t> getQuantizationRange(ElemKind qTy) {
// Pick int64_t in order to cover the uint32_t range.
int64_t qmin;
int64_t qmax;
@@ -310,6 +306,45 @@ TensorQuantizationParams chooseQuantizationParams(float min, float max,
default:
llvm_unreachable("Quantized type not supported");
}
return std::pair<int64_t, int64_t>(qmin, qmax);
}

void validateQuantizationParams(TensorQuantizationParams qParams, Schema schema,
ElemKind qTy) {

// Get the quantized range.
auto minMaxPair = getQuantizationRange(qTy);
int64_t qmin = minMaxPair.first;
int64_t qmax = minMaxPair.second;

// Validate params.
(void)(qmin);
(void)(qmax);
assert((qmin <= qParams.offset) && (qParams.offset <= qmax) &&
"The offset must be within the quantized range");
if (schema == quantization::Schema::Symmetric) {
assert((qParams.offset == 0) &&
"Symmetric quantization should have offset 0");
} else if (schema == quantization::Schema::SymmetricWithUnsigned) {
assert((qParams.offset == qmin || qParams.offset == 0) &&
"SymmetricWithUnsigned quantization should have offset 0 or qmin");
} else if (schema == quantization::Schema::SymmetricWithPower2Scale) {
assert((qParams.offset == 0) &&
"SymmetricWithPower2Scale quantization should have offset 0");
assert(isFloatPowerOf2(qParams.scale) &&
"SymmetricWithPower2Scale quantization parameter should be a power "
"of 2");
}
}

TensorQuantizationParams chooseQuantizationParams(float min, float max,
Schema schema, ElemKind qTy) {
assert(min <= max && "min must not be bigger than max");

// Get the quantized range.
auto minMaxPair = getQuantizationRange(qTy);
int64_t qmin = minMaxPair.first;
int64_t qmax = minMaxPair.second;

// We extend the [min, max] interval to ensure that it contains 0.
// Otherwise, we would not meet the requirement that 0 be an exactly
@@ -403,27 +438,7 @@ TensorQuantizationParams chooseQuantizationParams(float min, float max,
}

TensorQuantizationParams result{static_cast<float>(scale), nudgedZeroPoint};
// The only valid offset for symmetric quantization is 0.
assert((result.offset == 0 || schema != quantization::Schema::Symmetric) &&
"Symmetric quantization should be centered on 0");

// The only valid offsets for symmetric quantization with unsigned support are
// 0 and qmin.
assert((result.offset == qmin || result.offset == 0 ||
schema != quantization::Schema::SymmetricWithUnsigned) &&
"Symmetric quantization with unsigned should be centered on 0 or on "
"-qmin");

// For SymmetricWithPower2Scale schema the offset should be 0.
assert((result.offset == 0 ||
schema != quantization::Schema::SymmetricWithPower2Scale) &&
"Symmetric quantization should be centered on 0");

// For SymmetricWithPower2Scale schema the scale should be a power of 2.
assert((isFloatPowerOf2(result.scale) ||
schema != quantization::Schema::SymmetricWithPower2Scale) &&
"Scale quantization parameter should be a power of 2");

validateQuantizationParams(result, schema, qTy);
return result;
}

@@ -82,3 +82,21 @@ target_link_libraries(model-compiler
GraphOptimizer
Quantization
LLVMSupport)

add_executable(model-tuner
Loader.cpp
LoaderUtils.cpp
ModelTuner.cpp)

target_link_libraries(model-tuner
PRIVATE
Backends
Base
Converter
Graph
HostManager
Importer
ExecutionEngine
GraphOptimizer
Quantization
LLVMSupport)
@@ -252,7 +252,8 @@ buildAndCompileAndGetInAndOutPair(Loader &loader, PlaceholderBindings &bindings,

// Compile the model, and perform quantization/emit a bundle/dump debug info
// if requested from command line.
CompilationContext cctx{&bindings};
CompilationContext cctx = loader.getCompilationContext();
cctx.bindings = &bindings;
cctx.backendOpts.autoInstrument = autoInstrument;
loader.compile(cctx);

0 comments on commit 41d3e30

Please sign in to comment.
You can’t perform that action at this time.