-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conv2d native FP16 compute #51132
Conv2d native FP16 compute #51132
Conversation
I'd still like to see isolated benchmarks, and accuracy numbers for the models tested. It turns out the FP16 Conv2D op without dilations and with |
I did accuracy check using CNN model trained using mixed precision for float16. Following are the changes to base code base source: added float16_mixed precision policy as per following link (https://www.tensorflow.org/guide/mixed_precision)from tensorflow.keras import mixed_precision
replaced tensorboad callback with model check point to save most accurate modelcheckpoint_filepath = '/tmp/checkpoint' model.fit(train_ds, saved model to diskmodel.save('alexnet_cifar10.h5', save_format='h5') did accuracy check using saved modelmodel = tf.keras.models.load_model('alexnet_cifar10.h5') Following are results for 3 different training cycles + accuracy checks Accuracy check on GPU Test cycle : 2 Accuracy check on GPU Test cycle : 3 Accuracy check on GPU Model summary: Layer (type) Output Shape Param #conv2d (Conv2D) (None, 55, 55, 96) 34944 batch_normalization (BatchNo (None, 55, 55, 96) 384 max_pooling2d (MaxPooling2D) (None, 27, 27, 96) 0 conv2d_1 (Conv2D) (None, 27, 27, 256) 614656 batch_normalization_1 (Batch (None, 27, 27, 256) 1024 max_pooling2d_1 (MaxPooling2 (None, 13, 13, 256) 0 conv2d_2 (Conv2D) (None, 13, 13, 384) 885120 batch_normalization_2 (Batch (None, 13, 13, 384) 1536 conv2d_3 (Conv2D) (None, 13, 13, 384) 1327488 batch_normalization_3 (Batch (None, 13, 13, 384) 1536 conv2d_4 (Conv2D) (None, 13, 13, 256) 884992 batch_normalization_4 (Batch (None, 13, 13, 256) 1024 max_pooling2d_2 (MaxPooling2 (None, 6, 6, 256) 0 flatten (Flatten) (None, 9216) 0 dense (Dense) (None, 4096) 37752832 dropout (Dropout) (None, 4096) 0 dense_1 (Dense) (None, 4096) 16781312 dropout_1 (Dropout) (None, 4096) 0 dense_2 (Dense) (None, 10) 40970Total params: 58,327,818 Test sequence: Training on GPUexport CUDA_VISIBLE_DEVICES=0 Accuracy check on GPUexport CUDA_VISIBLE_DEVICES=0 Accuracy check on CPU with FP32export CUDA_VISIBLE_DEVICES=-1 Accuracy check on CPU with FP16 ACCUMULATEexport CUDA_VISIBLE_DEVICES=-1 |
@cantonios Can you please review this PR ? Thanks! |
Microbenchmarks for the affected convolution operation(s), and overall model inference numbers. |
Executed microbenchmark to exercise all possible paths of Conv2D op.
Benchmark is time measurement over Conv2D operation using timeit. Example data: Timeit is used over following operation with all the filters: _ = tf.nn.conv2d(image, filter, strides=[1, 1], padding='VALID') Following are the results in % gain/loss using 4 N1 cores FP16 vs FP16 ACCUMULATE FP32 vs FP16 and FP16 ACCUMULATE Note: Without using FP16 accumulate flag SpatialConvolution (else part of condition) shows -ve gain. |
This change is only affecting Is that what your first benchmark results show? The first 3.77% is actually just noise, so it's essentially (+ is a performance gain):
I don't know if I understand the second set of benchmarks though. Is the summary that without this flag, float16 convolutions with SpatialConvolution are slower than float32 convolutions, but with it float16 convolutions are faster? |
Yes your understanding is correct. When flag is not used packet flow is F16 -> F32 -> SpatialConvolution -> F16. |
Alright, then the second set of benchmarks are a bit misleading. Yes, float16 is slower than float32 without the macro because we are essentially doing a float32 convolution plus extra casting float16 -> float32 -> float16. The main gain is to skip the casting. Are you able to run the same benchmarks on an intel CPU? |
Along with casting ARMv8-FP16 is also losing processing power of NEON engine. In case of F32, 4 variables are processed in a go while in case of F16 8 variables can be processed in a go. Yes benchmark is Functional on x86_64, But FP16 results are low as Intel is not having native FP16 packet support. Intel can take advantage of upcast to FP32 as emulated Eigne::half will be worse. |
Right, but this is not reflected in
since both cases are doing computation in float32. That's all I mean... that the speed gain for fp16 is only due to the results in the first set of benchmarks (which does already include removing the cast + increased ISA throughput).
Do you have numbers (relative or absolute)? |
-ve gain is showing this in 3x3 and 7x7. |
@asharma-ampere still awaiting intel benchmark results, if you have them. |
@asharma-ampere Any update on this PR? Please. Thanks! |
I tried same micro benchmark on Skylake 8160 (on 4 CPU cores). TF Build flags: FP32 vs FP16 and FP32 vs FP16 ACC
FP16 vs FP16 ACC
|
I assume the 1x1 and mxn results in the second table are just noise. That is quite a significant loss if we try to do the convolutions in fp16. This leads me to believe we should be doing the matmul versions (1x1, mxn) in f32 on Intel (and arm - without native fp16 support) as well. Thanks, this has been very helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conv_ops_test seems to be broken by this. Can you double-check?
Fixed conv_ops_test for standard build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At line 163 assignment to output.device(d) was missing.
I tested patch with and without TF_CONV2D_USE_FP16_ACCUMULATE.
It is passing on ARM64 N1 system as well as Intel Skylake system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix formatting with this new change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting done (clang-format --style=google)
This patch enables use of native FP16 hardware acceleration for Conv2D op, if available on system.
It is tested on ARM Neoverse N1 CPU. With native FP16 vectored instruction up to 45% improvement is
observed on CNN models (without/very minimal accuracy loss). It is showing very good results in
mixed precision training as well. This will further improve with improved GEMM kernel for FP16.
By default native FP16 Conv2D ops are disabled.
To enable it, environment variable
TF_CONV2D_USE_FP16_ACCUMULATE
must be set to 1.i.e. without setting environment variable or in absence of environment variable there is no change
in execution of Conv2D operation.
Default behavior is disabled. i.e. without environment variable
does the same.
To Enable native FP16 accumulate
System Software Configuration used to test/build:
Ubuntu 20.04.2 LTS
GCC: gcc-11 (Ubuntu 11.1.0-1ubuntu1~20.04) 11.1.0