New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AVX intrinsics to vectorize & speed up FP16-CPU computations #574

merged 2 commits into from Oct 23, 2018


None yet
2 participants

alsrgv commented Oct 20, 2018

Use CPU intrinsics to do fast FP16 conversion & vectorization. Observe 10-12% speedup over two nodes.

@alsrgv alsrgv self-assigned this Oct 20, 2018

@alsrgv alsrgv requested a review from tgaddair Oct 20, 2018 Outdated
@@ -67,7 +67,7 @@ def check_tf_version():
def get_cpp_flags(build_ext):
last_err = None
default_flags = ['-std=c++11', '-fPIC', '-O2']
default_flags = ['-std=c++11', '-fPIC', '-O2', '-mf16c']

This comment has been minimized.


alsrgv Oct 20, 2018


TODO: additional ./configure-style test to check if -mf16c is allowed on the machine we're installing Horovod.

auto* in = (unsigned short*)invec;
auto* inout = (unsigned short*)inoutvec;
int i = 0;

This comment has been minimized.


tgaddair Oct 23, 2018


Why not initialize i within the for loop?

This comment has been minimized.


alsrgv Oct 23, 2018


Because if we have AVX, we will process all the blocks divisible by 8, and the rest will be processed as "leftovers" using the slow algorithm. If we don't have AVX, i will start at 0.

@alsrgv alsrgv merged commit 156c61b into master Oct 23, 2018

3 checks passed

License Compliance All checks passed.
continuous-integration/travis-ci/pr The Travis CI build passed
license/cla Contributor License Agreement is signed.

@alsrgv alsrgv deleted the fp16_cpu_intrin branch Oct 23, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment