Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow changing the tensor dimension type (to 32b) #3542

Conversation

@pjaaskel
Copy link
Contributor

pjaaskel commented Sep 21, 2019

Summary:

The previously used size_t is host-dependent and a 64b dimension width is clearly overkill for most of the networks. 32b data type adds more data level paralleism for computations with tensor dimension data as well as simplifies address computations.

The patch adds a new typedef glow::dim_t with a signed counterpart glow::sdim_t which are defined to a fixed width 32b type.

As a side effect, it simplified the OpenCL kernel interfacing since there is no need to pass host size_t width to it since the dimension type is now fixed width.

The 32b dimension type can be enabled via cmake -DTENSOR_DIMS_32_BITS=ON

Test Plan:

Tested with LLVM 7 (CPU backend) and OpenCL (pocl-phsa x86_64). Both with 32b and 64b index types. Passes the basic ninja check for both.

@opti-mix

This comment has been minimized.

Copy link
Contributor

opti-mix commented Sep 21, 2019

@pjaaskel Thanks for this PR!

  • Could you elaborate a bit if this PR is trying to address a concrete performance issue you've discovered or is it a hypothetical performance improvement?

  • It does not change much for the OpenCL backend, or? This backend was using a device-specific size_t in all the kernels anyways.

@pjaaskel

This comment has been minimized.

Copy link
Contributor Author

pjaaskel commented Sep 23, 2019

@pjaaskel Thanks for this PR!
Could you elaborate a bit if this PR is trying to address a concrete performance issue you've discovered or is it a hypothetical performance improvement?

Yes, this had major (> 10%) performance benefit on a (private) target where 32b integer (SIMD) computation is faster than 64b. I didn't test other targets, but since I believe this is generally true usually (32b integer computations are either as fast or faster than 64b), it seemed a good idea to allow reducing address computation overheads this way.

I think this should improve the CPU (LLVM) backend perf. too especially in case there is vectorizable address computation based on indices that now allows to utilize 32b SIMD subwords. But I haven't measured this.

It does not change much for the OpenCL backend, or? This backend was using a device-specific size_t in all the kernels anyways.

Unfortunately OpenCL has a fallacy here: size_t is tied to the address space width (unlike in C/C++), so if one wants to use 64b pointers/address space width (to global space), the device should declare size_t to 64b although wishing to strength reduce needless address computations to lower bit widths (and declare max object size to be 32b), even though 32b would be much faster/higher throughput.

The current way of explicitly passing the dimension/index data type improves the situation as it communicates we need only 32b for the indices and we ensure the index width is in synch with the data from the host. The previous didn't use size_t for indices, but uint64_t. Also the node validator assumed a 64b index type.

@opti-mix

This comment has been minimized.

Copy link
Contributor

opti-mix commented Sep 23, 2019

@pjaaskel Thanks for the detailed explanations!

And thanks a lot for the PR! It is a significant amount of work to go through the whole source code and identify places where size_t/ssize_t could be semantically replaced by a custom type.

Overall, I like the idea! We've discussed it internally a couple of times in the past.

Let's collect some opinions and reactions from others. In particular, does it break any private backends eventually?

@bertmaher

This comment has been minimized.

Copy link
Contributor

bertmaher commented Sep 24, 2019

I like the direction, for sure. I'm in particular not a huge fan of size_t, since it's dependent on the host that's running Glow, which may not have any relation to the device running Glow's generated code.

Something I'm curious about: in the cases where you've observed a speedup from going 64->32 bits, was it actually because the tensor's dimensions were narrower, or because the data stored in the tensors became narrower (e.g., you can make indices tensors of embedding lookups 32 bits). I've definitely seen the latter -- using uint32_t for indices of SparseLengthsSum is a substantial win since it halves the memory bandwidth spent on transferring the indices themselves. But you can do that without forcing any tensor dimensions to physically be. 32 bit; you just have to create embedding lookups with the right dtype on their inputs. If it's the former case (significant computations involving the actual tensor dimensions) I'd be curious to hear what it is. (PM me if it's non-public, maybe we can work out NDAs to discuss in more detail, since I'd love to help move this forward).

I'm a bit concerned if we lose the ability to represent >32 bit dimensions. There are at least semi-plausible domains where you could have a dimension > 2^32 (an embedding vector for every person, or something like that). If I'm not mistaken, TensorFlow's XLA uses 64 bits for shapes as well, which isn't totally dispositive, but might suggest that it's a useful property.

@pjaaskel

This comment has been minimized.

Copy link
Contributor Author

pjaaskel commented Sep 25, 2019

Something I'm curious about: in the cases where you've observed a speedup from going 64->32 bits, was it actually because the tensor's dimensions were narrower, or because the data stored in the tensors became narrower (e.g., you can make indices tensors of embedding lookups 32 bits). I've definitely seen the latter -- using uint32_t for indices of SparseLengthsSum is a substantial win since it halves the memory bandwidth spent on transferring the indices themselves. But you can do that without forcing any tensor dimensions to physically be. 32 bit; you just have to create embedding lookups with the right dtype on their inputs. If it's the former case (significant computations involving the actual tensor dimensions)

I'm not sure what you mean by "embedding lookups 32bits" here.

I'd be curious to hear what it is. (PM me if it's non-public, maybe we can work out NDAs to discuss in more detail, since I'd love to help move this forward).

In that private target all vectorized 64b integer computations and memory accesses are expensive since they fall back to scalarized execution instead of wide SIMD. So tensor index computations based on 64b shapes became a major bottleneck early on in our optimization assignment.

I'm a bit concerned if we lose the ability to represent >32 bit dimensions. There are at least semi-plausible domains where you could have a dimension > 2^32 (an embedding vector for every person, or something like that). If I'm not mistaken, TensorFlow's XLA uses 64 bits for shapes as well, which isn't totally dispositive, but might suggest that it's a useful property.

Yes, indeed this is my main concern of enabling it by default; are there feasible networks which need such huge dimensions? Can be, like you state, which at least hints that the index should be easily changeable (at build time?) back to 64b. I've tried to keep it such by means of the centralized dim_t/dims_t/IndexElementType typedefs, but I haven't re-tested setting dim_t to uint64_t after my modifications.

@bertmaher

This comment has been minimized.

Copy link
Contributor

bertmaher commented Sep 26, 2019

I'm not sure what you mean by "embedding lookups 32bits" here.

Usually the "indices" tensor of SparseLengthsSum. But it sounds like it's the actual shapes that are slowing you down, not the data.

I'd be curious to hear what it is. (PM me if it's non-public, maybe we can work out NDAs to discuss in more detail, since I'd love to help move this forward).

In that private target all vectorized 64b integer computations and memory accesses are expensive since they fall back to scalarized execution instead of wide SIMD. So tensor index computations based on 64b shapes became a major bottleneck early on in our optimization assignment.

Very interesting. I'll be honest I did not expect vectorizing indices to be a significant performance bottleneck.

Yes, indeed this is my main concern of enabling it by default; are there feasible networks which need such huge dimensions? Can be, like you state, which at least hints that the index should be easily changeable (at build time?) back to 64b. I've tried to keep it such by means of the centralized dim_t/dims_t/IndexElementType typedefs, but I haven't re-tested setting dim_t to uint64_t after my modifications.

Cool, yeah, I think I like the dim_t refactoring regardless. But I think I want to default to 64b and switch to 32b based on a build flag.

@stale

This comment has been minimized.

Copy link

stale bot commented Oct 11, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@pjaaskel pjaaskel force-pushed the parmance:allow-changing-the-tensor-dimension-type branch from 1cebff3 to 7a0ea03 Oct 12, 2019
@stale stale bot removed the stale_will_be_closed label Oct 12, 2019
@pjaaskel

This comment has been minimized.

Copy link
Contributor Author

pjaaskel commented Oct 12, 2019

I made it a build-time option, clang format'd, fixed a bunch of warnings due to data type bitwidth downcasts etc. A huge diff, but pretty trivial changes, I think. What do you think now?

@pjaaskel pjaaskel force-pushed the parmance:allow-changing-the-tensor-dimension-type branch from 7a0ea03 to 6c292d7 Oct 18, 2019
@pjaaskel

This comment has been minimized.

Copy link
Contributor Author

pjaaskel commented Oct 18, 2019

I rebased to the latest master and fixed a couple of new issues that came from new codes that used size_t.

@stale

This comment has been minimized.

Copy link

stale bot commented Nov 2, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale_will_be_closed label Nov 2, 2019
@pjaaskel

This comment has been minimized.

Copy link
Contributor Author

pjaaskel commented Nov 2, 2019

Bling?

@stale stale bot removed the stale_will_be_closed label Nov 2, 2019
@jfix71

This comment has been minimized.

Copy link
Contributor

jfix71 commented Nov 2, 2019

@pjaaskel sorry for the delay, @bertmaher has been out for the past few weeks and so this fell through the cracks. We will try to get this landed early next week.

@bertmaher

This comment has been minimized.

Copy link
Contributor

bertmaher commented Nov 5, 2019

Hey @pjaaskel, sorry for the long latency, my wife and I had our 2nd child a few weeks ago so I've been changing diapers instead of reviewing diffs :-D. Anyways, this change lgtm; can you rebase to fix the merge conflicts, and address the test failures, and then we can get this in? Thanks!

@pjaaskel pjaaskel force-pushed the parmance:allow-changing-the-tensor-dimension-type branch 2 times, most recently from 5013a54 to fd38034 Nov 9, 2019
@pjaaskel

This comment has been minimized.

Copy link
Contributor Author

pjaaskel commented Nov 10, 2019

Congratulations to your and your family (co-incidentally it's Father's day in Finland today). I've rebased the branch and fixed 'ninja check' to pass for 32b&64b LLVM 7 and OpenCL.

@bertmaher

This comment has been minimized.

Copy link
Contributor

bertmaher commented Nov 12, 2019

Looks like the build is broken due to a change to the foxi submodule. I think you need to revert foxi version in this PR to the version in master. (I can never remember how to do this offhand, so hopefully you know :) ).

@pjaaskel pjaaskel force-pushed the parmance:allow-changing-the-tensor-dimension-type branch 2 times, most recently from 00bd1d5 to 278013c Nov 13, 2019
@pjaaskel

This comment has been minimized.

Copy link
Contributor Author

pjaaskel commented Nov 13, 2019

I rebased again (without conflicts).

I'm not confortable with using submodules either, but I don't see any submodule revision changes in this PR and it builds for me. When I check the submodule stats in the two branches, they seem to match too:

Your branch is up to date with 'glow-origin/master'.
visit0r@x270:~/src/glow-upstreaming$ git submodule 
 eab62a5dd5129f6a4ebfbe4bbe41d35611f7c48d tests/OutputCheck (0.4.2-1-geab62a5)
 090faecb454fbd6e6e17a75ef8146acb037118d4 tests/googlebenchmark (v1.5.0)
 703bd9caab50b139428cea1aaff9974ebee5742e tests/googletest (release-1.8.0-2064-g703bd9ca)
 97fe555430a857581b9b826ecd955e4f0a3653f0 thirdparty/foxi (remotes/origin/HEAD)
 34d4bf01bbf7376f2baa71b8fa148b18524d45cf thirdparty/fp16 (34d4bf0)
 50dc186b50ea512d6888aa1f47414150fd782fa0 thirdparty/onnx (v1.3.0-275-g50dc186b)
 a1b71df137e015d44f7e31f7b6d4807253fb7871 thirdparty/pybind11 (v2.2.0-209-ga1b71df)
visit0r@x270:~/src/glow-upstreaming$ git checkout allow-changing-the-tensor-dimension-type
Switched to branch 'allow-changing-the-tensor-dimension-type'
Your branch is up to date with 'origin/allow-changing-the-tensor-dimension-type'.
visit0r@x270:~/src/glow-upstreaming$ git submodule 
 eab62a5dd5129f6a4ebfbe4bbe41d35611f7c48d tests/OutputCheck (0.4.2-1-geab62a5)
 090faecb454fbd6e6e17a75ef8146acb037118d4 tests/googlebenchmark (v1.5.0)
 703bd9caab50b139428cea1aaff9974ebee5742e tests/googletest (release-1.8.0-2064-g703bd9ca)
 97fe555430a857581b9b826ecd955e4f0a3653f0 thirdparty/foxi (remotes/origin/HEAD)
 34d4bf01bbf7376f2baa71b8fa148b18524d45cf thirdparty/fp16 (34d4bf0)
 50dc186b50ea512d6888aa1f47414150fd782fa0 thirdparty/onnx (v1.3.0-275-g50dc186b)
 a1b71df137e015d44f7e31f7b6d4807253fb7871 thirdparty/pybind11 (v2.2.0-209-ga1b71df)

@bertmaher

This comment has been minimized.

Copy link
Contributor

bertmaher commented Nov 13, 2019

Weird, I thought I saw a change to thirdparty/foxi before you pushed the latest update. Oh well. Now CircleCI seems to be complaining about merging onto master; I don't understand why it's trying to do that though, it never did before. @rdzhabarov do you know?

Copy link

facebook-github-bot left a comment

@bertmaher has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link

facebook-github-bot left a comment

@bertmaher has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@stale

This comment has been minimized.

Copy link

stale bot commented Nov 29, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@pjaaskel

This comment has been minimized.

Copy link
Contributor Author

pjaaskel commented Nov 29, 2019

What's the status of this?

@stale stale bot removed the stale_will_be_closed label Nov 29, 2019
@pjaaskel pjaaskel force-pushed the parmance:allow-changing-the-tensor-dimension-type branch 2 times, most recently from be7cba0 to 063801d Nov 30, 2019
@pjaaskel

This comment has been minimized.

Copy link
Contributor Author

pjaaskel commented Nov 30, 2019

I rebased & fixed tests again. Any chance getting this in soon?

@bertmaher

This comment has been minimized.

Copy link
Contributor

bertmaher commented Nov 30, 2019

Yeah, I’m not sure what was going wrong with CI, but I’ll try to get it through first thing on Monday

@pjaaskel pjaaskel force-pushed the parmance:allow-changing-the-tensor-dimension-type branch 2 times, most recently from 09e13f8 to 8e83945 Dec 2, 2019
As a side effect, this patch also makes the dimension width
type of an explicit fixed size. The previously used size_t is
device-dependent and a typical 64b width is clearly overkill for
most of the use cases. An option to use 32b data type adds more
data level parallelism for computations with tensor dimension
data as well as simplifies address computations.

The patch adds a new typedef glow::dim_t with a signed counterpart
glow::sdim_t which are defined to a fixed width 32b type when
cmake -DTENSOR_DIMS_32_BITS=ON.
@pjaaskel pjaaskel force-pushed the parmance:allow-changing-the-tensor-dimension-type branch from 8e83945 to b7bb540 Dec 3, 2019
@bertmaher

This comment has been minimized.

Copy link
Contributor

bertmaher commented Dec 3, 2019

I fixed the asan issues locally, and I need to make some changes to libjit for build system compatibility. I’ll try to finish the merge in a few hours once I’m back to the office.

@bertmaher

This comment has been minimized.

Copy link
Contributor

bertmaher commented Dec 3, 2019

Done! Thanks for your patience, and for this great contribution!

@pjaaskel

This comment has been minimized.

Copy link
Contributor Author

pjaaskel commented Dec 3, 2019

Thanks! Would it be possible to get a builder bot that at least builds with the 32b dim_t on so it would catch some cases where 64b size_t is used for indices early on?

@bertmaher

This comment has been minimized.

Copy link
Contributor

bertmaher commented Dec 3, 2019

It should be possible; in fact, I think you could actually add it yourself if you so desire. Check out the SHARED build configuration defined in our .circleci setup. We'll have to monitor whether it's slowing down our CI too much, but if it's just building it shouldn't be too onerous.

facebook-github-bot added a commit that referenced this pull request Dec 4, 2019
Summary:
Use dim_t from #3542

Documentation:
Pull Request resolved: #3843

Test Plan: ninja check

Differential Revision: D18804441

Pulled By: jackm321

fbshipit-source-id: f8c19a173c1b20a7395a2e66e3b2b3352aeba6f2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.