Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to add user scratch pad for matmul primitive to fix OOM issue in Transformer LT #54381

Merged
merged 4 commits into from Mar 14, 2022

Conversation

jojivk73
Copy link
Contributor

@jojivk73 jojivk73 commented Feb 14, 2022

Adding changes for the matmul primitive to use user scratch pad. This reduces memory footprint of the primitive. It fixes an out of memory issue when running Transformer LT with multiple instances and total thread count is large. Managing scratch pad for the primitive from the framework, fixes the out of memory issue, reduces memory footprint and does not affect performance. The changes :
Creates a new struct that hold the Tensor for scratch pad arg.
Allocates memory based on scratch pad size queried from primitive description.
Sets user scratch pad in post ops for the primitive.

@google-ml-butler google-ml-butler bot added the size:M CL Change Size: Medium label Feb 14, 2022
@google-ml-butler google-ml-butler bot added the awaiting review Pull request awaiting review label Feb 14, 2022
@tilakrayal tilakrayal added the comp:mkl MKL related issues label Feb 15, 2022
@tilakrayal tilakrayal added this to Assigned Reviewer in PR Queue via automation Feb 15, 2022
@tilakrayal tilakrayal self-assigned this Feb 15, 2022
Copy link
Member

@penpornk penpornk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

This reduces memory footprint of the primitive.

Could you please help clarify how providing a user-allocated scratchpad helps save the memory in this case? The user-allocated memory doesn't seem to be reused (only called once per each Compute call). Does the primitive over-allocate memory when not given user-allocated scratchpad? The primitive is not cached, e.g., a new one is created each time the op is called, so shouldn't it be able to allocate just the amount it needs?

@penpornk penpornk removed the awaiting review Pull request awaiting review label Mar 9, 2022
@gzmkl
Copy link
Contributor

gzmkl commented Mar 9, 2022

@penpornk we are working to enable oneDNN scratchpad "user-mode" for conv, inner-product, and matmul ops.
@tf side, we will continue to reuse primitives with LRU caching; but it will control oneDNN memory consumption (oneDNN will execute an op with provided scratchpad buffter and won't allocate anything more).

To answer the last quest:
(1) all conv/inner-product (matmul fused)/matmul (sgemm) ops will be cached at TF

(2) scratchpad buffer is allocated with Compute(), just before invoking oneDNN execution.
The buffer size is decided by oneDNN. Please check the "struct UserScratchPad" implementation in core/util/mkl_util.h

Thanks

@penpornk
Copy link
Member

penpornk commented Mar 10, 2022

(2) scratchpad buffer is allocated with Compute(), just before invoking oneDNN execution.
The buffer size is decided by oneDNN. Please check the "struct UserScratchPad" implementation in core/util/mkl_util.h

@gzmkl Thank you for the quick reply! I understand that the user scratchpad is allocated in Compute(). But I don't understand how this differs from letting oneDNN allocate the buffer itself, since the matmul primitive here is not cached in LRU and is created every time Compute() is called too (and oneDNN allocates memory at primitive creation time). Both approaches seem to allocate scratchpad memory with similar lifetime, and the sizes shouldn't be too different since the user-allocated approach also gets the size from oneDNN. So I'm wondering what makes the difference here (in this PR).

@jojivk73
Copy link
Contributor Author

@penpornk
When user scratch pad is not used by framework, oneDNN primitive adds its own scratch pad and hold it as long as the primitive is alive. TF also keeps all primitives alive till the LRU cache is full. In cases where there are many threads, each thread can create primitives for same dimensions, since the cache is thread local (this is essential). This results is many (similar) primitives created across multiple threads. For example, in the case of transformer (for which the issue was reported), the matmul dimensions varies based on input. This results in 100s of primitives getting created in each thread (when the thread count is greater than 1), each holding its scratch pad mem alive. And transformer with greater than 4 threads was going out of memory in machines with less memory (<182 GB)

With UserScratchPad, the TF framework controls the scratchpad by creating and releasing the scratchpad memory as needed : -ie (as of now) create it before execution and release it after execution of the primitive. This keeps the memory consumption low for the entire model

Copy link
Member

@penpornk penpornk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jojivk73 Thank you for the detailed reply!

I misread the do_not_cache parameter. I thought that false means the primitive is not cached (and assumed that the primitive is destroyed along with its auto-allocated scratchpad at the end of the kernel call -- hence having a similar lifetime as user-allocated scratchpad.) I understand now that the primitive here is cached and therefore will hold on to the scratchpad memory for as long as it stays in the LRU.

// Create or retrieve matmul primitive from cache.
MklMatMulPrimitive<Tlhs, Trhs, Toutput>* matmul_prim =
MklMatMulPrimitiveFactory<float, Tlhs, Trhs, Toutput>::Get(
*params, false /* value for do_not_cache */);

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Mar 14, 2022
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Mar 14, 2022
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 14, 2022
@copybara-service copybara-service bot merged commit b4bd9c3 into tensorflow:master Mar 14, 2022
@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Mar 14, 2022
copybara-service bot pushed a commit that referenced this pull request Mar 15, 2022
Rolling back PR #54381 because it broke Windows continuous build.

PiperOrigin-RevId: 434790571
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:mkl MKL related issues size:M CL Change Size: Medium
Projects
PR Queue
  
Approved by Reviewer
Development

Successfully merging this pull request may close these issues.

None yet

6 participants