Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROCm] AMDGPU XLA compiler bugfixes, HLO slice sorting #39106

Closed

Conversation

ekuznetsov139
Copy link
Contributor

This PR:
Implements HSACO cache, to avoid rerunning (expensive) AMDGPU compilation for identical modules
Makes sure that AMDGPU backend deletes temporary files after compilation
Implements deterministic sorting of HLO slices (without it, identical HLOs may result in different IRs)
Supplies workarounds for two bugs in the ROCm backend

Deleting temporary files after compilation
Deterministic sorting of HLO slices

Workaround for a memory overrun in HIP

Workaround for HIP's expectation that all modules are different
@google-ml-butler google-ml-butler bot added the size:M CL Change Size: Medium label May 2, 2020
@google-ml-butler google-ml-butler bot requested a review from joker-eph May 2, 2020 22:42
@gbaned gbaned self-assigned this May 3, 2020
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation May 3, 2020
@gbaned gbaned added the awaiting review Pull request awaiting review label May 8, 2020
@gbaned gbaned requested a review from cheshire May 19, 2020 13:57
@cheshire
Copy link
Member

Hi,

For commits we generally try to have a self-descriptive message; is it possible to update the title of this PR? (or if there are multiple bugfixes/features, to split it up?)

@gbaned gbaned removed the awaiting review Pull request awaiting review label May 20, 2020
@gbaned
Copy link
Contributor

gbaned commented May 27, 2020

@ekuznetsov139 Can you please check @cheshire's comments and keep us posted. Thanks!

@gbaned gbaned added the stat:awaiting response Status - Awaiting response from author label May 27, 2020
@ekuznetsov139 ekuznetsov139 changed the title [ROCm] XLA enhancements & bugfixes [ROCm] AMDGPU XLA compiler bugfixes, HLO slice sorting May 27, 2020
@gbaned
Copy link
Contributor

gbaned commented Jun 8, 2020

@ekuznetsov139 Can you please check @cheshire's comments and resolve conflicts?. Thanks!

@cheshire
Copy link
Member

cheshire commented Jun 8, 2020

@ekuznetsov139 additionally tests would be great, if possible.

@ekuznetsov139
Copy link
Contributor Author

Tests for which part?

@cheshire
Copy link
Member

cheshire commented Jun 8, 2020

All of it?

@gbaned
Copy link
Contributor

gbaned commented Jun 11, 2020

@ekuznetsov139 Can you please check @cheshire's comments and resolve conflicts?. Thanks!

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jun 24, 2020
@gbaned gbaned added stat:awaiting response Status - Awaiting response from author comp:gpu GPU related issues labels Jun 26, 2020
@deven-amd
Copy link
Contributor

@ekuznetsov139 gentle ping

@gbaned
Copy link
Contributor

gbaned commented Jul 17, 2020

@ekuznetsov139 Any update on this PR? Please. Thanks!

@ekuznetsov139
Copy link
Contributor Author

@gbaned I occasionally come here, look at my changes, try to imagine how any of them (let alone all of them) could be covered with unit tests, and generally fail. Do not close the PR, I'm sure I'll come up with something eventually.

@cheshire
Copy link
Member

Implements HSACO cache, to avoid rerunning (expensive) AMDGPU compilation for identical modules
Makes sure that AMDGPU backend deletes temporary files after compilation
Implements deterministic sorting of HLO slices (without it, identical HLOs may result in different IRs)
Supplies workarounds for two bugs in the ROCm backend

@ekuznetsov139 From your initial message it seems this PR is doing at least 3 different things. I think it would be simpler to split it first.

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jul 19, 2020
@gbaned gbaned added the stat:awaiting response Status - Awaiting response from author label Jul 20, 2020
@gbaned
Copy link
Contributor

gbaned commented Jul 24, 2020

@ekuznetsov139 Can you please check @cheshire's comments and resolve conflicts?. Thanks!

@ekuznetsov139
Copy link
Contributor Author

I've resubmitted part of this PR as #41641.
The conflict occurs because the part of patch that introduces HLO slice sorting is no longer applicable (because the function has been changed from returning a map to returning a vector). I need to test the change and see if slice sorting is still relevant.

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jul 26, 2020
PR Queue automation moved this from Assigned Reviewer to Closed/Rejected Aug 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:gpu GPU related issues size:M CL Change Size: Medium
Projects
PR Queue
  
Closed/Rejected
Development

Successfully merging this pull request may close these issues.

None yet

6 participants