Skip to content

Conversation

@huydhn
Copy link
Contributor

@huydhn huydhn commented Jan 21, 2023

Recently, there are some flaky sccache start-up failures on PyTorch when building XLA, for example:

The full list can be found here. The error, strangely, comes only from XLA.

It turns out that XLA pytorch/xla_base:v0.6 uses upstream sccache from https://github.com/mozilla/sccache.git while the rest of PyTorch CI uses a custom fork from https://github.com/pytorch/sccache.git as defined in https://github.com/pytorch/pytorch/blob/master/.circleci/docker/common/install_cache.sh#L12

IMO, it's easier to stick to https://github.com/pytorch/sccache.git as the rest of the CI till we have the capacity to do the switch back to upstream sccache.

AFAIK, https://github.com/pytorch/sccache.git has some fixes to work with nvcc.

@JackCaoG JackCaoG requested a review from yeounoh January 23, 2023 18:46
@JackCaoG
Copy link
Collaborator

@yeounoh Can you take this one?

@huydhn
Copy link
Contributor Author

huydhn commented Jan 23, 2023

@yeounoh Can you take this one?

Thank @JackCaoG. The change looks simple enough and build seems to pass. Let me know if this works as expected

@yeounoh
Copy link
Contributor

yeounoh commented Jan 25, 2023

@huydhn will verify, thank you!

# if image layers are not present in the repo.
docker tag ${GCR_DOCKER_IMAGE} ${ECR_DOCKER_IMAGE_BASE}:v0.6 >/dev/null
docker push ${ECR_DOCKER_IMAGE_BASE}:v0.6 >/dev/null
docker tag ${GCR_DOCKER_IMAGE} ${ECR_DOCKER_IMAGE_BASE}:v0.7 >/dev/null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use v0.8? v0.7 was already taken, so I pushed under v0.8 while testing/verifying your image in our CI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I can merge and bump up the version separately.

@yeounoh
Copy link
Contributor

yeounoh commented Jan 26, 2023

Oh, I think we also need to rebase before merging 🙏 looking at the test failures.

@huydhn
Copy link
Contributor Author

huydhn commented Jan 26, 2023

@yeounoh Thank you for doing the rebase for me. I'll update XLA docker image on PyTorch CI to v0.8 accordingly

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Jan 26, 2023
Given the context in pytorch/xla#4489, we now have a new XLA Docker image `v0.8`. This should fix the flaky sccache initialization failures with XLA.
Pull Request resolved: #93041
Approved by: https://github.com/malfet
ManfeiBai pushed a commit that referenced this pull request Jan 30, 2023
* Use pytorch/sccache

* Up the image version
@huydhn huydhn deleted the switch-to-pytorch-sccache branch March 13, 2023 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants