Skip to content

Conversation

@sdasgup3
Copy link
Collaborator

@sdasgup3 sdasgup3 commented May 7, 2025

The current PR tracks a issue where an internal TPU CI is failing on v5p hardware. A specific test failing with assertion failure at test_train_spmd_linear_model.py#L49 and test_train_spmd_linear_model.py#L51 with maximum absolute difference of 0.0042718649 and 0.0000191778 respectively.

The fix here is to update the corresponding atols.

@sdasgup3 sdasgup3 requested a review from bhavya01 May 7, 2025 23:53
@sdasgup3 sdasgup3 added the CI CI related change label May 7, 2025
@bhavya01
Copy link
Collaborator

bhavya01 commented May 8, 2025

I see that both the results are running on TPU, one with checkpointing and one without gradient checkpointing. The result should be exactly the same.

Generally, there is some tolerance when we are running on different hardwares but in this case, I expected them to be exactly the same. I think that we should take a closer look at this problem. We might find something is wrong with the way we checkpoint.

@yaoshiang yaoshiang enabled auto-merge (squash) June 17, 2025 22:27
@yaoshiang
Copy link
Collaborator

This has been a long-lived bug and Bhavya initially expressed concern about some deviation between grad checkpointing and not. But more recently, he is okay with merging this fix and opening a new issue to further investigate numerical equiv. of checkpointing vs not. I have set this PR to automerge once the tests pass. Thanks guys.

@yaoshiang yaoshiang disabled auto-merge June 23, 2025 18:56
@yaoshiang
Copy link
Collaborator

yaoshiang commented Jun 25, 2025

As I look at this bug a bit further, I think it's actually okay to include some tol and not expect exact equivalence between grad checkpointing and not.

With grad checkpointing vs not, you could imagine XLA reordering some operations, leading to some small rounding issues in the final bit of the weights. The best tolerance to measure would be on the weight and activation itself... by the time you get to loss, you can imagine a decent amount of movement.

@yaoshiang
Copy link
Collaborator

This PR is stuck in cicd and the creator is not working on it. Closing, and replaced with #9404

@yaoshiang yaoshiang closed this Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI CI related change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants