Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2.2.1] Release Tracker #119295

Closed
atalman opened this issue Feb 6, 2024 · 38 comments
Closed

[v2.2.1] Release Tracker #119295

atalman opened this issue Feb 6, 2024 · 38 comments
Labels
oncall: releng In support of CI and Release Engineering triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Milestone

Comments

@atalman
Copy link
Contributor

atalman commented Feb 6, 2024

🐛 Describe the bug

This issue is for tracking cherry-picks to the release branch. Following is release branch for the 2.2.1 release.

Our plan from this point from this point is roughly:

  • Phase 1 (until 2/14): Cherry-pick post deadline (End of day 5PM PST)
  • Phase 2 (after 2/14): Perform extended integration/stability/performance testing based on Release Candidate builds.

Only issues that have ‘cherry-picks’ in this tracker will be considered for the release.

Cherry-Pick Criteria

Phase 1 (until 2/14):

The Releng team relies on the cherry pick process to manage risk to release quality, i.e. by porting a small set of commit from trunk that are "must-have" into the release branch, we limit the change to the minimal to address pressing issues. Thus, not everything a developer land into the trunk will make it into the release. So, please consider the criteria below and follow the cherry picking process. Only low-risk changes may be cherry-picked from master:

  1. Fixes to regressions against the most recent release (e.g. 2.2.0 for 2.2.1 release; see module: regression issue list)
  2. Low risk critical fixes for: silent correctness, backwards compatibility, crashes, deadlocks, (large) memory leaks
  3. Fixes to new features being introduced in 2.2.0 release
  4. Documentation improvements
  5. Release branch specific changes (e.g. blocking ci fixes, change version identifiers)

Any other change requires special dispensation from the release managers (currently @atalman, @huydhn, @osalpekar, @malfet). If this applies to your change please write "Special Dispensation" in the "Criteria Category:" template below and explain.

Phase 2 (after 2/14):

Note that changes here require us to rebuild a Release Candidate and restart extended testing (likely delaying the release). Therefore, the only accepted changes are Release-blocking critical fixes for: silent correctness, backwards compatibility, crashes, deadlocks, (large) memory leaks

Changes will likely require a discussion with the larger release team over VC or Slack.

Cherry-Pick Process

  1. Ensure your PR has landed in master. This does not apply for release-branch specific changes (see Phase 1 criteria).

  2. Create (but do not land) a PR against the release branch.

    # Find the hash of the commit you want to cherry pick
    # (for example, abcdef12345)
    git log
    
    git fetch origin release/2.2
    git checkout release/2.2
    git cherry-pick abcdef12345
    
    # Submit a PR based against 'release/2.2' either:
    # via the GitHub UI
    git push my-fork
    
    # via the GitHub CLI
    gh pr create --base release/2.2
  3. Make a request below with the following format:

Link to landed trunk PR (if applicable):
* 

Link to release branch PR:
* 

Criteria Category:
* 
  1. Someone from the release team will reply with approved / denied or ask for more information.
  2. If approved, someone from the release team will merge your PR once the tests pass. Do not land the release branch PR yourself.

NOTE: Our normal tools (ghstack / ghimport, etc.) do not work on the release branch.

See HUD 2.2

Versions

2.2.1

@atalman atalman changed the title [2.2.1] Release Tracker [v2.2.1] Release Tracker Feb 6, 2024
@atalman atalman pinned this issue Feb 6, 2024
@huydhn
Copy link
Contributor

huydhn commented Feb 6, 2024

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:


@atalman merged

@antoniojkim
Copy link
Collaborator

antoniojkim commented Feb 6, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Low risk critical fix for unrecoverable crash via abort

@malfet: merged

@snadampal
Copy link
Collaborator

snadampal commented Feb 6, 2024

Link to master PR (if applicable):

Link to 2.2 release branch PR:

Criteria Category:
Fixes this performance regression issue##119374
This PR fixes the regression introduced when the aarch64 linux wheel builder was moved from local openblas builds to conda package during PyTorch 2.1.


@malfet: needs some discussion, as technically its not a 2.1->2.2 regression, but rather 2.0->2.2 one


@malfet merged

@mvpatel2000
Copy link
Contributor

mvpatel2000 commented Feb 6, 2024

Dtensor / device mesh is currently broken on 2.2 and has serious silent correctness errors. These were fixed by teh following PRs:

Related to distributed checkpointing:

Related dtensor sync_module_states bug:

These are all low-risk PRs and fix various issues, eg silent correctness.

CC: @chauhang


@mvpatel2000 could you please post cherry-picks to release/2.2 branch for these PRs ?

Edit: skylion included PR below

@malfet malfet added oncall: releng In support of CI and Release Engineering triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Feb 6, 2024
@mdevino
Copy link
Contributor

mdevino commented Feb 7, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:
Documentation Improvements - this PR fixes a typo in the torch.frombuffer() function documentation.


@atalman merged

@eellison eellison unpinned this issue Feb 7, 2024
@atalman atalman pinned this issue Feb 8, 2024
@mdevino
Copy link
Contributor

mdevino commented Feb 9, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:
Documentation Improvements - this PR fixes a typo in the contribution guide


@atalman merged

@huydhn
Copy link
Contributor

huydhn commented Feb 9, 2024

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:


@atalman merged

@atalman
Copy link
Contributor Author

atalman commented Feb 9, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • CI fixes use oidc when uploading binaries

@atalman merged

@atalman
Copy link
Contributor Author

atalman commented Feb 9, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • CI fixes use oidc when uploading triton wheels

@atalman merged

@atalman
Copy link
Contributor Author

atalman commented Feb 9, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • CI fixes use oidc for rocm

@atalman merged

@atalman
Copy link
Contributor Author

atalman commented Feb 9, 2024

@mvpatel2000
Copy link
Contributor

mvpatel2000 commented Feb 12, 2024

Fixes sharded checkpoint bug which causes OOM on large GPU counts:

#117799

CC: @chauhang @fegin

Edit: skylion included PR below

Edit: See #119295 (comment). #117799 was replaced by #118197 in main


@atalman wrote: @mvpatel2000 I don't see this PR being included as a cherry-pick

@atalman wrote: As discussion here: #119295 (comment) this PR is not required

@Skylion007
Copy link
Collaborator

Skylion007 commented Feb 12, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Silent Correctness

@Skylion007 The Cherry PIck introduce number of breakages. Could you please resolve these ? Resolved


@atalman merged

@Skylion007
Copy link
Collaborator

Skylion007 commented Feb 12, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Silent Correctness

@atalman merged

@Skylion007
Copy link
Collaborator

Skylion007 commented Feb 12, 2024

@Skylion007
Copy link
Collaborator

Skylion007 commented Feb 12, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Checkpoint loading bug in DCP: Fix to new feature / regression fix.

@atalman merged

@Skylion007
Copy link
Collaborator

Skylion007 commented Feb 12, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Crash (low risk fix)

@atalman wrote: @Skylion007 this cherry-pick is missing unit tests, we require unit tests for all cherry-picks. Also this one mentions #117799 do we need this cherry pick as well here ?


@fegin wrote: @atalman
#117799 is never landed. The actual fixes are the #118197 and #119716. The reason why these 2 do not have unit tests is because these 2 PRs are actually fixing the existing unit tests. Before these 2 PRs are landed, the tests (e.g., "testdistributed/checkpoint/state_dict.py") are marked as flaky. We tested locally before the PRs are landed and the errors could be reproduced roughly 3 out of 10 runs. The PRs fixes the issue that the state_dict may have garbage values.


@atalman merged

@Skylion007 Skylion007 added this to the 2.2.1 milestone Feb 12, 2024
@eellison eellison unpinned this issue Feb 12, 2024
@atalman atalman pinned this issue Feb 12, 2024
@atalman
Copy link
Contributor Author

atalman commented Feb 12, 2024

@eellison eellison unpinned this issue Feb 12, 2024
@atalman atalman pinned this issue Feb 12, 2024
@atalman
Copy link
Contributor Author

atalman commented Feb 12, 2024

Link to landed trunk PR (if applicable):

  • NA

Link to release branch PR:

Criteria Category:

  • CI release only, bump versions

@atalman merged

@atalman
Copy link
Contributor Author

atalman commented Feb 13, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Critical FIx Windows

@atalman merged

@atalman
Copy link
Contributor Author

atalman commented Feb 13, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • PyPi Metadata fix

@atalman merged

@atalman
Copy link
Contributor Author

atalman commented Feb 13, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Critical Regression

@atalman merged

@fegin
Copy link
Contributor

fegin commented Feb 13, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Documentation improvements

@atalman merged

@fegin
Copy link
Contributor

fegin commented Feb 13, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Low risk critical fixes

@atalman merged

@atalman
Copy link
Contributor Author

atalman commented Feb 13, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Critical CI fix - Triton

@huydhn merged

@mvpatel2000
Copy link
Contributor

mvpatel2000 commented Feb 13, 2024

@wanchaol
Copy link
Contributor

wanchaol commented Feb 13, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Documentation Improvements

@huydhn merged

@atalman
Copy link
Contributor Author

atalman commented Feb 14, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Critical FIx on V100, regression against 2.1.0 release

@huydhn merged

@atalman
Copy link
Contributor Author

atalman commented Feb 14, 2024

Link to landed master PR (if applicable):
Properly preserve SymInt input invariant when splitting graphs #117406

Link to release branch PR:
Properly preserve SymInt input invariant when splitting graphs (#117406)

Criteria Category:
Crashes - Functional failure in case of DDPOptimizer and dynamic shape.


@atalman merged

@mvpatel2000
Copy link
Contributor

mvpatel2000 commented Feb 14, 2024

Link to landed master PR:
#117453

Link to release branch PR:
#119916

Criteria Category:
Fixes to new features being introduced in 2.2.0 release -- minor fix for argument not propagated in new API introduced in 2.2

--
@fegin wrote: @mvpatel2000 Can you confirm this is required to avoid OOM? distributed_state_dictwill perform cpu_offloading in a later stage after state_dict() even without this PR. This PR just make the offloading done in state_dict() call. This is specifically for model state_dict and, iirc, your experiments were OOM when doing optimizer state_dict, not model state_dict. This PR does not have any unit tests and our existing cpu offload unit tests only test the final results, and does not test when cpu offloading happens. So it seems that this PR may not be included in 2.2.1 due to lack of unit test.

cc., @atalman


@atalman wrote: @mvpatel2000 reverting this PR: #119995 . Will not be merging this cherry-pick

@wanchaol
Copy link
Contributor

wanchaol commented Feb 15, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • regressions against the most recent release, silent correctness

@atalman merged

@Vins33
Copy link

Vins33 commented Feb 15, 2024

Hi guys, could you tell me if the next release will support cuda12.4?

@atalman wrote: @Vins33 We will have concrete dates once Cuda 12.4 is released. Please stay tuned

@atalman
Copy link
Contributor Author

atalman commented Feb 15, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Critical regression

@atalman merged

@atalman
Copy link
Contributor Author

atalman commented Feb 15, 2024

Please Note We don't accept cherry-picks anymore. We are currently in Phase 2 (after 2/14).

@huydhn
Copy link
Contributor

huydhn commented Feb 20, 2024

Link to landed trunk PR (if applicable):

Link to release branch PR:

Criteria Category:

  • CI fix to ignore 3.12 data binary which doesn't exist atm

@huydhn merged

@atalman
Copy link
Contributor Author

atalman commented Feb 22, 2024

Closing this. Completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: releng In support of CI and Release Engineering triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests