Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2.1.1] Release Tracker #110961

Closed
atalman opened this issue Oct 10, 2023 · 46 comments
Closed

[v2.1.1] Release Tracker #110961

atalman opened this issue Oct 10, 2023 · 46 comments
Labels
triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Milestone

Comments

@atalman
Copy link
Contributor

atalman commented Oct 10, 2023

🐛 Describe the bug

This issue is for tracking cherry-picks to the release branch. Following is release branch for the 2.1.1 release.

Our plan from this point from this point is roughly:

  • Phase 1 (until 11/3): Cherry-pick post deadline (End of day 5PM PST)
  • Phase 2 (after 11/3): Perform extended integration/stability/performance testing based on Release Candidate builds.

Only issues that have ‘cherry-picks’ in this tracker will be considered for the release.

Cherry-Pick Criteria

Phase 1 (until 11/3):

The Releng team relies on the cherry pick process to manage risk to release quality, i.e. by porting a small set of commit from trunk that are "must-have" into the release branch, we limit the change to the minimal to address pressing issues. Thus, not everything a developer land into the trunk will make it into the release. So, please consider the criteria below and follow the cherry picking process. Only low-risk changes may be cherry-picked from master:

  1. Fixes to regressions against the most recent release (e.g. 2.1.0 for 2.1.1 release; see module: regression issue list)
  2. Low risk critical fixes for: silent correctness, backwards compatibility, crashes, deadlocks, (large) memory leaks
  3. Fixes to new features being introduced in 2.1.0 release
  4. Documentation improvements
  5. Release branch specific changes (e.g. blocking ci fixes, change version identifiers)

Any other change requires special dispensation from the release managers (currently @atalman, @huydhn, @osalpekar, @malfet). If this applies to your change please write "Special Dispensation" in the "Criteria Category:" template below and explain.

Phase 2 (after 11/3):

Note that changes here require us to rebuild a Release Candidate and restart extended testing (likely delaying the release). Therefore, the only accepted changes are Release-blocking critical fixes for: silent correctness, backwards compatibility, crashes, deadlocks, (large) memory leaks

Changes will likely require a discussion with the larger release team over VC or Slack.

Cherry-Pick Process

  1. Ensure your PR has landed in master. This does not apply for release-branch specific changes (see Phase 1 criteria).

  2. Create (but do not land) a PR against the release branch.

    # Find the hash of the commit you want to cherry pick
    # (for example, abcdef12345)
    git log
    
    git fetch origin release/2.1
    git checkout release/2.1
    git cherry-pick abcdef12345
    
    # Submit a PR based against 'release/2.1' either:
    # via the GitHub UI
    git push my-fork
    
    # via the GitHub CLI
    gh pr create --base release/2.1
  3. Make a request below with the following format:

Link to landed master PR (if applicable):
* 

Link to release branch PR:
* 

Criteria Category:
* 
  1. Someone from the release team will reply with approved / denied or ask for more information.
  2. If approved, someone from the release team will merge your PR once the tests pass. Do not land the release branch PR yourself.

NOTE: Our normal tools (ghstack / ghimport, etc.) do not work on the release branch.

See HUD 2.1

Versions

2.1.1

@atalman atalman pinned this issue Oct 10, 2023
@tringwald
Copy link
Collaborator

tringwald commented Oct 10, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Documentation fixes

@atalman merged

@cpuhrsch cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 11, 2023
@Skylion007
Copy link
Collaborator

Skylion007 commented Oct 11, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Documentation fixes

@atalman merged. However, will be reverted since does not fall into documentation fixes

Revert:


@atalman merged
@malfet: Criteria category is misleading, as #109326 is not a documentation fix

@soumith
Copy link
Member

soumith commented Oct 11, 2023

got request from Mosaic to include these two fixes:

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Critical fixes

@atalman merged

@atalman
Copy link
Contributor Author

atalman commented Oct 11, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Required CI fix for linux Arm 64

@atalman merged

@Skylion007
Copy link
Collaborator

Skylion007 commented Oct 13, 2023

A minor NCCL version dropped that has a bug fix. Should we open a PR onto 2.1 tree to update it?
Link to the trunk PR:

Link to release branch PR:

Criteria Category:


@malfet: merged

@Blackhex Blackhex unpinned this issue Oct 17, 2023
@atalman atalman pinned this issue Oct 19, 2023
@atalman
Copy link
Contributor Author

atalman commented Oct 19, 2023

@atalman
Copy link
Contributor Author

atalman commented Oct 19, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Critical CI fix

@malfet merged

@jataylo
Copy link
Collaborator

jataylo commented Oct 20, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Unit test fixes

@huydhn: Do we need to cherry pick this change? From what I see, this fixes the issue introduced by #111381, but that change isn't picked to 2.1.1 (cc @shunting314) Confirmed with @shunting314 that we don't need to cherry pick #111381 or this one

@atalman
Copy link
Contributor Author

atalman commented Oct 20, 2023

Link to landed master PR (if applicable):

  • NA

Link to release branch PR:

Criteria Category:

  • CI Fixes

@atalman merged

@wz337
Copy link
Contributor

wz337 commented Oct 20, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:


@atalman merged

@wz337
Copy link
Contributor

wz337 commented Oct 20, 2023

Comments from user:

without this, resuming training at scale goes from minutes to hours

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Special dispensation: this is a low risk performance fix.
    The context is that without the fix, distributed checkpointing load scale very very poorly with number of GPUs. While it is fast on 1-2 nodes, it slows down to nearly an hour at scale (480 GPUs), because of the collective communication happening behind the scenes. This is flagged by MosaicML folks.

@malfet: Looks fine, need to do some manual testing before merging
@wz337: I have added more context in the cherry-pick PR and included a unit test update that could potentially capture the performance improvement.


@atalman merged

@thiagocrepaldi
Copy link
Collaborator

thiagocrepaldi commented Oct 20, 2023

@atalman
Copy link
Contributor Author

atalman commented Oct 23, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Critical CI fix

@atalman merged

@Skylion007
Copy link
Collaborator

Skylion007 commented Oct 23, 2023

#111703 is a good backport candidate and fixes a repeatable and problematic mem leak according to the issue.

@Skylion007 please follow cherry-pick process in the header to post a cherry-pick for review: #110961 (comment)

Closed my PR for @tringwald's issue below

@tringwald
Copy link
Collaborator

tringwald commented Oct 23, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Memory Leak

@malfet: merged, as change is very localized, but we need to have better guidance to distinguish between large and small memory leaks going forward

@jcaip
Copy link
Contributor

jcaip commented Oct 23, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Critical Fixes

@atalman moving it to 2.1.2 release

@eqy
Copy link
Collaborator

eqy commented Oct 24, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Critical fixes

@malfet: this does not look like a regression between 2.0 and 2.1 is it?
@eqy: not sure if it's 2.0, this fixes hopefully the remaining outstanding issues that began after #98793


@huydhn merged

@eqy
Copy link
Collaborator

eqy commented Oct 24, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Critical fixes

@atalman merged

@BowenBao
Copy link
Collaborator

BowenBao commented Oct 24, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Critical fixes: regression, crash

@atalman merged

@malfet
Copy link
Contributor

malfet commented Oct 24, 2023

Link to the landed main commit:

Lint to release branch PR:

Criteria Category:

  • Regression from the previous release, crash

@atalman merged

@huydhn
Copy link
Contributor

huydhn commented Oct 24, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Critical CI fix

@huydhn merged

@atalman
Copy link
Contributor Author

atalman commented Oct 25, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Critical CI fix

@malfet merged

@atalman
Copy link
Contributor Author

atalman commented Oct 25, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Critical CI docker release fix

@atalman merged

@atalman
Copy link
Contributor Author

atalman commented Oct 26, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Critical Fix

@malfet merged

@atalman
Copy link
Contributor Author

atalman commented Oct 26, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:


@bdhirsh could you please take a look at the cherry pick, there are multiple failures

@atalman
Copy link
Contributor Author

atalman commented Oct 26, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:


@malfet: merged

@kit1980
Copy link
Member

kit1980 commented Oct 26, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Low risk critical fixes for backwards compatibility,

Alban: This is actually a regression where it used to be possible to have indices on the mps device and it is not possible anymore.


@huydhn merged

@huydhn
Copy link
Contributor

huydhn commented Oct 27, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Critical CI fix, GitHub MacOS runner

@huydhn merged

@malfet
Copy link
Contributor

malfet commented Oct 30, 2023

Link to the landed PR:

Link to the release branch PR:

Criteria Category:

  • Low risk critical fixes for silent correctness on MPS->CPU copy

@malfet merged

@huydhn
Copy link
Contributor

huydhn commented Oct 31, 2023

Link to the landed PR:

Link to the release branch PR:

Criteria Category:

  • Critical CI fixes

@huydhn merged. Learn from @malfet: We don't have separation of the Docker image from main and release branch on builder atm. So I shouldn't have merged the change into the release branch because it would override the correct image from main. @malfet fixes this by rebuilding the image in pytorch/builder@617327e

@wz337
Copy link
Contributor

wz337 commented Oct 31, 2023

Got request from MosaicML to include this 1 fix.

Comments from users:

without this, we can't do training resumption because the model gets loaded without the optimizer

Link to the landed PR:

Link to the release branch PR:

Criteria Category:

  • Special dispensation: this is a low-risk fix for an existing feature. Requested by MosaicML to include.

@malfet: I think category is wrong, this feels like a feature work (adding new argument to the existing method), isn't it?
@huydhn: To summarize the decision going forward, for patch release, we will require regression tests to be part of the cherry pick. Without some tests, it's hard to say if the cherry pick is needed. In this specific case, this looks like the change doesn't fit for patch release and would better be as part of 2.2 release instead.
@malfet: merged

@snadampal snadampal unpinned this issue Oct 31, 2023
@huydhn
Copy link
Contributor

huydhn commented Nov 1, 2023

Link to the landed PR:

  • N/A

Link to the release branch PR:

Criteria Category:

  • Critical CI fixes. This is to update the pinned Docker image to -2.1 to match the release branch

@huydhn merged

@huydhn huydhn pinned this issue Nov 1, 2023
@huydhn
Copy link
Contributor

huydhn commented Nov 2, 2023

Link to the landed PR:

Link to the release branch PR:

Criteria Category:

  • Critical CI fixes. This allows the build Docker images to be updated when cutting RC

@malfet: We never had release tags in builder repo, had we? And may be it's fine to start doing it going forward, but not sure why it needs to be cherry-picked into release branch for 2.1.1 release)
@huydhn: The cherry pick is just to make sure that we have 2.1 Docker images pushed to Docker hub. After chatting with Nikita, we decide to close this one as it doesn't fit the cherry pick criteria. See comments in pytorch/builder#1576 (comment)
@atalman not merged to 2.1.1

@malfet
Copy link
Contributor

malfet commented Nov 2, 2023

Link to the landed trunk PR:

Link to the release branch PR:

Criteria Category:

  • Fixes to regressions against the most recent release (didn't crash in 2.0.1, crashes in 2.1.0)

@malfet: merged

@malfet
Copy link
Contributor

malfet commented Nov 2, 2023

Link to the landed trunk PR:

Link to the release branch PR:

Criteria Category:

  • Fixes to regressions against the most recent release (didn't crash in 2.0.1, crashes in 2.1.0)

@malfet merged

@huydhn
Copy link
Contributor

huydhn commented Nov 2, 2023

@drisspg
Copy link
Contributor

drisspg commented Nov 2, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Fixes to regression when trying to compile scaled dot product attention

@huydhn merged, confirm that the issue on #110832 (reproducible on 2.1.0) is fixed

@drisspg
Copy link
Contributor

drisspg commented Nov 2, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:

  • Silent correctness regression for scaled dot product attention

@huydhn merged

@drisspg
Copy link
Contributor

drisspg commented Nov 2, 2023

Link to landed master PR (if applicable):

Link to release branch PR:

Criteria Category:


@huydhn merged confirm that the issue on #112577 (reproducible on 2.1.0) is fixed

@penguinwu penguinwu unpinned this issue Nov 3, 2023
@malfet malfet pinned this issue Nov 6, 2023
@malfet
Copy link
Contributor

malfet commented Nov 6, 2023

Link to landed trunk PR:

Link to release branch PR:

Criteria Category:

  • Fixes to new features being introduced in 2.1.0 release

@malfet merged

@malfet
Copy link
Contributor

malfet commented Nov 14, 2023

Closing and unpinning, as final RC is out

@malfet malfet closed this as completed Nov 14, 2023
@malfet malfet unpinned this issue Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests