-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
High level summary
Migration to fetch_refspecs with pygit2 backend broke all commands that require cloning with git remotes (i.e. get,ls, import, pull) and/or fetch experiment refs, when using SSH URLs or HTTP(s) with a credential manager.
Timeline
Bug
Feb 6, 2023, 3:07 PM GMT+1@karajan1001 merges Use pygit2 backend for fetch_refspecs #8907 introducing the bug.Feb 9, 2023, 6:17 PM GMT+1DVC 2.44.0 released with the bug.
Bug Fix
Feb 13, 2023, 3:18 PM GMT+1User reports encountering the issue via GH (pull: pulling from private github dataregistry repo with git credential helper not working anymore #9016)Feb 13, 2023, 3:23 PM GMT+1@skshetry identifies the problem in pull: pulling from private github dataregistry repo with git credential helper not working anymore #9016 (comment)Feb 14, 2023, 11:52 AM GMT+1@skshetry self-assigns issue.Feb 14, 2023, 2:33 PM GMT+1@skshetry merges scm: use dulwich backend when fetching exps during clone/pull #9023
This fixes the bug by hardcoding the backend todulwich, with the exception ofexp pull.Feb 14, 2023, 2:51 PM GMT+1DVC 2.45.0 released fixing the bug, with the exception ofexp pull.
Regression for SSH URLs
Feb 16, 2023, 1:21 PM GMT+1@skshetry merges Revert "scm: use dulwich backend when fetching exps during clone/pull" #9041
This reintroduces the bug for SSH URLs in any command.Feb 16, 2023, 1:46 PM GMT+1DVC 2.45.1 released with the regression.
Regression fix
Feb 17, 2023, 8:16 PM GMT+1User suggests there might be a regression from2.45.0in pull: pulling from private github dataregistry repo with git credential helper not working anymore #9016 (comment).Feb 20, 2023, 4:54 PM GMT+1User confirms regression with SSH URL inhttps://github.com/pull: pulling from private github dataregistry repo with git credential helper not working anymore #9016#issuecomment-1437231527Feb 22, 2023, 3:24 PM GMT+1Client reports an issue withdvc getand SSH URL via slack. @daavoo investigates the report and confirms regression.Feb 23, 2023, 10:34 AM GMT+1@daavoo asks for confirmation on someone looking into the regression via slackFeb 27, 2023, 4:40 PM GMT+1@daavoo merges fetch_refspecs: Usedulwichfor ssh urls. scmrepo#196
fixing regression for SSH URLs, with the exception ofexp pull.Feb 27, 2023, 4:40 PM GMT+1scmrepo 0.1.13 released fixing the regression, with the exception ofexp pull.
exp pull fix
Mar 2, 2023, 3:23 PM GMT+1client reports thatexp pullis still not working via slackMar 3, 2023, 2:11 PM GMT+1@daavoo merges pygit2: Raise NotImplementedError on ssh remotes. scmrepo#201 fixingexp pull.Mar 3, 2023, 2:11 PM GMT+1scmrepo 0.1.14 is released with the fix.
Perf indicators
- Time to identify: 4 days (time between affected DVC release and user report)
- Time to bug fix: 4 days (time between user report and
DVC 2.45.0) - Time to regression: 2 days (time between
DVC 2.45.0andDVC 2.45.1) - Time to regression fix: 11 days (time between
DVC 2.45.1andscmrepo 0.1.13) - Time to exp pull fix: 4 days (time between
scmrepo 0.1.13andscmrepo 0.1.14)
Impact
Very common workflows were broken in DVC 2.44.0, 2.44.1, and 2.45.1 for git remotes with SSH URLs and/or HTTP(s) with a credential manager.
The time for releasing a version fixed 2.45.0 was short, but the regression was quickly reintroduced.
exp pull was broken for all versions but being a less common command the reports were not as abundant and it took quite some time to realize it was broken.
A high number of reports from multiple users and sources.
Only pip installations are currently fixed, any other distributions need a new DVC release which is blocked by #9098
Root cause analysis
-
A change mainly focused on speeding up the local collection of experiments ended up affecting many dvc commands that don't require experiment refs.
fetch_refspecsis being called unnecessarily (in many cases). -
Testing suite provided a false sense of security when in reality very common workflows are not being really tested (interactions with SSH URLs and HTTP(s) with credentials managers).
-
After the first fix, no regression tests were introduced.
-
When the regression was introduced for SSH URLs, we hesitate from the initial user reports, and took 11 days to address it.
A similar problem was encountered when we first migrated to dulwich backend for git clone operations. In that case, it was only broken for cases using a credential manager. No tests were introduced in that case either.
Prevention
-
Separation of concerns. Don't call
fetch_refspecsunnecessarily. -
Add tests for cases with SSH URLs and credentials manager in all commands that require interacting with a git remote.
-
Add regression tests when fixing something that was not being tested.