Skip to content

Postmortem: DVC commands broken for git remotes with SSH URLs or HTTP(s) with a credential manager #9117

@daavoo

Description

@daavoo

High level summary

Migration to fetch_refspecs with pygit2 backend broke all commands that require cloning with git remotes (i.e. get,ls, import, pull) and/or fetch experiment refs, when using SSH URLs or HTTP(s) with a credential manager.

Timeline

Bug

Bug Fix

Regression for SSH URLs

Regression fix

exp pull fix

Perf indicators

  • Time to identify: 4 days (time between affected DVC release and user report)
  • Time to bug fix: 4 days (time between user report and DVC 2.45.0)
  • Time to regression: 2 days (time between DVC 2.45.0 and DVC 2.45.1)
  • Time to regression fix: 11 days (time between DVC 2.45.1 and scmrepo 0.1.13)
  • Time to exp pull fix: 4 days (time between scmrepo 0.1.13 and scmrepo 0.1.14)

Impact

Very common workflows were broken in DVC 2.44.0, 2.44.1, and 2.45.1 for git remotes with SSH URLs and/or HTTP(s) with a credential manager.
The time for releasing a version fixed 2.45.0 was short, but the regression was quickly reintroduced.

exp pull was broken for all versions but being a less common command the reports were not as abundant and it took quite some time to realize it was broken.

A high number of reports from multiple users and sources.

Only pip installations are currently fixed, any other distributions need a new DVC release which is blocked by #9098

Root cause analysis

  • A change mainly focused on speeding up the local collection of experiments ended up affecting many dvc commands that don't require experiment refs. fetch_refspecs is being called unnecessarily (in many cases).

  • Testing suite provided a false sense of security when in reality very common workflows are not being really tested (interactions with SSH URLs and HTTP(s) with credentials managers).

  • After the first fix, no regression tests were introduced.

  • When the regression was introduced for SSH URLs, we hesitate from the initial user reports, and took 11 days to address it.

A similar problem was encountered when we first migrated to dulwich backend for git clone operations. In that case, it was only broken for cases using a credential manager. No tests were introduced in that case either.

Prevention

  • Separation of concerns. Don't call fetch_refspecs unnecessarily.

  • Add tests for cases with SSH URLs and credentials manager in all commands that require interacting with a git remote.

  • Add regression tests when fixing something that was not being tested.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions