Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestNodeFailingGracefulExitWithLowOnlineScore flaky #6401

Closed
thepaul opened this issue Oct 11, 2023 · 4 comments
Closed

TestNodeFailingGracefulExitWithLowOnlineScore flaky #6401

thepaul opened this issue Oct 11, 2023 · 4 comments
Assignees
Labels
Bug Something isn't working Flaky

Comments

@thepaul thepaul added Bug Something isn't working Flaky labels Oct 11, 2023
@thepaul thepaul self-assigned this Oct 11, 2023
@egonelbre
Copy link
Member

The second example points at a data race:

WARNING: DATA RACE
Read at 0x00c00307bd40 by goroutine 113725:
  runtime.mapaccess2_faststr()
      /usr/local/go/src/runtime/map_faststr.go:108 +0x0
  storj.io/storj/satellite/repair/repairer.(*statsCollector).getStatsByRS()
      /var/lib/jenkins/workspace/storj-gerrit-verify/satellite/repair/repairer/stats.go:26 +0x54
  storj.io/storj/satellite/repair/repairer.(*SegmentRepairer).getStatsByRS()
      /var/lib/jenkins/workspace/storj-gerrit-verify/satellite/repair/repairer/segments.go:702 +0x153
  storj.io/storj/satellite/repair/repairer.(*SegmentRepairer).Repair()
      /var/lib/jenkins/workspace/storj-gerrit-verify/satellite/repair/repairer/segments.go:197 +0xf84
  storj.io/storj/satellite/repair/repairer.(*Service).worker()
      /var/lib/jenkins/workspace/storj-gerrit-verify/satellite/repair/repairer/repairer.go:215 +0x39d
  storj.io/storj/satellite/repair/repairer.(*Service).process.func1()
      /var/lib/jenkins/workspace/storj-gerrit-verify/satellite/repair/repairer/repairer.go:200 +0x11d

Previous write at 0x00c00307bd40 by goroutine 113722:
  runtime.mapassign_faststr()
      /usr/local/go/src/runtime/map_faststr.go:203 +0x0
  storj.io/storj/satellite/repair/repairer.(*statsCollector).getStatsByRS()
      /var/lib/jenkins/workspace/storj-gerrit-verify/satellite/repair/repairer/stats.go:30 +0xd9
  storj.io/storj/satellite/repair/repairer.(*SegmentRepairer).getStatsByRS()
      /var/lib/jenkins/workspace/storj-gerrit-verify/satellite/repair/repairer/segments.go:702 +0x153
  storj.io/storj/satellite/repair/repairer.(*SegmentRepairer).Repair()
      /var/lib/jenkins/workspace/storj-gerrit-verify/satellite/repair/repairer/segments.go:197 +0xf84
  storj.io/storj/satellite/repair/repairer.(*Service).worker()
      /var/lib/jenkins/workspace/storj-gerrit-verify/satellite/repair/repairer/repairer.go:215 +0x39d
  storj.io/storj/satellite/repair/repairer.(*Service).process.func1()
      /var/lib/jenkins/workspace/storj-gerrit-verify/satellite/repair/repairer/repairer.go:200 +0x11d

Goroutine 113725 (running) created at:
  storj.io/storj/satellite/repair/repairer.(*Service).process()
      /var/lib/jenkins/workspace/storj-gerrit-verify/satellite/repair/repairer/repairer.go:197 +0x736
  storj.io/storj/satellite/repair/repairer.(*Service).processWhileQueueHasItems()
      /var/lib/jenkins/workspace/storj-gerrit-verify/satellite/repair/repairer/repairer.go:155 +0x5b
  storj.io/storj/satellite/repair/repairer.(*Service).processWhileQueueHasItems-fm()
      \u003cautogenerated\u003e:1 +0x47
  storj.io/common/sync2.(*Cycle).Run()
      /go/pkg/mod/storj.io/common@v0.0.0-20231005100446-96ee88859b9d/sync2/cycle.go:143 +0x6ab
  storj.io/storj/satellite/repair/repairer.(*Service).Run()
      /var/lib/jenkins/workspace/storj-gerrit-verify/satellite/repair/repairer/repairer.go:148 +0x252
  storj.io/storj/satellite/repair/repairer.(*Service).Run-fm()
      \u003cautogenerated\u003e:1 +0x47
  storj.io/storj/private/lifecycle.(*Group).Run.func2.1()
      /var/lib/jenkins/workspace/storj-gerrit-verify/private/lifecycle/group.go:87 +0x4b
  runtime/pprof.Do()
      /usr/local/go/src/runtime/pprof/runtime.go:51 +0x111
  storj.io/storj/private/lifecycle.(*Group).Run.func2()
      /var/lib/jenkins/workspace/storj-gerrit-verify/private/lifecycle/group.go:86 +0x3a8
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      /go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75 +0x76

Goroutine 113722 (running) created at:
  storj.io/storj/satellite/repair/repairer.(*Service).process()
      /var/lib/jenkins/workspace/storj-gerrit-verify/satellite/repair/repairer/repairer.go:197 +0x736
  storj.io/storj/satellite/repair/repairer.(*Service).processWhileQueueHasItems()
      /var/lib/jenkins/workspace/storj-gerrit-verify/satellite/repair/repairer/repairer.go:155 +0x5b
  storj.io/storj/satellite/repair/repairer.(*Service).processWhileQueueHasItems-fm()
      \u003cautogenerated\u003e:1 +0x47
  storj.io/common/sync2.(*Cycle).Run()
      /go/pkg/mod/storj.io/common@v0.0.0-20231005100446-96ee88859b9d/sync2/cycle.go:143 +0x6ab
  storj.io/storj/satellite/repair/repairer.(*Service).Run()
      /var/lib/jenkins/workspace/storj-gerrit-verify/satellite/repair/repairer/repairer.go:148 +0x252
  storj.io/storj/satellite/repair/repairer.(*Service).Run-fm()
      \u003cautogenerated\u003e:1 +0x47
  storj.io/storj/private/lifecycle.(*Group).Run.func2.1()
      /var/lib/jenkins/workspace/storj-gerrit-verify/private/lifecycle/group.go:87 +0x4b
  runtime/pprof.Do()
      /usr/local/go/src/runtime/pprof/runtime.go:51 +0x111
  storj.io/storj/private/lifecycle.(*Group).Run.func2()
      /var/lib/jenkins/workspace/storj-gerrit-verify/private/lifecycle/group.go:86 +0x3a8
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      /go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75 +0x76
==================

@thepaul
Copy link
Contributor Author

thepaul commented Oct 13, 2023

That race appears to come from TestManyNodesGracefullyExiting, instead (see #6402).

@storj-gerrit
Copy link

storj-gerrit bot commented Nov 6, 2023

storjBuildBot pushed a commit that referenced this issue Nov 8, 2023
…lineScore

I can't say with certainty yet what caused the two failures I know
about, but I have one theory: the node continuing to check in during the
test skewed the online score towards 1, and using the test default for
GracefulExitDurationInDays meant there were fewer update periods than
expected.

At any rate, it is more correct to pause the graceful exit processing
chore and the contact chore during the test, even if it doesn't end up
solving the problem.

Refs: #6401
Change-Id: I06d43d531e0b3344af13878c8d55213349fdcfa3
@iglesiasbrandon
Copy link
Contributor

it seems like this fix has been deployed. I am going to close this issue @thepaul @shaupt131

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Flaky
Projects
Status: Done/Deployed
Development

No branches or pull requests

3 participants