Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visor is not able to restart data-node when network history is enabled due to level db lock #7750

Closed
11 tasks
daniel1302 opened this issue Mar 3, 2023 · 1 comment · Fixed by #7800
Closed
11 tasks

Comments

@daniel1302
Copy link
Contributor

Problem encountered

When the network history is enabled level db seems to be locked longer than usually and we randomly getting those issues:

I think it would be nice to add the wait for that file mechanism to the data-node. When network history is disabled i am not seeing this kind of error.

But i see it mostly on mainnet(probably when server is slower), where we run 13 validators & data-node

failed to create and publish segment: failed to create snapshot: failed to get data dump metadata: failed to get database version: FATAL: terminating connection due to administrator command (SQLSTATE 57P01)
failed to initialise network history:failed to create networkHistory service:failed to create network history store:failed to create index:failed to open level db file:resource temporarily unavailable
Error: maximum number of possible restarts has been reached: failed to execute binary /jenkins/workspace/common/system-tests-lnl-mainnet/networkdata/testnet/visor/visor13/current/vega [datanode node --home /jenkins/workspace/common/system-tests-lnl-mainnet/networkdata/testnet/data-node/node13]: exit status 255
Usage:
  vegavisor run [flags]

Flags:
  -h, --help          help for run
      --home string   Path to visor home folder

Steps to reproduce

Manual

We just propose protocol upgrade in system-tests here: https://github.com/vegaprotocol/system-tests/blob/devops-infra/1522-3/tests/LNL/extended_test.py#L311-L316

And then the visor is not able to restart the node.

Automation

Link to automation and explanation on how to run it to reproduce the problem/bug

Evidence

Logs

If applicable, add logs and/or screenshots to help explain your problem.

Additional context

Add any other context about the problem here including; system version numbers, components affected.

Definition of Done

ℹ️ Not every issue will need every item checked, however, every item on this list should be properly considered and actioned to meet the DoD.

Before Merging

  • Code refactored to meet SOLID and other code design principles
  • Code is compilation error, warning, and hint free
  • Carry out a basic happy path end-to-end check of the new code
  • All APIs are documented so auto-generated documentation is created
  • All bug recreation steps can be followed without presenting the original error/bug
  • All Unit, Integration and BVT tests are passing
  • Implementation is peer reviewed (coding standards, meeting acceptance criteria, code/design quality)
  • Create front end or console tickets with feature labels (should be done when starting the work if dependencies known i.e. API changes)

After Merging

  • Move development ticket to Done if there is NO requirement for new system-tests
  • Resolve any issues with broken system-tests
  • Create documentation tickets with feature labels if functionality has changed, or is a new feature
@wwestgarth
Copy link
Contributor

wwestgarth commented Mar 7, 2023

A similar thing happened here: https://vegaprotocol.slack.com/archives/C03KD0MR8M8/p1678181471762409

In this case devnet was restarted and the chainID was different between the data-node going down and coming back up. It failed with:

failed to verify chain id:mismatched chain ids, config chain id: vega-devnet1-202303070921, current chain id: vega-devnet1-202303070827

and it looks like whatever code path causes that error, doesn't gracefully stop the datanode and I do not see the log for closing the IPFS goleveldb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants