Visor is not able to restart data-node when network history is enabled due to level db lock #7750

daniel1302 · 2023-03-03T09:45:57Z

Problem encountered

When the network history is enabled level db seems to be locked longer than usually and we randomly getting those issues:

I think it would be nice to add the wait for that file mechanism to the data-node. When network history is disabled i am not seeing this kind of error.

But i see it mostly on mainnet(probably when server is slower), where we run 13 validators & data-node

failed to create and publish segment: failed to create snapshot: failed to get data dump metadata: failed to get database version: FATAL: terminating connection due to administrator command (SQLSTATE 57P01)
failed to initialise network history:failed to create networkHistory service:failed to create network history store:failed to create index:failed to open level db file:resource temporarily unavailable
Error: maximum number of possible restarts has been reached: failed to execute binary /jenkins/workspace/common/system-tests-lnl-mainnet/networkdata/testnet/visor/visor13/current/vega [datanode node --home /jenkins/workspace/common/system-tests-lnl-mainnet/networkdata/testnet/data-node/node13]: exit status 255
Usage:
  vegavisor run [flags]

Flags:
  -h, --help          help for run
      --home string   Path to visor home folder

Steps to reproduce

Manual

We just propose protocol upgrade in system-tests here: https://github.com/vegaprotocol/system-tests/blob/devops-infra/1522-3/tests/LNL/extended_test.py#L311-L316

And then the visor is not able to restart the node.

Automation

Link to automation and explanation on how to run it to reproduce the problem/bug

Evidence

Logs

If applicable, add logs and/or screenshots to help explain your problem.

Additional context

Add any other context about the problem here including; system version numbers, components affected.

Definition of Done

ℹ️ Not every issue will need every item checked, however, every item on this list should be properly considered and actioned to meet the DoD.

Before Merging

Code refactored to meet SOLID and other code design principles
Code is compilation error, warning, and hint free
Carry out a basic happy path end-to-end check of the new code
All APIs are documented so auto-generated documentation is created
All bug recreation steps can be followed without presenting the original error/bug
All Unit, Integration and BVT tests are passing
Implementation is peer reviewed (coding standards, meeting acceptance criteria, code/design quality)
Create front end or console tickets with feature labels (should be done when starting the work if dependencies known i.e. API changes)

After Merging

Move development ticket to Done if there is NO requirement for new system-tests
Resolve any issues with broken system-tests
Create documentation tickets with feature labels if functionality has changed, or is a new feature

The text was updated successfully, but these errors were encountered:

wwestgarth · 2023-03-07T09:41:17Z

A similar thing happened here: https://vegaprotocol.slack.com/archives/C03KD0MR8M8/p1678181471762409

In this case devnet was restarted and the chainID was different between the data-node going down and coming back up. It failed with:

failed to verify chain id:mismatched chain ids, config chain id: vega-devnet1-202303070921, current chain id: vega-devnet1-202303070827

and it looks like whatever code path causes that error, doesn't gracefully stop the datanode and I do not see the log for closing the IPFS goleveldb.

daniel1302 added the bug label Mar 3, 2023

gordsport added the datanode label Mar 3, 2023

gordsport added this to the 🤠 🤸 OT Stretch milestone Mar 3, 2023

gordsport assigned guoguojin and daniel1302 Mar 6, 2023

guoguojin mentioned this issue Mar 8, 2023

fix: not all paths cleanly close the network history index store #7800

Merged

guoguojin closed this as completed in #7800 Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Visor is not able to restart data-node when network history is enabled due to level db lock #7750

Visor is not able to restart data-node when network history is enabled due to level db lock #7750

daniel1302 commented Mar 3, 2023

wwestgarth commented Mar 7, 2023 •

edited

Visor is not able to restart data-node when network history is enabled due to level db lock #7750

Visor is not able to restart data-node when network history is enabled due to level db lock #7750

Comments

daniel1302 commented Mar 3, 2023

Problem encountered

Steps to reproduce

Manual

Automation

Evidence

Logs

Additional context

Definition of Done

wwestgarth commented Mar 7, 2023 • edited

wwestgarth commented Mar 7, 2023 •

edited