Skip to content
This repository has been archived by the owner on Jan 21, 2022. It is now read-only.

Bounty: Fix race conditions causing problems identified in #417 #453

Closed
benjamincburns opened this issue Dec 25, 2017 · 87 comments
Closed

Comments

@benjamincburns
Copy link
Contributor

benjamincburns commented Dec 25, 2017

As a result of #450, in trufflesuite/ganache#41, @0xNPE gave us some much needed tests which reproduce the underlying problems behind #417. This bounty is for a fix for the underlying problem which those tests identify.

Bounty acceptance criteria:

  1. Bounty claimant must submit one or more pull requests to trufflesuite/ganache-core and/or its underlying dependencies (likely ethereumjs/ethereumjs-vm and/or ethereumjs/merkle-patricia-tree), and meet the necessary requirements defined by the projects' contributing guidelines for these PRs to be accepted by the project maintainers.
  2. Submissions must make the tests submitted in automated tests for bounty 450 ganache#41 pass.
  3. The tests submitted in automated tests for bounty 450 ganache#41 may be modified if necessary to demonstrate the fix, so long as the modified tests continue to demonstrate the underlying problem with the fixes reverted.
  4. Submitters should take care to avoid causing significant performance degradation to ganache-core or any other impacted projects.
  5. Submissions which merely make the tests pass but which do not fix the underlying cause of testrpc crashes after a while with "Cannot read property 'pop' of undefined" in "CheckpointTrie.Trie._updateNode" #417 will not be accepted for the bounty.

Important note: This bounty will not be paid out until all required PRs are accepted and merged.

Tips:

  • When making changes to an underlying dependency of ganache-core you can use npm link to symlink your local copy of the dependency into the node_modules directory of the ganache-core project.
  • When editing ethereumjs projects, be sure to follow their style guidelines, using the standard tool to verify that you have done so.
@gitcoinbot
Copy link

This issue now has a funding of 0.5 ETH (374.81 USD) attached to it.

  • If you would like to work on this issue you can claim it here.
  • If you've completed this issue and want to claim the bounty you can do so here
  • Questions? Get help on the Gitcoin Slack
  • $12301.16 more Funded OSS Work Available at: https://gitcoin.co/explorer

@owocki
Copy link

owocki commented Jan 2, 2018

happy new year! im going to spend some cycles this week sourcing someone to take this on..

@0xNPE
Copy link

0xNPE commented Jan 2, 2018

FYI, ethereumjs/merkle-patricia-tree#28 looks like a promising fix for TypeError: Cannot read property 'get' of undefined

I dug a little deeper into TypeError: Cannot read property 'pop' of undefined. It happens during ethereumjs-vm runTx when it flushes the cache to its trie after a couple of checkpoints, commit, and revert.

@federicobond
Copy link

Hi, I am the author of ethereumjs/merkle-patricia-tree#28.

I also looked at the pop error and I think it's related to reading a non-existent trie node (maybe due to some confusion about the currently active DB).

IIRC, the problematic call stack goes like this:

// baseTrie.js
put
  findPath
    _walkTrie
      _lookupNode
        _walkTrie.processNode
          findPath.processNode
            _updateNode

Where _updateNode is then called with an undefined value for stack.

If you look at the _walkTrie function, you will find that the child processNode function calls cb without arguments if it does not receive a node (!node).

cb then does onDone.apply(null, returnValues) with an empty returnValues array (because walkController.return was never called and thus returnValues never reassigned in this case).

Thus, stack in _updateNode ends up undefined.

Also interesting to look at are the conditions under which _lookupNode (called from _walkTrie) returns a null value.

@0xNPE
Copy link

0xNPE commented Jan 2, 2018

@federicobond your analysis of the call stack is similar to what I observed.

but I don't think the root cause can be fixed in merkle-patricia-tree.

I think the root cause pertains to ethereumjs-vm working on its cache of statemanager while in checkpoint mode, then when it tries to persist its cache after certain revert actions, there is a reference to a node that never existed in the db, hence your call stack occurs.

@federicobond
Copy link

That makes sense. I'll see if I find anything else.

@gitcoinbot
Copy link

The funding of 0.5 ETH (427.59 USD) attached has been claimed by @kammerdiener.

@kammerdiener, please leave a comment to let the funder (@owocki) and the other parties involved your implementation plan. If you don't leave a comment, the funder may expire your claim at their discretion.

@kammerdiener
Copy link

So a quick update on this. Based on what I have found I think that both issues are currently originating from ethereumjs/merkle-patricia-tree. I have fixed the get being undefined. Currently working on figuring out what is going on with pop coming back undefined.

@kammerdiener
Copy link

Believe I have fixed it all within ethereumjs/merkle-patricia-tree. The Pull Request is here: PR #30. I tried to take into account all of the information talked about from above and was sure to check @federicobond PR to help fix the _getDBs portion. All of the tests pass now with no errors.

@kammerdiener
Copy link

I went into the node_modules folder and replaced the items and am attaching a screenshot of it passing the race conditions test. Once the PR for Merkle Patricia Trees is merged then. Should I go ahead and submit a PR for this with it including the race conditions test (removing the skip command)?

screen shot 2018-01-06 at 8 22 00 pm

@benjamincburns
Copy link
Contributor Author

@kammerdiener there are actually a couple of different competing submissions at this point. I'll be taking some time early this week to evaluate them. I realize now that I said that the goal was to make the tests pass, but the real goal is to fix #417. The best thing that you can do at this point is to manually test ganache-cli with your fix in place (you can use npm link to do this) and determine whether you observe either of the crashes identified in #417. This is how I'll be evaluating the proposed fixes as well, so it'll take a bit of time, I'm afraid.

@benjamincburns
Copy link
Contributor Author

benjamincburns commented Jan 7, 2018

@roderik by any chance would you like to help with evaluating these fixes, as I believe you've been impacted by this bug quite regularly?

@benjamincburns
Copy link
Contributor Author

benjamincburns commented Jan 8, 2018

@kammerdiener, as I feared, when testing your fix I observe a new issue. I believe this is due to what @0xNPE @federicobond were discussing above. That said, your changes do make ganache much more stable, so until a proper fix is found we will likely include this fix in future releases (provided that it is accepted by the ethereumjs/merkle-patricia-tree maintainers).

The newly observed issue:

Error: Error: LevelUpArrayAdapter named 'blockLogs' index out of range: index 190; length: 187
    at ganache-cli/build/lib.node.js:87740:23
    at ganache-cli/build/lib.node.js:87714:5
    at ganache-cli/build/lib.node.js:88160:12
    at ganache-cli/build/lib.node.js:87925:19
    at ganache-cli/build/lib.node.js:92548:7
    at _combinedTickCallback (internal/process/next_tick.js:67:7)
    at process._tickCallback (internal/process/next_tick.js:98:9)
    at Object.InvalidResponse (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:41484:16)
    at node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:329530:36
    at node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176186:11
    at node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:325200:9
    at XMLHttpRequest.request.onreadystatechange (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:328229:7)
    at XMLHttpRequestEventTarget.dispatchEvent (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176415:18)
    at XMLHttpRequest._setReadyState (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176705:12)
    at XMLHttpRequest._onHttpResponseEnd (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176860:12)
    at IncomingMessage.<anonymous> (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176820:24)
    at IncomingMessage.emit (events.js:164:20)

@kammerdiener
Copy link

Alright thanks for the information. I will start working trying to figure this out as well. Do you have any more information on how you caused this error?

@benjamincburns
Copy link
Contributor Author

benjamincburns commented Jan 8, 2018

@kammerdiener it's fairly easy to do this with trufflesuite/ganache, as it'll crash somewhat readily when running truffle test on nearly any truffle project while the accounts page is visible. You can grab the metacoin truffle box if you don't have a project handy.

I recommend using the develop branch Ganache as well, as that will let you see your logs on crash a bit more easily by hitting the "report issue to github" button (just do me a favor and don't submit the issue that it opens up).

Also I'd expected ganache-cli to have good sourcemap support by now. I'm looking into this and will commit a fix to the relevant places once that's done. Will comment back here as well, hopefully with a mapped stack trace rather than the useless one I just posted.

@kammerdiener
Copy link

@benjamincburns I will be watching for it. Thanks for the tip. I will see what I can do to fix it and make it more stable!

@benjamincburns
Copy link
Contributor Author

Also be advised that the error I reported will be visible as an error response in your tests. Ganache doesn't crash in this case, but it's also useless for future tests (rerunning tests w/o restarting Ganache just spits out the same error)

@benjamincburns
Copy link
Contributor Author

@kammerdiener I managed to fix sourcemaps. I'll commit that to the develop branch of ganache-cli within the next few minutes.

In the mean time, here's the mapped stacktrace:

    at ganache-cli/build/webpack:/ganache-core/lib/database/leveluparrayadapter.js:51:1
    at ganache-cli/build/webpack:/ganache-core/lib/database/leveluparrayadapter.js:25:1
    at ganache-cli/build/webpack:/ganache-core/node_modules/level-sublevel/shell.js:102:1
    at ganache-cli/build/webpack:/ganache-core/node_modules/level-sublevel/nut.js:122:1
    at ganache-cli/build/webpack:/ganache-core/node_modules/cachedown/index.js:53:1
    at _combinedTickCallback (internal/process/next_tick.js:67:7)
    at process._tickCallback (internal/process/next_tick.js:98:9)
    at Object.InvalidResponse (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:41484:16)
    at node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:329530:36
    at node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176186:11
    at node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:325200:9
    at XMLHttpRequest.request.onreadystatechange (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:328229:7)
    at XMLHttpRequestEventTarget.dispatchEvent (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176415:18)
    at XMLHttpRequest._setReadyState (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176705:12)
    at XMLHttpRequest._onHttpResponseEnd (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176860:12)
    at IncomingMessage.<anonymous> (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176820:24)
    at IncomingMessage.emit (events.js:164:20)```

@benjamincburns
Copy link
Contributor Author

Actually now that I read that a bit more carefully, it seems that it's actually broken the truffle side of things. I'll dig in a bit further and see if I can't figure out what's gone amiss.

@benjamincburns
Copy link
Contributor Author

benjamincburns commented Jan 8, 2018

@kammerdiener I'm afraid I also just observed the original issue with your fix in place:

TypeError: Cannot read property 'pop' of undefined
    at CheckpointTrie.Trie._updateNode (ganache-cli/build/webpack:/merkle-patricia-tree/baseTrie.js:359:1)
    at ganache-cli/build/webpack:/merkle-patricia-tree/baseTrie.js:106:1
    at ganache-cli/build/webpack:/merkle-patricia-tree/baseTrie.js:460:1
    at processNode (ganache-cli/build/webpack:/merkle-patricia-tree/baseTrie.js:465:1)
    at ganache-cli/build/webpack:/merkle-patricia-tree/baseTrie.js:503:1
    at ganache-cli/build/webpack:/merkle-patricia-tree/baseTrie.js:179:1
    at ganache-cli/build/webpack:/merkle-patricia-tree/util.js:75:1
    at ganache-cli/build/webpack:/merkle-patricia-tree/node_modules/async/lib/async.js:52:1
    at ganache-cli/build/webpack:/merkle-patricia-tree/node_modules/async/lib/async.js:269:1
    at ganache-cli/build/webpack:/merkle-patricia-tree/node_modules/async/lib/async.js:44:1

@roderik
Copy link

roderik commented Jan 8, 2018

@benjamincburns gladly!

@owocki
Copy link

owocki commented Jan 16, 2018

@kammerdiener did you still want to claim the work here, or should i release it to someone else?

@kammerdiener
Copy link

@owocki I'll work on it some more. I think I have an idea for the fix.

@kammerdiener
Copy link

@owocki I have exhausted all of my potential fixes. If you would go ahead and release it in the interest of getting this fixed sooner. I will look for other issues to try and tackle.

@benjamincburns
Copy link
Contributor Author

benjamincburns commented Jan 21, 2018

@owocki are claims mutually exclusive, or can multiple people register a claim to the bounty at once? If only one claim can be registered at a time, I think it's best if we reject @kammerdiener's claim until he's resubmitted. I really appreciate your work, @kammerdiener, but I worry that people may see a claim for the bounty here and be put off from trying to solve the issue, when the truth is that the field is still wide open.

@benjamincburns
Copy link
Contributor Author

@clehene are you still intending to complete this fix? I'd love to award this bounty if possible.

Either way, @seesemichaelj is going to take over responsibility from me for tracking this. I'll still be around, however if you guys have any questions for me.

@clehene
Copy link

clehene commented May 31, 2018

@benjamincburns @seesemichaelj this got more or less blocked with the final bits as I saw two different ways which depend on input from EVM / merkle-tree. I.e. we'd need to understand the author's intentions there as the logic is a bit convoluted and undocumented or otherwise discuss what's the best thing based on context.

I wouldn't mind taking another stab. TBH the bounty was just the cherry on top, I mostly wanted to get my feet wet and have it done, but I need to discuss with someone.

I like to finish things when I start them, but getting the whole context back and rebasing it will take some effort. I need to go over notes, etc. as I don't fully remember all bits - so I may need to spend some time and would only do that if I'm sure I can get proper support and not sure whom to ask for that :)

@mikeseese
Copy link
Contributor

Hey @clehene; sorry I'm just getting around to this now. Please mind me as I'm getting up to speed what the issue is and possible theories/solutions.

As afar as getting input from EVM/merkle-tree, I think one of the best people to start a discussion will be @holgerd77. However, just looking at the history of baseTrie.js it may be difficult to get "original author" intentions.

Looking back on the stuff you said March 30, it may be best to go ahead and submit PRs and probe questions. It'd help if you have somewhat direct questions as well as a detailed description of your proposed changes.

Are the commits clehene/merkle-patricia-tree@4927c76 and clehene/merkle-patricia-tree@25541d4 potential solutions to the problem @clehene?

@benjamincburns
Copy link
Contributor Author

benjamincburns commented Aug 15, 2018

@holgerd77 please see @clehene's last comment. Are you the right person to help us navigate getting this change accepted? If you're not familiar w/ this issue, I'd be happy to set up a call with you to discuss. Would love it if @clehene joined as well.

Either way, we haven't heard from @clehene in a while and the price of ETH is drastically different from when we first opened this bounty. As a result we're going to adjust the funding and clear out @clenehe's currently active claim. That said, if @clehene is still interested I'd be thrilled if he picked it back up - he's definitely carried it most of the way.

@clehene
Copy link

clehene commented Aug 16, 2018

Hey @seesemichaelj didn't intend to go AWOL.
@benjamincburns I was under the impression that this is no longer under bounty / my claim - i.e. that I'm not locking anyone from fixing it.
See trufflesuite/ganache#453 (comment)

hi all. i just killed the bounty for this issue

As much as I'd love to finish this, I'm not sure it'd be realistic just based on how much time I estimate it would take. I was not able to reproduce all faulty conditions last time I rebased things (that was after June 13th). Before that I remember going rather deep in the VM last time I had a clue of what could be the root cause and expose the effect we were hunting for. For that I was debugging and stepping through VM and instrumented the VM along with libraries to get detailed tracing information. That makes rebasing costly as it's essentially carrying tracing parameters across the entire call stack. So long story short it's a full time job for 1-2 weeks to get up to speed :).
I would gladly get on a call. If anyone's fixing this in SF I could spare a few hours in an office to help out too.

Anyways, if there's indeed an active claim, I'm sorry for that and hope it didn't deter anyone from taking a stab at it :).

A final thought. I think it may be worth considering rewriting the merkle tree implementation from scratch by someone that has a clear idea of how it needs to behave, etc. (I remember going through the yellow paper that there were some peculiarities there too). I honestly think that would be the best and cheapest long term maintenance solution.

@gitcoinbot
Copy link

Issue Status: 1. Open 2. Started 3. Submitted 4. Done


This issue now has a funding of 400.0 DAI (400.0 USD @ $1.0/DAI) attached to it.

@holgerd77
Copy link

@benjamincburns Sorry, I have to prioritize other time-critical tasks for now. Can help to get this settled mid-term hopefully (8-12 weeks first estimation).

@kn
Copy link

kn commented Aug 24, 2018

Update:
I've investigated more and here is my hypothesis on the root cause:
The way we use blockchain.vm.stateManager.[checkpoint|commit|revert]() in some of the blockchain functions, such as processCall() assume there is no other function performing transaction on the trie.

For example, as demonstrated in the test case, the following events can be problematic:

  1. .checkpoint() is called by processCall()
  2. .checkpoint() is called by processNextBlock()
  3. .revert() is called by processCall() <- this reverts the checkpoint created by processNextBlock().
  4. .commit() is called by processNextBlock() <- this commits the checkpoint created by processCall()

Proposal:
We should introduce semaphore lock that makes sure multiple functions try to perform transactions on the trie to avoid these race condition.

@gitcoinbot
Copy link

Issue Status: 1. Open 2. Started 3. Submitted 4. Done


Work for 400.0 DAI (400.0 USD @ $1.0/DAI) has been submitted by:

  1. @kn

@ceresstation please take a look at the submitted work:


@benjamincburns
Copy link
Contributor Author

I commented over on the PR, but I figure I should comment here. @kn I really think @clehene identified the root cause already, and the changes he listed on 1 March likely addressed the true underlying issues. The most important problem he identified was that the MPT's checkpoint was written to assume it was operating synchronously, when this wasn't generally the case. This requires a breaking change to allow for a continuation callback to be passed into the MPT's checkpoint function.

@spm32
Copy link

spm32 commented Sep 19, 2018

Hey @PaulRBerg, @raboid are either of you interested in giving this a shot? Please also see the reference issue at the bottom. @kn really appreciate your work, going to pay out a tip separate from the bounty for your effort!

@kn
Copy link

kn commented Sep 20, 2018

Thanks! Appreciate it.

Here is what I left off with and changes I made are:
ethereumjs/merkle-patricia-tree@master...kn:i417
ethereumjs/ethereumjs-monorepo@master...kn:i417
trufflesuite/ganache@develop...kn:i417_2

The next steps were

  • to reproduce actual race condition and make sure these changes fix it.
  • get reviews on the data structure design changes.

@frankchen07
Copy link

hey @ceresstation how is this bounty proceeding? Looks like @benjamincburns closed a PR that kn started due to some extra complexity, but is it safe to say the bounty is still open?

@mikeseese
Copy link
Contributor

@frankchen07 Yes, this bounty is still open

@benjamincburns
Copy link
Contributor Author

Closing this one, as it was fixed by trufflesuite/ganache#209

We didn't fund this bounty ourselves, so I'm not sure how to go about compensating those who helped out. I think @kn and @clehene should split the reward, as their solutions were both viable and they ultimately lead me to the solution in the above PR.

I'll check in w/ the GitCoin folks next week and see what we can do.

@frankchen07
Copy link

@benjamincburns - gitcoin.co/tip is a way to compensate those who helped close out an issue (but not necessarily the ones who submitted the work).

@clehene
Copy link

clehene commented Nov 8, 2018

@benjamincburns nice! was it fixed entirely through trufflesuite/ganache#209 or did it require changes in other parts?

@spm32
Copy link

spm32 commented Nov 8, 2018

Fully agree @benjamincburns, I'll split the reward between @kn and @clehene :)

@gitcoinbot
Copy link

⚡️ A tip worth 200.00000 DAI (200.0 USD @ $1.0/DAI) has been granted to @kn for this issue from @ceresstation. ⚡️

Nice work @kn! Your tip has automatically been deposited in the ETH address we have on file.

@kn
Copy link

kn commented Nov 13, 2018

Great work - glad to see it fixed! Thanks for the generous tip.

@benjamincburns
Copy link
Contributor Author

benjamincburns commented Feb 7, 2019 via email

@owocki
Copy link

owocki commented Feb 7, 2019

@ceresstation mind closing the bounty? I just removed the 'open' override status, and that made it appear as 'submitted'. If you cancel the bounty that should close it out.

@gitcoinbot
Copy link

Issue Status: 1. Open 2. Started 3. Submitted 4. Done


This Bounty has been completed.

Additional Tips for this Bounty:

  • ceresstation tipped 200.0000 DAI worth 200.0 USD to kn.

@gitcoinbot
Copy link

⚡️ A tip worth 200.00000 DAI (200.0 USD @ $1.0/DAI) has been granted to @clehene for this issue from @ceresstation. ⚡️

Nice work @clehene! Your tip has automatically been deposited in the ETH address we have on file.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests