Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDC panics after receiving RequestTransactionDataError message #709

Closed
GitGab19 opened this issue Jan 3, 2024 · 42 comments · Fixed by #754
Closed

JDC panics after receiving RequestTransactionDataError message #709

GitGab19 opened this issue Jan 3, 2024 · 42 comments · Fixed by #754
Assignees
Milestone

Comments

@GitGab19
Copy link
Collaborator

GitGab19 commented Jan 3, 2024

JDC panics after receiving a RequestTransactionDataError message.
The line of interest is roles/jd-client/src/template_receiver/message_handler.rs:46:9.

For a better context, here's a screenshot of the error, we got it right after the last testnet block mined:
Screenshot 2024-01-03 at 15 20 55

@GitGab19 GitGab19 added bug Something isn't working job declarator labels Jan 3, 2024
@Fi3
Copy link
Collaborator

Fi3 commented Jan 3, 2024

JDC do not handle request tx data error we should decide what to do when it happen. This is not straightforward cause a TP that is not able to provide txs data for tx that have been included in the block by the TP should likley be considered unreliable and the connection should be closed. But maybe we prefer to try again? Some discussion is needed here.

@Fi3 Fi3 removed the bug Something isn't working label Jan 3, 2024
@Fi3
Copy link
Collaborator

Fi3 commented Jan 3, 2024

Removed tha bug label as it is a new feature more than a bug

@GitGab19
Copy link
Collaborator Author

GitGab19 commented Jan 4, 2024

In the specific case mentioned here, JDC received the RequestTransactionDataError for template 1946. That was the first template created after the new block was found. Right after that, a new future template was created and sent to JDC.
Could it be that TP is not able to get information about previous templates?

@Sjors
Copy link
Collaborator

Sjors commented Jan 12, 2024

Templates are forgotten after a new block is found so that's probably what happened. It's on my list to address that.

@pavlenex pavlenex modified the milestones: MVP3, Milestone4 Jan 16, 2024
@GitGab19
Copy link
Collaborator Author

Templates are forgotten after a new block is found so that's probably what happened. It's on my list to address that.

Hi @Sjors, I can confirm you that every time we get this error is right after a new block is found and a SetNewPrevHash is received.

@Sjors
Copy link
Collaborator

Sjors commented Feb 6, 2024

@GitGab19 the lastest https://github.com/Sjors/bitcoin/tree/2024/02/sv2-poll-ellswift will keep the old template around for 10 seconds after a new block arrives. I do still have to implement code that holds on to mempool transactions.

@pavlenex
Copy link
Collaborator

pavlenex commented Feb 6, 2024

From dev call:

We have to modify specs, add error message, TP needs to send this message, and SRI needs to react to this....

@pavlenex pavlenex modified the milestones: Milestone 4, Milestone 5 Feb 6, 2024
@GitGab19
Copy link
Collaborator Author

GitGab19 commented Feb 7, 2024

@GitGab19 the lastest https://github.com/Sjors/bitcoin/tree/2024/02/sv2-poll-ellswift will keep the old template around for 10 seconds after a new block arrives. I do still have to implement code that holds on to mempool transactions.

Have you already tested this branch?
Because I'm trying to connect the pool to it but it doesn't work

@Sjors
Copy link
Collaborator

Sjors commented Feb 7, 2024

@GitGab19 did you test it with #737?

@GitGab19
Copy link
Collaborator Author

GitGab19 commented Feb 7, 2024

@GitGab19 did you test it with #737?

Ook, just switched to #737 and now it's working.
I will let you know if we hit the same issue again

@GitGab19
Copy link
Collaborator Author

GitGab19 commented Feb 9, 2024

From dev call:

We have to modify specs, add error message, TP needs to send this message, and SRI needs to react to this....

Do we want to go down this path? What do you think about it?
What is important imo is to define JDC actions in response to this error codes.
For example, what should a JDC do after a RequestTransactionData.Error message with error code template-id-not-found is received? Close connection with TP? Discard the message and go on?
For reference, here there's RequestTransactionData.Error specs.

@Fi3
Copy link
Collaborator

Fi3 commented Feb 9, 2024

or example, what should a JDC do after a RequestTransactionData.Error message with error code template-id-not-found is received?

drop the template and use the new one. I don't see any other possible action

@GitGab19
Copy link
Collaborator Author

GitGab19 commented Feb 9, 2024

or example, what should a JDC do after a RequestTransactionData.Error message with error code template-id-not-found is received?

drop the template and use the new one. I don't see any other possible action

If we want to manage it in this way, I think there's no need to add any new error codes to specs at all.
If we agree on this, I can open a PR to close this issue soon

@Fi3
Copy link
Collaborator

Fi3 commented Feb 9, 2024

yes we need the new error code, to know that we are in this case. Other cases still need to drop the connection with upstream

@GitGab19
Copy link
Collaborator Author

GitGab19 commented Feb 9, 2024

yes we need the new error code, to know that we are in this case. Other cases still need to drop the connection with upstream

For example?
Which other error codes are thinking about?

@Fi3
Copy link
Collaborator

Fi3 commented Feb 9, 2024

We should have at least one for id not found and one for expired id

@Fi3
Copy link
Collaborator

Fi3 commented Feb 9, 2024

Otherwise how we could know if we want to drop connection or just drop the template and wait for new one

@GitGab19
Copy link
Collaborator Author

GitGab19 commented Feb 9, 2024

Otherwise how we could know if we want to drop connection or just drop the template and wait for new one

I may misunderstood your opinion because of your suggestion of dropping templates in previous comments.

So, I try to recap what we could have:

  • template-id-not-found --> JDC drops connection to TP
  • template-id-expired / stale-template-id --> JDC drops template and works on the new one

How does it sound to you?

@Fi3
Copy link
Collaborator

Fi3 commented Feb 9, 2024

seems a good solution

@GitGab19
Copy link
Collaborator Author

GitGab19 commented Feb 9, 2024

Otherwise how we could know if we want to drop connection or just drop the template and wait for new one

I may misunderstood your opinion because of your suggestion of dropping templates in previous comments.

So, I try to recap what we could have:

* `template-id-not-found` --> JDC drops connection to TP

* `template-id-expired` / `stale-template-id` --> JDC drops template and works on the new one

How does it sound to you?

@Sjors what do you think about it?
Would it be possible to implement this new error code on TP?

@Sjors
Copy link
Collaborator

Sjors commented Feb 9, 2024

I could add stale-template-id as a district message from template-id-not-found, by taking advantage of the fact that we sequentially number them. But this doesn't seem very elegant, and makes it impossible to randomise the template id in the future.

I would prefer only supporting template-id-not-found, so JDC should always drop the template and work on a new one.

@Fi3
Copy link
Collaborator

Fi3 commented Feb 9, 2024

@Sjors I understood that you cache templates if yes, would make more sense having the 2 different errors. And drop templates, but if something very bad (wither on the TP or the JDC side is happening) we close the connection

@Fi3
Copy link
Collaborator

Fi3 commented Feb 9, 2024

Ofc if JDC is trying to retreive templates of 1 hour ago this count as something very bad

@Fi3
Copy link
Collaborator

Fi3 commented Feb 9, 2024

If we have one error we implicitly assume that the only case of a old or invalid template id is the one that we are seeing now

@Sjors
Copy link
Collaborator

Sjors commented Feb 9, 2024

Oh wait, I guess I can return stale-template-id if I still have the template cached, but it refers to a stale block. That makes sense.

@Fi3
Copy link
Collaborator

Fi3 commented Feb 9, 2024

yep now i remember during the call this was the idea

@Sjors
Copy link
Collaborator

Sjors commented Feb 9, 2024

Alright, I'll work on that next week.

@GitGab19
Copy link
Collaborator Author

GitGab19 commented Feb 9, 2024

Oh wait, I guess I can return stale-template-id if I still have the template cached, but it refers to a stale block. That makes sense.

I like the approach, but in this way JDC behaviour (in response to the error code received) still depends on how long the timer for keeping old templates is set.
For example (considering timer=10s) what would happen in the case in which JDC sends a RequestTransactionData message 11 seconds after a new block is received by TP?
TP should send a stale-template-id error code, since it's a template related to previous block.

@Fi3
Copy link
Collaborator

Fi3 commented Feb 9, 2024

Trying to get txs for a very old template it means that there is a big issue I think that is fine to drop the connection

@Fi3
Copy link
Collaborator

Fi3 commented Feb 9, 2024

We can have the max time as a config parameter. Btw i would say anything older then 1 minute

@GitGab19
Copy link
Collaborator Author

GitGab19 commented Feb 9, 2024

Trying to get txs for a very old template it means that there is a big issue I think that is fine to drop the connection

I agree with you on this.
The important thing is that previous templates (the ones which are based on the same current block) are kept in memory until a new block is received. Because they are still valid (until a new block is found) and so it would make no sense to drop the connection for a delayed RequestTransactionData message (imo)

@Fi3
Copy link
Collaborator

Fi3 commented Feb 9, 2024

Yep for still valid templates we should just answear the req with the tx data for that template. The miner will start mine on the template as soon as is received without waiting for the tx data, that means that there will be valid shares that can be sent to the pool for that template, and if we return an error the job declaration will never complete and the miner will lose some shares.

@Sjors
Copy link
Collaborator

Sjors commented Feb 9, 2024

We could specify a minimum grace period in the spec. I think 10 seconds should be plenty. Anything longer implies a serious bug?

I'd rather not make this configurable, because holding on to old templates is potentially a memory DoS vector. We can't release the full transactions from memory until all templates let go. So imagine a 1 MB inscription with 1000 RBF bumps in a few minutes... Every bump might have enough extra fees to justify a new template (especially once we have cluster mempool and can make templates faster than today).

I'd rather have the "freedom" to drop old templates quickly when memory becomes a problem, or keep them longer if memory is fine. And I'd rather not explain all that in the docs for such a setting.

@Fi3
Copy link
Collaborator

Fi3 commented Feb 9, 2024

yep 10 sec is more than enough IMO. I said 1 min just to take a gigantic value.

@Sjors
Copy link
Collaborator

Sjors commented Feb 9, 2024

My idea would be this:

  1. Hold on to templates for at least 10 seconds (more if memory allows)
  2. When a RequestTransactionData message asks for an old template, deny it with stale-template-id.
  3. When a SubmitSolution arrives, accept it if we still have the template

The Template Provider does not handle shares.

The miner will start mine on the template as soon as is received without waiting for the tx data, that means that there will be valid shares that can be sent to the pool for that template, and if we return an error the job declaration will never complete and the miner will lose some shares.

Not sure how to handle this. Does the pool actually have to check the transaction for the share? I assumed it would check the templates immediately, not when it receives a share. As long as the pool does that within 10 seconds there should be no problem. And it it takes longer than that, there's something wrong with the pool?

@Sjors
Copy link
Collaborator

Sjors commented Feb 9, 2024

Mmm, but (2) is not great for the scenario where a pool wants to check a template and shares are already being submitted.

I see two options there:

  1. Recommend that pools simply approve shares for a stale template, even if they couldn't check it (is that safe?)
  2. Don't reject RequestTransactionData for stale templates.

(2) seems like a waste of bandwidth though, when you really want to prioritize the next block

@Fi3
Copy link
Collaborator

Fi3 commented Feb 10, 2024

As soon as the JDC receive the template (1) it send job downstream, (2) it start the procedure for job declaration.

If TP send template A and then template B and between A and B the miner have done some work but not yet declared it we want to complete the RequestTransactionData when A is still valid (block_eight(A) = block_eight(B)).

I do not thin that this is an issue for bandwidth use as it is very rare.

If A is a stale block, jobs for A IMO are not valid anymore so is fine to just do not complete the job declaration. If instead jobs for A are still valid cause A is not stale eg when B have the same height of A but more fees, IMO we want to

  1. send to the JDC the tx data and send B if not yet sent
  2. the JDC will complete the job declaration and send the sharers upstream, in the same time the JDC already send downstream new jobs based on B

@GitGab19
Copy link
Collaborator Author

As soon as the JDC receive the template (1) it send job downstream, (2) it start the procedure for job declaration.

If TP send template A and then template B and between A and B the miner have done some work but not yet declared it we want to complete the RequestTransactionData when A is still valid (block_eight(A) = block_eight(B)).

I do not thin that this is an issue for bandwidth use as it is very rare.

If A is a stale block, jobs for A IMO are not valid anymore so is fine to just do not complete the job declaration. If instead jobs for A are still valid cause A is not stale eg when B have the same height of A but more fees, IMO we want to

1. send to the JDC the tx data and send B if not yet sent

2. the JDC will complete the job declaration and send the sharers upstream, in the same time the JDC already send downstream new jobs based on B

Yeah I also see it in the same way as @Fi3.
As already said, I would manage it through the use of 2 error codes:

  • template-id-not-found (this is the case in which the template is too old and not in memory anymore) --> drop connection since probably something wrong is going on
  • stale-template-id (in this case the template is still in TP memory, but it's stale so it makes sense to work on the new one) --> drop template and work on declaring the new one

All other cases are good:

  • JDC sends a RequestTransactionData message for a new and regular template --> TP answers it with data
  • JDC sends a RequestTransactionData message for an old but still valid template (not stale) --> TP answers it with data

What do you think about this?

@Sjors
Copy link
Collaborator

Sjors commented Feb 12, 2024

It sounds like the Job Declarator Server should also have a way to tell the client that the block was rejected* because it was stale? Otherwise the client might automatically switch pools?

* = assuming they reject it. If they just accept it, then there's no error message needed.

@GitGab19
Copy link
Collaborator Author

It sounds like the Job Declarator Server should also have a way to tell the client that the block was rejected* because it was stale? Otherwise the client might automatically switch pools?

* = assuming they reject it. If they just accept it, then there's no error message needed.

It seems to me there's not (looking at the specs), but actually @Fi3 implemented the pool-switching mechanism, so he surely knows better how it works on SRI

@Sjors
Copy link
Collaborator

Sjors commented Feb 12, 2024

I added a commit Reject RequestTransactionData for stale templates to https://github.com/Sjors/bitcoin/commits/2024/02/sv2-poll-ellswift/

It returns stale-template-id if the template hashPrevBlock does not refer to the tip.

I haven't tested it though.

@Fi3
Copy link
Collaborator

Fi3 commented Feb 14, 2024

It sounds like the Job Declarator Server should also have a way to tell the client that the block was rejected* because it was stale? Otherwise the client might automatically switch pools?

This is a very good point. The JDS do not know which prev hash the JDC want to use1, so it should be the pool to say if a block is stale or not. One possible way to handle this case could be for the JDC to compare the last prev hash communicated by the pool with the one of current job and when they change stop sending shares to the pool, the pool should have a threshold in which stale shares are accepted to account for the lag of the system. We need also to define what JDS should do if it receive txs that are in the last block mined and what the JDC should do. There are various possibility:

  1. JDS accept the block, then JDC see that there is a new prev hash and stop sending shares to the pool within the threshold, if not the pool will send invalid share and the JDC change pool.
  2. JDS send DeclareMiningJob.Error and communicate to the JDC that the job is stale
  3. JDS send DeclareMiningJob.Error and communicate to the JDC that txs are not valid or another general error

This issue for sure deserve further discussions. For that i propose to treat it as different issue, and for now proceed with the proposed solution so that we can fix the panic, that is the most urgent thing.

Also I feel like that whatever we will decide do not belong to the spec but rather into the guidelines for implementation.

Footnotes

  1. also if it could look at the proposed txs and the last block mined

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants