JDC panics after receiving RequestTransactionDataError message #709

GitGab19 · 2024-01-03T14:24:28Z

JDC panics after receiving a RequestTransactionDataError message.
The line of interest is roles/jd-client/src/template_receiver/message_handler.rs:46:9.

For a better context, here's a screenshot of the error, we got it right after the last testnet block mined:

Fi3 · 2024-01-03T15:28:15Z

JDC do not handle request tx data error we should decide what to do when it happen. This is not straightforward cause a TP that is not able to provide txs data for tx that have been included in the block by the TP should likley be considered unreliable and the connection should be closed. But maybe we prefer to try again? Some discussion is needed here.

Fi3 · 2024-01-03T15:28:57Z

Removed tha bug label as it is a new feature more than a bug

GitGab19 · 2024-01-04T09:30:06Z

In the specific case mentioned here, JDC received the RequestTransactionDataError for template 1946. That was the first template created after the new block was found. Right after that, a new future template was created and sent to JDC.
Could it be that TP is not able to get information about previous templates?

Sjors · 2024-01-12T18:50:33Z

Templates are forgotten after a new block is found so that's probably what happened. It's on my list to address that.

GitGab19 · 2024-01-26T14:44:01Z

Templates are forgotten after a new block is found so that's probably what happened. It's on my list to address that.

Hi @Sjors, I can confirm you that every time we get this error is right after a new block is found and a SetNewPrevHash is received.

Sjors · 2024-02-06T17:52:19Z

@GitGab19 the lastest https://github.com/Sjors/bitcoin/tree/2024/02/sv2-poll-ellswift will keep the old template around for 10 seconds after a new block arrives. I do still have to implement code that holds on to mempool transactions.

pavlenex · 2024-02-06T17:59:06Z

From dev call:

We have to modify specs, add error message, TP needs to send this message, and SRI needs to react to this....

GitGab19 · 2024-02-07T16:45:18Z

@GitGab19 the lastest https://github.com/Sjors/bitcoin/tree/2024/02/sv2-poll-ellswift will keep the old template around for 10 seconds after a new block arrives. I do still have to implement code that holds on to mempool transactions.

Have you already tested this branch?
Because I'm trying to connect the pool to it but it doesn't work

Sjors · 2024-02-07T17:11:40Z

@GitGab19 did you test it with #737?

GitGab19 · 2024-02-07T17:23:38Z

@GitGab19 did you test it with #737?

Ook, just switched to #737 and now it's working.
I will let you know if we hit the same issue again

GitGab19 · 2024-02-09T09:10:07Z

From dev call:

We have to modify specs, add error message, TP needs to send this message, and SRI needs to react to this....

Do we want to go down this path? What do you think about it?
What is important imo is to define JDC actions in response to this error codes.
For example, what should a JDC do after a RequestTransactionData.Error message with error code template-id-not-found is received? Close connection with TP? Discard the message and go on?
For reference, here there's RequestTransactionData.Error specs.

Fi3 · 2024-02-09T09:40:23Z

or example, what should a JDC do after a RequestTransactionData.Error message with error code template-id-not-found is received?

drop the template and use the new one. I don't see any other possible action

GitGab19 · 2024-02-09T09:55:14Z

or example, what should a JDC do after a RequestTransactionData.Error message with error code template-id-not-found is received?

drop the template and use the new one. I don't see any other possible action

If we want to manage it in this way, I think there's no need to add any new error codes to specs at all.
If we agree on this, I can open a PR to close this issue soon

Fi3 · 2024-02-09T10:00:26Z

yes we need the new error code, to know that we are in this case. Other cases still need to drop the connection with upstream

GitGab19 · 2024-02-09T10:06:43Z

yes we need the new error code, to know that we are in this case. Other cases still need to drop the connection with upstream

For example?
Which other error codes are thinking about?

Fi3 · 2024-02-09T10:37:46Z

We should have at least one for id not found and one for expired id

Fi3 · 2024-02-09T10:38:32Z

Otherwise how we could know if we want to drop connection or just drop the template and wait for new one

GitGab19 · 2024-02-09T11:25:05Z

Otherwise how we could know if we want to drop connection or just drop the template and wait for new one

I may misunderstood your opinion because of your suggestion of dropping templates in previous comments.

So, I try to recap what we could have:

template-id-not-found --> JDC drops connection to TP
template-id-expired / stale-template-id --> JDC drops template and works on the new one

How does it sound to you?

Fi3 · 2024-02-09T13:07:41Z

seems a good solution

GitGab19 · 2024-02-09T13:37:49Z

Otherwise how we could know if we want to drop connection or just drop the template and wait for new one

I may misunderstood your opinion because of your suggestion of dropping templates in previous comments.

So, I try to recap what we could have:
* `template-id-not-found` --> JDC drops connection to TP

* `template-id-expired` / `stale-template-id` --> JDC drops template and works on the new one
How does it sound to you?

@Sjors what do you think about it?
Would it be possible to implement this new error code on TP?

Sjors · 2024-02-09T14:26:44Z

I could add stale-template-id as a district message from template-id-not-found, by taking advantage of the fact that we sequentially number them. But this doesn't seem very elegant, and makes it impossible to randomise the template id in the future.

I would prefer only supporting template-id-not-found, so JDC should always drop the template and work on a new one.

Fi3 · 2024-02-09T14:30:18Z

@Sjors I understood that you cache templates if yes, would make more sense having the 2 different errors. And drop templates, but if something very bad (wither on the TP or the JDC side is happening) we close the connection

Fi3 · 2024-02-09T14:31:03Z

Ofc if JDC is trying to retreive templates of 1 hour ago this count as something very bad

Fi3 · 2024-02-09T14:32:50Z

If we have one error we implicitly assume that the only case of a old or invalid template id is the one that we are seeing now

Sjors · 2024-02-09T14:47:54Z

Oh wait, I guess I can return stale-template-id if I still have the template cached, but it refers to a stale block. That makes sense.

Fi3 · 2024-02-09T14:49:15Z

yep now i remember during the call this was the idea

Sjors · 2024-02-09T14:49:44Z

Alright, I'll work on that next week.

GitGab19 · 2024-02-09T15:21:08Z

Oh wait, I guess I can return stale-template-id if I still have the template cached, but it refers to a stale block. That makes sense.

I like the approach, but in this way JDC behaviour (in response to the error code received) still depends on how long the timer for keeping old templates is set.
For example (considering timer=10s) what would happen in the case in which JDC sends a RequestTransactionData message 11 seconds after a new block is received by TP?
TP should send a stale-template-id error code, since it's a template related to previous block.

Fi3 · 2024-02-09T15:23:18Z

Trying to get txs for a very old template it means that there is a big issue I think that is fine to drop the connection

Fi3 · 2024-02-09T15:25:15Z

We can have the max time as a config parameter. Btw i would say anything older then 1 minute

GitGab19 · 2024-02-09T15:29:06Z

Trying to get txs for a very old template it means that there is a big issue I think that is fine to drop the connection

I agree with you on this.
The important thing is that previous templates (the ones which are based on the same current block) are kept in memory until a new block is received. Because they are still valid (until a new block is found) and so it would make no sense to drop the connection for a delayed RequestTransactionData message (imo)

Fi3 · 2024-02-09T15:36:00Z

Yep for still valid templates we should just answear the req with the tx data for that template. The miner will start mine on the template as soon as is received without waiting for the tx data, that means that there will be valid shares that can be sent to the pool for that template, and if we return an error the job declaration will never complete and the miner will lose some shares.

Sjors · 2024-02-09T17:07:51Z

We could specify a minimum grace period in the spec. I think 10 seconds should be plenty. Anything longer implies a serious bug?

I'd rather not make this configurable, because holding on to old templates is potentially a memory DoS vector. We can't release the full transactions from memory until all templates let go. So imagine a 1 MB inscription with 1000 RBF bumps in a few minutes... Every bump might have enough extra fees to justify a new template (especially once we have cluster mempool and can make templates faster than today).

I'd rather have the "freedom" to drop old templates quickly when memory becomes a problem, or keep them longer if memory is fine. And I'd rather not explain all that in the docs for such a setting.

Fi3 · 2024-02-09T17:10:48Z

yep 10 sec is more than enough IMO. I said 1 min just to take a gigantic value.

Sjors · 2024-02-09T17:15:52Z

My idea would be this:

Hold on to templates for at least 10 seconds (more if memory allows)
When a RequestTransactionData message asks for an old template, deny it with stale-template-id.
When a SubmitSolution arrives, accept it if we still have the template

The Template Provider does not handle shares.

The miner will start mine on the template as soon as is received without waiting for the tx data, that means that there will be valid shares that can be sent to the pool for that template, and if we return an error the job declaration will never complete and the miner will lose some shares.

Not sure how to handle this. Does the pool actually have to check the transaction for the share? I assumed it would check the templates immediately, not when it receives a share. As long as the pool does that within 10 seconds there should be no problem. And it it takes longer than that, there's something wrong with the pool?

Sjors · 2024-02-09T17:18:45Z

Mmm, but (2) is not great for the scenario where a pool wants to check a template and shares are already being submitted.

I see two options there:

Recommend that pools simply approve shares for a stale template, even if they couldn't check it (is that safe?)
Don't reject RequestTransactionData for stale templates.

(2) seems like a waste of bandwidth though, when you really want to prioritize the next block

Fi3 · 2024-02-10T10:42:27Z

As soon as the JDC receive the template (1) it send job downstream, (2) it start the procedure for job declaration.

If TP send template A and then template B and between A and B the miner have done some work but not yet declared it we want to complete the RequestTransactionData when A is still valid (block_eight(A) = block_eight(B)).

I do not thin that this is an issue for bandwidth use as it is very rare.

If A is a stale block, jobs for A IMO are not valid anymore so is fine to just do not complete the job declaration. If instead jobs for A are still valid cause A is not stale eg when B have the same height of A but more fees, IMO we want to

send to the JDC the tx data and send B if not yet sent
the JDC will complete the job declaration and send the sharers upstream, in the same time the JDC already send downstream new jobs based on B

GitGab19 · 2024-02-12T09:59:23Z

As soon as the JDC receive the template (1) it send job downstream, (2) it start the procedure for job declaration.

If TP send template A and then template B and between A and B the miner have done some work but not yet declared it we want to complete the RequestTransactionData when A is still valid (block_eight(A) = block_eight(B)).

I do not thin that this is an issue for bandwidth use as it is very rare.

If A is a stale block, jobs for A IMO are not valid anymore so is fine to just do not complete the job declaration. If instead jobs for A are still valid cause A is not stale eg when B have the same height of A but more fees, IMO we want to
1. send to the JDC the tx data and send B if not yet sent

2. the JDC will complete the job declaration and send the sharers upstream, in the same time the JDC already send downstream new jobs based on B

Yeah I also see it in the same way as @Fi3.
As already said, I would manage it through the use of 2 error codes:

template-id-not-found (this is the case in which the template is too old and not in memory anymore) --> drop connection since probably something wrong is going on
stale-template-id (in this case the template is still in TP memory, but it's stale so it makes sense to work on the new one) --> drop template and work on declaring the new one

All other cases are good:

JDC sends a RequestTransactionData message for a new and regular template --> TP answers it with data
JDC sends a RequestTransactionData message for an old but still valid template (not stale) --> TP answers it with data

What do you think about this?

Sjors · 2024-02-12T10:08:14Z

It sounds like the Job Declarator Server should also have a way to tell the client that the block was rejected* because it was stale? Otherwise the client might automatically switch pools?

* = assuming they reject it. If they just accept it, then there's no error message needed.

GitGab19 · 2024-02-12T10:14:54Z

It sounds like the Job Declarator Server should also have a way to tell the client that the block was rejected* because it was stale? Otherwise the client might automatically switch pools?

* = assuming they reject it. If they just accept it, then there's no error message needed.

It seems to me there's not (looking at the specs), but actually @Fi3 implemented the pool-switching mechanism, so he surely knows better how it works on SRI

Sjors · 2024-02-12T10:47:23Z

I added a commit Reject RequestTransactionData for stale templates to https://github.com/Sjors/bitcoin/commits/2024/02/sv2-poll-ellswift/

It returns stale-template-id if the template hashPrevBlock does not refer to the tip.

I haven't tested it though.

Fi3 · 2024-02-14T11:44:50Z

It sounds like the Job Declarator Server should also have a way to tell the client that the block was rejected* because it was stale? Otherwise the client might automatically switch pools?

This is a very good point. The JDS do not know which prev hash the JDC want to use¹, so it should be the pool to say if a block is stale or not. One possible way to handle this case could be for the JDC to compare the last prev hash communicated by the pool with the one of current job and when they change stop sending shares to the pool, the pool should have a threshold in which stale shares are accepted to account for the lag of the system. We need also to define what JDS should do if it receive txs that are in the last block mined and what the JDC should do. There are various possibility:

JDS accept the block, then JDC see that there is a new prev hash and stop sending shares to the pool within the threshold, if not the pool will send invalid share and the JDC change pool.
JDS send DeclareMiningJob.Error and communicate to the JDC that the job is stale
JDS send DeclareMiningJob.Error and communicate to the JDC that txs are not valid or another general error

This issue for sure deserve further discussions. For that i propose to treat it as different issue, and for now proceed with the proposed solution so that we can fix the panic, that is the most urgent thing.

Also I feel like that whatever we will decide do not belong to the spec but rather into the guidelines for implementation.

also if it could look at the proposed txs and the last block mined ↩

GitGab19 added bug Something isn't working job declarator labels Jan 3, 2024

Fi3 removed the bug Something isn't working label Jan 3, 2024

Sjors mentioned this issue Jan 12, 2024

Stratum v2 Template Provider (take 2) bitcoin/bitcoin#28983

Closed

30 tasks

pavlenex assigned Sjors Jan 16, 2024

pavlenex modified the milestones: MVP3, Milestone4 Jan 16, 2024

pavlenex modified the milestones: Milestone 4, Milestone 5 Feb 6, 2024

GitGab19 mentioned this issue Feb 14, 2024

Add stale-template-id error code in RequestTransactionData.Error message stratum-mining/sv2-spec#71

Merged

pavlenex modified the milestones: Milestone 5, Milestone 4 Feb 14, 2024

pavlenex mentioned this issue Feb 14, 2024

error: template-id-not-found #753

Closed

GitGab19 mentioned this issue Feb 14, 2024

RequestTransactionDataError better management in JDC #754

Merged

GitGab19 closed this as completed in #754 Feb 16, 2024

JDC panics after receiving RequestTransactionDataError message #709

JDC panics after receiving RequestTransactionDataError message #709

Comments

GitGab19 commented Jan 3, 2024 • edited by Fi3

Fi3 commented Jan 3, 2024

Fi3 commented Jan 3, 2024

GitGab19 commented Jan 4, 2024

Sjors commented Jan 12, 2024

GitGab19 commented Jan 26, 2024

Sjors commented Feb 6, 2024 • edited

pavlenex commented Feb 6, 2024

GitGab19 commented Feb 7, 2024

Sjors commented Feb 7, 2024

GitGab19 commented Feb 7, 2024

GitGab19 commented Feb 9, 2024

Fi3 commented Feb 9, 2024

GitGab19 commented Feb 9, 2024

Fi3 commented Feb 9, 2024

GitGab19 commented Feb 9, 2024

Fi3 commented Feb 9, 2024

Fi3 commented Feb 9, 2024

GitGab19 commented Feb 9, 2024

Fi3 commented Feb 9, 2024

GitGab19 commented Feb 9, 2024

Sjors commented Feb 9, 2024 • edited

Fi3 commented Feb 9, 2024

Fi3 commented Feb 9, 2024

Fi3 commented Feb 9, 2024 • edited

Sjors commented Feb 9, 2024

Fi3 commented Feb 9, 2024

Sjors commented Feb 9, 2024

GitGab19 commented Feb 9, 2024

Fi3 commented Feb 9, 2024

Fi3 commented Feb 9, 2024

GitGab19 commented Feb 9, 2024

Fi3 commented Feb 9, 2024

Sjors commented Feb 9, 2024 • edited

Fi3 commented Feb 9, 2024

Sjors commented Feb 9, 2024

Sjors commented Feb 9, 2024 • edited

Fi3 commented Feb 10, 2024

GitGab19 commented Feb 12, 2024

Sjors commented Feb 12, 2024 • edited

GitGab19 commented Feb 12, 2024

Sjors commented Feb 12, 2024 • edited

Fi3 commented Feb 14, 2024

Footnotes

GitGab19 commented Jan 3, 2024 •

edited by Fi3

Sjors commented Feb 6, 2024 •

edited

Sjors commented Feb 9, 2024 •

edited

Fi3 commented Feb 9, 2024 •

edited

Sjors commented Feb 9, 2024 •

edited

Sjors commented Feb 9, 2024 •

edited

Sjors commented Feb 12, 2024 •

edited

Sjors commented Feb 12, 2024 •

edited