Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] DoS mitigations #258

Closed
tevador opened this issue Nov 6, 2022 · 21 comments
Closed

[Proposal] DoS mitigations #258

tevador opened this issue Nov 6, 2022 · 21 comments

Comments

@tevador
Copy link
Owner

tevador commented Nov 6, 2022

Network nodes typically verify RandomX hashes in the light mode. One hash calculation takes around 15 ms on high-end machines and up to 1 second on low-end 32-bit machines (Raspberry Pi Zero). This might be a denial-of-service (DoS) vector if malicious nodes started flooding the network with invalid blocks. I don't know if any actual attacks exploiting this have happened, but it might be a good idea to mitigate this.

As per the specification, the RandomX result is calculated as a Blake2b hash of the final register file (with the constant registers replaced by the scratchpad hash). We could include some intermediate data in the block blob so that the final result can be quickly calculated with one call to the Blake2b compression function. This would require a change of the block format, but since Seraphis will bring large changes to the transaction format anyways, adding one field to the block blob would be a comparatively small change.

If we wanted to make hashes backwards compatible, it would require at least 192 bytes to be included in the block (128 bytes for the last Blake2b block and 64 bytes of the internal hash state (see blake2b.c).

However, even such a "backwards compatible" change would require all mining software to be updated to also return the intermediate state. So I don't think we need to restrict ourselves to backwards compatible solutions.

A much simpler solution would be to change step 14 of the algorithm from

  1. Result is calculated as R = Hash256(RegisterFile).

to

  1. Result is calculated as R = Hash256(Hash256(RegisterFile)).

This would produce different hashes, but the intermediate state would be reduced to just 32 bytes. This change is also very easily proven to be secure (double hash is as secure as a single hash) and it would be a non-contentious fork because the performance implications are negligible (additional ~0.3 μs on top of the full ~1600 μs calculation in full mode) and no changes are needed to the RandomX virtual machine.

If network nodes first validated the difficulty of the final Blake2b hash, it would become infeasible to use the PoW algorithm to DoS the network. At the current network difficulty of ~300G, it takes about 1 minute on a high-end GPU to produce a Blake2b hash that is below the target.

@SChernykh
Copy link
Collaborator

SChernykh commented Nov 6, 2022

What stops an attacker from generating high-difficulty Blake2b hash once and using it many times?
Edit: R = Hash256(Hash256(RegisterFile) || input_blob) would be better.

@tevador
Copy link
Owner Author

tevador commented Nov 6, 2022

What stops an attacker from generating high-difficulty Blake2b hash once and using it many times?

I guess there would need to be some sort of cache that would include blocks passing Blake2b but failing RandomX.

R = Hash256(Hash256(RegisterFile) || input_blob) would be better.

Yes, that's also an option.

R = Hash256(Hash256(RegisterFile) || H)

@SChernykh
Copy link
Collaborator

A cache of failing hashes will be safer from cryptographical point of view. I'm not sure if giving a direct acces to the final hash like this: Hash256(Hash256(RegisterFile) || input_blob) is 100% safe.

@tevador
Copy link
Owner Author

tevador commented Nov 6, 2022

I'm not sure if giving a direct acces to the final hash like this: Hash256(Hash256(RegisterFile) || input_blob) is 100% safe.

Why would it be unsafe? It's actually a good idea to thwart precomputed DoS attacks because the attacker cannot know the hashing blobs for future blocks.

@SChernykh
Copy link
Collaborator

I'm not a cryptography expert and I don't see any shortcuts in this approach right now, my concern is that adding the original input blob to the final hash can create some shortcuts in PoW, if Blake2 gets broken for example.

@tevador
Copy link
Owner Author

tevador commented Nov 6, 2022

if Blake2 gets broken for example.

I don't think we can make any statements about the security of RandomX if Blake2 is broken, regardless of this proposal.

my concern is that adding the original input blob to the final hash can create some shortcuts

Missing input to a hash function is what can cause vulnerabilities. I don't think it's possible to cause a vulnerability by adding more context to a hash function.

Moreover, this inner/outer pow is not a new idea. It was used by Ethereum and is currently also used by Grin.

@SChernykh
Copy link
Collaborator

I know, and I don't see any problems with it myself. I'm just saying it will need another audit from cryptography experts to be 100% sure.

@tevador
Copy link
Owner Author

tevador commented Nov 6, 2022

I don't think we need an audit for a 1-line change in specs and perhaps 8 lines of C++ code. MRL has enough capacity for this.

@UkoeHB
Copy link

UkoeHB commented Nov 9, 2022

So the proposal is:

  1. store PoW_prehash = Hash256(RegisterFile) in block headers
  2. compute PoW_hash = Hash256(PoW_prehash || H) then check if PoW_hash meets difficulty requirements
  3. compute RegisterFile and check if PoW_prehash is valid

Afaict this is a good solution to the stated problem. Note that it only solves DoS in cases where computing RegisterFile is the last step. If you have any step that can fail after RegisterFile (that isn't PoW-locked - the PoW_prehash recomputation can fail but you need to satisfy the PoW condition to cause failures), then a DoSer can just make that step fail and then you have to compute RegisterFile to resolve bad blocks. Basically any pre-validated computations should always go at the end of your validation chain.

I believe the only way to mitigate DoS (at the validation site) that involves multiple validation steps with non-trivial cost and no pre-validation (e.g. validating a transaction or set of transactions) is to randomize the order of validation. Then you can get 50% reduction in average validation cost for invalid maximally-efficient DoS statements.

@tevador
Copy link
Owner Author

tevador commented Nov 9, 2022

@UkoeHB yes, this is the gist of the proposal. The third step involves invoking the RadomX virtual machine and is therefore quite expensive, but the attacker needs to perform substantial amount of work to pass step 2.

If you have any step that can fail after RegisterFile (that isn't PoW-locked

No, the register file used in the hash is the final output of the RandomX VM.

I believe the only way to mitigate DoS (at the validation site) that involves multiple validation steps with non-trivial cost and no pre-validation (e.g. validating a transaction or set of transactions) is to randomize the order of validation. Then you can get 50% reduction in average validation cost for invalid maximally-efficient DoS statements.

This proposal only focuses on DoS using the PoW algorithm.

However, I don't think randomization is the best way. An attacker can trivially submit blocks that violate all rules. In that case, the optimal strategy is to validate from the cheapest to the most expensive rule (where the cost is measured e.g. by CPU cycles needed to validate the rule).

@UkoeHB
Copy link

UkoeHB commented Nov 9, 2022

No, the register file used in the hash is the final output of the RandomX VM.
This proposal only focuses on DoS using the PoW algorithm.

Right, I just mean if validating PoW is always done as part of block validation, and then you have some other validation step after the PoW check, the DoSer can just target that final step.

However, I don't think randomization is the best way. An attacker can trivially submit blocks that violate all rules. In that case, the optimal strategy is to validate from the cheapest to the most expensive rule (where the cost is measured e.g. by CPU cycles needed to validate the rule).

Randomization is the best way if the attacker knows your validation strategy, because he can always target the last step in a fixed validation sequence. You can increase the cost to the attacker to execute a max-cost attack by ordering validation steps by cost, but that means increasing your own local cost to validate the attacker's statements (relative to dynamically randomizing validation steps).

When statement-creation and statement-validation costs are massively disproportionate, as in PoW, then it makes sense to move the disproportionate steps to the end and order them by cost (assuming you can guarantee that the attacker pays statement-creation costs like this proposal does).

Randomization is basically academic though, it's probably a serious pain to implement correctly.

@MoneroChan
Copy link

I'm trying to design a better system to defend against invalid blocks called "Proof of Attack" (PoA)

I'm trying to see we can Exploit Dandelion++ to Reverse Attack (triage) malicious nodes that are submitting invalid blocks;

Here's a Draft Example of the "Proof of Attack" (PoA) defence system:

  1. Normally Dandelion++ makes it almost impossible to block malicious IPs like Tor and the majority of the nodes are doing their part to hide everyone else... But,
  2. That's ONLY IF all nodes "Choose" to hide everyone else.
  3. If Nodes have absolute 'Proof they are under attack' (Proof of Attack, PoA) , nodes have no reason to hide IP's and can just as easily choose to work together to unmask the attacker, allowing for network level IP level triage previously impossible due to Dandelion++.
  4. From what i know, each node does know Exactly 'what data they sent', 'To which IP' and 'From which IP" that data came from... (they just 'choose' to keep it secret)
  5. So if proven to be under attack (Proof of Attack - PoA), nodes can 'Choose to' willingly work together and unmask the approximate source of the transaction.
  6. This "Proof of Attack" would constitute for example, if the next node in the chain replys and proves Proof saying they received a malicious invalid block. They do this by sending a copy of that malicious block and the 'forwarded part from the previous node', so that 'previous node' and all those prior can 'provably see and reverse engineer and double check and confirm' that they indeed forwarded that malicious block, and are thus "Provably under attack" and "part of the chain of attack".
  7. If nodes have "proof they are under attack", nodes have No reason to keep the sender of that attack block private anymore.
  8. They then forward this "Proof of attack" down the dandelion IP address chain so every node that participated in the forwarding of that invalid block knows about it, all the way back to the source IP Address at the malicious node itself
  9. So in essense, Dandelion++ works in Reverse and goes on the offensive.
  10. Obviously, the malicious node will LIE saying it came from someone else, but that won't work because of entropy and Proof of Attack.
  11. To exploit entropy to prevent DoS, Each previous node puts the IP address of the previous node/forwarder on a "Low priority" list. This doesn't affect the network or propogation of blocks, it just means blocks from IPs on that list are only processed at the lowest priority using 'Spare cpu cycles' and treated as possible DoS, so the average probability builds up against the attacker.
  12. Due to Entropy, the malicious node IP is quickly on the "Last to do" list of every node on the network, so every block or data piece from that IP address is tagged as low priority on every nodes last to do list,
  13. DoS is therefore 100% Impossible as the malicious nodes quickly run out of IP addresses since all nodes work together to collaborate and triage IP addressses using Proof of Attack reports.

16: End result:
Malicious node Attacking IP have all blocks and forwards "Tagged as low priority", so a malicious node will never cause DoS because they are always processed last and only using spare CPU cycles.

The benefit of this method is that triaging doesn't actually prevent invalid blocks from being submitted, it just triages / sorts them last so weak nodes can just ignore them, leaving the powerful nodes to crunch the numbers. So the network quickly heals itself.

This is just a draft idea i came up with. Not sure if it'll work or not.
Is this doable?

@moneromooo-monero
Copy link

I do not agree with step 10. It basically reads like "it will work because I say so" ("because of entropy" is not a valid argument unless you explain what it means).

AFAICT, it is very vulnerable to Sybils, in the form of simple nodes that do not have to perform any kind of work. Spin up enough nodes, and you can "prove" any arbitrary other node "attacked" by sending an invalid block, claiming it came from your target.

PoW is how bitcoin/monero/etc get out of this Sybil issue, since spinning up more nodes stops being so cheap. If you can explain point 10 to people's satisfaction, it can be looked at again.

Also, Dandelion works on unmined txes, not blocks, doesn't matter here though.

@UkoeHB
Copy link

UkoeHB commented Nov 10, 2022

They do this by sending a copy of that malicious block and the 'forwarded part from the previous node', so that 'previous node' and all those prior can 'provably see and reverse engineer and double check and confirm' that they indeed forwarded that malicious block, and are thus "Provably under attack" and "part of the chain of attack".

They send this information to the node they received the block/tx from? Wouldn't that node be the malicious one? Otherwise they wouldn't have transmitted an invalid statement. No need for triage, just block IPs that send you invalid stuff.

What you could do is form graphs of trusted nodes that report malicious node IPs to each other. A very complicated system to build though.

@MoneroChan
Copy link

MoneroChan commented Nov 11, 2022

Re: Step 10,

  • All nodes know what data passed through them, e,g they can store a copy of this in RAM, even 1GB of ram can store a lot of Tx for a long period of time for little cost. They only need to store data for a few minutes to check if malicious data went through them.
  • so to prevent nodes from Lying about being under attack, "Proof of Attack" would require for example, the next node in the chain repling and providing Proof saying they received a malicious invalid data. They do this by sending a copy of that malicious data and the 'forwarded data from the previous node', so that 'current node' can indeed see that they are telling the truth (by checking their RAM to confirm they indeeded forwarded part of a malicious data). then they forward the detail to the 'previous node' and all those prior can 'provably see and reverse engineer and double check and confirm' that they indeed involved in the attack (because they keep a copy of the malicious data in RAM, they know with certainty they are not being lied to, and are thus "Provably under attack" and "part of the chain of attack". The nodes then report to each other and call each other to confirm the story, so all honest nodes are 'entangled' using the "Proof of attack" and data stored in RAM.

PoA does not target the malicious node directly because the best the proof can do is to narrow down the attack to just 2 nodes The malicious node could be, The last one that claims responsibility (who may be lying), or, the one that Denies responsibility (may also be lying), so it's a 50% possibilty

To explain the entropy is as follows: very quickly, after several cycles (e.g with Dos ), entropy builds up and as more nodes share the PoA, the list of "Re-curring IP addresses" very quickly narrows down the source. e.g 50% of 50% =75%,of 50% = 87.5% then 93% etc, soon becomes 99.9% after just a few hops and the malicious nodes (no matter how many there are) soon run out of IP addresses. Because the source node will always have a higher count of malicious Data than the honest nodes in a random environment. like how contaminated water disperses from an approx single point, when the vectors cross, you can triangulate the probability of the source very quickly.

So for example, if a Sybil attacker spins up multiple nodes and tells a huge bunch of lies by sending false blocks, only the participating nodes (the attackers) are affected because only they can prove Proof of Attack.
If they falsely blame another node, or the proof of attack doesn't check out like in a Sybil,
No one will listen to them as other honest nodes that are not part of the attack have no reason to listen to the malcious nodes as as the malicious data doesn't exist on their RAM database. They they are instead identified as malicious for making false demands.

So after just one malicious piece of data, the probability of unmasking the attacker is reduced from 10+ nodes to just 2 nodes,
because the nodes can use the "proof of attack" to verify and work together and trace the malicious node.

Also i should emphasizee the punitive action is merely "Triaging" ie: Optimization tagging.
Even if the attacker fully manipulates the PoA system, it doesn't affect the network because the network is currently basically currently random Chaos. and Any optimization, even if only 1% is still better than no optimization, or absolute Pure chaos.

The key thing to remenber is that this is "immunization" by optimization and triaging, not blocking / banning an IP per se (that's optional), so although many honest nodes >90% may get on the 'naughty list', they are given benefit of doubt, but the malicious nodes have the 'most doubt' against them (they are on the most naughty lists due to entropy as demonstrated above) and are almost always at the bottom and are affected the most, but still function normally as a failsafe.

i'll have to create a diagram to demonstrate this as it's a very messy with words. I apologise this is just a draft.

@MoneroChan
Copy link

MoneroChan commented Nov 11, 2022

They send this information to the node they received the block/tx from? Wouldn't that node be the malicious one? Otherwise they wouldn't have transmitted an invalid statement. No need for triage, just block IPs that send you invalid stuff.

  • Dandellion++ hides the IPs like TOR, by choice, so that won't work as you might ban an innocent node.
  • The only way to do this require Proof.
    e.g PoA Exploits this ability to Choose to hide or identify the True malicious node,

The nodes store the malicious data and when provided proof of an Attack (PoA) and 2 way communication can occur within the honest nodes to run a trace route back to the source who will ultimately Lie about it.

  • The malicious node will lie about being malicious, there are two possible lies they can tell, they can blame another innocent unrelated node, or blame the node they forwarded the data to
    if they feign innocence and blame the node they forwarded it to, that would instantly give them away as the culprit and get banned by the following node they sent the data to, because the following node would know with 100% certainty they are lying as they can verify it. Other nodes may not know, but it's irrelevant as they have 1 less node to spam due to the certain IP ban, so they will most likely lie and say a previous node sent the malicious data to them. Then when the nodes query the previous node, the innocent node will deny any knowledge. And so the trail stops there and the probability of the malicious nodes drop from 10+ to just 2 nodes, (50% probability), and 50% added to 50% grows very quickly over time, and the malicious node is eventually triangulated, and identified from the innocent nodes.

  • So you probably should Not block nodes based on IP as there is always a probability that they are innocent,
    so instead, their data is "Triaged" or lower prioritized. so Malicious nodes are treated as lowest priority, so they don't cause a DoS as they are simply ignored by weak nodes until they have spare CPU cycles, and they leave the powerful nodes to crunch the potentially dirty malicious data., therefore avoiding DoS via improved sorting and optimization of malicious data.

@tevador
Copy link
Owner Author

tevador commented Nov 11, 2022

Dandelion++ makes it almost impossible to block malicious IP

Dandelion is used to propagate mempool transactions, not blocks. Blocks must be propagated as fast as possible.

This "Proof of Attack" would constitute for example, if the next node in the chain replys and proves Proof saying they received a malicious invalid block.

No such proof is possible unless you do some sort of PKI and have all nodes sign their messages. Impractical.

every node that participated in the forwarding of that invalid block

Invalid blocks are not propagated. If a node receives an invalid block, the sending node is almost certainly malicious and can be immediately blocked.

they can store a copy of this in RAM, even 1GB of ram can store a lot of Tx

Explain that to someone running a node on a Rasberry Pi with 1 GB of RAM.

In any case, this discussion has gone off topic. I suggest @MoneroChan to open a separate discussion in the Monero repository.

@MoneroChan
Copy link

MoneroChan commented Nov 11, 2022

I apologise, i was getting off track and appear to have mistook the issue

Back on topic:

If a node receives an invalid block, the sending node is almost certainly malicious and can be immediately blocked.

  • From what i understand here the core DoS issue is caused by "time wasted processing malicious blocks", (because otherwise we could simply just auto-block nodes to solve the issue as @tevador said.

  • So in this case, can we just add a background timeout at the slow nodes to auto dump blocks that take too long to process and also block the nodes they come from? A timer threshold should prevent nodes from locking up if a zero day exploit somehow causes a node to spend too much time on a certain block.

  • Blocks that exceed the timeout (and the corresponding nodes that sent it) are put on a low priority 'to do later' list when free CPU cycles are available, (or discarded and blocked entirely) etc.

  • A similar real life example is setting "cpu-priority 0" on Xmrig launch parameters, so RandomX doesn't DoS your web browsing or office work.

Hope this helps, I'm sorry if i misunderstood the issue.

@tevador
Copy link
Owner Author

tevador commented Nov 11, 2022

From what i understand here the core DoS issue is caused by "time wasted processing malicious blocks", (because otherwise we could simply just auto-block nodes to solve the issue

Even if you immediately ban the malicious peer, you have already wasted the CPU time needed to calculate the hash, which might be significant. The IP ban can be circumvented by attackers with access to many addresses (botnets etc.).

Monero developers were clearly aware of the DoS potential of the PoW algorithm, hence the harsh penalty for nodes that submit invalid PoW.

For Bitcoin, the PoW is much cheaper to verify and bitcoin nodes have a more lenient policy for misbehavior. They mark misbehaving peers as "discouraged" and only accept them as incoming connections if their inbound peer slots are not fully occupied (source).

So in this case, can we just add a background timeout at the slow nodes to auto dump blocks that take too long to process and also block the nodes they come from?

This is a recipe for a chain split. You cannot reject blocks that might be valid.

@MoneroChan
Copy link

MoneroChan commented Nov 12, 2022

You cannot reject blocks that might be valid.

Yes. but my main proposal was: "Blocks that exceed the timeout (and the corresponding nodes that sent it) are put on a low priority 'to do later' list when free CPU cycles are available, "

So there is a 'fixed' amount of time that can be wasted by the CPU (adjustable set by node operator)
Basically a "cpu time validation" step

Example; if a quantum computer exploits a zero day vulnerability to maliciously craft a block that takes 100 years to solve,
the Raspberry pi will spend no more than say 2x/3x the average block time (or whatever the node operators deems suitable)
before moving it to the "to do later when i have free CPU cycles" job list.

This way, the weak Raspberri Pi Node only works on it when it has spare CPU cycles and doesn't cause a DoS traffic jam backlog.

(The "discarded or blocked part" you are worried about, can be made optional; but sooner or later you will eventually have to discard it if the maliciously crafted block takes infinity to solve)

The IP ban can be circumvented by attackers with access to many addresses (botnets etc.).

True, However, If we think about this logically, if every node connected to the raspberry Pi is a malicious botnet, then the poor raspberry pi has zero chance of getting a valid block no matter what checks you implement to the algorithm because all links are cutoff to the actual chain begin with.

So for practical simulation sake, to solve this DoS issue, we will assume there is at least "one" good node the Raspberri Pi is connected to that on average sends blocks that are computable within an acceptable timeframe, like a bootstrapper.

So with this mind, as i proposed, the Raspberri pi simply prioritizes blocks from that known good node first before other nodes.
So even if there are 1000 Malicious nodes connected to the Pi sending 1000 malicious blocks crafted using a quantum computer and a Zero day vulnerability that take infinity to solve, the little Raspberry Pi isn't affected at all, because it simply works with blocks from the good node first and only processes the 1000 malicious 'timed out' blocks using spare cpu cycles.

The key point i'm suggesting is that you would still be protected by using this timeout method, in the event of a future unknown Zero Day DoS vulnerability that may not be codable for.

E.g Looking at 0.20.1 Release Notes from bitcoin, they may still be vulnerable because their method requires confirmation the block is invalid before acting on the node.
e.g a zero day vulnerability can exploit the fact that nodes require malicious confirmation and be "stuck in limbo" trying to determine if it's malicious or not
My proposal is different and more effective against a Zero Day DoS, as it doesn't require confirmation the block is malicious before "protection kicks in". it targets the 'time factor' to optimize and traige blocks more efficiently for processing without crashing or DoSing the raspberry pi even if 99% of nodes are malicious.

Does the network already do this? If i'm mistaking something please let me know,

@tevador
Copy link
Owner Author

tevador commented Oct 15, 2023

Implemented in #265

@tevador tevador closed this as completed Oct 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants