Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hashes: Epic re-write #2770

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

tcharding
Copy link
Member

@tcharding tcharding commented May 15, 2024

Re-write the hashes API for fun and profit.

Please note, this removes the ability to display tagged hashes backward.

Resolves the following issues:

Close: #1481
Close: #1552
Close: #1689
Close: #2197
Close: #2377
Close: #2665
Close: #2811

TODO

  • Resolve the QUESTIONS in hashes/examples/*.rs

@github-actions github-actions bot added C-bitcoin PRs modifying the bitcoin crate C-hashes PRs modifying the hashes crate test C-base58 labels May 15, 2024
@apoelstra
Copy link
Member

Overall 6c7e143 looks good. The duplication of the merkle root code I don't like (but I don't think it's necessary ... in fact I suspect we could pull it into the hashes crate and keep it generic since it's not particularly Bitcoin-specific ... but anyway it doesn't need to happen as part of the initial hashes API changes).

We should have some public type aliases for HMAC-SHA256 and HMAC-SHA512 since those are the overwhelmingly most common forms of HMAC.

As for "hash newtypes should not be generic hash types" I am leaning strongly into implementing this by adding visibility specifiers to the hash_newtype! macro which would control how it implements the engine; a pub engine is a "generic hash type" while a pub(crate) or private engine is a hash type that can't be constructed by hashing using the public API.). Or maybe we want a different macro for this. (The hash_newtype macro currently seems to just implement a bunch of byte accessors, and I wonder if it should be renamed to reflect this, and/or whether it could replace impl_array_newtype.)

@tcharding
Copy link
Member Author

Leave it with me, I'm going to go for the least tricky solution that does exactly what we want.

. (The hash_newtype macro currently seems to just implement a bunch of byte accessors, and I wonder if it should be renamed to reflect this, and/or whether it could replace impl_array_newtype.)

Will look into this.

@tcharding
Copy link
Member Author

tcharding commented May 16, 2024

As for "hash newtypes should not be generic hash types" I am leaning strongly into implementing this by adding visibility specifiers to the hash_newtype! macro which would control how it implements the engine; a pub engine is a "generic hash type" while a pub(crate) or private engine is a hash type that can't be constructed by hashing using the public API.).

lol, double lol - look at the diff required to get this functionality, 5 changed lines.

diff --git a/bitcoin/src/blockdata/block.rs b/bitcoin/src/blockdata/block.rs
index 1a660fb8..98074e10 100644
--- a/bitcoin/src/blockdata/block.rs
+++ b/bitcoin/src/blockdata/block.rs
@@ -26,7 +26,7 @@ hashes::hash_newtype! {
     /// A bitcoin block hash.
     pub struct BlockHash(sha256d);
     /// A hash of the Merkle tree branch or root for transactions.
-    pub struct TxMerkleNode(sha256d);
+    pub struct TxMerkleNode(pub(crate) sha256d);
     /// A hash corresponding to the Merkle tree root for witness data.
     pub struct WitnessMerkleNode(sha256d);
     /// A hash corresponding to the witness structure commitment in the coinbase transaction.
diff --git a/bitcoin/src/blockdata/transaction.rs b/bitcoin/src/blockdata/transaction.rs
index 4916e43a..2834be04 100644
--- a/bitcoin/src/blockdata/transaction.rs
+++ b/bitcoin/src/blockdata/transaction.rs
@@ -44,10 +44,10 @@ hashes::hash_newtype! {
     /// versions of the Bitcoin Core software itself, this and other [`sha256d::Hash`] types, are
     /// serialized in reverse byte order when converted to a hex string via [`std::fmt::Display`]
     /// trait operations. See [`hashes::Hash::DISPLAY_BACKWARD`] for more details.
-    pub struct Txid(sha256d);
+    pub struct Txid(pub(crate) sha256d);
 
     /// A bitcoin witness transaction ID.
-    pub struct Wtxid(sha256d);
+    pub struct Wtxid(pub(crate) sha256d);
 }
 impl_hashencode!(Txid);
 impl_hashencode!(Wtxid);
diff --git a/hashes/src/macros.rs b/hashes/src/macros.rs
index 45b53ae1..6a32c935 100644
--- a/hashes/src/macros.rs
+++ b/hashes/src/macros.rs
@@ -128,11 +128,11 @@ macro_rules! hash_newtype {
 
             /// Constructs a new engine.
             #[inline]
-            pub fn engine() -> $hash::Engine { $hash::Hash::engine() }
+            $field_vis fn engine() -> $hash::Engine { $hash::Hash::engine() }
 
             /// Produces a hash froam the current state of a given engine.
             #[inline]
-            pub fn from_engine(e: $hash::Engine) -> Self { Self($hash::Hash::from_engine(e)) }
+            $field_vis fn from_engine(e: $hash::Engine) -> Self { Self($hash::Hash::from_engine(e)) }
 
             /// Copies a byte slice into a hash object.
             #[inline]

@tcharding
Copy link
Member Author

tcharding commented May 16, 2024

We should have some public type aliases for HMAC-SHA256 and HMAC-SHA512 since those are the overwhelmingly most common forms of HMAC.

I've used public functions in the hmac module.

/// Returns an engine for computing `HMAC-SHA256`.
///
/// Equivalent to `hmac::Engine::<sha256::Engine>::new(key)`.
pub fn new_engine_sha256(key: &[u8]) -> Engine<sha256::Engine> {
    Engine::<sha256::Engine>::new(key)
}

/// Returns an engine for computing `HMAC-SHA512`.
///
/// Equivalent to `hmac::Engine::<sha512::Engine>::new(key)`.
pub fn new_engine_sha512(key: &[u8]) -> Engine<sha512::Engine> {
    Engine::<sha512::Engine>::new(key)
}

@tcharding
Copy link
Member Author

tcharding commented May 16, 2024

The duplication of the merkle root code I don't like (but I don't think it's necessary ... in fact I suspect we could pull it into the hashes crate and keep it generic since it's not particularly Bitcoin-specific ... but anyway it doesn't need to happen as part of the initial hashes API changes).

I agree its ugly, before we fix it we should ask how much change we are willing to accept to de-duplicate something that is two exact copies and likely never changes? AFIACT there is no way of doing this generically without an invasive change, hence the question.

Said another way, there is no way of going from a generic type (we only have HashEngine now) to the concrete hash type so this cannot be done generically IIUC.

The quick obvious thing is to use a macro but I don't know if its possible or the implications of calling a macro recursively.

@tcharding tcharding force-pushed the hashes branch 3 times, most recently from f8c33e0 to 5f57240 Compare May 16, 2024 05:05
@tcharding
Copy link
Member Author

tcharding commented May 16, 2024

This isn't going to pass CI because it has a patched manifest to use rust-bitcoin/hex-conservative#90

In case it gets past you, this PR is hot as f**k. Check out the hashes::internal_macros and macros modules - gawd damn.

Massive props @apoelstra for sticking to your guns and refusing to accept the split out of rust-chf - team work makes the dream work.

@apoelstra
Copy link
Member

apoelstra commented May 16, 2024

Said another way, there is no way of going from a generic type (we only have HashEngine now) to the concrete hash type so this cannot be done generically IIUC.

I'll play with it. Fine for now if we have to duplicate the code. But I think/hope we can somehow be generic over the engine.

The quick obvious thing is to use a macro but I don't know if its possible or the implications of calling a macro recursively.

I'd rather add a dummy trait before I added a macro. (Or a new Digest trait which just had a method to construct an engine and nothing else, which I think would be okay since users would almost never need to be aware of it.) But we'll see.

@tcharding
Copy link
Member Author

tcharding commented May 17, 2024

Well that was pretty easy with a fresh set of eyes, I just threw this into the merkle_tree module.

/// Trait used for types that can be hashed into a merkle tree.
pub trait Hash {
    /// Constructs a new [`sha256d`] hash engine.
    fn engine() -> sha256d::Engine { sha256d::Engine::new() }

    /// Produces a hash froam the current state of a given engine.
    fn from_engine(e: sha256d::Engine) -> Self;
}

impl Hash for Txid {
    fn from_engine(e: sha256d::Engine) -> Self { Self::from_engine(e) }
}

impl Hash for Wtxid {
    fn from_engine(e: sha256d::Engine) -> Self { Self::from_engine(e) }
}

And changed the trait bounds on all the functions to T: Hash + Encodable + Copy,

Wallah, the original calculate_ functions work as before.

@tcharding tcharding force-pushed the hashes branch 4 times, most recently from 0a10c43 to 4622bf8 Compare May 17, 2024 00:55
@tcharding
Copy link
Member Author

tcharding commented May 17, 2024

I'm happy with this now, I'll just leave it sitting here until the hex release is done.

I had an idea - although the PR title is hashes: Epic re-write the epic is more just a reference to how much work it took me to get here, the actual changes to the hashes surface API are minor, mainly just removal of the Hash trait in favour of inherent functions. The other changes are only deeper in the crate ie., the HashEngine trait. If we put a nice thorough regression test patch at the front of this PR (for each hash type: hash a hard coded array of bytes and round trip the hard coded hex string for those bytes) then we can basically just YOLO this in and be confident it doesn't introduce bugs. What do you rekon?

@apoelstra
Copy link
Member

The thing to remember is that const associated types are not really part of the type, they are defined outside the type.

I don't understand what you mean by this. Associated types are associated with the type.

@tcharding
Copy link
Member Author

Lets just leave the const thing aside, I can't explain it anyways but I can guarantee that Rust does not implement associated consts to behave how one would think they should behave (at least in my opinion).

Stepping back, I believe the problem is that HMAC (and HKDF) have different requirements of the various hash types than a either a normal user, or an advanced user [0]. We removed the Hash trait to help normal users, leaving the HashEngine trait to support HMAC but this has led to the const N mess.

I believe what we really want is no traits whatsoever for normal users and a single trait for HMAC that defines "a hash that can be used in HMAC-X (and HKDF)" - even though this is all the hash types.

Does that make sense or am I mad?

[0] "advanced" meaning create engine, hash data, hash more data, finalize engine.

@tcharding
Copy link
Member Author

Problem, if a hash is to be used in a merkle tree (eg Txid) we can't both hide the engine stuff and implement a trait since the trait will make the engine stuff public.

@apoelstra
Copy link
Member

I believe what we really want is no traits whatsoever for normal users and a single trait for HMAC that defines "a hash that can be used in HMAC-X (and HKDF)" - even though this is all the hash types.

Does the HMAC trait differ from the "can be used in a Merkle structure" trait or the "can be used to tweak an EC key" trait?

Problem, if a hash is to be used in a merkle tree (eg Txid) we can't both hide the engine stuff and implement a trait since the trait will make the engine stuff public.

This is ok I think -- Txid has a private engine, it can be used in a Merkle tree but only by rust-bitcoin (who defines Txid). And if rust-bitcoin wants to expose some general-purpose "TXID merkle tree" API then it can do that, though I don't think it wants to.

@tcharding
Copy link
Member Author

I believe what we really want is no traits whatsoever for normal users and a single trait for HMAC that defines "a hash that can be used in HMAC-X (and HKDF)" - even though this is all the hash types.

Does the HMAC trait differ from the "can be used in a Merkle structure" trait or the "can be used to tweak an EC key" trait?

Same trait, we need a name for it, on master it is Hash.

Problem, if a hash is to be used in a merkle tree (eg Txid) we can't both hide the engine stuff and implement a trait since the trait will make the engine stuff public.

This is ok I think -- Txid has a private engine, it can be used in a Merkle tree but only by rust-bitcoin (who defines Txid). And if rust-bitcoin wants to expose some general-purpose "TXID merkle tree" API then it can do that, though I don't think it wants to.

Txid cannot have a private engine and also implement Hash because trait methods cannot be private - unless I'm missing something.

@tcharding
Copy link
Member Author

So Hash would stay exactly as it is an master but all implementations would just call through to the inherent functions on each hash type.

@apoelstra
Copy link
Member

Same trait, we need a name for it, on master it is Hash.

Let's go with GeneralHash for now. I suspect we could eliminate it or fold it into HashEngine eventually but let's stop trying to do that for now because it's causing too much grief with the compiler.

Txid cannot have a private engine and also implement Hash because trait methods cannot be private - unless I'm missing something.

It shouldn't implement Hash, only ByteArrayWrapper (or whatever we want to call that). Hash is about general hash functions and Txid is not one.

@tcharding
Copy link
Member Author

Ok, I am going to risk my neck here, I believe you have in your mind associated consts working how a sane person would think they work. They don't, I'll try dig up the stuff I read on it to convince you (and better understand it myself). The ByteArrayWrapper thing 100% doesn't work. There is no way to tie the HashEngine and Hash traits together other than how we currently do it with engine/from_engine (in Hash on master). I'm struggling to describe it but over a month of working on the hashes crate has made me totally convinced - I very well may have to eat my hat though of course. Happy to be proven wrong but not willing to write any more code to check my understanding.

@apoelstra
Copy link
Member

The ByteArrayWrapper thing 100% doesn't work.

Why not?

There is no way to tie the HashEngine and Hash traits together other than how we currently do it with engine/from_engine (in Hash on master).

What is wrong with that?

@apoelstra
Copy link
Member

I pasted my proposed ByteArrayWrapper trait into the playground and it was fine (except for two minor syntax errors). Are you claiming that it is impossible to implement? It is a very simple trait. It does not make any assumptions about how associated constants work.

@apoelstra
Copy link
Member

Ok, it looks like trigger traits don't actually work the way I want them to:

error[E0118]: no nominal type found for inherent implementation
   --> hashes/src/lib.rs:155:1
    |   
155 | impl <T: ByteArrayWrapper> T { 
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ impl requires a nominal type
    |   
    = note: either implement a trait on it or create a newtype to wrap it instead

So I guess we do need macros to define inherent methods. That is super annoying.

@tcharding
Copy link
Member Author

Were does that leave us?

  1. Are we ok to keep the hash_newtype and sha256_hash_newypte macros?

  2. Do you agree that these are mutually exclusive:

  • We provide general/limited wrapper hash types and HMAC/merkle tree/etc users are expected to be generic over HashEngine and have to do the const N dance.
  • hash wrapper types are all general and HMAC and friends are generic over a Hash trait and don't need to worry about the const N thing.

@apoelstra
Copy link
Member

  1. Are we ok to keep the hash_newtype and sha256_hash_newypte macros?

Yeah :/.

  1. Do you agree that these are mutually exclusive:

Yes, but we shouldn't do either.

Here is what I think we should do:

  1. Split the existing Hash trait into two -- one with the engine stuff and one without. I'm split between naming them GeneralHash/Hash or Hash/ByteArrayWrapper. I think we want to keep the name Hash for the one we expect users to use most often, to minimize code breakage. Initially just implement both in hash_type and hash_newtype and other macros. This can be one commit which will have a huge mostly-mechanical diff. You will find that HMAC/HKDF/Merkle all need to be generic over T:GenericHash but the rest can keep using T:Hash.
  2. Change bitcoin/src/merkle_tree/mod.rs calculate_root to no longer be generic over GenericHash but to cover only Txid and WTxid. (Fine to use an auxiliary trait which is narrowly tailored to this purpose.)
  3. Then drop the GenericHash impl in hash_newtype and provide a separate macro for impl'ing GenericHash. My guess is that we won't actually want to call this macro anywhere, since we'll only want to impl GenericHash for the hashes defined in the hashes crate itself.

Then take a breather and we'll revisit HashEngine and see if that trait needs to be modified.

I got about halfway through step 1 of this and it was quite cathartic to see how different parts of the codebase became explicit about whether they wanted to hash arbitrary data or whether they just wanted to deal with prepackaged hashes.

@apoelstra
Copy link
Member

The resulting codebase will let people keep using HKDF and HMAC and merkle trees with the "base hashes" but they won't be able to use these constructions with Txids or Sighashes or anything else which isn't supposed to be a general hash. (This is, I believe, Kix's original vision when he opened the "split the hash trait up" issue forever ago.)

@tcharding
Copy link
Member Author

Aight, sounds like a plan. I need to take some time away from hashes to regain my composure. Hopefully won't take too long. I'm a sensitive flower at the moment, not exactly sure why.

@Kixunil
Copy link
Collaborator

Kixunil commented Jun 6, 2024

directly working with hashengines is sufficiently advanced that people can just import the trait.

Chunked feeding of data to an engine is pretty common though. Perhaps at least those methods could be exposed.

I think it's all of them.

IIRC we have non-cryptographic hashes in the crate and using HMAC with them is at least suspicious. If you think it's not worth a marker trait I can understand that.

I'd just drop the *_raw_hash methods.

IIRC I did see a case when there was a good reason to have it but I think it was old taproot stuff that currently uses sane strong types. IMO the casts are reasonably explicit.

I think that every hash newtype should get its own engine newtype as well.

That seems excessive to me, but not strictly wrong. You can still accidentally feed some data into wrong engine but at least you can't assign an engine to an incompatible one. I don't think people are assigning engines around too much. They just use them to compute the hashes.

Though I could see a benefit of having .finalize() method.

So maybe we don't need macros at all.

I believe inherent const methods are super useful and we need macros for that.

trying to convince the compiler that <E as HashEngine::Digest as str::FromStr:>:Err does in fact implement Display

Just add a bound?

It is a tiiiny bit tempting to move ByteArrayWrapper into internals and use it in rust-secp for the crypto types.

I'd like to see that but internals is not the right crate. At this point I'm considering giving up my aversion towards From<[u8; N]> for HashNewType and just use that + AsRef.

if a hash is to be used in a merkle tree (eg Txid) we can't both hide the engine stuff and implement a trait since the trait will make the engine stuff public.

Not really. Hashing two txids into a merkle node doesn't produce a Txid but a merkle node. Merkle node could have engine exposed but we can simply internally use raw sha256. So perhaps this is another good example of needing to publish raw hash type.

This is, I believe, Kix's original vision when he opened the "split the hash trait up" issue forever ago.

Yes, and they can still do exotic things explicitly by going through byte arrays. That's what I wanted.

@apoelstra
Copy link
Member

Chunked feeding of data to an engine is pretty common though. Perhaps at least those methods could be exposed.

Yes, pretty common but it still seems uncommon enough that it's okay for users to need to import a trait to do it. But agreed, just like everything we should have inherents for this.

IIRC we have non-cryptographic hashes in the crate and using HMAC with them is at least suspicious. If you think it's not worth a marker trait I can understand that.

Yeah :/ agreed it may be strictly better to have a marker trait, but I continue to think it's not worth it. I think siphash is the only non-cryptographic hash in this crate (other than sha1 which needs to be used in cryptographic contexts for legacy/compatibility reasons, even though I would definitely not recommend its use as a crypto hash).

That seems excessive to me, but not strictly wrong. You can still accidentally feed some data into wrong engine but at least you can't assign an engine to an incompatible one. I don't think people are assigning engines around too much. They just use them to compute the hashes.

Though I could see a benefit of having .finalize() method.

Yeah, exactly. I'd like to have a finalize method and this means that each hash engine needs to have a unique hash that it finalizes to.

I believe inherent const methods are super useful and we need macros for that.

Yeah :( earlier in this thread I thought there was a way with "trigger traits" that we could do this without macros. It turns out that you can use these to implement inherent methods on Vec<T> for example... but not T itself.

Not really. Hashing two txids into a merkle node doesn't produce a Txid but a merkle node. Merkle node could have engine exposed but we can simply internally use raw sha256. So perhaps this is another good example of needing to publish raw hash type.

Yeah, great points. Need to think about what our Merkle tree computation function should look like then.. unfortunately we need to resolve this as step 1 of this project (well, maybe step 2, after just splitting Hash into two traits but always implementing both). Because all our breaking API decisions here affect what a generic Merkle tree computation will look like. Maybe the answer is that we don't want a generic function...we want something that can take either Txid or Wtxid and computes a merkle root in an ad-hoc way which is used in Bitcoin blockheader computations.

@apoelstra
Copy link
Member

I think we should open a new PR since github has started hiding comments on this one.

@apoelstra
Copy link
Member

I also think that "new PR" should be one that just splits up Hash into two traits (and always implements both) since I think that's a clear step forward we can all agree on, including the compiler. And possibly moves/copies some methods out of both traits and into inherents.

@Kixunil
Copy link
Collaborator

Kixunil commented Jun 7, 2024

I think we can provide generic merkle tree over non-wrapped hashes (maybe in hashes) and then just have a facade that uses Txid Wtxid and Taproot.

@apoelstra
Copy link
Member

I think you're right. That's the approach I'd try first.

@tcharding
Copy link
Member Author

Thanks for the input fellas, just wanted to say that I read everything (multiple times) but the PRs that fell out today may not seem like I did. There are six hundred things going on, not exactly sure how to pull everything together. (I did read and grok you list in #2770 (comment) @apoelstra, but I went a different way - no disrespect intended).

@tcharding
Copy link
Member Author

tcharding commented Jun 14, 2024

There has been so much discussion on the hashes crate and so much back and forth that I do not know exactly where we are trying to get to or what we are trying to solve. Also, I don't know what is the objection to the current hmac module presented in this PR. I my opinion this PR, in its current state, fixes all the issues linked to as claimed. I agree the const N stuff is ugly but apart from just ugly I don't know what is the objection, could you please re-state it @apoelstra?

I believe the correct observation from @Kixunil that a past state did not correctly tie the Hmac to the hash engine was correct (we used to only have the const N generic) but now hmac::Hash is generic over the engine also.

(Note that this PR should be re-done on top of what ever is decided for the merkle tree stuff.)

There are a few niggling problems with the current `hashes` API that we
would like to improve:

- Its not that discoverable
- It is annoying to have to import the `Hash` trait all the time.
- New hash types (created with `hash_newtype`) are general purpose but
  we'd like to hide this from users of `rust-bitcoin`.
- The API of the `hmac` module is different to the other hash types.
- The `hash_newtype` and `sha256t_hash_newtype` macros are a bit hard
  to read and work on

This re-write of the `hashes` API is an attempt to address all these
issues. Note that the public API surface of `hashes` does not change all
that much.
@apoelstra
Copy link
Member

I agree the const N stuff is ugly but apart from just ugly I don't know what is the objection, could you please re-state it @apoelstra?

It needs to be generic over the hash type. If it is only generic over N (which will basically always be 32 and therefore isn't much of a generic at all) then it loses type safety vs the current API.

If it's now generic over both N and the hash engine I'll take another look, once we have the other hashes PRs merged. But it's not obvious to me that it needs to be generic over both. The HashEngine trait should have enough information, and if it doesn't, we should extend it..

@tcharding
Copy link
Member Author

But it's not obvious to me that it needs to be generic over both.

If you can get rid of it then I'll be happy to see how that was achieved, as far as I can tell you are going to hit all the associated const problems I've been hitting - its a language limitation. But by all means please do try to solve it if you think I've missed something.

If it's now generic over both N and the hash engine I'll take another look

So we've been having API design discussions, with you giving me directives of things to explore, and ways you want it done, and you never looked at the latest state. That is pretty frustrating man. Even when I repeatedly said that we are hitting language limitations and that associated const don't "just work" as one would expect.

There are so many problems with hashes, as shown by the list of issues this PR claims to fix, I don't think the strategy of piecemeal improvements is a good one. The crate is an absolute punish to work on, that is why the only way I was able to almost maintain my sanity was just to cut it up wildly. Since we have regression tests for all hashes it is ok I believe to merge a single massive PR. Then we go over it again with a fine tooth combe to make sure the public API is as we want it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-base58 C-bitcoin PRs modifying the bitcoin crate C-hashes PRs modifying the hashes crate doc test
Projects
None yet
3 participants