Skip to content

Conversation

@lauzadis
Copy link
Contributor

@lauzadis lauzadis commented Dec 2, 2022

Adds HashingSource and HashingSink, which hash their data as they are being read and written, respectively.

Issue #

Description of changes

This change is required to generalize the CrcSource to support any hashing function. It is planned to be used in the flexible checksums implementation.

Note on digestValue()

Crc32 and Crc32C have an additional function called digestValue() which returns an unsigned int. To keep this function but also provide generic support for any hashing function, the HashingSource and HashingSink provide access to the underlying hash function as a member variable called hash.

With this, the user can cast the hash function to Crc32 and call the digestValue() function.

Example

var hs = HashingSource(Crc32(), ByteArray(1024) { 0xf }.source())  // CRC32 hashing source with 1024 bytes
val sink = SdkBuffer()

hs.read(sink, 1024)

val digestValue = (hs.hash as Crc32).digestValue()

An alternative design would be to create a class like CrcHashingSource which implements HashingSource and also provides digestValue(), but I thought the first option was better.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@lauzadis lauzadis marked this pull request as ready for review December 2, 2022 16:10
@lauzadis lauzadis requested a review from a team as a code owner December 2, 2022 16:10
commonMain {
dependencies {
implementation(project(":runtime:utils"))
implementation(project(":runtime:io"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: This adds a dependency on I/O, which seems like a potentially unnecessary coupling -- would this HashingSink and HashingSource be better located in the io module?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe they'd be better off in io.

* @param hash The [HashFunction] to hash data with
* @param sink the [SdkSink] to hash
*/
public class HashingSink(public val hash: HashFunction, sink: SdkSink) : HashFunction by hash, SdkSinkObserver(sink) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness: Why implement HashFunction? Is there a use case for calling update/reset on a sink? Will we ever pass a HashingSink as an argument to ByteArray.hash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no particular reason, and you're right, there are no strong use cases for directly calling update / reset on this sink, so I'll remove this implementation

commonMain {
dependencies {
implementation(project(":runtime:utils"))
implementation(project(":runtime:io"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe they'd be better off in io.

Comment on lines 16 to 20
public class HashingSink(public val hash: HashFunction, sink: SdkSink) : HashFunction by hash, SdkSinkObserver(sink) {
override fun observe(data: ByteArray, offset: Int, length: Int) {
update(data, offset, length)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness: This largely duplicates the private class in Canonicalizer.kt. We should remove that and update the surrounding code to use this.

Comment on lines 15 to 45
@Test
fun testHashingSinkDigest() = run {
val byteArray = ByteArray(1024) { 0xf }
val buffer = SdkBuffer()
buffer.write(byteArray)

val hashingSink = HashingSink(Crc32c(), SdkSink.blackhole())

val expectedHash = Crc32c()

assertEquals(expectedHash.digest().decodeToString(), hashingSink.digest().decodeToString())
hashingSink.write(buffer, buffer.size)
expectedHash.update(byteArray)
assertEquals(expectedHash.digest().decodeToString(), hashingSink.digest().decodeToString())
}

@Test
fun testCrcSinkDigest() = run {
val byteArray = ByteArray(1024) { 0xf }
val buffer = SdkBuffer()
buffer.write(byteArray)

val hashingSink = HashingSink(Crc32(), SdkSink.blackhole())

val expectedHash = Crc32()

assertEquals(expectedHash.digest().decodeToString(), hashingSink.digest().decodeToString())
hashingSink.write(buffer, buffer.size)
expectedHash.update(byteArray)
assertEquals(expectedHash.digest().decodeToString(), hashingSink.digest().decodeToString())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Are these two tests identical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They weren't initially, but I refactored and they ended up being basically the same, just using a different hash function. I will remove the duplication

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my eyes missed the Crc32 vs Crc32c. That's great to test both, maybe just refactor the test itself into a helper method that accepts a hash implementation.

* @param hash The [HashFunction] to hash data with
* @param sink the [SdkSink] to hash
*/
public class HashingSink(public val hash: HashFunction, sink: SdkSink) : HashFunction by hash, SdkSinkObserver(sink) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: Making the underlying hash publicly visible is inadvisable since that means its state can be mutated by object holders. We should probably make it private instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or use a HashSupplier instead.

Copy link
Contributor Author

@lauzadis lauzadis Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I made this public was for the digestValue() use case for CRC hash functions. I'm not sure how to nicely provide this function if the underlying hash is private, but I understand the problem with mutation by object holders.

From what I see, HashSupplier wouldn't be great here because it returns a new HashFunction each invocation, while we want to use the same HashFunction object to get the hash value of the entire underlying Source/Sink.

For digestValue(), I could do this:

public class HashingSource(private val hash: HashFunction, source: SdkSource) : SdkSourceObserver(source) {
    override fun observe(data: ByteArray, offset: Int, length: Int) {
        hash.update(data, offset, length)
    }

    public fun digestValue() : UInt? = (hash as Crc32?)?.digestValue()
}

But it seems too special-cased to CRC32 to be included in a generic HashingSource

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a single unsigned-int digest value is very specific to CRC32. We shouldn't include it in anything that utilizes generic hash functionality. How will this be used eventually outside of things like tests?

Copy link
Contributor Author

@lauzadis lauzadis Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's being used for event steams today. I'm adding digestValue() only to support event streams and change the usage of the original CrcSource implementation to this new one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the event streams code need to treat the CRC as a UInt? Could it just as easily use byte arrays?

Copy link
Contributor Author

@lauzadis lauzadis Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible. Event streams sends the expected checksum value as UInt, so we need to either

  1. compare using digestValue()'s UInt (how it's done today)
  2. convert the given expected checksum from UInt to ByteArray, and compare using digest() instead

I didn't want to make too many changes in event streams besides replacing the CrcSource/Sink that was used there, which is why I went with option 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, we decided to use just digest()/ByteArray everywhere for hash functions including CRC32. We'll remove specialized digestValue() members for the source/sink.

Comment on lines 15 to 47
@Test
fun testHashingSourceDigest() = run {
val byteArray = ByteArray(1024) { 0xf }
val source = byteArray.source()
val hashingSource = HashingSource(Crc32c(), source)

val sink = SdkBuffer()

val expectedHash = Crc32c()
assertEquals(expectedHash.digest().decodeToString(), hashingSource.digest().decodeToString())

hashingSource.read(sink, 1024)
expectedHash.update(byteArray)

assertEquals(expectedHash.digest().decodeToString(), hashingSource.digest().decodeToString())
}

@Test
fun testCrcSourceDigest() = run {
val byteArray = ByteArray(1024) { 0xf }
val source = byteArray.source()
val hashingSource = HashingSource(Crc32(), source)

val sink = SdkBuffer()

val expectedHash = Crc32()
assertEquals(expectedHash.digest().decodeToString(), hashingSource.digest().decodeToString())

hashingSource.read(sink, 1024)
expectedHash.update(byteArray)

assertEquals(expectedHash.digest().decodeToString(), hashingSource.digest().decodeToString())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Identical?

Copy link
Contributor

@aajtodd aajtodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix and ship

* @param hash The [HashFunction] to hash data with
* @param sink the [SdkSink] to hash
*/
public class HashingSink(public val hash: HashFunction, sink: SdkSink) : HashFunction by hash, SdkSinkObserver(sink) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably make these InternalApi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything I would create a CrcSource/Sink instead of trying to force them through HashingSource/Sink. It's going to look nearly identical but it won't be awkwardly special casing the generic hashing source/sink

* @param hash The [HashFunction] to hash data with
* @param sink the [SdkSink] to hash
*/
public class HashingSink(private val hash: HashFunction, sink: SdkSink) : SdkSinkObserver(sink) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@InternalApi

* Provides the digest value as an unsigned integer for the CRC32 function family.
* @return unsigned integer representing the value of the digest, if the [HashFunction] is [Crc32] or [Crc32c], and null otherwise.
*/
public fun digestValue() : UInt? = (hash as Crc32?)?.digestValue()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either special case OR move this to an internal package as an extension function (would require making hash internal rather than private

Copy link
Contributor

@ianbotsf ianbotsf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Copy link
Contributor

@aajtodd aajtodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix and ship

@ParameterizedTest
@ValueSource(strings = ["crc32", "crc32c", "md5", "sha1", "sha256"])
fun testHashingSinkDigest(hashFunctionName: String) = run {
val byteArray = ByteArray(1024) { 0xf }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix: let's use something that forces okio to cross segment boundaries, something odd like 19k would be 2 segments + a partial. Same for all source/sink tests

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@lauzadis lauzadis merged commit cee4330 into main Dec 12, 2022
aajtodd pushed a commit that referenced this pull request Feb 8, 2023
aajtodd pushed a commit that referenced this pull request Feb 13, 2023
@lauzadis lauzadis deleted the feat-hashing-io branch February 22, 2023 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants