Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate different hashes #13

Open
laurentsimon opened this issue Aug 25, 2023 · 12 comments
Open

Investigate different hashes #13

laurentsimon opened this issue Aug 25, 2023 · 12 comments
Milestone

Comments

@laurentsimon
Copy link
Collaborator

laurentsimon commented Aug 25, 2023

E.g. https://www.blake2.net (reduced rounds), https://github.com/BLAKE3-team/BLAKE3

@laurentsimon
Copy link
Collaborator Author

laurentsimon commented May 15, 2024

BLAKE3 may be similar to what we've implemented - see comments in sigstore/sigstore-python#1018 (comment).
We would need to package the Rust or C implementation via py bindings, I think. Is there an existing (cross-platform) package doing this?

@laurentsimon
Copy link
Collaborator Author

There seems to be a package available https://pypi.org/project/blake3/

@laurentsimon
Copy link
Collaborator Author

One thing to be aware of is that blake2 is not a NIST hash, and blake3 seems based on a similar idea (reduced number of rounds in the compression function). It also fixes the shard size to 1KB, which we would likely want to be parameterizable in practice.

@smeiklej
Copy link
Collaborator

When we benchmarked model signing for different model sizes, disk performance became the bottleneck quite quickly, so my sense is that changing the hash function would have little to no impact.
(Graph is here if you're curious: signing_runtime.pdf.)

That being said, supporting different hash functions is good for the sake of modularity. But that seems like a different goal from what you're talking about here.

@laurentsimon
Copy link
Collaborator Author

How do you know disk performance was the bottleneck? It could be that the number of vCPUs was the bottleneck, since we parallelize by dispatching computation on each vCPU. Is there another graph showing that disk IO was the bottleneck?

@smeiklej
Copy link
Collaborator

That's fair, we don't know for sure (for reference though the machine had 48 vCPUs). But even if the bottleneck is the number of vCPUs I still don't see why a different hash function would affect performance?

@laurentsimon
Copy link
Collaborator Author

If the hash can compute faster each shard on each vCPU, we'd get a lower plateau (first part of graph) and a slower linear increase (second part of graph). Of course, memory also comes into play and could become the bottleneck too :)
Wdut?

@smeiklej
Copy link
Collaborator

One thing I forgot to mention: You can see on the graph that the results are very similar across M2 and M3 (and at the end identical), despite the fact that M3 has twice the number of cores. So this again suggests that the bottleneck is disk IO.

@mihaimaruseac has a graph about memory usage showing that it really never got that high (like 4-8 GB RAM if I remember correctly) so I think that unless we seriously increase the shard size this is unlikely to become a bottleneck.

@mihaimaruseac
Copy link
Collaborator

How do you know disk performance was the bottleneck? It could be that the number of vCPUs was the bottleneck, since we parallelize by dispatching computation on each vCPU. Is there another graph showing that disk IO was the bottleneck?

Here's the serial implementation numbers

           size   time cold   mean time      stddev
           8192     0.00121     0.00023     0.00016
          16384     0.00114     0.00046     0.00023
          32768     0.00121     0.00019     0.00000
          65536     0.00149     0.00037     0.00018
         131072     0.00159     0.00044     0.00018
         262144     0.00224     0.00054     0.00010
         524288     0.00417     0.00095     0.00009
        1048576     0.00659     0.00157     0.00011
        2097152     0.00954     0.00288     0.00035
        4194304     0.01621     0.00456     0.00008
        8388608     0.03048     0.00810     0.00070
       16777216     0.05604     0.01775     0.00066
       33554432     0.10647     0.03329     0.00088
       67108864     0.23366     0.06219     0.00093
      134217728     0.40839     0.12064     0.00082
      268435456     0.83653     0.23720     0.00126
      536870912     1.68094     0.46663     0.00408
     1073741824     3.43046     0.83006     0.00222
     2147483648     6.42181     1.53563     0.00295
     4294967296    12.99146     3.06277     0.00592
     8589934592    26.38218     6.23468     0.01291
    17179869184    52.69675    12.24823     0.01216
    34359738368   104.40937    24.60314     0.02284
    68719476736   209.93898    49.40612     0.08544
   137438953472   413.53302    98.65482     0.10862
   274877906944   835.76904   831.67746     2.00515
   549755813888  1701.84040  1693.82841     6.66369
  1099511627776  3359.77479  3334.71729     4.79792

Doubling model size approximately doubles the time here. But now, let's see the optimized method:

           size   time cold   mean time      stddev
           8192     0.05316     0.05220     0.00049
          16384     0.05487     0.05160     0.00094
          32768     0.05497     0.05160     0.00060
          65536     0.05344     0.05194     0.00048
         131072     0.05247     0.05176     0.00051
         262144     0.05273     0.05190     0.00065
         524288     0.05627     0.05356     0.00069
        1048576     0.05863     0.05388     0.00041
        2097152     0.05962     0.05443     0.00133
        4194304     0.06749     0.05538     0.00044
        8388608     0.08282     0.05992     0.00061
       16777216     0.10344     0.06661     0.00092
       33554432     0.15928     0.08324     0.00070
       67108864     0.25431     0.10588     0.00083
      134217728     0.45653     0.15577     0.00079
      268435456     0.88208     0.25272     0.00113
      536870912     1.70203     0.45228     0.00157
     1073741824     3.44713     0.83517     0.00257
     2147483648     3.42376     0.84252     0.00330
     4294967296     3.96692     0.87051     0.00325
     8589934592     7.30940     0.91070     0.01196
    17179869184    14.11854     1.10208     0.02410
    34359738368    27.75890     1.37019     0.02251
    68719476736    55.51175     2.83477     0.04802
   137438953472   110.55379     5.02267     0.07931
   274877906944   226.84960   219.26083     0.01333
   549755813888   442.62415   441.70884     0.01044
  1099511627776   878.59498   868.93802     6.38893

See the huge jump at the end, that is caused by disk read. Confirmed by watching the process during hashing with htop, it was in D state for most of the time.

RAM usage was never more than 4GB (including Python interpreter size, etc.)

@laurentsimon
Copy link
Collaborator Author

laurentsimon commented May 16, 2024

Thanks. So after 150GB or so, it's better to store the model files on different SSDs, correct?

Also, for files < 100MB or so, naive hashing is faster (I suspect because there's no thread / process scheduling contention), correct?

@mihaimaruseac
Copy link
Collaborator

I think so.

I'm actually planning to add microbenchmarks as part of the API redesign. Will hopefully send a PR today (need to do the same internally, so depends on how I get the CL reviews)

@laurentsimon
Copy link
Collaborator Author

laurentsimon commented May 20, 2024

A few other things to report in the graph / benchmark:

  1. How many processes are created and running in parallel, and if possible how often they are waiting on IO. One motivation to report this is understand the sudden jump for model size > 274877906944B. One hypothesis could be that the library sees CPU-intensive tasks being run and decide to start more processes. Then they end up competing for disk reads. A solution could be to limit the maximum number of processes / threads created.
  2. What is the memory used on the machine, maybe via /proc/slabinfo or proc/meminfo (I think top only reports process memory)? The performance for model sizes > 274877906944 is ~ those for cold benchmarks. One hypothesis is that the page cache is full which causes high-level of cache misses. One workaround here would be to increase the RAM on the machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants