Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues with Bytes literals #3954

Open
ceedubs opened this issue May 2, 2023 · 1 comment
Open

Performance issues with Bytes literals #3954

ceedubs opened this issue May 2, 2023 · 1 comment

Comments

@ceedubs
Copy link
Contributor

ceedubs commented May 2, 2023

In general large Bytes values seem to be fine, but literal values of Bytes have bad performance and result in excessive resource usage that tends to get the Unison process killed by the OS.

Here is a fairly minimal way to reproduce:

  • Start up Unison with a fresh codebase: ❯ unison -C $(mktemp -d)

  • Open a scratch.u file and save the following (you don't need to add it):

    main : '{Exception, IO} Bytes
    main = do
      splitmix 42 do
        Random.bytes 100000
    
  • .> run main

  • .> add.run largeBytes

This all happens fairly quickly. Now I open my scratch file, add the watch expression > Bytes.size largeBytes and save. My ucm process hangs for a while and then gets killed by my OS.

@dolio and I talked about this a bit, and he said that Bytes literals don't have a dedicated representation in the AST and that they end up getting inefficiently encoded as [Nat] where each Nat represents a single Byte. It sounds like there are even further issues with excessive stack frames for large Bytes literals.

@dolio
Copy link
Contributor

dolio commented May 2, 2023

Yeah, the problem is that e.g. 0xs0123 is actually something like fromList [0x01, 0x23] which then turns into (pseudocode):

let x0u = 0x01u -- unboxed number
    x0 = Nat x0u
    x1u = 0x23u
    x1 = Nat x1u
    l = mkList x0 x1
 in Bytes.fromList l

So, there's a massive blowup from that representation. Not only is every byte represented as 8 bytes, but code has to be emitted to build the representation, and put it in a sequence. So there's a huge per byte overhead on the code representation even.

The temporary workaround I suggested is to use a string literal of the hex digits, because the code emitted for strings is much more compact. It'd be nice if that were the actual representation of bytes literals, or if they were directly represented. I guess code could be written to detect the same fromBytes [...] with all literal numbers in the compiler, but that's additional work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants