Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with a hybrid bitfield + range encoding for Span / DefId. #53560

Open
eddyb opened this issue Aug 21, 2018 · 4 comments
Open

Experiment with a hybrid bitfield + range encoding for Span / DefId. #53560

eddyb opened this issue Aug 21, 2018 · 4 comments
Labels
A-incr-comp Area: Incremental compilation C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@eddyb
Copy link
Member

eddyb commented Aug 21, 2018

Roughly, if you have a "container" (file/crate/etc.), and sequential indices in it:

  • you can use (container_index, intra_container_index) (but that takes 2x space)
  • you can split an integer's bitwidth into two bitfields, one for each half of the pair above
    • the point where you choose to split is a tradeoff and you can run out of either half
  • you can split an integer's range, with each container having its sequential range
    • Span does this currently, where the files are effectively "concatenated"
    • requires binary search to translate into the pair representation

An improvement on all of those is to choose an arbitrary chunk size (e.g. 2^17 = 128kB for files), and then split each container into a number of chunks (ideally just 1 in the common case).
You can then use bitfields for (chunk, intra_chunk_index) (e.g. 15 and 17 bits of u32).

The difference is that to translate chunk to container, we don't need to use binary search, because chunk is several orders of magnitude smaller than the index space as a whole, and we can use arrays.

That is, chunk -> container can be an array, but also, if there is per-container data that would be accessed through chunk, we can optimize that by building a chunk -> Rc<ContainerData> array.

Translating intra_chunk_index to intra_container_index is similarly easy, if you can look up per-container data, you can subtract its overall start (if each container is a contiguous range of chunks).


Another reason this might be useful is translating (an unified) DefId or Span between crates or between incremental (re)compilation sessions - we can have a bitset of changed chunks: if a chunk is unchanged, the index is identical, otherwise we can have an intra-chunk/container binary search for changed ranges (or just a map of changes).

We can grow the number indices within the last chunk of a container, and if we run out of space, we can relocate the container's chunks without a significant cost. Alternatively, another tradeoff we can make is to fragment a container's chunks.


The first step in experimenting with this would have to be take Span, and round up the start/end of each file's range to a multiple of a power of 2 (e.g. 2^17 - but an optimal value would require gathering some real-world file-size statistics).
This way we can see if there's a negative performance impact from having unused gaps in the index space, everything else should be an improvement.
We can also try to replace the binary searches to find the SourceFile a Span is from.

cc @nikomatsakis @michaelwoerister

@michaelwoerister
Copy link
Member

Sounds like a good idea. For incremental compilation we'd eventually like to have offset relative to item-likes (or something similarly stable). Maybe that could be backed right into such a scheme?

@estebank
Copy link
Contributor

@michaelwoerister as I mentioned in the incr.comp thread, I would very much be in favor of making Spans offsets off their containers, but implementing it is not going to be trivial, I believe.

@estebank estebank added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-incr-comp Area: Incremental compilation I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. labels Aug 21, 2018
@eddyb
Copy link
Member Author

eddyb commented Aug 22, 2018

@estebank I'd start with splitting the "file" component out first (because it is the implicit "container" of each span's lo/hi "indices"), and then we can try to do something of finer granularity.

@eddyb
Copy link
Member Author

eddyb commented Aug 22, 2018

Hmm there's something weird we can do with spans: all top-level declarations today end in ; or }, so while lexing we can split a file into one chunk/container per top-level declaration.

Then we'd never have "true files" for spans within Rust modules.
Alternatively, we can just make fake files when converting AST into HIR, by translating all the spans into item-relative ones.

Or we can make spans more opaque. This idea wasn't originally about per-item spans, as spans have contiguity requirements, but rather a more unified & dynamic DefId for multi-crate rustc.

@Enselic Enselic added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Nov 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-incr-comp Area: Incremental compilation C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

4 participants