Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sort: Reduce memory overhead #2134

Closed
1 of 4 tasks
miDeb opened this issue Apr 26, 2021 · 21 comments
Closed
1 of 4 tasks

sort: Reduce memory overhead #2134

miDeb opened this issue Apr 26, 2021 · 21 comments

Comments

@miDeb
Copy link
Contributor

miDeb commented Apr 26, 2021

When I wrote #1996, I implemented this by pre-computing some values beforehand to speed up comparisons later. There are other cases where we rely on pre-computed values. While this can result in a speed improvement, it has the big disadvantage of increasing our overall memory usage. The way it is currently implemented also means that we pay a constant tax even for lines/files that would not need precomputed values. That said, I think we should not rely on precomputed values for our comparisons. We can later experiment with adding them back for cases where they bring significant improvements without impacting other use-cases.

Places where pre-computed values are used:

  • numeric string comparisons (-h and -n): The current algorithm is expecting that pre-computed values are cached. To achieve similar performance without caching, it needs to be rewritten. Expected performance impact: Small performance regression, improvements for other use-cases and already sorted files.
  • general numeric comparisons: Parsed f64 values are cached. Reverting this should be relatively straight-forward. Should only be a minor performance regression. See sort: Reduce memory overhead #2134 (comment), this may not be as useful as I initially thought.
  • transforms: (-f, -d, -i, -b) could be implemented with a custom string comparison algorithm. Allocating new strings for every comparison could have a noticeable performance impact: sort: add some custom string comparisons #2144
  • fields: (-k) This could potentially benefit from caching, but maybe in a limited form? I have to investigate what the performance penalty for tokenizing each time would be.

Generally the goal is to

  • reduce memory usage: Currently every line gets its own Line struct that is 96 bytes big. This is due to the precomputed values for numeric string comparison and indices for field support we have to reserve space for. The goal is to reduce this to just a &str, i.e. 16 bytes.
  • speed cases up that do not need pre-computed values anyways.
  • observe the performance impact and at a later point add mitigations for them.

cc @electricboogie does roughly sound ok to you? If so, I'll start implementing this the next days.

@kimono-koans
Copy link
Contributor

I'm not sure we have to abandon all precomputed values just yet. I was just testing this and the SmallVec in the Line struct is a big hog. We can reduce the Line struct to 48 bytes just by changing to a regular Vec. Why don't we both test that first and see where we are after?

@miDeb
Copy link
Contributor Author

miDeb commented Apr 26, 2021

Yeah, I also tried this briefly. I will look at it again, but I think this regressed speed. It will not reduce overall memory usage because it just moves the memory to a different place.

@kimono-koans
Copy link
Contributor

So the memory overhead is similar? Huh. Yeah, then I guess I'm with you on precomputed values seeming like the thing we can lose if we have to.

@tertsdiepraam
Copy link
Member

Maybe you could store a Vec in FieldSelector instead of a (Small)Vec in Line? Then if you have no FieldSelectors, you wouldn't allocate any memory for it, right? I'm not really familiar with the code, so this might make no sense at all.

@miDeb
Copy link
Contributor Author

miDeb commented Apr 26, 2021

This could be an option, but would probably introduce more complexity. FieldSelectors are a global thing, but we'd need indices for each line. I'm also unsure how nice this would work with extsort (sorting by using auxiliary disk memory to sort large files), but we can surely explore multiple approaches.

This is more or less what I meant with "later we can try to put some of this back in a way that doesn't affect other use-cases". Maybe we can build indices for the first n lines for huge files or something like that to reduce the peak memory usage.

@kimono-koans
Copy link
Contributor

@miDeb since you've tinkered with the Line struct, do we have to take an owned value for each line string? I'm looking at this naively, but are there instances where we could take a str, and we could make this a COW, and save a bunch of string allocs (or maybe we can avoid string allocs altogether)?

@miDeb
Copy link
Contributor Author

miDeb commented Apr 26, 2021

No, I think we need to allocate those strings. We get them from the BufReader and have to take ownership from it because it will not keep its buffer content around. We could avoid allocations in some cases in the transform function as far as I can tell. For -b that's surely the case: We could just trim before comparing strings.

@kimono-koans
Copy link
Contributor

I'm sorry I meant alloc a string buffer. So, alloc a String::with_capacity(4096)[the size of the BufReader buffer or the --buffer-size limit or whatever] into a Buffer_Chunks struct then have the Line struct take references?

I think this is what GNU sort does looking at its memory behavior in Valgrind.

@miDeb
Copy link
Contributor Author

miDeb commented Apr 26, 2021

Thanks for clarifying, that makes sense! That way we can avoid sorting Strings (24 bytes afaik) and can use &str (16 bytes).

@kimono-koans
Copy link
Contributor

I reread your initial comment and realized what I was suggesting something like what you were already talking about! Sorry for missing the point.

Re: chunked sorting, just so someone remembers -- we should benchmark stable and unstable again -- because I remember reading in the library docs that the default unstable sort perhaps wouldn't be suited to doing a final pass over the chunks.

@miDeb
Copy link
Contributor Author

miDeb commented Apr 29, 2021

Alright, I did some experiments with removing precomputation for numeric comparisons. I don't think my initial hypothesis holds up, this results in a severe performance regression, and I don't know how valuable the improvement in memory usage is in practice.

I'm seeing a 30% speed regression for -n and a 70% regression for -g (haven't implemented -h yet, but it would be somewhere in between). Other workloads aren't really affected in their performance, athough memory usage goes down by about 15% for all workloads. The size of the line struct is reduced from 96 to 72 bytes (a reduction of 25%).

Maybe I'll find a change that has a clearer benefit.

@kimono-koans
Copy link
Contributor

Could you store the cache in a separate struct/hashmap for each function call, instead of per line?

@kimono-koans
Copy link
Contributor

kimono-koans commented Apr 29, 2021

Re: Field/Key, what about storing the field/key info in a separate struct/s which merely correspond to an instance of a line struct? So maybe Key=Line,Value=OtherStuff? Or perhaps struct GlobalLine, which is only used if needed and we rework the functions to take type T?

@kimono-koans
Copy link
Contributor

@kimono-koans
Copy link
Contributor

FYI, on BSD/MacOS, --debug has some interesting output which will show you the items/values which are actually being compared. Right now, as I understand the Line struct, we are storing every value. When I look at BSD --debug, I think wouldn't a smaller cache make more sense? Because it seems like a 10-50 item cache would might make performance dip by 5-10% but save bunches of memory.

@miDeb
Copy link
Contributor Author

miDeb commented May 2, 2021

With the changes I'm implementing in #2144 and some more changes I in #2165 the size of the Line struct is reduced from 12 words (96 bytes) to 7 words (56 bytes). That's as far as I could get, not sure how much room for improvement there's left.

Here's how Line and Selection would look like:

pub struct Line {
    line: Box<str>,
    first_selection: Selection,
    other_selections: Box<[Selection]>,
}

struct Selection {
    range: SelectionRange, // type alias for Range<usize>
    num_cache: Option<Box<NumCache>>,
}

The memory layout of Line and Selection would look like this:

type: `Line`: 56 bytes, alignment: 8 bytes
    field `.line`: 16 bytes
    field `.first_selection`: 24 bytes
    field `.other_selections`: 16 bytes

type: `Selection`: 24 bytes, alignment: 8 bytes
    field `.range`: 16 bytes
    field `.num_cache`: 8 bytes

I'll also have to evaluate how much of an improvement this is in practice, #2165 alone looks like a 10% memory usage reduction.

@miDeb miDeb changed the title sort: Remove precomputation sort: Reduce memory overhead May 2, 2021
@kimono-koans
Copy link
Contributor

I tried dividing up the Line struct myself into a Line: String (24 bytes) and Selections: Vec (24 bytes) structs, and my benchmarking puts our memory usage about even with GNU, and we're still about 1.5x as fast on the slower Linux machine, and 5-6x as fast as GNU on the faster MacOS laptop (if still slowish).

Appreciate your work on this. We both don't like a performance regression, but I'm still wondering if it isn't better to keep the memory usage low at the sake of out and out performance, because larger inputs is where this may matter (?).

I'll push to my repo so you can take a look, if you care to. I still think there is an upside on the performance front to just cache the last 10-20 parse ops.

May I ask -- why are you still holding onto the first and second selection within the Line struct proper? Is there a performance benefit?

@miDeb
Copy link
Contributor Author

miDeb commented May 3, 2021

I'm holding onto the first selection, because I'm assuming that we'll only need to compare the first selection to find a difference between lines (but we always need to compare at least one!). Having the first selection directly on the struct is I think much faster than it being behind a pointer, but I'll check the exact numbers here.

The other thing is a Box<[Selection]>, which is a "boxed slice", similar to a &[Selection], with the only difference that we own it. The trick is not to use a Vec<Selection>, which uses 24 bytes (it consists of ptr, len, capacity), but a slice (16 bytes: ptr and len). This just means that we can't grow a Box<[Selection]> anymore (can't add Selections after constructing the line), but we don't need to in the first place. I'm converting the initial Vec<Selection> in Line::new using into_boxed_slice.

Similarly, Box<str> is like String, but we can't grow it and it uses 16 bytes instead of 24.

I really appreciate your suggestions, I'll have a look at your repo (though maybe not until the next weekend)

@chadbrewbaker
Copy link
Contributor

Like to see a /usr/bin time -v (BSD/OSX has different arguments) for several tools too keep track of memory regressions. Many Rust buffered readers are unbounded memory.

@miDeb
Copy link
Contributor Author

miDeb commented May 14, 2021

My latest, highly experimental attempt at improving sort performance can be found here: https://github.com/miDeb/coreutils/tree/ouroboros. Performance should be comparable to GNU sort for everything I tested due to a few new (experimental) approaches. I hope I'll be able to split my changes into a number of incremental and merge-able PRs, but for now it's just an exploration into what could be possibly achieved.

The biggest experimental change is likely the use of self-rererential structs with a dependency to https://crates.io/crates/ouroboros, I guess.

@miDeb
Copy link
Contributor Author

miDeb commented Jun 16, 2021

I think I'm pretty satisfied with the current memory usage, although there could be some improvements for numeric sorting or sort keys (maybe using a more data-oriented approach).

@miDeb miDeb closed this as completed Jun 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants