New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sort: Finish implementing #1919
Comments
See: #1979 |
See: #2008 |
See: @miDeb 's #1996 and #2057 @miDeb Can you briefly explain through what voodoo you made it so fast? I can see you changed the transforms, which I always thought was a pinch point, but everything is faster. Is it SmallVecs? Our wall clock time for numeric is comparable to gsort with LC_ALL=C, which is insane. Only items I see now are nice to haves:
|
I made it so that transforms are only run once at the beginning for every line and only when we need transforms (otherwise transforms were run at every comparison and even when there was nothing to transform, needlessly allocating a new string each time). At least I think that's the perf improvement you are seeing. I also made writes to stdout buffered, so if you are measuring perf without an outifle I'm also working on a rust version of numstrcmp, which will hopefully improve perf a bit more. |
I think 2 (sort buffer size) would be needed to be able to sort very large files (i.e. files that are larger that what we'd be able to keep in memory), right? I have no idea how I could implement that, though... Is there prior art somewhere? |
Another nice to have would be |
My understanding of (2) is that it is the size of the buffer for the entire sort. By default, gsort allows itself to use 90% of memory. When capped, the program looks at the amount of memory allocated and says we should block until these other ops are completed, because we would exceed our total memory allocated, but we should exceed the memory cap if some one string exceeds the memory cap. I looked at a crate called cap, but the crate just causes a OOM situation when the cap is exceeded? ExtSort, (1), is for the sorting of files that we wouldn't be able to keep in memory. |
extsort seems to allow to specify a |
It seems GNU uses rlimits for its buffer size option, which are unfortunately Mac OS/Linux only. FYI, I'm going to take a crack at --buffer-size (probably with rlimits), and then maybe take a look at extsort (--temporary-directory and --batch-size). |
@miDeb You were actually exactly right about the relationship between --buffer-size and extsort. I'm finishing up the implementation now. BTW, have you experienced any weird errors when using the test_helper function? I get input and out that doesn't look like what I should be feeding it. |
The only thing that confused me was that it printed unexpected/expected output as bytes, such as
|
@electricboogie fyi working on |
Stuff left to do (there's probably more than this):
|
By locales, do you mean locale-specific ordering? |
Wow, great crate! Yes, that's indeed what I meant. Localizing error messages, help texts etc. is also something we need to do, but that'd probably need to be a shared effort for all utilities. Fyi: i have found rust_icu_ucol, which exposes a collator (https://docs.rs/rust_icu_ucol/1.0.0/rust_icu_ucol/struct.UCollator.html), but that struct is not Sync (edit: I previosly linked to the wrong crate, sorry for the confusion: rust_icu_sys is only used internally by rust_icu_ucol), so I couldn't use it because we parallelize sorting and for that purpose we have to be able to share a collator across threads. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I've started work on this. See: #1922
I'd like to take a crack at the rest of the missing functionality.
The text was updated successfully, but these errors were encountered: