Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sort: Finish implementing #1919

Open
kimono-koans opened this issue Mar 26, 2021 · 16 comments
Open

sort: Finish implementing #1919

kimono-koans opened this issue Mar 26, 2021 · 16 comments

Comments

@kimono-koans
Copy link
Contributor

kimono-koans commented Mar 26, 2021

I've started work on this. See: #1922

I'd like to take a crack at the rest of the missing functionality.

@kimono-koans kimono-koans changed the title Finish implementing sort sort: Finish implementing Mar 29, 2021
@kimono-koans
Copy link
Contributor Author

See: #1979

@kimono-koans
Copy link
Contributor Author

See: #2008

@kimono-koans
Copy link
Contributor Author

See: @miDeb 's #1996 and #2057

@miDeb Can you briefly explain through what voodoo you made it so fast? I can see you changed the transforms, which I always thought was a pinch point, but everything is faster. Is it SmallVecs? Our wall clock time for numeric is comparable to gsort with LC_ALL=C, which is insane.

Only items I see now are nice to haves:

  1. ExtSort (there is a crate which utilizes rayon for the in memory compares),
  2. sort buffer size (if this means we couldn't use, or would have to reimplement, par_sort_by, I don't see the point),
  3. strnumcmp,
  4. locales (rust-icu-sys requires an unstable feature, and if we don't have that crate, I imagine it would be much harder).

@miDeb
Copy link
Contributor

miDeb commented Apr 10, 2021

I made it so that transforms are only run once at the beginning for every line and only when we need transforms (otherwise transforms were run at every comparison and even when there was nothing to transform, needlessly allocating a new string each time). At least I think that's the perf improvement you are seeing.

I also made writes to stdout buffered, so if you are measuring perf without an outifle -o you will certainly see huge perf improvements

I'm also working on a rust version of numstrcmp, which will hopefully improve perf a bit more.

@miDeb
Copy link
Contributor

miDeb commented Apr 10, 2021

I think 2 (sort buffer size) would be needed to be able to sort very large files (i.e. files that are larger that what we'd be able to keep in memory), right? I have no idea how I could implement that, though... Is there prior art somewhere?

@miDeb
Copy link
Contributor

miDeb commented Apr 10, 2021

Another nice to have would be --debug IMHO.

@kimono-koans
Copy link
Contributor Author

My understanding of (2) is that it is the size of the buffer for the entire sort. By default, gsort allows itself to use 90% of memory. When capped, the program looks at the amount of memory allocated and says we should block until these other ops are completed, because we would exceed our total memory allocated, but we should exceed the memory cap if some one string exceeds the memory cap. I looked at a crate called cap, but the crate just causes a OOM situation when the cap is exceeded?

ExtSort, (1), is for the sorting of files that we wouldn't be able to keep in memory.

@miDeb
Copy link
Contributor

miDeb commented Apr 10, 2021

extsort seems to allow to specify a segment_size, but that only limits the number of items to be kept in memory. Maybe it would be possible to add a similar mechanism that limits the amount of memory used.

@kimono-koans
Copy link
Contributor Author

It seems GNU uses rlimits for its buffer size option, which are unfortunately Mac OS/Linux only. FYI, I'm going to take a crack at --buffer-size (probably with rlimits), and then maybe take a look at extsort (--temporary-directory and --batch-size).

@kimono-koans
Copy link
Contributor Author

@miDeb You were actually exactly right about the relationship between --buffer-size and extsort. I'm finishing up the implementation now. BTW, have you experienced any weird errors when using the test_helper function? I get input and out that doesn't look like what I should be feeding it.

@miDeb
Copy link
Contributor

miDeb commented Apr 12, 2021

The only thing that confused me was that it printed unexpected/expected output as bytes, such as

thread 'test_sort::test_numeric_floats_with_nan2' panicked at 'assertion failed: `(left == right)`
  left: `[45, 56, 46, 57, 48, 56, 56, 48, 10, 45, 46, 48, 53, 10, 75, 97, 114, 109, 97, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 49, 46, 48, 47, 48, 46, 48, 10, 49, 10, 49, 46, 48, 52, 48, 48, 48, 48, 48, 48, 48, 10, 49, 46, 50, 10, 49, 46, 52, 52, 52, 10, 49, 46, 53, 56, 53, 57, 48, 10]`,
 right: `[45, 56, 46, 57, 48, 56, 56, 48, 10, 45, 46, 48, 53, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 75, 97, 114, 109, 97, 10, 49, 10, 49, 46, 48, 47, 48, 46, 48, 10, 49, 46, 48, 52, 48, 48, 48, 48, 48, 48, 48, 10, 49, 46, 50, 10, 49, 46, 52, 52, 52, 10, 49, 46, 53, 56, 53, 57, 48, 10]`', tests/common/util.rs:197:9

@miDeb
Copy link
Contributor

miDeb commented Apr 17, 2021

@electricboogie fyi working on --debug

@miDeb
Copy link
Contributor

miDeb commented May 22, 2021

Stuff left to do (there's probably more than this):

  • locales
  • exit codes (2 for failure, 1 for disorder when invoked with -c)
  • error messages do not match gnu
  • compression of auxiliary files with ext sort

@tertsdiepraam
Copy link
Member

By locales, do you mean locale-specific ordering? ls also does not implement that correctly yet, so I've been looking around for a crate that implements the Unicode Collation Algorithm and couldn't really find any, so I've started making my own. It's a big project that involves reading a couple different file formats to generate character orderings. I've written most of the parsing code, and am now in the process of figuring out how to generate the order. The code is here: https://github.com/tertsdiepraam/collate. I think it makes sense to keep it as a separate crate (although maybe under the uutils organization?). Anyway, feel free to send PR's or copy some code to put in uutils if you decide to work on this. It's not much yet, but it's better than starting from nothing.

@miDeb
Copy link
Contributor

miDeb commented May 22, 2021

Wow, great crate! Yes, that's indeed what I meant. Localizing error messages, help texts etc. is also something we need to do, but that'd probably need to be a shared effort for all utilities.

Fyi: i have found rust_icu_ucol, which exposes a collator (https://docs.rs/rust_icu_ucol/1.0.0/rust_icu_ucol/struct.UCollator.html), but that struct is not Sync (edit: I previosly linked to the wrong crate, sorry for the confusion: rust_icu_sys is only used internally by rust_icu_ucol), so I couldn't use it because we parallelize sorting and for that purpose we have to be able to share a collator across threads.
Collators are also a "stretch goal" of icu4x.

@stale
Copy link

stale bot commented Jan 20, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants