Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge sort #2850

Merged
merged 22 commits into from Feb 11, 2021
Merged

Merge sort #2850

merged 22 commits into from Feb 11, 2021

Conversation

kevinventullo
Copy link
Contributor

Issue

Introducing a new function merge_sort, which takes an auxiliary file and outputs the sorted sequence there. We then swap the two files.

This function seems to provide better performance on large files in low-memory environments, where the mem-mapped sequence appears to often get paged out. In particular, in a machine with about 2GB of memory operating on the whole world, the two sorts went from taking 3h48m and 5h48m, respectively to 55m and 20m, respectively.

Tasklist

  • Add tests
  • Add #fixes with the issue number that this PR addresses
  • Update the docs with any new request parameters or changes to behavior described
  • Update the changelog

Requirements / Relations

Link any requirements here. Other pull requests this PR is based on?

@kevinventullo
Copy link
Contributor Author

cc @kevinkreiser

@kevinkreiser kevinkreiser self-requested a review February 10, 2021 01:16
std::sort(static_cast<T*>(memmap) + i,
static_cast<T*>(memmap) + std::min(memmap.size(), i+buffer_size),
predicate);
pq.emplace(*at(i), i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! so the priority queue only stores at max however many sub-ranges it works out to based on the size of the sequence and the buffer_size 👍

@kevinkreiser
Copy link
Member

@kevinventullo you'll want to run scripts/format.sh to lint the code so it gets passed CI

// These should all fit in memory. Then, merge the sub-ranges into the
// output sequence via priority queue.
void merge_sort(const std::function<bool(const T&, const T&)>& predicate,
sequence<T>& output_seq,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only unfortunate bit, we have to keep around two sequences to make the implementation trivial... which means we double the disk need for a small period of time. i think its a worthwhile trade off for sure. at first i started thinking about a scheme where, before we sort the subsequences we shift the items on the beginning to the end of the sequence to make space to swap things in from the priority queue as we do the merge but... that wont work you can think of pathological cases where the first "subsorted bucket" is all of the elements that sort to the end of the sequence, which means as we do the merge the space we made at the beginning will eventually eat into them. you cant win them all i guess 😄

Comment on lines 2232 to 2239
sequence<OSMWayNode> way_nodes_tmp(way_nodes_tmp_file, true);
sequence<OSMWayNode> way_nodes(way_nodes_file, false);
way_nodes.sort(
[](const OSMWayNode& a, const OSMWayNode& b) { return a.node.osmid_ < b.node.osmid_; });
way_nodes.merge_sort(
[](const OSMWayNode& a, const OSMWayNode& b) { return a.node.osmid_ < b.node.osmid_; }, way_nodes_tmp);
}
LOG_INFO("Merge sort done, now swapping files.");
filesystem::remove(way_nodes_file);
filesystem::rename(way_nodes_tmp_file, way_nodes_file);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinventullo what do you think about encapsulating the whole, make a new sequence/file, do the sort, remove and rename the file back, into the merge_sort function, so that instead of passing it a sequence and hanlding it outside you pass in the tmp file location and let all that stuff encapsulated inside the function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i made the chagnes locally to move the creation of a temporary file into the merge_sort function. ill verify its working and then push it up

@kevinkreiser
Copy link
Member

i was able to manually lint it via github we'll see if the build it sgreen here shortly

@kevinkreiser
Copy link
Member

@kevinventullo ive fixed a bunch of small things here including lint but because we changed the signature to ParseNodes we need to make a bunch of mechanical updates to the unit tests, would you mind taking a pass on that? i think thats the only thing left to get CI passing

@kevinkreiser
Copy link
Member

Ok I was just able to get this over the line locally. I made the following changes:

  1. replace sort with merge_sort
  2. encapsulate the creation and swapping of the tmp file inside the sort function
  3. if no merge will be needed (because buffer_size is larger than number of elements) we fall back to standard sorting
  4. add a unit test that confirms it all works

I've never pushed to an origin from a contribution before without the use of the github UI so we'll see if i can figure out how to do that 😉

@kevinkreiser
Copy link
Member

ok this is ready for merge as soon as CI is happy with it. thank you very much @kevinventullo !

@kevinkreiser
Copy link
Member

master:

2021/02/11 13:54:03.720693 [INFO] Sorting osm way node references by node id...
2021/02/11 13:54:17.291529 [INFO] Parsing nodes...
2021/02/11 13:56:48.344377 [INFO] Sorting osm way node references by way index and node shape index...
2021/02/11 13:57:02.954264 [INFO] Finished: max_osm_id 8413106505
2021/02/11 13:57:05.896848 [INFO] Sorting graph...
2021/02/11 13:57:12.975710 [INFO] Finished with 17861214 graph nodes

this branch:

2021/02/11 13:39:39.530047 [INFO] Sorting osm way node references by node id...
2021/02/11 13:39:53.974470 [INFO] Parsing nodes...
2021/02/11 13:42:18.358375 [INFO] Sorting osm way node references by way index and node shape index...
2021/02/11 13:42:34.354646 [INFO] Finished: max_osm_id 8413106505
2021/02/11 13:42:37.095536 [INFO] Sorting graph...
2021/02/11 13:42:45.867451 [INFO] Finished with 17861214 graph nodes

At least on my hardware the time is about the same for germany which should trigger the merge sort behavior since its larger than the default buffer_size. At any rate I see no reason not to merge this if it helps in some hardware configurations.

@kevinkreiser kevinkreiser merged commit 67b8e82 into valhalla:master Feb 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants