New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge sort #2850
Merge sort #2850
Conversation
valhalla/midgard/sequence.h
Outdated
std::sort(static_cast<T*>(memmap) + i, | ||
static_cast<T*>(memmap) + std::min(memmap.size(), i+buffer_size), | ||
predicate); | ||
pq.emplace(*at(i), i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! so the priority queue only stores at max however many sub-ranges it works out to based on the size of the sequence and the buffer_size 👍
@kevinventullo you'll want to run |
valhalla/midgard/sequence.h
Outdated
// These should all fit in memory. Then, merge the sub-ranges into the | ||
// output sequence via priority queue. | ||
void merge_sort(const std::function<bool(const T&, const T&)>& predicate, | ||
sequence<T>& output_seq, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the only unfortunate bit, we have to keep around two sequences to make the implementation trivial... which means we double the disk need for a small period of time. i think its a worthwhile trade off for sure. at first i started thinking about a scheme where, before we sort the subsequences we shift the items on the beginning to the end of the sequence to make space to swap things in from the priority queue as we do the merge but... that wont work you can think of pathological cases where the first "subsorted bucket" is all of the elements that sort to the end of the sequence, which means as we do the merge the space we made at the beginning will eventually eat into them. you cant win them all i guess 😄
src/mjolnir/pbfgraphparser.cc
Outdated
sequence<OSMWayNode> way_nodes_tmp(way_nodes_tmp_file, true); | ||
sequence<OSMWayNode> way_nodes(way_nodes_file, false); | ||
way_nodes.sort( | ||
[](const OSMWayNode& a, const OSMWayNode& b) { return a.node.osmid_ < b.node.osmid_; }); | ||
way_nodes.merge_sort( | ||
[](const OSMWayNode& a, const OSMWayNode& b) { return a.node.osmid_ < b.node.osmid_; }, way_nodes_tmp); | ||
} | ||
LOG_INFO("Merge sort done, now swapping files."); | ||
filesystem::remove(way_nodes_file); | ||
filesystem::rename(way_nodes_tmp_file, way_nodes_file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinventullo what do you think about encapsulating the whole, make a new sequence/file, do the sort, remove and rename the file back, into the merge_sort
function, so that instead of passing it a sequence and hanlding it outside you pass in the tmp file location and let all that stuff encapsulated inside the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i made the chagnes locally to move the creation of a temporary file into the merge_sort function. ill verify its working and then push it up
i was able to manually lint it via github we'll see if the build it sgreen here shortly |
@kevinventullo ive fixed a bunch of small things here including lint but because we changed the signature to |
Fixing unit tests
Ok I was just able to get this over the line locally. I made the following changes:
I've never pushed to an origin from a contribution before without the use of the github UI so we'll see if i can figure out how to do that 😉 |
ok this is ready for merge as soon as CI is happy with it. thank you very much @kevinventullo ! |
master:
this branch:
At least on my hardware the time is about the same for germany which should trigger the merge sort behavior since its larger than the default buffer_size. At any rate I see no reason not to merge this if it helps in some hardware configurations. |
Issue
Introducing a new function merge_sort, which takes an auxiliary file and outputs the sorted sequence there. We then swap the two files.
This function seems to provide better performance on large files in low-memory environments, where the mem-mapped sequence appears to often get paged out. In particular, in a machine with about 2GB of memory operating on the whole world, the two sorts went from taking 3h48m and 5h48m, respectively to 55m and 20m, respectively.
Tasklist
Requirements / Relations
Link any requirements here. Other pull requests this PR is based on?