Merge sort #2850

kevinventullo · 2021-02-10T00:59:28Z

Issue

Introducing a new function merge_sort, which takes an auxiliary file and outputs the sorted sequence there. We then swap the two files.

This function seems to provide better performance on large files in low-memory environments, where the mem-mapped sequence appears to often get paged out. In particular, in a machine with about 2GB of memory operating on the whole world, the two sorts went from taking 3h48m and 5h48m, respectively to 55m and 20m, respectively.

Tasklist

Add tests
Add #fixes with the issue number that this PR addresses
Update the docs with any new request parameters or changes to behavior described
Update the changelog

Requirements / Relations

Link any requirements here. Other pull requests this PR is based on?

kevinventullo · 2021-02-10T00:59:50Z

cc @kevinkreiser

kevinkreiser · 2021-02-10T01:22:19Z

valhalla/midgard/sequence.h

+      std::sort(static_cast<T*>(memmap) + i, 
+                static_cast<T*>(memmap) + std::min(memmap.size(), i+buffer_size), 
+                predicate);
+      pq.emplace(*at(i), i);


nice! so the priority queue only stores at max however many sub-ranges it works out to based on the size of the sequence and the buffer_size 👍

kevinkreiser · 2021-02-10T01:22:52Z

@kevinventullo you'll want to run scripts/format.sh to lint the code so it gets passed CI

kevinkreiser · 2021-02-10T01:28:23Z

valhalla/midgard/sequence.h

+  // These should all fit in memory. Then, merge the sub-ranges into the 
+  // output sequence via priority queue.
+  void merge_sort(const std::function<bool(const T&, const T&)>& predicate,
+                  sequence<T>& output_seq,


this is the only unfortunate bit, we have to keep around two sequences to make the implementation trivial... which means we double the disk need for a small period of time. i think its a worthwhile trade off for sure. at first i started thinking about a scheme where, before we sort the subsequences we shift the items on the beginning to the end of the sequence to make space to swap things in from the priority queue as we do the merge but... that wont work you can think of pathological cases where the first "subsorted bucket" is all of the elements that sort to the end of the sequence, which means as we do the merge the space we made at the beginning will eventually eat into them. you cant win them all i guess 😄

kevinkreiser · 2021-02-10T01:31:00Z

src/mjolnir/pbfgraphparser.cc

+    sequence<OSMWayNode> way_nodes_tmp(way_nodes_tmp_file, true);
    sequence<OSMWayNode> way_nodes(way_nodes_file, false);
-    way_nodes.sort(
-        [](const OSMWayNode& a, const OSMWayNode& b) { return a.node.osmid_ < b.node.osmid_; });
+    way_nodes.merge_sort(
+        [](const OSMWayNode& a, const OSMWayNode& b) { return a.node.osmid_ < b.node.osmid_; }, way_nodes_tmp);
  }
+  LOG_INFO("Merge sort done, now swapping files.");
+  filesystem::remove(way_nodes_file);
+  filesystem::rename(way_nodes_tmp_file, way_nodes_file);


@kevinventullo what do you think about encapsulating the whole, make a new sequence/file, do the sort, remove and rename the file back, into the merge_sort function, so that instead of passing it a sequence and hanlding it outside you pass in the tmp file location and let all that stuff encapsulated inside the function?

i made the chagnes locally to move the creation of a temporary file into the merge_sort function. ill verify its working and then push it up

kevinkreiser · 2021-02-10T13:34:37Z

i was able to manually lint it via github we'll see if the build it sgreen here shortly

kevinkreiser · 2021-02-10T14:31:24Z

@kevinventullo ive fixed a bunch of small things here including lint but because we changed the signature to ParseNodes we need to make a bunch of mechanical updates to the unit tests, would you mind taking a pass on that? i think thats the only thing left to get CI passing

valhalla/midgard/sequence.h

Fixing unit tests

kevinkreiser · 2021-02-11T03:36:11Z

Ok I was just able to get this over the line locally. I made the following changes:

replace sort with merge_sort
encapsulate the creation and swapping of the tmp file inside the sort function
if no merge will be needed (because buffer_size is larger than number of elements) we fall back to standard sorting
add a unit test that confirms it all works

I've never pushed to an origin from a contribution before without the use of the github UI so we'll see if i can figure out how to do that 😉

kevinkreiser · 2021-02-11T03:40:25Z

ok this is ready for merge as soon as CI is happy with it. thank you very much @kevinventullo !

…ullo-MergeSort

kevinkreiser · 2021-02-11T14:46:23Z

master:

2021/02/11 13:54:03.720693 [INFO] Sorting osm way node references by node id...
2021/02/11 13:54:17.291529 [INFO] Parsing nodes...
2021/02/11 13:56:48.344377 [INFO] Sorting osm way node references by way index and node shape index...
2021/02/11 13:57:02.954264 [INFO] Finished: max_osm_id 8413106505
2021/02/11 13:57:05.896848 [INFO] Sorting graph...
2021/02/11 13:57:12.975710 [INFO] Finished with 17861214 graph nodes

this branch:

2021/02/11 13:39:39.530047 [INFO] Sorting osm way node references by node id...
2021/02/11 13:39:53.974470 [INFO] Parsing nodes...
2021/02/11 13:42:18.358375 [INFO] Sorting osm way node references by way index and node shape index...
2021/02/11 13:42:34.354646 [INFO] Finished: max_osm_id 8413106505
2021/02/11 13:42:37.095536 [INFO] Sorting graph...
2021/02/11 13:42:45.867451 [INFO] Finished with 17861214 graph nodes

At least on my hardware the time is about the same for germany which should trigger the merge sort behavior since its larger than the default buffer_size. At any rate I see no reason not to merge this if it helps in some hardware configurations.

kevinventullo added 6 commits February 9, 2021 19:43

Update pbfgraphparser.cc

bddfa21

Update pbfgraphparser.cc

d93c86b

Update util.cc

bdbffec

Update sequence.h

fcbffab

Update pbfgraphparser.h

83e1021

Update filesystem.h

fc5a26c

kevinkreiser self-requested a review February 10, 2021 01:16

kevinkreiser reviewed Feb 10, 2021

View reviewed changes

kevinkreiser added 4 commits February 10, 2021 08:25

Merge branch 'master' into MergeSort

1ea5236

lint

f73b2c2

lint

72ff1af

lint

a6e4276

kevinkreiser added 3 commits February 10, 2021 08:35

Merge branch 'master' into MergeSort

6685ac9

missing include

c799a00

fix typo

b249337

purew reviewed Feb 10, 2021

View reviewed changes

valhalla/midgard/sequence.h Show resolved Hide resolved

kevinventullo added 4 commits February 10, 2021 14:32

Update graphparser.cc

aa13545

Fixing unit tests

Fixing countryaccess.cc tests

94f5062

Fix more tests

646ec28

Fix test

91b87b0

kevinkreiser added 3 commits February 10, 2021 22:36

sequester the changes inside of sequence.h and add unit test

e679284

update changelog, revert unneeded change

f4aa9a8

Merge branch 'master' into MergeSort

d768116

kevinkreiser added 2 commits February 10, 2021 22:52

fix mac type coersion

864483a

Merge remote-tracking branch 'kevinventullo/MergeSort' into kevinvent…

70f314a

…ullo-MergeSort

kevinkreiser approved these changes Feb 11, 2021

View reviewed changes

kevinkreiser merged commit 67b8e82 into valhalla:master Feb 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge sort #2850

Merge sort #2850

kevinventullo commented Feb 10, 2021

kevinventullo commented Feb 10, 2021

kevinkreiser Feb 10, 2021

kevinkreiser commented Feb 10, 2021

kevinkreiser Feb 10, 2021 •

edited

kevinkreiser Feb 10, 2021

kevinkreiser Feb 10, 2021

kevinkreiser commented Feb 10, 2021

kevinkreiser commented Feb 10, 2021

kevinkreiser commented Feb 11, 2021

kevinkreiser commented Feb 11, 2021

kevinkreiser commented Feb 11, 2021

Merge sort #2850

Merge sort #2850

Conversation

kevinventullo commented Feb 10, 2021

Issue

Tasklist

Requirements / Relations

kevinventullo commented Feb 10, 2021

kevinkreiser Feb 10, 2021

Choose a reason for hiding this comment

kevinkreiser commented Feb 10, 2021

kevinkreiser Feb 10, 2021 • edited

Choose a reason for hiding this comment

kevinkreiser Feb 10, 2021

Choose a reason for hiding this comment

kevinkreiser Feb 10, 2021

Choose a reason for hiding this comment

kevinkreiser commented Feb 10, 2021

kevinkreiser commented Feb 10, 2021

kevinkreiser commented Feb 11, 2021

kevinkreiser commented Feb 11, 2021

kevinkreiser commented Feb 11, 2021

kevinkreiser Feb 10, 2021 •

edited