Batch writes to edges file #2851

kevinventullo · 2021-02-10T01:09:15Z

Issue

Instead of writing to the edge file inline, keep track of the writes in a
vector in-memory, sort them by edge id, and apply them all at the end. Sorting them
means the writes are now sequential for the edges file, which vastly speeds it up.

In our internal system on the whole world, this reduced the processing time of this stage from 10 hours to about 7 minutes.

Tasklist

Add tests
Add #fixes with the issue number that this PR addresses
Update the docs with any new request parameters or changes to behavior described
Update the changelog

Requirements / Relations

Link any requirements here. Other pull requests this PR is based on?

kevinventullo · 2021-02-10T01:09:44Z

cc @kevinkreiser

kevinkreiser · 2021-02-10T03:15:55Z

src/mjolnir/graphbuilder.cc

  uint32_t run_index = 0;
  uint32_t node_index = 0;
  size_t node_count = 0;
  Node last_node{};
  std::map<GraphId, size_t> tiles;
-  nodes.transform([&edges, &run_index, &node_index, &node_count, &last_node, &tiles](Node& node) {
+  // If these get too large, consider using a sequence instead of a vector here.
+  std::vector<std::pair<uint32_t, uint32_t>> starts, ends;


at 8 bytes per record and about 400000000 graph edges (double that for because of start/end) we get something like 3.2GB (and maybe double that again if we are unlucky with vector allocation). this is definitely close to needing a sequence. would you be opposed to something as simple as a struct (right here above the function) to encapsulate the two indices and then throwing them into a sequence?

Yep, I can do that!

I think it might be easier to have two separate sequences; otherwise you need an extra bit of data for each entry indicating whether it's a start or an end. If instead the sequence was a mapping [edge -> (start,end)], you run into the same random insert issue as you iterate through the nodes.

ah yep sorry i was being imprecise with my wording above. i meant 2 sequences in total, 1 for each mapping.

kevinkreiser · 2021-02-10T03:19:54Z

src/mjolnir/graphbuilder.cc

    }

    // next node
    last_node = node;
    ++node_index;
  });
+
+  LOG_INFO("Nodes processed. Now adding data to edges.");


the rest of this looks perfectly fine. its unfortunate that looping over both sets at the same time ends up being so verbose but there isnt a clearer way to do it so it looks good to me!

mandeepsandhu · 2021-02-10T04:27:54Z

Curious as to whats the reason for the big speedup in writes? Typically the writes would go to the page cache first (assuming write-back), which is quite fast and then later flushed to disk. This is of course on a typical linux server/desktop setup which would be configured as such. Was the system where you observed the speedup differently configured?

purew · 2021-02-10T15:11:27Z

src/mjolnir/graphbuilder.cc

-      auto edge = *element;
-      edge.sourcenode_ = run_index;
-      element = edge;
+      starts.emplace_back(node.start_of, run_index);
    }
    // if this node marks the end of an edge, go tell the edge where the first node in the
    // series is


Nit: Perhaps update comment same as above.

kevinventullo · 2021-02-10T19:12:22Z

Curious as to whats the reason for the big speedup in writes? Typically the writes would go to the page cache first (assuming write-back), which is quite fast and then later flushed to disk. This is of course on a typical linux server/desktop setup which would be configured as such. Was the system where you observed the speedup differently configured?

I don't think it was differently configured. My guess is that while the system does use a write-back cache, all the jumping around means that roughly every write is going to a new page, so you end up needing to flush much more often as the in-memory write cache fills up.

I haven't tested these changes on our local system, so not sure how much of a perf hit there will be.

kevinkreiser · 2021-02-12T14:41:38Z

@kevinventullo ill get this review through this evening, thanks again!

kevinventullo · 2021-02-12T18:20:48Z

@kevinventullo ill get this review through this evening, thanks again!

Thank you for your patience! I'm curious to see how this performs on your hardware.

kevinkreiser · 2021-02-13T03:14:56Z

I'm going to reduce the blast radius of this a bit. I really wish from the beginning we would have passed a base directory for all the temp stuff we create (like many other applications do) and then we could just choose what all the random .bin files would be named (i mean we do now but its at the very beginning of the process). im sure it was just something that changed over time and never got refactored

…y sequence creation/removal

kevinkreiser · 2021-02-13T04:04:52Z

i ran again with germany to see what the difference would be

master:

2021/02/13 04:01:32.526292 [INFO] Creating graph edges from ways...
2021/02/13 04:01:35.279684 [INFO] Finished with 22840796 graph edges
2021/02/13 04:01:35.385875 [INFO] Sorting graph...
2021/02/13 04:01:41.573242 [INFO] Finished with 17861214 graph nodes

this branch:

2021/02/13 03:40:09.233344 [INFO] Creating graph edges from ways...
2021/02/13 03:40:11.829682 [INFO] Finished with 22840796 graph edges
2021/02/13 03:40:11.928681 [INFO] Sorting graph...
2021/02/13 03:40:19.518808 [INFO] Nodes processed. Now adding data to edges.
2021/02/13 03:40:24.296359 [INFO] Sorting starts and ends done. Populating edges...
2021/02/13 03:40:24.751309 [INFO] Finished with 17861214 graph nodes

this branch is quite a bit slower. that difference may evaporate when we are talking about the planet which i havent tried. ill have to give that another go once i have one downloaded

kevinventullo · 2021-02-13T20:28:07Z

I'm going to reduce the blast radius of this a bit. I really wish from the beginning we would have passed a base directory for all the temp stuff we create (like many other applications do) and then we could just choose what all the random .bin files would be named (i mean we do now but its at the very beginning of the process). im sure it was just something that changed over time and never got refactored

Oh, perfect! Yeah, it felt a little weird piping the tmp file through, but didn't feel emboldened enough to do a whole refactor 😃

kevinventullo · 2021-02-13T20:29:23Z

i ran again with germany to see what the difference would be

master:

2021/02/13 04:01:32.526292 [INFO] Creating graph edges from ways...
2021/02/13 04:01:35.279684 [INFO] Finished with 22840796 graph edges
2021/02/13 04:01:35.385875 [INFO] Sorting graph...
2021/02/13 04:01:41.573242 [INFO] Finished with 17861214 graph nodes

this branch:

2021/02/13 03:40:09.233344 [INFO] Creating graph edges from ways...
2021/02/13 03:40:11.829682 [INFO] Finished with 22840796 graph edges
2021/02/13 03:40:11.928681 [INFO] Sorting graph...
2021/02/13 03:40:19.518808 [INFO] Nodes processed. Now adding data to edges.
2021/02/13 03:40:24.296359 [INFO] Sorting starts and ends done. Populating edges...
2021/02/13 03:40:24.751309 [INFO] Finished with 17861214 graph nodes

this branch is quite a bit slower. that difference may evaporate when we are talking about the planet which i havent tried. ill have to give that another go once i have one downloaded

Ah I see. Yeah, the reduction I saw went from 10 hours on the whole planet to 7 minutes. It definitely sounds like our problems would be fixed by adding more memory.

kevinkreiser · 2021-02-14T00:57:48Z

@kevinventullo for reference im on a machine with 32gb for ram but to be honest if we can get similar performance on a machine with a couple gigs of ram, then its worth it. it would be great to one day finish making everything distributed and allow a swarm of low ram vms work on the whole tileset in parallel. the the main hurdle there is that that swarm will often need access to adjacent tiles though but something like NFS (while its usually too slow for these needs) might be just fine for this need where you only need to look up a couple of adjacent tiles rather than access the whole dataset... anyway just thinking ahead 😄

anyway i ran the planet last night and should have the results of the change here shortly. even if its a couple minutes longer i think its still an ok change to make to enable lower ram situations. it is a bit concerning that code coverage doesnt hit all the branches inside the loop. i need to closely look at that to see why that would work out to be the case depite all the different datasets we load in CI

kevinkreiser · 2021-02-14T03:25:51Z

Good news. Even for me its slightly faster on the planet:
master:

2021/02/13 18:38:16.460137 [INFO] Sorting graph...
2021/02/13 18:46:39.194044 [INFO] Finished with 284443165 graph nodes

this branch:

2021/02/14 01:03:51.243432 [INFO] Sorting graph...
2021/02/14 01:07:53.464111 [INFO] Nodes processed. Now adding data to edges.
2021/02/14 01:09:31.314837 [INFO] Sorting starts and ends done. Populating edges...

2021/02/14 01:09:43.687472 [INFO] Finished with 284443165 graph nodes

kevinkreiser · 2021-02-14T19:30:19Z

@kevinventullo i couldnt ignore that coverage check and so after thinking about it a bit i couldnt see how it was possible that we'd have any mis matching in the begin/end node lists. we should have exactly one begin and one end node for each edge and the same set of edges in each list. so ive added a couple of asserts and simplified the looping down a bit. let me know what you think: 4d2f110

kevinkreiser · 2021-02-14T22:22:30Z

@nilsnolde ive again disabled the mingw workflow as its failign to get one of the depenencies again ☹️

kevinventullo · 2021-02-15T04:15:02Z

@kevinventullo i couldnt ignore that coverage check and so after thinking about it a bit i couldnt see how it was possible that we'd have any mis matching in the begin/end node lists. we should have exactly one begin and one end node for each edge and the same set of edges in each list. so ive added a couple of asserts and simplified the looping down a bit. let me know what you think: 4d2f110

Oh, I was wondering about this! Yes, that looks so much cleaner. Thank you for all these awesome improvements to the PR!

Update graphbuilder.cc

70df2b8

kevinventullo changed the title ~~Update graphbuilder.cc~~ Batch writes to edges file Feb 10, 2021

kevinkreiser self-requested a review February 10, 2021 01:31

kevinkreiser reviewed Feb 10, 2021

View reviewed changes

purew reviewed Feb 10, 2021

View reviewed changes

kevinventullo and others added 14 commits February 10, 2021 14:35

Fixing comment

c3ef657

Adding tmp files to BuildEdges signature

f1493f8

Updating BuildEdges signature

bda56fb

Changing to sequence

213802b

I haven't tested these changes on our local system, so not sure how much of a perf hit there will be.

Forgot to add rv

cf36e24

Formatting fix

2145e74

Formatting fixes

28748a9

Formatting fixes

20df7c4

Formatting

4f71682

Fixing scope issue

d8e4fdd

Fixing test API

74cd92b

Updating BuildEdges API

40d936a

Updating BuildEdges API

ed5e0be

Merge branch 'master' into BatchChanges

23885b5

kevinkreiser added 2 commits February 12, 2021 22:19

reduce the amount of information we need to pass around for temporary…

960a979

…y sequence creation/removal

changelog

df72f70

simplify begin/end node assignment

4d2f110

kevinkreiser approved these changes Feb 14, 2021

View reviewed changes

kevinkreiser merged commit b34a9d9 into valhalla:master Feb 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch writes to edges file #2851

Batch writes to edges file #2851

kevinventullo commented Feb 10, 2021

kevinventullo commented Feb 10, 2021

kevinkreiser Feb 10, 2021

kevinventullo Feb 10, 2021

kevinventullo Feb 10, 2021

kevinkreiser Feb 11, 2021

kevinkreiser Feb 10, 2021

mandeepsandhu commented Feb 10, 2021

purew Feb 10, 2021

kevinventullo commented Feb 10, 2021

kevinkreiser commented Feb 12, 2021

kevinventullo commented Feb 12, 2021

kevinkreiser commented Feb 13, 2021

kevinkreiser commented Feb 13, 2021

kevinventullo commented Feb 13, 2021

kevinventullo commented Feb 13, 2021

kevinkreiser commented Feb 14, 2021

kevinkreiser commented Feb 14, 2021

kevinkreiser commented Feb 14, 2021

kevinkreiser commented Feb 14, 2021

kevinventullo commented Feb 15, 2021

Batch writes to edges file #2851

Batch writes to edges file #2851

Conversation

kevinventullo commented Feb 10, 2021

Issue

Tasklist

Requirements / Relations

kevinventullo commented Feb 10, 2021

kevinkreiser Feb 10, 2021

Choose a reason for hiding this comment

kevinventullo Feb 10, 2021

Choose a reason for hiding this comment

kevinventullo Feb 10, 2021

Choose a reason for hiding this comment

kevinkreiser Feb 11, 2021

Choose a reason for hiding this comment

kevinkreiser Feb 10, 2021

Choose a reason for hiding this comment

mandeepsandhu commented Feb 10, 2021

purew Feb 10, 2021

Choose a reason for hiding this comment

kevinventullo commented Feb 10, 2021

kevinkreiser commented Feb 12, 2021

kevinventullo commented Feb 12, 2021

kevinkreiser commented Feb 13, 2021

kevinkreiser commented Feb 13, 2021

kevinventullo commented Feb 13, 2021

kevinventullo commented Feb 13, 2021

kevinkreiser commented Feb 14, 2021

kevinkreiser commented Feb 14, 2021

kevinkreiser commented Feb 14, 2021

kevinkreiser commented Feb 14, 2021

kevinventullo commented Feb 15, 2021