Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix possible null pointer dereferencing #3065

Merged
merged 10 commits into from
Jun 1, 2021

Conversation

DenisPryt
Copy link
Contributor

Issue

#3064

Tasklist

  • Add tests
  • Add #fixes with the issue number that this PR addresses
  • Update the docs with any new request parameters or changes to behavior described
  • Update the changelog

Requirements / Relations

Link any requirements here. Other pull requests this PR is based on?

@kevinkreiser
Copy link
Member

Rather than blanketing the code with this could you maybe figure out which one you are seeing in practice and fixt that one? A lot of these will be unnecessary.

For example. In the Origin function of bidirectional a* the ede id's come from loki using the same graph reader. It is possible to write code to clear the tile cache between loki and Thor but it's harder to understand why someone would do that.

So yeah please can you focus the pr to a specific case you were running into? Or maybe you are using the library in another project and the cache can be cleared at any time by another thread?

A bit more explanation of what is going on here would be great!

@DenisPryt
Copy link
Contributor Author

DenisPryt commented May 12, 2021

@kevinkreiser Sorry for the incomplete description. This fixes for our project. And yes, in our case the cache can be cleared from another thread.

@kevinkreiser
Copy link
Member

thats a lot of branches and print statements. we should clean up print statements and we should check what impacts all of these branches have on the code in terms of performance

@DenisPryt
Copy link
Contributor Author

@kevinkreiser printf statements were there before my changes and used inside #ifdef LOGGING_LEVEL_TRACE
I removed LOG statement that I added.
If ifs will be a problem we can use hints for branch prediction cpu unit. Without this ifs we can just crash

@DenisPryt DenisPryt marked this pull request as ready for review May 14, 2021 08:46
@kevinkreiser
Copy link
Member

kevinkreiser commented May 17, 2021

Algorithms Should Not Be Robust to Data Disappearing On-the-fly

We'll start by clearing up one thing. None of the algorithms are robust to randomly removing tiles out from underneath them. The trim and clear operations on GraphReader were meant to reclaim memory between top level methods on the actor_t interface. Like this:

actor.route(graphreader)
graphreader.clear()
actor.mapmatch()
graphreader.clear()
// and so on ...

Yes the code is robust to a tile not existing or not being on the device that is fine but its not robust to it being there for part of the computation and going away the next. And actually it SHOULNT be anyway. Think about what would happen in a couple scenarios and you'll realize why its not worth it to try to make the code "gracefully" handle this.

  1. loki finds candidate edges but the tiles those edges are in are purged by another thread, thor starts up with those candidate edges and the route is immediately failed
  2. thor does part of an expansion in one tile moves on expanding in another tile, the first tile gets removed from cache, now we start popping edges from the queue that are in that tile, we have to skip them, we wasted CPU expanding in that tile. whats worse is we can some non-deterministic routing depending on the state of the cache at any one time, do you think the user wants a route that randomly differs from minute to minute?
  3. thor finds a path but before it goes to serialize that into a trip path object you yank the tiles from it, you wasted that time findign the route only to have to abort and fail the route because you cant do anything with it since you no longer have the data

As you can see, we can just keep playing little scenarios like this out over and over and every time it doesnt make sense to continue our work.

We need clear and trim to only happen when routing isn't already in progress in one of the threads. Basically we just need extra synchronization around cache cleanup and we dont need to try to "do something" inside the algorithms themselves.

Your Particular Problem: Routing on Regional Extracts

I'm pretty sure the scenario you are worrying about is the following:

You configure graphreader to cache tiles in memory, but the in memory cache is backed up by disk and the disk only has a subset of tiles from the planet on it so its backed up by a network call. Your cache hierarchy is like this:

memory -> disk -> network

To screw us even worse we have multiple threads sharing synchronized access to this tile cache and they all have the power to clear it whenever they want to keep your application within some reasonable memory/disk limits.

Now I get that you still want to route with the data you have and you want to be able to clear some of it to stay within the boundaries of what the configuration demands. The router already supports that so long as you understand you can only shrink back down to your limit if a route isnt in progress. For the reasons described above it doesnt make sense to have the algorithms constantly checking if the data they just saw still exists (for the same routing calculation).

Potential Solution

I believe you are using this library in another application where you have the individual workers (loki, thor, odin) and they share a thread synchronized tile cache via the graphreader. I also believe when you do a route or a map match etc you probably call trim/clear afterwards. I suggest you remove that call to trim the cache and instead spawn another thread whose sole purpose is to call trim/clear on the graphreader when its over its limits. I suggest that you add a shared lock with the other threads (one per thread so they dont block eachother) that do routing and map matching and that when they are doing routing or map matching they acquire the lock so that the trimming thread cannot clear any cache while a route/map match is in progress. In this way we dont have to make the algorithms robust for pretty much no gain and you can still make sure your application stays within the limits.

Copy link
Member

@kevinkreiser kevinkreiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still dont think this is the right way to go. It is pointless to make the router robust to "disappearing" tiles. We should be robust to running on regional tile sets (ie nullptrs are allowed) but tiles cant be there in one iteration of the loop and not be there in the next.

Please see my comments here: #3065 (comment)

@kevinkreiser
Copy link
Member

kevinkreiser commented May 21, 2021

if we assume that we need the router to deal with tiles blinking into and out of existence then we need to look at each one of these changes and determine can the algorithm even continue past this point if it cant get the tile. If it can we continue if it cant we throw and abort the computation.

For continuing some common cases would be:

  1. we pop an edgeid out of the queue but the tile that holds it is now gone. we can continue the algorithm but the results might be pretty weird
  2. ??

For aborting the computation common cases would be:

  1. loki finds edge candidates calls thor, thor cant get the tiles with the candidates, so we throw
  2. thor finds a path but while recovering it by following the edge labels it finds an edge in a tile it cant get, so we throw
  3. thor finds a path sends it off to triplegbuilder but when its iterating over it it cant get a tile, so we throw

@mskurydin @DenisPryt @SiarheiFedartsou it seems to me the only "recoverable" one above is when a tile goes away and we basically cut all search paths (expansions) that are in it out and continue. any other failure should throw and abort the computation immediately. i assume that if this happens you'll want to know that it did and retry, this time without disappearing tiles presumably?

if that is the case i kind of wonder if we should just always throw when we detect a tile disappeared out from under us and then retry. like maybe we make a method on graphreader that is called GetGraphTileOrThrow and we call it anywhere where we couldn't tolerate a tile being missing. basically all the places in the algorithm that we should never see an edge_id but not be able to get the tile that goes with it.

anyway i'll go through all of these and mark each type continue or throw if they are mishandled

src/thor/dijkstras.cc Outdated Show resolved Hide resolved
src/thor/dijkstras.cc Outdated Show resolved Hide resolved
src/thor/dijkstras.cc Outdated Show resolved Hide resolved
Copy link
Member

@kevinkreiser kevinkreiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just went over the whole PR. please make the requested changes. after that we need to do a performance check in sever mode just to make sure we dont have any regressions there (hopefully branch prediction will help us there)

@DenisPryt
Copy link
Contributor Author

@kevinkreiser should I add new exception like tile_no_longer_available_error or just use std::runtime_error ?

@kevinkreiser
Copy link
Member

kevinkreiser commented May 24, 2021

@DenisPryt it would be a nice-to-have but i wouldnt say it is required. its your choice! we definitely need a changelog entry explaining that you made an enhancement to continue the routing computation in the case that a graphtile disappears from the tilecache

@DenisPryt
Copy link
Contributor Author

@kevinkreiser so now we need to make performance tests, right? How can I do it? Just make run-benchmarks?

@kevinkreiser
Copy link
Member

@DenisPryt different people do it in different ways. Perhaps @genadz could show you the way he and i have been doing it lately. or if you can wait until the evening i can run the check on my end. @danpat did have a way of running the algorithm benchmarks with larger datasets (like a server would) but i cant remember if its documented or not.

@DenisPryt
Copy link
Contributor Author

@kevinkreiser I've been downloading tiles for benchmarks since this morning. If it's not difficult, can you try to run a benchmark?

@kevinkreiser
Copy link
Member

Yep I can. I'll do it tonight

kevinkreiser
kevinkreiser previously approved these changes May 28, 2021
Copy link
Member

@kevinkreiser kevinkreiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other than the small nitpick, please at a changelog entry to record that you made this enhancement

i checked the performance and there was no impact as we had hoped. i am certain there are many many more of these types of fixes througout the code base that we will want to do (in loki and in other parts of the algorithms in thor) but we can get them in another pull request.

i do think that in the future we should consider adding to GraphReader a method which throws immediately if the tile is not found RequireGraphTile that returns a tile just like GetGraphTile does but throws an exception if its not there

@DenisPryt
Copy link
Contributor Author

@kevinkreiser done

@kevinkreiser
Copy link
Member

image

please try not to do this ^^ it makes the review process harder. we only allow squash merges on the repository anyway so rewriting history serves no purpose.

@kevinkreiser kevinkreiser merged commit db74bb5 into valhalla:master Jun 1, 2021
@DenisPryt
Copy link
Contributor Author

image

please try not to do this ^^ it makes the review process harder. we only allow squash merges on the repository anyway so rewriting history serves no purpose.

Of course. Sorry for that.

@DenisPryt DenisPryt deleted the fix_null_pointers branch June 2, 2021 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants