Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeoTIFF segfaulting on Mac ARM python bindings #4623

Open
baileyheading opened this issue Mar 8, 2024 · 40 comments · Fixed by #4628
Open

GeoTIFF segfaulting on Mac ARM python bindings #4623

baileyheading opened this issue Mar 8, 2024 · 40 comments · Fixed by #4628
Labels

Comments

@baileyheading
Copy link

baileyheading commented Mar 8, 2024

Thought I'd try out this new GeoTIFF feature on Mac ARM.

I'm running it through python bindings so I can't get a lot of debugging info.

Has anyone successfully ran this on an ARM build of any kind?

I'm doing regular isochrone request (which works), with "format":"geotiff" added

@baileyheading
Copy link
Author

baileyheading commented Mar 8, 2024

@nilsnolde would the bindings produced in valhalla be able to receive the GeoTIFF format? is it just a string in the response for isochrone instead of the contours?

EDIT: Upon looking at it, yeh it should be fine as it's literally just a json string

@baileyheading
Copy link
Author

baileyheading commented Mar 8, 2024

I confirmed that the error can be avoided by commenting out this. Wish I had a better way to debug this than rebuilding it all to a static library for python though.

// std::string ret = tyr::serializeIsochrones(request, intervals, grid);
std::string ret = R"({"test":"output"})";

EDIT 1: I see this should activate serializeGeoTIFF so I'll try and work out what's going on in there
EDIT 2: So it's called serialize, but it's actually making the GeoTIFF here? (including all the projections and everything)

@baileyheading
Copy link
Author

baileyheading commented Mar 8, 2024

I believe the segfault occurs in this line:
auto geotiff_dataset = geotiff_driver->Create(name.c_str(), ext_x, ext_y, nbands, GDT_UInt16, geotiff_options);

I added this in for my debugging. It hasn't found a gdal driver it seems.

throw valhalla_exception_t{599, "found the needle in the haystack"};

I'm trying to manually make sure it links in CMakeList

I outputted some things. I'm unsure if that link libraries ones is meant to find something or not.

-- GDAL::GDAL include directories: /opt/homebrew/Cellar/gdal/3.8.4_3/include
-- GDAL::GDAL link libraries: gdal_link_libraries-NOTFOUND

@kevinkreiser
Copy link
Member

kevinkreiser commented Mar 8, 2024

given what you've shown it seems like we continue with the build and supporting gdal even if we dont actually have full support on your system. seems like we just need to add a bit more protection/checking before continuing to enable it. i am confused about how it even links and runs!

also, excellent sleuthing!

@baileyheading
Copy link
Author

baileyheading commented Mar 8, 2024

I added this in: (the exception number is random, just copied from another valhalla_exception)

if(geotiff_driver == nullptr) {
// Handle error: Driver not found
throw valhalla_exception_t{599, "gdal driver not found"};
`}``

When it can't find the driver, it returns a null pointer as per: https://gdal.org/doxygen/classGDALDriverManager.html#ac26308c182440c8abd584040fa89bd4f

@baileyheading
Copy link
Author

given what you've shown it seems like we continue with the build and supporting gdal even if we dont actually have full support on your system. seems like we just need to add a bit more protection/checking before continuing to enable it. i am confused about how it even links and runs!

also, excellent sleuthing!

Makes sense. Then if someone is able to get it working, it can just start working in the future I guess

@baileyheading
Copy link
Author

baileyheading commented Mar 8, 2024

This is what I used in CMakeList. I don't know whether the GDAL::GDAL INTERFACE_LINK_LIBRARIES is actually meant to return something other than "gdal_link_libraries-NOTFOUND" as I really don't understand this process at all.

get_target_property(gdal_include_dirs GDAL::GDAL INTERFACE_INCLUDE_DIRECTORIES)
message(STATUS "GDAL::GDAL include directories: ${gdal_include_dirs}")

get_target_property(gdal_link_libraries GDAL::GDAL INTERFACE_LINK_LIBRARIES)
message(STATUS "GDAL::GDAL link libraries: ${gdal_link_libraries}")

@kevinkreiser
Copy link
Member

maybe gdal loads the drivers (separate libs) at runtime (which i was initially worried about in our implementation because its probably not too performant to do it on every request). your suggestion for checking at runtime is good but as you started doing further down, id rather check at configuration time and disable it then. if the gdal bin package is installed we can simply ask gdalinfo --formats what drivers are supported.. but since we dont ask for the bin package we could also just use a little test program OR perhaps there is another way to programatically get the information

@baileyheading
Copy link
Author

baileyheading commented Mar 9, 2024

maybe gdal loads the drivers (separate libs) at runtime (which i was initially worried about in our implementation because its probably not too performant to do it on every request). your suggestion for checking at runtime is good but as you started doing further down, id rather check at configuration time and disable it then. if the gdal bin package is installed we can simply ask gdalinfo --formats what drivers are supported.. but since we dont ask for the bin package we could also just use a little test program OR perhaps there is another way to programatically get the information

I've studied the implementation further and tested some more things. So, in the initial build stages, we're basically just finding the location of GDAL? and then it's not until we try and use it that it links it? Could we attempt to link it in the CMakeList.txt stage (as we seem to do with other binaries?). I think this is what you were suggesting more or less perhaps, but this way we could try and address problems linking it to various OS at that build stage (an otherwise disable it as you suggested)

EDIT: Or does it actually get linked elsewhere in the code? The docs mention a global singleton I think somewhere

@baileyheading
Copy link
Author

I've found this info on it.

So it should be linked in thor/worker? and it's just a pointer to that in serializeGeoTIFF?

#4594 (comment)

@kevinkreiser
Copy link
Member

yeah that was my suggestion to @chrstnbwnkl for keeping the driver loaded and around but again it would still have to happen at run time. imho during cmake we should find gdal and make sure that it can load the geotiff driver. it might be as easy as checking some cmake variables. let me take a look here...nope struck out on that. looking here:

https://formulae.brew.sh/api/formula/gdal.json

and also checking on linux i see gdal depends on libgeotiff. so now im wondering how did you get into this situation! apart from us detecting that you have gdal but its not been built with geotiff support, how did you get it without geotiff support!? what does gdalinfo --formats | grep -i tiff say?

@baileyheading
Copy link
Author

baileyheading commented Mar 10, 2024

yeah that was my suggestion to @chrstnbwnkl for keeping the driver loaded and around but again it would still have to happen at run time. imho during cmake we should find gdal and make sure that it can load the geotiff driver. it might be as easy as checking some cmake variables. let me take a look here...nope struck out on that. looking here:

https://formulae.brew.sh/api/formula/gdal.json

and also checking on linux i see gdal depends on libgeotiff. so now im wondering how did you get into this situation! apart from us detecting that you have gdal but its not been built with geotiff support, how did you get it without geotiff support!? what does gdalinfo --formats | grep -i tiff say?

GTiff -raster- (rw+vs): GeoTIFF
COG -raster- (wv): Cloud optimized GeoTIFF generator

It looks like it isn't finding it, or it is failing to link it (in Thor I guess?), as my GDAL does appear to have 'GTiff'

@nilsnolde
Copy link
Member

nilsnolde commented Mar 10, 2024

I just tried on linux with the bindings and I get the same segfault.. I know where the issue is now: we don't register the GTiff Driver properly, we only do it in valhalla_service and the tests only pass because each test is registering it itself. so it's all fine with your gdal installation I think.

that means currently anything but valhalla_service is broken. I didn't build the most recent stuff yet on the public instance, and seems I won't until we fix that ;) it also means that it should work for you as well with valhalla_service @baileyheading .

not sure where to exactly register the gtiff driver.. directly in isochrone_serializer?

@nilsnolde nilsnolde added bug and removed cleanup labels Mar 10, 2024
@nilsnolde nilsnolde changed the title GeoTIFF segfaulting on Mac ARM build GeoTIFF segfaulting on Mac ARM python bindings Mar 10, 2024
@baileyheading
Copy link
Author

Are we able to have it as some kind of global parameter?

But for the short term, is there a line or two of code we could insert into isochrone_serializer? @nilsnolde

@nilsnolde
Copy link
Member

you can try adding

#ifdef ENABLE_GDAL
#include <gdal_priv.h>
  GDALRegister_GTiff();
#endif

in isochrone_serializer.cc at the top and replace the current #include <gdal_priv.h> line. I think that should be fine, then we can get rid of gdal in valhalla_service as well and all executables would have the driver loaded.

fwiw, I don't think it's a problem to call GetGDALDriverManager() for every request, it's a global singleton. it's more important that we're only letting gdal load the driver only once, which we did already, just in the wrong place.

@baileyheading
Copy link
Author

you can try adding

#ifdef ENABLE_GDAL
#include <gdal_priv.h>
  GDALRegister_GTiff();
#endif

in isochrone_serializer.cc at the top and replace the current #include <gdal_priv.h> line. I think that should be fine, then we can get rid of gdal in valhalla_service as well and all executables would have the driver loaded.

fwiw, I don't think it's a problem to call GetGDALDriverManager() for every request, it's a global singleton. it's more important that we're only letting gdal load the driver only once, which we did already, just in the wrong place.

I'll give it a try and report back here. Ok, that doesn't sound too bad then. For the python bindings, will it also just load once for when actor is first initialised?

@baileyheading
Copy link
Author

valhalla/src/tyr/isochrone_serializer.cc:18:3: error: a type specifier is required for all declarations
GDALRegister_GTiff();

@nilsnolde

@baileyheading
Copy link
Author

baileyheading commented Mar 10, 2024

anyway, for now I put it at the top of serializeGeoTIFF() just to get it working

I get this error with the python bindings.

site-packages/valhalla/actor.py", line 34, in wrapped
response = func(args[0], encoded_req)

site-packages/valhalla/actor.py", line 88, in isochrone
return super().isochrone(req)
^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 18: invalid continuation byte

@baileyheading
Copy link
Author

baileyheading commented Mar 10, 2024

I get this error for "format": "geotiff" or "pbf" for the python bindings

EDIT 1: Oh, pbf is not implemented. I suspect the issue is that geotiff is not working properly. it returns "Unknown serialization error: II*" when I return the "data" variable from serializeGeoTIFF()

EDIT 2: oh again, it is? but the case for pbf itself is empty. it seems to be under a conditional for the json one..

@baileyheading
Copy link
Author

baileyheading commented Mar 10, 2024

So to clarify, it returns "II*" as a std::string in valhalla, instead of the json

EDIT: a little research suggested that it was just some escape characters causing the string to cutoff when I was outputting it in valhalla, as II is the start of a geoTIFF apparently

@baileyheading
Copy link
Author

@nilsnolde ok, I got it working by converting the output into base64 and making a json

@nilsnolde
Copy link
Member

Yeah bindings were neglected. We should def fix that before releasing. We'll probably have to do some stuff in the actor first..

@kevinkreiser
Copy link
Member

i slapped together (hopefully) a fix for this: #4628

regarding the second issue imho there are 2 problems there:

  1. the binding wrapper for actor returns either a string or a dict (loaded via json loads) depending on the input types. but the c++ can return either a string representing json or a protobuf representing a serialized pbf so doing json.loads is not correct though i get its a nice convenience
  2. now that we have other binary returns such as geotiff it means the c++ can return bytes and not just strings. python requires strings be utf8 so geotiffs will break this.

imho the bindings should just return bytes and we can let the caller decide if what they expected was json pbf or geotiff. however we have a problem with error handlign where if you give it input that it cant act on it will either return json or pbf. if you asked for geotiff and something went wrogn you'll get back json. im not sure we want to spend the time adding error messages to a geotiffs metadata hahahha so have to think of a way around this. only othe roption would be to allow the actor interface to also return a mime type

@nilsnolde
Copy link
Member

nilsnolde commented Mar 13, 2024

so doing json.loads is not correct though i get its a nice convenience

yeah I didn't think about the PBF format at all when I added those.. I'm a bit used to python's convenience stuff, it's just feels more right in python to not return bytes but the expected type, bytes or string or dict. the caller will have to decide either way as well and has the same problem with errors being JSON instead of bytes/string, as expected. since it's already prescribed what has to happen, i.e. each request will have to be handled in exactly one single way, it's not up to the caller to decide what to convert the bytes into to make sense of the output, I'd rather have that decision handled in the bindings so it comes for free to any caller. I didn't look into yet, but I'm sure it's pretty manageable.

is that ok for you @kevinkreiser ? I'd add that soon-ish if so.

@kevinkreiser
Copy link
Member

kevinkreiser commented Mar 13, 2024

yeah im just thinking its more work to manage at runtime. in either case i think you need to change the pybind11 code to return py::bytes. then yeah you can do the work in the actor.py wrapper to first look at the format thats expected and then if it was geotiff first check if the bytes look like a json error isntead of a tiff header. as you said manageable but its just another thing to have to remember to do rather than it being automatic is all.

i forgot to say feel free to do it though!

@nilsnolde
Copy link
Member

nilsnolde commented Mar 13, 2024

ah we don't expose the PBF input stuff in the python bindings. though technically you can still use via the hidden python_valhalla or from valhalla import _Actor and get a filled PBF back, but at least there's nothing breaking at the moment, it's just not really exposed. and PBF output, see below..

and FWIW, that json.loads thing is basically an overload: if you provide a string as input to the bindings functions, then you get the raw response from valhalla, so the issue can even be prevented today if you just call with e.g. route = actor.route(json.dumps(req_json)) where route should be a string. now reading @kevinkreiser 's answer, I'm not sure if that'll work though, as he says, pybind11 likely does some c++ -> py conversion as well where it converts to a python string which assumes UTF-8, but again not sure, that'd "break the bytes" in the python string object somehow or could still be saved to disk as is as GeoTIFF. maybe I'll try later if no one else knows for sure..

@nilsnolde
Copy link
Member

in either case i think you need to change the pybind11 code to return py::bytes.

yeah you're probably right

@nilsnolde
Copy link
Member

very much sounds like it: https://pybind11.readthedocs.io/en/stable/advanced/cast/strings.html#returning-c-strings-to-python

When a C++ function returns a std::string or char* to a Python caller, pybind11 will assume that the string is valid UTF-8 and will decode it to a native Python str, using the same API as Python uses to perform bytes.decode('utf-8'). If this implicit conversion fails, pybind11 will raise a UnicodeDecodeError.

@kevinkreiser
Copy link
Member

yeah sorry i should have said it with more conviction, i knew this stuff as a fact but wanted my suggestion to still come off as a suggestion rather than a demand 😄 communication is hard in text

@nilsnolde
Copy link
Member

haha wasn't sure. but your "I think" is more often than not actually "I know" I think 😄

just returning from the c++ interface a py::bytes would be a fine solution to me, and then do the bit of python wrapper to return it to the caller "properly" and wrap it into the expected (or unexpected) type. I'll see how to best communicate in case of errors.

@baileyheading
Copy link
Author

haha wasn't sure. but your "I think" is more often than not actually "I know" I think 😄

just returning from the c++ interface a py::bytes would be a fine solution to me, and then do the bit of python wrapper to return it to the caller "properly" and wrap it into the expected (or unexpected) type. I'll see how to best communicate in case of errors.

This issue was closed, but have the python bindings actually been changed yet to handle what is returned by geoTIFF? It doesn't run currently with latest 'master' in python.

@nilsnolde
Copy link
Member

No you’re right, it was automagically closed by the PR, rather unintended

@nilsnolde nilsnolde reopened this Mar 14, 2024
@baileyheading
Copy link
Author

I hacked in a fix by applying std::string data_base64 = base64_encode(data); so I could put it into a json without the exit characters breaking the json (at least I think that is what it solved)

Is there a more efficient way of doing this do you think? @nilsnolde

@nilsnolde
Copy link
Member

Yeah no need to make it base64, we just want to return raw bytes, see above

@baileyheading
Copy link
Author

baileyheading commented Mar 14, 2024

Yeah no need to make it base64, we just want to return raw bytes, see above

I did read this but would like to confirm, if we send raw bytes, we won't be able to send any info other than the geoTIFF? (ie. no meta data such as the request id etc)

@kevinkreiser
Copy link
Member

technically we can put meta data in the tiff but it's not yet implemented.

@baileyheading
Copy link
Author

baileyheading commented Mar 15, 2024

geoTIFF seems cool, but it doesn't seem that well supported to me. I've noticed GDAL can convert geoTIFF into PNG (so you can still get the map projections and stuff done before conversion), which is probably (from what I've seen at least) the most common format for displaying raster images on an interactive map (ie. leaflett), and then the bbox would need to be sent separately too. This could be another format type perhaps?

GDAL PNG Documentation

Another thing is it'd be good to support a number of different map projections, such as Web Mercator / Pseudo Mercator (EPSG:3857). This is a totally different issue though, so I should probably make this as feature request issue. Currently valhalla is using EPSG:4326 by the looks of it, which is not even the one used by most online maps right?

@kevinkreiser
Copy link
Member

the reason geotiff makes sense is that in contains the spatial referencing needed to display the image over a map. yes for raster imagery png and jpg are more common in tile-based systems however how could we make the isochrone endpoint tile-based? just thinking about it in a tiled system brings up a bunch of difficult questions like what:

what zoom level
what resolution
what about the tile pyramid
how do you simplify it for lower zooms
how do you overzoom it
when it covers more than one tile at a particular zoom how do you send multiple pngs to multiple requests for different tiles?

there is no practical way to support a slippy map with valhalla for this use-case because we'd have to keep state about what your isochrone was to return the specific tile you are requesting out of it and that is just out of the scope of the project at the moment (the service is stateless/sessionless from the perspective of the caller).

yes valhallas data is unprojected (epsg 4326) this is because valhalla is not really focused on the display use-case it is focused on routing applications. why would we spend the time and CPU on tons of trig functions to convert back to lat lon to do distance calculations (crucial for routing but not for display) when we could just keep the raw data in its raw form? it is true that since google embraced it, spherical mercator has taken over webmaps (mapquest used to use equirectangular projection before google was a thing though). but this has many drawbacks including cutting off the poles (how are santa claus or the penguins going to route?) but also distorts distance as it approaches the pole. also with modern tiled maps, the coordinates for each tiles contents are tile relative mercator with varying resolution by zoom level. we could return geometry in global mercator coordinates but you'd still have to translate it to map it to a pixel in a tile.

long story short, in my opinion it is ok and actually preferred for us to offer up rasters in unprojected form inside geotiff containers. it is definitely the most natural way for us to deliver the data. after that point it would be totally cool if the end user would use gdal_warp to reproject to mercator and then colorize it by threshold, followed up with gdal_retile.py to turn it into a tile pyramid and serve it to whatever thing they were doing.

@baileyheading
Copy link
Author

baileyheading commented Mar 15, 2024

These are some interesting points.

EDIT: gdal_warp and gdal_retile.py are especially cool ideas

@nilsnolde
Copy link
Member

And GeoTIFF is very widely supported, probably the widest supported geographical raster format. Leaflet is not a good reference anymore IMO, I’d recommend looking into openlayers, it’s (very) actively maintained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants