Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

our cmake find modules are not robust to pkgconfig installations #4253

Merged
merged 22 commits into from
Jan 24, 2024

Conversation

nilsnolde
Copy link
Member

I tried for way too long to get a clean build of Valhalla on a fresh (more or less) CentOS 8, with dependencies only installed by vcpkg. After an initial struggle with the build environment, it started to build, but couldn't link properly, it was all linking errors for libgeos and libspatialite. I was going down the rabbit hole of brushing up my CMake knowledge and eventually came to the conclusion that our CMake Find modules for those 2 libraries for some reason don't link the dependencies to spatialite (quite a lot) and geos (mostly its own c++ -> c). I don't know why, since it found the libraries when CMake was configuring. I tried a few more things, but it in the end it still couldn't properly link libz.a to spatialite and I just gave up. Though I'm quite sure I almost had it.. I really do wonder how the hell it's working so smoothly on Windows.

The below solution is to use pkg-config instead of relying on the CMake Find modules. That should work for all platforms/environments with pkg-config installed. Also Windows should be fine if vcpkg is used as a package manager. I don't have a strong opinion about merging this, I can keep it as a patch downstream.

I'd PR a bit of README soon, now that I have a vcpkg-only Valhalla build without any system package manager or manual builds (no ENABLE_SERVICES yet).

…n a fresh almalinux with only vcpkg dependencies. it's easiest to keep this as a patch for valhalla packaging and not contribute it upstream. this is just for demmo
@nilsnolde
Copy link
Member Author

OSX is segfaulting?! That's weird..

@nilsnolde nilsnolde mentioned this pull request Oct 21, 2023
9 tasks
@nilsnolde
Copy link
Member Author

nilsnolde commented Jan 4, 2024

I tried hard to reproduce the CI failure for OSX, no chance.. Also, how can it even cause a segfault if nothing is being run?! This is what's happening while building with make -C build -j8:

[ 93%] Building Utrecht Tiles...
[ 93%] Linking CXX executable valhalla_assign_speeds
ld: warning: Linking with PIE, -image_base will be ignored
[ 93%] Built target valhalla_assign_speeds
[ 96%] Building C object test/CMakeFiles/valhalla_test.dir/__/third_party/lz4/lib/xxhash.c.o
make[2]: *** [test/data/utrecht_tiles/traffic.tar] Segmentation fault: 11
make[1]: *** [test/CMakeFiles/utrecht_tiles.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....

At that point it's just compiling right? The tiles are not actually being built here! EDIT: forget that, they're being built.. new round of debugging.. It seems to be pretty consistent in the segfault so far, still I have no idea what causes this. Also hard if it's not reproducible locally.. I have clang 14.x vs 13.x of CI, still..

Anyone got any idea? Or could even try this out on a local x64 Mac?

git clone https://github.com/valhalla/valhalla --branch nn-cmake-find-modules --single-branch
cd valhalla
cmake -B build_test
make -C build_test -j$(sysctl -n hw.logicalcpu)

@nilsnolde
Copy link
Member Author

I have a feeling all those compiler warnings might've caught up with us..

@nilsnolde
Copy link
Member Author

Or might just try the M1 one..

@nilsnolde
Copy link
Member Author

nilsnolde commented Jan 4, 2024

oh jeez.. tried some more to rule out some clang bug:

  • SSH'd into the circleci machine and tried to build with gcc which resulted in an endless stream of linking errors with protobuf
  • tried clang 16 on my local linux machine and it succeeds (CI has 13.1.6)

So nothing ruled out, but honestly out of strategies.. I'll focus on M1.

@nilsnolde
Copy link
Member Author

nilsnolde commented Jan 4, 2024

Finally I got a traceback from lldb:

2024/01/04 22:05:46.068765 [INFO] Finished ReclassifyFerryEdges: ferry_endpoint_count = 0, 0 edges reclassified. Failed both directions for 0 connections.
2024/01/04 22:05:46.215942 [INFO] Building 1 tiles with 1 threads...
Process 18970 stopped
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000000000000
error: memory read failed for 0x0
Target 0: (valhalla_build_tiles) stopped.
(lldb) bt
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x0000000000000000
    frame #1: 0x0000000004801e1b libsqlite3.0.dylib`sqlite3Malloc + 64
    frame #2: 0x00000000048443ef libsqlite3.0.dylib`sqlite3HashInsert + 85
    frame #3: 0x0000000004811d57 libsqlite3.0.dylib`sqlite3FindFunction + 439
    frame #4: 0x0000000004811a48 libsqlite3.0.dylib`sqlite3CreateFunc + 531
    frame #5: 0x0000000004811629 libsqlite3.0.dylib`createFunctionApi + 213
    frame #6: 0x00000000048116a7 libsqlite3.0.dylib`sqlite3_create_function_v2 + 38
    frame #7: 0x0000000004df3fd0 libspatialite.7.dylib`register_spatialite_sql_functions + 69
    frame #8: 0x0000000004e3b8df libspatialite.7.dylib`spatialite_init_ex + 52
    frame #9: 0x000000000097fff5 valhalla_build_tiles`valhalla::mjolnir::make_spatialite_cache(handle=0x0000000005c171e0) at util.cc:180:3

seems like with pkg-config spatialite doesn't play balls with sqlite.. I upgraded XCode now to the 14.2.0, which is what I have locally on my x64 OSX. if that's it, I'll forever hate mac even more!

for reference, here the commands for circleci ssh:

cd project
rm -r build
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_SINGLE_FILES_WERROR=OFF

make -C build -j$(sysctl -n hw.logicalcpu) run-gurka_admin_uk_override
# or tiles build
make -C build -j$(sysctl -n hw.logicalcpu) valhalla_build_tiles

cd build
# construct the JSON, so lldb works fine
./valhalla_build_config --mjolnir-tile-dir /Users/distiller/project/build/test/data/utrecht_tiles --mjolnir-tile-extract /Users/distiller/project/build/test/data/tiles.tar --mjolnir-admin /Users/distiller/project/test/data/netherlands_admin.sqlite --mjolnir-id-table-size 1000 --mjolnir-timezone /Users/distiller/project/build/test/data/tz.sqlite --mjolnir-include-construction true --mjolnir-concurrency 1 > valhalla.json

# this segfaults
./valhalla_build_tiles -c valhalla.json  -e build ../test/data/utrecht_netherlands.osm.pbf

lldb -- ./valhalla_build_tiles -c valhalla.json  -e build ../test/data/utrecht_netherlands.osm.pbf
# in lldb
run
bt

@nilsnolde
Copy link
Member Author

nilsnolde commented Jan 4, 2024

last try with xcode 15.1.0. might also be the spatialite version, which was 5.0.1 on previous xcode versions, but is now 5.1.0.

ah right, robin hood has deprecations. so much for "good forever";)

@nilsnolde
Copy link
Member Author

boost failed some shit too where clang must've removed some deprecated c++03 shite.. trying 1.83.0

@nilsnolde
Copy link
Member Author

ok that finally seems to have worked! I have ZERO idea why this tiny change of using pkg-config to find libspatialite would segfault all of a sudden when nothing else changed.. but well, happy to see it's working with the latest version.

do note, this was only a problem with intel mac. ubuntu 22.04 works just fine with the older 5.0.1 version.

@@ -3,15 +3,15 @@ version: 2.1
executors:
macos:
macos:
xcode: 13.4.1
xcode: 15.1.0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this enables libspatialite 5.1.0 which seems to be needed on intel mac, with 5.0.1 there was a segfault, see #4253 (comment)

environment:
HOMEBREW_NO_AUTO_UPDATE: 1
CXXFLAGS: -DGEOS_INLINE

commands:
mac_deps:
steps:
- run: brew install protobuf cmake ccache libtool libspatialite pkg-config luajit curl wget czmq lz4 spatialite-tools unzip
- run: brew install autoconf automake protobuf cmake ccache libtool libspatialite pkg-config luajit curl wget czmq lz4 spatialite-tools unzip
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wasn't installed with the newer mac image apparently

mkdir -p build
cd build
cmake ..
- run: cmake -B build -DENABLE_SINGLE_FILES_WERROR=OFF
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the new clang there's tons of more warnings treated as errors

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also no way to enable this again anytime soon, it's also robin hood which causes the errors/warnings

CMakeLists.txt Outdated
@@ -134,7 +134,7 @@ find_package(Threads REQUIRED)
find_package(ZLIB REQUIRED)

# try to find an installed boost or install locally with conan
set(boost_VERSION "1.71")
set(boost_VERSION "1.80")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should bump this to 1.83 too to be consistent with the conanfile.txt. might as well take the chance to upgrade to latest version IMO

if(NOT MSVC)
set_property(TARGET CURL::CURL APPEND PROPERTY INTERFACE_LINK_LIBRARIES "${CURL_LIBRARIES}")
target_link_libraries(CURL::CURL INTERFACE ${CURL_LIBRARIES})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit easier to look at

@@ -192,7 +195,7 @@ if(NOT Protobuf_FOUND)
set(CMAKE_FIND_PACKAGE_PREFER_CONFIG OFF)
endif()

message(STATUS "Using pbf headers from ${PROTOBUF_INCLUDE_DIR}")
message(STATUS "Using pbf headers from ${PROTOBUF_INCLUDE_DIRS}")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is weird.. I compiled on intel mac and all of a sudden it wouldn't see the protobuf include dir anymore.. I dug a bit and found that the protobuf-options.cmake was using the plural version. seems to be working on linux too, probably there both is defined

boost:without_context=True
boost:without_contract=True
boost:without_coroutine=True
boost:without_date_time=True
boost:without_exception=True
boost:without_fiber=True
boost:without_filesystem=True
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also pretty annoying.. it wouldn't let me "build" the header-only boost without including those. IIRC filesystem is actually a compiled library. but even so, I couldn't see that conan compiles anything, so it's probably just an optional dependency of some other library

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to move to vcpkg as soon as possible after merging this PR

@@ -416,7 +416,8 @@ TEST(UtilMidgard, TestTrimPolylineWithFloatGeoPoint) {
// Worst case is they may quantized at 1.69m intervals (for an epsilon change).
// https://stackoverflow.com/a/28420164
// The length comparisons below do better than that, but not a lot.
constexpr double MAX_FLOAT_PRECISION = 0.05; // Should be good for 5cm at this lon/lat
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linux-debug was fine, but osx failed with

[ RUN      ] UtilMidgard.TestTrimPolylineWithFloatGeoPoint
/Users/distiller/project/test/util_midgard.cc:451: Failure
The difference between length(clip.begin(), clip.end()) and length(line.begin(), line.end()) * 0.0001f is 0.056293416414652825, which exceeds MAX_FLOAT_PRECISION, where
length(clip.begin(), clip.end()) evaluates to 0.095041990280151367,
length(line.begin(), line.end()) * 0.0001f evaluates to 0.038748573865498542, and
MAX_FLOAT_PRECISION evaluates to 0.050000000000000003.
0.1% portion should be clipped

@nilsnolde
Copy link
Member Author

nilsnolde commented Jan 5, 2024

OH WTFFF.. 2 admin tests fail too:

/bin/bash: line 1: 28479 Segmentation fault: 11  /Users/distiller/project/build/test/gurka/gurka_admin_uk_override >&/Users/distiller/project/build/test/gurka/gurka_admin_uk_override.log
[FAIL] gurka_admin_uk_override

/bin/bash: line 1: 28471 Segmentation fault: 11  /Users/distiller/project/build/test/gurka/gurka_admin_sidewalk_crossing_override >&/Users/distiller/project/build/test/gurka/gurka_admin_sidewalk_crossing_override.log
[FAIL] gurka_admin_sidewalk_crossing_override

Of course locally it doesn't fail.. Don't know what to do anymore for real.. FWIW, these are the cmake configs:

local cmake config
[cmake] -- The CXX compiler identification is AppleClang 14.0.0.14000029
[cmake] -- The C compiler identification is AppleClang 14.0.0.14000029
[cmake] -- Detecting CXX compiler ABI info
[cmake] -- Detecting CXX compiler ABI info - done
[cmake] -- Check for working CXX compiler: /usr/local/opt/ccache/libexec/clang++ - skipped
[cmake] -- Detecting CXX compile features
[cmake] -- Detecting CXX compile features - done
[cmake] -- Detecting C compiler ABI info
[cmake] -- Detecting C compiler ABI info - done
[cmake] -- Check for working C compiler: /usr/local/opt/ccache/libexec/clang - skipped
[cmake] -- Detecting C compile features
[cmake] -- Detecting C compile features - done
[cmake] -- Found PkgConfig: /usr/local/bin/pkg-config (found version "0.29.2") 
[cmake] -- Configuring in release mode
[cmake] -- Using ccache to speed up incremental builds
[cmake] -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
[cmake] -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
[cmake] -- Found Threads: TRUE  
[cmake] -- Found ZLIB: /Library/Developer/CommandLineTools/SDKs/MacOSX13.1.sdk/usr/lib/libz.tbd (found version "1.2.11")  
[cmake] -- No compatible boost version detected, using conan...
[cmake] -- Conan: checking conan executable
[cmake] -- Conan: Found program /usr/local/bin/conan
[cmake] -- Conan: Version found Conan version 1.62.0
[cmake] -- Conan executing: /usr/local/bin/conan install /Users/nilsnolde/dev/cpp/valhalla/conanfile.txt --remote conancenter --settings build_type=Release --settings compiler=apple-clang --settings compiler.version=14.0 --settings compiler.libcxx=libc++ --settings compiler.cppstd=17
[cmake] -- Conan: Using autogenerated FindBoost.cmake
[cmake] -- Found Boost: 1.71.0 (found suitable version "1.71.0", minimum required is "1.71") 
[cmake] -- Found CURL: /Library/Developer/CommandLineTools/SDKs/MacOSX13.1.sdk/usr/lib/libcurl.tbd (found version "7.85.0")  
[cmake] -- Found cURL: /Library/Developer/CommandLineTools/SDKs/MacOSX13.1.sdk/usr/lib/libcurl.tbd
[cmake] -- Found Protobuf: /usr/local/bin/protoc-25.1.0 (found version "25.1.0") 
[cmake] -- Using pbf headers from /usr/local/include
[cmake] -- Using pbf libs from /usr/local/lib/libprotobuf.25.1.0.dylib
[cmake] -- Using pbf release libs from 
[cmake] -- Using pbf debug libs from 
[cmake] -- Using pbf-lite
[cmake] -- Checking for module 'libprime_server>=0.6.3'
[cmake] --   Found libprime_server, version 0.7.0
[cmake] -- Found SQLite3: /usr/local/opt/sqlite3/lib/libsqlite3.dylib  
[cmake] -- Found SQLite3: /usr/local/opt/sqlite3/lib/libsqlite3.dylib
[cmake] -- Looking for sqlite3_enable_load_extension in /usr/local/opt/sqlite3/lib/libsqlite3.dylib
[cmake] -- Looking for sqlite3_enable_load_extension in /usr/local/opt/sqlite3/lib/libsqlite3.dylib - found
[cmake] -- Checking for module 'spatialite'
[cmake] --   Found spatialite, version 5.1.0
[cmake] -- Found LuaJIT: /usr/local/lib/libluajit-5.1.dylib (found version "2.1.0-beta3") 
[cmake] -- Performing Test LIBCXX_SUPPORTS_MFPMATH_EQ_SSE_FLAG
[cmake] -- Performing Test LIBCXX_SUPPORTS_MFPMATH_EQ_SSE_FLAG - Success
[cmake] -- Performing Test LIBCXX_SUPPORTS_MSSE2_FLAG
[cmake] -- Performing Test LIBCXX_SUPPORTS_MSSE2_FLAG - Success
[cmake] -- Found Python: /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/bin/python3.9 (found version "3.9.6") found components: Development Interpreter Development.Module Development.Embed 
[cmake] -- pybind11 v2.11.1 
[cmake] -- Performing Test HAS_FLTO
[cmake] -- Performing Test HAS_FLTO - Success
[cmake] -- Performing Test HAS_FLTO_THIN
[cmake] -- Performing Test HAS_FLTO_THIN - Success
[cmake] -- Installing python modules to /Library/Python/3.9/site-packages
[cmake] -- Checking for module 'geos'
[cmake] --   Found geos, version 3.12.1
[cmake] -- Found Python: /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/bin/python3.9 (found version "3.9.6") found components: Interpreter
CI cmake config
-- The CXX compiler identification is AppleClang 15.0.0.15000100
-- The C compiler identification is AppleClang 15.0.0.15000100
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /Applications/Xcode-15.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /Applications/Xcode-15.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Found PkgConfig: /usr/local/bin/pkg-config (found version "0.29.2") 
-- No build type specified, defaulting to Release
-- Configuring in release mode
-- Using ccache to speed up incremental builds
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Found ZLIB: /Applications/Xcode-15.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/usr/lib/libz.tbd (found version "1.2.12")  
-- No compatible boost version detected, using conan...
-- Conan: checking conan executable
-- Conan: Found program /Users/distiller/.pyenv/shims/conan
-- Conan: Version found Conan version 1.62.0
-- Conan executing: /Users/distiller/.pyenv/shims/conan install /Users/distiller/project/conanfile.txt --remote conancenter --settings build_type=Release --settings compiler=apple-clang --settings compiler.version=15.0 --settings compiler.libcxx=libc++ --settings compiler.cppstd=17
ERROR: Not able to automatically detect '/Applications/Xcode-15.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc' version
WARN: Remotes registry file missing, creating default one in /Users/distiller/.conan/remotes.json
-- Conan: Using autogenerated FindBoost.cmake
-- Found Boost: 1.83.0 (found suitable version "1.83.0", minimum required is "1.80") 
-- Found CURL: /Applications/Xcode-15.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/usr/lib/libcurl.tbd (found version "8.4.0")  
-- Found cURL: /Applications/Xcode-15.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/usr/lib/libcurl.tbd
-- Found Protobuf: /usr/local/bin/protoc-25.1.0 (found version "25.1.0") 
-- Using pbf headers from /usr/local/include
-- Using pbf libs from /usr/local/lib/libprotobuf.25.1.0.dylib
-- Using pbf release libs from 
-- Using pbf debug libs from 
-- Using pbf-lite
-- Checking for module 'libprime_server>=0.6.3'
--   Found libprime_server, version 0.7.1
-- Found SQLite3: /usr/local/opt/sqlite3/lib/libsqlite3.dylib  
-- Found SQLite3: /usr/local/opt/sqlite3/lib/libsqlite3.dylib
-- Looking for sqlite3_enable_load_extension in /usr/local/opt/sqlite3/lib/libsqlite3.dylib
-- Looking for sqlite3_enable_load_extension in /usr/local/opt/sqlite3/lib/libsqlite3.dylib - found
-- Checking for module 'spatialite'
--   Found spatialite, version 5.1.0
-- Found LuaJIT: /usr/local/lib/libluajit-5.1.dylib (found version "2.1.1703358377") 
-- Performing Test LIBCXX_SUPPORTS_MFPMATH_EQ_SSE_FLAG
-- Performing Test LIBCXX_SUPPORTS_MFPMATH_EQ_SSE_FLAG - Success
-- Performing Test LIBCXX_SUPPORTS_MSSE2_FLAG
-- Performing Test LIBCXX_SUPPORTS_MSSE2_FLAG - Success
-- Found Python: /Users/distiller/.pyenv/shims/python3.11 (found version "3.11.7") found components: Development Interpreter Development.Module Development.Embed 
-- pybind11 v2.11.1 
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- Performing Test HAS_FLTO_THIN
-- Performing Test HAS_FLTO_THIN - Success
-- Installing python modules to /Users/distiller/.pyenv/versions/3.11.7/lib/python3.11/site-packages
-- Checking for module 'geos'
--   Found geos, version 3.12.1
-- Found Python: /Users/distiller/.pyenv/shims/python3.11 (found version "3.11.7") found components: Interpreter 
-- Failed to find LLVM FileCheck
-- Found Git: /usr/local/bin/git (found version "2.43.0") 
-- git version: v1.8.3-8-gf30c99a7 normalized to 1.8.3.8
-- Google Benchmark version: 1.8.3.8
-- Looking for shm_open in rt
-- Looking for shm_open in rt - not found
-- Performing Test HAVE_CXX_FLAG_WALL
-- Performing Test HAVE_CXX_FLAG_WALL - Success
-- Performing Test HAVE_CXX_FLAG_WEXTRA
-- Performing Test HAVE_CXX_FLAG_WEXTRA - Success
-- Performing Test HAVE_CXX_FLAG_WSHADOW
-- Performing Test HAVE_CXX_FLAG_WSHADOW - Success
-- Performing Test HAVE_CXX_FLAG_WFLOAT_EQUAL
-- Performing Test HAVE_CXX_FLAG_WFLOAT_EQUAL - Success
-- Performing Test HAVE_CXX_FLAG_WOLD_STYLE_CAST
-- Performing Test HAVE_CXX_FLAG_WOLD_STYLE_CAST - Success
-- Performing Test HAVE_CXX_FLAG_WERROR
-- Performing Test HAVE_CXX_FLAG_WERROR - Success
-- Performing Test HAVE_CXX_FLAG_WSUGGEST_OVERRIDE
-- Performing Test HAVE_CXX_FLAG_WSUGGEST_OVERRIDE - Success
-- Performing Test HAVE_CXX_FLAG_PEDANTIC
-- Performing Test HAVE_CXX_FLAG_PEDANTIC - Success
-- Performing Test HAVE_CXX_FLAG_PEDANTIC_ERRORS
-- Performing Test HAVE_CXX_FLAG_PEDANTIC_ERRORS - Success
-- Performing Test HAVE_CXX_FLAG_WSHORTEN_64_TO_32
-- Performing Test HAVE_CXX_FLAG_WSHORTEN_64_TO_32 - Success
-- Performing Test HAVE_CXX_FLAG_FSTRICT_ALIASING
-- Performing Test HAVE_CXX_FLAG_FSTRICT_ALIASING - Success
-- Performing Test HAVE_CXX_FLAG_WNO_DEPRECATED_DECLARATIONS
-- Performing Test HAVE_CXX_FLAG_WNO_DEPRECATED_DECLARATIONS - Success
-- Performing Test HAVE_CXX_FLAG_WNO_DEPRECATED
-- Performing Test HAVE_CXX_FLAG_WNO_DEPRECATED - Success
-- Performing Test HAVE_CXX_FLAG_WSTRICT_ALIASING
-- Performing Test HAVE_CXX_FLAG_WSTRICT_ALIASING - Success
-- Performing Test HAVE_CXX_FLAG_WD654
-- Performing Test HAVE_CXX_FLAG_WD654 - Failed
-- Performing Test HAVE_CXX_FLAG_WTHREAD_SAFETY
-- Performing Test HAVE_CXX_FLAG_WTHREAD_SAFETY - Success
-- Enabling additional flags: -DINCLUDE_DIRECTORIES=/Users/distiller/project/third_party/benchmark/include
-- Compiling and running to test HAVE_THREAD_SAFETY_ATTRIBUTES
-- Performing Test HAVE_THREAD_SAFETY_ATTRIBUTES -- success
-- Performing Test HAVE_CXX_FLAG_COVERAGE
-- Performing Test HAVE_CXX_FLAG_COVERAGE - Success
-- Compiling and running to test HAVE_STD_REGEX
-- Performing Test HAVE_STD_REGEX -- success
-- Compiling and running to test HAVE_GNU_POSIX_REGEX
-- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile
-- Compiling and running to test HAVE_POSIX_REGEX
-- Performing Test HAVE_POSIX_REGEX -- success
-- Compiling and running to test HAVE_STEADY_CLOCK
-- Performing Test HAVE_STEADY_CLOCK -- success
-- Compiling and running to test HAVE_PTHREAD_AFFINITY
-- Performing Test HAVE_PTHREAD_AFFINITY -- failed to compile
-- Configuring done (25.1s)
-- Generating done (3.2s)
-- Build files have been written to: /Users/distiller/project/build

@nilsnolde
Copy link
Member Author

If anyone has any inclination to help out/advice/fix, I'd appreciate it!

@nilsnolde
Copy link
Member Author

nilsnolde commented Jan 5, 2024

Some further investigation:

Curiously it only fails on a Release build, Debug passes the test easily.. So I built gurka_admin_sidewalk_crossing_override with RelWithDebInfo and it's the weirdest thing. When we stitch together the admin areas with GEOS, this test somehow causes the following code to think that there's inner polygons when there really isn't:

auto* outer_ring = geos_helper_t::from_striped_container(polygon.outer());
std::vector<GEOSGeometry*> inner_rings;
inner_rings.reserve(polygon.inners().size());
for (const auto& inner : polygon.inners())
inner_rings.push_back(geos_helper_t::from_striped_container(inner));

The actual segfault happens on line 146 in from_striped_container() when doing GEOSCoordSequence* geos_coords = GEOSCoordSeq_create(inner.size(), 2);.

See the map here, there's only one outer polygon with a few ways, no inners:

https://github.com/valhalla/valhalla/blob/7eb478460a98bb76d91d7d3bd1a9b2a2e3fe4d63/test/gurka/test_admin_sidewalk_crossing_override.cc#L22C1-L31

I stepped through the code with lldb and it goes right into line 146 above! I checked locally on Linux and of course there are no inners, so locally it skips the loop as it should. Unfortunately I can't check variable values properly with the optimized debug build on circleci, but with the little I can check it doesn't seem like it somehow added an inner for some reason. It all points to that inners() container having 0 size.

What's going on here?

GEOS segfault traceback circleci
2024/01/05 13:26:04.173257 [INFO] Created admin table.
Process 34368 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x000000000047a77e gurka_admin_sidewalk_crossing_override`GEOSGeom_t* (anonymous namespace)::geos_helper_t::from_striped_container<boost::geometry::model::ring<valhalla::midgard::GeoPoint<double>, true, true, std::__1::vector, std::__1::allocator>>(boost::geometry::model::ring<valhalla::midgard::GeoPoint<double>, true, true, std::__1::vector, std::__1::allocator> const&) [inlined] std::__1::vector<valhalla::midgard::GeoPoint<double>, std::__1::allocator<valhalla::midgard::GeoPoint<double>>>::size[abi:v160006](this=0x0000000000000000) const at vector:546:46 [opt]
   543 	
   544 	    _LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI
   545 	    size_type size() const _NOEXCEPT
-> 546 	        {return static_cast<size_type>(this->__end_ - this->__begin_);}
   547 	    _LIBCPP_CONSTEXPR_SINCE_CXX20 _LIBCPP_HIDE_FROM_ABI
   548 	    size_type capacity() const _NOEXCEPT
   549 	        {return static_cast<size_type>(__end_cap() - this->__begin_);}
Target 0: (gurka_admin_sidewalk_crossing_override) stopped.
warning: gurka_admin_sidewalk_crossing_override was compiled with optimization - stepping may behave oddly; variables may not be available.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x000000000047a77e gurka_admin_sidewalk_crossing_override`GEOSGeom_t* (anonymous namespace)::geos_helper_t::from_striped_container<boost::geometry::model::ring<valhalla::midgard::GeoPoint<double>, true, true, std::__1::vector, std::__1::allocator>>(boost::geometry::model::ring<valhalla::midgard::GeoPoint<double>, true, true, std::__1::vector, std::__1::allocator> const&) [inlined] std::__1::vector<valhalla::midgard::GeoPoint<double>, std::__1::allocator<valhalla::midgard::GeoPoint<double>>>::size[abi:v160006](this=0x0000000000000000) const at vector:546:46 [opt]
    frame #1: 0x000000000047a77e gurka_admin_sidewalk_crossing_override`GEOSGeom_t* (anonymous namespace)::geos_helper_t::from_striped_container<boost::geometry::model::ring<valhalla::midgard::GeoPoint<double>, true, true, std::__1::vector, std::__1::allocator>>(coords=0x0000000000000000) at adminbuilder.cc:47:65 [opt]
    frame #2: 0x0000000000473e16 gurka_admin_sidewalk_crossing_override`valhalla::mjolnir::BuildAdminFromPBF(boost::property_tree::basic_ptree<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>> const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>> const&) at adminbuilder.cc:146:27 [opt]
    frame #3: 0x0000000000473d4f gurka_admin_sidewalk_crossing_override`valhalla::mjolnir::BuildAdminFromPBF(boost::property_tree::basic_ptree<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>> const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>> const&) at adminbuilder.cc:349:5 [opt]
    frame #4: 0x00000000004734ff gurka_admin_sidewalk_crossing_override`valhalla::mjolnir::BuildAdminFromPBF(pt=<unavailable>, input_files=size=1) at adminbuilder.cc:546:25 [opt]

@kevinkreiser
Copy link
Member

all this confirms to me is that we should in fact completely remove boost geometry from the equation. we shouldnt have started using it anywhere in the project but sadly we didnt know what was going on with its development. one less dependency to wonder about would be good too. why do i say that?

to me what you are saying sounds a lot like boosts container has random garbage, when default initialized, in the inners. default initialization will be zero'd out in a debug build but not in a release build. maybe you can find a way to zero initialize our use of the boost objects but honestly i wish i had continued my quest here: #3863

@nilsnolde
Copy link
Member Author

sounds a lot like boosts container has random garbage, when default initialized, in the inners

those containers are all std::vector, there shouldn't be any garbage right? I don't really know what to try and explicitly initialize, but I tried one container, let's see..

@nilsnolde
Copy link
Member Author

nilsnolde commented Jan 5, 2024

if that doesn't work, I'll try another boost version.. if it's not that, it must be some shit with clang?!

the only suspicious block might be this one:

  for (auto& poly : polys) {
    multipolygon_t buffered;
    // here the postponed_inners vector wasn't explicitly initialized, but it's class `poly` was default initialized
    // and inners() returns also an explicitly default intialized vector
    // could swap() be problematic? no right?
    poly.polygon.inners().swap(poly.postponed_inners);
    buffer_polygon(poly.polygon, multipolygon);
  }

@kevinkreiser
Copy link
Member

no you are right, vector should be immune to default initialization problems

inner_rings.push_back(geos_helper_t::from_striped_container(inner));
// there is some weird AppleClang bug where it would iterate over non-existing elements
// in the inners() vector and eventually segfault in from_striped_container()
std::cout << "Size of polygon.inners(): " << polygon.inners().size() << std::endl;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not kidding: this line printing to stdout is actually making the tests pass! Compiler bug wasting many hours of my life..

@nilsnolde
Copy link
Member Author

After pulling my hair out on this one, I migrate to M1 after all: #4500. I expected it to be more work, but it seems to just work:)

@nilsnolde nilsnolde mentioned this pull request Jan 8, 2024
@nilsnolde nilsnolde mentioned this pull request Jan 21, 2024
@nilsnolde
Copy link
Member Author

this is also ready finally

@kevinkreiser kevinkreiser merged commit 8074e8e into master Jan 24, 2024
1 check passed
@kevinkreiser kevinkreiser deleted the nn-cmake-find-modules branch January 24, 2024 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants