Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Arm Builds and Add CI Workflow for Arm #4213

Merged
merged 70 commits into from
Oct 19, 2023
Merged

Fix Arm Builds and Add CI Workflow for Arm #4213

merged 70 commits into from
Oct 19, 2023

Conversation

kevinkreiser
Copy link
Member

circle ci has support for arm. the docs say its 20.04 which is not what we target on x86 but at the very least i plan to get the code actually building on arm. locally ive built it on arm just fine but it segfaults during valhalla_build_tiles. we'll see what the matter is and see if we can patch it up

@Asher-JH
Copy link

Asher-JH commented Aug 9, 2023

How's it going any luck so far 🤔

@nilsnolde
Copy link
Member

This will not solve your problem, it’s „just“ about testing on arm64. Once GHA adds a M1/2 runner for public use, we can think about distributing Docker images for M1. The docker build already takes 30 mins which a bit annoying on master merges. Cross compiling would easily double that.

GH is still has it on their roadmap for end of 2023.

@kevinkreiser
Copy link
Member Author

this does seem to reproduce many of the issues we have on m1s though. i got it building on arm linux but a bunch of tests fail. most of them are just that the expected shape of a route doesnt quite match (small differences in floating point numbers) but there are a few that are more substantial. isochrones test fails in a way that makes the outputs seem quite different and the gtfs tests do some boost polygon crap that doesnt seem to work at all (why do we even use that, we know already its not aware of spherical geom imho we should remove it).

@kevinkreiser
Copy link
Member Author

kevinkreiser commented Oct 11, 2023

so i decided to focus most of my time on the isochrones issue since it seems to be the one with the most egregious output. what i did was build the run-isochrone test target on both aarch64 and amd64 platforms. i then took the data created on the aarach64 platform, copied it over to the amd64 platform and ran test/isochrone manually. here's the output:

kkreiser@L50X1P73:~/sandbox/valhalla/build$ test/isochrone 
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from Isochrones
[ RUN      ] Isochrones.Basic
/home/kkreiser/sandbox/valhalla/test/isochrone.cc:96: Failure
Expected equality of these values:
  actual_geom.Size()
    Which is: 271
  expected_geom.Size()
    Which is: 272
/home/kkreiser/sandbox/valhalla/test/isochrone.cc:96: Failure
Expected equality of these values:
  actual_geom.Size()
    Which is: 271
  expected_geom.Size()
    Which is: 272
[  FAILED  ] Isochrones.Basic (89 ms)
[ RUN      ] Isochrones.OriginEdge
[          ] generating map PBF at test/data/isochrones/origin_edge/map.pbf
[          ] building tiles in test/data/isochrones/origin_edge
[          ] isochrone with mjolnir.tile_dir = test/data/isochrones/origin_edge with locations b with costing pedestrian
[          ] Valhalla request is: {"locations":[{"lon":0.03593248178978409,"lat":0.0,"type":"break"}],"costing":"pedestrian","costing_options":{"pedestrian":{"speed_types":["freeflow","constrained","predicted"]}},"verbose":true,"contours":[{"time":"10"}],"shape_match":"map_snap"}
[       OK ] Isochrones.OriginEdge (394 ms)
[ RUN      ] Isochrones.LongEdge
[          ] generating map PBF at test/data/isochrones/long_edge/map.pbf
[          ] building tiles in test/data/isochrones/long_edge
[          ] isochrone with mjolnir.tile_dir = test/data/isochrones/long_edge with locations a with costing pedestrian
[          ] Valhalla request is: {"locations":[{"lon":0.0,"lat":-0.0017966240891925996,"type":"break"}],"costing":"pedestrian","costing_options":{"pedestrian":{"speed_types":["freeflow","constrained","predicted"]}},"verbose":true,"contours":[{"time":"15"}],"shape_match":"map_snap"}
[       OK ] Isochrones.LongEdge (393 ms)
[ RUN      ] Isochrones.test_clear_reserved_memory
[       OK ] Isochrones.test_clear_reserved_memory (0 ms)
[ RUN      ] Isochrones.test_max_reserved_labels_count
[       OK ] Isochrones.test_max_reserved_labels_count (0 ms)
[----------] 5 tests from Isochrones (877 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (877 ms total)
[  PASSED  ] 4 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Isochrones.Basic

I can check the actual output just to see what it looks like but based on the above log, it looks like the data that the arm machine created, when run through the algorithm on the amd machine works as we expect it to. This means that we can trust the whole data creation pipeline and can focus on where the algorithm might be screwing up on the arm platform. my bet is all the iterator unordered map shenanigans i did to turn the output of the conrec algorithm into actual polygons/lines.

in fact, i recently refactored the whole admins stuff to use geos' c api and in doing so i rewrote the code that pieces segments together into rings. i was very happy at the simplification of this algorithm (though geos has its own version of linemerging) and i thought why didnt it do it this way in the isochrone generation code. i might take a swag at that first as a solution.

just for completeness here is the output of the isochrone test when run completely within the arm machine:

[FAIL] isochrone
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from Isochrones
[ RUN      ] Isochrones.Basic
/home/ubuntu/valhalla/test/isochrone.cc:96: Failure
Expected equality of these values:
  actual_geom.Size()
    Which is: 4
  expected_geom.Size()
    Which is: 356
/home/ubuntu/valhalla/test/isochrone.cc:96: Failure
Expected equality of these values:
  actual_geom.Size()
    Which is: 5
  expected_geom.Size()
    Which is: 272
/home/ubuntu/valhalla/test/isochrone.cc:96: Failure
Expected equality of these values:
  actual_geom.Size()
    Which is: 4
  expected_geom.Size()
    Which is: 272
[  FAILED  ] Isochrones.Basic (269 ms)
[ RUN      ] Isochrones.OriginEdge
[          ] generating map PBF at test/data/isochrones/origin_edge/map.pbf
[          ] building tiles in test/data/isochrones/origin_edge
[          ] isochrone with mjolnir.tile_dir = test/data/isochrones/origin_edge with locations b with costing pedestrian
[          ] Valhalla request is: {"locations":[{"lon":0.03593248178978409,"lat":0.0,"type":"break"}],"costing":"pedestrian","costing_options":{"pedestrian":{"speed_types":["freeflow","constrained","predicted"]}},"verbose":true,"contours":[{"time":"10"}],"shape_match":"map_snap"}
/home/ubuntu/valhalla/test/isochrone.cc:188: Failure
Expected equality of these values:
  within(WaypointToBoostPoint("b"), polygon)
    Which is: false
  true
[  FAILED  ] Isochrones.OriginEdge (1330 ms)
[ RUN      ] Isochrones.LongEdge
[          ] generating map PBF at test/data/isochrones/long_edge/map.pbf
[          ] building tiles in test/data/isochrones/long_edge
[          ] isochrone with mjolnir.tile_dir = test/data/isochrones/long_edge with locations a with costing pedestrian
[          ] Valhalla request is: {"locations":[{"lon":0.0,"lat":-0.0017966240891925996,"type":"break"}],"costing":"pedestrian","costing_options":{"pedestrian":{"speed_types":["freeflow","constrained","predicted"]}},"verbose":true,"contours":[{"time":"15"}],"shape_match":"map_snap"}
[       OK ] Isochrones.LongEdge (1294 ms)
[ RUN      ] Isochrones.test_clear_reserved_memory
[       OK ] Isochrones.test_clear_reserved_memory (0 ms)
[ RUN      ] Isochrones.test_max_reserved_labels_count
[       OK ] Isochrones.test_max_reserved_labels_count (0 ms)
[----------] 5 tests from Isochrones (2894 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (2894 ms total)
[  PASSED  ] 3 tests.
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] Isochrones.Basic
[  FAILED  ] Isochrones.OriginEdge

 2 FAILED TESTS
make[3]: *** [test/CMakeFiles/run-isochrone.dir/build.make:73: test/isochrone.log] Error 1
make[3]: *** Deleting file 'test/isochrone.log'
make[2]: *** [CMakeFiles/Makefile2:8718: test/CMakeFiles/run-isochrone.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:8725: test/CMakeFiles/run-isochrone.dir/rule] Error 2
make: *** [Makefile:3699: run-isochrone] Error 2

you'll see that the rings it creates are very small in terms of number of vertices compared to the expected answers

@kevinkreiser
Copy link
Member Author

alright after reviewing the isochrones geometry generation code i added some informational printing to it. the good news about the algorithm is that its deterministic in its iteration and doesnt rely on unordered datastructures for that. its scans by contour line value and by grid cell so the output of the algorithms can be directly compared.

so what i first did was made it output whenever it generated a line segment from a particular gird cell and diffed the line segments generated on both platforms. as we've seen in other tests, they are only off by tiny amounts of floating point wiggle (0.000001):

image

yet the output of the above arm run suggests that the algorithm is unable to connect the segments to each other. let me see if i can prove that... what i will do is annotate each segment with what happens to it. there are a number of options for that:

  • the segment completes a ring
  • the segment bridges two other segments
  • the segment is appended to another segment
  • the segment is prepended to another segment
  • the segment is an orphan (at least temporarily)

my hypothesis is that on arm many more segments stay orphans because somehow looking up connectable segments in the map doesnt work (it uses direct pointll equality). looking at the logs we get pretty convincing numbers regarding teh hypothesis:

image

we can look at the logs in detail and reason about why a segment becomes an orphan when it shouldnt... and there you see it right off the bat:

image

the second segment on the arm platform should be prepended to the first one as it is on the amd platform but it gets orphaned instead. i can only imagine this is because of floating point noise on the end of point which is very unfortunate...

@kevinkreiser
Copy link
Member Author

ok yep if i round the points on the segments to the nearest 7 decimal places the test still fails but its only off by one coordinate meaning its acceptable.

so a quick run down of whats left to do:

  1. unbork CI so that all the builds run again including the arm one
  2. arm builds need to disable warnings because the compiler does a lot more warning
  3. add testing work arounds for all the tests where shape is ever so slightly different (or if its too annoying disable those tests for arm)

nilsnolde
nilsnolde previously approved these changes Oct 19, 2023
nilsnolde
nilsnolde previously approved these changes Oct 19, 2023
@kevinkreiser
Copy link
Member Author

looks like the docker stuff should work its building fine in the github workflow but seems like it will be a while before it completes: https://github.com/valhalla/valhalla/actions/runs/6574607552/job/17859996388

im going to merge this in the mean time

@kevinkreiser kevinkreiser merged commit 9aa1c1c into master Oct 19, 2023
2 of 7 checks passed
@kevinkreiser kevinkreiser deleted the kk_arm branch October 19, 2023 14:41
@kevinkreiser kevinkreiser changed the title playing around with arm builds Fix Arm Builds and Add CI Workflow for Arm Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants