Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up transit ingestion #4167

Merged
merged 16 commits into from
Jul 4, 2023
Merged

speed up transit ingestion #4167

merged 16 commits into from
Jul 4, 2023

Conversation

kevinkreiser
Copy link
Member

@kevinkreiser kevinkreiser commented Jun 16, 2023

At the moment valhalla_ingest_transit is slow. After looking at the code for a few minutes i noticed that just_gtfs does a lot of scanning vectors to find stuff by gtfs id. This seemed slow to me, so i tried the profiler and it agreed.

I forked just_gtfs into the valhalla org and made a new branch that sorts the vectors (optionally for now but i think this should be the only mode really..).

In my test (pittsburgh) it brought the time to ingest the transit data from close to 6 minutes down to close to 2 minutes.

I'll probably just remove the ability to leave the data unsorted but wanted to throw this up here to show the progress.

@mesozoic-drones would you accept a pr? i see @nilsnolde has a pr that is unmerged for a while so i dont want to waste your time if you are not interested. thanks again!

EDIT:

Ok this is ready for second review, ive changed some more with just gtfs primarily:

  1. no more copies of structs, everything returns const &
  2. validity/not-found semantics now use gtfs::valid since std::optional cannot take const &
  3. sorting is not optional anymore its too advantageous to not do it

the time is now down from 6min for the pittsburgh feed to 1m20s. thats just shy of an 80% reduction. the bulk of the time is now spent just doing the initial enumeration of all the stuff in the feed (the tiling part). this will quickly become the bottleneck once we start doing tons of feeds at once. we can parallelize it and do other optimizations when that time comes.

CHANGELOG.md Outdated
@@ -5,6 +5,7 @@
* FIXED: when reclassifying ferry edges, remove destonly from ways only if the connecting way was destonly [#4118](https://github.com/valhalla/valhalla/pull/4118)
* **Enhancement**
* UPDATED: French translations, thanks to @xlqian [#4159](https://github.com/valhalla/valhalla/pull/4159)
* ENHANCED: sped up the transit gtfs ingestion process by sorting the feeds before querying them, forked just_gtfs to accomplish it [#4167](https://github.com/valhalla/valhalla/pull/4167)
Copy link
Member

@nilsnolde nilsnolde Jun 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard nit: usually we'd say CHANGED here

@@ -91,7 +92,7 @@
* ADDED: Added has_toll, has_higway, has_ferry tags to summary field of a leg and route and a highway tag to a maneuver if it includes a highway. [#3815](https://github.com/valhalla/valhalla/issues/3815)
* ADDED: Add time info to sources_to_targets [#3795](https://github.com/valhalla/valhalla/pull/3795)
* ADDED: "available_actions" to the /status response [#3836](https://github.com/valhalla/valhalla/pull/3836)
* ADDED: "waiting" field on input/output intermediate break(_through) locations to respect services times [#3849](https://github.com/valhalla/valhalla/pull/3849)
* ADDED: "waiting" field on input/output intermediate break(\_through) locations to respect services times [#3849](https://github.com/valhalla/valhalla/pull/3849)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops haha did that turn the whole document after to italics?

nilsnolde
nilsnolde previously approved these changes Jun 17, 2023
@nilsnolde
Copy link
Member

This is great, thanks for the effort!

Though you forked from the wrong repo, earlier this year @mesozoic-drones shifted the dev version to her own repo: https://github.com/mesozoic-drones/just_gtfs. This is also the submodule we have here ever since. There my PR was also merged:)

@kevinkreiser
Copy link
Member Author

ah ok I can make sure I have the changes from that repo. I just quickly googled and forgot about the backstory.

regarding 'changed' I looked through the changelog to see what we used because semantically changed is incredibly vague and to me seems worse than useless. actually i really wish we wouldn't use the all caps thing under the headings at all as it's just wasted bytes with 0 extra meaning. at times in the past we didn't use them but sadly we do now. so yeah I'll change it

@nilsnolde
Copy link
Member

"hard nit" was maybe lost in translation, I meant "I don't really care":) it's just a tiny consistency thing and I fully agree, it's really meaningless bytes

@kevinkreiser
Copy link
Member Author

sadly still not done here. our transit tests fail and some of it is definitely due to the sorting. the first test where we write transit and then read it back to make sure its right doesnt realize that the data is sorted so it expects to be able to use certain indices to map to certain stops. i can change the test a bit to be robust to this (use the find methods rather than indices).

further down it is also saying that the days mask on the tile says all 59 days have service. this again might be us relying on indices for a certain schedule but im not sure yet, didnt look too close. i'll check it later

@kevinkreiser
Copy link
Member Author

this is ready for final review

for (const auto& cal_date_item : trip_calDates) {
auto d = to_local_pivot_sec(cal_date_item.date.get_raw_date());
if (cal_date_item.exception_type == gtfs::CalendarDateException::Added) {
for (auto cal_itr = trip_calDates.first; cal_itr != trip_calDates.second; ++cal_itr) {
Copy link
Member Author

@kevinkreiser kevinkreiser Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i had a bug here in the first draft of this pr where i didnt make a copy of the first iterator in the range and instead modified the original. this meant that for other stop pairs that use teh same trip they wouldnt get these exceptions which causes the tests to fail, so thanks to the tests this was caught!

@kevinkreiser kevinkreiser merged commit 30e11f1 into master Jul 4, 2023
8 checks passed
@kevinkreiser kevinkreiser deleted the kk_speed_up_ingest_transit branch July 4, 2023 20:47
@kevinkreiser
Copy link
Member Author

just for preservation purposes i have pr'd our fork of just_gtfs with the speedups to the currently maintained version: mesozoic-drones/just_gtfs#3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants