Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add comprehensive date checking in setters of regional_task.py #266

Open
wklumpen opened this issue Feb 25, 2023 · 24 comments
Open

Add comprehensive date checking in setters of regional_task.py #266

wklumpen opened this issue Feb 25, 2023 · 24 comments
Assignees
Labels
bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request refactoring test related to tests validation Specific input validation feature requests

Comments

@wklumpen
Copy link
Contributor

At the moment, it looks like the "date coverage" check only checks if TansitMode.TRANSIT is in the departure modes, but not if there is any subset of these modes.

Not sure if we have a test set up for this (but I suspect it would fail if the user passed a a set of all modes individually).

Further the RuntimeWarning that gets thrown if a date is off should ideally specify which feed in the GTFS list is causing the issue (currently trying to debug whether this is throwing a warning unnecessarily.

@wklumpen wklumpen added bug Something isn't working validation Specific input validation feature requests labels Feb 25, 2023
@wklumpen wklumpen self-assigned this Feb 25, 2023
@christophfink
Copy link
Member

christophfink commented Feb 28, 2023

At the moment, it looks like the "date coverage" check only checks if TansitMode.TRANSIT is in the departure modes, but not if there is any subset of these modes.

Haven’t yet tested this myself, but this can definitely be the case. I’ll check this out now, should be an easy fix

Not sure if we have a test set up for this (but I suspect it would fail if the user passed a a set of all modes individually).

Hm, this is a bit tricky: supplying multiple GTFS files can have the purpose of covering more than one transport system, but it can be used also to cover a longer period of time, with some times covered by one file and some by others. So there should be different warnings for (1) one of the files do not cover the departure_time, and (2) none of the files cover departure_time. For case (1) , we could maybe also only warn when verbose == True - or at least combine the individual per-data-set warnings to something like ‘some data sets do not cover departure date’

@christophfink
Copy link
Member

Since we open this up another time: let's also add (1) a check whether the origin and destination geometry is within he geographic extent of the GTFS dataset, and (2) a warning if a transit mode is selected, but no GTFS dataset loaded (we might have this check already, but I'm typing this on the phone)

Finally, (3), let's coordinate this with a possible super-class that combines TransitMode, StreetMode, and LegMode (see issue #265 )

@christophfink christophfink added documentation Improvements or additions to documentation enhancement New feature or request refactoring test related to tests labels Mar 1, 2023
@wklumpen
Copy link
Contributor Author

wklumpen commented Mar 1, 2023

FYI I was referring to this line of code.

@christophfink
Copy link
Member

christophfink commented Mar 2, 2023

Yes, I figured, and just this fix is literally 2 minutes of work. Going to move the other ideas into a separate issue

Edit: Thinking about it, also the other things are not such a big effort, let’s keep them together

@wklumpen
Copy link
Contributor Author

wklumpen commented Mar 9, 2023

add (1) a check whether the origin and destination geometry is within he geographic extent of the GTFS dataset

I assume we would use stops.txt as our boundary?

@wklumpen wklumpen changed the title Date coverage check and RuntimeWarning not comprehensive in the depature setter in regional_task.py Add comprehensive date checking in setters of regional_task.py Mar 9, 2023
@wklumpen
Copy link
Contributor Author

wklumpen commented Mar 9, 2023

Also given that #265 is a bit different (and has a PR open to fix it) I think we should keep this issue focused on date validation only.

@christophfink
Copy link
Member

Also, since you’re the GTFS expert: does stops.txt have geometries from which we can create our own convex hull, or should we simply use the feed_info.txt extent?

I’ll link these comments into #265 - definitely a better fit there

@wklumpen
Copy link
Contributor Author

From my experience feed_info.txt is fickle and often doesn't contain anything useful (e.g. Chicago and NYC feeds all are missing complete feed info data even though the file is there and technically compliant). Stops contain stop_lon and stop_lat which we could use to create convex hulls, however

Suppose a GTFS covers a single line service or just does subways for example. We wouldn't get much of a convex hull to work with.

I'm doing a similar thing for another project and I'm wondering if it's better to just leave that type of check to the user input. In part because GTFS doesn't need to span the analysis area (people can walk/bike/drive long distances to stops).

What is probably more useful is checking the extent of the PBF file, but perhaps R5 does that already (haven't tried it).

@christophfink
Copy link
Member

What is probably more useful is checking the extent of the PBF file, but perhaps R5 does that already (haven't tried it).

That’s something worth checking

@christophfink
Copy link
Member

Suppose a GTFS covers a single line service or just does subways for example. We wouldn't get much of a convex hull to work with.

True, I did not consider that cities with just one line even exist. If the different lines are split into several GTFS feeds, then a convex hull of all stops of all feeds would make sense, though

@wklumpen
Copy link
Contributor Author

Even so, I can see use cases where the area created by the convex hull does not span the entire analysis area. If it's a reasonably quick check I think it's fine to do it (are we getting into "verbosity" setting territory?) but it should definitely only be a warning with a clear explanation that it might not be an unintended thing.

@christophfink
Copy link
Member

Even so, I can see use cases where the area created by the convex hull does not span the entire analysis area. If it's a reasonably quick check I think it's fine to do it (are we getting into "verbosity" setting territory?) but it should definitely only be a warning with a clear explanation that it might not be an unintended thing.

Absolutely, a warning should suffice, and maybe even only when running in verbose mode.

@christophfink
Copy link
Member

christophfink commented Mar 14, 2023

Opened a new issue #271 concerning the geometry/extent checking

@christophfink
Copy link
Member

From our call:

  • if the user routes with a transit mode:
    • if all supplied GTFS Feeds cover the departure datetime: go ahead
    • if at least one GTFS Feed does not cover it: throw Exception

@wklumpen wklumpen added this to the Release v0.0.6 (paper) milestone Mar 14, 2023
@wklumpen
Copy link
Contributor Author

How do we feel about this being in release 0.0.6?

@wklumpen
Copy link
Contributor Author

Now that I'm working with batches of regional GTFS feeds, I have noticed that the warning is thrown even when my own validation using gtfs-lite doesn't seem to find any issues with the time I've supplied (to the extent that I don't know which of the 7 GTFS files are the culprit), however gtfs-lite only checks the date, not the specific time. I can try to see if I can find the GTFS feed that's throwing a warning in r5py but not in gtfs-lite.

But that got me thinking further: It is possible that an analysis using a set of GTFS feeds might run outside of one of them (for example, a commuter rail GTFS feed that runs only at peak period, while the analysis time is set for 1400). This might be totally fine (especially for automated tasks). It is also possible that an analysis window is half-covered by a GTFS feed.

In any case, I'm not sure this input date coverage check should be baked-in to the r5py workflow. I guess it goes to a philosophical discussion of how much data checking we want to provide to the user (without them asking), and how much do we leave in guides/documentation/eventual textbook that we inevitably publish 😄

Suggested behaviour here is, is that if the mode is set to verbose or a validate_data flag is supplied, AND at least one transit mode is specified in transport_modes:

  • Check each GTFS feed to see if there's at least one stop-time within the start+duration window
  • If no feeds have a valid date, throw an exception
  • If one or more feeds are not in the valid range, throw a warning specifying the file

@christophfink
Copy link
Member

I would not introduce an additional validate_data flag, but with everything else I’m with you.

Should the exception be thrown independently of whether or not we’re in verbose mode?

I remember you mentioning that the information in calendar.txt often is imprecise. However, the premise ‘at least one departure within the departure_window might not be bullet-proof either: in some of our research, for instance, we look at diurnal differences in accessibility, and it’s perfectly valid that certain places do not have any departure within the time window (say) 2-3am. How bad is the calendar.txt situation, really? Can we somehow find a middle way (derive start_date and end_date from actual routes and not simply read it from the metadata?)?

@christophfink
Copy link
Member

christophfink commented Apr 23, 2023

In any case, I'm not sure this input date coverage check should be baked-in to the r5py workflow. I guess it goes to a philosophical discussion of how much data checking we want to provide to the user (without them asking), and how much do we leave in guides/documentation/eventual textbook that we inevitably publish

Philosophical, indeed! I think warning if verbose feedback requested, is alright, and failing if no file claims (!) to cover the departure_window, is fine, as well. Not sure whether we should go much further, TBH

@wklumpen
Copy link
Contributor Author

Reminding myself here that a RuntimeWarning is being thrown still for GTFS sets that should be valid, but more importantly one or more of them are valid in a larger set.

Perhaps the wording of the warning here should be "One or more of the GTFS data sets is outside the time range covered by currently loaded data sets"

@wklumpen
Copy link
Contributor Author

wklumpen commented Nov 7, 2023

@christophfink do we still want to provide a check for the dates? If so, we need to consider the following (as I do for GTFS-lite):

  • Is there a calendar.txt service that covers the date?
  • Is there a calendar_dates.txt service that adds to the date (enum 1), or alternatively removes service on the date (enum 2)

Note: this does not check if there are any trip on that date. For example, a calendar.txt definition might only cover weekdays, and will appear valid for an analysis run on a Saturday despite not running service on a Saturday.

Also note: I don't know how extensive the checking of this data is on the R5 end.

This is a pretty serious can of worms that I've tackled over at GTFS-lite with valid_date() and date_trips()

As I've expressed before, my opinion is to leave the data wrangling and validation up to the user. We can provide guides and how-tos if we are interested, but ultimately there are so many permutations of what might be going wrong we can't cover them all in R5py without basically recreating things like the MobilityData Validator or GTFS-Lite package.

@christophfink
Copy link
Member

I see too viable options really:

  • kick out all the date checking; instead adding some more emphasis to this in the documentation
  • import gtfs-lite and run its checks before importing a GTFS file: a bit of overhead, but probably thorough

@wklumpen
Copy link
Contributor Author

wklumpen commented Nov 7, 2023

Loading in a file can be quite slow (5-6 seconds) if the feed is large - but perhaps the tradeoff is worth it. Could also be something we allow the user to specify?

My vote is to put those few lines of code outside of the regional task and have people check their dates ahead of time, much like other data preparation steps.

Happy to make a PR to include a validation example and remove the warning if this is the way we go.

@christophfink
Copy link
Member

Happy to make a PR to include a validation example and remove the warning if this is the way we go.
That would be very much appreciated, thanks!

@wklumpen
Copy link
Contributor Author

TODO: Remove date checking warning and add date checking code to examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request refactoring test related to tests validation Specific input validation feature requests
Projects
None yet
Development

No branches or pull requests

2 participants