Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop distance analysis and stop_name clustering #181

Merged
merged 6 commits into from
Jan 27, 2022

Conversation

polettif
Copy link
Contributor

Summary

This PR provides tools to analyse stop locations and edit stop data.

New functions:

  • stop_distances Calculate distances between a given set of stops
  • stop_group_distances Summary of stop distances among stops in a group (e.g. with the same name)
  • cluster_stops Cluster nearby stops within a group

New parameter

  • stop_dist_check in travel_times

Background

usage of stop_names

Some time last year, changes were made to the Swiss GTFS feed concerning stop_names and parent_stations. There are some parent stops all over Switzerland with generic names (like "Bahnhof", "gare" or "Post"). Those stops might share the same name but they are not located close to each other. In fact, they don't share any similarities besides their name. AFAIK, these short generic names are not acutally used even in user-facing applications, so I'm not sure why they were added.

# Download from https://opentransportdata.swiss/de/dataset/timetable-2022-gtfs2020
g_ch = read_gtfs("gtfs_fp2022_2022-01-12_09-15.zip")

parent_stops = g_ch$stops %>% filter(grepl("Parent", g_ch$stops$stop_id))
parent_stops %>% group_by(stop_name) %>% count() %>% arrange(desc(n))
#> # A tibble: 1,380 × 2
#> # Groups:   stop_name [1,380]
#>    stop_name        n
#>    <chr>        <int>
#>  1 Bahnhof         99
#>  2 gare            14
#>  3 Post             9
#>  4 Dorfplatz        5
#>  5 Stazione         5
#>  6 Bahnhof Süd      4
#>  7 village          4
#>  8 centre           3
#>  9 Dorf             3
#> 10 Hauptbahnhof     3
#> # … with 1,370 more rows

library(ggplot2)
parent_stops %>%
  filter(stop_name == "Bahnhof") %>% 
  stops_as_sf() %>% 
  mapview::mapview()
stops named "Bahnhof"

Bahnhof

Analyse distances among stops

It made me realize that using "stop_names" as an identifier in travel_times might be convenient but it also might lead to wrong results. Because there's a similar issue with the package-provided NYC feed. Th new function stop_group_distances allows you to calculate distances among stop with the same name.
There is no issue with travel time calculations, just with aggregating times to a more "readable" format.

g_nyc = read_gtfs(system.file("extdata", "google_transit_nyc_subway.zip", package = "tidytransit"))
stop_group_distances(g_nyc$stops)
#> # A tibble: 380 × 6
#>    stop_name   dists           n_stop_ids dist_mean dist_median dist_max
#>    <chr>       <list>               <dbl>     <dbl>       <dbl>    <dbl>
#>  1 86 St       <dbl [18 × 18]>         18     5395.       5395.   21811.
#>  2 79 St       <dbl [6 × 6]>            6    19053.      19053.   19053.
#>  3 Prospect Av <dbl [6 × 6]>            6    18804.      18804.   18804.
#>  4 77 St       <dbl [6 × 6]>            6    16947.      16947.   16947.
#>  5 59 St       <dbl [6 × 6]>            6    14130.      14130.   14130.
#>  6 50 St       <dbl [9 × 9]>            9     7097.       7097.   14068.
#>  7 36 St       <dbl [6 × 6]>            6    12496.      12496.   12496.
#>  8 8 Av        <dbl [6 × 6]>            6    11682.      11682.   11682.
#>  9 7 Av        <dbl [9 × 9]>            9     5479.       5479.   10753.
#> 10 111 St      <dbl [9 × 9]>            9     3877.       3877.    7753.
#> # … with 370 more rows

g_nyc$stops %>% 
  filter(stop_name == "86 St") %>% 
  stops_as_sf() %>% 
  mapview::mapview(zcol = "parent_station")
stops named "86 St" in New York

86_St

Cluster stops

The issue is that travel_times calculates times to each stop separately but then aggregates all travel times for a stop name an keeps the minimum. So all stops with the name "86 St" have the same time. travel_times now warns users if stop_names might not be suitable as an identifier (i.e. the distance is above a defined threshold).

This PR provides a fix to this issue, cluster_stops allows grouping of stops based on distance using stats::kmeans.

clusterstops = cluster_stops(g_nyc$stops, max_dist = 300, 
                             group_col = "stop_name", cluster_colname = "stop_name_cluster")

# There are 6 stops with the name "86 St" that are far apart
stops_86_St = clusterstops %>%
  filter(stop_name == "86 St")

table(stops_86_St$stop_name_cluster)
#> 
#> 86 St [1] 86 St [2] 86 St [3] 86 St [4] 86 St [5] 86 St [6] 
#>         3         3         3         3         3         3
#> 86 St [1] 86 St [2] 86 St [3] 86 St [4] 86 St [5] 86 St [6]
#>         3         3         3         3         3         3

stops_86_St %>% select(stop_id, stop_name, parent_station, stop_name_cluster) %>% head()
#> # A tibble: 6 × 4
#>   stop_id stop_name parent_station stop_name_cluster
#>   <chr>   <chr>     <chr>          <chr>            
#> 1 121     86 St     ""             86 St [2]        
#> 2 121N    86 St     "121"          86 St [2]        
#> 3 121S    86 St     "121"          86 St [2]        
#> 4 626     86 St     ""             86 St [1]        
#> 5 626N    86 St     "626"          86 St [1]        
#> 6 626S    86 St     "626"          86 St [1]
#> # A tibble: 6 × 4
#>   stop_id stop_name parent_station stop_name_cluster
#>   <chr>   <chr>     <chr>          <chr>
#> 1 121     86 St     ""             86 St [3]
#> 2 121N    86 St     "121"          86 St [3]
#> 3 121S    86 St     "121"          86 St [3]
#> 4 626     86 St     ""             86 St [4]
#> 5 626N    86 St     "626"          86 St [4]
#> 6 626S    86 St     "626"          86 St [4]

stops_86_St %>% stops_as_sf() %>% mapview::mapview(zcol = "stop_name_cluster")
stops named "86 St" after clustering

86_St_cluster_name

Usage in travel_times()

These cluster names can be used as new stop_names (by replacing the original column) to use travel_times .

g_nyc = read_gtfs(system.file("extdata", "google_transit_nyc_subway.zip", package = "tidytransit"))

# Original data, travel_time errors 
g_nyc %>% 
  filter_stop_times("2018-06-26", 7*3600, 9*3600) %>% 
  travel_times("34 St - Herald Sq")
#> Error in travel_times(., "34 St - Herald Sq"): Some stops with the same name are more than 300 meters apart, see stop_group_distances().
#> Using travel_times() might lead to unexpected results. Set stop_dist_check=FALSE to ignore this error.

# Cluster stop names
g_nyc %>% 
  cluster_stops(cluster_colname = "stop_name") %>% 
  filter_stop_times("2018-06-26", 7*3600, 9*3600) %>% 
  travel_times("34 St - Herald Sq")
#> # A tibble: 223 × 8
#>    from_stop_name    to_stop_name  travel_time journey_departu… journey_arrival…
#>    <chr>             <chr>               <dbl> <time>           <time>          
#>  1 34 St - Herald Sq 34 St - Hera…           0 07:00:00         07:00:00        
#>  2 34 St - Herald Sq 28 St [1]              60 07:04:00         07:05:00        
#>  3 34 St - Herald Sq 42 St - Brya…          90 07:00:00         07:01:30        
#>  4 34 St - Herald Sq 23 St [1]              90 07:05:00         07:06:30        
#>  5 34 St - Herald Sq Times Sq - 4…          90 07:00:00         07:01:30        
#>  6 34 St - Herald Sq 23 St [4]             150 07:04:00         07:06:30        
#>  7 34 St - Herald Sq 47-50 Sts - …         180 07:00:00         07:03:00        
#>  8 34 St - Herald Sq 14 St [1]             180 07:05:00         07:08:00        
#>  9 34 St - Herald Sq W 4 St                180 07:02:00         07:05:00        
#> 10 34 St - Herald Sq 14 St - Unio…         180 07:01:30         07:04:30        
#> # … with 213 more rows, and 3 more variables: transfers <dbl>,
#> #   from_stop_id <chr>, to_stop_id <chr>

In general, logical grouping of stops is not trivial (I don't want to get into pathways) and this is a reminder that some assumptions for feed might don't hold over time let alone across different feeds.

@codecov-commenter
Copy link

codecov-commenter commented Jan 19, 2022

Codecov Report

Merging #181 (0b700f1) into master (bd587fc) will increase coverage by 0.15%.
The diff coverage is 91.46%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #181      +/-   ##
==========================================
+ Coverage   88.65%   88.80%   +0.15%     
==========================================
  Files          13       14       +1     
  Lines        1075     1152      +77     
==========================================
+ Hits          953     1023      +70     
- Misses        122      129       +7     
Impacted Files Coverage Δ
R/geo.R 90.41% <90.41%> (ø)
R/raptor.R 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd587fc...0b700f1. Read the comment docs.

@tbuckl
Copy link
Member

tbuckl commented Jan 22, 2022

this PR could be a nice blog post and/or vignette.

@mpadge
Copy link
Contributor

mpadge commented Jan 23, 2022

We could also maybe arrange an rOpenSci blog post to serve as the reference point for clarifying confusion between gtfsr and tidytransfit, and which would then also serve to direct people from rOpenSci to this package. It would maybe also need some meta-discussion of how packages may sometimes evolve "away" from rOpenSci, and serve as a blueprint for how to retain and cultivate some kind of coupling? Just some thoughts ....

@polettif
Copy link
Contributor Author

this PR could be a nice blog post and/or vignette.

Good idea. Using a vignette is a bit of a problem since I don't want to package the 700MB feed or download it everytime it's rendered. As for the blog post, I guess there's not an easy way to add it to tidytransit.r-transit.org?

We could also maybe arrange an rOpenSci blog post to serve as the reference point for clarifying confusion between gtfsr and tidytransfit, and which would then also serve to direct people from rOpenSci to this package. It would maybe also need some meta-discussion of how packages may sometimes evolve "away" from rOpenSci, and serve as a blueprint for how to retain and cultivate some kind of coupling? Just some thoughts ....

A blog post like this would be great. Especially since there's basically no development on gtsfr anymore (see gtfsr#60) and we might prevent some confusion that way.

@polettif polettif merged commit 2593862 into master Jan 27, 2022
@polettif polettif deleted the dev/stop-name-distances branch January 27, 2022 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants