stop distance analysis and stop_name clustering #181

polettif · 2022-01-19T13:39:11Z

Summary

This PR provides tools to analyse stop locations and edit stop data.

New functions:

stop_distances Calculate distances between a given set of stops
stop_group_distances Summary of stop distances among stops in a group (e.g. with the same name)
cluster_stops Cluster nearby stops within a group

New parameter

stop_dist_check in travel_times

Background

usage of stop_names

Some time last year, changes were made to the Swiss GTFS feed concerning stop_names and parent_stations. There are some parent stops all over Switzerland with generic names (like "Bahnhof", "gare" or "Post"). Those stops might share the same name but they are not located close to each other. In fact, they don't share any similarities besides their name. AFAIK, these short generic names are not acutally used even in user-facing applications, so I'm not sure why they were added.

# Download from https://opentransportdata.swiss/de/dataset/timetable-2022-gtfs2020
g_ch = read_gtfs("gtfs_fp2022_2022-01-12_09-15.zip")

parent_stops = g_ch$stops %>% filter(grepl("Parent", g_ch$stops$stop_id))
parent_stops %>% group_by(stop_name) %>% count() %>% arrange(desc(n))
#> # A tibble: 1,380 × 2
#> # Groups:   stop_name [1,380]
#>    stop_name        n
#>    <chr>        <int>
#>  1 Bahnhof         99
#>  2 gare            14
#>  3 Post             9
#>  4 Dorfplatz        5
#>  5 Stazione         5
#>  6 Bahnhof Süd      4
#>  7 village          4
#>  8 centre           3
#>  9 Dorf             3
#> 10 Hauptbahnhof     3
#> # … with 1,370 more rows

library(ggplot2)
parent_stops %>%
  filter(stop_name == "Bahnhof") %>% 
  stops_as_sf() %>% 
  mapview::mapview()

stops named "Bahnhof"

Analyse distances among stops

It made me realize that using "stop_names" as an identifier in travel_times might be convenient but it also might lead to wrong results. Because there's a similar issue with the package-provided NYC feed. Th new function stop_group_distances allows you to calculate distances among stop with the same name.
There is no issue with travel time calculations, just with aggregating times to a more "readable" format.

g_nyc = read_gtfs(system.file("extdata", "google_transit_nyc_subway.zip", package = "tidytransit"))
stop_group_distances(g_nyc$stops)
#> # A tibble: 380 × 6
#>    stop_name   dists           n_stop_ids dist_mean dist_median dist_max
#>    <chr>       <list>               <dbl>     <dbl>       <dbl>    <dbl>
#>  1 86 St       <dbl [18 × 18]>         18     5395.       5395.   21811.
#>  2 79 St       <dbl [6 × 6]>            6    19053.      19053.   19053.
#>  3 Prospect Av <dbl [6 × 6]>            6    18804.      18804.   18804.
#>  4 77 St       <dbl [6 × 6]>            6    16947.      16947.   16947.
#>  5 59 St       <dbl [6 × 6]>            6    14130.      14130.   14130.
#>  6 50 St       <dbl [9 × 9]>            9     7097.       7097.   14068.
#>  7 36 St       <dbl [6 × 6]>            6    12496.      12496.   12496.
#>  8 8 Av        <dbl [6 × 6]>            6    11682.      11682.   11682.
#>  9 7 Av        <dbl [9 × 9]>            9     5479.       5479.   10753.
#> 10 111 St      <dbl [9 × 9]>            9     3877.       3877.    7753.
#> # … with 370 more rows

g_nyc$stops %>% 
  filter(stop_name == "86 St") %>% 
  stops_as_sf() %>% 
  mapview::mapview(zcol = "parent_station")

stops named "86 St" in New York

Cluster stops

The issue is that travel_times calculates times to each stop separately but then aggregates all travel times for a stop name an keeps the minimum. So all stops with the name "86 St" have the same time. travel_times now warns users if stop_names might not be suitable as an identifier (i.e. the distance is above a defined threshold).

This PR provides a fix to this issue, cluster_stops allows grouping of stops based on distance using stats::kmeans.

clusterstops = cluster_stops(g_nyc$stops, max_dist = 300, 
                             group_col = "stop_name", cluster_colname = "stop_name_cluster")

# There are 6 stops with the name "86 St" that are far apart
stops_86_St = clusterstops %>%
  filter(stop_name == "86 St")

table(stops_86_St$stop_name_cluster)
#> 
#> 86 St [1] 86 St [2] 86 St [3] 86 St [4] 86 St [5] 86 St [6] 
#>         3         3         3         3         3         3
#> 86 St [1] 86 St [2] 86 St [3] 86 St [4] 86 St [5] 86 St [6]
#>         3         3         3         3         3         3

stops_86_St %>% select(stop_id, stop_name, parent_station, stop_name_cluster) %>% head()
#> # A tibble: 6 × 4
#>   stop_id stop_name parent_station stop_name_cluster
#>   <chr>   <chr>     <chr>          <chr>            
#> 1 121     86 St     ""             86 St [2]        
#> 2 121N    86 St     "121"          86 St [2]        
#> 3 121S    86 St     "121"          86 St [2]        
#> 4 626     86 St     ""             86 St [1]        
#> 5 626N    86 St     "626"          86 St [1]        
#> 6 626S    86 St     "626"          86 St [1]
#> # A tibble: 6 × 4
#>   stop_id stop_name parent_station stop_name_cluster
#>   <chr>   <chr>     <chr>          <chr>
#> 1 121     86 St     ""             86 St [3]
#> 2 121N    86 St     "121"          86 St [3]
#> 3 121S    86 St     "121"          86 St [3]
#> 4 626     86 St     ""             86 St [4]
#> 5 626N    86 St     "626"          86 St [4]
#> 6 626S    86 St     "626"          86 St [4]

stops_86_St %>% stops_as_sf() %>% mapview::mapview(zcol = "stop_name_cluster")

stops named "86 St" after clustering

Usage in travel_times()

These cluster names can be used as new stop_names (by replacing the original column) to use travel_times .

g_nyc = read_gtfs(system.file("extdata", "google_transit_nyc_subway.zip", package = "tidytransit"))

# Original data, travel_time errors 
g_nyc %>% 
  filter_stop_times("2018-06-26", 7*3600, 9*3600) %>% 
  travel_times("34 St - Herald Sq")
#> Error in travel_times(., "34 St - Herald Sq"): Some stops with the same name are more than 300 meters apart, see stop_group_distances().
#> Using travel_times() might lead to unexpected results. Set stop_dist_check=FALSE to ignore this error.

# Cluster stop names
g_nyc %>% 
  cluster_stops(cluster_colname = "stop_name") %>% 
  filter_stop_times("2018-06-26", 7*3600, 9*3600) %>% 
  travel_times("34 St - Herald Sq")
#> # A tibble: 223 × 8
#>    from_stop_name    to_stop_name  travel_time journey_departu… journey_arrival…
#>    <chr>             <chr>               <dbl> <time>           <time>          
#>  1 34 St - Herald Sq 34 St - Hera…           0 07:00:00         07:00:00        
#>  2 34 St - Herald Sq 28 St [1]              60 07:04:00         07:05:00        
#>  3 34 St - Herald Sq 42 St - Brya…          90 07:00:00         07:01:30        
#>  4 34 St - Herald Sq 23 St [1]              90 07:05:00         07:06:30        
#>  5 34 St - Herald Sq Times Sq - 4…          90 07:00:00         07:01:30        
#>  6 34 St - Herald Sq 23 St [4]             150 07:04:00         07:06:30        
#>  7 34 St - Herald Sq 47-50 Sts - …         180 07:00:00         07:03:00        
#>  8 34 St - Herald Sq 14 St [1]             180 07:05:00         07:08:00        
#>  9 34 St - Herald Sq W 4 St                180 07:02:00         07:05:00        
#> 10 34 St - Herald Sq 14 St - Unio…         180 07:01:30         07:04:30        
#> # … with 213 more rows, and 3 more variables: transfers <dbl>,
#> #   from_stop_id <chr>, to_stop_id <chr>

In general, logical grouping of stops is not trivial (I don't want to get into pathways) and this is a reminder that some assumptions for feed might don't hold over time let alone across different feeds.

codecov-commenter · 2022-01-19T13:42:12Z

Codecov Report

Merging #181 (0b700f1) into master (bd587fc) will increase coverage by 0.15%.
The diff coverage is 91.46%.

@@            Coverage Diff             @@
##           master     #181      +/-   ##
==========================================
+ Coverage   88.65%   88.80%   +0.15%     
==========================================
  Files          13       14       +1     
  Lines        1075     1152      +77     
==========================================
+ Hits          953     1023      +70     
- Misses        122      129       +7

Impacted Files	Coverage Δ
R/geo.R	`90.41% <90.41%> (ø)`
R/raptor.R	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd587fc...0b700f1. Read the comment docs.

tbuckl · 2022-01-22T05:29:00Z

this PR could be a nice blog post and/or vignette.

mpadge · 2022-01-23T12:18:27Z

We could also maybe arrange an rOpenSci blog post to serve as the reference point for clarifying confusion between gtfsr and tidytransfit, and which would then also serve to direct people from rOpenSci to this package. It would maybe also need some meta-discussion of how packages may sometimes evolve "away" from rOpenSci, and serve as a blueprint for how to retain and cultivate some kind of coupling? Just some thoughts ....

polettif · 2022-01-25T08:08:50Z

this PR could be a nice blog post and/or vignette.

Good idea. Using a vignette is a bit of a problem since I don't want to package the 700MB feed or download it everytime it's rendered. As for the blog post, I guess there's not an easy way to add it to tidytransit.r-transit.org?

We could also maybe arrange an rOpenSci blog post to serve as the reference point for clarifying confusion between gtfsr and tidytransfit, and which would then also serve to direct people from rOpenSci to this package. It would maybe also need some meta-discussion of how packages may sometimes evolve "away" from rOpenSci, and serve as a blueprint for how to retain and cultivate some kind of coupling? Just some thoughts ....

A blog post like this would be great. Especially since there's basically no development on gtsfr anymore (see gtfsr#60) and we might prevent some confusion that way.

polettif added 5 commits January 18, 2022 15:05

add stop distance functions, check for dist in travel_times

15b2ead

export stop_distances

f2e77eb

error on stop_dist_check

9655484

export stop_group_distances

b9a2586

add cluster_stops function

e55cf0f

rename dist columns to distances, update docs

0b700f1

polettif merged commit 2593862 into master Jan 27, 2022

polettif deleted the dev/stop-name-distances branch January 27, 2022 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stop distance analysis and stop_name clustering #181

stop distance analysis and stop_name clustering #181

polettif commented Jan 19, 2022

codecov-commenter commented Jan 19, 2022 •

edited

Loading

tbuckl commented Jan 22, 2022

mpadge commented Jan 23, 2022

polettif commented Jan 25, 2022

stop distance analysis and stop_name clustering #181

stop distance analysis and stop_name clustering #181

Conversation

polettif commented Jan 19, 2022

Summary

Background

usage of stop_names

stops named "Bahnhof"

Analyse distances among stops

stops named "86 St" in New York

Cluster stops

stops named "86 St" after clustering

Usage in travel_times()

codecov-commenter commented Jan 19, 2022 • edited Loading

Codecov Report

tbuckl commented Jan 22, 2022

mpadge commented Jan 23, 2022

polettif commented Jan 25, 2022

codecov-commenter commented Jan 19, 2022 •

edited

Loading