onewayid performance #321

mem48 · 2019-07-03T23:06:02Z

onewayid performance is exponentially slower on large datasets, especially if character IDs are used

I've implemented a replacement for od_id_order() that uses Szudzik pairing in https://github.com/ropensci/stplanr/tree/onewayid

ids <- as.character(runif(4000, 1e6, 1e7 - 1))
x <- data.frame(id1 = rep(ids, times = 4000), 
                id2 = rep(ids, each = 4000),
                val = 1,
                stringsAsFactors = FALSE)

> system.time(od_id_order(x))
   user  system elapsed 
  38.89    0.23   39.69 

> system.time(szudzik_pairing(x))
   user  system elapsed 
   6.52    0.43    7.04 

> system.time(szudzik_pairing(x, simple = TRUE))
   user  system elapsed 
   4.80    0.33    5.20

There is probably some more performance to be had out of refactoring onewayid to use szudzik_pairing(x, simple = TRUE)

The text was updated successfully, but these errors were encountered:

mem48 · 2019-07-03T23:26:48Z

Robinlovelace · 2019-07-04T08:17:04Z

Fantastic work Malcolm. Will review asap.

mem48 · 2019-07-05T09:01:21Z

@Robinlovelace FYI just pushed some key bugfixes for tibbles and factors

Robinlovelace · 2019-07-05T09:43:26Z

Thanks. Perfect timing for the od vignette, which I hope to have ready in time for the UseR conference next week.

mem48 · 2019-07-05T11:07:57Z

This is going to need a little more work, the main onewayid function doesn't like stplan.key being a number rather than a character. I've made first push of a new function

mem48 · 2019-07-05T11:41:30Z

Refactoring both makes even mor difference (notice the lower number of IDs used)

Robinlovelace · 2019-07-05T15:09:28Z

See #323

Robinlovelace · 2019-07-29T18:16:50Z

Heads-up @mem48 after merging #328 I'm confident this issue has been addressed. Many thanks for pointing out the issues. onewayid() is indeed slow on large datasets for various reasons, not least because it created multiple copies of the data.

This is now addressed in many ways, including your excellent od_id_szudzic() function, which is for sure fast. The new od_oneway() function has a simple default setting which preserves the useful original ids, but you can go faster by passing it ids.

All documented in reproducible benchmark below. Please test on your computer and let me know if you get similarly impressive (10x!) speedups.

library(stplanr)
#> Registered S3 method overwritten by 'R.oo':
#>   method        from       
#>   throw.default R.methodsS3
od = pct::get_od()
#> No region provided. Returning national OD data.
#> Parsed with column specification:
#> cols(
#>   `Area of residence` = col_character(),
#>   `Area of workplace` = col_character(),
#>   `All categories: Method of travel to work` = col_double(),
#>   `Work mainly at or from home` = col_double(),
#>   `Underground, metro, light rail, tram` = col_double(),
#>   Train = col_double(),
#>   `Bus, minibus or coach` = col_double(),
#>   Taxi = col_double(),
#>   `Motorcycle, scooter or moped` = col_double(),
#>   `Driving a car or van` = col_double(),
#>   `Passenger in a car or van` = col_double(),
#>   Bicycle = col_double(),
#>   `On foot` = col_double(),
#>   `Other method of travel to work` = col_double()
#> )
#> Parsed with column specification:
#> cols(
#>   X = col_double(),
#>   Y = col_double(),
#>   objectid = col_double(),
#>   msoa11cd = col_character(),
#>   msoa11nm = col_character()
#> )
create_df <- function(rows, cols) {
  od[1:rows, 1:cols]
}
res = bench::press(
  rows = c(5000, 10000, 15000),
  cols = c(3, 4, 5),
  {
    d = create_df(rows, cols)
    bench::mark(check = FALSE,
      onewayid = onewayid(d, attrib = cols[-c(1:2)]),
      od_oneway = od_oneway(d, attrib = cols[-c(1:2)]),
      od_oneway_char = od_oneway(d, attrib = cols[-c(1:2)], stplanr.key = od_id_character(d[[1]], d[[2]])),
      od_oneway_szud = od_oneway(d, attrib = cols[-c(1:2)], stplanr.key = od_id_szudzik(d[[1]], d[[2]])),
      od_oneway_max = od_oneway(d, attrib = cols[-c(1:2)], stplanr.key = od_id_max_min(d[[1]], d[[2]]))
    )
  }
  )
#> Running with:
#>    rows  cols
#> 1  5000     3
#> 2 10000     3
#> 3 15000     3
#> 4  5000     4
#> 5 10000     4
#> 6 15000     4
#> 7  5000     5
#> 8 10000     5
#> 9 15000     5
ggplot2::autoplot(res)
#> Loading required namespace: tidyr

^{Created on 2019-07-29 by the reprex package (v0.3.0)}

Robinlovelace closed this as completed Jul 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

onewayid performance #321

onewayid performance #321

mem48 commented Jul 3, 2019 •

edited

Loading

mem48 commented Jul 3, 2019

Robinlovelace commented Jul 4, 2019

mem48 commented Jul 5, 2019

Robinlovelace commented Jul 5, 2019

mem48 commented Jul 5, 2019

mem48 commented Jul 5, 2019

Robinlovelace commented Jul 5, 2019

Robinlovelace commented Jul 29, 2019

onewayid performance #321

onewayid performance #321

Comments

mem48 commented Jul 3, 2019 • edited Loading

mem48 commented Jul 3, 2019

Robinlovelace commented Jul 4, 2019

mem48 commented Jul 5, 2019

Robinlovelace commented Jul 5, 2019

mem48 commented Jul 5, 2019

mem48 commented Jul 5, 2019

Robinlovelace commented Jul 5, 2019

Robinlovelace commented Jul 29, 2019

mem48 commented Jul 3, 2019 •

edited

Loading