Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

onewayid performance #321

Closed
mem48 opened this issue Jul 3, 2019 · 8 comments
Closed

onewayid performance #321

mem48 opened this issue Jul 3, 2019 · 8 comments

Comments

@mem48
Copy link
Collaborator

mem48 commented Jul 3, 2019

onewayid performance is exponentially slower on large datasets, especially if character IDs are used

I've implemented a replacement for od_id_order() that uses Szudzik pairing in https://github.com/ropensci/stplanr/tree/onewayid

ids <- as.character(runif(4000, 1e6, 1e7 - 1))
x <- data.frame(id1 = rep(ids, times = 4000), 
                id2 = rep(ids, each = 4000),
                val = 1,
                stringsAsFactors = FALSE)

> system.time(od_id_order(x))
   user  system elapsed 
  38.89    0.23   39.69 

> system.time(szudzik_pairing(x))
   user  system elapsed 
   6.52    0.43    7.04 

> system.time(szudzik_pairing(x, simple = TRUE))
   user  system elapsed 
   4.80    0.33    5.20 

There is probably some more performance to be had out of refactoring onewayid to use szudzik_pairing(x, simple = TRUE)

@mem48
Copy link
Collaborator Author

mem48 commented Jul 3, 2019

pairing

@Robinlovelace
Copy link
Member

Fantastic work Malcolm. Will review asap.

@mem48
Copy link
Collaborator Author

mem48 commented Jul 5, 2019

@Robinlovelace FYI just pushed some key bugfixes for tibbles and factors

@Robinlovelace
Copy link
Member

Thanks. Perfect timing for the od vignette, which I hope to have ready in time for the UseR conference next week.

@mem48
Copy link
Collaborator Author

mem48 commented Jul 5, 2019

This is going to need a little more work, the main onewayid function doesn't like stplan.key being a number rather than a character. I've made first push of a new function

@mem48
Copy link
Collaborator Author

mem48 commented Jul 5, 2019

image

Refactoring both makes even mor difference (notice the lower number of IDs used)

@Robinlovelace
Copy link
Member

See #323

@Robinlovelace
Copy link
Member

Heads-up @mem48 after merging #328 I'm confident this issue has been addressed. Many thanks for pointing out the issues. onewayid() is indeed slow on large datasets for various reasons, not least because it created multiple copies of the data.

This is now addressed in many ways, including your excellent od_id_szudzic() function, which is for sure fast. The new od_oneway() function has a simple default setting which preserves the useful original ids, but you can go faster by passing it ids.

All documented in reproducible benchmark below. Please test on your computer and let me know if you get similarly impressive (10x!) speedups.

library(stplanr)
#> Registered S3 method overwritten by 'R.oo':
#>   method        from       
#>   throw.default R.methodsS3
od = pct::get_od()
#> No region provided. Returning national OD data.
#> Parsed with column specification:
#> cols(
#>   `Area of residence` = col_character(),
#>   `Area of workplace` = col_character(),
#>   `All categories: Method of travel to work` = col_double(),
#>   `Work mainly at or from home` = col_double(),
#>   `Underground, metro, light rail, tram` = col_double(),
#>   Train = col_double(),
#>   `Bus, minibus or coach` = col_double(),
#>   Taxi = col_double(),
#>   `Motorcycle, scooter or moped` = col_double(),
#>   `Driving a car or van` = col_double(),
#>   `Passenger in a car or van` = col_double(),
#>   Bicycle = col_double(),
#>   `On foot` = col_double(),
#>   `Other method of travel to work` = col_double()
#> )
#> Parsed with column specification:
#> cols(
#>   X = col_double(),
#>   Y = col_double(),
#>   objectid = col_double(),
#>   msoa11cd = col_character(),
#>   msoa11nm = col_character()
#> )
create_df <- function(rows, cols) {
  od[1:rows, 1:cols]
}
res = bench::press(
  rows = c(5000, 10000, 15000),
  cols = c(3, 4, 5),
  {
    d = create_df(rows, cols)
    bench::mark(check = FALSE,
      onewayid = onewayid(d, attrib = cols[-c(1:2)]),
      od_oneway = od_oneway(d, attrib = cols[-c(1:2)]),
      od_oneway_char = od_oneway(d, attrib = cols[-c(1:2)], stplanr.key = od_id_character(d[[1]], d[[2]])),
      od_oneway_szud = od_oneway(d, attrib = cols[-c(1:2)], stplanr.key = od_id_szudzik(d[[1]], d[[2]])),
      od_oneway_max = od_oneway(d, attrib = cols[-c(1:2)], stplanr.key = od_id_max_min(d[[1]], d[[2]]))
    )
  }
  )
#> Running with:
#>    rows  cols
#> 1  5000     3
#> 2 10000     3
#> 3 15000     3
#> 4  5000     4
#> 5 10000     4
#> 6 15000     4
#> 7  5000     5
#> 8 10000     5
#> 9 15000     5
ggplot2::autoplot(res)
#> Loading required namespace: tidyr

Created on 2019-07-29 by the reprex package (v0.3.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants