Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop support for R 3.2 #4247

Closed
yutannihilation opened this issue Oct 25, 2020 · 17 comments · Fixed by #4266
Closed

Drop support for R 3.2 #4247

yutannihilation opened this issue Oct 25, 2020 · 17 comments · Fixed by #4266

Comments

@yutannihilation
Copy link
Member

As I commented on #4242 (comment), digest package now requires R >= 3.3. digest::digest() is used for hashing guides so that we can merge overlapped guides by guides_merge().

ggplot2/R/guides-.r

Lines 231 to 244 in 7e51849

# merge overlapped guides
guides_merge <- function(gdefs) {
# split gdefs based on hash, and apply Reduce (guide_merge) to each gdef group.
gdefs <- lapply(gdefs, function(g) {
if (g$order == 0) {
order <- "99"
} else {
order <- sprintf("%02d", g$order)
}
g$hash <- paste(order, g$hash, sep = "_")
g
})
tapply(gdefs, sapply(gdefs, function(g)g$hash), function(gs)Reduce(guide_merge, gs))
}

Since the hashing we need is very simple, it's probably possible to implement another hashing function (Maybe paste() with some nice separator is enough...?), but I'm not sure if it's worth. I feel it's time to consider dropping support for R 3.2. If this sounds good, I'll prepare a pull request.

@thomasp85
Copy link
Member

We have a commitment to support the last 4 versions of R. I think we should look into dropping digest

@clauswilke
Copy link
Member

Or alternatively help the digest package support 3.2? It seems a minor issue that caused them to bump the version:
eddelbuettel/digest#159

@yutannihilation
Copy link
Member Author

Thanks, but, though it doesn't seem to be very difficult to implement a new startsWith(), I don't feel this is something worth asking the maintainer to release the new version for us because, as @thomasp85 says, R 3.2 is out of the tidyverse's "4 last versions" rule.

@clauswilke
Copy link
Member

Thomas said "dropping digest" instead of "dropping support for R 3.2". Maybe we're all a bit confused. I just looked up the past R versions and I agree we should be fine with dropping 3.2:

  • R 4.0 current
  • R 3.6 past 1
  • R 3.5 past 2
  • R 3.4 past 3
  • R 3.3 past 4
  • R 3.2 past 5, no comittment for support

@yutannihilation
Copy link
Member Author

Thomas said "dropping digest"

Oh, true... Definitely, I'm a bit confused, sorry.

@thomasp85
Copy link
Member

Sorry. The confusion was my doing. We are not bound by our support commitment. Still, I think it would be beneficial to investigate whether dropping digest is viable from a performance point of view

@clauswilke
Copy link
Member

So digest() basically calls serialize() on the object and then creates a hash value. Can we just use the serialized string?

x <- list(1, TRUE, "hello")
paste0(serialize(x, NULL, TRUE), collapse = "")
#> [1] "410a330a3236323134360a3139373838380a350a5554462d380a31390a330a31340a310a310a31300a310a310a31360a310a3236323135330a350a68656c6c6f0a"

Created on 2020-10-25 by the reprex package (v0.3.0)

One problem may be that the string can be of arbitrary length. One way around this would be to write a digestlight package that takes the string and runs it through a C++ hash function. That would be maybe 100 lines of code total, including everything needed to have a complete package.

@clauswilke
Copy link
Member

clauswilke commented Oct 26, 2020

Some benchmarking to see how fast the various parts are. serialize() doesn't seem to be the bottleneck, but just paste0() on the serialized object is quite expensive.

x <- list(1, TRUE, "hello", 1:10)

microbenchmark::microbenchmark(
  digest::digest(x),
  paste0(serialize(x, NULL, TRUE), collapse = ""),
  serialize(x, NULL, TRUE)
)
#> Unit: microseconds
#>                                             expr    min     lq     mean  median
#>                                digest::digest(x) 36.277 41.181 60.45034 43.9495
#>  paste0(serialize(x, NULL, TRUE), collapse = "") 37.960 41.153 46.93491 43.1825
#>                         serialize(x, NULL, TRUE)  7.588 10.590 13.70226 11.9500
#>      uq     max neval
#>  72.332 524.315   100
#>  50.000  97.516   100
#>  15.307  37.839   100

x <- list(1, TRUE, "hello", 1:10000)

microbenchmark::microbenchmark(
  digest::digest(x),
  paste0(serialize(x, NULL, TRUE), collapse = ""),
  serialize(x, NULL, TRUE)
)
#> Unit: microseconds
#>                                             expr     min       lq      mean
#>                                digest::digest(x) 238.258 263.8265 280.24610
#>  paste0(serialize(x, NULL, TRUE), collapse = "")  37.359  40.7460  45.37433
#>                         serialize(x, NULL, TRUE)   7.605   8.5340  10.56713
#>    median       uq     max neval
#>  268.4350 283.1045 408.335   100
#>   43.0250  45.0730  90.078   100
#>   10.1015  11.7575  25.245   100

Created on 2020-10-25 by the reprex package (v0.3.0)

@yutannihilation
Copy link
Member Author

I didn't notice caching is this costly... Interesting. What about just using rawToChar()? Calculating BASE64 string on a non-ASCII binary seems the fastest, but considering there's the cost of ::, there's no significant difference.

library(base64enc)
library(digest)

x <- list(1, TRUE, "hello", 1:10000)

microbenchmark::microbenchmark(
  digest_w_colon = digest::digest(x),
  digest_wo_colon = digest(x),
  paste = paste0(serialize(x, NULL, TRUE), collapse = ""),
  rawToChar = rawToChar(serialize(x, NULL, TRUE)),
  base64_w_colon = base64enc::base64encode(serialize(x, NULL, FALSE)),
  base64_wo_colon =base64encode(serialize(x, NULL, FALSE)),
  serialize_ascii = serialize(x, NULL, TRUE),
  serialize_non_ascii = serialize(x, NULL, FALSE)
)
#> Unit: microseconds
#>                 expr     min       lq      mean   median       uq     max neval
#>       digest_w_colon 192.649 206.1580 212.11704 210.9590 219.8110 247.720   100
#>      digest_wo_colon 185.873 199.6800 206.76998 203.5015 211.4220 500.502   100
#>                paste  49.019  52.3120  54.04512  53.3200  54.8050  69.588   100
#>            rawToChar  11.499  12.5710  13.71171  13.4775  14.4275  19.482   100
#>       base64_w_colon  11.293  13.5240  15.12000  14.7140  15.8705  28.796   100
#>      base64_wo_colon   5.179   6.2515   9.19496   7.3140   8.3750 183.277   100
#>      serialize_ascii  10.583  11.4225  12.76870  12.4195  13.7135  26.942   100
#>  serialize_non_ascii   3.029   3.5310   4.43354   4.1255   4.6590   8.929   100
#>    cld
#>      e
#>      e
#>     d 
#>   bc  
#>    c  
#>  ab   
#>   bc  
#>  a

Created on 2020-10-27 by the reprex package (v0.3.0)

@yutannihilation
Copy link
Member Author

If I replace digest::digest() with rawToChar(serialize(x, NULL, TRUE)), I got this error:

Error in do.call("unit.c", lapply(ggrobs, function(g) sum(g$widths))) : 
  variable names are limited to 10000 bytes

What's worse, this isn't very fast. Here is the result on a realistic data on which I got the error above. ascii = TRUE seems very expensive.

library(base64enc)
library(digest)

x <- list(
  "z",
  as.character(1:6),
  data.frame(
    colour = scales::viridis_pal()(300),
    value  = seq(1, 6, length.out = 300)
  ),
  "colourbar"
)

microbenchmark::microbenchmark(
  digest_w_colon  = digest::digest(x),
  digest_wo_colon = digest(x),
  digest_xxhash32 = digest::digest(x, "xxhash32"),
  digest_xxhash64 = digest::digest(x, "xxhash64"),
  serialize_ascii = serialize(x, NULL, TRUE),
  serialize_bin   = serialize(x, NULL, FALSE),
  rawToChar       = rawToChar(serialize(x, NULL, TRUE))
)
#> Unit: microseconds
#>             expr     min       lq      mean   median       uq      max neval
#>   digest_w_colon  92.271  99.5690 104.68697 101.8930 107.1105  135.390   100
#>  digest_wo_colon  82.708  88.3015  99.75531  89.7285  92.5880  615.140   100
#>  digest_xxhash32  69.879  78.3815  84.60217  80.7615  85.5825  146.478   100
#>  digest_xxhash64  68.775  77.6865  84.72259  80.3860  85.7910  145.938   100
#>  serialize_ascii 650.601 682.8550 742.17401 692.8080 729.6420 1255.246   100
#>    serialize_bin  36.854  40.3000  43.25900  42.7725  44.7235   72.391   100
#>        rawToChar 686.606 718.9860 763.60650 728.8495 755.1025 1278.913   100
#>  cld
#>   b 
#>   b 
#>   b 
#>   b 
#>    c
#>  a  
#>    c

Created on 2020-11-02 by the reprex package (v0.3.0)

@clauswilke
Copy link
Member

This is weird. When I looked at the digest() source code, I thought it starts by calling serialize(). Does it call the binary version? Does this work: rawToChar(serialize(x, NULL, FALSE))?

@clauswilke
Copy link
Member

To answer my own question: No, it doesn't work, because C strings cannot contain binary zeros.

@yutannihilation
Copy link
Member Author

yutannihilation commented Nov 3, 2020

Yeah. There might be nicer way to convert a binary into ASCII (e.g. BASE64 encoding), but I think variable names are limited to 10000 bytes means we need some hashing anyway to ensure the length of the result.

Implementing a hashing algorithm doesn't seem trivial, and digest package is already very fast; it took only twice than bare serialization (serialize(x, NULL, FALSE)). Though it would probably get a bit faster by using XXH3, a latest variant provided by xxHash (digest uses xxHash, but only provides XXH32 and XXH64), I don't think it's worth implementing another package just for this as there's little room for speed up. Serialization might be improved by not relying on serialize(), but I have no idea at the moment.

@yutannihilation
Copy link
Member Author

Can we agree to drop support for R 3.2? We might eventually implement another hashing package, but I don't think it will happen in the near future. So, I think it's better to make it clear that ggplot2 cannot be installed on R 3.2, at least at the moment.

@thomasp85
Copy link
Member

I think dropping support for 3.2 is fine for now. flange might eventually get a hashing function but I don't know the timeline for it

@yutannihilation
Copy link
Member Author

Thanks. Is "flange" an R package that is not yet made public? Sounds interesting.

@yutannihilation
Copy link
Member Author

Serialization might be improved by not relying on serialize(), but I have no idea at the moment.

In case someone is interested in this, xxhashlite serializes without serialize() and the benchmark on README looks much faster than digest.

https://github.com/coolbutuseless/xxhashlite

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants