Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_csv: scientific notation cannot be disabled #671

Closed
dpprdan opened this issue May 10, 2017 · 4 comments
Closed

write_csv: scientific notation cannot be disabled #671

dpprdan opened this issue May 10, 2017 · 4 comments

Comments

@dpprdan
Copy link
Contributor

dpprdan commented May 10, 2017

write_csv() turns (some) longer numbers into scientific notation and there does not seem to be a way to disable it. This has been mentioned before in #229 and apparently was fixed then, so this might be a regression?

library("readr")
df <- data.frame(a = -0.0004029971, b = 0.0412975501857025)
print(df, digits = 17)
#>                         a                    b
#> 1 -0.00040299710000000002 0.041297550185702497
cat(format_csv(df))
#> a,b
#> -4.029971e-4,0.0412975501857025

The problem is I cannot use scientific notation (well) with the tools that import the csv.

Also compare this to this (which should be equivalent IMHO):

format_csv(data.frame(GEOID = seq(from = 60150001022000, to = 60150001022005, 
  1)))
#> [1] "GEOID\n60150001022e3\n60150001022001\n60150001022002\n60150001022003\n60150001022004\n60150001022005\n"

I.e. GEOID\n60150001022e3 instead of GEOID\n60150001022000

@jimhester
Copy link
Collaborator

I don't think we will be changing this behavior in the near future (if ever). A workaround you can use is to format the columns before writing. See ?base::format for details on possible formatting arguments.

format_numeric <- function(x, ...) {
  numeric_cols <- vapply(x, is.numeric, logical(1))
  x[numeric_cols] <- lapply(x[numeric_cols], format, ...)
  x
}

library("readr")
df <- data.frame(a = -0.0004029971, b = 0.0412975501857025)
format_csv(format_numeric(df))
#> [1] "a,b\n-0.0004029971,0.04129755\n"

@dpprdan
Copy link
Contributor Author

dpprdan commented May 11, 2017

Thanks! One general question though: Why default to a notation/formatting that, at least to me, seems to be less compatible with other tools? (Even more so when the file format, csv, is arguably one of the most interchangeble/compatible formats.) I guess this is all a matter of perspective, I just would like to understand your design choice.

One addition and one question with respect to the format_numeric fuction: I guess one ought to add the scientific = FALSE option to reliably disable scientific notation, irrespective of options(scipen).

format_numeric_jh <- function(x, ...) {
  numeric_cols <- vapply(x, is.numeric, logical(1))
  x[numeric_cols] <- lapply(x[numeric_cols], format, ...)
  x
}
format_numeric_dpd <- function(x, scientific = FALSE, ...) {
  numeric_cols <- vapply(x, is.numeric, logical(1))
  x[numeric_cols] <- lapply(x[numeric_cols], format, scientific = scientific, ...)
  x
}
df <- data.frame(a = -0.00004029971, b = 0.0412975501857025)
geoid_df <- data.frame(GEOID = seq(from = 60150001022000, to = 60150001022005, 1))
print(df, digits = 18)
#>                         a                    b
#> 1 -4.0299709999999997e-05 0.041297550185702497
library("readr")
format_csv(format_numeric_jh(df))
#> [1] "a,b\n-4.029971e-05,0.04129755\n"
format_csv(format_numeric_jh(geoid_df))
#> [1] "GEOID\n6.015e+13\n6.015e+13\n6.015e+13\n6.015e+13\n6.015e+13\n6.015e+13\n"
# ehm, no
format_csv(format_numeric_dpd(df))
#> [1] "a,b\n-0.00004029971,0.04129755\n"
format_csv(format_numeric_dpd(geoid_df))
#> [1] "GEOID\n60150001022000\n60150001022001\n60150001022002\n60150001022003\n60150001022004\n60150001022005\n"

But how can I reliably preserve precision without hard-coding it with digits or nsmall?

@zeehio
Copy link
Contributor

zeehio commented May 17, 2017

While I am not the one to talk about design decisions maybe I can help explaining the integer limitations (this message may help to understand the problem or workaround it with the bit64 package, it does not provide a specific solution):

In R you store integers as a 32-bit integer number (a <- 3L). This limits you to a maximum number of: a <- 2147483647L (2^31-1). If you try a larger number a <- 2147483648L you will see a warning non-integer value 2147483648L qualified with L; using numeric value and the number will be stored as a double.

Using doubles, 2^53 (9007199254740992)[http://stackoverflow.com/a/1848953/446149] is stored without loss of precision. If you try a higher number you may lose precision depending on its floating point representation. For instance 2^53+1 can't be represented as a double without loss of precision, but 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368 (close to 1.8E308) can.

Another alternative if you need to work with integer numbers below 2^63-1 is to use the bit64 package, that works as expected. This would be a workaround to your problem.

x <- data.frame(a = bit64::as.integer64(60150001022000))
readr::write_csv(x, "/tmp/test.csv")
# cat /tmp/test.csv 
# a
# 60150001022000

Usually doubles (called numeric in R) are used to store very large or very small numbers up to some degree of precision. Using them to store integers is fine, as long as we are aware of the 2^52+1 limit, but then printing those numbers is much more complicated because printing libraries need to be designed having in mind the use case "The user wants to store an integer in a double" or we need to work around that.

@dpprdan
Copy link
Contributor Author

dpprdan commented May 19, 2017

Thanks @zeehio for the explanation. I was not aware of that (in that detail at least).
However, that does not explain why, when I try to store doubles like (a = -0.00004029971, b = 0.0412975501857025) as csv, I either get scientific notation a = -4.029971e-05, loose precision b = 0.04129755, or both, does it?
In any case, this remains true.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants