New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_csv: scientific notation cannot be disabled #671

Closed
dpprdan opened this Issue May 10, 2017 · 4 comments

Comments

Projects
None yet
3 participants
@dpprdan
Copy link

dpprdan commented May 10, 2017

write_csv() turns (some) longer numbers into scientific notation and there does not seem to be a way to disable it. This has been mentioned before in #229 and apparently was fixed then, so this might be a regression?

library("readr")
df <- data.frame(a = -0.0004029971, b = 0.0412975501857025)
print(df, digits = 17)
#>                         a                    b
#> 1 -0.00040299710000000002 0.041297550185702497
cat(format_csv(df))
#> a,b
#> -4.029971e-4,0.0412975501857025

The problem is I cannot use scientific notation (well) with the tools that import the csv.

Also compare this to this (which should be equivalent IMHO):

format_csv(data.frame(GEOID = seq(from = 60150001022000, to = 60150001022005, 
  1)))
#> [1] "GEOID\n60150001022e3\n60150001022001\n60150001022002\n60150001022003\n60150001022004\n60150001022005\n"

I.e. GEOID\n60150001022e3 instead of GEOID\n60150001022000

@jimhester

This comment has been minimized.

Copy link
Member

jimhester commented May 10, 2017

I don't think we will be changing this behavior in the near future (if ever). A workaround you can use is to format the columns before writing. See ?base::format for details on possible formatting arguments.

format_numeric <- function(x, ...) {
  numeric_cols <- vapply(x, is.numeric, logical(1))
  x[numeric_cols] <- lapply(x[numeric_cols], format, ...)
  x
}

library("readr")
df <- data.frame(a = -0.0004029971, b = 0.0412975501857025)
format_csv(format_numeric(df))
#> [1] "a,b\n-0.0004029971,0.04129755\n"
@dpprdan

This comment has been minimized.

Copy link
Author

dpprdan commented May 11, 2017

Thanks! One general question though: Why default to a notation/formatting that, at least to me, seems to be less compatible with other tools? (Even more so when the file format, csv, is arguably one of the most interchangeble/compatible formats.) I guess this is all a matter of perspective, I just would like to understand your design choice.

One addition and one question with respect to the format_numeric fuction: I guess one ought to add the scientific = FALSE option to reliably disable scientific notation, irrespective of options(scipen).

format_numeric_jh <- function(x, ...) {
  numeric_cols <- vapply(x, is.numeric, logical(1))
  x[numeric_cols] <- lapply(x[numeric_cols], format, ...)
  x
}
format_numeric_dpd <- function(x, scientific = FALSE, ...) {
  numeric_cols <- vapply(x, is.numeric, logical(1))
  x[numeric_cols] <- lapply(x[numeric_cols], format, scientific = scientific, ...)
  x
}
df <- data.frame(a = -0.00004029971, b = 0.0412975501857025)
geoid_df <- data.frame(GEOID = seq(from = 60150001022000, to = 60150001022005, 1))
print(df, digits = 18)
#>                         a                    b
#> 1 -4.0299709999999997e-05 0.041297550185702497
library("readr")
format_csv(format_numeric_jh(df))
#> [1] "a,b\n-4.029971e-05,0.04129755\n"
format_csv(format_numeric_jh(geoid_df))
#> [1] "GEOID\n6.015e+13\n6.015e+13\n6.015e+13\n6.015e+13\n6.015e+13\n6.015e+13\n"
# ehm, no
format_csv(format_numeric_dpd(df))
#> [1] "a,b\n-0.00004029971,0.04129755\n"
format_csv(format_numeric_dpd(geoid_df))
#> [1] "GEOID\n60150001022000\n60150001022001\n60150001022002\n60150001022003\n60150001022004\n60150001022005\n"

But how can I reliably preserve precision without hard-coding it with digits or nsmall?

@zeehio

This comment has been minimized.

Copy link
Contributor

zeehio commented May 17, 2017

While I am not the one to talk about design decisions maybe I can help explaining the integer limitations (this message may help to understand the problem or workaround it with the bit64 package, it does not provide a specific solution):

In R you store integers as a 32-bit integer number (a <- 3L). This limits you to a maximum number of: a <- 2147483647L (2^31-1). If you try a larger number a <- 2147483648L you will see a warning non-integer value 2147483648L qualified with L; using numeric value and the number will be stored as a double.

Using doubles, 2^53 (9007199254740992)[http://stackoverflow.com/a/1848953/446149] is stored without loss of precision. If you try a higher number you may lose precision depending on its floating point representation. For instance 2^53+1 can't be represented as a double without loss of precision, but 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368 (close to 1.8E308) can.

Another alternative if you need to work with integer numbers below 2^63-1 is to use the bit64 package, that works as expected. This would be a workaround to your problem.

x <- data.frame(a = bit64::as.integer64(60150001022000))
readr::write_csv(x, "/tmp/test.csv")
# cat /tmp/test.csv 
# a
# 60150001022000

Usually doubles (called numeric in R) are used to store very large or very small numbers up to some degree of precision. Using them to store integers is fine, as long as we are aware of the 2^52+1 limit, but then printing those numbers is much more complicated because printing libraries need to be designed having in mind the use case "The user wants to store an integer in a double" or we need to work around that.

@dpprdan

This comment has been minimized.

Copy link
Author

dpprdan commented May 19, 2017

Thanks @zeehio for the explanation. I was not aware of that (in that detail at least).
However, that does not explain why, when I try to store doubles like (a = -0.00004029971, b = 0.0412975501857025) as csv, I either get scientific notation a = -4.029971e-05, loose precision b = 0.04129755, or both, does it?
In any case, this remains true.

@jimhester jimhester closed this May 19, 2017

zeehio added a commit to zeehio/readr that referenced this issue May 19, 2017

Allow int_use_scientific=FALSE in write_csv
This fixes tidyverse#671, allowing to save large integers (up to 1e15)
without scientific notation.

zeehio added a commit to zeehio/readr that referenced this issue May 19, 2017

Allow int_use_scientific=FALSE in write_csv
This fixes tidyverse#671, allowing to save large integers (up to 1e15)
without scientific notation.

zeehio added a commit to zeehio/readr that referenced this issue May 19, 2017

Allow int_use_scientific=FALSE in write_csv
This fixes tidyverse#671, allowing to save large integers (up to 1e15)
without scientific notation.

zeehio added a commit to zeehio/readr that referenced this issue May 19, 2017

Allow int_use_scientific=FALSE in write_csv
This fixes tidyverse#671, allowing to save large integers (up to 1e15)
without scientific notation.

zeehio added a commit to zeehio/readr that referenced this issue May 19, 2017

Allow int_use_scientific=FALSE in write_csv
This fixes tidyverse#671, allowing to save large integers (up to 1e15)
without scientific notation.

zeehio added a commit to zeehio/readr that referenced this issue May 19, 2017

Add scipen argument to write_csv.
A scipen value higher than zero will prefer fixed point notation to scientific notation.

Fixes tidyverse#671

zeehio added a commit to zeehio/readr that referenced this issue May 19, 2017

Add scipen argument to write_csv.
A scipen value higher than zero will prefer fixed point notation to scientific notation.

Fixes tidyverse#671

zeehio added a commit to zeehio/readr that referenced this issue May 20, 2017

Allow control of scientific notation in CSV
This fixes tidyverse#671, allowing to save large integers (up to 1e15)
without scientific notation.

Add a scipen argument to write_csv.

A scipen value higher than zero will prefer fixed point notation to scientific notation.

zeehio added a commit to zeehio/readr that referenced this issue Oct 24, 2017

Allow control of scientific notation in CSV
This fixes tidyverse#671, allowing to save large integers (up to 1e15)
without scientific notation.

Add a scipen argument to write_csv.

A scipen value higher than zero will prefer fixed point notation to scientific notation.

@lock lock bot locked and limited conversation to collaborators Sep 24, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.