Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_csv appends .0 to numbers #526

Closed
wdenton opened this issue Sep 21, 2016 · 13 comments
Closed

write_csv appends .0 to numbers #526

wdenton opened this issue Sep 21, 2016 · 13 comments

Comments

@wdenton
Copy link

@wdenton wdenton commented Sep 21, 2016

data.frame(x = c(1, 2)) %>% write_csv("test.csv")

Result:

x
1.0
2.0

Forcing x to integer prevents it but it seems surprising.

@josiekre
Copy link

@josiekre josiekre commented Oct 28, 2016

This has also bitten me a couple times. It takes me a while to debug this problem in automated scripts as it is unexpected. Running the same code pre-v1.0 does not do this.

Loading

@hadley
Copy link
Member

@hadley hadley commented Oct 28, 2016

Why is that a problem?

Loading

@josiekre
Copy link

@josiekre josiekre commented Oct 28, 2016

A lot of my scripts in R write out data to read into python for simulation. When the csv has a double rather than an int, it usually causes an error eventually in python. I am forced to type cast to int in R before writing to csv now.

I guess the problem might be with my code earlier. Is there a reason I should not expect an int after computing a count variable with n() or row_number()?

Loading

@rdinter
Copy link

@rdinter rdinter commented Nov 21, 2016

I also ran into this issue recently which affected a few of my "year" variables in a dataset and created problems merging that particular dataset with others.

data.frame(x = c(1999, 2000, 2001)) %>% write_csv("test.csv")
Which gives me:

x
1999.0
2e3
2001.0

This isn't a problem if the variable is classified as.integer() or as.character() and is probably good coding practice to do this anyway. But it feels like there was an update(?) to the readr package where previously the ".0" precision was not a part of the csv output of write_csv(). For comparison, the write.csv() function does not add in the ".0" precision either.

Loading

@chpmoreno
Copy link

@chpmoreno chpmoreno commented Nov 23, 2016

I had the same issue related to years but also with some other factor variables. In particular, I used R for reading a SPSS file (.sav) and then writing a .csv that would be used for creating MySQL databases. The problem is that the MySQL schema my colleagues made didn't work properly with doubles because they are integers and were defined like this in the schema. Therefore, I create a script for detecting if each variable in the data frame generated after we read the SPSS file using foreign package (I tried to use haven but it doesn't read properly the NA assigned in the .sav) is a character, numeric or integer. I also had to deal with some indexes (like 00455). At the end the script worked but it is not the most efficient. It is something like this:

library(readr)
library(foreign)
library(stringi)

data <- read.spss("data.sav",
                to.data.frame = TRUE, use.value.labels = FALSE)

data_alter <- as.data.frame(seq(1, nrow(data)))
for(i in 1:ncol(data)) {
  if(sum(is.na(as.numeric(stri_replace_all_fixed(as.character(data[, i]), " ", "")))) == length(data[,i])){
    data_alter <- cbind(data_alter, as.data.frame(as.character(data[, i])))
  } else {
    if(all.equal(as.character(as.numeric(stri_replace_all_fixed(as.character(data[, i]), " ", ""))),
                 stri_replace_all_fixed(as.character(data[, i]), " ", "")) != TRUE){
      data_alter <- cbind(data_alter, as.data.frame(as.character(data[, i])))
    } else {
      if(all.equal(as.numeric(stri_replace_all_fixed(as.character(data[, i]), " ", "")), 
                   as.integer(as.numeric(stri_replace_all_fixed(as.character(data[, i]), " ", "")))) == TRUE){
        data_alter <- cbind(data_alter, as.data.frame(as.integer(data[, i])))
      } else {
        data_alter <- cbind(data_alter, as.data.frame(as.numeric(data[, i])))
      }
    }
    
  }
}
data_alter <- data_alter[, -1]
colnames(data_alter) <- colnames(data)
write_csv(data_alter, "data.csv", na = "")

Maybe in some way it would be useful if we could define some variables as integers.

Loading

@jennybc
Copy link
Member

@jennybc jennybc commented Dec 22, 2016

I just re-ran some 2015 scripts and this caused massive diffs in multiple csvs. I guess it doesn't really matter? But it would be nice if integer data still got written without the trailing .0 for many of the reasons outlined above.

Loading

@hadley
Copy link
Member

@hadley hadley commented Dec 22, 2016

I think this behaviour will cause a little pain in the short term, but it reduces pain in the long-term by forcing you to be more explicit about column types.

It's straight-forward to convert all integerish columns yourself:

library(purrr)

integerish <- function(x) x == trunc(x)
df %>% 
  map_if(~ all(integerish(.)), as.integer)

Loading

@hadley hadley closed this Dec 22, 2016
@hadley
Copy link
Member

@hadley hadley commented Dec 22, 2016

@jennybc integers are written out without trailing dots:

data.frame(x = 1L) %>% format_csv()
#> [1] "x\n1\n"

Loading

@hadley hadley reopened this Dec 22, 2016
@hadley hadley closed this Dec 22, 2016
@jennybc
Copy link
Member

@jennybc jennybc commented Dec 22, 2016

I think a year ago write_csv() would leave off the .0 for individual numerics, if it was integer-ish. And if that applied to an entire variable, that made it look like integer in the csv, even though user did not properly make it integer.

Who knows ... my year-old scripts do a lot of other tidyverse processing in between read_csv() and write_csv()! Something changed (the data did not) but I have no burning need to know what.

screen shot 2016-12-22 at 11 36 27 am

Loading

@rgknight
Copy link
Contributor

@rgknight rgknight commented Jan 18, 2017

Another voice for returning to the old behavior. I've been caught by this many times.

If the tidyverse is not going to silently coerce from integer to double (r-lib/vctrs#7), it seems reasonable to make interger-ish values their integer representations when writing to file. Why be type strict on reading / writing integer vs double when we are not on manipulating integers?

For example, it's confusing to read an int var in, add zero to it, write_csv, then not be able to read the same file back in.

file <- "my_int_var\n0"

read_csv(file, col_types = "i") %>%
    mutate(my_int_var = my_int_var + 0) %>%
    write_csv("file.csv")

read_csv("file.csv", col_types = "i")

# you would need to add `as.integer(0)` to preserve type

My solution is to either stop reading integers, since they will likely be silently coerced to double at some point in my scripts anyway, or just go back to write.csv.

Maybe read_csv should never guess integer, since "0" is not an integer in R? That wouldn't solve the problem mentioned earlier of sharing csvs with other programs.

Loading

@hadley
Copy link
Member

@hadley hadley commented Jan 18, 2017

Re-opening just to make sure we re-think this for the next release

Loading

@jimhester
Copy link
Member

@jimhester jimhester commented Jan 31, 2017

@rgknight You append L to a number to specify an integer, 0L is used for integer 0.

I personally feel these types need to be explicit, otherwise you can have a double column with 1000 "integers", that has doubles after the first 1000 and it will fail to parse.

Loading

@hadley
Copy link
Member

@hadley hadley commented Feb 1, 2017

Let's go back to the old behaviour. We're forcing people to care about the difference between integers and doubles, when that's not usually important in R.

Loading

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants