Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_tsv adds whitespace characters to NA #930

Closed
dfjenkins3 opened this issue Dec 1, 2018 · 11 comments
Closed

write_tsv adds whitespace characters to NA #930

dfjenkins3 opened this issue Dec 1, 2018 · 11 comments

Comments

@dfjenkins3
Copy link

@dfjenkins3 dfjenkins3 commented Dec 1, 2018

Hi,

Using readr 1.2.1 I'm getting extra whitespace characters around NA values when I use write_tsv. Here's an example:

example.txt

library(readr)
a <- read_tsv("example.txt")
write_tsv(a, path = "example2.txt")

In this example the 'NA' values are written as '\s\s\s\s\s\sNA'

Let me know if you need any other info.

Thanks,

David

@batpigandme
Copy link
Member

@batpigandme batpigandme commented Dec 2, 2018

I'm not able to reproduce your issue on my Mac (I put the input and output files in a gist here), can you try updating to the development version and see if that works?

library(readr)
a <- read_tsv("https://gist.githubusercontent.com/batpigandme/7af83d22bb5812120d1e545cf0233dc7/raw/9d627d5a560756041690edef9fea36ec485a89b2/example.txt")

c <- read_lines("https://gist.githubusercontent.com/batpigandme/7af83d22bb5812120d1e545cf0233dc7/raw/9d627d5a560756041690edef9fea36ec485a89b2/example2.txt")
c
#>  [1] "monday\ttuesday\twednesday\tthursday\tfriday"    
#>  [2] "08:00:00\t08:00:00\t08:00:00\t08:00:00\t08:00:00"
#>  [3] "      NA\t      NA\t      NA\t      NA\t      NA"
#>  [4] "12:00:00\t12:00:00\t12:00:00\t13:00:00\t12:00:00"
#>  [5] "09:00:00\t09:00:00\t09:00:00\t09:00:00\t09:00:00"
#>  [6] "13:00:00\t13:00:00\t13:00:00\t14:00:00\t13:00:00"
#>  [7] "08:00:00\t08:00:00\t08:00:00\t08:00:00\t08:00:00"
#>  [8] "09:00:00\t09:00:00\t09:00:00\t09:00:00\t09:00:00"
#>  [9] "10:00:00\t10:00:00\t      NA\t10:00:00\t10:00:00"
#> [10] "09:00:00\t10:00:00\t12:00:00\t13:00:00\t09:00:00"
#> [11] "10:00:00\t10:00:00\t10:00:00\t10:00:00\t10:00:00"

Created on 2018-12-02 by the reprex package (v0.2.1.9000)

@dfjenkins3
Copy link
Author

@dfjenkins3 dfjenkins3 commented Dec 2, 2018

Hi Mara,

Thanks for looking into this. In your gist in example2.txt I think you are reproducing my issue. There are whitespace characters after the tabs:

screen shot 2018-12-02 at 1 25 28 pm

@batpigandme
Copy link
Member

@batpigandme batpigandme commented Dec 2, 2018

Oh, I see what you're saying. Sorry, I don't have a solution for you, then. But, confirmed. ✔️

@cderv
Copy link
Contributor

@cderv cderv commented Dec 2, 2018

This is also happening with csv format.

cat(readr::format_csv(readr::read_tsv("https://gist.githubusercontent.com/batpigandme/7af83d22bb5812120d1e545cf0233dc7/raw/9d627d5a560756041690edef9fea36ec485a89b2/example.txt")))
#> Parsed with column specification:
#> cols(
#>   monday = col_time(format = ""),
#>   tuesday = col_time(format = ""),
#>   wednesday = col_time(format = ""),
#>   thursday = col_time(format = ""),
#>   friday = col_time(format = "")
#> )
#> monday,tuesday,wednesday,thursday,friday
#> 08:00:00,08:00:00,08:00:00,08:00:00,08:00:00
#>       NA,      NA,      NA,      NA,      NA
#> 12:00:00,12:00:00,12:00:00,13:00:00,12:00:00
#> 09:00:00,09:00:00,09:00:00,09:00:00,09:00:00
#> 13:00:00,13:00:00,13:00:00,14:00:00,13:00:00
#> 08:00:00,08:00:00,08:00:00,08:00:00,08:00:00
#> 09:00:00,09:00:00,09:00:00,09:00:00,09:00:00
#> 10:00:00,10:00:00,      NA,10:00:00,10:00:00
#> 09:00:00,10:00:00,12:00:00,13:00:00,09:00:00
#> 10:00:00,10:00:00,10:00:00,10:00:00,10:00:00

Created on 2018-12-02 by the reprex package (v0.2.1)

Don't know if it is intended or not... it does not seem to append with non time class

cat(readr::format_csv(tibble::tibble(X1=c(111, NA), X2=c("aaaa", NA))))
#> X1,X2
#> 111,aaaa
#> NA,NA

only with hms column

library(magrittr)
tibble::tibble(
  X1 = c(111, NA),
  X2 = c("aaaa", NA),
  X3 = c(as.Date("2003/04/09"), as.Date(NA)),
  X4 = c(hms::as.hms("02:00:13"), hms::as.hms(NA_character_))
) %>%
  readr::format_csv() %>%
  cat()
#> X1,X2,X3,X4
#> 111,aaaa,2003-04-09,02:00:13
#> NA,NA,NA,      NA

Created on 2018-12-02 by the reprex package (v0.2.1)

@cderv
Copy link
Contributor

@cderv cderv commented Dec 2, 2018

Here the investigation to go to the bottom of this:

it comes from how hms column are formatted.

hms_col <- c(hms::as.hms("02:00:13"), hms::as.hms(NA_character_))
readr::output_column(hms_col)
#> [1] "02:00:13" "      NA"

This is because

readr/R/write.R

Lines 206 to 208 in d52a177

output_column.hms <- function(x) {
format(x, "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC")
}

calls hms:::format.hms method that is

hms:::format.hms
function (x, ...) 
{
    if (length(x) == 0L) {
        "hms()"
    }
    else {
        format(as.character(x), justify = "right")
    }
}

So this is a choice of the hms 📦 to justify "right" and maybe a mistake to not pass ... through. If justify = "none" you have what you want.

hms_col <- c(hms::as.hms("02:00:13"), hms::as.hms(NA_character_))
format(hms_col)
#> [1] "02:00:13" "      NA"
format(as.character(hms_col), justify = "none")
#> [1] "02:00:13" "NA"

Don't know why hms have done this choice some time ago now and also don't really what should be the correct formatting as I don't think is an issue to read a file with padding.

@batpigandme I let you see if it is an issue or not and how this could be transfer to hms 📦 . It should be possible to add an argument to hms:::format.hms to deal about that. Several choices possible in fact.

@batpigandme
Copy link
Member

@batpigandme batpigandme commented Dec 2, 2018

Thx, @cderv.
@krlmlr, do you think I should move this to hms?

@jimhester
Copy link
Member

@jimhester jimhester commented Dec 2, 2018

readr should probably be calling format (justify="none") for all non-primitive objects

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Dec 2, 2018

I think we could add a justify = "right" argument to format.hms(), would that help here? readr still needs to set justify = "none".

@cderv
Copy link
Contributor

@cderv cderv commented Dec 2, 2018

I think we could add a justify = "right" argument to format.hms(), would that help here? readr still needs to set justify = "none".

This is what I had in mind. 👍

@jimhester jimhester closed this in 0074fa9 Dec 3, 2018
@jimhester
Copy link
Member

@jimhester jimhester commented Dec 3, 2018

readr can actually just use as.character() on the hms object now, I think I originally had to use format() because at the time the CRAN version of hms did not show partial seconds in the as.character() format, but it does now.

@lock
Copy link

@lock lock bot commented Jun 2, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jun 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants