Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing issues when combining as_date and across with multiple date formats #1042

Closed
ad1729 opened this issue May 17, 2022 · 6 comments
Closed

Comments

@ad1729
Copy link

ad1729 commented May 17, 2022

Hi,

I have a data frame with multiple date columns (encoded as characters) where the date format can differ between columns (e.g. col1 has the following format YYYY-MM-DD and col2 is coded as DD/MM/YYYY. What I'd like to do is supply multiple formats while converting the char columns to date columns.

Using lubridate::as_date(..., format = c(...)), lubridate version is 1.8.0

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

tibble(
    date1 = c("2020-10-02", NA_character_), 
    date2 = c("03/10/2020", "03/11/2020")
    ) %>% 
    mutate(
        across(
            .cols = contains("date"), 
            .fns = ~ lubridate::as_date(.x, format = c("%Y-%m-%d", "%d/%m/%Y"))
            )
        )
#> # A tibble: 2 × 2
#>   date1      date2     
#>   <date>     <date>    
#> 1 2020-10-02 NA        
#> 2 NA         2020-11-03

packageVersion("lubridate")
#> [1] '1.8.0'

packageVersion("dplyr")
#> [1] ‘1.0.8’

Not sure why this fails to convert the first element in date2 from 03/10/2020 to 2020-10-03.

On the other hand, using base R's as.Date(..., tryFormats = c(...)) works as expected / desired.

tibble(
    date1 = c("2020-10-02", NA_character_), 
    date2 = c("03/10/2020", "03/11/2020")
    ) %>% 
    mutate(
        across(
            .cols = contains("date"), 
            .fns = ~ as.Date(.x, tryFormats = c("%Y-%m-%d", "%d/%m/%Y"))
            )
        )
#> # A tibble: 2 × 2
#>   date1      date2     
#>   <date>     <date>    
#> 1 2020-10-02 2020-10-03
#> 2 NA         2020-11-03

Additionally, using as_date(parse_date_time(..., orders = c(...))) gives the right result too on this example.

tibble(
    date1 = c("2020-10-02", NA_character_), 
    date2 = c("03/10/2020", "03/11/2020")
    ) %>% 
    mutate(
        across(
            .cols = contains("date"), 
            .fns = ~ lubridate::as_date(lubridate::parse_date_time(.x, orders = c("%Y-%m-%d", "%d/%m/%Y")))
        )
    )
#> # A tibble: 2 × 2
#>   date1      date2     
#>   <date>     <date>    
#> 1 2020-10-02 2020-10-03
#> 2 NA         2020-11-03

What is unexpected / surprising is that as_date(...) accepts multiple formats but generates NAs where there should be data. As far as I can tell, this only seems to be happening when combining as_date and across. If there are multiple formats in one single column then as_date works as expected

tibble(
    date1 = c("2020-10-02", "03/10/2020", NA_character_)
    ) %>% 
    mutate(
        date1_1 = lubridate::as_date(date1, format = "%Y-%m-%d"), 
        date1_2 = lubridate::as_date(date1, format = c("%Y-%m-%d", "%d/%m/%Y")))
#> # A tibble: 3 × 3
#>   date1      date1_1    date1_2   
#>   <chr>      <date>     <date>    
#> 1 2020-10-02 2020-10-02 2020-10-02
#> 2 03/10/2020 NA         2020-10-03
#> 3 <NA>       NA         NA
@vspinu
Copy link
Member

vspinu commented Oct 26, 2022

I am also puzzled by this behavior. Internally in as_date we were using strptime which does seem to support multiple formats, though it's not documented, but then it breaks inside dplyr across. So I have moved away from strptime to parse_date_time(exact = TRUE) to have a more predictable behavior and explicitly allow multiple formats in format argument.

@ad1729
Copy link
Author

ad1729 commented Nov 8, 2022

Thanks for looking into this and fixing it!

@canadice
Copy link

canadice commented Jan 25, 2023

Did you return to strptime in version 1.9.1?

I had to change the following code from prior to 1.9.0 as_date("Jan 16 2023", format = "%b %d %Y") to as_date("Jan 16 2023", format = "bdY") to get the same output in 1.9.0 but now in 1.9.1 format = "bdY" produces NA and format = "%b %d %Y" produces the correct output once more.

@vspinu
Copy link
Member

vspinu commented Jan 25, 2023

@canadice No, the code in 1.9.0 was a regression. We didn't return to strtime, but parse with exact = TRUE to comply with as.Date. Also as_datetime and as_date now behave similarly. For instance multiple formats can be passed.

Change introduced in 3aa948d

@canadice
Copy link

canadice commented Jan 25, 2023

@canadice No, the code in 1.9.0 was a regression. We didn't return to strtime, but parse with exact = TRUE to comply with as.Date. Also as_datetime and as_date now behave similarly. For instance multiple formats can be passed.

Change introduced in 3aa948d

Right, and the changes I made when updating to 1.9.0 seemed to work but now with 1.9.1 they have reversed.

To clarify:
<1.9.0 lubridate
as_date("Jan 16 2023", format = "%b %d %Y") produces no errors

1.9.0 lubridate
as_date("Jan 16 2023", format = "%b %d %Y") produces NA
as_date("Jan 16 2023", format = "bdY") produces no errors so changed to this way of writing format.

1.9.1 lubridate
as_date("Jan 16 2023", format = "%b %d %Y") produces no errors
as_date("Jan 16 2023", format = "bdY") produces NA

When troubleshooting the NAs in 1.9.0 I detailed it to the parse_date_time(exact = TRUE) and made appropriate changes, but now those changes does not work and the "old" format can be used again. Is this an intended feature with the new version (using "%b %d %Y") or something that will be reverted again, i.e. any changes I do will have to be redone again to the 1.9.0 solutions I had?

@vspinu
Copy link
Member

vspinu commented Jan 25, 2023

yes, the exact = TRUE means that you need to specify the exact format (like in strptime) and not lubridate lax format (aka orders). So this particular behavior now is as in <1.9.0.

When troubleshooting the NAs in 1.9.0 I detailed it to the parse_date_time(exact = TRUE) and made appropriate changes, but now those changes does not work

If you used parse_date_time(exact = TRUE) you should not be affected by any of this. This change affects only as_date.

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Jun 5, 2023
Version 1.9.2
=============

### BUG FIXES

* [#1104](tidyverse/lubridate#1104) Fix
  incorrect parsing of months when %a format is present.

### OTHER

* Adapt to internal name changes in R-devel

Version 1.9.1
=============

### NEW FEATURES

* `as_datetime()` accepts multiple formats in format argument, just like `as_date()` does.

### BUG FIXES

* [#1091](tidyverse/lubridate#1091) Fix
  formatting of numeric inputs to parse_date_time.

* [#1092](tidyverse/lubridate#1092) Fix
  regression in `ymd_hm` on locales where `p` format is not defined.

* [#1097](tidyverse/lubridate#1097) Fix
  `as_date("character")` to work correctly with formats that include
  extra characters.

* [#1098](tidyverse/lubridate#1098) Roll
  over the month boundary in `make_dateime()` when units exceed their
  maximal values.

* [#1090](tidyverse/lubridate#1090)
  timechange has been moved from Depends to Imports.

Version 1.9.0
=============

### NEW FEATURES

* `roll` argument to updating and time-zone manipulation functions is
  deprecated in favor of a new `roll_dst` parameter.

* [#1042](tidyverse/lubridate#1042)
  `as_date` with character inputs accepts multiple formats in `format`
  argument. When `format` is supplied, the input string is parsed with
  `parse_date_time` instead of the old `strptime`.

* [#1055](tidyverse/lubridate#1055)
  Implement `as.integer` method for Duration, Period and Interval
  classes.

* [#1061](tidyverse/lubridate#1061) Make
  `year<-`, `month<-` etc. accessors truly generic. In order to make
  them work with arbitrary class XYZ, it's enough to define a
  `reclass_date.XYZ` method.

* [#1061](tidyverse/lubridate#1061) Add
  support for `year<-`, `month<-` etc. accessors for `data.table`'s
  IDate and ITime objects.

* [#1017](tidyverse/lubridate#1017)
  `week_start` argument in all lubridate functions now accepts full
  and abbreviated names of the days of the week.

* The assignment value `wday<-` can be a string either in English or
  as provided by the current locale.

* Date rounding functions accept a date-time `unit` argument for
  rounding to a vector of date-times.

* [#1005](tidyverse/lubridate#1005)
  `as.duration` now allows for full roundtrip `duration ->
  as.character -> as.duration`

* [#911](tidyverse/lubridate#911) C parsers
  treat multiple spaces as one (just like strptime does)

* `stamp` gained new argument `exact=FALSE` to indicate whether
  `orders` argument is an exact strptime formats string or not.

* [#1001](tidyverse/lubridate#1001) Add
  `%within` method with signature (Interval, list), which was
  documented but not implemented.

* [#941](tidyverse/lubridate#941)
  `format_ISO8601()` gained a new option `usetz="Z"` to format time
  zones with a "Z" and convert the time to the UTC time zone.

* [#931](tidyverse/lubridate#931) Usage of
  `Period` objects in rounding functions is explicitly documented.

### BUG FIXES

* [#1036](tidyverse/lubridate#1036)
  `%within%` now correctly works with flipped intervals

* [#1085](tidyverse/lubridate#1085)
  `as_datetime()` now preserves the time zone of the POSIXt input.

* [#1072](tidyverse/lubridate#1072) Names
  are now handled correctly when combining multiple Period or Interval
  objects.

* [#1003](tidyverse/lubridate#1003)
  Correctly handle r and R formats in locales which have no p format

* [#1074](tidyverse/lubridate#1074) Fix
  concatination of named Period, Interval and Duration vectors.

* [#1044](tidyverse/lubridate#1044) POSIXlt
  results returned by `fast_strptime()` and `parse_date_time2()` now
  have a recycled `isdst` field.

* [#1069](tidyverse/lubridate#1069) Internal
  code handling the addition of period months and years no longer
  generates partially recycled POSIXlt objects.

* Fix rounding of POSIXlt objects

* [#1007](tidyverse/lubridate#1007) Internal
  lubridate formats are no longer propagated to stamp formater.

* `train` argument in `parse_date_time` now takes effect. It was
  previously ignored.

* [#1004](tidyverse/lubridate#1004) Fix
  `c.POSIXct` and `c.Date` on empty single POSIXct and Date vectors.

* [#1013](tidyverse/lubridate#1013) Fix
  c(`POSIXct`,`POSIXlt`) heterogeneous concatenation.

* [#1002](tidyverse/lubridate#1002) Parsing
  only with format `j` now works on numeric inputs.

* `stamp()` now correctly errors when no formats could be guessed.

* Updating a date with timezone (e.g. `tzs = "UTC"`) now returns a POSIXct.

### INTERNALS

* `lubridate` is now relying on `timechange` package for update and
  time-zone computation. Google's CCTZ code is no longer part of the
  package.

* `lubridate`'s updating logic is now built on top of `timechange`
  package.

* Change implementation of `c.Period`, `c.Duration` and `c.Interval`
  from S4 to S3.

Version 1.8.0
=============

### NEW FEATURES

* [#960](tidyverse/lubridate#960)
  `c.POSIXct` and `c.Date` can deal with heterogeneous object types
  (e.g `c(date, datetime)` works as expected)

### BUG FIXES

* [#994](tidyverse/lubridate#994)
  Subtracting two duration or two period objects no longer results in
  an ambiguous dispatch note.

* `c.Date` and `c.POSIXct` correctly deal with empty vectors.

* `as_datetime(date, tz=XYZ)` returns the date-time object with HMS
  set to 00:00:00 in the corresponding `tz`

### CHANGES

* [#966](tidyverse/lubridate#966) Lubridate is
  now built with cpp11 (contribution of @DavisVaughan)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants