Data quality issues associated with decision to force UTC on date/time objects (plus related bug) #702

jmobrien · 2022-12-14T17:19:57Z

Wanted to bring up some concerns associated with the changes to force implied UTC on all date/time variables prior to exporting (associated with #555).

I get that the best choice isn't obvious. As discussed there and later in #692, since the other suites all lack a notion of time zone, while the date/time variables all "officially" represent UTC (or similar, for SPSS), even looking at their own docs it's encouraged to just use them to display as expected, so the package is following that expectation.

However, even if it's reasonable to consider users display expectations for outside suites, I think users likely have a deeper, implicit expectation that display should take a back seat to data quality and accuracy. In the case of dates and times, an important accuracy concern is a consistent metric between time points in data. Unfortunately, thanks to the complexities of things daylight savings time, the current approach might be falling short.

Below is a reprex that resembles an actual scenario from my team's data collection: multi-day intensive data collection for an individual (including biological sampling, for which things like circadian rhythms--and thus accurate timestamps--matter). We also happen to have complex data use/publication demands that means that while we mostly work in R, we also have to support Stata datasets.

In this case, a 10-day collection window happens to fall over a DST transition day. So, despite consistent collection "by the clock" each day, there's actually some variation in sampling times that is important to take into account.

Pre-#555, all timestamps would be shown in UTC once sent to Stata, regardless of collection locale. That was important for us to document, but users at least knew that data reliably measured the passage of time from one collection to the next. But now that all timestamps are treated as if they were always UTC to start with, our Stata users and R users are actually getting different timestamp data:

require(haven, quietly = TRUE)
require(tidyverse, quietly = TRUE)

# Function because diff.POSIXt can't specify units:
alt_diff <- 
  function(x){
    difftime(x[-1], x[-length(x)], units = "hours") |> 
      append(as.difftime(NA_character_, units="hours"), after = 0)
  }

# Data from someone sampled 10 sequential days at 8:00 AM,
# where collection happened to span an end-of-daylight-savings transition:
df <- 
  tibble(
    session = 
      1:10,
    time = 
      glue::glue("2022-11-{1:10} 08:00:00") |> 
      as.POSIXct(tz = "America/Chicago"),
    difftime_orig = 
      alt_diff(time)
  )

# Time data + relationship between time points is accurate:
# - one day-over-day time difference 25 hours due to DST ending
# - subsequent sessions stay at that later time (24 hours past previous)

df
#> # A tibble: 10 × 3
#>    session time                difftime_orig
#>      <int> <dttm>              <drtn>       
#>  1       1 2022-11-01 08:00:00 NA hours     
#>  2       2 2022-11-02 08:00:00 24 hours     
#>  3       3 2022-11-03 08:00:00 24 hours     
#>  4       4 2022-11-04 08:00:00 24 hours     
#>  5       5 2022-11-05 08:00:00 24 hours     
#>  6       6 2022-11-06 08:00:00 25 hours     
#>  7       7 2022-11-07 08:00:00 24 hours     
#>  8       8 2022-11-08 08:00:00 24 hours     
#>  9       9 2022-11-09 08:00:00 24 hours     
#> 10      10 2022-11-10 08:00:00 24 hours

# Round-tripping from Stata:
tmp <- tempfile(fileext = ".dta")
written_df <- haven::write_dta(df, tmp)
read_df <- haven::read_dta(tmp) |> 
  # regenerate the time_passed variable:
  mutate(
    # remake original difftime
    difftime_orig = as.difftime(difftime_orig, units = "hours"),
    # Difftime w/current data:
    difftime_forceutc = alt_diff(time),
    # Compare:
    matching = difftime_orig == difftime_forceutc
  )

# Times and relationships b/w them are no longer consistently accurate:

read_df  
#> # A tibble: 10 × 5
#>    session time                difftime_orig difftime_forceutc matching
#>      <dbl> <dttm>              <drtn>        <drtn>            <lgl>   
#>  1       1 2022-11-01 08:00:00 NA hours      NA hours          NA      
#>  2       2 2022-11-02 08:00:00 24 hours      24 hours          TRUE    
#>  3       3 2022-11-03 08:00:00 24 hours      24 hours          TRUE    
#>  4       4 2022-11-04 08:00:00 24 hours      24 hours          TRUE    
#>  5       5 2022-11-05 08:00:00 24 hours      24 hours          TRUE    
#>  6       6 2022-11-06 08:00:00 25 hours      24 hours          FALSE   
#>  7       7 2022-11-07 08:00:00 24 hours      24 hours          TRUE    
#>  8       8 2022-11-08 08:00:00 24 hours      24 hours          TRUE    
#>  9       9 2022-11-09 08:00:00 24 hours      24 hours          TRUE    
#> 10      10 2022-11-10 08:00:00 24 hours      24 hours          TRUE

# Separate issue: write_*() functions invisibly return dataframe with 
# forced UTC, instead of original input data as documented:

written_df$time
#>  [1] "2022-11-01 08:00:00 UTC" "2022-11-02 08:00:00 UTC"
#>  [3] "2022-11-03 08:00:00 UTC" "2022-11-04 08:00:00 UTC"
#>  [5] "2022-11-05 08:00:00 UTC" "2022-11-06 08:00:00 UTC"
#>  [7] "2022-11-07 08:00:00 UTC" "2022-11-08 08:00:00 UTC"
#>  [9] "2022-11-09 08:00:00 UTC" "2022-11-10 08:00:00 UTC"

^{Created on 2022-12-16 by the reprex package (v2.0.1)}

Again, I understand how it's not a cut-and-dry scenario because you're working around inadequacies in the other suites. Still, IMO the data quality risks from this new date/time approach isn't a worthwhile trade for a more familiar Stata display--especially where users have no choice to preserve older behavior.

Perhaps I'm in the minority on that view, and the new approach is preferable overall. Still, perhaps at least a new argument for whether force_utc() gets applied would be wise? With documentation so users understand the implications either way?

One secondary issue as an aside--in the reprex you'll also note that the write_* functions also aren't returning unaltered input data when date/time variables are present. That was related to the changes due to #555, but it was a simple fix so I've got a forthcoming PR for that.

EDIT: reprex had errors due last-minute changes to simplify some variable names, but missing a reference to the older name, as well as a mistyped argument tx instead of tz (h/t @huftis). Corrected, apologies.

The text was updated successfully, but these errors were encountered:

gorcha · 2023-01-03T03:21:59Z

Hi @jmobrien, thanks for the feedback.

I'm inclined to stay with the current approach (it still feels like a sensible default) but make the timezone adjustment configurable via an argument/option and improve the documentation.

Thanks for the PR fixing the reutn value also, that is eminently sensible! I'll review and let you know of any feedback soon 🙂

jmobrien · 2023-01-03T16:57:00Z

Thanks. An option should be sufficient; I think our situation represents an important use case but I'm not sure how typical it is.

huftis · 2023-01-17T09:16:19Z

I agree that an option/argument to specify how the timezone should be handled when saving would be very useful. There should also be a similar option for specifying the timezone whean reading data (in haven::read_*()).

BTW, there’s a bug in your test case, @jmobrien. In as.POSIXct(tx = "America/Chicago"), the specified timezone is actually ignored. The argument should be tz, not tx.

gorcha · 2023-01-19T01:26:25Z

Hi @huftis, I don't think this makes sense to add as an argument for the file reading functions - date time variables are read in consistently as UTC, and the timezone can be changed by the user as needed after reading.

jmobrien mentioned this issue Dec 14, 2022

Original data return from write_* functions no longer has UTC-adjusted date/times #703

Merged

gorcha added the feature a feature request or enhancement label Jan 3, 2023

gorcha added a commit that referenced this issue Feb 21, 2023

Add convert_utc argument to writer functions (close #702)

392972f

gorcha mentioned this issue Feb 21, 2023

Add convert_utc argument to writer functions #714

Merged

gorcha closed this as completed in #714 Feb 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data quality issues associated with decision to force UTC on date/time objects (plus related bug) #702

Data quality issues associated with decision to force UTC on date/time objects (plus related bug) #702

jmobrien commented Dec 14, 2022 •

edited

gorcha commented Jan 3, 2023

jmobrien commented Jan 3, 2023

huftis commented Jan 17, 2023

gorcha commented Jan 19, 2023 •

edited

Data quality issues associated with decision to force UTC on date/time objects (plus related bug) #702

Data quality issues associated with decision to force UTC on date/time objects (plus related bug) #702

Comments

jmobrien commented Dec 14, 2022 • edited

gorcha commented Jan 3, 2023

jmobrien commented Jan 3, 2023

huftis commented Jan 17, 2023

gorcha commented Jan 19, 2023 • edited

jmobrien commented Dec 14, 2022 •

edited

gorcha commented Jan 19, 2023 •

edited