Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data quality issues associated with decision to force UTC on date/time objects (plus related bug) #702

Closed
jmobrien opened this issue Dec 14, 2022 · 4 comments · Fixed by #714
Labels
feature a feature request or enhancement

Comments

@jmobrien
Copy link
Contributor

jmobrien commented Dec 14, 2022

Wanted to bring up some concerns associated with the changes to force implied UTC on all date/time variables prior to exporting (associated with #555).

I get that the best choice isn't obvious. As discussed there and later in #692, since the other suites all lack a notion of time zone, while the date/time variables all "officially" represent UTC (or similar, for SPSS), even looking at their own docs it's encouraged to just use them to display as expected, so the package is following that expectation.

However, even if it's reasonable to consider users display expectations for outside suites, I think users likely have a deeper, implicit expectation that display should take a back seat to data quality and accuracy. In the case of dates and times, an important accuracy concern is a consistent metric between time points in data. Unfortunately, thanks to the complexities of things daylight savings time, the current approach might be falling short.

Below is a reprex that resembles an actual scenario from my team's data collection: multi-day intensive data collection for an individual (including biological sampling, for which things like circadian rhythms--and thus accurate timestamps--matter). We also happen to have complex data use/publication demands that means that while we mostly work in R, we also have to support Stata datasets.

In this case, a 10-day collection window happens to fall over a DST transition day. So, despite consistent collection "by the clock" each day, there's actually some variation in sampling times that is important to take into account.

Pre-#555, all timestamps would be shown in UTC once sent to Stata, regardless of collection locale. That was important for us to document, but users at least knew that data reliably measured the passage of time from one collection to the next. But now that all timestamps are treated as if they were always UTC to start with, our Stata users and R users are actually getting different timestamp data:

require(haven, quietly = TRUE)
require(tidyverse, quietly = TRUE)

# Function because diff.POSIXt can't specify units:
alt_diff <- 
  function(x){
    difftime(x[-1], x[-length(x)], units = "hours") |> 
      append(as.difftime(NA_character_, units="hours"), after = 0)
  }

# Data from someone sampled 10 sequential days at 8:00 AM,
# where collection happened to span an end-of-daylight-savings transition:
df <- 
  tibble(
    session = 
      1:10,
    time = 
      glue::glue("2022-11-{1:10} 08:00:00") |> 
      as.POSIXct(tz = "America/Chicago"),
    difftime_orig = 
      alt_diff(time)
  )

# Time data + relationship between time points is accurate:
# - one day-over-day time difference 25 hours due to DST ending
# - subsequent sessions stay at that later time (24 hours past previous)

df
#> # A tibble: 10 × 3
#>    session time                difftime_orig
#>      <int> <dttm>              <drtn>       
#>  1       1 2022-11-01 08:00:00 NA hours     
#>  2       2 2022-11-02 08:00:00 24 hours     
#>  3       3 2022-11-03 08:00:00 24 hours     
#>  4       4 2022-11-04 08:00:00 24 hours     
#>  5       5 2022-11-05 08:00:00 24 hours     
#>  6       6 2022-11-06 08:00:00 25 hours     
#>  7       7 2022-11-07 08:00:00 24 hours     
#>  8       8 2022-11-08 08:00:00 24 hours     
#>  9       9 2022-11-09 08:00:00 24 hours     
#> 10      10 2022-11-10 08:00:00 24 hours

# Round-tripping from Stata:
tmp <- tempfile(fileext = ".dta")
written_df <- haven::write_dta(df, tmp)
read_df <- haven::read_dta(tmp) |> 
  # regenerate the time_passed variable:
  mutate(
    # remake original difftime
    difftime_orig = as.difftime(difftime_orig, units = "hours"),
    # Difftime w/current data:
    difftime_forceutc = alt_diff(time),
    # Compare:
    matching = difftime_orig == difftime_forceutc
  )

# Times and relationships b/w them are no longer consistently accurate:

read_df  
#> # A tibble: 10 × 5
#>    session time                difftime_orig difftime_forceutc matching
#>      <dbl> <dttm>              <drtn>        <drtn>            <lgl>   
#>  1       1 2022-11-01 08:00:00 NA hours      NA hours          NA      
#>  2       2 2022-11-02 08:00:00 24 hours      24 hours          TRUE    
#>  3       3 2022-11-03 08:00:00 24 hours      24 hours          TRUE    
#>  4       4 2022-11-04 08:00:00 24 hours      24 hours          TRUE    
#>  5       5 2022-11-05 08:00:00 24 hours      24 hours          TRUE    
#>  6       6 2022-11-06 08:00:00 25 hours      24 hours          FALSE   
#>  7       7 2022-11-07 08:00:00 24 hours      24 hours          TRUE    
#>  8       8 2022-11-08 08:00:00 24 hours      24 hours          TRUE    
#>  9       9 2022-11-09 08:00:00 24 hours      24 hours          TRUE    
#> 10      10 2022-11-10 08:00:00 24 hours      24 hours          TRUE

# Separate issue: write_*() functions invisibly return dataframe with 
# forced UTC, instead of original input data as documented:

written_df$time
#>  [1] "2022-11-01 08:00:00 UTC" "2022-11-02 08:00:00 UTC"
#>  [3] "2022-11-03 08:00:00 UTC" "2022-11-04 08:00:00 UTC"
#>  [5] "2022-11-05 08:00:00 UTC" "2022-11-06 08:00:00 UTC"
#>  [7] "2022-11-07 08:00:00 UTC" "2022-11-08 08:00:00 UTC"
#>  [9] "2022-11-09 08:00:00 UTC" "2022-11-10 08:00:00 UTC"

Created on 2022-12-16 by the reprex package (v2.0.1)

Again, I understand how it's not a cut-and-dry scenario because you're working around inadequacies in the other suites. Still, IMO the data quality risks from this new date/time approach isn't a worthwhile trade for a more familiar Stata display--especially where users have no choice to preserve older behavior.

Perhaps I'm in the minority on that view, and the new approach is preferable overall. Still, perhaps at least a new argument for whether force_utc() gets applied would be wise? With documentation so users understand the implications either way?

One secondary issue as an aside--in the reprex you'll also note that the write_* functions also aren't returning unaltered input data when date/time variables are present. That was related to the changes due to #555, but it was a simple fix so I've got a forthcoming PR for that.

EDIT: reprex had errors due last-minute changes to simplify some variable names, but missing a reference to the older name, as well as a mistyped argument tx instead of tz (h/t @huftis). Corrected, apologies.

@gorcha
Copy link
Member

gorcha commented Jan 3, 2023

Hi @jmobrien, thanks for the feedback.

I'm inclined to stay with the current approach (it still feels like a sensible default) but make the timezone adjustment configurable via an argument/option and improve the documentation.

Thanks for the PR fixing the reutn value also, that is eminently sensible! I'll review and let you know of any feedback soon 🙂

@jmobrien
Copy link
Contributor Author

jmobrien commented Jan 3, 2023

Thanks. An option should be sufficient; I think our situation represents an important use case but I'm not sure how typical it is.

@huftis
Copy link
Contributor

huftis commented Jan 17, 2023

I agree that an option/argument to specify how the timezone should be handled when saving would be very useful. There should also be a similar option for specifying the timezone whean reading data (in haven::read_*()).

BTW, there’s a bug in your test case, @jmobrien. In as.POSIXct(tx = "America/Chicago"), the specified timezone is actually ignored. The argument should be tz, not tx.

@gorcha
Copy link
Member

gorcha commented Jan 19, 2023

Hi @huftis, I don't think this makes sense to add as an argument for the file reading functions - date time variables are read in consistently as UTC, and the timezone can be changed by the user as needed after reading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants