Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need timezone specification on import or read timestamps as string not number #347

Closed
chrisholbrook opened this issue Apr 18, 2017 · 8 comments

Comments

@chrisholbrook
Copy link

chrisholbrook commented Apr 18, 2017

Unless I've missed something, it seems all timestamps are assumed to be in UTC in read_excel.

Absent of option to specify time zone for each dttm column, I figured I could import as text and then coerce using proper time zone specification in as.POSIXct. Unfortunately my timestamps come in as a number formatted as text when I specify col_types = "text" in read_excel. An example xlsx file is attached and some code below showing the issue. (using readxl v1.0.0)

Book2.xlsx

it seems read_excel assumes all dttm input in UTC... is this correct? Any chance column-specific timezone argument could be added?

#read data from xlsx file with dates from two difference timezones
    foo <- readxl::read_excel("C:/Users/cholbrook/Desktop/Book2.xlsx", sheet = 1)
    foo$timestamp_us_eastern #need to be in US/Eastern timezone
> foo$timestamp_us_eastern #need to be in US/Eastern timezone
[1] "2015-05-05 12:00:02 UTC" "2015-05-08 19:22:50 UTC"

try reading as text and then coerce with as.POSIXct, but data come in as number formatted as text, rather than string... can cobble together conversion, but it's not intuitive.

foo2 <- readxl::read_excel("C:/Users/cholbrook/Desktop/Book2.xlsx", sheet = 1,
                           col_types = "text") 
> foo2
# A tibble: 2 × 2
  timestamp_us_eastern      timestamp_utc
                 <chr>              <chr>
1   42129.500023148146 42129.500023148146
2   42132.807523148149 42132.807523148149

I thought perhaps I could convert from number by supplying an origin, but that hasn't panned out...

foo2$timestamp_us_eastern <- as.POSIXct(
    as.numeric(foo2$timestamp_us_eastern)*86400, 
    tz = "US/Eastern", 
    origin = "1899-12-30 04:00")

foo2$timestamp_utc <- as.POSIXct(
  as.numeric(foo2$timestamp_utc)*86400, 
  tz = "UTC", 
  origin = "1899-12-30 04:00")
```
```
> foo2$timestamp_us_eastern
[1] "2015-05-05 12:00:02 EDT" "2015-05-08 19:22:50 EDT"

> foo2$timestamp_utc
[1] "2015-05-05 16:00:02 UTC" "2015-05-08 23:22:50 UTC"
```
@jennybc
Copy link
Member

jennybc commented Apr 18, 2017

I'll take this in a few pieces.

Yes readxl brings everything in as being UTC, because Excel has no notion of timezone.

It is harder than meets the eye to bring date times in as text (which is #118). The text you see is ephemeral -- created just for your eyeballs. On disk, the cell just is a number and a reference to a date format. I'd like to offer this but haven't gotten to it yet.

I think you should import these as proper dates and then change the timezone. This StackOverflow thread seems to offer both base and lubridate solutions for exactly this problem:

http://stackoverflow.com/questions/14141537/change-timezone-in-a-posixct-object

I'm not eager to add timezone to the read_excel() interface. It would be clunky. But there's another issue to move readxl to a more readr-like interface (#198), which would afford an opportunity for much better date handling, such as this.

Let me know how it works out. I'd love some example code re: post hoc timezone switching. I'd add it to the docs somewhere.

@jennybc jennybc closed this as completed Apr 18, 2017
@chrisholbrook
Copy link
Author

@jennybc Thanks for quick reply. Your rationale makes sense.

Safest solution seems to be to coerce dttm column to string using format (as long as tz is UTC) and then coerce back to POSIXct using the correct tz attribute.

Here's an example using the attached file Book3.xlsx

Book3.xlsx contains a single row in a single column with
a timestamp "2015-05-05 12:00:02" in US/Eastern time zone

foo <- readxl::read_excel("C:/Users/cholbrook/Desktop/Book3.xlsx", sheet = 1)

note that tzone = UTC (default read_excel behavior)

> foo$timestamp_us_eastern
[1] "2015-05-05 12:00:02 UTC"

which is not correct, should look like this

should_be <- as.POSIXct("2015-05-05 12:00:02", tz="US/Eastern")
> should_be 
[1] "2015-05-05 12:00:02 EDT"

safest method seems to be coerce to text (from POSIXct in UTC) and then back to POSIXct (with correct time zone)

new <- as.POSIXct(format(foo$timestamp_us_eastern), tz="US/Eastern")
> new
[1] "2015-05-05 12:00:02 EDT"
> identical(new, should_be)
[1] TRUE

@chrisholbrook
Copy link
Author

@jennybc Note also that changing the tzone attribute (as shown in http://stackoverflow.com/questions/14141537/change-timezone-in-a-posixct-object) is not a solution to this problem because the tzone attribute does not change the internal value of the POSIXct object; only how it is displayed.

For example:

> #make two identical POSIXct objects
> foo1 <- foo2 <- as.POSIXct("2015-05-05 12:00:02", tz="UTC")
> foo1
[1] "2015-05-05 12:00:02 UTC"
> #change tzone attr of one
> attr(foo2, "tzone") <- "US/Eastern"
> foo2
[1] "2015-05-05 08:00:02 EDT"
> #note that internally same value; tzone only affects how displayed
> as.numeric(foo1)
[1] 1430827202
> as.numeric(foo2)
[1] 1430827202

@jennybc
Copy link
Member

jennybc commented Apr 19, 2017

Thanks for closing the loop! Yeah, I wondered about the attribute adjustment, as what I saw in that SO thread didn't seem to be exactly what you needed. I'll find a logical place to record your workaround for other people who have this same problem.

@jennybc
Copy link
Member

jennybc commented Apr 23, 2017

Hey @chrisholbrook,

Here's an even better solution to your problem: lubridate::force_tz().

## make three identical POSIXct objects
dttm_raw <- dttm_attr <- dttm_force <-
  as.POSIXct("2015-05-05 12:00:02", tz="UTC")
dttm_raw
#> [1] "2015-05-05 12:00:02 UTC"

## editing the attribute preserves moment of time, but not clock time
attr(dttm_attr, "tzone") <- "US/Eastern"
dttm_attr
#> [1] "2015-05-05 08:00:02 EDT"
## this doesn't actually change the moment of time

## use lubridate::force_tz() to preserve clock time and modify
## moment of time
dttm_force <- lubridate::force_tz(dttm_force, "US/Eastern")
dttm_force
#> [1] "2015-05-05 12:00:02 EDT"

## take a look under the hood at moment of time
as.numeric(dttm_raw)
#> [1] 1430827202
as.numeric(dttm_attr)  # same as dttm_raw
#> [1] 1430827202
as.numeric(dttm_force) # offset 
#> [1] 1430841602

@chrisholbrook
Copy link
Author

@jennybc, excellent. yes, somehow I'd missed lubridate::force_tz. thank you!

@sdanielzafar
Copy link

+1 for adding the timezone functionality. Would it be easier to have it not be an arg and just set it to system timezone?

@jennybc
Copy link
Member

jennybc commented Dec 5, 2017

I think for now this is best left as a well-documented, after-import transformation. If anyone on the thread is game to make a PR re: how to document, that would be great. When/if readxl gets column-typing more consistent with readr, then it would make sense to have a way to set the timezone.

@lock lock bot locked and limited conversation to collaborators Oct 10, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants