Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Think about return types for date, time and timestamp #42

Closed
krlmlr opened this issue Oct 11, 2015 · 19 comments · Fixed by #97
Closed

Think about return types for date, time and timestamp #42

krlmlr opened this issue Oct 11, 2015 · 19 comments · Fixed by #97

Comments

@krlmlr
Copy link
Member

krlmlr commented Oct 11, 2015

Current situation:

  • Dates are returned as "Date" class
  • Times are returned character, or integer (number of seconds, MySQL only)
  • Timestamps are returned character, or POSIXct (MySQL only)

It is unclear if, and how much, the DBI driver should abstract the DB server's understanding of these types. The current MySQL convention looks good for dates and timestamps, and should be adopted by the other backends. The "time" type is tricky: It doesn't seem to have a time zone, and probably represents a duration in most cases, but this cannot be relied on. Still, a lubridate::duration() might be a good bet here.

For now, the test tests for the "character" data type.

@hadley
Copy link
Member

hadley commented Oct 11, 2015

How are times stored in the database? Seconds from midnight?

Might be useful to create (or find) lightweight time package. Both haven and readr currently define their own time classes.

@krlmlr
Copy link
Member Author

krlmlr commented Oct 30, 2015

I'm not sure about the internal storage, but probably it's something like seconds from midnight, or a fraction of a day.

I like this idea of a data type package. ???hms didn't return anything useful, so I'd vote for extracting the code from haven. This class should define useful converters to/from the most common date/time types (POSIX, lubridate, ...).

Same for 64-bit and larger, and perhaps also currency: A new package that uses a lossless storage type (probably character) but defines useful conversions to/from types the user actually can work with (numeric, bit64, int64, raw, ...).

BLOBs should probably be returned as lists of raw.

@imanuelcostigan
Copy link

I think times / date storage is DB dependent.

@hannes
Copy link
Contributor

hannes commented Nov 9, 2015

Seems reasonable to me, but please keep DBI free from dependencies on other packages. I do think numbers should stay numbers though if at all possible. And MonetDB stores times as timestamps since 1970-01-01 with millisecond precision in long values internally.

@krlmlr
Copy link
Member Author

krlmlr commented Nov 9, 2015

Why is this important? I think lossless data types can be of more general use than just for DBI, and should therefore live in a separate package.

@krlmlr
Copy link
Member Author

krlmlr commented Nov 10, 2015

@hannesmuehleisen: Unfortunately, numbers can't always stay numbers in R without sacrificing precision. For timestamps stored as milliseconds since the epoch, using numeric might just work (at lest for the next few millennia) -- but not necessarily for generic 64-bit or larger values.

If there's a new "data types" package, I agree that it shouldn't include mandatory dependencies itself. But I still don't see why DBI shouldn't import this "data types" package. Conversion to and from R's internal data types (with mutable warnings if precision is lost) always will be possible.

@hannes
Copy link
Contributor

hannes commented Nov 10, 2015

I would argue as long as base R does not have proper int64 support, DBI does not have to either.

@krlmlr
Copy link
Member Author

krlmlr commented Nov 10, 2015

There are R packages that do offer some support for 64-bit values. They just might not be suitable for every use case.

I think DBI's job is to be able to get the data in and out of the database in a safe and unambiguous way without additional loss of precision, using R's base data types. To me, this means either character or list(raw(...)) for 64-bit values, and I think character is the lesser of the two evils.

> bigint <- structure("1234567890123456789", class = c("DBI_bigint", "character"))
> bigint
[1] "1234567890123456789"
attr(,"class")
[1] "DBI_bigint" "character" 
> bit64::as.integer64(bigint)
integer64
[1] 1234567890123456789
> type.convert(bigint, numerals = "warn.loss")
[1] 1.234568e+18
Warning message:
In type.convert(bigint, numerals = "warn.loss") :
  accuracy loss in conversion from "1234567890123456789" to numeric
> as.numeric(bigint)
[1] 1.234568e+18

@hannes
Copy link
Contributor

hannes commented Nov 10, 2015

Throwing a warning if accuracy is lost is a good idea. But still think its a bad idea to convert some numerical SQL types to R's numeric or integer and some to characters. Nobody expects that. The clean way of doing that would be to follow the JDBC approach, where the caller MUST specify the type of data to be read. But I think nobody wants that either.

@krlmlr
Copy link
Member Author

krlmlr commented Nov 10, 2015

Agreed to both. Another way would be to include the original data as an attribute in cases where conversion losses can occur:

> bigint <- structure(1234567890123456789, original = "1234567890123456789")
> bigint
[1] 1.234568e+18
attr(,"original")
[1] "1234567890123456789"

The attribute won't survive data frame manipulations, but at least the data is there when DBI returns it, and the unsuspecting user isn't surprised.

What about time-only data types? I think those should be returned as seconds from midnight, to allow painless interoperability with POSIXct.

@hadley
Copy link
Member

hadley commented Nov 11, 2015

Let's use lightweight time class - number of seconds since midnight (as double) + class attribute.

For data types which have a loss conversion to closest R type, I like the idea to store the "original" value (as a character vector) as an attribute.

@hannes
Copy link
Contributor

hannes commented Nov 12, 2015

Like the attr solution as well.

@imanuelcostigan
Copy link

RJDBC currently only supports character and numeric return and write values. So dates for example are returned as characters. I've worked around this in my RSQLServer fetch method and expanded RJDBC's dbDataType method to improve writing out R types to SQLServer. Similarly, I have also expanded RJDBC's .fillStatementParameters which is used by dbSendUpdate.

My approach is obviously mostly hackery, and I would love to see better way of mapping DB types to R types (and vice-versa)

@krlmlr
Copy link
Member Author

krlmlr commented Nov 12, 2015

@imanuelcostigan: Data types not supported by the SQL engine are (not) covered in #61. This issue is about SQL data types for which R has no (lossless) equivalent.

@imanuelcostigan
Copy link

@krlmlr I'm not sure I follow. For example, dates and date-times are supported by SQL Server.

@krlmlr
Copy link
Member Author

krlmlr commented Nov 13, 2015

I'm just saying that the conversion DB -> R and R -> DB are two separate issues. The former is handled here, but I'm not sure how your case fits in. You seem to be able somehow to determine the true underlying data type, which is fine: I think it's the DBI driver's job to return proper data types for the DB -> R path. The data types "bigint" (64-bit integer) and "time" (number of seconds, without date) are a bit tricky, and this is what this issuer here is mostly about.

The conversion R -> DB has been discussed in #61. DBI will support specifying field types when creating tables via dbWriteTable (#18), but probably not much more.

Is there any further functionality you'd like to see in DBI?

@yasminlucero
Copy link

+1 to the idea of storing the 'original' value as a character vector. I just want to reiterate the importance of returning lossless information.

In my current application, I need to be able to return correct BIGINT counts often. I have to do the cast to char query side, which is tricky because the query generation is already very complex. A feature like this would side-step a lot of complexity for me.

@krlmlr
Copy link
Member Author

krlmlr commented Jan 13, 2016

  • "hms" package for time values, code pulled out of readr/haven
    • Check that data type is "double"
  • Lossy values: "lossy" S3 class

@krlmlr
Copy link
Member Author

krlmlr commented Mar 31, 2016

@hadley: Could you please take a look at the new hms package (early draft)? I'd like to put it under the rstats-db namespace, and eventually release to CRAN.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants