Specialised data types in R #22

nbenn · 2017-11-07T22:04:38Z

DBI should offer some type of plugin system that other packages can build upon by offering implementations for representing some of the more exotic data types in R. Throughout the DBI packages, there are many open issues surrounding this problem. A selection:

r-dbi/DBI#199 enum types
r-dbi/DBI#179 uuid types
r-dbi/RPostgres#132 geometry types, sf
r-dbi/RPostgres#114 geometry types, sf
r-dbi/RPostgres#86 geometry types
r-dbi/bigrquery#145 array/struct types
r-dbi/RMariaDB#42 json

other types that fall into this area include arbitrary precision floats (Rmpfr, gmp), xml types (xml2), monetary types, etc.

Now if geometry types are implemented for Postgres, this is great. But they are also available in MySQL/MariaDB. It therefore might be useful to consider these issues in a more general fashion. Furthermore, approaching this in a type by type fashion might not be sufficient. How could a user map a Postgres composite type, if there is not some inherent extensibility?

Unfortunately, I have no idea how to tackle such an issue. Maybe a pragmatic approach, where things such as composite types are simply not considered, is the best we can do. I just was hoping to get a discussion started on this topic.

krlmlr · 2017-11-08T23:49:29Z

I agree that this is the way to go, but a plugin system shouldn't impact performance. Currently, the values obtained from the database driver are coerced to their target type (integer, double, int64, string, logical, raw vector) as they arrive. Do you think the decision about the "right" target type can be made from metadata only, without fetching any rows (or after fetching the first row only)?

We could offer an interface that allows registration of column handlers for a particular DBI result class. Backends would then be expected to call these handlers with column metadata (as R objects), and the handler decides if he can handle columns of this type or not. If yes, the handler returns an empty container (think character(0) or list()) and a converter function that would be called for each value. The value would be passed as binary as a raw or as an external pointer, the converter function is supposed to return an R object that can be plugged into this container.

The data format for the column metadata and the raw values fully depends on the backend and should match that of the underlying C library. We would obey order of registration, younger handlers are called first. Handlers can also be provided at the connection or the result level.

Backends then also could expose "default" built-in handlers for R's data types, and also use this mechanism to decide how to handle integers, blobs, times etc..

If we don't care that much about performance and permit an extra copy operation and memory allocation, we could also package everything as lists of raw() first and handle everything at the R level. This is nice because then we could also support daisy-chaining handlers.

krlmlr · 2017-11-15T15:06:04Z

@edzer: Do you think geometry columns can be handled by a plugin provided by sf?

edzer · 2017-11-15T15:27:54Z

Yes, no problem. sf already has st_as_sfc methods for WKB objects (list column with raw vectors holding binary geometry) or for blob, but for the latter we of course don't know whether the blob actually contains binary geometries. When at DBI level it is clear that a blob is a geometry, we know what to do. @etiennebr

aornugent · 2019-11-06T22:11:41Z

All roads lead to this thread! What is the state of array types in e.g. RPostgres?

> demo <- tibble(reporting_period = list(1L, 2L, 3L),
                 rp_start = lapply(c("2019-11-01", "2020-11-01", "2021-11-01"), as.Date),  
                 rp_end = lapply(c("2020-10-31", "2021-10-31", "2022-10-31"), as.Date))
> demo
# A tibble: 3 x 3
#  reporting_period rp_start   rp_end    
#  <list>           <list>     <list>    
# 1 <int [1]>        <date [1]> <date [1]>
# 2 <int [1]>        <date [1]> <date [1]>
# 3 <int [1]>        <date [1]> <date [1]>

> result <- dbSendQuery(con,   
     "CREATE TABLE reporting_schedule (
       reporting_period INTEGER[] NOT NULL,
       rp_start DATE[] NOT NULL,
       rp_end DATE[] NOT NULL
     )")
> dbClearResult(result)

> dbAppendTable(con, "reporting_schedule", demo)
# Error: Lists must contain raw vectors or NULL

> dbDataType(con, demo)
# reporting_period         rp_start           rp_end 
#         "BYTEA"          "BYTEA"          "BYTEA" 

> result <- dbSendQuery(con, "SELECT * FROM reporting_schedule")
> dbColumnInfo(result)
#               name      type .oid .known .typname
# 1 reporting_period character 1007  FALSE    _int4
# 2         rp_start character 1182  FALSE    _date
# 3           rp_end character 1182  FALSE    _date

> dbClearResult(result)

Any suggestions how I can convince DBI to take a table of lists?

krlmlr · 2020-12-27T12:10:47Z

No updates here. This is a complex problem, leaving the issue open as a reminder.

krlmlr · 2021-10-17T21:16:46Z

With the availability of ALTREP, this is easier to design and implement efficiently:

Data is kept in whatever format it is returned from the database driver
All data provided to R is an ALTREP vector that supports pluggable conversions, with a "default" conversion to e.g. string or blob
The ALTREP vector also knows the native type on the database

Closing for now, will keep a bookmark to this issue.

krlmlr · 2021-11-01T05:50:49Z

Efficiency gains with ALTREP should be quantified: what's the cost of allocating (and garbage-collecting) several list-of-lists columns in row-major order, in relation to the other costs?

krlmlr closed this as completed Oct 17, 2021

krlmlr transferred this issue from r-dbi/DBI Nov 1, 2021

krlmlr reopened this Nov 1, 2021

krlmlr added the interface 🔌 label Nov 1, 2021

basille mentioned this issue May 23, 2023

About the relevance and future of rpostgis r-spatial/discuss#58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specialised data types in R #22

Specialised data types in R #22

nbenn commented Nov 7, 2017

krlmlr commented Nov 8, 2017 •

edited

krlmlr commented Nov 15, 2017

edzer commented Nov 15, 2017

aornugent commented Nov 6, 2019 •

edited

krlmlr commented Dec 27, 2020

krlmlr commented Oct 17, 2021

krlmlr commented Nov 1, 2021

Specialised data types in R #22

Specialised data types in R #22

Comments

nbenn commented Nov 7, 2017

krlmlr commented Nov 8, 2017 • edited

krlmlr commented Nov 15, 2017

edzer commented Nov 15, 2017

aornugent commented Nov 6, 2019 • edited

krlmlr commented Dec 27, 2020

krlmlr commented Oct 17, 2021

krlmlr commented Nov 1, 2021

krlmlr commented Nov 8, 2017 •

edited

aornugent commented Nov 6, 2019 •

edited