Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to convert "enum" types to factor or character? #455

Open
DavisVaughan opened this issue Jan 18, 2024 · 11 comments
Open

Option to convert "enum" types to factor or character? #455

DavisVaughan opened this issue Jan 18, 2024 · 11 comments

Comments

@DavisVaughan
Copy link

My wife has a scenario where she's pulling in a number of columns that end up with a class of "pq_stuff_enum". She doesn't really care that its an enum, and would really like it to come through as a character or possibly a factor.

It seems like typcategory has a category code E for enums:
https://www.postgresql.org/docs/current/catalog-pg-type.html

I wonder if we can detect that when pulling the column in? Maybe around here?

DATA_TYPE PqResultImpl::_cache::get_column_type_from_oid(const Oid type) {

I could see a dbConnect() argument with something like enum = c("enum", "character", "factor") or something similar as a way to control how these are handled

@krlmlr
Copy link
Member

krlmlr commented Jan 22, 2024

Thanks. Reprex:

library(RPostgres)

con <- postgresDefault()
dbExecute(con, "CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy')")
#> [1] 0
dbExecute(con, "CREATE TABLE x (a mood)")
#> [1] 0
dbExecute(con, "INSERT INTO x VALUES ('sad')")
#> [1] 1
dbExecute(con, "INSERT INTO x VALUES ('ok')")
#> [1] 1

out <- dbGetQuery(con, "SELECT * FROM x")
dput(out)
#> structure(list(a = structure(c("sad", "ok"), class = "pq_NA")), row.names = c(NA, 
#> -2L), class = "data.frame")
out <- dbGetQuery(con, "SELECT CAST(a AS text) FROM x")
dput(out)
#> structure(list(a = c("sad", "ok")), class = "data.frame", row.names = c(NA, 
#> -2L))

dbExecute(con, "DROP TABLE IF EXISTS x")
#> [1] 0
dbExecute(con, "DROP TYPE IF EXISTS mood")
#> [1] 0

Created on 2024-01-22 with reprex v2.0.2

Are you requesting the removal of the "pq_mood" class? (For some reason, the class is "pq_NA" in this example, I wonder why.) This is easy to do either by the client (use unclass() for the relevant columns) or by reformulating your query as I did.

Would you like to share the use case behind this request?

@DavisVaughan
Copy link
Author

DavisVaughan commented Jan 22, 2024

She uses dbplyr, so she has little control over the underlying query and is eventually just calling collect() to bring the data frame into R.

It's mostly a problem when you start using it like a character vector with other parts of the tidyverse, like:

x <- structure(c("sad", "ok"), class = "pq_NA")

vctrs::vec_c(x, "stuff")
#> Error in `vctrs::vec_c()`:
#> ! Can't combine `..1` <pq_NA> and `..2` <character>.

It seems reasonable to me that we should try and map as many Postgres types to native R types as we can, and mapping an enum to a character or factor seemed pretty natural

This is easy to do either by the client (use unclass() for the relevant columns)

My argument is that this extra class serves no purpose for her (and most users, I'd imagine), and it would be better off as a bare character vector or a factor

@krlmlr
Copy link
Member

krlmlr commented Jan 22, 2024

That's not wrong. For now, a possible workaround might be something along the lines of

collect <- function(x, ...) {
  out <- dplyr::collect(x, ...)
  out[] <- map(out, ~ if (is.character(.x)) unclass(.x) else .x)
  out
}

Would it help if all enum classes had a common superclass so that we could define casting/coercion operators for vctrs?

Adding a special dbConnect() option looks like a hasty kludge. Let's think about it, in particular in the context of other database drivers that might need to deal with enums, like duckdb.

@DavisVaughan
Copy link
Author

Is there a reason enums don't come through as factors? That feels like a 1:1 match to me

@krlmlr
Copy link
Member

krlmlr commented Jan 22, 2024

No particular reason, but I'm not sure if the levels are available as part of the result.

@DavisVaughan
Copy link
Author

DavisVaughan commented Jan 22, 2024

It seems like you look the labels up in an overarching pg_enum catalog based on the owning type's id or something like that
https://www.postgresql.org/docs/current/catalog-pg-enum.html

@krlmlr
Copy link
Member

krlmlr commented Jan 23, 2024

It's interesting that "pq_NA" is set only for enums that are created as part of the current connection (last reprex). After reconnecting, the class is "pq_mood" :

library(RPostgres)

con <- postgresDefault()
dbExecute(con, "CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy')")
#> [1] 0
dbDisconnect(con)

con <- postgresDefault()
dbExecute(con, "CREATE TABLE x (a mood)")
#> [1] 0
dbExecute(con, "INSERT INTO x VALUES ('sad')")
#> [1] 1
dbExecute(con, "INSERT INTO x VALUES ('ok')")
#> [1] 1

out <- dbGetQuery(con, "SELECT * FROM x")
dput(out)
#> structure(list(a = structure(c("sad", "ok"), class = "pq_mood")), row.names = c(NA, 
#> -2L), class = "data.frame")
out <- dbGetQuery(con, "SELECT CAST(a AS text) FROM x")
dput(out)
#> structure(list(a = c("sad", "ok")), class = "data.frame", row.names = c(NA, 
#> -2L))

dbExecute(con, "DROP TABLE IF EXISTS x")
#> [1] 0
dbExecute(con, "DROP TYPE IF EXISTS mood")
#> [1] 0
dbDisconnect(con)

Created on 2024-01-23 with reprex v2.0.2

And we already fetch a @typnames component as part of dbConnect() , we could do the same for the enum levels. Although now I'm concerned that this already adds a non-negligible overhead for establishing a connection. Likely would have to be opt-in.

If the enums were factors, we would get cast and coercion behavior for free.

@paleolimbot: How do you (intend to) handle enum types in ADBC for Postgres?

@paleolimbot
Copy link

It looks like they're not handled yet!

library(adbcdrivermanager)

db <- adbc_database_init(
  adbcpostgresql::adbcpostgresql(),
  uri = "postgresql://localhost:5432/postgres?user=postgres&password=password") 

db |> 
  execute_adbc("CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy')")

con <- db |> adbc_connection_init()

con |> 
  execute_adbc("CREATE TABLE x (a mood)") |> 
  execute_adbc("INSERT INTO x VALUES ('sad')") |> 
  execute_adbc("INSERT INTO x VALUES ('ok')")

con |> 
  read_adbc("SELECT * from x")
#> Error in adbc_statement_execute_query(stmt, stream, stream_join_parent = TRUE): [libpq] Column #1 ("a") has unknown type code 16385

Created on 2024-01-23 with reprex v2.0.2

@paleolimbot
Copy link

Just kidding, I forgot about the part where we cache the types in the database. It's returned as binary but the COPY output is character, so that's how the driver will return it:

library(adbcdrivermanager)

db <- adbc_database_init(
  adbcpostgresql::adbcpostgresql(),
  uri = "postgresql://localhost:5432/postgres?user=postgres&password=password") 

db |> 
  execute_adbc("CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy')")

db |> 
  adbc_database_release()

# Have to recreate the database after because that's where we cache the types
con <- db <- adbc_database_init(
    adbcpostgresql::adbcpostgresql(),
    uri = "postgresql://localhost:5432/postgres?user=postgres&password=password"
  ) |> 
  adbc_connection_init()

con |> 
  execute_adbc("CREATE TABLE x (a mood)") |> 
  execute_adbc("INSERT INTO x VALUES ('sad')") |> 
  execute_adbc("INSERT INTO x VALUES ('ok')")

# Currently returned as binary (i.e., COPY output)
con |> 
  read_adbc("SELECT * from x") |> 
  as.data.frame() |> 
  dplyr::pull(a) |> 
  unclass() |> 
  vapply(rawToChar, character(1))
#> [1] "sad" "ok"

Created on 2024-01-23 with reprex v2.0.2

@paleolimbot
Copy link

Sorry for hijacking, but after apache/arrow-adbc#1485 it'll be:

library(adbcdrivermanager)

db <- adbc_database_init(
  adbcpostgresql::adbcpostgresql(),
  uri = "postgresql://localhost:5432/postgres?user=postgres&password=password") 

db |> 
  execute_adbc("CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy')")

db |> 
  adbc_database_release()

# Have to recreate the database after because that's where we cache the types
con <- db <- adbc_database_init(
    adbcpostgresql::adbcpostgresql(),
    uri = "postgresql://localhost:5432/postgres?user=postgres&password=password"
  ) |> 
  adbc_connection_init()

con |> 
  execute_adbc("CREATE TABLE x (a mood)") |> 
  execute_adbc("INSERT INTO x VALUES ('sad')") |> 
  execute_adbc("INSERT INTO x VALUES ('ok')")

# Returns as string because that's how the driver sees it
con |> 
  read_adbc("SELECT * from x") |> 
  as.data.frame()
#>     a
#> 1 sad
#> 2  ok

# If you know the R type you can specify it
con |> 
  read_adbc("SELECT * from x") |> 
  nanoarrow::convert_array_stream(
    to = data.frame(x = factor(levels = c("mood", "sad", "ok")))
  ) |> 
  str()
#> 'data.frame':    2 obs. of  1 variable:
#>  $ x: Factor w/ 3 levels "mood","sad","ok": 2 3

Created on 2024-01-23 with reprex v2.0.2

@dpprdan
Copy link
Contributor

dpprdan commented Jan 23, 2024

@paleolimbot engaging with your hijacking (sorry, not sorry 😬):

Would a reverse conversion, i.e. character/factor to an existing enum with write_adbc(mode = "append"), be possible, too?

Would an update cache function be useful (instead of having to (de)connect)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants