Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More fully supported nested structs and arrays #520

Closed
elbamos opened this issue Oct 15, 2020 · 7 comments
Closed

More fully supported nested structs and arrays #520

elbamos opened this issue Oct 15, 2020 · 7 comments

Comments

@elbamos
Copy link

elbamos commented Oct 15, 2020

A commit made last year enabled dbplyr to support a single layer of struct nesting: 423820a

This commit converts the R syntax parent_field$sub_field to SQL syntax parent_field.sub_field.

In my dataset, the data (from Snowplow) looks like, in a field called contexts:

{
   data: [
       {
           data: {
               id_1: varchar, 
               id_2: varchar, 
              other_stuff: []
           }, 
           schema: varchar
       }
   ]
   schema: varchar
}

Calling contexts[["data"]] returns, unexpectedly, data[1]. This is unexpected because data should be an array.

Calling contexts[["data"]][["data"]] (or contexts[["data"]][["1"]]returns an error thatExpression "contexts"."data" is not of type ROW. The same thing occurs if these are turned into separate steps in a subsequent mutate` with an intermediate variable name.

Interestingly, the errors describe the sql as being translated to contexts.data.whateverelse, rather than contexts.data[1], which is what's returned.

So, I'm not sure if what's going on is that there's no way to specify indexing a nested array within a struct in dbplyr, or if the nested instructions are being translated in a funky way. But in any event, there doesn't seem to be a syntax for drilling deeper into the nested structure.

@hadley
Copy link
Member

hadley commented Oct 15, 2020

Can you please start by creating a reprex using lazy_frame()?

@elbamos
Copy link
Author

elbamos commented Oct 15, 2020

Here you go:

df <- tibble(
  contexts = c(
    list(data = list(list(data = list(id1 = 1, id2 = 2)))), 
    list(data = list(list(data = list(id1 = 3, id2 = 4))))
  ),
  row_id = c(1, 2)
)

bob <- tbl_lazy(df, con = simulate_postgres())

bob %>% 
  mutate(
    inner_context = contexts[["data"]]
  ) %>% show_query()

bob %>% 
  mutate(
    inner_context = contexts[["data"]][["1"]]
  ) %>% show_query()

Note the result when I attempt to index the nested array with [["1"]]. Note also that this is not able to replicate part of what I see with real data -- that contexts[["data"]] returns contexts.data[1] rather than contexts.data.

@hadley
Copy link
Member

hadley commented Oct 15, 2020

Slightly more minimal reprex:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

db <- lazy_frame(
  x = c(
    list(data = list(list(data = list(id1 = 1, id2 = 2)))), 
    list(data = list(list(data = list(id1 = 3, id2 = 4))))
  ),
  con = simulate_postgres()
)

db %>% mutate(y = x[["data"]])
#> <SQL>
#> SELECT `x`, `x`.`data` AS `y`
#> FROM `df`
db %>% mutate(y = x[["data"]][[1]])
#> Error: Can only index with strings

Created on 2020-10-15 by the reprex package (v0.3.0.9001)

@hadley hadley closed this as completed in 6679ed8 Oct 19, 2020
@hadley
Copy link
Member

hadley commented Oct 19, 2020

If the generated sql is correct, the problem probably lies in the database backend that you're using.

@elbamos
Copy link
Author

elbamos commented Oct 19, 2020

@hadley The generated sql does not appear to me to be correct. The SQL for extracting an element for an array is [] rather than .. See https://docs.aws.amazon.com/athena/latest/ug/accessing-array-elements.html

There does not seem to be any way with current dbplyr syntax to specify that one wants a particular item from an array, indexed by number, rather than a named item from a struct.

contexts[["data"]][["1"]] generates "contexts"."data"."1". It should generate "contexts"."data"[1].

(Actually, that's what I'd expect it to do - the missing option seems to be contexts[["data"]][[1]], which should generate "contexts"."data"[1], but instead throws an error as you observed. Note that contexts[["data[1]"]] also throws an error, because it gets translated to "contexts"."data[1]", which is improper SQL.)

@hadley
Copy link
Member

hadley commented Oct 20, 2020

That is exactly what I fixed in this issue. This is what I see:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

db <- lazy_frame(
  x = list(data = list(list(data = list(id1 = 1, id2 = 2)))), 
  con = simulate_postgres()
)

db %>% mutate(y = x[["data"]][[1]])
#> <SQL>
#> SELECT `x`, `x`.`data`[1] AS `y`
#> FROM `df`

@elbamos
Copy link
Author

elbamos commented Oct 20, 2020

Oops - didn't realize you'd implemented the change. Sorry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants