More fully supported nested structs and arrays #520

elbamos · 2020-10-15T20:23:24Z

A commit made last year enabled dbplyr to support a single layer of struct nesting: 423820a

This commit converts the R syntax parent_field$sub_field to SQL syntax parent_field.sub_field.

In my dataset, the data (from Snowplow) looks like, in a field called contexts:

{
   data: [
       {
           data: {
               id_1: varchar, 
               id_2: varchar, 
              other_stuff: []
           }, 
           schema: varchar
       }
   ]
   schema: varchar
}

Calling contexts[["data"]] returns, unexpectedly, data[1]. This is unexpected because data should be an array.

Calling contexts[["data"]][["data"]] (or contexts[["data"]][["1"]]returns an error thatExpression "contexts"."data" is not of type ROW. The same thing occurs if these are turned into separate steps in a subsequent mutate` with an intermediate variable name.

Interestingly, the errors describe the sql as being translated to contexts.data.whateverelse, rather than contexts.data[1], which is what's returned.

So, I'm not sure if what's going on is that there's no way to specify indexing a nested array within a struct in dbplyr, or if the nested instructions are being translated in a funky way. But in any event, there doesn't seem to be a syntax for drilling deeper into the nested structure.

The text was updated successfully, but these errors were encountered:

hadley · 2020-10-15T20:27:01Z

Can you please start by creating a reprex using lazy_frame()?

elbamos · 2020-10-15T20:58:56Z

Here you go:

df <- tibble(
  contexts = c(
    list(data = list(list(data = list(id1 = 1, id2 = 2)))), 
    list(data = list(list(data = list(id1 = 3, id2 = 4))))
  ),
  row_id = c(1, 2)
)

bob <- tbl_lazy(df, con = simulate_postgres())

bob %>% 
  mutate(
    inner_context = contexts[["data"]]
  ) %>% show_query()

bob %>% 
  mutate(
    inner_context = contexts[["data"]][["1"]]
  ) %>% show_query()

Note the result when I attempt to index the nested array with [["1"]]. Note also that this is not able to replicate part of what I see with real data -- that contexts[["data"]] returns contexts.data[1] rather than contexts.data.

hadley · 2020-10-15T21:10:32Z

Slightly more minimal reprex:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

db <- lazy_frame(
  x = c(
    list(data = list(list(data = list(id1 = 1, id2 = 2)))), 
    list(data = list(list(data = list(id1 = 3, id2 = 4))))
  ),
  con = simulate_postgres()
)

db %>% mutate(y = x[["data"]])
#> <SQL>
#> SELECT `x`, `x`.`data` AS `y`
#> FROM `df`
db %>% mutate(y = x[["data"]][[1]])
#> Error: Can only index with strings

^{Created on 2020-10-15 by the reprex package (v0.3.0.9001)}

hadley · 2020-10-19T19:11:05Z

If the generated sql is correct, the problem probably lies in the database backend that you're using.

elbamos · 2020-10-19T23:02:25Z

@hadley The generated sql does not appear to me to be correct. The SQL for extracting an element for an array is [] rather than .. See https://docs.aws.amazon.com/athena/latest/ug/accessing-array-elements.html

There does not seem to be any way with current dbplyr syntax to specify that one wants a particular item from an array, indexed by number, rather than a named item from a struct.

contexts[["data"]][["1"]] generates "contexts"."data"."1". It should generate "contexts"."data"[1].

(Actually, that's what I'd expect it to do - the missing option seems to be contexts[["data"]][[1]], which should generate "contexts"."data"[1], but instead throws an error as you observed. Note that contexts[["data[1]"]] also throws an error, because it gets translated to "contexts"."data[1]", which is improper SQL.)

hadley · 2020-10-20T11:39:59Z

That is exactly what I fixed in this issue. This is what I see:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

db <- lazy_frame(
  x = list(data = list(list(data = list(id1 = 1, id2 = 2)))), 
  con = simulate_postgres()
)

db %>% mutate(y = x[["data"]][[1]])
#> <SQL>
#> SELECT `x`, `x`.`data`[1] AS `y`
#> FROM `df`

elbamos · 2020-10-20T18:17:42Z

Oops - didn't realize you'd implemented the change. Sorry!

hadley closed this as completed in 6679ed8 Oct 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More fully supported nested structs and arrays #520

More fully supported nested structs and arrays #520

elbamos commented Oct 15, 2020

hadley commented Oct 15, 2020

elbamos commented Oct 15, 2020

hadley commented Oct 15, 2020

hadley commented Oct 19, 2020

elbamos commented Oct 19, 2020 •

edited

Loading

hadley commented Oct 20, 2020

elbamos commented Oct 20, 2020

More fully supported nested structs and arrays #520

More fully supported nested structs and arrays #520

Comments

elbamos commented Oct 15, 2020

hadley commented Oct 15, 2020

elbamos commented Oct 15, 2020

hadley commented Oct 15, 2020

hadley commented Oct 19, 2020

elbamos commented Oct 19, 2020 • edited Loading

hadley commented Oct 20, 2020

elbamos commented Oct 20, 2020

elbamos commented Oct 19, 2020 •

edited

Loading