error with nested fields in spark_apply #3242

bhogan-mitre · 2022-03-25T20:48:07Z

I'm running into issues with spark_apply and nested columns. For example the snippet below produces the following error.

Error: org.apache.spark.sql.AnalysisException: cannot resolve 'from_json(vals)' due to data type mismatch: Input schema bigint must be a struct, an array or a map.;
'Project [a#152, b#153, from_json(LongType, vals#154, Some(America/New_York)) AS vals#201, d#155]

I'm curious where the Some(America/New_York) piece comes from given that this is an array of integers.

The error appears to be an issue with serialization of nested columns, vals in this case, even though spark_apply is just passing that column through and not trying to operate on it. The NA value in the field b that is used in the calculation seems to trigger the issue.

library(sparklyr)
library(dplyr)

spark_version <- "3.2.1"
sc <- spark_connect(master = "local", version = spark_version)

tribble(
  ~a, ~b,           ~c,
   1,  NA_integer_,  1,
   1,  1,            2,
   1,  1,            3,
   2,  2,            1,
   2,  2,            2,
) %>% 
  copy_to(sc, df = ., name = "test_sdf1", overwrite = TRUE) %>% 
  group_by(a, b) %>%
  summarise(vals = collect_list(c), .groups = "drop") %>% 
  arrange(a, b) %>% 
  spark_apply(
    function(df) {
      library(dplyr)
      
      df %>% 
        mutate(
          d = b * 2
        )
    }
  )

On the other hand, the same calculation without the NA value runs through okay.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error with nested fields in spark_apply #3242

error with nested fields in spark_apply #3242

bhogan-mitre commented Mar 25, 2022

error with nested fields in spark_apply #3242

error with nested fields in spark_apply #3242

Comments

bhogan-mitre commented Mar 25, 2022