Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False Negative Error: org.apache.spark.sql.AnalysisException: cannot resolve 'A' given input columns #3286

Closed
bhuvanesh1707 opened this issue Aug 16, 2022 · 10 comments · Fixed by #3290
Labels

Comments

@bhuvanesh1707
Copy link

I am facing a challenge to execute my R code in sparklyr (version 1.7.7) and dbplyr (version 2.2.0), where it throws an error "cannot resolve a 'column'" which means the column is not present but the column is present. The same R code is working in previous sparklyr version (1.7.5) and dbplyr (2.1.1).
I have created a dummy data with 1 row and 62 columns and did some calculation. I have column 'A' present in my dummy data but during the calculation it errors "cannot resolve column'A'"
I am guessing it fails to identify the columns when it goes beyond 100 columns or so.

#Load libraries
library(DBI) #1.1.2
library(sparklyr) #1.7.7
library(dplyr) #1.0.9
library(dbplyr) #2.2.0

#Dummy data frame with 1 row and 62 columns
dummy_data <- data.frame( A = "XX"
                         , B = 1L           
                         , U = 1L           
                         , D = T
                         , A1 = as.Date("1900-01-01")
                         , A2 = 1L
                         , A3 = 1L
                         , A4 = 1L           
                         , A5 = 1L           
                         , A6 = T
                         , A7 = as.Date("1900-01-01")
                         , A8 = T
                         , A9 = T
                         , A10 = "XX"
                         , A11 = "XX"
                         , A12 = T
                         , A13 = T
                         , A14 = as.Date("1900-01-01")
                         , A15 = T
                         , A16 = T
                         , A17 = 1L
                         , A18 = 1L
                         , A19 = T
                         , A20 = as.Date("1900-01-01")
                         , A21 = as.Date("1900-01-01")
                         , A22 = 1L
                         , A23 = 1L
                         , A24 = 1L
                         , A25 = 1L
                         , A26 = 1L
                         , A27 = 1L
                         , A28 = 1L
                         , A29 = T
                         , A30 = T
                         , A31 = T
                         , A32 = T
                         , A33 = T
                         , A34 = T
                         , A35 = "XX"
                         , A36 = "XX"
                         , A37 = "XX"
                         , A38 = "XX"
                         , A39 = "XX"
                         , A40 = "XX"
                         , A41 = T
                         , A42 = as.Date("1900-01-01")
                         , A43 = T
                         , A44 = 1L
                         , A45 = T
                         , A46 = 1L
                         , A47 = as.Date("1900-01-01")
                         , A48 = "XX"
                         , A49 = as.Date("1900-01-01")
                         , A50 = as.Date("1900-01-01")
                         , A51 = as.Date("1900-01-01")
                         , A52 = as.Date("1900-01-01")
                         , A53 = as.Date("1900-01-01")
                         , A54 = "XX"
                         , A55 = as.Date("1900-01-01")
                         , A56 = 1L
                         , A57 = 1L
                         , A58 = T
)

#Create spark connection
spark_disconnect_all()
cn <- spark_connect(master = "yarn-client", app_name = "newJob")

#Copying dummy_data data frame to spark data frame
spark_dummy_data <- copy_to(cn, dummy_data, overwrite = T) %>% compute()

#Execute code snippet
A <- spark_dummy_data %>% 
  mutate(excl = case_when(
    A %in% c("HHH","GGG") ~ (A17 == 1 & !A13)
    , A == "HHH" ~ T
    , T ~ (!A13)
  )) %>% 
  filter(excl) %>% 
  mutate(
    COL1 = case_when(
      A == "III" ~ B %in%  c(3,72,70)
      , T ~ (B == 3)
    )
    , COL2 = (B == 34)
    , COL3 = (B ==  71)
    , COL4 = (B == 79)
    , COL5 = (B == 83)
    , COL6 = case_when(A == "PPP" ~ B %in% c(5,6), T ~ B == 6)
    , COL7 = case_when(
      A == "III" ~ (B == 5 | B == 6)
      , T ~ COL6
    )
    , COL8 = case_when(
      A == "III" ~ B %in%  c(4,69,74)
      , T ~ (B == 4)
    )
    , COL9 = (B == 4 & A31)
    , COL10 = (B == 25)
    , COL11 = (B ==5)
    , COL12 = case_when(
      A == "GGG" ~ F
      , T ~ T
    )
    , COL13 = A15
    , COL14 = B %in% c(34,35,21,9,37,38,39)
    , COL15 = case_when(
      A == "III" ~ (B != 21)
      , A == "CCC" ~ A30
      , A == "ZZZ" ~ !COL14
      , A == "PPP" ~ (B != 17 & B != 21)
      , T ~ (U < 12)
    )
    , COL16 = case_when(
      A %in% c("III","ZZZ") ~ (B == 20)
      , A == "CCC" ~ (U == 15)
      , T ~ (U == 11)
    )
    , COL17 = (B == 19)
    , COL18 = (B == 29)
    , COL19 = case_when(A == "PPP" ~
                          case_when(A18 > 60 & !is.na(A18) ~ TRUE
                                    , T ~ FALSE)
                        , T ~ (B ==90))
    , COL20 = case_when(A == "PPP" ~
                          case_when((D | COL18 | COL6 | COL1) ~
                                      case_when((A35 == "M" & (A18) >= 37 &  (A18) <= 60) |
                                                  (A35 == "A" & (A18*12) >= 37 &  (A18*12) <= 60) |
                                                  (A35 == "B" & (A18/2) >= 37 &  (A18/2) <= 60) |
                                                  (A35 == "Q" & (A18*3) >= 37 &  (A18*3) <= 60) |
                                                  (A35 == "S" & (A18*6) >= 37 &  (A18*6) <= 60) |
                                                  (A35 == "W" & (A18/4) >= 37 &  (A18/4) <= 60) ~ TRUE
                                                , T ~ FALSE)
                                    , T ~ FALSE)
                        , T ~ ( B==91 ))
    , COL21 = case_when(A == "PPP" ~
                          case_when(D | COL18 | COL6 | COL1 ~
                                      case_when((A35 == "M" & A18 == 1) |
                                                  (A35 == "W" & (A18/4) == 1) ~ TRUE
                                                , T ~ FALSE)
                                    , T ~ FALSE)
                        , T ~ ( B==33 ))
    , COL22 = case_when(A == "PPP" ~
                          case_when(D | COL18 | COL6 | COL1 ~
                                      case_when((A35 == "M" & (A18) >= 2 &  (A18) <= 36) |
                                                  (A35 == "A" & (A18*12) >= 2 &  (A18*12) <= 36) |
                                                  (A35 == "B" & (A18/2) >= 2 &  (A18/2) <= 36) |
                                                  (A35 == "Q" & (A18*3) >= 2 &  (A18*3) <= 36) |
                                                  (A35 == "S" & (A18*6) >= 2 &  (A18*6) <= 36) |
                                                  (A35 == "W" & (A18/4) >= 2 &  (A18/4) <= 36) ~ TRUE
                                                , T ~ FALSE)
                                    , T ~ FALSE)
                        , T ~ ( B==92 ))      
    , COL23 = (COL21 & U == 1) 
    , COL24 = (COL22 & U == 1) 
    , COL25 = (A15 & U == 1)
    , COL26 = (D & U == 2)
    , COL27 = (A15 & U == 2)
    , COL28 = (U == 2 & COL19)
    , COL29 = (U == 2 & COL20)
    , COL30= (U == 2 & COL21)
    , COL31 = (U == 2 & COL22) 
    , COL32 = B ==35
    , COL33 = F
    , COL34 = case_when(
      A == "GGG" ~ (B == 28)
      , A == "ZZZ" ~ F
      , T ~ (B == 59)
    )
    , COL35 = case_when(
      A == "GGG" ~ as.integer(A37) %in% c(1, 2, 6, 8, 9) & as.integer(A38) %in% c(29, 62)
      , A == "ZZZ" ~ F
      ,T ~ (B == 13)
    )
    , COL36 = case_when(
      A == "ZZZ" ~ F
      , T ~ (B == 8)
    )
    , COL37 = (U == 3)
    , COL38 = case_when(
      A == "ZZZ" ~ F
      , T ~ (D & U != 1)
    )
  ) %>% compute()
@bhuvanesh1707
Copy link
Author

Below is the error statement I am getting and column A is present in my data frame

Error: org.apache.spark.sql.AnalysisException: cannot resolve 'A' given input columns: [q35.A34, q35.A4, q35.A45, q35.COL10, q35.A27, q35.COL11, q35.A54, q35.A40, q35.COL7, q35.A9, q35.COL32, q35.COL18, q35.COL25, q35.A30, q35.COL15, q35.COL19, q35.A55, q35.COL17, q35.COL35, q35.A12, q35.COL6, q35.COL4, q35.A51, q35.A56, q35.A1, q35.A13, q35.A52, q35.A26, q35.COL13, q35.A36, q35.COL21, q35.COL23, q35.A48, q35.A32, q35.COL8, q35.COL26, q35.COL20, q35.A37, q35.COL14, q35.A42, q35.A14, q35.COL2, q35.A25, q35.A50, q35.A29, q35.COL24, q35.A2, q35.A31, q35.A17, q35.COL9, q35.A11, q35.B, q35.A15, q35.A8, q35.COL33, q35.COL29, q35.A35, q35.COL16, q35.COL5, q35.A23, q35.A, q35.COL31, q35.A41, q35.A3, q35.A44, q35.COL34, q35.A38, q35.A39, q35.A53, q35.U, q35.A20, q35.COL22, q35.COL12, q35.A18, q35.COL1, q35.A10, q35.COL3, q35.COL28, q35.D, q35.A6, q35.A57, q35.A28, q35.A43, q35.A58, q35.excl, q35.A21, q35.A33, q35.A5, q35.A22, q35.COL27, q35.A16, q35.A24, q35.A49, q35.A47, q35.A7, q35.COL30, q35.A46, q35.A19]; line 1 pos 21;

Below is few list of libraries I used and their version comparison

<style> </style>
S.no library previous version current version Same version
1 arrow 8.0.0 8.0.0 TRUE
2 base 4.0.2 4.0.2 TRUE
3 datacompareR 0.1.4 0.1.4 TRUE
4 datasets 4.0.2 4.0.2 TRUE
5 DBI 1.1.2 1.1.2 TRUE
6 dbplyr 2.1.1 2.2.0 FALSE
7 devtools 2.4.3 2.4.3 TRUE
8 dplyr 1.0.9 1.0.9 TRUE
9 forcats 0.5.1 0.5.1 TRUE
10 ggplot2 3.3.6 3.3.6 TRUE
11 graphics 4.0.2 4.0.2 TRUE
12 grDevices 4.0.2 4.0.2 TRUE
13 lubridate 1.8.0 1.8.0 TRUE
14 methods 4.0.2 4.0.2 TRUE
15 purrr 0.3.4 0.3.4 TRUE
16 readr 2.1.2 2.1.2 TRUE
17 RevoUtilsMath 11.0.0 11.0.1 FALSE
18 sparklyr 1.7.5 1.7.7 FALSE
19 stats 4.0.2 4.0.2 TRUE
20 stringr 1.4.0 1.4.0 TRUE
21 testthat 3.1.4 3.1.4 TRUE
22 tibble 3.1.7 3.1.7 TRUE
23 tidyr 1.2.0 1.2.0 TRUE
24 tidyverse 1.3.1 1.3.1 TRUE
25 usethis 2.1.5 2.1.6 FALSE
26 utils 4.0.2 4.0.2 TRUE
27 zoo 1.8-10 1.8-10 TRUE
28 RSqlLite 2.2.14 2.2.14 TRUE

@bhuvanesh1707
Copy link
Author

Below is the session info I am working with while executing the reprex

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.2 (2020-06-22)
#> os Red Hat Enterprise Linux
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz America/Chicago
#> date 2022-08-18
#> pandoc 2.17.1.1 @ /opt/revr/rstudio-server/bin/quarto/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> askpass 1.1 2019-01-13 [2] CRAN (R 4.0.2)
#> assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.0.2)
#> base64enc 0.1-3 2015-07-28 [2] CRAN (R 4.0.2)
#> blob 1.2.3 2022-04-10 [2] CRAN (R 4.0.2)
#> cli 3.3.0 2022-04-25 [2] CRAN (R 4.0.2)
#> config 0.3.1 2020-12-17 [2] CRAN (R 4.0.2)
#> crayon 1.5.1 2022-03-26 [2] CRAN (R 4.0.2)
#> DBI * 1.1.2 2021-12-20 [2] CRAN (R 4.0.2)
#> dbplyr * 2.2.0 2022-06-05 [2] CRAN (R 4.0.2)
#> digest 0.6.29 2021-12-01 [2] CRAN (R 4.0.2)
#> dplyr * 1.0.9 2022-04-28 [2] CRAN (R 4.0.2)
#> ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.0.2)
#> evaluate 0.15 2022-02-18 [2] CRAN (R 4.0.2)
#> fansi 1.0.3 2022-03-24 [2] CRAN (R 4.0.2)
#> fastmap 1.1.0 2021-01-25 [2] CRAN (R 4.0.2)
#> forge 0.2.0 2019-02-26 [2] CRAN (R 4.0.2)
#> fs 1.5.2 2021-12-08 [2] CRAN (R 4.0.2)
#> generics 0.1.2 2022-01-31 [2] CRAN (R 4.0.2)
#> glue 1.6.2 2022-02-24 [2] CRAN (R 4.0.2)
#> highr 0.9 2021-04-16 [2] CRAN (R 4.0.2)
#> htmltools 0.5.2 2021-08-25 [2] CRAN (R 4.0.2)
#> htmlwidgets 1.5.4 2021-09-08 [2] CRAN (R 4.0.2)
#> httr 1.4.3 2022-05-04 [2] CRAN (R 4.0.2)
#> jsonlite 1.8.0 2022-02-22 [2] CRAN (R 4.0.2)
#> knitr 1.39 2022-04-26 [2] CRAN (R 4.0.2)
#> lifecycle 1.0.1 2021-09-24 [2] CRAN (R 4.0.2)
#> magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.0.2)
#> openssl 2.0.2 2022-05-24 [2] CRAN (R 4.0.2)
#> pillar 1.7.0 2022-02-01 [2] CRAN (R 4.0.2)
#> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.0.2)
#> purrr 0.3.4 2020-04-17 [2] CRAN (R 4.0.2)
#> R.cache 0.15.0 2021-04-30 [2] CRAN (R 4.0.2)
#> R.methodsS3 1.8.2 2022-06-13 [2] CRAN (R 4.0.2)
#> R.oo 1.25.0 2022-06-12 [2] CRAN (R 4.0.2)
#> R.utils 2.11.0 2021-09-26 [2] CRAN (R 4.0.2)
#> r2d3 0.2.6 2022-02-28 [2] CRAN (R 4.0.2)
#> R6 2.5.1 2021-08-19 [2] CRAN (R 4.0.2)
#> reprex 2.0.1 2021-08-05 [2] CRAN (R 4.0.2)
#> rlang 1.0.2 2022-03-04 [2] CRAN (R 4.0.2)
#> rmarkdown 2.14 2022-04-25 [2] CRAN (R 4.0.2)
#> rprojroot 2.0.3 2022-04-02 [2] CRAN (R 4.0.2)
#> rstudioapi 0.13 2020-11-12 [2] CRAN (R 4.0.2)
#> sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.0.2)
#> sparklyr * 1.7.7 2022-06-07 [2] CRAN (R 4.0.2)
#> stringi 1.7.6 2021-11-29 [2] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [2] CRAN (R 4.0.2)
#> styler 1.7.0 2022-03-13 [2] CRAN (R 4.0.2)
#> tibble 3.1.7 2022-05-03 [2] CRAN (R 4.0.2)
#> tidyr 1.2.0 2022-02-01 [2] CRAN (R 4.0.2)
#> tidyselect 1.1.2 2022-02-21 [2] CRAN (R 4.0.2)
#> utf8 1.2.2 2021-07-24 [2] CRAN (R 4.0.2)
#> vctrs 0.4.1 2022-04-13 [2] CRAN (R 4.0.2)
#> withr 2.5.0 2022-03-03 [2] CRAN (R 4.0.2)
#> xfun 0.31 2022-05-10 [2] CRAN (R 4.0.2)
#> yaml 2.3.5 2022-02-21 [2] CRAN (R 4.0.2)
#>
#> [1] /opt/revr/library/4.0/2022-06-16/bio
#> [2] /opt/revr/library/4.0/2022-06-16/cran
#> [3] /opt/revr/library/4.0/2022-06-16/github
#> [4] /opt/revr/library/4.0/2022-06-16/misc
#> [5] /opt/revr/library/4.0/2022-06-16/tu
#> [6] /opt/revr/library/4.0/2022-06-16/tu_test
#> [7] /opt/revr/ropen/4.0.2/lib64/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────

@tschutte
Copy link

tschutte commented Aug 26, 2022

When I run your example I actually get a different error:
Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Max iterations (100) reached for batch Resolution, please set 'spark.sql.analyzer.maxIterations' to a larger value., tree:.

Adding this to my spark config fixed the issue:

conf <- spark_config()
conf$spark.sql.analyzer.maxIterations <- 1000
cn <- spark_connect(master = "local", app_name = "newJob", conf = conf)

Perhaps trying this will fix your error?

@bhuvanesh1707
Copy link
Author

bhuvanesh1707 commented Aug 29, 2022

1.7.7

I have changed the maxIteration as below and checked in the Spark environment, spark iteration was set to 1000

conf <- spark_config()
conf$spark.sql.analyzer.maxIterations <- 1000
cn <- spark_connect(master = "yarn-client", app_name = "newJob", conf = conf)

spark.sql.analyzer.maxIterations 1000

But when I am executing, I am getting this Max Iteration as a warning and not as a error in my spark log. Below are the spark log. And the spark version is 2.4.4

Spark Log:
22/08/29 00:53:16 WARN HiveSessionStateBuilder$$anon$1: Max iterations (100) reached for batch Resolution
22/08/29 00:53:21 WARN HiveSessionStateBuilder$$anon$1: Max iterations (100) reached for batch Resolution
22/08/29 00:53:21 ERROR sparklyr: Backend (96256) failed calling sql on 9: org.apache.spark.sql.AnalysisException: cannot resolve 'A' given input columns: [q35.COL8, q35.B, q35.COL18, q35.A23, q35.A24, q35.A6, q35.A39, q35.A19, q35.A35, q35.D, q35.A43, q35.excl, q35.A55, q35.COL28, q35.A36, q35.A7, q35.A37, q35.COL1, q35.COL2, q35.COL11, q35.A38, q35.COL24, q35.COL30, q35.COL27, q35.A46, q35.A13, q35.A32, q35.COL20, q35.A29, q35.A49, q35.COL25, q35.A8, q35.COL34, q35.A10, q35.A5, q35.A44, q35.A41, q35.COL16, q35.A45, q35.COL15, q35.A21, q35.COL29, q35.A31, q35.COL4, q35.A50, q35.A47, q35.U, q35.A56, q35.A18, q35.COL9, q35.A52, q35.COL21, q35.COL14, q35.A54, q35.A57, q35.A12, q35.A25, q35.A58, q35.COL26, q35.COL7, q35.A51, q35.A3, q35.A40, q35.COL5, q35.A42, q35.A48, q35.A, q35.COL32, q35.A33, q35.A15, q35.COL10, q35.COL17, q35.COL23, q35.A26, q35.COL3, q35.A4, q35.A17, q35.A9, q35.A16, q35.COL12, q35.A30, q35.COL13, q35.A20, q35.COL22, q35.A14, q35.A11, q35.A22, q35.COL31, q35.A53, q35.A27, q35.A28, q35.COL19, q35.COL35, q35.A34, q35.COL6, q35.A1, q35.COL33, q35.A2]; line 1 pos 21;

@edgararuiz
Copy link
Collaborator

Hi, have you considered splitting the snippet into two or more dplyr code blocks? I was able to recreate the error using a "local" Spark connection, and was also able to get the query to compute after moving the las 3 mutate() statements to a new variable:

#Load libraries
library(DBI) #1.1.2
library(sparklyr) #1.7.7
#> 
#> Attaching package: 'sparklyr'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(dplyr) #1.0.9
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dbplyr) #2.2.0
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

#Dummy data frame with 1 row and 62 columns
dummy_data <- data.frame( A = "XX"
                          , B = 1L           
                          , U = 1L           
                          , D = T
                          , A1 = as.Date("1900-01-01")
                          , A2 = 1L
                          , A3 = 1L
                          , A4 = 1L           
                          , A5 = 1L           
                          , A6 = T
                          , A7 = as.Date("1900-01-01")
                          , A8 = T
                          , A9 = T
                          , A10 = "XX"
                          , A11 = "XX"
                          , A12 = T
                          , A13 = T
                          , A14 = as.Date("1900-01-01")
                          , A15 = T
                          , A16 = T
                          , A17 = 1L
                          , A18 = 1L
                          , A19 = T
                          , A20 = as.Date("1900-01-01")
                          , A21 = as.Date("1900-01-01")
                          , A22 = 1L
                          , A23 = 1L
                          , A24 = 1L
                          , A25 = 1L
                          , A26 = 1L
                          , A27 = 1L
                          , A28 = 1L
                          , A29 = T
                          , A30 = T
                          , A31 = T
                          , A32 = T
                          , A33 = T
                          , A34 = T
                          , A35 = "XX"
                          , A36 = "XX"
                          , A37 = "XX"
                          , A38 = "XX"
                          , A39 = "XX"
                          , A40 = "XX"
                          , A41 = T
                          , A42 = as.Date("1900-01-01")
                          , A43 = T
                          , A44 = 1L
                          , A45 = T
                          , A46 = 1L
                          , A47 = as.Date("1900-01-01")
                          , A48 = "XX"
                          , A49 = as.Date("1900-01-01")
                          , A50 = as.Date("1900-01-01")
                          , A51 = as.Date("1900-01-01")
                          , A52 = as.Date("1900-01-01")
                          , A53 = as.Date("1900-01-01")
                          , A54 = "XX"
                          , A55 = as.Date("1900-01-01")
                          , A56 = 1L
                          , A57 = 1L
                          , A58 = T
)

#Create spark connection
spark_disconnect_all()
#> [1] 0

## Changed it to local for the reprex
cn <- spark_connect("local")

#Copying dummy_data data frame to spark data frame
spark_dummy_data <- copy_to(cn, dummy_data, overwrite = T) %>% compute()

#Execute code snippet
A1 <- spark_dummy_data %>% 
  mutate(excl = case_when(
    A %in% c("HHH","GGG") ~ (A17 == 1 & !A13)
    , A == "HHH" ~ T
    , T ~ (!A13)
  )) %>% 
  filter(excl) %>% 
  mutate(
    COL1 = case_when(
      A == "III" ~ B %in%  c(3,72,70)
      , T ~ (B == 3)
    )
    , COL2 = (B == 34)
    , COL3 = (B ==  71)
    , COL4 = (B == 79)
    , COL5 = (B == 83)
    , COL6 = case_when(A == "PPP" ~ B %in% c(5,6), T ~ B == 6)
    , COL7 = case_when(
      A == "III" ~ (B == 5 | B == 6)
      , T ~ COL6
    )
    , COL8 = case_when(
      A == "III" ~ B %in%  c(4,69,74)
      , T ~ (B == 4)
    )
    , COL9 = (B == 4 & A31)
    , COL10 = (B == 25)
    , COL11 = (B ==5)
    , COL12 = case_when(
      A == "GGG" ~ F
      , T ~ T
    )
    , COL13 = A15
    , COL14 = B %in% c(34,35,21,9,37,38,39)
    , COL15 = case_when(
      A == "III" ~ (B != 21)
      , A == "CCC" ~ A30
      , A == "ZZZ" ~ !COL14
      , A == "PPP" ~ (B != 17 & B != 21)
      , T ~ (U < 12)
    )
    , COL16 = case_when(
      A %in% c("III","ZZZ") ~ (B == 20)
      , A == "CCC" ~ (U == 15)
      , T ~ (U == 11)
    )
    , COL17 = (B == 19)
    , COL18 = (B == 29)
    , COL19 = case_when(A == "PPP" ~
                          case_when(A18 > 60 & !is.na(A18) ~ TRUE
                                    , T ~ FALSE)
                        , T ~ (B ==90))
    , COL20 = case_when(A == "PPP" ~
                          case_when((D | COL18 | COL6 | COL1) ~
                                      case_when((A35 == "M" & (A18) >= 37 &  (A18) <= 60) |
                                                  (A35 == "A" & (A18*12) >= 37 &  (A18*12) <= 60) |
                                                  (A35 == "B" & (A18/2) >= 37 &  (A18/2) <= 60) |
                                                  (A35 == "Q" & (A18*3) >= 37 &  (A18*3) <= 60) |
                                                  (A35 == "S" & (A18*6) >= 37 &  (A18*6) <= 60) |
                                                  (A35 == "W" & (A18/4) >= 37 &  (A18/4) <= 60) ~ TRUE
                                                , T ~ FALSE)
                                    , T ~ FALSE)
                        , T ~ ( B==91 ))
    , COL21 = case_when(A == "PPP" ~
                          case_when(D | COL18 | COL6 | COL1 ~
                                      case_when((A35 == "M" & A18 == 1) |
                                                  (A35 == "W" & (A18/4) == 1) ~ TRUE
                                                , T ~ FALSE)
                                    , T ~ FALSE)
                        , T ~ ( B==33 ))
    , COL22 = case_when(A == "PPP" ~
                          case_when(D | COL18 | COL6 | COL1 ~
                                      case_when((A35 == "M" & (A18) >= 2 &  (A18) <= 36) |
                                                  (A35 == "A" & (A18*12) >= 2 &  (A18*12) <= 36) |
                                                  (A35 == "B" & (A18/2) >= 2 &  (A18/2) <= 36) |
                                                  (A35 == "Q" & (A18*3) >= 2 &  (A18*3) <= 36) |
                                                  (A35 == "S" & (A18*6) >= 2 &  (A18*6) <= 36) |
                                                  (A35 == "W" & (A18/4) >= 2 &  (A18/4) <= 36) ~ TRUE
                                                , T ~ FALSE)
                                    , T ~ FALSE)
                        , T ~ ( B==92 ))      
    , COL23 = (COL21 & U == 1) 
    , COL24 = (COL22 & U == 1) 
    , COL25 = (A15 & U == 1)
    , COL26 = (D & U == 2)
    , COL27 = (A15 & U == 2)
    , COL28 = (U == 2 & COL19)
    , COL29 = (U == 2 & COL20)
    , COL30= (U == 2 & COL21)
    , COL31 = (U == 2 & COL22) 
    , COL32 = B ==35
    , COL33 = F
    , COL34 = case_when(
      A == "GGG" ~ (B == 28)
      , A == "ZZZ" ~ F
      , T ~ (B == 59)
    )
    , COL35 = case_when(
      A == "GGG" ~ as.integer(A37) %in% c(1, 2, 6, 8, 9) & as.integer(A38) %in% c(29, 62)
      , A == "ZZZ" ~ F
      ,T ~ (B == 13)
    )) %>% 
  compute()

A <- A1 %>% 
  mutate(
    COL36 = case_when(
      A == "ZZZ" ~ F
      , T ~ (B == 8)
    )
    , COL37 = (U == 3)
    , COL38 = case_when(
      A == "ZZZ" ~ F
      , T ~ (D & U != 1)
    )
  ) %>% 
  compute()

A
#> # Source: spark<dbplyr_003> [?? x 101]
#>   A         B     U D     A1        A2    A3    A4    A5 A6    A7    
#>   <chr> <int> <int> <lgl> <date> <int> <int> <int> <int> <lgl> <date>
#> 1 <NA>     NA    NA NA    NA        NA    NA    NA    NA NA    NA    
#> # … with 90 more variables: A8 <lgl>, A9 <lgl>, A10 <chr>, A11 <chr>,
#> #   A12 <lgl>, A13 <lgl>, A14 <date>, A15 <lgl>, A16 <lgl>, A17 <int>,
#> #   A18 <int>, A19 <lgl>, A20 <date>, A21 <date>, A22 <int>, A23 <int>,
#> #   A24 <int>, A25 <int>, A26 <int>, A27 <int>, A28 <int>, A29 <lgl>,
#> #   A30 <lgl>, A31 <lgl>, A32 <lgl>, A33 <lgl>, A34 <lgl>, A35 <chr>,
#> #   A36 <chr>, A37 <chr>, A38 <chr>, A39 <chr>, A40 <chr>, A41 <lgl>,
#> #   A42 <date>, A43 <lgl>, A44 <int>, A45 <lgl>, A46 <int>, A47 <date>, …

spark_disconnect()
#> Error in UseMethod("spark_disconnect"): no applicable method for 'spark_disconnect' applied to an object of class "NULL"

Created on 2022-08-29 by the reprex package (v2.0.1)

@bhuvanesh1707
Copy link
Author

bhuvanesh1707 commented Sep 8, 2022

The above solution of breaking into multiple mutate statement and assigning it to a new variable has resolved the issue. But it is not a single error that comes up like "cannot resolve 'A' column" where we can break the dplyr code. There are other errors as well like

"Error: org.apache.spark.sql.AnalysisException: invalid cast from date to timestamp;;" and
"Error: org.apache.spark.sql.AnalysisException: cannot resolve '(CAST(q54.A57 AS DECIMAL(10,0)) = 0.0BD)' due to data type mismatch: differing types in '(CAST(q54.A57 AS DECIMAL(10,0)) = 0.0BD)' (decimal(10,0) and decimal(1,1)).; line 1 pos 31;"

Date to time stamp casting error where I do not have timestamp column and decimal to Big decimal where all my columns are integer. I can overcome this error by the above solution by further breaking the mutate statement.
But this type cast error are deceiving since data type is not the actual issue.
Below is the reprex for " invalid cast from date to timestamp;;"

#Dummy data frame with 1 row and 62 columns
dummy_data <- data.frame( A = "XX"
                          , B = 1L           
                          , U = 1L           
                          , D = T
                          , A1 = as.Date("1900-01-01")
                          , A2 = 1L
                          , A3 = 1L
                          , A4 = 1L           
                          , A5 = 1L           
                          , A6 = T
                          , A7 = as.Date("1900-01-01")
                          , A8 = T
                          , A9 = T
                          , A10 = "XX"
                          , A11 = "XX"
                          , A12 = T
                          , A13 = T
                          , A14 = as.Date("1900-01-01")
                          , A15 = T
                          , A16 = T
                          , A17 = 1L
                          , A18 = 1L
                          , A19 = T
                          , A20 = as.Date("1900-01-01")
                          , A21 = as.Date("1900-01-01")
                          , A22 = 1L
                          , A23 = 1L
                          , A24 = 1L
                          , A25 = 1L
                          , A26 = 1L
                          , A27 = 1L
                          , A28 = 1L
                          , A29 = T
                          , A30 = T
                          , A31 = T
                          , A32 = T
                          , A33 = T
                          , A34 = T
                          , A35 = "XX"
                          , A36 = "XX"
                          , A37 = "XX"
                          , A38 = "XX"
                          , A39 = "XX"
                          , A40 = "XX"
                          , A41 = T
                          , A42 = as.Date("1900-01-01")
                          , A43 = T
                          , A44 = 1L
                          , A45 = T
                          , A46 = 1L
                          , A47 = as.Date("1900-01-01")
                          , A48 = "XX"
                          , A49 = as.Date("1900-01-01")
                          , A50 = as.Date("1900-01-01")
                          , A51 = as.Date("1900-01-01")
                          , A52 = as.Date("1900-01-01")
                          , A53 = as.Date("1900-01-01")
                          , A54 = "XX"
                          , A55 = as.Date("1900-01-01")
                          , A56 = 1L
                          , A57 = 1L
                          , A58 = T
)

#Load libraries
library(DBI)
library(sparklyr)
#> 
#> Attaching package: 'sparklyr'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

#Create spark connection

spark_disconnect_all()
#> [1] 0
cn <- spark_connect(master = "yarn-client", app_name = "newJob")

#Copying dummy_data data frame to spark data frame
spark_dummy_data <- copy_to(cn, dummy_data, overwrite = T) %>% compute()




#Execute code snippet
df1 <- spark_dummy_data %>% 
  mutate(excl = case_when(
    A %in% c("HHH","GGG") ~ (A17 == 1 & !A13)
    , A == "HHH" ~ T
    , T ~ (A13)
  )) %>% 
  filter(excl) %>% 
  mutate(
    COL1 = case_when(
      A == "III" ~ B %in%  c(3,72,70)
      , T ~ (B == 3)
    )
    , COL2 = (B == 34)
    , COL3 = (B ==  71)
    , COL4 = (B == 79)
    , COL5 = (B == 83)
    , COL6 = case_when(A == "PPP" ~ B %in% c(5,6), T ~ B == 6)
    , COL7 = case_when(
      A == "III" ~ (B == 5 | B == 6)
      , T ~ COL6
    )
    , COL8 = case_when(
      A == "III" ~ B %in%  c(4,69,74)
      , T ~ (B == 4)
    )
    , COL9 = (B == 4 & A31)
    , COL10 = (B == 25)
    , COL11 = (B ==5)
    , COL12 = case_when(
      A == "GGG" ~ F
      , T ~ T
    )
    , COL13 = A15
    , COL14 = B %in% c(34,35,21,9,37,38,39)
    , COL15 = case_when(
      A == "III" ~ (B != 21)
      , A == "CCC" ~ A30
      , A == "ZZZ" ~ !COL14
      , A == "PPP" ~ (B != 17 & B != 21)
      , T ~ (U < 12)
    )
    , COL16 = case_when(
      A %in% c("III","ZZZ") ~ (B == 20)
      , A == "CCC" ~ (U == 15)
      , T ~ (U == 11)
    )
    , COL17 = (B == 19)
    , COL18 = (B == 29)
    , COL19 = case_when(A == "PPP" ~
                          case_when(A18 > 60 & !is.na(A18) ~ TRUE
                                    , T ~ FALSE)
                        , T ~ (B ==90))
    , COL20 = case_when(A == "PPP" ~
                          case_when((D | COL18 | COL6 | COL1) ~
                                      case_when((A35 == "M" & (A18) >= 37 &  (A18) <= 60) |
                                                  (A35 == "A" & (A18*12) >= 37 &  (A18*12) <= 60) |
                                                  (A35 == "B" & (A18/2) >= 37 &  (A18/2) <= 60) |
                                                  (A35 == "Q" & (A18*3) >= 37 &  (A18*3) <= 60) |
                                                  (A35 == "S" & (A18*6) >= 37 &  (A18*6) <= 60) |
                                                  (A35 == "W" & (A18/4) >= 37 &  (A18/4) <= 60) ~ TRUE
                                                , T ~ FALSE)
                                    , T ~ FALSE)
                        , T ~ ( B==91 ))
    , COL21 = case_when(A == "PPP" ~
                          case_when(D | COL18 | COL6 | COL1 ~
                                      case_when((A35 == "M" & A18 == 1) |
                                                  (A35 == "W" & (A18/4) == 1) ~ TRUE
                                                , T ~ FALSE)
                                    , T ~ FALSE)
                        , T ~ ( B==33 ))
    , COL22 = case_when(A == "PPP" ~
                          case_when(D | COL18 | COL6 | COL1 ~
                                      case_when((A35 == "M" & (A18) >= 2 &  (A18) <= 36) |
                                                  (A35 == "A" & (A18*12) >= 2 &  (A18*12) <= 36) |
                                                  (A35 == "B" & (A18/2) >= 2 &  (A18/2) <= 36) |
                                                  (A35 == "Q" & (A18*3) >= 2 &  (A18*3) <= 36) |
                                                  (A35 == "S" & (A18*6) >= 2 &  (A18*6) <= 36) |
                                                  (A35 == "W" & (A18/4) >= 2 &  (A18/4) <= 36) ~ TRUE
                                                , T ~ FALSE)
                                    , T ~ FALSE)
                        , T ~ ( B==92 ))      
    , COL23 = (COL21 & U == 1) 
    , COL24 = (COL22 & U == 1) 
    , COL25 = (A15 & U == 1)
    , COL26 = (D & U == 2)
    , COL27 = (A15 & U == 2)
    , COL28 = (U == 2 & COL19)
    , COL29 = (U == 2 & COL20)
    , COL30= (U == 2 & COL21)
    , COL31 = (U == 2 & COL22) 
    , COL32 = B ==35
    , COL33 = F
    , COL34 = case_when(
      A == "GGG" ~ (B == 28)
      , A == "ZZZ" ~ F
      , T ~ (B == 59)
    )
    , COL35 = case_when(
      A == "GGG" ~ as.integer(A37) %in% c(1, 2, 6, 8, 9) & as.integer(A38) %in% c(29, 62)
      , A == "ZZZ" ~ F
      ,T ~ (B == 13)
    )
  ) %>% compute()

df2 <- df1 %>% 
  mutate(
    COL36 = case_when(
      A == "ZZZ" ~ F
      , T ~ (B == 8)
    )
    , COL37 = (U == 3)
    , COL38 = case_when(
      A == "ZZZ" ~ F
      , T ~ (D & U != 1)
    )
    
    , COL39 = case_when(
      A == "XX" ~ B %in% c(90,91,92)
      , T ~ (!A6)
    )
    , COL40 = case_when(
      A == "XX" ~ F
      , T ~ (!A6 & U == 1)
    )
    , COL41 = case_when(
      A == "XX" ~ F
      , T ~ (!A6 & U == 2)
    )
    , COL42 = (B == 10)
    , COL43 = (A10 == "COM")
    , COL44 = case_when(
      A == "XX" ~ case_when(A6 == F ~ T, T ~ F)
      , A == "XX" ~ B %in% c(90,91,92)
      , T ~ !A6 & A9
    )
    , COL45 = !A9
    , COL46 = case_when(!is.na(A1) ~ as.integer(months_between(last_day(A7), last_day(A1))), T ~ as.integer(999))
    , COL46 = case_when(
      A == "XX" & COL46 != 999 ~ as.integer(COL46)
      , T ~ as.integer(COL46)
    )
    , COL47 = case_when(
      A == "XX" ~ A10 %in% c('XX') | A11 %in% c('YY')
      , T ~ F
    )
    , COL48 = case_when(
      A == "XX" ~ (
        A10 %in% c('XX')
        | A11 %in% c('YY')
        | A29
        | A30
      )
      , (A == "ZZ" & (A10 == "DDD")) ~ TRUE
      , T ~ F
    )
    , COL49 = case_when(
      A == "XX" ~ COL48
      , (A == "ZZ" & (A10 == "DDD")) ~ TRUE
      , T ~ F
    )
    , COL50 = case_when(
      A == "XX" ~ (
        A10 %in% c('X')
        | A11 %in% c('ZZ')
        | A29
        | A30
      )
      , (A == "ZZ" & (A10 == "DDD" | A8)) ~ TRUE
      , T ~ F
    )
    , COL51 = case_when(
      A == "XX" ~ COL48
      , (A == "ZZ" & (A10 == "DDD" | A8)) ~ TRUE
      , T ~ F
    )
    , COL52 = case_when(
      A == "XX" ~ (A10 == "TR" | A11 == "TR")
      , T ~ F
    )
    , COL53 = 999
    , COL54 = case_when(
      A == "XX" ~ as.integer(pmax(0, 84 - COL46))
      , A == "XX" ~ as.integer(pmax(0, 36 - COL46))
      , A == "XX" ~ as.integer(82)
      , A == "XXY" ~ as.integer(48)
      , A %in% c("ZZ","XX") ~ as.integer(24)
      , A %in% c("YY","ZZ") ~ as.integer(84)
      , T ~ as.integer(nchar(A10))
    )
    , COL55 = case_when(
      A == "XX" ~ -7
      , T ~ A4
    )
    , COL56 = A5
    , A5 = case_when(
      A %in% c("XX","XX") ~ case_when(
        A5 >= 0 ~ as.double(as.integer(A5))
        , A5 < 0 ~ as.double(NA)
      )
      , A == "XX" ~ case_when(substr(A10,1,1) != '2' & substr(A11,1,1) != 'D' & A5 != -1 ~ as.double(A5), T ~as.double(NA)) 
      , T ~ as.double(A5)
    )
    , COL57= case_when(
      A == "XX" ~ case_when(
        A24 >= 0 ~ A24
        , A24 < 0 ~ as.double(NA)
      )
      , T ~ as.double(A24)
    )
    , COL58 = case_when(A1 != '1900-01-01' ~ A1, T ~ A7)
    , COL59 = case_when(!is.na(A1) ~ as.integer(months_between(last_day(A7), last_day(A1))))
    , COL60 = case_when(
      A == "XX" & A5 != 999 ~ as.integer(A5)
      , T ~ as.integer(A5)
    )
    , COL61 = case_when(
      A %in% c("XX","XX") ~ as.numeric(case_when(
        A17 >= 0 ~ as.integer(A17)
        , A18 > 0  ~ as.integer(A18)
        , T ~ as.integer(NA)
      ))
      , A == "XX" ~ as.numeric(A17)
      , A == "ZZ" ~ case_when(A10 == "YY" & A17 >= 0 ~ A17)
      , A == "XX" ~ case_when(
        as.integer(A17 > 0) > 0 ~ as.numeric(A17)
        , as.integer((A17 <= 0 | is.na(A17))) > 0 ~ as.numeric(NA)
      )
      , A == "YY" ~ case_when(
        as.integer(!A32 & A17 > 0 & !is.na(A17)) > 0 ~ as.numeric(A17)
        , as.integer(!A32 & (A17 <= 0 | is.na(A17)) & A34 & A18 >= 0) > 0 ~ as.numeric(A18)
        , T ~ as.numeric(NA)
      )
      , A == "ZZ" ~ case_when(
        A17 > 1 ~ as.numeric(A17)
        , A18 > 0 ~ as.numeric(A18)
        , TRUE ~ as.numeric(NA)
      )
      , T ~ case_when(
        as.integer(!A32 & A17 > 0 & !is.na(A17)) > 0 ~ as.numeric(A17)
        , as.integer(!A32 & (A17 <= 0 | is.na(A17)) & A34 & A18 >= 0) > 0 ~ as.numeric(A18)
        , T ~ as.numeric(A5)
      )
    )
    , COL62 = case_when(
      A == "XX" ~ case_when(
        A17 >=0 & (A9 | A15) & A10 == "M" ~ A17 * A2
        , A34 & A5 >= 0 & !(is.na(A5))  ~ as.numeric(A5 * 0.05)
      )
      , A == "ZZ" ~ case_when(
        (A34 | A33) & A5 >= 0 ~ as.double(A5 * 0.03)
        , (A9 | A33) & A18 >= 0 & A10 == "YY" ~ as.double(A18)
      )
      , A == "XX" ~ A17
      , is.na(A17) ~ as.numeric(NA)
      , A10 == "B" ~ A17 * 2
      , A10 == "D" ~ A17
      , A10 == "E" ~ A17 * 2
      , A10 == "I" ~ A17*30
      , A10 == "L" ~ A17 / 2
      , A10 == "M" ~ A17
      , A10 == "P" ~ A17
      , A10 == "Q" ~ A17 / 3
      , A10 == "S" ~ A17 / 6
      , A10 == "T" ~ A17 / 4
      , A10 == "W" ~ A17 * 4.3
      , A10 == "Y" ~ A17 / 12
      , A10 == "O" ~ A17
      , T ~ A17
    )
    , COL63 = case_when(
      is.na(A4) ~ as.numeric(NA)
      , A10 == "B" ~ A4*2
      , A10 == "D" ~ A4
      , A10 == "E" ~ A4*2
      , A10 == "I" ~ A4*30
      , A10 == "L" ~ A4/2
      , A10 == "M" ~ A4
      , A10 == "P" ~ A4
      , A10 == "Q" ~ A4/3
      , A10 == "S" ~ A4/6
      , A10 == "T" ~ A4/4
      , A10 == "W" ~ A4*4.3
      , A10 == "Y" ~ A4/12
      , A10 == "O" ~ A4
      , T ~ A4
    )
  ) %>% compute()



df3 <- df2 %>% 
  mutate(
    COL64 = "XX"
    , COL65 = 1L
    , COL66 = "YY"
    , COL67 = 2L
    , COL68 = "ZZ"
    , COL69 = 3L
    , COL70 = "XY"
    , COL71 = 4L
    , COL72 = case_when(
      A %in% c("XXX","XXX","XXY") ~ (A2 == 1)
      , T ~ (A57 %in% c(9) & B == 3)
    )
    , COL73 = case_when(
      A == "IIO" ~ F
      , T ~ (B == 20 & A8)
    )
    , COL74 = case_when(
      A == "IIO" ~ F
      , T ~ (B == 20 & A8)
    )
    , COL75 = case_when(
      A %in% c("XYY","GGG", "XXY") ~ B == 4 & !is.na(A17) & A17 > A18
      , A == "XXX" ~ !is.na(A17) & A17 > A18 & A58
      , T ~ B == 4 & A5 == 1 & !is.na(A17) & A17 >= A18
    )
    , COL76 = (A8 & A5 == 1)
    , COL77 = A8
    , COL78 = A8
    , COL79 = (A17 == 2)
    , COL80 = case_when(
      A == "XXX" ~ (A17 %in% c(1,2,6))
      , T ~ (A17 %in% c(1,2,4))
    )
    , COL81 = case_when(
      A == "XXX" ~ (A17 %in% c(1,6))
      , T ~ (A17 == 1)
    )
    , COL82 = (A3 >= 2 & A3 <= 5 & !is.na(A3))
    , COL83 = case_when(
      A == "XXX" ~ (A5 == 1 & A12)
      , T ~ (A5 == 1 & A3 >= 6 & A3 <= 60 & !is.na(A3))
    )
    , COL84 = case_when(
      A == "XXX" ~ (A5 == 1 & A43)
      , T ~ (A5 == 1 & A3 > 60 & !is.na(A3))
    )
    , COL85 = B == 36
    , COL86 = (B == 21)
    , COL87 = (B == 19)
    , COL88 = case_when(
      A == "IIO" ~ (B %in% c(98,9,20,8,61,4,36,65,74,81) & A8)
      , A == "XXY" ~ (!A12 & A8 & B != 19)
      , T ~ (!A12 & A8)
    )
    , COL89 = case_when(
      A == "XXY" ~ case_when((A12 | A12| A13) & A8 == T ~ T, T ~ F)
      , T ~ case_when(A12 == T & A8 == T ~ T, T ~ F)
    )
    , COL90 = (A5 != 1 & A8)
    , COL91 = (B == 9)
    , COL92 = case_when(
      A == "IIO" ~ B %in%  c(9, 20)
      , A == "XYY" ~ F
      , T ~ (A12 | (A8 & A13))
    )
    , COL93 = case_when(A == "XXY" ~ B %in% c(17,21),T ~ B == 17)
    , COL94 = (B == 8)
    , COL95 = (B == 27)
    , COL96 = case_when(A == "XXY" ~ A10 =="A58", T ~ B != 6 & A17 > 0)
    , COL97= (B != 6 & A18 > 0)
    , COL98 = case_when(
      A %in% c("XXX") ~ F
      , T ~ (B == 5)
    )
    , COL99 = case_when(
      A %in% c("IIO","XXX", "GGG","XXY") ~ F
      , T ~ (B == 5)
    )
    , COL100 = (B == 43)
    , COL101 = case_when(
      A == "IIO" ~ A8 & COL46 < 36
      , T ~ A8
    )
    , COL102 = A8
    , COL103 = case_when(
      A == "XXX" ~ (!A8 & !A9)
      , A == "XYY" ~ (A8)
      , A == "DDD" ~ (!A8 & !A9)
      , T ~ A30 & (!A8 & !A9)
    )
    
    , COL104 = case_when(
      A == "XXX" ~ (A16)
      , A == "DDD" ~ T
      , T ~ A8
    ) #changed from TRUE to A30 for XXX
    
    , COL105 = case_when(A == "DDD" ~ (!A9 & !A8)
                         , T ~ (A30 & !A9 & !A8))
    , COL106 = 999
    , COL107 = case_when(
      A == "GGG" ~ as.integer(999)
      , T ~ case_when(!is.na(A1) ~ as.integer(months_between(last_day(A7), last_day(A1))), T ~ as.integer(999))
    )
    , COL108 = as.integer(999) # TODO
    , COL109 = as.integer(999)
    , COL110 = as.integer(999)
    , COL111 = as.integer(999) # TODO
    , COL116 = case_when(
      A == "XXX" & A46 != 999  ~ as.integer(A46)
      , T ~ as.integer(A46)
    )     
    ,COL117  = case_when(!is.na(A1) ~ as.integer(datediff(A7, A1)),T ~ as.integer(999))
    
    , COL119 = case_when(!is.na(A20) ~ as.integer(months_between(last_day(A7), last_day(A20))), T ~ as.integer(999))
    , COL120 = case_when(
      A == "XXX" & A5 != 999 ~ as.integer(A5 )
      , T ~ A5
    )
    , COL121 = case_when(
      A == "XXX" ~ as.integer(A5)
      , T ~ as.integer(COL46)
    )
    , COL122 = case_when(
      A == "XYY" ~ as.integer(A30)
      , T ~ as.integer(99)
    )
    , COL125 = F
    , COL126 = case_when(A19 & A57 == 0 ~ T, T ~ F)
    , COL127 = case_when(
      A == "XXX" ~ case_when(!is.na(A20) & A20 <= A7 ~ as.integer(months_between(last_day(A7), last_day(A20))), T ~ as.integer(999))
      , T ~ A46
    )
    , COL128 = case_when(A == "XXX" ~ F ,A19 & !A8 & A5 > 0 ~ T, T ~ F)
    
  ) %>% 
  compute("df3") 
#> Error: org.apache.spark.sql.AnalysisException: invalid cast from date to timestamp;;
#> 'Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 159 more fields]
#> +- SubqueryAlias `q53`
#>    +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 158 more fields]
#>       +- SubqueryAlias `q52`
#>          +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 157 more fields]
#>             +- SubqueryAlias `q51`
#>                +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 156 more fields]
#>                   +- SubqueryAlias `q50`
#>                      +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 155 more fields]
#>                         +- SubqueryAlias `q49`
#>                            +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 154 more fields]
#>                               +- SubqueryAlias `q48`
#>                                  +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 153 more fields]
#>                                     +- SubqueryAlias `q47`
#>                                        +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 152 more fields]
#>                                           +- SubqueryAlias `q46`
#>                                              +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 151 more fields]
#>                                                 +- SubqueryAlias `q45`
#>                                                    +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 150 more fields]
#>                                                       +- SubqueryAlias `q44`
#>                                                          +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 149 more fields]
#>                                                             +- SubqueryAlias `q43`
#>                                                                +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 148 more fields]
#>                                                                   +- SubqueryAlias `q42`
#>                                                                      +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 147 more fields]
#>                                                                         +- SubqueryAlias `q41`
#>                                                                            +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 146 more fields]
#>                                                                               +- SubqueryAlias `q40`
#>                                                                                  +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 145 more fields]
#>                                                                                     +- SubqueryAlias `q39`
#>                                                                                        +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 144 more fields]
#>                                                                                           +- SubqueryAlias `q38`
#>                                                                                              +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 143 more fields]
#>                                                                                                 +- SubqueryAlias `q37`
#>                                                                                                    +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 142 more fields]
#>                                                                                                       +- SubqueryAlias `q36`
#>                                                                                                          +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 141 more fields]
#>                                                                                                             +- SubqueryAlias `q35`
#>                                                                                                                +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 139 more fields]
#>                                                                                                                   +- SubqueryAlias `q34`
#>                                                                                                                      +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 138 more fields]
#>                                                                                                                         +- SubqueryAlias `q33`
#>                                                                                                                            +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 137 more fields]
#>                                                                                                                               +- SubqueryAlias `q32`
#>                                                                                                                                  +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A

Created on 2022-09-06 by the reprex package (v2.0.1)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.0.2 (2020-06-22)
#>  os       Red Hat Enterprise Linux
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Chicago
#>  date     2022-09-06
#>  pandoc   2.17.1.1 @ /opt/revr/rstudio-server/bin/quarto/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  askpass       1.1     2019-01-13 [2] CRAN (R 4.0.2)
#>  assertthat    0.2.1   2019-03-21 [2] CRAN (R 4.0.2)
#>  base64enc     0.1-3   2015-07-28 [2] CRAN (R 4.0.2)
#>  blob          1.2.3   2022-04-10 [2] CRAN (R 4.0.2)
#>  cli           3.3.0   2022-04-25 [2] CRAN (R 4.0.2)
#>  config        0.3.1   2020-12-17 [2] CRAN (R 4.0.2)
#>  crayon        1.5.1   2022-03-26 [2] CRAN (R 4.0.2)
#>  DBI         * 1.1.3   2022-06-18 [2] CRAN (R 4.0.2)
#>  dbplyr      * 2.2.1   2022-06-27 [2] CRAN (R 4.0.2)
#>  digest        0.6.29  2021-12-01 [2] CRAN (R 4.0.2)
#>  dplyr       * 1.0.9   2022-04-28 [2] CRAN (R 4.0.2)
#>  ellipsis      0.3.2   2021-04-29 [2] CRAN (R 4.0.2)
#>  evaluate      0.15    2022-02-18 [2] CRAN (R 4.0.2)
#>  fansi         1.0.3   2022-03-24 [2] CRAN (R 4.0.2)
#>  fastmap       1.1.0   2021-01-25 [2] CRAN (R 4.0.2)
#>  forge         0.2.0   2019-02-26 [2] CRAN (R 4.0.2)
#>  fs            1.5.2   2021-12-08 [2] CRAN (R 4.0.2)
#>  generics      0.1.3   2022-07-05 [2] CRAN (R 4.0.2)
#>  glue          1.6.2   2022-02-24 [2] CRAN (R 4.0.2)
#>  highr         0.9     2021-04-16 [2] CRAN (R 4.0.2)
#>  htmltools     0.5.2   2021-08-25 [2] CRAN (R 4.0.2)
#>  htmlwidgets   1.5.4   2021-09-08 [2] CRAN (R 4.0.2)
#>  httr          1.4.3   2022-05-04 [2] CRAN (R 4.0.2)
#>  jsonlite      1.8.0   2022-02-22 [2] CRAN (R 4.0.2)
#>  knitr         1.39    2022-04-26 [2] CRAN (R 4.0.2)
#>  lifecycle     1.0.1   2021-09-24 [2] CRAN (R 4.0.2)
#>  magrittr      2.0.3   2022-03-30 [2] CRAN (R 4.0.2)
#>  openssl       2.0.2   2022-05-24 [2] CRAN (R 4.0.2)
#>  pillar        1.7.0   2022-02-01 [2] CRAN (R 4.0.2)
#>  pkgconfig     2.0.3   2019-09-22 [2] CRAN (R 4.0.2)
#>  purrr         0.3.4   2020-04-17 [2] CRAN (R 4.0.2)
#>  R.cache       0.15.0  2021-04-30 [2] CRAN (R 4.0.2)
#>  R.methodsS3   1.8.2   2022-06-13 [2] CRAN (R 4.0.2)
#>  R.oo          1.25.0  2022-06-12 [2] CRAN (R 4.0.2)
#>  R.utils       2.12.0  2022-06-28 [2] CRAN (R 4.0.2)
#>  r2d3          0.2.6   2022-02-28 [2] CRAN (R 4.0.2)
#>  R6            2.5.1   2021-08-19 [2] CRAN (R 4.0.2)
#>  reprex        2.0.1   2021-08-05 [2] CRAN (R 4.0.2)
#>  rlang         1.0.4   2022-07-12 [2] CRAN (R 4.0.2)
#>  rmarkdown     2.14    2022-04-25 [2] CRAN (R 4.0.2)
#>  rprojroot     2.0.3   2022-04-02 [2] CRAN (R 4.0.2)
#>  rstudioapi    0.13    2020-11-12 [2] CRAN (R 4.0.2)
#>  sessioninfo   1.2.2   2021-12-06 [2] CRAN (R 4.0.2)
#>  sparklyr    * 1.7.7   2022-06-07 [2] CRAN (R 4.0.2)
#>  stringi       1.7.8   2022-07-11 [2] CRAN (R 4.0.2)
#>  stringr       1.4.0   2019-02-10 [2] CRAN (R 4.0.2)
#>  styler        1.7.0   2022-03-13 [2] CRAN (R 4.0.2)
#>  tibble        3.1.7   2022-05-03 [2] CRAN (R 4.0.2)
#>  tidyr         1.2.0   2022-02-01 [2] CRAN (R 4.0.2)
#>  tidyselect    1.1.2   2022-02-21 [2] CRAN (R 4.0.2)
#>  utf8          1.2.2   2021-07-24 [2] CRAN (R 4.0.2)
#>  vctrs         0.4.1   2022-04-13 [2] CRAN (R 4.0.2)
#>  withr         2.5.0   2022-03-03 [2] CRAN (R 4.0.2)
#>  xfun          0.31    2022-05-10 [2] CRAN (R 4.0.2)
#>  yaml          2.3.5   2022-02-21 [2] CRAN (R 4.0.2)
#> 
#> ──────────────────────────────────────────────────────────────────────────────

@bhuvanesh1707
Copy link
Author

A drawback with breaking mutate statement, adding compute and assigning it to a new variable causes intermediate tables to get created on Spark. This is causing performance degradation for us by 50% with sparklyr 1.7.7 in comparison to sparklyr 1.7.5.

@edgararuiz
Copy link
Collaborator

Ok, I just tried an idea. I captured the resulting SQL using simulate_hive(), and passed that to the Spark connection. That worked. This suggests that the issue is in how Spark itself is preparing the query before executing it.

#Load libraries
library(DBI) 
library(sparklyr) 
library(dplyr) 
library(dbplyr) 
#Dummy data frame with 1 row and 62 columns
dummy_data <- data.frame( A = "XX"
                          , B = 1L           
                          , U = 1L           
                          , D = T
                          , A1 = as.Date("1900-01-01")
                          , A2 = 1L
                          , A3 = 1L
                          , A4 = 1L           
                          , A5 = 1L           
                          , A6 = T
                          , A7 = as.Date("1900-01-01")
                          , A8 = T
                          , A9 = T
                          , A10 = "XX"
                          , A11 = "XX"
                          , A12 = T
                          , A13 = T
                          , A14 = as.Date("1900-01-01")
                          , A15 = T
                          , A16 = T
                          , A17 = 1L
                          , A18 = 1L
                          , A19 = T
                          , A20 = as.Date("1900-01-01")
                          , A21 = as.Date("1900-01-01")
                          , A22 = 1L
                          , A23 = 1L
                          , A24 = 1L
                          , A25 = 1L
                          , A26 = 1L
                          , A27 = 1L
                          , A28 = 1L
                          , A29 = T
                          , A30 = T
                          , A31 = T
                          , A32 = T
                          , A33 = T
                          , A34 = T
                          , A35 = "XX"
                          , A36 = "XX"
                          , A37 = "XX"
                          , A38 = "XX"
                          , A39 = "XX"
                          , A40 = "XX"
                          , A41 = T
                          , A42 = as.Date("1900-01-01")
                          , A43 = T
                          , A44 = 1L
                          , A45 = T
                          , A46 = 1L
                          , A47 = as.Date("1900-01-01")
                          , A48 = "XX"
                          , A49 = as.Date("1900-01-01")
                          , A50 = as.Date("1900-01-01")
                          , A51 = as.Date("1900-01-01")
                          , A52 = as.Date("1900-01-01")
                          , A53 = as.Date("1900-01-01")
                          , A54 = "XX"
                          , A55 = as.Date("1900-01-01")
                          , A56 = 1L
                          , A57 = 1L
                          , A58 = T
)


#Execute code snippet
test_snippet <- function(x) { 
  x %>% 
  mutate(excl = case_when(
    A %in% c("HHH","GGG") ~ (A17 == 1 & !A13)
    , A == "HHH" ~ T
    , T ~ (!A13)
  )) %>% 
  filter(excl) %>% 
  mutate(
    COL1 = case_when(
      A == "III" ~ B %in%  c(3,72,70)
      , T ~ (B == 3)
    )
    , COL2 = (B == 34)
    , COL3 = (B ==  71)
    , COL4 = (B == 79)
    , COL5 = (B == 83)
    , COL6 = case_when(A == "PPP" ~ B %in% c(5,6), T ~ B == 6)
    , COL7 = case_when(
      A == "III" ~ (B == 5 | B == 6)
      , T ~ COL6
    )
    , COL8 = case_when(
      A == "III" ~ B %in%  c(4,69,74)
      , T ~ (B == 4)
    )
    , COL9 = (B == 4 & A31)
    , COL10 = (B == 25)
    , COL11 = (B ==5)
    , COL12 = case_when(
      A == "GGG" ~ F
      , T ~ T
    )
    , COL13 = A15
    , COL14 = B %in% c(34,35,21,9,37,38,39)
    , COL15 = case_when(
      A == "III" ~ (B != 21)
      , A == "CCC" ~ A30
      , A == "ZZZ" ~ !COL14
      , A == "PPP" ~ (B != 17 & B != 21)
      , T ~ (U < 12)
    )
    , COL16 = case_when(
      A %in% c("III","ZZZ") ~ (B == 20)
      , A == "CCC" ~ (U == 15)
      , T ~ (U == 11)
    )
    , COL17 = (B == 19)
    , COL18 = (B == 29)
    , COL19 = case_when(A == "PPP" ~
                          case_when(A18 > 60 & !is.na(A18) ~ TRUE
                                    , T ~ FALSE)
                        , T ~ (B ==90))
    , COL20 = case_when(A == "PPP" ~
                          case_when((D | COL18 | COL6 | COL1) ~
                                      case_when((A35 == "M" & (A18) >= 37 &  (A18) <= 60) |
                                                  (A35 == "A" & (A18*12) >= 37 &  (A18*12) <= 60) |
                                                  (A35 == "B" & (A18/2) >= 37 &  (A18/2) <= 60) |
                                                  (A35 == "Q" & (A18*3) >= 37 &  (A18*3) <= 60) |
                                                  (A35 == "S" & (A18*6) >= 37 &  (A18*6) <= 60) |
                                                  (A35 == "W" & (A18/4) >= 37 &  (A18/4) <= 60) ~ TRUE
                                                , T ~ FALSE)
                                    , T ~ FALSE)
                        , T ~ ( B==91 ))
    , COL21 = case_when(A == "PPP" ~
                          case_when(D | COL18 | COL6 | COL1 ~
                                      case_when((A35 == "M" & A18 == 1) |
                                                  (A35 == "W" & (A18/4) == 1) ~ TRUE
                                                , T ~ FALSE)
                                    , T ~ FALSE)
                        , T ~ ( B==33 ))
    , COL22 = case_when(A == "PPP" ~
                          case_when(D | COL18 | COL6 | COL1 ~
                                      case_when((A35 == "M" & (A18) >= 2 &  (A18) <= 36) |
                                                  (A35 == "A" & (A18*12) >= 2 &  (A18*12) <= 36) |
                                                  (A35 == "B" & (A18/2) >= 2 &  (A18/2) <= 36) |
                                                  (A35 == "Q" & (A18*3) >= 2 &  (A18*3) <= 36) |
                                                  (A35 == "S" & (A18*6) >= 2 &  (A18*6) <= 36) |
                                                  (A35 == "W" & (A18/4) >= 2 &  (A18/4) <= 36) ~ TRUE
                                                , T ~ FALSE)
                                    , T ~ FALSE)
                        , T ~ ( B==92 ))      
    , COL23 = (COL21 & U == 1) 
    , COL24 = (COL22 & U == 1) 
    , COL25 = (A15 & U == 1)
    , COL26 = (D & U == 2)
    , COL27 = (A15 & U == 2)
    , COL28 = (U == 2 & COL19)
    , COL29 = (U == 2 & COL20)
    , COL30= (U == 2 & COL21)
    , COL31 = (U == 2 & COL22) 
    , COL32 = B ==35
    , COL33 = F
    , COL34 = case_when(
      A == "GGG" ~ (B == 28)
      , A == "ZZZ" ~ F
      , T ~ (B == 59)
    )
    , COL35 = case_when(
      A == "GGG" ~ as.integer(A37) %in% c(1, 2, 6, 8, 9) & as.integer(A38) %in% c(29, 62)
      , A == "ZZZ" ~ F
      ,T ~ (B == 13)
    )
    , COL36 = case_when(
      A == "ZZZ" ~ F
      , T ~ (B == 8)
    )
    , COL37 = (U == 3)
    , COL38 = case_when(
      A == "ZZZ" ~ F
      , T ~ (D & U != 1)
    )
  ) 
}


# Confirm it works in database
library(RSQLite)
db_con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
db_dummy_data <- copy_to(db_con, dummy_data)
db_dummy_data %>% 
  test_snippet()
#> # Source:   SQL [0 x 101]
#> # Database: sqlite 3.38.5 [:memory:]
#> # … with 101 variables: A <chr>, B <int>, U <int>, D <int>, A1 <dbl>, A2 <int>,
#> #   A3 <int>, A4 <int>, A5 <int>, A6 <int>, A7 <dbl>, A8 <int>, A9 <int>,
#> #   A10 <chr>, A11 <chr>, A12 <int>, A13 <int>, A14 <dbl>, A15 <int>,
#> #   A16 <int>, A17 <int>, A18 <int>, A19 <int>, A20 <dbl>, A21 <dbl>,
#> #   A22 <int>, A23 <int>, A24 <int>, A25 <int>, A26 <int>, A27 <int>,
#> #   A28 <int>, A29 <int>, A30 <int>, A31 <int>, A32 <int>, A33 <int>,
#> #   A34 <int>, A35 <chr>, A36 <chr>, A37 <chr>, A38 <chr>, A39 <chr>, …


# Confirm it works in Spark if passed the resulting query

cn <- spark_connect("local")

df <- copy_to(cn, dummy_data, "df")

mf <- tbl_lazy(dummy_data, simulate_hive())

hive_sql <- mf %>% 
  test_snippet()  %>% 
  remote_query()

dbGetQuery(cn, hive_sql)
#>      A  B  U  D   A1 A2 A3 A4 A5 A6   A7 A8 A9  A10  A11 A12 A13  A14 A15 A16
#> 1 <NA> NA NA NA <NA> NA NA NA NA NA <NA> NA NA <NA> <NA>  NA  NA <NA>  NA  NA
#>   A17 A18 A19  A20  A21 A22 A23 A24 A25 A26 A27 A28 A29 A30 A31 A32 A33 A34
#> 1  NA  NA  NA <NA> <NA>  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
#>    A35  A36  A37  A38  A39  A40 A41  A42 A43 A44 A45 A46  A47  A48  A49  A50
#> 1 <NA> <NA> <NA> <NA> <NA> <NA>  NA <NA>  NA  NA  NA  NA <NA> <NA> <NA> <NA>
#>    A51  A52  A53  A54  A55 A56 A57 A58 excl COL1 COL2 COL3 COL4 COL5 COL6 COL7
#> 1 <NA> <NA> <NA> <NA> <NA>  NA  NA  NA   NA   NA   NA   NA   NA   NA   NA   NA
#>   COL8 COL9 COL10 COL11 COL12 COL13 COL14 COL15 COL16 COL17 COL18 COL19 COL20
#> 1   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
#>   COL21 COL22 COL23 COL24 COL25 COL26 COL27 COL28 COL29 COL30 COL31 COL32 COL33
#> 1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
#>   COL34 COL35 COL36 COL37 COL38
#> 1    NA    NA    NA    NA    NA

spark_disconnect_all()
#> [1] 1

Created on 2022-09-12 by the reprex package (v2.0.1)

@edgararuiz
Copy link
Collaborator

Hi @bhuvanesh1707, I think I have a fix for this issue. Can you try installing this dev version? remotes::install_github("sparklyr/sparklyr", ref = "op_vars")

@bhuvanesh1707
Copy link
Author

Hi @edgararuiz, after installing this dev version of sparklyr and executed my code I did not face any errors. I am satisfied with the resolution provided. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants