-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
False Negative Error: org.apache.spark.sql.AnalysisException: cannot resolve 'A' given input columns #3286
Comments
Below is the error statement I am getting and column A is present in my data frame Error: org.apache.spark.sql.AnalysisException: cannot resolve ' Below is few list of libraries I used and their version comparison <style> </style>
|
Below is the session info I am working with while executing the reprex sessioninfo::session_info() |
When I run your example I actually get a different error: Adding this to my spark config fixed the issue:
Perhaps trying this will fix your error? |
I have changed the maxIteration as below and checked in the Spark environment, spark iteration was set to 1000 conf <- spark_config()
But when I am executing, I am getting this Max Iteration as a warning and not as a error in my spark log. Below are the spark log. And the spark version is 2.4.4 Spark Log: |
Hi, have you considered splitting the snippet into two or more #Load libraries
library(DBI) #1.1.2
library(sparklyr) #1.7.7
#>
#> Attaching package: 'sparklyr'
#> The following object is masked from 'package:stats':
#>
#> filter
library(dplyr) #1.0.9
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(dbplyr) #2.2.0
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
#Dummy data frame with 1 row and 62 columns
dummy_data <- data.frame( A = "XX"
, B = 1L
, U = 1L
, D = T
, A1 = as.Date("1900-01-01")
, A2 = 1L
, A3 = 1L
, A4 = 1L
, A5 = 1L
, A6 = T
, A7 = as.Date("1900-01-01")
, A8 = T
, A9 = T
, A10 = "XX"
, A11 = "XX"
, A12 = T
, A13 = T
, A14 = as.Date("1900-01-01")
, A15 = T
, A16 = T
, A17 = 1L
, A18 = 1L
, A19 = T
, A20 = as.Date("1900-01-01")
, A21 = as.Date("1900-01-01")
, A22 = 1L
, A23 = 1L
, A24 = 1L
, A25 = 1L
, A26 = 1L
, A27 = 1L
, A28 = 1L
, A29 = T
, A30 = T
, A31 = T
, A32 = T
, A33 = T
, A34 = T
, A35 = "XX"
, A36 = "XX"
, A37 = "XX"
, A38 = "XX"
, A39 = "XX"
, A40 = "XX"
, A41 = T
, A42 = as.Date("1900-01-01")
, A43 = T
, A44 = 1L
, A45 = T
, A46 = 1L
, A47 = as.Date("1900-01-01")
, A48 = "XX"
, A49 = as.Date("1900-01-01")
, A50 = as.Date("1900-01-01")
, A51 = as.Date("1900-01-01")
, A52 = as.Date("1900-01-01")
, A53 = as.Date("1900-01-01")
, A54 = "XX"
, A55 = as.Date("1900-01-01")
, A56 = 1L
, A57 = 1L
, A58 = T
)
#Create spark connection
spark_disconnect_all()
#> [1] 0
## Changed it to local for the reprex
cn <- spark_connect("local")
#Copying dummy_data data frame to spark data frame
spark_dummy_data <- copy_to(cn, dummy_data, overwrite = T) %>% compute()
#Execute code snippet
A1 <- spark_dummy_data %>%
mutate(excl = case_when(
A %in% c("HHH","GGG") ~ (A17 == 1 & !A13)
, A == "HHH" ~ T
, T ~ (!A13)
)) %>%
filter(excl) %>%
mutate(
COL1 = case_when(
A == "III" ~ B %in% c(3,72,70)
, T ~ (B == 3)
)
, COL2 = (B == 34)
, COL3 = (B == 71)
, COL4 = (B == 79)
, COL5 = (B == 83)
, COL6 = case_when(A == "PPP" ~ B %in% c(5,6), T ~ B == 6)
, COL7 = case_when(
A == "III" ~ (B == 5 | B == 6)
, T ~ COL6
)
, COL8 = case_when(
A == "III" ~ B %in% c(4,69,74)
, T ~ (B == 4)
)
, COL9 = (B == 4 & A31)
, COL10 = (B == 25)
, COL11 = (B ==5)
, COL12 = case_when(
A == "GGG" ~ F
, T ~ T
)
, COL13 = A15
, COL14 = B %in% c(34,35,21,9,37,38,39)
, COL15 = case_when(
A == "III" ~ (B != 21)
, A == "CCC" ~ A30
, A == "ZZZ" ~ !COL14
, A == "PPP" ~ (B != 17 & B != 21)
, T ~ (U < 12)
)
, COL16 = case_when(
A %in% c("III","ZZZ") ~ (B == 20)
, A == "CCC" ~ (U == 15)
, T ~ (U == 11)
)
, COL17 = (B == 19)
, COL18 = (B == 29)
, COL19 = case_when(A == "PPP" ~
case_when(A18 > 60 & !is.na(A18) ~ TRUE
, T ~ FALSE)
, T ~ (B ==90))
, COL20 = case_when(A == "PPP" ~
case_when((D | COL18 | COL6 | COL1) ~
case_when((A35 == "M" & (A18) >= 37 & (A18) <= 60) |
(A35 == "A" & (A18*12) >= 37 & (A18*12) <= 60) |
(A35 == "B" & (A18/2) >= 37 & (A18/2) <= 60) |
(A35 == "Q" & (A18*3) >= 37 & (A18*3) <= 60) |
(A35 == "S" & (A18*6) >= 37 & (A18*6) <= 60) |
(A35 == "W" & (A18/4) >= 37 & (A18/4) <= 60) ~ TRUE
, T ~ FALSE)
, T ~ FALSE)
, T ~ ( B==91 ))
, COL21 = case_when(A == "PPP" ~
case_when(D | COL18 | COL6 | COL1 ~
case_when((A35 == "M" & A18 == 1) |
(A35 == "W" & (A18/4) == 1) ~ TRUE
, T ~ FALSE)
, T ~ FALSE)
, T ~ ( B==33 ))
, COL22 = case_when(A == "PPP" ~
case_when(D | COL18 | COL6 | COL1 ~
case_when((A35 == "M" & (A18) >= 2 & (A18) <= 36) |
(A35 == "A" & (A18*12) >= 2 & (A18*12) <= 36) |
(A35 == "B" & (A18/2) >= 2 & (A18/2) <= 36) |
(A35 == "Q" & (A18*3) >= 2 & (A18*3) <= 36) |
(A35 == "S" & (A18*6) >= 2 & (A18*6) <= 36) |
(A35 == "W" & (A18/4) >= 2 & (A18/4) <= 36) ~ TRUE
, T ~ FALSE)
, T ~ FALSE)
, T ~ ( B==92 ))
, COL23 = (COL21 & U == 1)
, COL24 = (COL22 & U == 1)
, COL25 = (A15 & U == 1)
, COL26 = (D & U == 2)
, COL27 = (A15 & U == 2)
, COL28 = (U == 2 & COL19)
, COL29 = (U == 2 & COL20)
, COL30= (U == 2 & COL21)
, COL31 = (U == 2 & COL22)
, COL32 = B ==35
, COL33 = F
, COL34 = case_when(
A == "GGG" ~ (B == 28)
, A == "ZZZ" ~ F
, T ~ (B == 59)
)
, COL35 = case_when(
A == "GGG" ~ as.integer(A37) %in% c(1, 2, 6, 8, 9) & as.integer(A38) %in% c(29, 62)
, A == "ZZZ" ~ F
,T ~ (B == 13)
)) %>%
compute()
A <- A1 %>%
mutate(
COL36 = case_when(
A == "ZZZ" ~ F
, T ~ (B == 8)
)
, COL37 = (U == 3)
, COL38 = case_when(
A == "ZZZ" ~ F
, T ~ (D & U != 1)
)
) %>%
compute()
A
#> # Source: spark<dbplyr_003> [?? x 101]
#> A B U D A1 A2 A3 A4 A5 A6 A7
#> <chr> <int> <int> <lgl> <date> <int> <int> <int> <int> <lgl> <date>
#> 1 <NA> NA NA NA NA NA NA NA NA NA NA
#> # … with 90 more variables: A8 <lgl>, A9 <lgl>, A10 <chr>, A11 <chr>,
#> # A12 <lgl>, A13 <lgl>, A14 <date>, A15 <lgl>, A16 <lgl>, A17 <int>,
#> # A18 <int>, A19 <lgl>, A20 <date>, A21 <date>, A22 <int>, A23 <int>,
#> # A24 <int>, A25 <int>, A26 <int>, A27 <int>, A28 <int>, A29 <lgl>,
#> # A30 <lgl>, A31 <lgl>, A32 <lgl>, A33 <lgl>, A34 <lgl>, A35 <chr>,
#> # A36 <chr>, A37 <chr>, A38 <chr>, A39 <chr>, A40 <chr>, A41 <lgl>,
#> # A42 <date>, A43 <lgl>, A44 <int>, A45 <lgl>, A46 <int>, A47 <date>, …
spark_disconnect()
#> Error in UseMethod("spark_disconnect"): no applicable method for 'spark_disconnect' applied to an object of class "NULL" Created on 2022-08-29 by the reprex package (v2.0.1) |
The above solution of breaking into multiple mutate statement and assigning it to a new variable has resolved the issue. But it is not a single error that comes up like "cannot resolve 'A' column" where we can break the dplyr code. There are other errors as well like "Error: org.apache.spark.sql.AnalysisException: invalid cast from date to timestamp;;" and Date to time stamp casting error where I do not have timestamp column and decimal to Big decimal where all my columns are integer. I can overcome this error by the above solution by further breaking the mutate statement. #Dummy data frame with 1 row and 62 columns
dummy_data <- data.frame( A = "XX"
, B = 1L
, U = 1L
, D = T
, A1 = as.Date("1900-01-01")
, A2 = 1L
, A3 = 1L
, A4 = 1L
, A5 = 1L
, A6 = T
, A7 = as.Date("1900-01-01")
, A8 = T
, A9 = T
, A10 = "XX"
, A11 = "XX"
, A12 = T
, A13 = T
, A14 = as.Date("1900-01-01")
, A15 = T
, A16 = T
, A17 = 1L
, A18 = 1L
, A19 = T
, A20 = as.Date("1900-01-01")
, A21 = as.Date("1900-01-01")
, A22 = 1L
, A23 = 1L
, A24 = 1L
, A25 = 1L
, A26 = 1L
, A27 = 1L
, A28 = 1L
, A29 = T
, A30 = T
, A31 = T
, A32 = T
, A33 = T
, A34 = T
, A35 = "XX"
, A36 = "XX"
, A37 = "XX"
, A38 = "XX"
, A39 = "XX"
, A40 = "XX"
, A41 = T
, A42 = as.Date("1900-01-01")
, A43 = T
, A44 = 1L
, A45 = T
, A46 = 1L
, A47 = as.Date("1900-01-01")
, A48 = "XX"
, A49 = as.Date("1900-01-01")
, A50 = as.Date("1900-01-01")
, A51 = as.Date("1900-01-01")
, A52 = as.Date("1900-01-01")
, A53 = as.Date("1900-01-01")
, A54 = "XX"
, A55 = as.Date("1900-01-01")
, A56 = 1L
, A57 = 1L
, A58 = T
)
#Load libraries
library(DBI)
library(sparklyr)
#>
#> Attaching package: 'sparklyr'
#> The following object is masked from 'package:stats':
#>
#> filter
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(dbplyr)
#>
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#>
#> ident, sql
#Create spark connection
spark_disconnect_all()
#> [1] 0
cn <- spark_connect(master = "yarn-client", app_name = "newJob")
#Copying dummy_data data frame to spark data frame
spark_dummy_data <- copy_to(cn, dummy_data, overwrite = T) %>% compute()
#Execute code snippet
df1 <- spark_dummy_data %>%
mutate(excl = case_when(
A %in% c("HHH","GGG") ~ (A17 == 1 & !A13)
, A == "HHH" ~ T
, T ~ (A13)
)) %>%
filter(excl) %>%
mutate(
COL1 = case_when(
A == "III" ~ B %in% c(3,72,70)
, T ~ (B == 3)
)
, COL2 = (B == 34)
, COL3 = (B == 71)
, COL4 = (B == 79)
, COL5 = (B == 83)
, COL6 = case_when(A == "PPP" ~ B %in% c(5,6), T ~ B == 6)
, COL7 = case_when(
A == "III" ~ (B == 5 | B == 6)
, T ~ COL6
)
, COL8 = case_when(
A == "III" ~ B %in% c(4,69,74)
, T ~ (B == 4)
)
, COL9 = (B == 4 & A31)
, COL10 = (B == 25)
, COL11 = (B ==5)
, COL12 = case_when(
A == "GGG" ~ F
, T ~ T
)
, COL13 = A15
, COL14 = B %in% c(34,35,21,9,37,38,39)
, COL15 = case_when(
A == "III" ~ (B != 21)
, A == "CCC" ~ A30
, A == "ZZZ" ~ !COL14
, A == "PPP" ~ (B != 17 & B != 21)
, T ~ (U < 12)
)
, COL16 = case_when(
A %in% c("III","ZZZ") ~ (B == 20)
, A == "CCC" ~ (U == 15)
, T ~ (U == 11)
)
, COL17 = (B == 19)
, COL18 = (B == 29)
, COL19 = case_when(A == "PPP" ~
case_when(A18 > 60 & !is.na(A18) ~ TRUE
, T ~ FALSE)
, T ~ (B ==90))
, COL20 = case_when(A == "PPP" ~
case_when((D | COL18 | COL6 | COL1) ~
case_when((A35 == "M" & (A18) >= 37 & (A18) <= 60) |
(A35 == "A" & (A18*12) >= 37 & (A18*12) <= 60) |
(A35 == "B" & (A18/2) >= 37 & (A18/2) <= 60) |
(A35 == "Q" & (A18*3) >= 37 & (A18*3) <= 60) |
(A35 == "S" & (A18*6) >= 37 & (A18*6) <= 60) |
(A35 == "W" & (A18/4) >= 37 & (A18/4) <= 60) ~ TRUE
, T ~ FALSE)
, T ~ FALSE)
, T ~ ( B==91 ))
, COL21 = case_when(A == "PPP" ~
case_when(D | COL18 | COL6 | COL1 ~
case_when((A35 == "M" & A18 == 1) |
(A35 == "W" & (A18/4) == 1) ~ TRUE
, T ~ FALSE)
, T ~ FALSE)
, T ~ ( B==33 ))
, COL22 = case_when(A == "PPP" ~
case_when(D | COL18 | COL6 | COL1 ~
case_when((A35 == "M" & (A18) >= 2 & (A18) <= 36) |
(A35 == "A" & (A18*12) >= 2 & (A18*12) <= 36) |
(A35 == "B" & (A18/2) >= 2 & (A18/2) <= 36) |
(A35 == "Q" & (A18*3) >= 2 & (A18*3) <= 36) |
(A35 == "S" & (A18*6) >= 2 & (A18*6) <= 36) |
(A35 == "W" & (A18/4) >= 2 & (A18/4) <= 36) ~ TRUE
, T ~ FALSE)
, T ~ FALSE)
, T ~ ( B==92 ))
, COL23 = (COL21 & U == 1)
, COL24 = (COL22 & U == 1)
, COL25 = (A15 & U == 1)
, COL26 = (D & U == 2)
, COL27 = (A15 & U == 2)
, COL28 = (U == 2 & COL19)
, COL29 = (U == 2 & COL20)
, COL30= (U == 2 & COL21)
, COL31 = (U == 2 & COL22)
, COL32 = B ==35
, COL33 = F
, COL34 = case_when(
A == "GGG" ~ (B == 28)
, A == "ZZZ" ~ F
, T ~ (B == 59)
)
, COL35 = case_when(
A == "GGG" ~ as.integer(A37) %in% c(1, 2, 6, 8, 9) & as.integer(A38) %in% c(29, 62)
, A == "ZZZ" ~ F
,T ~ (B == 13)
)
) %>% compute()
df2 <- df1 %>%
mutate(
COL36 = case_when(
A == "ZZZ" ~ F
, T ~ (B == 8)
)
, COL37 = (U == 3)
, COL38 = case_when(
A == "ZZZ" ~ F
, T ~ (D & U != 1)
)
, COL39 = case_when(
A == "XX" ~ B %in% c(90,91,92)
, T ~ (!A6)
)
, COL40 = case_when(
A == "XX" ~ F
, T ~ (!A6 & U == 1)
)
, COL41 = case_when(
A == "XX" ~ F
, T ~ (!A6 & U == 2)
)
, COL42 = (B == 10)
, COL43 = (A10 == "COM")
, COL44 = case_when(
A == "XX" ~ case_when(A6 == F ~ T, T ~ F)
, A == "XX" ~ B %in% c(90,91,92)
, T ~ !A6 & A9
)
, COL45 = !A9
, COL46 = case_when(!is.na(A1) ~ as.integer(months_between(last_day(A7), last_day(A1))), T ~ as.integer(999))
, COL46 = case_when(
A == "XX" & COL46 != 999 ~ as.integer(COL46)
, T ~ as.integer(COL46)
)
, COL47 = case_when(
A == "XX" ~ A10 %in% c('XX') | A11 %in% c('YY')
, T ~ F
)
, COL48 = case_when(
A == "XX" ~ (
A10 %in% c('XX')
| A11 %in% c('YY')
| A29
| A30
)
, (A == "ZZ" & (A10 == "DDD")) ~ TRUE
, T ~ F
)
, COL49 = case_when(
A == "XX" ~ COL48
, (A == "ZZ" & (A10 == "DDD")) ~ TRUE
, T ~ F
)
, COL50 = case_when(
A == "XX" ~ (
A10 %in% c('X')
| A11 %in% c('ZZ')
| A29
| A30
)
, (A == "ZZ" & (A10 == "DDD" | A8)) ~ TRUE
, T ~ F
)
, COL51 = case_when(
A == "XX" ~ COL48
, (A == "ZZ" & (A10 == "DDD" | A8)) ~ TRUE
, T ~ F
)
, COL52 = case_when(
A == "XX" ~ (A10 == "TR" | A11 == "TR")
, T ~ F
)
, COL53 = 999
, COL54 = case_when(
A == "XX" ~ as.integer(pmax(0, 84 - COL46))
, A == "XX" ~ as.integer(pmax(0, 36 - COL46))
, A == "XX" ~ as.integer(82)
, A == "XXY" ~ as.integer(48)
, A %in% c("ZZ","XX") ~ as.integer(24)
, A %in% c("YY","ZZ") ~ as.integer(84)
, T ~ as.integer(nchar(A10))
)
, COL55 = case_when(
A == "XX" ~ -7
, T ~ A4
)
, COL56 = A5
, A5 = case_when(
A %in% c("XX","XX") ~ case_when(
A5 >= 0 ~ as.double(as.integer(A5))
, A5 < 0 ~ as.double(NA)
)
, A == "XX" ~ case_when(substr(A10,1,1) != '2' & substr(A11,1,1) != 'D' & A5 != -1 ~ as.double(A5), T ~as.double(NA))
, T ~ as.double(A5)
)
, COL57= case_when(
A == "XX" ~ case_when(
A24 >= 0 ~ A24
, A24 < 0 ~ as.double(NA)
)
, T ~ as.double(A24)
)
, COL58 = case_when(A1 != '1900-01-01' ~ A1, T ~ A7)
, COL59 = case_when(!is.na(A1) ~ as.integer(months_between(last_day(A7), last_day(A1))))
, COL60 = case_when(
A == "XX" & A5 != 999 ~ as.integer(A5)
, T ~ as.integer(A5)
)
, COL61 = case_when(
A %in% c("XX","XX") ~ as.numeric(case_when(
A17 >= 0 ~ as.integer(A17)
, A18 > 0 ~ as.integer(A18)
, T ~ as.integer(NA)
))
, A == "XX" ~ as.numeric(A17)
, A == "ZZ" ~ case_when(A10 == "YY" & A17 >= 0 ~ A17)
, A == "XX" ~ case_when(
as.integer(A17 > 0) > 0 ~ as.numeric(A17)
, as.integer((A17 <= 0 | is.na(A17))) > 0 ~ as.numeric(NA)
)
, A == "YY" ~ case_when(
as.integer(!A32 & A17 > 0 & !is.na(A17)) > 0 ~ as.numeric(A17)
, as.integer(!A32 & (A17 <= 0 | is.na(A17)) & A34 & A18 >= 0) > 0 ~ as.numeric(A18)
, T ~ as.numeric(NA)
)
, A == "ZZ" ~ case_when(
A17 > 1 ~ as.numeric(A17)
, A18 > 0 ~ as.numeric(A18)
, TRUE ~ as.numeric(NA)
)
, T ~ case_when(
as.integer(!A32 & A17 > 0 & !is.na(A17)) > 0 ~ as.numeric(A17)
, as.integer(!A32 & (A17 <= 0 | is.na(A17)) & A34 & A18 >= 0) > 0 ~ as.numeric(A18)
, T ~ as.numeric(A5)
)
)
, COL62 = case_when(
A == "XX" ~ case_when(
A17 >=0 & (A9 | A15) & A10 == "M" ~ A17 * A2
, A34 & A5 >= 0 & !(is.na(A5)) ~ as.numeric(A5 * 0.05)
)
, A == "ZZ" ~ case_when(
(A34 | A33) & A5 >= 0 ~ as.double(A5 * 0.03)
, (A9 | A33) & A18 >= 0 & A10 == "YY" ~ as.double(A18)
)
, A == "XX" ~ A17
, is.na(A17) ~ as.numeric(NA)
, A10 == "B" ~ A17 * 2
, A10 == "D" ~ A17
, A10 == "E" ~ A17 * 2
, A10 == "I" ~ A17*30
, A10 == "L" ~ A17 / 2
, A10 == "M" ~ A17
, A10 == "P" ~ A17
, A10 == "Q" ~ A17 / 3
, A10 == "S" ~ A17 / 6
, A10 == "T" ~ A17 / 4
, A10 == "W" ~ A17 * 4.3
, A10 == "Y" ~ A17 / 12
, A10 == "O" ~ A17
, T ~ A17
)
, COL63 = case_when(
is.na(A4) ~ as.numeric(NA)
, A10 == "B" ~ A4*2
, A10 == "D" ~ A4
, A10 == "E" ~ A4*2
, A10 == "I" ~ A4*30
, A10 == "L" ~ A4/2
, A10 == "M" ~ A4
, A10 == "P" ~ A4
, A10 == "Q" ~ A4/3
, A10 == "S" ~ A4/6
, A10 == "T" ~ A4/4
, A10 == "W" ~ A4*4.3
, A10 == "Y" ~ A4/12
, A10 == "O" ~ A4
, T ~ A4
)
) %>% compute()
df3 <- df2 %>%
mutate(
COL64 = "XX"
, COL65 = 1L
, COL66 = "YY"
, COL67 = 2L
, COL68 = "ZZ"
, COL69 = 3L
, COL70 = "XY"
, COL71 = 4L
, COL72 = case_when(
A %in% c("XXX","XXX","XXY") ~ (A2 == 1)
, T ~ (A57 %in% c(9) & B == 3)
)
, COL73 = case_when(
A == "IIO" ~ F
, T ~ (B == 20 & A8)
)
, COL74 = case_when(
A == "IIO" ~ F
, T ~ (B == 20 & A8)
)
, COL75 = case_when(
A %in% c("XYY","GGG", "XXY") ~ B == 4 & !is.na(A17) & A17 > A18
, A == "XXX" ~ !is.na(A17) & A17 > A18 & A58
, T ~ B == 4 & A5 == 1 & !is.na(A17) & A17 >= A18
)
, COL76 = (A8 & A5 == 1)
, COL77 = A8
, COL78 = A8
, COL79 = (A17 == 2)
, COL80 = case_when(
A == "XXX" ~ (A17 %in% c(1,2,6))
, T ~ (A17 %in% c(1,2,4))
)
, COL81 = case_when(
A == "XXX" ~ (A17 %in% c(1,6))
, T ~ (A17 == 1)
)
, COL82 = (A3 >= 2 & A3 <= 5 & !is.na(A3))
, COL83 = case_when(
A == "XXX" ~ (A5 == 1 & A12)
, T ~ (A5 == 1 & A3 >= 6 & A3 <= 60 & !is.na(A3))
)
, COL84 = case_when(
A == "XXX" ~ (A5 == 1 & A43)
, T ~ (A5 == 1 & A3 > 60 & !is.na(A3))
)
, COL85 = B == 36
, COL86 = (B == 21)
, COL87 = (B == 19)
, COL88 = case_when(
A == "IIO" ~ (B %in% c(98,9,20,8,61,4,36,65,74,81) & A8)
, A == "XXY" ~ (!A12 & A8 & B != 19)
, T ~ (!A12 & A8)
)
, COL89 = case_when(
A == "XXY" ~ case_when((A12 | A12| A13) & A8 == T ~ T, T ~ F)
, T ~ case_when(A12 == T & A8 == T ~ T, T ~ F)
)
, COL90 = (A5 != 1 & A8)
, COL91 = (B == 9)
, COL92 = case_when(
A == "IIO" ~ B %in% c(9, 20)
, A == "XYY" ~ F
, T ~ (A12 | (A8 & A13))
)
, COL93 = case_when(A == "XXY" ~ B %in% c(17,21),T ~ B == 17)
, COL94 = (B == 8)
, COL95 = (B == 27)
, COL96 = case_when(A == "XXY" ~ A10 =="A58", T ~ B != 6 & A17 > 0)
, COL97= (B != 6 & A18 > 0)
, COL98 = case_when(
A %in% c("XXX") ~ F
, T ~ (B == 5)
)
, COL99 = case_when(
A %in% c("IIO","XXX", "GGG","XXY") ~ F
, T ~ (B == 5)
)
, COL100 = (B == 43)
, COL101 = case_when(
A == "IIO" ~ A8 & COL46 < 36
, T ~ A8
)
, COL102 = A8
, COL103 = case_when(
A == "XXX" ~ (!A8 & !A9)
, A == "XYY" ~ (A8)
, A == "DDD" ~ (!A8 & !A9)
, T ~ A30 & (!A8 & !A9)
)
, COL104 = case_when(
A == "XXX" ~ (A16)
, A == "DDD" ~ T
, T ~ A8
) #changed from TRUE to A30 for XXX
, COL105 = case_when(A == "DDD" ~ (!A9 & !A8)
, T ~ (A30 & !A9 & !A8))
, COL106 = 999
, COL107 = case_when(
A == "GGG" ~ as.integer(999)
, T ~ case_when(!is.na(A1) ~ as.integer(months_between(last_day(A7), last_day(A1))), T ~ as.integer(999))
)
, COL108 = as.integer(999) # TODO
, COL109 = as.integer(999)
, COL110 = as.integer(999)
, COL111 = as.integer(999) # TODO
, COL116 = case_when(
A == "XXX" & A46 != 999 ~ as.integer(A46)
, T ~ as.integer(A46)
)
,COL117 = case_when(!is.na(A1) ~ as.integer(datediff(A7, A1)),T ~ as.integer(999))
, COL119 = case_when(!is.na(A20) ~ as.integer(months_between(last_day(A7), last_day(A20))), T ~ as.integer(999))
, COL120 = case_when(
A == "XXX" & A5 != 999 ~ as.integer(A5 )
, T ~ A5
)
, COL121 = case_when(
A == "XXX" ~ as.integer(A5)
, T ~ as.integer(COL46)
)
, COL122 = case_when(
A == "XYY" ~ as.integer(A30)
, T ~ as.integer(99)
)
, COL125 = F
, COL126 = case_when(A19 & A57 == 0 ~ T, T ~ F)
, COL127 = case_when(
A == "XXX" ~ case_when(!is.na(A20) & A20 <= A7 ~ as.integer(months_between(last_day(A7), last_day(A20))), T ~ as.integer(999))
, T ~ A46
)
, COL128 = case_when(A == "XXX" ~ F ,A19 & !A8 & A5 > 0 ~ T, T ~ F)
) %>%
compute("df3")
#> Error: org.apache.spark.sql.AnalysisException: invalid cast from date to timestamp;;
#> 'Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 159 more fields]
#> +- SubqueryAlias `q53`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 158 more fields]
#> +- SubqueryAlias `q52`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 157 more fields]
#> +- SubqueryAlias `q51`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 156 more fields]
#> +- SubqueryAlias `q50`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 155 more fields]
#> +- SubqueryAlias `q49`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 154 more fields]
#> +- SubqueryAlias `q48`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 153 more fields]
#> +- SubqueryAlias `q47`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 152 more fields]
#> +- SubqueryAlias `q46`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 151 more fields]
#> +- SubqueryAlias `q45`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 150 more fields]
#> +- SubqueryAlias `q44`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 149 more fields]
#> +- SubqueryAlias `q43`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 148 more fields]
#> +- SubqueryAlias `q42`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 147 more fields]
#> +- SubqueryAlias `q41`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 146 more fields]
#> +- SubqueryAlias `q40`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 145 more fields]
#> +- SubqueryAlias `q39`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 144 more fields]
#> +- SubqueryAlias `q38`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 143 more fields]
#> +- SubqueryAlias `q37`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 142 more fields]
#> +- SubqueryAlias `q36`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 141 more fields]
#> +- SubqueryAlias `q35`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 139 more fields]
#> +- SubqueryAlias `q34`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 138 more fields]
#> +- SubqueryAlias `q33`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A16#28, A17#29, A18#30, A19#31, A20#32, ... 137 more fields]
#> +- SubqueryAlias `q32`
#> +- Project [A#9, B#10, U#11, D#12, A1#13, A2#14, A3#15, A4#16, A5#12517, A6#18, A7#19, A8#20, A9#21, A10#22, A11#23, A12#24, A13#25, A14#26, A15#27, A Created on 2022-09-06 by the reprex package (v2.0.1) Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.2 (2020-06-22)
#> os Red Hat Enterprise Linux
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz America/Chicago
#> date 2022-09-06
#> pandoc 2.17.1.1 @ /opt/revr/rstudio-server/bin/quarto/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> askpass 1.1 2019-01-13 [2] CRAN (R 4.0.2)
#> assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.0.2)
#> base64enc 0.1-3 2015-07-28 [2] CRAN (R 4.0.2)
#> blob 1.2.3 2022-04-10 [2] CRAN (R 4.0.2)
#> cli 3.3.0 2022-04-25 [2] CRAN (R 4.0.2)
#> config 0.3.1 2020-12-17 [2] CRAN (R 4.0.2)
#> crayon 1.5.1 2022-03-26 [2] CRAN (R 4.0.2)
#> DBI * 1.1.3 2022-06-18 [2] CRAN (R 4.0.2)
#> dbplyr * 2.2.1 2022-06-27 [2] CRAN (R 4.0.2)
#> digest 0.6.29 2021-12-01 [2] CRAN (R 4.0.2)
#> dplyr * 1.0.9 2022-04-28 [2] CRAN (R 4.0.2)
#> ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.0.2)
#> evaluate 0.15 2022-02-18 [2] CRAN (R 4.0.2)
#> fansi 1.0.3 2022-03-24 [2] CRAN (R 4.0.2)
#> fastmap 1.1.0 2021-01-25 [2] CRAN (R 4.0.2)
#> forge 0.2.0 2019-02-26 [2] CRAN (R 4.0.2)
#> fs 1.5.2 2021-12-08 [2] CRAN (R 4.0.2)
#> generics 0.1.3 2022-07-05 [2] CRAN (R 4.0.2)
#> glue 1.6.2 2022-02-24 [2] CRAN (R 4.0.2)
#> highr 0.9 2021-04-16 [2] CRAN (R 4.0.2)
#> htmltools 0.5.2 2021-08-25 [2] CRAN (R 4.0.2)
#> htmlwidgets 1.5.4 2021-09-08 [2] CRAN (R 4.0.2)
#> httr 1.4.3 2022-05-04 [2] CRAN (R 4.0.2)
#> jsonlite 1.8.0 2022-02-22 [2] CRAN (R 4.0.2)
#> knitr 1.39 2022-04-26 [2] CRAN (R 4.0.2)
#> lifecycle 1.0.1 2021-09-24 [2] CRAN (R 4.0.2)
#> magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.0.2)
#> openssl 2.0.2 2022-05-24 [2] CRAN (R 4.0.2)
#> pillar 1.7.0 2022-02-01 [2] CRAN (R 4.0.2)
#> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.0.2)
#> purrr 0.3.4 2020-04-17 [2] CRAN (R 4.0.2)
#> R.cache 0.15.0 2021-04-30 [2] CRAN (R 4.0.2)
#> R.methodsS3 1.8.2 2022-06-13 [2] CRAN (R 4.0.2)
#> R.oo 1.25.0 2022-06-12 [2] CRAN (R 4.0.2)
#> R.utils 2.12.0 2022-06-28 [2] CRAN (R 4.0.2)
#> r2d3 0.2.6 2022-02-28 [2] CRAN (R 4.0.2)
#> R6 2.5.1 2021-08-19 [2] CRAN (R 4.0.2)
#> reprex 2.0.1 2021-08-05 [2] CRAN (R 4.0.2)
#> rlang 1.0.4 2022-07-12 [2] CRAN (R 4.0.2)
#> rmarkdown 2.14 2022-04-25 [2] CRAN (R 4.0.2)
#> rprojroot 2.0.3 2022-04-02 [2] CRAN (R 4.0.2)
#> rstudioapi 0.13 2020-11-12 [2] CRAN (R 4.0.2)
#> sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.0.2)
#> sparklyr * 1.7.7 2022-06-07 [2] CRAN (R 4.0.2)
#> stringi 1.7.8 2022-07-11 [2] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [2] CRAN (R 4.0.2)
#> styler 1.7.0 2022-03-13 [2] CRAN (R 4.0.2)
#> tibble 3.1.7 2022-05-03 [2] CRAN (R 4.0.2)
#> tidyr 1.2.0 2022-02-01 [2] CRAN (R 4.0.2)
#> tidyselect 1.1.2 2022-02-21 [2] CRAN (R 4.0.2)
#> utf8 1.2.2 2021-07-24 [2] CRAN (R 4.0.2)
#> vctrs 0.4.1 2022-04-13 [2] CRAN (R 4.0.2)
#> withr 2.5.0 2022-03-03 [2] CRAN (R 4.0.2)
#> xfun 0.31 2022-05-10 [2] CRAN (R 4.0.2)
#> yaml 2.3.5 2022-02-21 [2] CRAN (R 4.0.2)
#>
#> ────────────────────────────────────────────────────────────────────────────── |
A drawback with breaking mutate statement, adding compute and assigning it to a new variable causes intermediate tables to get created on Spark. This is causing performance degradation for us by 50% with sparklyr 1.7.7 in comparison to sparklyr 1.7.5. |
Ok, I just tried an idea. I captured the resulting SQL using #Load libraries
library(DBI)
library(sparklyr)
library(dplyr)
library(dbplyr)
#Dummy data frame with 1 row and 62 columns
dummy_data <- data.frame( A = "XX"
, B = 1L
, U = 1L
, D = T
, A1 = as.Date("1900-01-01")
, A2 = 1L
, A3 = 1L
, A4 = 1L
, A5 = 1L
, A6 = T
, A7 = as.Date("1900-01-01")
, A8 = T
, A9 = T
, A10 = "XX"
, A11 = "XX"
, A12 = T
, A13 = T
, A14 = as.Date("1900-01-01")
, A15 = T
, A16 = T
, A17 = 1L
, A18 = 1L
, A19 = T
, A20 = as.Date("1900-01-01")
, A21 = as.Date("1900-01-01")
, A22 = 1L
, A23 = 1L
, A24 = 1L
, A25 = 1L
, A26 = 1L
, A27 = 1L
, A28 = 1L
, A29 = T
, A30 = T
, A31 = T
, A32 = T
, A33 = T
, A34 = T
, A35 = "XX"
, A36 = "XX"
, A37 = "XX"
, A38 = "XX"
, A39 = "XX"
, A40 = "XX"
, A41 = T
, A42 = as.Date("1900-01-01")
, A43 = T
, A44 = 1L
, A45 = T
, A46 = 1L
, A47 = as.Date("1900-01-01")
, A48 = "XX"
, A49 = as.Date("1900-01-01")
, A50 = as.Date("1900-01-01")
, A51 = as.Date("1900-01-01")
, A52 = as.Date("1900-01-01")
, A53 = as.Date("1900-01-01")
, A54 = "XX"
, A55 = as.Date("1900-01-01")
, A56 = 1L
, A57 = 1L
, A58 = T
)
#Execute code snippet
test_snippet <- function(x) {
x %>%
mutate(excl = case_when(
A %in% c("HHH","GGG") ~ (A17 == 1 & !A13)
, A == "HHH" ~ T
, T ~ (!A13)
)) %>%
filter(excl) %>%
mutate(
COL1 = case_when(
A == "III" ~ B %in% c(3,72,70)
, T ~ (B == 3)
)
, COL2 = (B == 34)
, COL3 = (B == 71)
, COL4 = (B == 79)
, COL5 = (B == 83)
, COL6 = case_when(A == "PPP" ~ B %in% c(5,6), T ~ B == 6)
, COL7 = case_when(
A == "III" ~ (B == 5 | B == 6)
, T ~ COL6
)
, COL8 = case_when(
A == "III" ~ B %in% c(4,69,74)
, T ~ (B == 4)
)
, COL9 = (B == 4 & A31)
, COL10 = (B == 25)
, COL11 = (B ==5)
, COL12 = case_when(
A == "GGG" ~ F
, T ~ T
)
, COL13 = A15
, COL14 = B %in% c(34,35,21,9,37,38,39)
, COL15 = case_when(
A == "III" ~ (B != 21)
, A == "CCC" ~ A30
, A == "ZZZ" ~ !COL14
, A == "PPP" ~ (B != 17 & B != 21)
, T ~ (U < 12)
)
, COL16 = case_when(
A %in% c("III","ZZZ") ~ (B == 20)
, A == "CCC" ~ (U == 15)
, T ~ (U == 11)
)
, COL17 = (B == 19)
, COL18 = (B == 29)
, COL19 = case_when(A == "PPP" ~
case_when(A18 > 60 & !is.na(A18) ~ TRUE
, T ~ FALSE)
, T ~ (B ==90))
, COL20 = case_when(A == "PPP" ~
case_when((D | COL18 | COL6 | COL1) ~
case_when((A35 == "M" & (A18) >= 37 & (A18) <= 60) |
(A35 == "A" & (A18*12) >= 37 & (A18*12) <= 60) |
(A35 == "B" & (A18/2) >= 37 & (A18/2) <= 60) |
(A35 == "Q" & (A18*3) >= 37 & (A18*3) <= 60) |
(A35 == "S" & (A18*6) >= 37 & (A18*6) <= 60) |
(A35 == "W" & (A18/4) >= 37 & (A18/4) <= 60) ~ TRUE
, T ~ FALSE)
, T ~ FALSE)
, T ~ ( B==91 ))
, COL21 = case_when(A == "PPP" ~
case_when(D | COL18 | COL6 | COL1 ~
case_when((A35 == "M" & A18 == 1) |
(A35 == "W" & (A18/4) == 1) ~ TRUE
, T ~ FALSE)
, T ~ FALSE)
, T ~ ( B==33 ))
, COL22 = case_when(A == "PPP" ~
case_when(D | COL18 | COL6 | COL1 ~
case_when((A35 == "M" & (A18) >= 2 & (A18) <= 36) |
(A35 == "A" & (A18*12) >= 2 & (A18*12) <= 36) |
(A35 == "B" & (A18/2) >= 2 & (A18/2) <= 36) |
(A35 == "Q" & (A18*3) >= 2 & (A18*3) <= 36) |
(A35 == "S" & (A18*6) >= 2 & (A18*6) <= 36) |
(A35 == "W" & (A18/4) >= 2 & (A18/4) <= 36) ~ TRUE
, T ~ FALSE)
, T ~ FALSE)
, T ~ ( B==92 ))
, COL23 = (COL21 & U == 1)
, COL24 = (COL22 & U == 1)
, COL25 = (A15 & U == 1)
, COL26 = (D & U == 2)
, COL27 = (A15 & U == 2)
, COL28 = (U == 2 & COL19)
, COL29 = (U == 2 & COL20)
, COL30= (U == 2 & COL21)
, COL31 = (U == 2 & COL22)
, COL32 = B ==35
, COL33 = F
, COL34 = case_when(
A == "GGG" ~ (B == 28)
, A == "ZZZ" ~ F
, T ~ (B == 59)
)
, COL35 = case_when(
A == "GGG" ~ as.integer(A37) %in% c(1, 2, 6, 8, 9) & as.integer(A38) %in% c(29, 62)
, A == "ZZZ" ~ F
,T ~ (B == 13)
)
, COL36 = case_when(
A == "ZZZ" ~ F
, T ~ (B == 8)
)
, COL37 = (U == 3)
, COL38 = case_when(
A == "ZZZ" ~ F
, T ~ (D & U != 1)
)
)
}
# Confirm it works in database
library(RSQLite)
db_con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
db_dummy_data <- copy_to(db_con, dummy_data)
db_dummy_data %>%
test_snippet()
#> # Source: SQL [0 x 101]
#> # Database: sqlite 3.38.5 [:memory:]
#> # … with 101 variables: A <chr>, B <int>, U <int>, D <int>, A1 <dbl>, A2 <int>,
#> # A3 <int>, A4 <int>, A5 <int>, A6 <int>, A7 <dbl>, A8 <int>, A9 <int>,
#> # A10 <chr>, A11 <chr>, A12 <int>, A13 <int>, A14 <dbl>, A15 <int>,
#> # A16 <int>, A17 <int>, A18 <int>, A19 <int>, A20 <dbl>, A21 <dbl>,
#> # A22 <int>, A23 <int>, A24 <int>, A25 <int>, A26 <int>, A27 <int>,
#> # A28 <int>, A29 <int>, A30 <int>, A31 <int>, A32 <int>, A33 <int>,
#> # A34 <int>, A35 <chr>, A36 <chr>, A37 <chr>, A38 <chr>, A39 <chr>, …
# Confirm it works in Spark if passed the resulting query
cn <- spark_connect("local")
df <- copy_to(cn, dummy_data, "df")
mf <- tbl_lazy(dummy_data, simulate_hive())
hive_sql <- mf %>%
test_snippet() %>%
remote_query()
dbGetQuery(cn, hive_sql)
#> A B U D A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
#> 1 <NA> NA NA NA <NA> NA NA NA NA NA <NA> NA NA <NA> <NA> NA NA <NA> NA NA
#> A17 A18 A19 A20 A21 A22 A23 A24 A25 A26 A27 A28 A29 A30 A31 A32 A33 A34
#> 1 NA NA NA <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA NA NA
#> A35 A36 A37 A38 A39 A40 A41 A42 A43 A44 A45 A46 A47 A48 A49 A50
#> 1 <NA> <NA> <NA> <NA> <NA> <NA> NA <NA> NA NA NA NA <NA> <NA> <NA> <NA>
#> A51 A52 A53 A54 A55 A56 A57 A58 excl COL1 COL2 COL3 COL4 COL5 COL6 COL7
#> 1 <NA> <NA> <NA> <NA> <NA> NA NA NA NA NA NA NA NA NA NA NA
#> COL8 COL9 COL10 COL11 COL12 COL13 COL14 COL15 COL16 COL17 COL18 COL19 COL20
#> 1 NA NA NA NA NA NA NA NA NA NA NA NA NA
#> COL21 COL22 COL23 COL24 COL25 COL26 COL27 COL28 COL29 COL30 COL31 COL32 COL33
#> 1 NA NA NA NA NA NA NA NA NA NA NA NA NA
#> COL34 COL35 COL36 COL37 COL38
#> 1 NA NA NA NA NA
spark_disconnect_all()
#> [1] 1 Created on 2022-09-12 by the reprex package (v2.0.1) |
Hi @bhuvanesh1707, I think I have a fix for this issue. Can you try installing this dev version? |
Hi @edgararuiz, after installing this dev version of sparklyr and executed my code I did not face any errors. I am satisfied with the resolution provided. Thanks. |
I am facing a challenge to execute my R code in sparklyr (version 1.7.7) and dbplyr (version 2.2.0), where it throws an error "cannot resolve a 'column'" which means the column is not present but the column is present. The same R code is working in previous sparklyr version (1.7.5) and dbplyr (2.1.1).
I have created a dummy data with 1 row and 62 columns and did some calculation. I have column 'A' present in my dummy data but during the calculation it errors "cannot resolve column'A'"
I am guessing it fails to identify the columns when it goes beyond 100 columns or so.
The text was updated successfully, but these errors were encountered: