Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbplyr does not translate base R integer division %/% #108

Closed
ghost opened this issue Jun 25, 2018 · 14 comments
Closed

dbplyr does not translate base R integer division %/% #108

ghost opened this issue Jun 25, 2018 · 14 comments

Comments

@ghost
Copy link

@ghost ghost commented Jun 25, 2018

@emilyriederer commented on Aug 27, 2017, 7:16 PM UTC:

Base R integer division (%/%) is translated to normal division (/) in SQL.

Apologies for non-reprex. I couldn't think of a way to get around the fact that a user must establish a connection to get this to run. When run connected to a AWS Redshift DB, data$z contains values 0, 1, and 2 while data_db$z contains floating point values.

The cause is that %/% is translated as / without accounting for integer-format. One correct SQL translation (at least for Redshift) would be CAST("x"/5 AS INTEGER) or FLOOR("x"/5)

library(dplyr)

data <- data.frame(x=as.integer(1:10))
data_db <- copy_to(con, data, "data_db", temporary = FALSE)

data <- mutate(data, z = x %/% 5)
data_db <- mutate(data_db, z = x %/% 5)

data
collect(data_db)

show_query(data_db)

This issue was moved by krlmlr from tidyverse/dplyr/issues/3057.

@ghost
Copy link
Author

@ghost ghost commented Jun 25, 2018

@edgararuiz commented on Sep 4, 2017, 6:17 PM UTC:

Hi @emilyriederer , does a regular using forward slash by itself works in you environment:

data <- mutate(data, z = x / 5)

@ghost
Copy link
Author

@ghost ghost commented Jun 25, 2018

@emilyriederer commented on Sep 15, 2017, 1:29 AM UTC:

Hi @edgararuiz -- unfortunately not.

It might work just as happenstance due to integer division, but the SQL translation also turns 5 into a floating point:

SELECT "x", "x" / 5.0 AS "z" FROM "data_db"

@ghost
Copy link
Author

@ghost ghost commented Jun 25, 2018

@edgararuiz commented on Sep 15, 2017, 1:31 AM UTC:

Ok, can we try appending an "L" to the right of 5?

data <- mutate(data, z = x / 5L)

@ghost
Copy link
Author

@ghost ghost commented Jun 25, 2018

@hadley commented on Oct 23, 2017, 4:43 PM UTC:

Minimal reprex

dbplyr::translate_sql(x %/% 5)
#> <SQL> "x" / 5.0

@ghost
Copy link
Author

@ghost ghost commented Jun 25, 2018

@hadley commented on Oct 23, 2017, 4:46 PM UTC:

To make equivalent to 1 %/% 0.2 will probably need to implement via modulo arithmetic.

x == (x %/% m) * m + (x %% m)  
x %/% m == (x - (x %% m)) / m

@ghost
Copy link
Author

@ghost ghost commented Jun 25, 2018

@hadley commented on Oct 23, 2017, 10:35 PM UTC:

But this needs to be thought through correctly - I have a vague recollection that negative values might cause issues.

@ghost
Copy link
Author

@ghost ghost commented Jun 25, 2018

@hadley commented on Oct 24, 2017, 10:11 PM UTC:

sql_int_div <- function() {
  function(x, m) {
    build_sql("((", x, " - (", x, " % ", m, ")) / ", m, ")")
  }
}

That definition gets us pretty close, but it turns out that %% in R and % in SQL have slightly different semantics when the signs are different:

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

df <- tibble(
  x = c(10, 10, -10, -10), 
  y = c(3, -3, 3, -3)
)
df %>% mutate(x %% y, x %/% y)
#> # A tibble: 4 x 4
#>       x     y `x%%y` `x%/%y`
#>   <dbl> <dbl>  <dbl>   <dbl>
#> 1    10     3      1       3
#> 2    10    -3     -2      -4
#> 3   -10     3      2      -4
#> 4   -10    -3     -1       3

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
mf <- con %>% copy_to(df)
mf %>% mutate(x %% y, x %/% y)
#> # Source:   lazy query [?? x 4]
#> # Database: sqlite 3.19.3 [:memory:]
#>       x     y `x%%y` `x%/%y`
#>   <dbl> <dbl>  <dbl>   <dbl>
#> 1    10     3      1       3
#> 2    10    -3      1      -3
#> 3   -10     3     -1      -3
#> 4   -10    -3     -1       3

I'm not sure how to handle this :(

@ghost
Copy link
Author

@ghost ghost commented Jun 25, 2018

@alex-gable commented on Jan 26, 2018, 8:44 AM UTC:

This is a result of C98 and beyond (and by extension SQLite) using truncated division where the modulo operator takes the sign of the dividend, and R using the mathematically preferred floored division with the modulo sign taking the sign of the divisor.

Quite frankly, C(SQLite) and R are doing fundamentally different arithmetic. There's some fascinating reading on the subject here and of course an abridged version on Wikipedia. This will likely also vary across SQL dialects, which makes it more difficult to pin down in a unified way.

Given the above, I'm not sure it is reasonable to expect equivalent output in every language. Python covered some of the complexities in this discussion in PEP-228: Reworking Python's Numeric Model and PEP-238: Changing the Division Operator, highlighting that this is not just a "dplyr issue", but rather a significantly larger architectural decision in R and computer arithmetic itself.

@emilyriederer, In the case of RedShift, a python udf could be constructed in your database leveraging numpy to replicate the output from R.

@ghost
Copy link
Author

@ghost ghost commented Jun 25, 2018

@hadley commented on Jan 26, 2018, 1:37 PM UTC:

@alex-gable thanks for that awesome summary of the problem!

@ghost
Copy link
Author

@ghost ghost commented Jun 25, 2018

@hadley commented on Jun 7, 2018, 11:37 PM UTC:

I think the best way to handle this is simple to document it.

@ghost
Copy link
Author

@ghost ghost commented Jun 25, 2018

@emilyriederer commented on Jun 9, 2018, 6:51 PM UTC:

Wow - thank you all for the very helpful, detailed responses. All of the context here is fascinating. I'm embarrassed to discover that I completely "went dark" on this thread. Somehow, I'm not getting notifications but luckily spotted this atop the new GitHub feed. Thanks again!

@jsta
Copy link

@jsta jsta commented Sep 1, 2018

I was able to force non-integer operation by adding 0.0 to one of the variables. In contrast, as.numeric seemed to have no effect:

library(dplyr)

data <- data.frame(x = as.integer(1:5), y = as.integer(5:1))
data_db <- copy_to(dbplyr::src_memdb(), data, "data_db")

mutate(data_db, z = as.numeric(x)/y)
#> # Source:   lazy query [?? x 3]
#> # Database: sqlite 3.22.0 [:memory:]
#>       x     y     z
#>   <int> <int> <int>
#> 1     1     5     0
#> 2     2     4     0
#> 3     3     3     1
#> 4     4     2     2
#> 5     5     1     5
mutate(data_db, z = (x + 0)/y)
#> # Source:   lazy query [?? x 3]
#> # Database: sqlite 3.22.0 [:memory:]
#>       x     y     z
#>   <int> <int> <dbl>
#> 1     1     5   0.2
#> 2     2     4   0.5
#> 3     3     3   1  
#> 4     4     2   2  
#> 5     5     1   5

@hadley
Copy link
Member

@hadley hadley commented Oct 12, 2019

Oh but there’s a DIV function we could use instead

@hadley hadley reopened this Oct 12, 2019
@hadley
Copy link
Member

@hadley hadley commented Sep 22, 2020

So I think overall this is just too complicated, and not worth providing a translation. I'll just make %/% return a clean error (since otherwise it's translated to /, which isn't correct)`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants