Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbplyr does not translate base R integer division %/% #3057

Closed
emilyriederer opened this issue Aug 27, 2017 · 13 comments
Closed

dbplyr does not translate base R integer division %/% #3057

emilyriederer opened this issue Aug 27, 2017 · 13 comments
Labels
documentation feature a feature request or enhancement

Comments

@emilyriederer
Copy link

emilyriederer commented Aug 27, 2017

Base R integer division (%/%) is translated to normal division (/) in SQL.

Apologies for non-reprex. I couldn't think of a way to get around the fact that a user must establish a connection to get this to run. When run connected to a AWS Redshift DB, data$z contains values 0, 1, and 2 while data_db$z contains floating point values.

The cause is that %/% is translated as / without accounting for integer-format. One correct SQL translation (at least for Redshift) would be CAST("x"/5 AS INTEGER) or FLOOR("x"/5)

library(dplyr)

data <- data.frame(x=as.integer(1:10))
data_db <- copy_to(con, data, "data_db", temporary = FALSE)

data <- mutate(data, z = x %/% 5)
data_db <- mutate(data_db, z = x %/% 5)

data
collect(data_db)

show_query(data_db)
@edgararuiz-zz
Copy link

Hi @emilyriederer , does a regular using forward slash by itself works in you environment:

data <- mutate(data, z = x / 5)

@emilyriederer
Copy link
Author

Hi @edgararuiz -- unfortunately not.

It might work just as happenstance due to integer division, but the SQL translation also turns 5 into a floating point:

SELECT "x", "x" / 5.0 AS "z" FROM "data_db"

@edgararuiz-zz
Copy link

Ok, can we try appending an "L" to the right of 5?

data <- mutate(data, z = x / 5L)

@hadley
Copy link
Member

hadley commented Oct 23, 2017

Minimal reprex

dbplyr::translate_sql(x %/% 5)
#> <SQL> "x" / 5.0

@hadley
Copy link
Member

hadley commented Oct 23, 2017

To make equivalent to 1 %/% 0.2 will probably need to implement via modulo arithmetic.

x == (x %/% m) * m + (x %% m)  
x %/% m == (x - (x %% m)) / m

@hadley hadley added database feature a feature request or enhancement labels Oct 23, 2017
@hadley
Copy link
Member

hadley commented Oct 23, 2017

But this needs to be thought through correctly - I have a vague recollection that negative values might cause issues.

@hadley
Copy link
Member

hadley commented Oct 24, 2017

sql_int_div <- function() {
  function(x, m) {
    build_sql("((", x, " - (", x, " % ", m, ")) / ", m, ")")
  }
}

That definition gets us pretty close, but it turns out that %% in R and % in SQL have slightly different semantics when the signs are different:

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

df <- tibble(
  x = c(10, 10, -10, -10), 
  y = c(3, -3, 3, -3)
)
df %>% mutate(x %% y, x %/% y)
#> # A tibble: 4 x 4
#>       x     y `x%%y` `x%/%y`
#>   <dbl> <dbl>  <dbl>   <dbl>
#> 1    10     3      1       3
#> 2    10    -3     -2      -4
#> 3   -10     3      2      -4
#> 4   -10    -3     -1       3

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
mf <- con %>% copy_to(df)
mf %>% mutate(x %% y, x %/% y)
#> # Source:   lazy query [?? x 4]
#> # Database: sqlite 3.19.3 [:memory:]
#>       x     y `x%%y` `x%/%y`
#>   <dbl> <dbl>  <dbl>   <dbl>
#> 1    10     3      1       3
#> 2    10    -3      1      -3
#> 3   -10     3     -1      -3
#> 4   -10    -3     -1       3

I'm not sure how to handle this :(

@alex-gable
Copy link

alex-gable commented Jan 26, 2018

This is a result of C98 and beyond (and by extension SQLite) using truncated division where the modulo operator takes the sign of the dividend, and R using the mathematically preferred floored division with the modulo sign taking the sign of the divisor.

Quite frankly, C(SQLite) and R are doing fundamentally different arithmetic. There's some fascinating reading on the subject here and of course an abridged version on Wikipedia. This will likely also vary across SQL dialects, which makes it more difficult to pin down in a unified way.

Given the above, I'm not sure it is reasonable to expect equivalent output in every language. Python covered some of the complexities in this discussion in PEP-228: Reworking Python's Numeric Model and PEP-238: Changing the Division Operator, highlighting that this is not just a "dplyr issue", but rather a significantly larger architectural decision in R and computer arithmetic itself.

@emilyriederer, In the case of RedShift, a python udf could be constructed in your database leveraging numpy to replicate the output from R.

@hadley
Copy link
Member

hadley commented Jan 26, 2018

@alex-gable thanks for that awesome summary of the problem!

@hadley
Copy link
Member

hadley commented Jun 7, 2018

I think the best way to handle this is simple to document it.

@emilyriederer
Copy link
Author

Wow - thank you all for the very helpful, detailed responses. All of the context here is fascinating. I'm embarrassed to discover that I completely "went dark" on this thread. Somehow, I'm not getting notifications but luckily spotted this atop the new GitHub feed. Thanks again!

@ghost
Copy link

ghost commented Jun 25, 2018

This issue was moved by krlmlr to tidyverse/dbplyr/issues/108.

@ghost ghost closed this as completed Jun 25, 2018
@lock
Copy link

lock bot commented Dec 22, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Dec 22, 2018
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants