Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbplyr does not translate base R integer division %/% #108

Open
move bot opened this issue Jun 25, 2018 · 13 comments

Comments

@move
Copy link

@move move bot commented Jun 25, 2018

@emilyriederer commented on Aug 27, 2017, 7:16 PM UTC:

Base R integer division (%/%) is translated to normal division (/) in SQL.

Apologies for non-reprex. I couldn't think of a way to get around the fact that a user must establish a connection to get this to run. When run connected to a AWS Redshift DB, data$z contains values 0, 1, and 2 while data_db$z contains floating point values.

The cause is that %/% is translated as / without accounting for integer-format. One correct SQL translation (at least for Redshift) would be CAST("x"/5 AS INTEGER) or FLOOR("x"/5)

library(dplyr)

data <- data.frame(x=as.integer(1:10))
data_db <- copy_to(con, data, "data_db", temporary = FALSE)

data <- mutate(data, z = x %/% 5)
data_db <- mutate(data_db, z = x %/% 5)

data
collect(data_db)

show_query(data_db)

This issue was moved by krlmlr from tidyverse/dplyr/issues/3057.

@move

This comment has been minimized.

Copy link
Author

@move move bot commented Jun 25, 2018

@edgararuiz commented on Sep 4, 2017, 6:17 PM UTC:

Hi @emilyriederer , does a regular using forward slash by itself works in you environment:

data <- mutate(data, z = x / 5)

@move

This comment has been minimized.

Copy link
Author

@move move bot commented Jun 25, 2018

@emilyriederer commented on Sep 15, 2017, 1:29 AM UTC:

Hi @edgararuiz -- unfortunately not.

It might work just as happenstance due to integer division, but the SQL translation also turns 5 into a floating point:

SELECT "x", "x" / 5.0 AS "z" FROM "data_db"

@move

This comment has been minimized.

Copy link
Author

@move move bot commented Jun 25, 2018

@edgararuiz commented on Sep 15, 2017, 1:31 AM UTC:

Ok, can we try appending an "L" to the right of 5?

data <- mutate(data, z = x / 5L)

@move

This comment has been minimized.

Copy link
Author

@move move bot commented Jun 25, 2018

@hadley commented on Oct 23, 2017, 4:43 PM UTC:

Minimal reprex

dbplyr::translate_sql(x %/% 5)
#> <SQL> "x" / 5.0
@move

This comment has been minimized.

Copy link
Author

@move move bot commented Jun 25, 2018

@hadley commented on Oct 23, 2017, 4:46 PM UTC:

To make equivalent to 1 %/% 0.2 will probably need to implement via modulo arithmetic.

x == (x %/% m) * m + (x %% m)  
x %/% m == (x - (x %% m)) / m

@move

This comment has been minimized.

Copy link
Author

@move move bot commented Jun 25, 2018

@hadley commented on Oct 23, 2017, 10:35 PM UTC:

But this needs to be thought through correctly - I have a vague recollection that negative values might cause issues.

@move

This comment has been minimized.

Copy link
Author

@move move bot commented Jun 25, 2018

@hadley commented on Oct 24, 2017, 10:11 PM UTC:

sql_int_div <- function() {
  function(x, m) {
    build_sql("((", x, " - (", x, " % ", m, ")) / ", m, ")")
  }
}

That definition gets us pretty close, but it turns out that %% in R and % in SQL have slightly different semantics when the signs are different:

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

df <- tibble(
  x = c(10, 10, -10, -10), 
  y = c(3, -3, 3, -3)
)
df %>% mutate(x %% y, x %/% y)
#> # A tibble: 4 x 4
#>       x     y `x%%y` `x%/%y`
#>   <dbl> <dbl>  <dbl>   <dbl>
#> 1    10     3      1       3
#> 2    10    -3     -2      -4
#> 3   -10     3      2      -4
#> 4   -10    -3     -1       3

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
mf <- con %>% copy_to(df)
mf %>% mutate(x %% y, x %/% y)
#> # Source:   lazy query [?? x 4]
#> # Database: sqlite 3.19.3 [:memory:]
#>       x     y `x%%y` `x%/%y`
#>   <dbl> <dbl>  <dbl>   <dbl>
#> 1    10     3      1       3
#> 2    10    -3      1      -3
#> 3   -10     3     -1      -3
#> 4   -10    -3     -1       3

I'm not sure how to handle this :(

@move

This comment has been minimized.

Copy link
Author

@move move bot commented Jun 25, 2018

@alex-gable commented on Jan 26, 2018, 8:44 AM UTC:

This is a result of C98 and beyond (and by extension SQLite) using truncated division where the modulo operator takes the sign of the dividend, and R using the mathematically preferred floored division with the modulo sign taking the sign of the divisor.

Quite frankly, C(SQLite) and R are doing fundamentally different arithmetic. There's some fascinating reading on the subject here and of course an abridged version on Wikipedia. This will likely also vary across SQL dialects, which makes it more difficult to pin down in a unified way.

Given the above, I'm not sure it is reasonable to expect equivalent output in every language. Python covered some of the complexities in this discussion in PEP-228: Reworking Python's Numeric Model and PEP-238: Changing the Division Operator, highlighting that this is not just a "dplyr issue", but rather a significantly larger architectural decision in R and computer arithmetic itself.

@emilyriederer, In the case of RedShift, a python udf could be constructed in your database leveraging numpy to replicate the output from R.

@move

This comment has been minimized.

Copy link
Author

@move move bot commented Jun 25, 2018

@hadley commented on Jan 26, 2018, 1:37 PM UTC:

@alex-gable thanks for that awesome summary of the problem!

@move

This comment has been minimized.

Copy link
Author

@move move bot commented Jun 25, 2018

@hadley commented on Jun 7, 2018, 11:37 PM UTC:

I think the best way to handle this is simple to document it.

@move

This comment has been minimized.

Copy link
Author

@move move bot commented Jun 25, 2018

@emilyriederer commented on Jun 9, 2018, 6:51 PM UTC:

Wow - thank you all for the very helpful, detailed responses. All of the context here is fascinating. I'm embarrassed to discover that I completely "went dark" on this thread. Somehow, I'm not getting notifications but luckily spotted this atop the new GitHub feed. Thanks again!

@jsta

This comment has been minimized.

Copy link

@jsta jsta commented Sep 1, 2018

I was able to force non-integer operation by adding 0.0 to one of the variables. In contrast, as.numeric seemed to have no effect:

library(dplyr)

data <- data.frame(x = as.integer(1:5), y = as.integer(5:1))
data_db <- copy_to(dbplyr::src_memdb(), data, "data_db")

mutate(data_db, z = as.numeric(x)/y)
#> # Source:   lazy query [?? x 3]
#> # Database: sqlite 3.22.0 [:memory:]
#>       x     y     z
#>   <int> <int> <int>
#> 1     1     5     0
#> 2     2     4     0
#> 3     3     3     1
#> 4     4     2     2
#> 5     5     1     5
mutate(data_db, z = (x + 0)/y)
#> # Source:   lazy query [?? x 3]
#> # Database: sqlite 3.22.0 [:memory:]
#>       x     y     z
#>   <int> <int> <dbl>
#> 1     1     5   0.2
#> 2     2     4   0.5
#> 3     3     3   1  
#> 4     4     2   2  
#> 5     5     1   5
@hadley hadley added this to the v1.4.0 milestone Jan 9, 2019
@hadley hadley closed this in 6cacb7a Jan 10, 2019
@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Oct 12, 2019

Oh but there’s a DIV function we could use instead

@hadley hadley reopened this Oct 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.