Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tally() returns incorrect result on database table with n variable #3075

Closed
JohnMount opened this issue Sep 4, 2017 · 7 comments
Closed

tally() returns incorrect result on database table with n variable #3075

JohnMount opened this issue Sep 4, 2017 · 7 comments

Comments

@JohnMount
Copy link

@JohnMount JohnMount commented Sep 4, 2017

dbplyr issue: dplyr::tally() results on database different than on data.frame. This is a very low priority for me as I do not use tally().

suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")
#> [1] '0.7.2.9000'
library("sparklyr")
packageVersion("sparklyr")
#> [1] '0.6.2'
packageVersion("dbplyr")
#> [1] '1.1.0.9000'
packageVersion("DBI")
#> [1] '0.7'

db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

dLocal <- data.frame(n = 1:3)
print(dLocal)
#>   n
#> 1 1
#> 2 2
#> 3 3

dRemote <- dplyr::copy_to(db, dLocal)
print(dRemote)
#> # Source:   table<dLocal> [?? x 1]
#> # Database: sqlite 3.19.3 [:memory:]
#>       n
#>   <int>
#> 1     1
#> 2     2
#> 3     3


tally(dLocal)
#> Using `n` as weighting variable
#>   nn
#> 1  6

tally(dRemote)
#> # Source:   lazy query [?? x 1]
#> # Database: sqlite 3.19.3 [:memory:]
#>      nn
#>   <int>
#> 1     3

DBI::dbDisconnect(db)
@hadley hadley changed the title dbplyr issue: dplyr::tally() results on database different than on data.frame tally() returns incorrect result on database table with n variable Oct 23, 2017
@hadley
Copy link
Member

@hadley hadley commented Oct 23, 2017

Minimal reprex

library(dplyr, warn.conflicts = FALSE)

db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

dLocal <- data.frame(n = 1:3)
tally(dLocal)
#> Using `n` as weighting variable
#>   nn
#> 1  6

dRemote <- dplyr::copy_to(db, dLocal)
tally(dRemote)
#> # Source: lazy query [?? x 1]
#> # Database: sqlite 3.19.3 [:memory:]
#>      nn
#>   <int>
#> 1     3

Note the absence of "Using n as weighting variable`

@hadley
Copy link
Member

@hadley hadley commented Oct 23, 2017

(probably due to use of names() instead of tbl_vars() on line 64)

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Mar 13, 2018

There's another inconsistency between tally() and count(), which is been present in v0.5.0 already (tested with #3412):

library(tidyverse)
tibble(a = 1:3, n = 4:6) %>% group_by(a) %>% tally()
#> Using `n` as weighting variable
#> # A tibble: 3 x 2
#>       a    nn
#>   <int> <int>
#> 1     1     4
#> 2     2     5
#> 3     3     6
tibble(a = 1:3, n = 4:6) %>% count(a)
#> # A tibble: 3 x 2
#>       a    nn
#>   <int> <int>
#> 1     1     1
#> 2     2     1
#> 3     3     1

Created on 2018-03-13 by the reprex package (v0.2.0).

I propose to stop guessing the weighting variable in tally() altogether, partly because it doesn't work as expected if there's an nn variable but not an n variable. First we change the message, then we warn if there's an n variable in the data but no wt argument is used, then we stop caring about it.

@hadley
Copy link
Member

@hadley hadley commented Mar 13, 2018

Eliminating guessing is not acceptable to me because it's important that tally(tally(x)) continue to work and return the correct answer (i.e. the total count)

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Mar 13, 2018

We agreed to not support guessing in tally(tally(tally(x))), and tally(count(x)) currently is different from tally(tally(x)) with respect to guessing. How can we make this a bit more consistent?

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented May 2, 2018

Fixed in #3525. Just need to move this test to dbplyr :

test_that("tally works on tbl_sql (#3075)", {
  db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

  dLocal <- data.frame(n = 1:3)
  expect_message(tally_local <- tally(dLocal), "Using `n` as weighting variable")

  dRemote <- dplyr::copy_to(db, dLocal)
  expect_message(tally_remote <- tally(dRemote), "Using `n` as weighting variable" )
  expect_equal(pull(tally_remote, nn), 6)
})

@krlmlr krlmlr closed this in 12076e5 May 2, 2018
krlmlr added a commit that referenced this issue May 2, 2018
- `tally()` works correctly on non-data frame table sources such as `tbl_sql` (#3075).
@lock
Copy link

@lock lock bot commented Oct 29, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Oct 29, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants