Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first() doesn't work in summaries #129

Closed
dgrtwo opened this issue Jul 17, 2018 · 5 comments
Milestone

Comments

@dgrtwo
Copy link
Member

@dgrtwo dgrtwo commented Jul 17, 2018

The first function in dbplyr returns whatever expression it's given as a string, instead of returning the first value (which should respect previous group_by() and arrange().

It was discovered in this Stack Overflow question. That table (and the version I reproduced locally) is PostgreSQL, but the problem can be reproduced in SQLlite:

library(dplyr)

batting <- tbl(dbplyr::lahman_sqlite(), "Batting")

tbl(dbplyr::lahman_sqlite(), "Batting") %>%
  arrange(yearID) %>%
  group_by(playerID) %>%
  summarize(first = first(G))

Returns:

# Source:   lazy query [?? x 2]
# Database: sqlite 3.22.0 [/var/folders/8p/xzrrqphx2qb3d2s_fgqrk5xr0000gn/T//Rtmpe0JeOa/lahman.sqlite]
   playerID  first
   <chr>     <chr>
 1 aardsda01 `G`  
 2 aaronha01 `G`  
 3 aaronto01 `G`  
 4 aasedo01  `G`  
 5 abadan01  `G`  
 6 abadfe01  `G`  
 7 abadijo01 `G`  
 8 abbated01 `G`  
 9 abbeybe01 `G`  
10 abbeych01 `G`  
# ... with more rows

This matches with the result of show_query() on the above.

<SQL>
SELECT `playerID`, '`G`' AS `first`
FROM (SELECT *
FROM `Batting`
ORDER BY `yearID`)
GROUP BY `playerID`

It has a similar behavior:

  • If it's given first(blabla), or another expression that's not a column
  • If it's done without an arrange or a group_by() (it can thus be reproduced with just tbl(dbplyr::lahman_sqlite(), "Batting") %>% summarize(first(G)))
@karldw

This comment has been minimized.

Copy link
Contributor

@karldw karldw commented Aug 10, 2018

Closely related: tidyverse/dplyr#2290

@javierluraschi

This comment has been minimized.

Copy link
Contributor

@javierluraschi javierluraschi commented Sep 11, 2018

Also reported in sparklyr under sparklyr/sparklyr#1504. sparklyr reprex:

library(sparklyr)

sc <- spark_connect(mater = "local")
batting_tbl <- copy_to(sc, Lahman::Batting)

batting_tbl %>% head() %>%
  group_by(yearID) %>%
  summarise(meansting = mean(stint), firstname = first(playerID)) %>%
  dbplyr::sql_render()
<SQL> SELECT `yearID`, AVG(`stint`) AS `meansting`, '`playerID`' AS `firstname`
FROM (SELECT *
FROM `sparklyr_88f216c7aaf6`
LIMIT 6) `ypzthwdkmo`
GROUP BY `yearID`
@edgararuiz

This comment has been minimized.

Copy link
Collaborator

@edgararuiz edgararuiz commented Sep 20, 2018

I thinks this is because first is considered a window function, activated by mutate(), as oppossed to an grouped/aggregate function, activated by summarize. I post two examples, one with first and the other with max:

library(sparklyr)
library(dplyr)
library(DBI)
sc <- spark_connect(master = "local")
#> * Using Spark: 2.1.0
batting_tbl <- copy_to(sc, Lahman::Batting)

t <- batting_tbl %>% 
  head(100) %>%
  group_by(teamID) %>%
  mutate(x = first(playerID)) %>%
  select(teamID, x)
t
#> # Source:   lazy query [?? x 2]
#> # Database: spark_connection
#> # Groups:   teamID
#>    teamID x        
#>    <chr>  <chr>    
#>  1 BS1    barnero01
#>  2 BS1    barnero01
#>  3 BS1    barnero01
#>  4 BS1    barnero01
#>  5 BS1    barnero01
#>  6 BS1    barnero01
#>  7 BS1    barnero01
#>  8 BS1    barnero01
#>  9 BS1    barnero01
#> 10 CH1    brannmi01
#> # ... with more rows
show_query(t)
#> <SQL>
#> SELECT `teamID`, `x`
#> FROM (SELECT `playerID`, `yearID`, `stint`, `teamID`, `lgID`, `G`, `AB`, `R`, `H`, `X2B`, `X3B`, `HR`, `RBI`, `SB`, `CS`, `BB`, `SO`, `IBB`, `HBP`, `SH`, `SF`, `GIDP`, first_value(`playerID`) OVER (PARTITION BY `teamID`) AS `x`
#> FROM (SELECT *
#> FROM `sparklyr_35d8306e463f`
#> LIMIT 100) `jyepziefva`) `bhsbzffqdj`
spark_disconnect(sc)

If you try with max(), it will return one row per group, which is what I usually do in MS SQL

library(sparklyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(DBI)
sc <- spark_connect(master = "local")
#> * Using Spark: 2.1.0
batting_tbl <- copy_to(sc, Lahman::Batting)

t <- batting_tbl %>% 
  head(100) %>%
  group_by(teamID) %>%
  summarise(x = max(playerID)) %>%
  select(teamID, x)
t
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning

#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> # Source:   lazy query [?? x 2]
#> # Database: spark_connection
#>   teamID x        
#>   <chr>  <chr>    
#> 1 BS1    spaldal01
#> 2 CH1    simmojo01
#> 3 CL1    questjo01
#> 4 FW1    selmafr01
#> 5 NY2    startjo01
#> 6 PH1    senseco01
#> 7 RC1    sagerpo01
#> 8 TRO    pikeli01 
#> 9 WS3    nortofr01
show_query(t)
#> Warning: Missing values are always removed in SQL.
#> Use `MAX(x, na.rm = TRUE)` to silence this warning
#> <SQL>
#> SELECT `teamID`, `x`
#> FROM (SELECT `teamID`, MAX(`playerID`) AS `x`
#> FROM (SELECT *
#> FROM `sparklyr_1d1870e96305`
#> LIMIT 100) `gpdizhnmrf`
#> GROUP BY `teamID`) `mgevjkueih`
spark_disconnect(sc)

Created on 2018-09-20 by the reprex package (v0.2.0).

@hadley hadley added this to the v1.4.0 milestone Jan 9, 2019
@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Jan 10, 2019

Minimal reprex:

library(dplyr, warn.conflicts = FALSE)
mf <- dbplyr::memdb_frame(g = c(1, 1, 2, 2), x = 1:4) %>% group_by(g)

mf %>% mutate(f = first(x)) %>% show_query()
#> <SQL>
#> SELECT `g`, `x`, FIRST_VALUE(`x`) OVER (PARTITION BY `g`) AS `f`
#> FROM `dbplyr_djljnlxwsc`
mf %>% summarise(f = first(x)) %>% show_query()
#> <SQL>
#> SELECT `g`, '`x`' AS `f`
#> FROM `dbplyr_djljnlxwsc`
#> GROUP BY `g`

The source of the bug appears to be that first appears in the window functions, but not in the aggregate functions, so somehow dplyr::first() is getting called locally. I can fix that.

However, this is only going to deliver a better error message because in SQL FIRST_VALUE() is only window function, not also an aggregation function as it is R (because, I think, it depends on a notion of order that does not exist in SQL, except in windows).

@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Jan 10, 2019

library(dplyr, warn.conflicts = FALSE)
mf <- dbplyr::memdb_frame(g = c(1, 1, 2, 2), x = 1:4) %>% group_by(g)

mf %>% mutate(f = first(x)) %>% show_query()
#> <SQL>
#> SELECT `g`, `x`, FIRST_VALUE(`x`) OVER (PARTITION BY `g`) AS `f`
#> FROM `dbplyr_yhvcnautig`
mf %>% summarise(f = first(x)) %>% show_query()
#> Error: `first()` is only available in a windowed (`mutate()`) context
@hadley hadley closed this in d4d8f2a Jan 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.