Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Like mutate, summarise needs to create subqueries when necessary #114

Closed
move bot opened this issue Jun 25, 2018 · 9 comments

Comments

@move
Copy link

@move move bot commented Jun 25, 2018

@AjarKeen commented on Jan 8, 2018, 9:44 PM UTC:

cc @javierluraschi since I've only tested this with sparklyr, not with other dbplyr backends.

I'm starting to use `sparklyr` to do grouped time series aggregations, and very quickly ran into this problem. Here's a modest reprex showing how it works locally, the `sparklyr` version that fails, and then a workaround:
library(dplyr)
library(sparklyr)
library(nycflights13)

# local version
flights %>%
  group_by(carrier) %>%
  summarize(count_num = n(), 
            mean_dep_delay = mean(dep_delay, na.rm = TRUE),
            ratio = mean_dep_delay / count_num) %>% 
  arrange(carrier)

#> # A tibble: 16 x 4
#>    carrier count_num mean_dep_delay    ratio
#>    <chr>       <int>          <dbl>    <dbl>
#>  1 9E          18460          16.7  0.000906
#>  2 AA          32729           8.59 0.000262
#>  3 AS            714           5.80 0.00813 
#>  4 B6          54635          13.0  0.000238
#>  5 DL          48110           9.26 0.000193
#>  6 EV          54173          20.0  0.000368
#>  7 F9            685          20.2  0.0295  
#>  8 FL           3260          18.7  0.00574 
#>  9 HA            342           4.90 0.0143  
#> 10 MQ          26397          10.6  0.000400
#> 11 OO             32          12.6  0.393   
#> 12 UA          58665          12.1  0.000206
#> 13 US          20536           3.78 0.000184
#> 14 VX           5162          12.9  0.00249 
#> 15 WN          12275          17.7  0.00144 
#> 16 YV            601          19.0  0.0316

# Spark version
sc <- spark_connect(master = "local")

flights_sdf <- copy_to(sc, flights, "flights")

flights_sdf %>%
  group_by(carrier) %>%
  summarize(count_num = n(), 
            mean_dep_delay = mean(dep_delay),
            ratio = mean_dep_delay / count_num) %>%
  collect()

#> Warning: Missing values are always removed in SQL.
#> Use `AVG(x, na.rm = TRUE)` to silence this warning
#> Error: org.apache.spark.sql.AnalysisException: cannot resolve '`mean_dep_delay`' given input columns: [dest, dep_delay, distance, dep_time, minute, carrier, origin, sched_arr_time, month, arr_time, day, flight, sched_dep_time, time_hour, arr_delay, air_time, hour, tailnum, year]; line 1 pos 81;
#> 'Aggregate [carrier#38], [carrier#38, count(1) AS count_num#447L, avg(dep_delay#34) AS mean_dep_delay#448, ('mean_dep_delay / 'count_num) AS ratio#449]
#> +- SubqueryAlias flights
#>    +- LogicalRDD [year#29, month#30, day#31, dep_time#32, sched_dep_time#33, dep_delay#34, arr_time#35, sched_arr_time#36, arr_delay#37, carrier#38, flight#39, tailnum#40, origin#41, dest#42, air_time#43, distance#44, hour#45, minute#46, time_hour#47]
#> 
#>  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
#>  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
#>  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$transformUp$1.apply(TreeNode.scala:289)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$transformUp$1.apply(TreeNode.scala:289)
#>  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$3.apply(TreeNode.scala:286)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$3.apply(TreeNode.scala:286)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$4.apply(TreeNode.scala:306)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$3.apply(TreeNode.scala:286)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$3.apply(TreeNode.scala:286)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode$anonfun$4.apply(TreeNode.scala:306)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
#>  at org.apache.spark.sql.catalyst.plans.QueryPlan$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
#>  at org.apache.spark.sql.catalyst.plans.QueryPlan$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
#>  at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:279)
#>  at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$recursiveTransform$1(QueryPlan.scala:289)
#>  at org.apache.spark.sql.catalyst.plans.QueryPlan$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$recursiveTransform$1$1.apply(QueryPlan.scala:293)
#>  at scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:234)
#>  at scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:234)
#>  at scala.collection.immutable.List.foreach(List.scala:381)
#>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
#>  at scala.collection.immutable.List.map(List.scala:285)
#>  at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$recursiveTransform$1(QueryPlan.scala:293)
#>  at org.apache.spark.sql.catalyst.plans.QueryPlan$anonfun$6.apply(QueryPlan.scala:298)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
#>  at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:298)
#>  at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:268)
#>  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
#>  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
#>  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
#>  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
#>  at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
#>  at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
#>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:66)
#>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
#>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
#>  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
#>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
#>  at java.lang.reflect.Method.invoke(Method.java:498)
#>  at sparklyr.Invoke$.invoke(invoke.scala:102)
#>  at sparklyr.StreamHandler$.handleMethodCall(stream.scala:97)
#>  at sparklyr.StreamHandler$.read(stream.scala:62)
#>  at sparklyr.BackendHandler.channelRead0(handler.scala:52)
#>  at sparklyr.BackendHandler.channelRead0(handler.scala:14)
#>  at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
#>  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
#>  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
#>  at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
#>  at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
#>  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
#>  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
#>  at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
#>  at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
#>  at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
#>  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
#>  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
#>  at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
#>  at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
#>  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
#>  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
#>  at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
#>  at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
#>  at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
#>  at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
#>  at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
#>  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
#>  at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
#>  at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
#>  at java.lang.Thread.run(Thread.java:748)

# Spark workaround
flights_sdf %>%
  group_by(carrier) %>%
  mutate(count_num = n(), 
         mean_dep_delay = mean(dep_delay),
         ratio = mean_dep_delay / count_num) %>%
  summarize(count_num = mean(count_num),
            mean_dep_delay = mean(mean_dep_delay),
            ratio = mean(ratio)) %>%
  arrange(carrier) %>%
  collect()

#> Warning: Missing values are always removed in SQL.
#> Use `AVG(x, na.rm = TRUE)` to silence this warning

#> Warning: Missing values are always removed in SQL.
#> Use `AVG(x, na.rm = TRUE)` to silence this warning

#> Warning: Missing values are always removed in SQL.
#> Use `AVG(x, na.rm = TRUE)` to silence this warning

#> Warning: Missing values are always removed in SQL.
#> Use `avg(x, na.rm = TRUE)` to silence this warning

#> # A tibble: 16 x 4
#>    carrier count_num mean_dep_delay    ratio
#>    <chr>       <dbl>          <dbl>    <dbl>
#>  1 9E        18460            16.7  0.000906
#>  2 AA        32729             8.59 0.000262
#>  3 AS          714             5.80 0.00813 
#>  4 B6        54635            13.0  0.000238
#>  5 DL        48110             9.26 0.000193
#>  6 EV        54173            20.0  0.000368
#>  7 F9          685            20.2  0.0295  
#>  8 FL         3260            18.7  0.00574 
#>  9 HA          342             4.90 0.0143  
#> 10 MQ        26397            10.6  0.000400
#> 11 OO           32.0          12.6  0.393   
#> 12 UA        58665            12.1  0.000206
#> 13 US        20536             3.78 0.000184
#> 14 VX         5162            12.9  0.00249 
#> 15 WN        12275            17.7  0.00144 
#> 16 YV          601            19.0  0.0316

My session info after I run the above:

> devtools::session_info()
Session info ---------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.1 (2017-06-30)
 system   x86_64, linux-gnu           
 ui       RStudio (1.1.383)           
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       <NA>                        
 date     2018-01-08                  

Packages -------------------------------------------------------------------------
 package      * version    date       source                           
 assertthat     0.2.0      2017-04-11 CRAN (R 3.4.1)                   
 backports      1.1.2      2017-12-13 cran (@1.1.2)                    
 base         * 3.4.1      2017-09-07 local                            
 base64enc      0.1-3      2015-07-28 CRAN (R 3.4.1)                   
 bindr          0.1        2016-11-13 CRAN (R 3.4.1)                   
 bindrcpp     * 0.2        2017-06-17 CRAN (R 3.4.1)                   
 broom          0.4.3      2017-11-20 cran (@0.4.3)                    
 callr          1.0.0      2016-06-18 CRAN (R 3.4.1)                   
 cli            1.0.0      2017-11-05 CRAN (R 3.4.1)                   
 clipr          0.4.0      2017-11-03 CRAN (R 3.4.1)                   
 compiler       3.4.1      2017-09-07 local                            
 config         0.2        2016-08-02 CRAN (R 3.4.1)                   
 crayon         1.3.4      2017-09-16 CRAN (R 3.4.1)                   
 datasets     * 3.4.1      2017-09-07 local                            
 DBI            0.7        2017-06-18 CRAN (R 3.4.1)                   
 dbplyr         1.2.0      2018-01-03 cran (@1.2.0)                    
 devtools       1.13.4     2017-11-09 CRAN (R 3.4.1)                   
 digest         0.6.13     2017-12-14 cran (@0.6.13)                   
 dplyr        * 0.7.4      2017-09-28 CRAN (R 3.4.1)                   
 evaluate       0.10.1     2017-06-24 CRAN (R 3.4.1)                   
 foreign        0.8-69     2017-06-22 CRAN (R 3.4.1)                   
 glue           1.2.0      2017-10-29 CRAN (R 3.4.1)                   
 graphics     * 3.4.1      2017-09-07 local                            
 grDevices    * 3.4.1      2017-09-07 local                            
 grid           3.4.1      2017-09-07 local                            
 htmltools      0.3.6      2017-04-28 CRAN (R 3.4.1)                   
 httpuv         1.3.5      2017-07-04 CRAN (R 3.4.1)                   
 httr           1.3.1      2017-08-20 CRAN (R 3.4.1)                   
 jsonlite       1.5        2017-06-01 CRAN (R 3.4.1)                   
 knitr          1.17       2017-08-10 CRAN (R 3.4.1)                   
 lattice        0.20-35    2017-03-25 CRAN (R 3.4.1)                   
 lazyeval       0.2.1      2017-10-29 CRAN (R 3.4.1)                   
 magrittr       1.5        2014-11-22 CRAN (R 3.4.1)                   
 memoise        1.1.0      2017-04-21 CRAN (R 3.4.1)                   
 methods      * 3.4.1      2017-09-07 local                            
 mime           0.5        2016-07-07 CRAN (R 3.4.1)                   
 mnormt         1.5-5      2016-10-15 CRAN (R 3.4.1)                   
 nlme           3.1-131    2017-02-06 CRAN (R 3.4.1)                   
 nycflights13 * 0.2.2      2017-01-27 CRAN (R 3.4.1)                   
 openssl        0.9.9      2017-11-10 cran (@0.9.9)                    
 parallel       3.4.1      2017-09-07 local                            
 pillar         1.0.1      2017-11-27 cran (@1.0.1)                    
 pkgconfig      2.0.1      2017-03-21 CRAN (R 3.4.1)                   
 plyr           1.8.4      2016-06-08 CRAN (R 3.4.1)                   
 psych          1.7.8      2017-09-09 CRAN (R 3.4.1)                   
 purrr          0.2.4      2017-10-18 CRAN (R 3.4.1)                   
 R6             2.2.2      2017-06-17 CRAN (R 3.4.1)                   
 Rcpp           0.12.14    2017-11-23 cran (@0.12.14)                  
 reprex       * 0.1.1      2017-01-13 CRAN (R 3.4.1)                   
 reshape2       1.4.3      2017-12-11 cran (@1.4.3)                    
 rlang          0.1.6      2017-12-21 cran (@0.1.6)                    
 rmarkdown      1.7        2017-11-10 CRAN (R 3.4.1)                   
 rprojroot      1.3-2      2018-01-03 cran (@1.3-2)                    
 rstudioapi     0.7        2017-09-07 CRAN (R 3.4.1)                   
 shiny          1.0.5      2017-08-23 CRAN (R 3.4.1)                   
 sparklyr     * 0.7.0-9106 2018-01-08 Github (rstudio/sparklyr@41d145a)
 stats        * 3.4.1      2017-09-07 local                            
 stringi        1.1.5      2017-04-07 CRAN (R 3.4.1)                   
 stringr        1.2.0      2017-02-18 CRAN (R 3.4.1)                   
 tibble         1.4.1      2017-12-25 cran (@1.4.1)                    
 tidyr          0.7.2      2017-10-16 CRAN (R 3.4.1)                   
 tools          3.4.1      2017-09-07 local                            
 utf8           1.1.3      2018-01-03 cran (@1.1.3)                    
 utils        * 3.4.1      2017-09-07 local                            
 whisker        0.3-2      2013-04-28 CRAN (R 3.4.1)                   
 withr          2.1.1      2017-12-19 cran (@2.1.1)                    
 xtable         1.8-2      2016-02-05 CRAN (R 3.4.1)                   
 yaml           2.1.14     2016-11-12 CRAN (R 3.4.1) 

I'm not sure if this is related to a similar-sounding issue with mutate / rename that was resolved in dbplyr 1.2.0. I can work around it for now, but it will get clunky fast as I get into more complex aggregation operations with sparklyr.

This issue was moved by krlmlr from tidyverse/dplyr/issues/3295.

@move

This comment was marked as outdated.

Copy link
Author

@move move bot commented Jun 25, 2018

@AjarKeen commented on Apr 4, 2018, 7:54 PM UTC:

I think the problem isn't related to the tidyeval issue. It seems to just be that translation of the failing code into SQL results in a reference to the newly created column within the same statement, which doesn't work.

The code that fails is this:

flights_sdf %>%
  group_by(carrier) %>%
  summarize(count_num = n(), 
            mean_dep_delay = mean(dep_delay),
            ratio = mean_dep_delay / count_num) %>%
  collect()

Which is translated to:

SELECT `carrier`, 
       count(*) AS `count_num`, 
       AVG(`dep_delay`) AS `mean_dep_delay`, 
       `mean_dep_delay` / `count_num` AS `ratio`
FROM `flights`
GROUP BY `carrier`

We can test that query out on a toy table:

CREATE TABLE flights ( 
carrier VARCHAR(50), 
dep_delay FLOAT
); 

INSERT INTO `flights` (`carrier`, `dep_delay`) VALUES ('UA', 0);
INSERT INTO `flights` (`carrier`, `dep_delay`) VALUES ('AA', 5);
INSERT INTO `flights` (`carrier`, `dep_delay`) VALUES ('DL', 10);
INSERT INTO `flights` (`carrier`, `dep_delay`) VALUES ('UA', 5);
INSERT INTO `flights` (`carrier`, `dep_delay`) VALUES ('AA', 10);
INSERT INTO `flights` (`carrier`, `dep_delay`) VALUES ('DL', 15);

SELECT `carrier`, 
        count(*) AS `count_num`, 
        AVG(`dep_delay`) AS `mean_dep_delay`, 
        `mean_dep_delay` / `count_num` AS `ratio`
FROM `flights`
GROUP BY `carrier`

Which fails: Unknown column 'mean_dep_delay'.

Instead, we need the R code to be translated into something like this:


SELECT `carrier`,
       `mean_dep_delay` / `count_num` AS `ratio`
FROM (
    SELECT `carrier`, 
            count(*) AS `count_num`, 
            AVG(`dep_delay`) AS `mean_dep_delay`
    FROM `flights`
    GROUP BY `carrier` 
    ) AS flights_grouped

Which works on the toy example:

carrier | ratio
AA      | 3.75
DL      | 6.25
UA      | 1.25

I'm not sure how to make such a change in dbplyr, but it would definitely be a useful feature.

@move

This comment was marked as outdated.

Copy link
Author

@move move bot commented Jun 25, 2018

@hadley commented on May 20, 2018, 2:38 PM UTC:

Minimal reprex

library(dplyr, warn.conflicts = FALSE)

lf1 <- dbplyr::lazy_frame(x = 1:5, src = dbplyr::simulate_dbi())

lf1 %>% 
  summarise(
    x1 = mean(x, na.rm = TRUE), 
    x2 = sum(x, na.rm = TRUE),
    x3 = x1 / x2
  ) %>% 
  show_query()
#> <SQL> SELECT AVG("x") AS "x1", SUM("x") AS "x2", "x1" / "x2" AS "x3"
#> FROM "df"

Created on 2018-05-20 by the reprex package (v0.2.0).

@move

This comment was marked as outdated.

Copy link
Author

@move move bot commented Jun 25, 2018

@hadley commented on May 20, 2018, 2:39 PM UTC:

Compare with mutate() which correctly creates the nested subquery:

library(dplyr, warn.conflicts = FALSE)

lf1 <- dbplyr::lazy_frame(x = 1:5, src = dbplyr::simulate_dbi())

lf1 %>% 
  mutate(
    x1 = mean(x, na.rm = TRUE), 
    x2 = sum(x, na.rm = TRUE),
    x3 = x1 / x2
  ) %>% 
  show_query()
#> <SQL> SELECT "x", "x1", "x2", "x1" / "x2" AS "x3"
#> FROM (SELECT "x", avg("x") OVER () AS "x1", sum("x") OVER () AS "x2"
#> FROM "df") "gdgrsnkldf"
@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Jan 10, 2019

This isn't quite as simple as applying the algorithm from mutate() because while mutate() automatically keeps previous variables, summarise() does not. This means it's the flip side of #193

@hadley

This comment was marked as resolved.

Copy link
Member

@hadley hadley commented Jan 10, 2019

Also need to make this work for transmute()

@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Mar 14, 2019

Updated reprex:

library(dplyr, warn.conflicts = FALSE)
lf <- dbplyr::lazy_frame(x = 1:5)

lf %>% 
  mutate(
    x1 = mean(x, na.rm = TRUE), 
    x2 = sum(x, na.rm = TRUE),
    x3 = x1 / x2
  )
#> <SQL>
#> SELECT `x`, `x1`, `x2`, `x1` / `x2` AS `x3`
#> FROM (SELECT `x`, AVG(`x`) OVER () AS `x1`, SUM(`x`) OVER () AS `x2`
#> FROM `df`) `dbplyr_ksmttkjckb`

lf %>% 
  summarise(
    x1 = mean(x, na.rm = TRUE), 
    x2 = sum(x, na.rm = TRUE),
    x3 = x1 / x2
  )
#> <SQL>
#> SELECT AVG(`x`) AS `x1`, SUM(`x`) AS `x2`, `x1` / `x2` AS `x3`
#> FROM `df`

Created on 2019-03-14 by the reprex package (v0.2.1.9000)

@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Mar 14, 2019

Maybe it's enough to make this an error? It's not really clear what the above example means, and we really want to the user to rewrite to:

library(dplyr, warn.conflicts = FALSE)
lf <- dbplyr::lazy_frame(x = 1:5)

lf %>% 
  summarise(
    x1 = mean(x, na.rm = TRUE), 
    x2 = sum(x, na.rm = TRUE)
  ) %>% 
  mutate(x3 = x1 / x2)
#> <SQL>
#> SELECT `x1`, `x2`, `x1` / `x2` AS `x3`
#> FROM (SELECT AVG(`x`) AS `x1`, SUM(`x`) AS `x2`
#> FROM `df`) `dbplyr_xejwocofsh`

Created on 2019-03-14 by the reprex package (v0.2.1.9000)

@hadley hadley closed this in 2447f75 Mar 14, 2019
@doorisajar

This comment has been minimized.

Copy link

@doorisajar doorisajar commented Mar 14, 2019

The only confusing thing about this fix is that the "wrong" version still works in a non-dbplyr context, so the same code will work or not work if the data source changes. This makes it hard to prototype locally and then execute on a larger dataset through a backend.

Local reprex:

library(dplyr)
lf <- tibble(x = 1:5)

lf %>% 
  summarise(
    x1 = mean(x, na.rm = TRUE), 
    x2 = sum(x, na.rm = TRUE),
    x3 = x1 / x2
  ) 
#> # A tibble: 1 x 3
#>      x1    x2    x3
#>   <dbl> <int> <dbl>
#> 1     3    15   0.2

I can file a dplyr issue if the right thing to do is have the error thrown in the local context as well.

Edit: or does this fix cover both cases? That wasn't clear from looking at the commit.

@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Mar 15, 2019

I think it might make sense in R due to differing semantics. It is simply not feasible to make dplyr and a database agree for every possible input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.