MSSQL connection. Errors in dplyr select() after arrange #94

ghost · 2018-06-07T23:40:21Z

@pssguy commented on Aug 29, 2017, 10:22 PM UTC:

I am attempting to use an MSSQL connection and hitting this issue

I first replicate the example from the dbplyr intro

library(odbc)
library(DBI)

 library(tidyverse) 
 library(dbplyr)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
DBI::dbWriteTable(con, "iris", iris)
iris2 <- tbl(con, "iris")

iris2 %>% 
  arrange(Species) %>% 
  select(Sepal.Length)

This works fine

Now with an MSSQL connection

con2 <- dbConnect(odbc::odbc(), DSN = "premier")
DBI::dbWriteTable(con2, "iris", iris, overwrite=TRUE)
iris9 <- tbl(con2, "iris")

test1 <-iris9 %>% 
  arrange(Species) %>% 
  select(Sepal.Length)

test1

# Error: <SQL> 'SELECT TOP 1000 "Sepal.Length" AS "Sepal.Length" FROM (SELECT * FROM "iris" ORDER BY "Species") "odeiuzmtqh"' nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][SQL Server Native Client 11.0][SQL Server]The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.

So it appears as though error has a conflict re use of TOP

Reversing the select and arrange commands

test2 <-iris9 %>% 
  select(Sepal.Length) %>% 
  arrange(Species) 

test2 # No error

This works in a simple example but I will sometimes need to arrange data prior to other processes in a pipe

When I look at the problem code it does not exactly replicate error i.e no mention of Top 1000

test1 %>% show_query()
# <SQL>
# SELECT "Sepal.Length" AS "Sepal.Length"
# FROM (SELECT *
# FROM "iris"
# ORDER BY "Species") "omlcgfsrjt"

Trying several alternatives in SQL


SELECT "Sepal.Length" AS "Sepal.Length"
 FROM (SELECT *
 FROM "iris"
 ORDER BY "Species") "omlcgfsrjt"

nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][SQL Server Native Client 11.0][SQL Server]The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified. 
Failed to execute SQL chunk

something that looks like error code


SELECT TOP 1000 "Sepal.Length" AS "Sepal.Length"
 FROM (SELECT *
 FROM "iris"
 ORDER BY "Species") "omlcgfsrjt"

  nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][SQL Server Native Client 11.0][SQL Server]The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified. 
Failed to execute SQL chunk

replacing top 1000 in sub-query produces the desired output


SELECT  "Sepal.Length" AS "Sepal.Length"
 FROM (SELECT TOP 1000 *
 FROM "iris"
 ORDER BY "Species") "omlcgfsrjt"

Not sure if this is an error or just something that has not yet been addressed for MSSQL.

p.s. Why no issues option under dbplyr?

This issue was moved by hadley from tidyverse/dplyr/issues/3062.

The text was updated successfully, but these errors were encountered:

ghost · 2018-06-07T23:40:30Z

@hadley commented on Nov 2, 2017, 8:47 PM UTC:

Minimal reprex:

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE) 

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
mf <- copy_to(con, data.frame(x = 1:5, y = 5:1), name = "test")

mf %>%
  arrange(x) %>%
  select(y) %>%
  show_query()
#> <SQL>
#> SELECT `y`
#> FROM (SELECT *
#> FROM `test`
#> ORDER BY `x`)

DBI::dbGetQuery(con, "SELECT y FROM test ORDER BY x")
#>   y
#> 1 5
#> 2 4
#> 3 3
#> 4 2
#> 5 1

Ideally this would only generate one query because conceptually the select happens after the arrange.

I think that implies we can fix this issue by reordering select_query_clauses(), which currently looks like this:

  present <- c(
    where =    length(x$where) > 0,
    group_by = length(x$group_by) > 0,
    having =   length(x$having) > 0,
    select =   !identical(x$select, sql("*")),
    distinct = x$distinct,
    order_by = length(x$order_by) > 0,
    limit    = !is.null(x$limit)
  )

Currently select comes before arrange when really it should come afterwards.

And indeed if we move select to the end then we get:

SELECT `y`
FROM `test`
ORDER BY `x`

It remains to consider if this is actually correct - i.e. are there situations when this change would yield invalid SQL

ghost · 2018-06-07T23:40:31Z

@hadley commented on Nov 2, 2017, 9:13 PM UTC:

Ah I think the problem with performing this optimisation is this query:

memdb_frame(x = 1:2) %>%
    arrange(x) %>%
    mutate(x = -x)

This should return c(-1, -2), but if we collapse the query as described above we generate:

SELECT -`x` AS `x`
FROM `gatlqlicge`
ORDER BY `x`

which yields c(-2, -1) because the ORDER BY clause uses aliases defined in SELECT (as described in https://sqlbolt.com/lesson/select_queries_order_of_execution). This means that this optimisation is not possible in general.

But this is a mutate() and the motivation issue is a select(). Can we perform the optimisation at a higher level? I think the answer is no, because select()s can rename variables and this SQL would still be incorrect:

  memdb_frame(x = 1:2, y = 3:2) %>%
    arrange(x) %>%
    select(x = y) %>%
    show_query()
#> SELECT `y` AS `x`
#> FROM `bmvfznmfws`
#> ORDER BY `x`

hadley · 2019-02-05T19:58:52Z

Reprex of original problem:

library(DBI)
library(dplyr)

con <- dbConnect(odbc::odbc(), "SQL Server", database = "airontime")

x <- tbl(con, "airlines")

x %>%
  arrange(carrier) %>% 
  select(name) %>% 
  head()
#> Error in new_result(connection@ptr, statement) : 
#>  nanodbc/nanodbc.cpp:1344: HY000: The ORDER BY clause is invalid in views, inline functions, 
#> derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is 
#> also specified., Statement(s) could not be prepared.,

Because this is the generated SQL:

SELECT TOP 6 "name" FROM 
   (SELECT * FROM "airlines" ORDER BY "carrier") "jmausnbmvw"

so another option might be to try and push TOP down into the lower level.

hadley · 2019-02-05T20:02:41Z

Note that the problem also occurs for other queries:

x %>%
  arrange(carrier) %>% 
  select(name) %>% 
  mutate(name = substr(name, 1, 1)) %>% 
  collect()

i.e. it's the ORDER BY in the subquery that's the problem, not the TOP.

hadley · 2019-02-05T20:24:44Z

Having read through the MS SQL docs, I don't think there's anything the dbplyr can do about this — you just need to make sure that arrange() is always the last step in the pipe.

It would be nice if we could give a better error message here, but there's no easy way to do it, and given that no one else has commented on this issue, it seems unlikely to be a common problem.

This comment has been minimized.

Sign in to view

ghost mentioned this issue Jun 7, 2018

MSSQL connection. Errors in dplyr select() after arrange tidyverse/dplyr#3062

Closed

hadley added feature a feature request or enhancement verb trans 🤖 Translation of dplyr verbs to SQL labels Jan 2, 2019

hadley closed this as completed Feb 5, 2019

hadley mentioned this issue Feb 6, 2019

dplyr & TeraData | windowing | mutate to add rows #228

Closed

This was referenced Apr 9, 2019

Support ORDER BY in subqueries for MSSQL #275

Closed

Rethink optimization of ORDER BY #276

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSSQL connection. Errors in dplyr select() after arrange #94

MSSQL connection. Errors in dplyr select() after arrange #94

ghost commented Jun 7, 2018 •

edited by hadley

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

ghost commented Jun 7, 2018

ghost commented Jun 7, 2018

This comment has been minimized.

This comment has been minimized.

hadley commented Feb 5, 2019

hadley commented Feb 5, 2019

hadley commented Feb 5, 2019

MSSQL connection. Errors in dplyr select() after arrange #94

MSSQL connection. Errors in dplyr select() after arrange #94

Comments

ghost commented Jun 7, 2018 • edited by hadley Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

ghost commented Jun 7, 2018

ghost commented Jun 7, 2018

This comment has been minimized.

This comment has been minimized.

hadley commented Feb 5, 2019

hadley commented Feb 5, 2019

hadley commented Feb 5, 2019

ghost commented Jun 7, 2018 •

edited by hadley

Loading