Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor substr() translation #577

Closed
michal-przydacz opened this issue Jan 21, 2021 · 11 comments · Fixed by #580
Closed

Poor substr() translation #577

michal-przydacz opened this issue Jan 21, 2021 · 11 comments · Fixed by #580
Labels
feature a feature request or enhancement func trans 🌍 Translation of individual functions to SQL

Comments

@michal-przydacz
Copy link

michal-przydacz commented Jan 21, 2021

Hi,
I was trying to create an SQL query which would grab the end of a string. In this example it was supposed to be the last five characters. However, the show_query function shows it returns null rather than calculated integers.
Is that a correct behaviour?

library(RSQLite)
library(DBI)
library(dplyr)
library(dbplyr)

# Create example SQL database and load with example table
conn <- DBI::dbConnect(RSQLite::SQLite(), "")
DBI::dbWriteTable(conn, 
                  "iris", 
                  iris %>% distinct(Species), 
                  overwrite = TRUE)

# Expected behaviour
DBI::dbGetQuery(conn,
                "SELECT `Species`
                        ,LENGTH(`Species`) AS `Length`
                        ,SUBSTR(`Species`, LENGTH(`Species`)-4, LENGTH(`Species`)) AS `Trimmed`
                FROM `iris`")
#>      Species Length Trimmed
#> 1     setosa      6   etosa
#> 2 versicolor     10   color
#> 3  virginica      9   inica

# Actual code example; start and stop substituted with NULL
tbl(conn, DBI::Id(Table = "iris")) %>%
    mutate(Length = length(Species),
           Trimmed = substr(Species, start = length(Species)-4, stop = length(Species))) %T>%
    show_query() %>%
    collect()
#> Warning in substr(Species, start = length(Species) - 4, stop = length(Species)):
#> NAs introduced by coercion
#> Warning in pmax(as.integer(stop) - start + 1L, 0L): NAs introduced by coercion
#> <SQL>
#> SELECT `Species`, length(`Species`) AS `Length`, SUBSTR(`Species`, NULL, NULL) AS `Trimmed`
#> FROM `iris`
#> Warning in substr(Species, start = length(Species) - 4, stop = length(Species)):
#> NAs introduced by coercion

#> Warning in substr(Species, start = length(Species) - 4, stop = length(Species)):
#> NAs introduced by coercion
#> # A tibble: 3 x 3
#>   Species    Length Trimmed
#>   <chr>       <int> <lgl>  
#> 1 setosa          6 NA     
#> 2 versicolor     10 NA     
#> 3 virginica       9 NA

Created on 2021-01-21 by the reprex package (v0.3.0)

I've tried using nchar and len instead of R length but with the same effect (My actual database uses len rather than length).

Thanks

Edit: Provided a better example showing that using native SQL code gets desired result

@hadley
Copy link
Member

hadley commented Jan 21, 2021

Minimal reprex:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

db <- memdb_frame(x = c("abc", "defsg", "highs"))

db %>% 
  mutate(y = substr(x, 2, nchar(x) - 1)) %>% 
  show_query()
#> Warning in pmax(as.integer(stop) - start + 1L, 0L): NAs introduced by coercion
#> <SQL>
#> SELECT `x`, SUBSTR(`x`, 2, NULL) AS `y`
#> FROM `dbplyr_001`

Created on 2021-01-21 by the reprex package (v0.3.0.9001)

@hadley hadley added feature a feature request or enhancement func trans 🌍 Translation of individual functions to SQL labels Jan 21, 2021
@hadley
Copy link
Member

hadley commented Jan 21, 2021

Hmmm, this is going to be hard to support, I think. The easiest work around is to use sub_str()

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

db <- memdb_frame(x = c("abc", "defsg", "highs"))

db %>% 
  mutate(y = str_sub(x, 2, -2))
#> # Source:   lazy query [?? x 2]
#> # Database: sqlite 3.34.0 [:memory:]
#>   x     y    
#>   <chr> <chr>
#> 1 abc   b    
#> 2 defsg efs  
#> 3 highs igh

Created on 2021-01-21 by the reprex package (v0.3.0.9001)

@michal-przydacz
Copy link
Author

Thanks for the reply.
Unfortunately, this work around won't work, as it still misses out the last character.

library(dbplyr)
library(stringr)
translate_sql(str_sub(Test, 1, -2))
#> <SQL> SUBSTR(`Test`, 1, LENGTH(`Test`) - 1)

translate_sql(str_sub(Test, 1, length(Test)))
#> Warning in str_sub(Test, 1, length(Test)): NAs introduced by coercion
#> Error in if (end == -1L) {: missing value where TRUE/FALSE needed

Created on 2021-01-22 by the reprex package (v0.3.0)

Interestingly, str_sub() already uses the translation to length() but doesn't seem to be able to do it when specifically asked

@hadley
Copy link
Member

hadley commented Jan 22, 2021

Use -1 if you want to include the last character.

@michal-przydacz
Copy link
Author

Yep, that'd do it, thanks.
I think in that case this issue is unnecessary, as str_sub() with negative start/stop options do exactly that.
I pretty much always use tidyverse function but somehow within dbplyr related calls I tend to use base. I will change that now.
Thanks.

@michal-przydacz
Copy link
Author

The str_sub() works in a reprex example but not with the SQL database I work with.
The translated substring function in this case is SUBSTRING and str_sub(..., start = -10, end = -1) when passing on attributes skips the last one entirely, causing a missing attribute error.

This reprex illustrates skipping 3rd attribute

library(dbplyr)
library(stringr)
translate_sql(str_sub(Test, -10, -1),
                     con = simulate_mssql())
#> <SQL> SUBSTRING(`Test`, -10)

Created on 2021-01-25 by the reprex package (v0.3.0)

@hadley
Copy link
Member

hadley commented Jan 25, 2021

Ah bummer, it seems like SQL server doesn't have an option third argument, unlike pretty much every other database (and, as far as I can tell, the SQL spec).

@hadley
Copy link
Member

hadley commented Jan 25, 2021

That said, there's clearly a bug because no database is going to accept a negative start value.

@hadley
Copy link
Member

hadley commented Jan 25, 2021

No, that's not right — SUBSTR() does seem accept negative values; but SUBSTRING() does not.

@hadley
Copy link
Member

hadley commented Jan 25, 2021

Looks like all the backends that use SUBSTRING currently have an incorrect translation (SQL server, redshift, HANA)

hadley added a commit that referenced this issue Jan 25, 2021
The handling of negative start seems to be inconsistent from database to database, so I think it's safer to just always use LENGTH(). New optional_length parameter allows for MS SQL compatibility.

Fixes #577
@hadley hadley changed the title dbplyr doesn't handle substr with dynamic start and stop Poor substr() translation Feb 3, 2021
hadley added a commit that referenced this issue Feb 3, 2021
Improving handling of negative values, and new optional_length parameter allows for MS SQL compatibility.

Fixes #577
@michal-przydacz
Copy link
Author

I've just tested it on the SQL server and can confirm it works perfectly. Thank you for solving it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement func trans 🌍 Translation of individual functions to SQL
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants