-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use try_cast instead of cast for MSSQL version 11+ (2012+) #496
Use try_cast instead of cast for MSSQL version 11+ (2012+) #496
Conversation
A failure to cast (e.g. as.numeric or as.double, when the string contains a non-numeric character) results in an error: > library(magrittr)
> x <- mtcars %>% dplyr::mutate(wt = as.character(wt))
> x[1,6] <- "<2.62" # adds a non-numeric character '<' to an element
> DBI::dbWriteTable(con, "##mtcars", x) # where 'con' is an already defined connection to a MSSQL database
> con %>% dplyr::tbl("##mtcars") %>% dplyr::mutate(wt = as.numeric(wt))
Error: <SQL> 'SELECT TOP(11) "mpg", "cyl", "disp", "hp", "drat", CAST("wt" AS NUMERIC) AS "wt", "qsec", "vs", "am", "gear", "carb"
FROM "##mtcars"'
nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC SQL Server Driver][SQL Server]Error converting data type varchar to numeric. For Microsoft SQL (2012 and later) try_cast allows a more graceful return of NA. This pull request results in a more graceful failure (returns NA): > con %>% dplyr::tbl("##mtcars") %>% dplyr::mutate(wt = as.numeric(wt))
# Source: lazy query [?? x 11]
# Database: Microsoft SQL Server 12.00.2000[bpsrawdata@DAVIDLENOVOPC\BPSINSTANCE/BPSSAMPLES]
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 NA 16.5 0 1 4 4
2 21 6 160 110 3.9 3 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3 20.2 1 0 3 1
7 14.3 8 360 245 3.21 4 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3 20 1 0 4 2
9 22.8 4 141. 95 3.92 3 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3 18.3 1 0 4 4
# ... with more rows Additional test code. All the library(magrittr)
library(lubridate)
library(tidyr)
library(dplyr)
df <- data.frame(China = c(5.72, 3.86, 2.6, 2.65, 2.31, 1.66, 1.60, 1.61, 1.63),
Taiwan = c(3.83, 2.99, 2.48, 1.96, 1.74, 1.74, 1.5, 1.16, 1.05),
Vietnam = c(6.46, 5.97, 5.05, 4.20, 3.55, 2.71, 2.01, 1.89, 1.94),
ROK = c(4.34, 3.46, 2.54, 1.84, 1.58, 1.62, 1.35, 1.17, 1.21),
Year = as.POSIXct(c("1970", "1975", "1980", "1985", "1990", "1995", "2000", "2005", "2010"), format = "%Y", origin = "1970-01-01"),
row.names = c("1970", "1975", "1980", "1985", "1990", "1995", "2000", "2005", "2010"))
df <- df %>%
gather(key = "Country", value = "NchildPerWoman", -Year) %>%
rename(n = NchildPerWoman) %>%
mutate(n = as.character(n), Year = as.character(Year)) # convert two columns to character
df[1,3] <- "5.72x" # creates invalid numeric value
df[2,1] <- "xyz" # create invalid date
# assumes 'con' is a connection to a MSSQL database
DBI::dbWriteTable(con, "##nchild", df)
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(n = as.character(n))
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(n = as.integer64(n))
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(n = as.integer(n))
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(n = as.numeric(n))
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(n = as.double(n))
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(n = as.logical(n))
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(Year = as.Date(Year))
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(Year = as_date(Year))
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(Year = as_datetime(Year))
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(Year = as.character(Year)) # this causes hang on master, behaviour not changed with this pull request
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(Year = as.POSIXct(Year)) # this causes hang on master, behaviour not changed with this pull request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a bullet to the top of NEWS.md
? It should briefly describe the change and end with (@yourname, #issuenumber)
.
R/backend-mssql.R
Outdated
|
||
), | ||
if (!is(con, "DBIObject") || sub("\\..*", "", DBI::dbGetInfo(con)$db.version) >= 11) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please pull this out into separate function? I think that will make the main thrust of the code easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Hadley. Just to clarify, would you like me to place everything inside the curly braces { re-assignment of mssql_scalar } as a function? Or even the 'test condition' (!is.con(...) and version >= 11) should also be in the separate function?
I've just done the former, and placed those re-assignments in a (non-exported) function mssql_use_try_cast
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test of whether or not an connection has an adequate version of SQL server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, non-exported function mssql_version()
returns a list of ProductVersion_Major
/ProductVersion_Minor
/ProductVersion_Revision
(returns list of NULLs if argument con
is not a valid DBIObject
). mssql_version()
is used to determine whether the MSSQL server version is >=11, in which case sql_translate_env.Microsoft SQL Server
uses TRY_CAST
instead of CAST
where possible.
Also added to tests/testthat/test-backend-mssql.R
to test the small non-exported function mssql_interpret_version()
which is used to interpret the $db.version
string returned by DBI::dbGetInfo()
.
232f838
to
bbd884f
Compare
try_cast results in a more elegant failure (returns NA) if a cast is not possible ('cast' will throw an error). Address issue tidyverse#380
Address issue tidyverse#380
Address issue tidyverse#380
code to handle invalid DBIObject (as is the case during 'testthat' execution) change testthat expectations (in test-backend-mssql.R) to TRY_CAST as appropriate address issue tidyverse#380
'mssql_use_try_cast()' replaces 'sql_cast()' with 'sql_try_cast()' add NEWS.md entry address issue tidyverse#380
mssql_version calls 'mssql_interpret_version_string'
bbd884f
to
7f2adb7
Compare
mssql_version/mssql_interpret_version return list of three (not four) as currently DBI::dbGetInfo(x)$db.version does not return the fourth number 'Build' mssql_interpret_version returns a list of integers (not character) mssql_interpret_version_string rename to mssql_interpret_version address issue tidyverse#380
7f2adb7
to
37c045e
Compare
okay, done. |
I just pushed a simplification to your logic — for the real database case we can rely on |
Great! Thanks for fixing for what even a non-programmer like me realized was rather ugly code. All whilst I was listening to Taiwan Digital Minister Tang talk about open data and governance for a sustainable development lecture. Thanks for all the tidyverse tools! |
as.integer = sql_try_cast("NUMERIC"), | ||
# in MSSQL, NUMERIC converts to integer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what's the reasoning behind this. When I do this, I get a numeric
via FreeTDS:
con <- DBI::dbConnect(
odbc::odbc(),
"mssql-test",
uid = "kirill", pwd = keyring::key_get("mssql", "kirill")
)
library(tidyverse)
tbl <-
tibble(a = 1L) %>%
copy_to(con, ., name = "tbl_a") %>%
mutate(b = as.integer(a))
#> Created a temporary table named #tbl_a
tbl
#> # Source: lazy query [?? x 2]
#> # Database: Microsoft SQL Server 15.00.4073[@tp/test]
#> a b
#> <int> <dbl>
#> 1 1 1
dbplyr::sql_render(tbl)
#> <SQL> SELECT "a", TRY_CAST("a" AS NUMERIC) AS "b"
#> FROM "#tbl_a"
Created on 2020-11-16 by the reprex package (v0.3.0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In DavidPatShuiFong@4933037 I describe how I found that "in MSSQL casting to NUMERIC casts to integer, and is more flexible/graceful ... [NUMERIC] returns NA if unable to convert individual rows. Whereas casting to INTEGER/INT results in the entire column becoming NA if a single row cannot be converted to integer."
However, I see that in your example, the type of 'b' is not the desired result.
The same '' happens with the NchildPerWoman
example I showed as part of the test code
> x <- con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(n = as.numeric(n))
> dbplyr::sql_render(x)
<SQL> SELECT "Year", "Country", TRY_CAST("n" AS FLOAT) AS "n"
FROM "##nchild"
> x
# Source: lazy query [?? x 3]
# Database: Microsoft SQL Server 12.00.2000[bpsrawdata@DESKTOP-BH8LGMO\BPSINSTANCE/bpssamples]
Year Country n
<chr> <chr> <dbl>
1 1970-01-01 10:32:50 China NA
2 xyz China 3.86
3 1970-01-01 10:33:00 China 2.6
4 1970-01-01 10:33:05 China 2.65
5 1970-01-01 10:33:10 China 2.31
6 1970-01-01 10:33:15 China 1.66
7 1970-01-01 10:33:20 China 1.6
8 1970-01-01 10:33:25 China 1.61
9 1970-01-01 10:33:30 China 1.63
10 1970-01-01 10:32:50 Taiwan 3.83
# ... with more rows
> dbplyr::sql_render(x)
<SQL> SELECT "Year", "Country", TRY_CAST("n" AS NUMERIC) AS "n"
FROM "##nchild"
> x
# Source: lazy query [?? x 3]
# Database: Microsoft SQL Server 12.00.2000[bpsrawdata@DESKTOP-BH8LGMO\BPSINSTANCE/bpssamples]
Year Country n
<chr> <chr> <dbl>
1 1970-01-01 10:32:50 China NA
2 xyz China 4
3 1970-01-01 10:33:00 China 3
4 1970-01-01 10:33:05 China 3
5 1970-01-01 10:33:10 China 2
6 1970-01-01 10:33:15 China 2
7 1970-01-01 10:33:20 China 2
8 1970-01-01 10:33:25 China 2
9 1970-01-01 10:33:30 China 2
10 1970-01-01 10:32:50 Taiwan 4
# ... with more rows
n
is rounded to integers, but is still of type <dbl>
.
Perhaps the result of casting to NUMERIC
then needs to be cast to INT
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a double cast might work -- would still all rows be converted to NULL
if the numbers are too large or too small?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, double casting works! and doesn't set the entire column to NA
if a single row has an invalid value.
Happy to know if there is a more elegant solution, but here is what I did (from master...DavidPatShuiFong:feature/doubletrycastMSSQL )
changed try_cast
(translate_sql_helpers.R
) to accept a second optional parameter to do a second casting
#' @rdname sql_variant
#' @export
sql_try_cast <- function(type1, type2 = NULL) {
type1 <- sql(type1)
if (is.null(type2)) {
function(x) {
sql_expr(try_cast(!!x %as% !!type1))
# try_cast available in MSSQL 2012+
}
} else {
type2 <- sql(type2)
# if type2 is defined, then double-cast ('type1' then 'type2')
function(x) {
sql_expr(try_cast(try_cast(!!x %as% !!type1) %as% !!type2))
# try_cast available in MSSQL 2012+
}
}
}
Change the casting of as.integer
in backend-mssql.R
for mssql_version
>= 11.0
as.integer = sql_try_cast("NUMERIC", "INT"),
For the example I've given (I couldn't set up your example to work, perhaps because I am using an old version of dplyr
?)
> x <- con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(n = as.integer(n))
> dbplyr::sql_render(x)
<SQL> SELECT "Year", "Country", TRY_CAST(TRY_CAST("n" AS NUMERIC) AS INT) AS "n"
FROM "##nchild"
> x
# Source: lazy query [?? x 3]
# Database: Microsoft SQL Server 12.00.2000[bpsrawdata@DESKTOP-BH8LGMO\BPSINSTANCE/bpssamples]
Year Country n
<chr> <chr> <int>
1 1970-11-17 China NA
2 xyz China 4
3 1980-11-17 China 3
4 1985-11-17 China 3
5 1990-11-17 China 2
6 1995-11-17 China 2
7 2000-11-17 China 2
8 2005-11-17 China 2
9 2010-11-17 China 2
10 1970-11-17 Taiwan 4
# ... with more rows
(there was a mistake in the original example regarding the datetimes! now fixed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks good. I don't know an easier solution off the top of my head.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks good. I don't know an easier solution off the top of my head.
Thanks @krlmlr . After a bit more testing on 'production code' I have which uses dbplyr
, should I open a new issue and submit the fix as a pull request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. A PR referring to this conversation should be enough.
try_cast results in a more elegant failure (returns NA) if a cast is not possible ('cast' will normally throw an error).
Address issue #380
If Microsoft SQL is version 11 or greater (2012+) then call
try_cast
instead ofcast
when converting character tological
Date
POSIXct
numeric
double
integer
integer64
character
date
datetime
In line
155
ofR/backend-mssql.R
,try_cast
is also used ifcon
is not aDBIObject
. This should not happen in normal operation, but is the case if the code is called during use oftestthat
intests/testthat/test-backend-mssql.R
.