Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use try_cast instead of cast for MSSQL version 11+ (2012+) #496

Merged
merged 11 commits into from
Sep 22, 2020

Conversation

DavidPatShuiFong
Copy link
Contributor

@DavidPatShuiFong DavidPatShuiFong commented Sep 20, 2020

try_cast results in a more elegant failure (returns NA) if a cast is not possible ('cast' will normally throw an error).
Address issue #380

If Microsoft SQL is version 11 or greater (2012+) then call try_cast instead of cast when converting character to

  • logical
  • Date
  • POSIXct
  • numeric
  • double
  • integer
  • integer64
  • character
  • date
  • datetime

In line 155 of R/backend-mssql.R, try_cast is also used if con is not a DBIObject. This should not happen in normal operation, but is the case if the code is called during use of testthat in tests/testthat/test-backend-mssql.R.

@DavidPatShuiFong
Copy link
Contributor Author

DavidPatShuiFong commented Sep 20, 2020

A failure to cast (e.g. as.numeric or as.double, when the string contains a non-numeric character) results in an error:

> library(magrittr)
> x <- mtcars %>% dplyr::mutate(wt = as.character(wt))
> x[1,6] <- "<2.62" # adds a non-numeric character '<' to an element
> DBI::dbWriteTable(con, "##mtcars", x) # where 'con' is an already defined connection to a MSSQL database
> con %>% dplyr::tbl("##mtcars") %>% dplyr::mutate(wt = as.numeric(wt))
Error: <SQL> 'SELECT TOP(11) "mpg", "cyl", "disp", "hp", "drat", CAST("wt" AS NUMERIC) AS "wt", "qsec", "vs", "am", "gear", "carb"
FROM "##mtcars"'
  nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC SQL Server Driver][SQL Server]Error converting data type varchar to numeric. 

For Microsoft SQL (2012 and later) try_cast allows a more graceful return of NA.

This pull request results in a more graceful failure (returns NA):

> con %>% dplyr::tbl("##mtcars") %>% dplyr::mutate(wt = as.numeric(wt))
# Source:   lazy query [?? x 11]
# Database: Microsoft SQL Server 12.00.2000[bpsrawdata@DAVIDLENOVOPC\BPSINSTANCE/BPSSAMPLES]
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9     NA  16.5     0     1     4     4
 2  21       6  160    110  3.9      3  17.0     0     1     4     4
 3  22.8     4  108     93  3.85     2  18.6     1     1     4     1
 4  21.4     6  258    110  3.08     3  19.4     1     0     3     1
 5  18.7     8  360    175  3.15     3  17.0     0     0     3     2
 6  18.1     6  225    105  2.76     3  20.2     1     0     3     1
 7  14.3     8  360    245  3.21     4  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69     3  20       1     0     4     2
 9  22.8     4  141.    95  3.92     3  22.9     1     0     4     2
10  19.2     6  168.   123  3.92     3  18.3     1     0     4     4
# ... with more rows

Additional test code.

All the mutate fail with an error in the current v1.4.4 release.
This pull request returns NA for invalid data.

library(magrittr)
library(lubridate)
library(tidyr)
library(dplyr)

df <- data.frame(China = c(5.72, 3.86, 2.6, 2.65, 2.31, 1.66, 1.60, 1.61, 1.63),
                Taiwan = c(3.83, 2.99, 2.48, 1.96, 1.74, 1.74, 1.5, 1.16, 1.05),
                Vietnam = c(6.46, 5.97, 5.05, 4.20, 3.55, 2.71, 2.01, 1.89, 1.94),
                ROK = c(4.34, 3.46, 2.54, 1.84, 1.58, 1.62, 1.35, 1.17, 1.21),
                Year = as.POSIXct(c("1970", "1975", "1980", "1985", "1990", "1995", "2000", "2005", "2010"), format = "%Y", origin = "1970-01-01"),
                row.names = c("1970", "1975", "1980", "1985", "1990", "1995", "2000", "2005", "2010"))

df <- df %>%
 gather(key = "Country", value = "NchildPerWoman", -Year) %>%
 rename(n = NchildPerWoman) %>%
 mutate(n = as.character(n), Year = as.character(Year)) # convert two columns to character

df[1,3] <- "5.72x" # creates invalid numeric value
df[2,1] <- "xyz" # create invalid date

# assumes 'con' is a connection to a MSSQL database

DBI::dbWriteTable(con, "##nchild", df)
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(n = as.character(n))
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(n = as.integer64(n))
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(n = as.integer(n))
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(n = as.numeric(n))
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(n = as.double(n))
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(n = as.logical(n))


con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(Year = as.Date(Year))
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(Year = as_date(Year))
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(Year = as_datetime(Year))

con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(Year = as.character(Year)) # this causes hang on master, behaviour not changed with this pull request
con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(Year = as.POSIXct(Year)) # this causes hang on master, behaviour not changed with this pull request

@DavidPatShuiFong DavidPatShuiFong changed the title Feature/try cast mssql Use try_cast instead of cast for MSSQL version 11+ (2012+) Sep 20, 2020
Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a bullet to the top of NEWS.md? It should briefly describe the change and end with (@yourname, #issuenumber).


),
if (!is(con, "DBIObject") || sub("\\..*", "", DBI::dbGetInfo(con)$db.version) >= 11) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please pull this out into separate function? I think that will make the main thrust of the code easier to understand.

Copy link
Contributor Author

@DavidPatShuiFong DavidPatShuiFong Sep 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Hadley. Just to clarify, would you like me to place everything inside the curly braces { re-assignment of mssql_scalar } as a function? Or even the 'test condition' (!is.con(...) and version >= 11) should also be in the separate function?

I've just done the former, and placed those re-assignments in a (non-exported) function mssql_use_try_cast.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test of whether or not an connection has an adequate version of SQL server

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, non-exported function mssql_version() returns a list of ProductVersion_Major/ProductVersion_Minor/ProductVersion_Revision (returns list of NULLs if argument con is not a valid DBIObject). mssql_version() is used to determine whether the MSSQL server version is >=11, in which case sql_translate_env.Microsoft SQL Server uses TRY_CAST instead of CAST where possible.

Also added to tests/testthat/test-backend-mssql.R to test the small non-exported function mssql_interpret_version() which is used to interpret the $db.version string returned by DBI::dbGetInfo().

@DavidPatShuiFong DavidPatShuiFong force-pushed the feature/try_castMSSQL branch 2 times, most recently from 232f838 to bbd884f Compare September 21, 2020 11:38
try_cast results in a more elegant failure (returns NA) if a cast is not possible ('cast' will throw an error).
Address issue tidyverse#380
code to handle invalid DBIObject (as is the case during 'testthat' execution)
change testthat expectations (in test-backend-mssql.R) to TRY_CAST as appropriate
address issue tidyverse#380
'mssql_use_try_cast()' replaces 'sql_cast()' with 'sql_try_cast()'
add NEWS.md entry
address issue tidyverse#380
mssql_version calls 'mssql_interpret_version_string'
mssql_version/mssql_interpret_version return list of three (not four) as currently DBI::dbGetInfo(x)$db.version does not return the fourth number 'Build'
mssql_interpret_version returns a list of integers (not character)
mssql_interpret_version_string rename to mssql_interpret_version
address issue tidyverse#380
@DavidPatShuiFong
Copy link
Contributor Author

Can you please add a bullet to the top of NEWS.md? It should briefly describe the change and end with (@yourname, #issuenumber).

okay, done.

@hadley
Copy link
Member

hadley commented Sep 22, 2020

I just pushed a simplification to your logic — for the real database case we can rely on numeric_version() to do most of the work, and for the testing case I added an explicit version component to the simulation.

@DavidPatShuiFong
Copy link
Contributor Author

DavidPatShuiFong commented Sep 22, 2020

I just pushed a simplification to your logic — for the real database case we can rely on numeric_version() to do most of the work, and for the testing case I added an explicit version component to the simulation.

Great! Thanks for fixing for what even a non-programmer like me realized was rather ugly code. All whilst I was listening to Taiwan Digital Minister Tang talk about open data and governance for a sustainable development lecture.

Thanks for all the tidyverse tools!

@hadley hadley merged commit 8950018 into tidyverse:master Sep 22, 2020
Comment on lines +163 to +164
as.integer = sql_try_cast("NUMERIC"),
# in MSSQL, NUMERIC converts to integer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what's the reasoning behind this. When I do this, I get a numeric via FreeTDS:

con <- DBI::dbConnect(
  odbc::odbc(),
  "mssql-test",
  uid = "kirill", pwd = keyring::key_get("mssql", "kirill")
)

library(tidyverse)

tbl <-
  tibble(a = 1L) %>%
  copy_to(con, ., name = "tbl_a") %>%
  mutate(b = as.integer(a))
#> Created a temporary table named #tbl_a

tbl
#> # Source:   lazy query [?? x 2]
#> # Database: Microsoft SQL Server 15.00.4073[@tp/test]
#>       a     b
#>   <int> <dbl>
#> 1     1     1
dbplyr::sql_render(tbl)
#> <SQL> SELECT "a", TRY_CAST("a" AS NUMERIC) AS "b"
#> FROM "#tbl_a"

Created on 2020-11-16 by the reprex package (v0.3.0)

Copy link
Contributor Author

@DavidPatShuiFong DavidPatShuiFong Nov 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In DavidPatShuiFong@4933037 I describe how I found that "in MSSQL casting to NUMERIC casts to integer, and is more flexible/graceful ... [NUMERIC] returns NA if unable to convert individual rows. Whereas casting to INTEGER/INT results in the entire column becoming NA if a single row cannot be converted to integer."

However, I see that in your example, the type of 'b' is not the desired result.

The same '' happens with the NchildPerWoman example I showed as part of the test code

> x <- con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(n = as.numeric(n))
> dbplyr::sql_render(x)
<SQL> SELECT "Year", "Country", TRY_CAST("n" AS FLOAT) AS "n"
FROM "##nchild"
> x
# Source:   lazy query [?? x 3]
# Database: Microsoft SQL Server 12.00.2000[bpsrawdata@DESKTOP-BH8LGMO\BPSINSTANCE/bpssamples]
   Year                Country     n
   <chr>               <chr>   <dbl>
 1 1970-01-01 10:32:50 China   NA   
 2 xyz                 China    3.86
 3 1970-01-01 10:33:00 China    2.6 
 4 1970-01-01 10:33:05 China    2.65
 5 1970-01-01 10:33:10 China    2.31
 6 1970-01-01 10:33:15 China    1.66
 7 1970-01-01 10:33:20 China    1.6 
 8 1970-01-01 10:33:25 China    1.61
 9 1970-01-01 10:33:30 China    1.63
10 1970-01-01 10:32:50 Taiwan   3.83
# ... with more rows


> dbplyr::sql_render(x)
<SQL> SELECT "Year", "Country", TRY_CAST("n" AS NUMERIC) AS "n"
FROM "##nchild"
> x
# Source:   lazy query [?? x 3]
# Database: Microsoft SQL Server 12.00.2000[bpsrawdata@DESKTOP-BH8LGMO\BPSINSTANCE/bpssamples]
   Year                Country     n
   <chr>               <chr>   <dbl>
 1 1970-01-01 10:32:50 China      NA
 2 xyz                 China       4
 3 1970-01-01 10:33:00 China       3
 4 1970-01-01 10:33:05 China       3
 5 1970-01-01 10:33:10 China       2
 6 1970-01-01 10:33:15 China       2
 7 1970-01-01 10:33:20 China       2
 8 1970-01-01 10:33:25 China       2
 9 1970-01-01 10:33:30 China       2
10 1970-01-01 10:32:50 Taiwan      4
# ... with more rows

n is rounded to integers, but is still of type <dbl>.

Perhaps the result of casting to NUMERIC then needs to be cast to INT!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a double cast might work -- would still all rows be converted to NULL if the numbers are too large or too small?

Copy link
Contributor Author

@DavidPatShuiFong DavidPatShuiFong Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, double casting works! and doesn't set the entire column to NA if a single row has an invalid value.

Happy to know if there is a more elegant solution, but here is what I did (from master...DavidPatShuiFong:feature/doubletrycastMSSQL )

changed try_cast (translate_sql_helpers.R) to accept a second optional parameter to do a second casting

#' @rdname sql_variant
#' @export
sql_try_cast <- function(type1, type2 = NULL) {
  type1 <- sql(type1)
  if (is.null(type2)) {
    function(x) {
      sql_expr(try_cast(!!x %as% !!type1))
      # try_cast available in MSSQL 2012+
    }
  } else {
    type2 <- sql(type2)
    # if type2 is defined, then double-cast ('type1' then 'type2')
    function(x) {
      sql_expr(try_cast(try_cast(!!x %as% !!type1) %as% !!type2))
      # try_cast available in MSSQL 2012+
    }
  }
}

Change the casting of as.integer in backend-mssql.R for mssql_version >= 11.0

 as.integer = sql_try_cast("NUMERIC", "INT"),

For the example I've given (I couldn't set up your example to work, perhaps because I am using an old version of dplyr?)

> x <- con %>% dplyr::tbl("##nchild") %>% dplyr::mutate(n = as.integer(n))
> dbplyr::sql_render(x)
<SQL> SELECT "Year", "Country", TRY_CAST(TRY_CAST("n" AS NUMERIC) AS INT) AS "n"
FROM "##nchild"
> x
# Source:   lazy query [?? x 3]
# Database: Microsoft SQL Server 12.00.2000[bpsrawdata@DESKTOP-BH8LGMO\BPSINSTANCE/bpssamples]
   Year       Country     n
   <chr>      <chr>   <int>
 1 1970-11-17 China      NA
 2 xyz        China       4
 3 1980-11-17 China       3
 4 1985-11-17 China       3
 5 1990-11-17 China       2
 6 1995-11-17 China       2
 7 2000-11-17 China       2
 8 2005-11-17 China       2
 9 2010-11-17 China       2
10 1970-11-17 Taiwan      4
# ... with more rows

(there was a mistake in the original example regarding the datetimes! now fixed)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good. I don't know an easier solution off the top of my head.

Copy link
Contributor Author

@DavidPatShuiFong DavidPatShuiFong Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good. I don't know an easier solution off the top of my head.

Thanks @krlmlr . After a bit more testing on 'production code' I have which uses dbplyr, should I open a new issue and submit the fix as a pull request?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. A PR referring to this conversation should be enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants