New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SQL Server] The encoding of the SQL affects the results returned by dbGetQuery() #179

Closed
shrektan opened this Issue May 8, 2018 · 7 comments

Comments

Projects
None yet
4 participants
@shrektan

shrektan commented May 8, 2018

Issue Description and Expected Result

Due to the the archive of RSQLServer, I switched to odbc recently. Everything seems working fine. Except this annoying behavior:

Background: I'm working on a windows machine with GB2312 as the native encoding. The SQL Server uses GB2312 as the default encoding as well (what I mean is that the result it returns is GB2312 and it stores the data as GB2312)

If the SQL contains non-ASCII characters, the results returned by dbGetQuery() will depend on the encoding. More specifically, I can get the correct results, only when the SQL encoding is the same as the database encoding. Otherwise, it either throws error or returns a different result.

What I'm expecting is odbc always returns the same result no matter I use native encoding or UTF-8 encoding for the SQL.

I will be very grateful if there's any quick fix for this because it's quite difficult for me to ensure the SQL is always encoded in GB2312 (Imagine glue::glue() always returns UTF-8 strings. Moreover, although I haven't tested on a Linux server, I bet the I may need to use iconv(sql, from = 'UTF-8', to = 'GB2312') all the time to get it work... )

BTW, I can confirm that both RSQLServer and RODBC will return the same results no matter what the encoding is.

Thanks.

Database

 select @@VERSION
Microsoft SQL Server 2014 - 12.0.2000.8 (X64) 
	Feb 20 2014 20:04:26 
	Copyright (c) Microsoft Corporation
	Enterprise Edition (64-bit) on Windows NT 6.2 <X64> (Build 9200: )

Reproducible Example

library(odbc)
con <- DBI::dbConnect(
  odbc::odbc(), server = "ip-address", 
  database = "db", uid = "uid", pwd = "pwd", 
  driver = "SQL Server", encoding = "GB2312"
)
tmp_tbl <- tibble::tribble(
  ~CN_COL, ~EN_COL,
  "", 1,
  "", 2,
  "", 3,
  "", 4,
  "", 5
)
DBI::dbWriteTable(con, "tmp_tbl", tmp_tbl)
sql <- "select (case CN_COL when '一' then 'one' when '二' then 'two' else 'others' end) as CN_COL, EN_COL
from tmp_tbl"
DBI::dbGetQuery(con, sql)
#>   CN_COL EN_COL
#> 1    one      1
#> 2    two      2
#> 3 others      3
#> 4 others      4
#> 5 others      5
DBI::dbGetQuery(con, enc2utf8(sql))
# Error: <SQL> 'select (case CN_COL when '涓€' then 'one' when '浜? then 'two' else 'others' end) as CN_COL, EN_COL from tmp_tbl'
#  nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC SQL Server Driver][SQL Server]Incorrect syntax near 'two'.
DBI::dbRemoveTable(con, "tmp_tbl")
DBI::dbDisconnect(con)

In addition

Change the drivers won't help

I've tried all the three SQL Server ODBC drivers on my computer but it doesn't help.

image

RODBC will convert the sql to the same encoding as the database automatically

> RODBC::odbcQuery
function (channel, query, rows_at_time = attr(channel, "rows_at_time")) 
{
    if (!odbcValidChannel(channel)) 
        stop("first argument is not an open RODBC channel")
    if (nchar(enc <- attr(channel, "encoding"))) 
        query <- iconv(query, to = enc)
    .Call(C_RODBCQuery, attr(channel, "handle_ptr"), as.character(query), 
        as.integer(rows_at_time))
}
<environment: namespace:RODBC>
Session Info
devtools::session_info()
─ Session info ──────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value                                              
 version  R version 3.4.4 (2018-03-15)                       
 os       Windows 7 x64 SP 1                                 
 system   x86_64, mingw32                                    
 ui       RStudio                                            
 language (EN)                                               
 collate  Chinese (Simplified)_People's Republic of China.936
 tz       Asia/Taipei                                        
 date     2018-05-08                                         

─ Packages ──────────────────────────────────────────────────────────────────────────────────────────────────────
 package     * version     date       source                             
 assertthat    0.2.0       2017-04-11 CRAN (R 3.4.4)                     
 backports     1.1.2       2017-12-13 CRAN (R 3.4.3)                     
 bit           1.1-12      2014-04-09 CRAN (R 3.4.1)                     
 bit64         0.9-7       2017-05-08 CRAN (R 3.4.0)                     
 blob          1.1.1       2018-03-25 CRAN (R 3.4.4)                     
 cli           1.0.0       2017-11-05 CRAN (R 3.4.4)                     
 clisymbols    1.2.0       2017-05-21 CRAN (R 3.4.4)                     
 crayon        1.3.4       2017-09-16 CRAN (R 3.4.4)                     
 DBI           1.0.0       2018-05-02 CRAN (R 3.4.4)                     
 desc          1.2.0       2018-05-01 CRAN (R 3.4.4)                     
 devtools      1.13.5.9000 2018-04-08 local                              
 digest        0.6.15      2018-01-28 CRAN (R 3.4.3)                     
 hms           0.4.2       2018-03-10 CRAN (R 3.4.4)                     
 magrittr      1.5         2014-11-22 CRAN (R 3.4.4)                     
 memoise       1.1.0       2017-04-21 CRAN (R 3.4.4)                     
 odbc        * 1.1.5       2018-01-23 CRAN (R 3.4.3)                     
 pillar        1.2.2       2018-04-26 CRAN (R 3.4.4)                     
 pkgbuild      0.0.0.9000  2017-12-06 Github (r-lib/pkgbuild@ce7f6d1)    
 pkgconfig     2.0.1       2017-03-21 CRAN (R 3.4.4)                     
 pkgload       0.0.0.9000  2017-12-06 Github (r-lib/pkgload@70eaef8)     
 R6            2.2.2       2017-06-17 CRAN (R 3.4.4)                     
 Rcpp          0.12.16     2018-03-13 CRAN (R 3.4.4)                     
 rlang         0.2.0       2018-02-20 CRAN (R 3.4.3)                     
 rprojroot     1.3-2       2018-01-03 CRAN (R 3.4.3)                     
 rstudioapi    0.7.0-9000  2018-04-13 Github (rstudio/rstudioapi@e87b481)
 sessioninfo   1.0.1.9000  2017-12-06 Github (r-lib/sessioninfo@c871d01) 
 testthat      2.0.0       2017-12-13 CRAN (R 3.4.4)                     
 tibble        1.4.2       2018-01-22 CRAN (R 3.4.3)                     
 usethis     * 1.3.0       2018-02-24 CRAN (R 3.4.4)                     
 withr         2.1.2       2018-03-15 CRAN (R 3.4.4)                     
 yaml          2.1.19      2018-05-01 CRAN (R 3.4.4)     

@jimhester jimhester added the bug label May 8, 2018

@Leo-Lee15

This comment has been minimized.

Leo-Lee15 commented May 10, 2018

I use the odbc package and implyr package to connect to Impala and also encounter the annoying encoding problems. The columns containing Chinese characters are all corrupted.

@jimhester

This comment has been minimized.

Member

jimhester commented Jun 8, 2018

Should be fixed by 7c763d5

@jimhester jimhester closed this Jun 8, 2018

@shrektan

This comment has been minimized.

shrektan commented Jun 8, 2018

I think it won’t work. The statement should be encoded to the native encoding of the db server. As my example shows, UTF8 encoding is the one that fails.

@jimhester

This comment has been minimized.

Member

jimhester commented Jun 9, 2018

Yes I did that with the next commit, 3b01fe1

@shrektan

This comment has been minimized.

shrektan commented Jun 10, 2018

great, thanks.

@hienghtk

This comment has been minimized.

hienghtk commented Aug 4, 2018

I have the same issue when using ODBC to connect to Impala and fetching tables.
The string does not return the native encoding of the db server. If I change encoding to utf8, some characters display wrong.
Ex: "Kích hoạt" change to "Kích ho<U+1EA1>t"

@shrektan

This comment has been minimized.

shrektan commented Aug 10, 2018

@jimhester I can confirm it gets fixed. Thanks!

However, people may encounter errors, if the param encoding was set up in a Linux machine that connects to an MS SQL Server using Free TDS when the returned results contain multibyte chars.

So I'd like to share my solution here, which is adding a new param clientcharset whose value is the same as encoding.

The reason of needing this new param is that Free TDS driver will convert the returned table to UTF-8 encoding by default. However, odbc package will try to convert the table to UTF-8 as well, using the encoding param - without knowing the table has already been re-encoded. Thus, odbc will complain there exist wrong multiple byte chars.

An example of connecting to an MS SQL Server

conn <- DBI::dbConnect(
  drv = odbc::odbc() ,
  server = "your ip",
  port = 1433, # Free TDS requires the port explicitly
  database = "your db",
  uid = "your user name",
  pwd = "your password",
  encoding = "GB2312", # the Encoding of the server
  driver = "FreeTDS", # you have to set it up in `/etc/odbcinst.ini` so that unixODBC can find the driver
  clientcharset = "GB2312" # This line is what I'm talking about!
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment