-
Notifications
You must be signed in to change notification settings - Fork 182
Description
I am working with a remote MySQL database trough dplyr / dbplyr. I stumbled upon a strange behaviour today that puzzles me.
I am inserting a very long text (JSON) into a column with Type LONGTEXT (which has enough space to store the string in the db)
I experienced truncation of the string to maximum 65535 characters (byte limit for e.g. TEXT columns) when using dplyr::rows_update
. When using an alternative approach using DBI directly this truncation does not occur.
My current naive thinking goes into the direction of the 'copy_to()' mechanisms. I saw in the source of dbplyr
that the method dbplyr::detect_col_types()
for class DBIConnection
returns NULL. So if the new df is created in the remote DB with default col Types (e.g. TEXT vs LONGTEXT) this would explain this behaviour.
Line 775 in 09cb679
get_col_types.DBIConnection <- function(con, name, call) { |
Am I trying to use dbplyr outside of its intended usage? If yes believe this should at least be documented / warned.
This is the code i execute:
library(DBI)
library(dplyr)
# Establish the database connection
con <- dbConnect(odbc::odbc(), Driver = "MySQL ODBC 8.0 ANSI Driver",
Server = "your_server", Database = "your_database",
User = "your_username", Password = "your_password",
timeout = 10)
# Create a sample data frame with long JSON data
json_data <- paste(rep('{"key":"', 100000), "a", rep('"}', 100000), collapse = "")
json_data <- paste0('{"data": "', json_data, '"}')
df <- data.frame(id = 1:3, json = rep(json_data, 3))
print(nchar(json_data))
# Create Table in remote database with name `temp_table` with a column `json` set to LONGTEXT
query <- "
CREATE TABLE temp_table (
id INT PRIMARY KEY,
json LONGTEXT
)
"
dbExecute(con, query)
# Use dplyr to update the remote database table
tbl(con, "temp_table") %>%
rows_update(df,by = c("id"), copy=TRUE, in_place = TRUE, unmatched='ignore')
resDplyr <- tbl(con,"temp_table") %>%
dplyr::collect()
nchar(resDplyr$json[1])
# DBI method -----------------------------------------------------------------
update_statement <- "UPDATE temp_table SET json = ? WHERE id = ? "
# Iterate over the rows in the data frame and update the database
for (i in seq_len(nrow(df))) {
# Get the values for the current row
json_data <- df$json[i]
id <- df$id[i]
# Execute the update statement with the current row values
dbExecute(con, update_statement, list(json_data, id))
}
resDBI <- tbl(con,"temp_table") %>%
dplyr::collect()
nchar(resDBI$json[1])
# Close the database connection
dbDisconnect(con)
This is the console output of the relevant portion:
> # Create a sample data frame with long JSON data
> json_data <- paste(rep('{"key":"', 100000), "a", rep('"}', 100000), collapse = "")
> json_data <- paste0('{"data": "', json_data, '"}')
>
> df <- data.frame(id = 1:3, json = rep(json_data, 3))
>
> print(nchar(json_data))
[1] 1300012
>
> # Table in remote database has table `temp_table` with a column `json` set to LONGTEXT
>
> # Use dplyr to update the remote database table
>
> tbl(con, "temp_table") %>%
+ rows_update(df,by = c("id"), copy=TRUE, in_place = TRUE, unmatched='ignore')
>
> resDplyr <- tbl(con,"temp_table") %>%
+ dplyr::collect()
>
> nchar(resDplyr$json[1])
[1] 65535
>
>
> # DBI method -----------------------------------------------------------------
> update_statement <- "UPDATE temp_table SET json = ? WHERE id = ? "
>
>
> # Iterate over the rows in the data frame and update the database
> for (i in seq_len(nrow(df))) {
+ # Get the values for the current row
+ json_data <- df$json[i]
+ id <- df$id[i]
+
+ # Execute the update statement with the current row values
+ dbExecute(con, update_statement, list(json_data, id))
+ }
>
> resDBI <- tbl(con,"temp_table") %>%
+ dplyr::collect()
>
> nchar(resDBI$json[1])
[1] 1300012
>
> # Close the database connection
> dbDisconnect(con)
> ?copy_to
>
and this is the sessionInfo()
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_Switzerland.utf8 LC_CTYPE=English_Switzerland.utf8 LC_MONETARY=English_Switzerland.utf8
[4] LC_NUMERIC=C LC_TIME=English_Switzerland.utf8
time zone: Europe/Zurich
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.1.2 DBI_1.1.3
loaded via a namespace (and not attached):
[1] utf8_1.2.3 R6_2.5.1 bit_4.0.5 tidyselect_1.2.0 magrittr_2.0.3 odbc_1.3.4 glue_1.6.2 blob_1.2.4
[9] tibble_3.2.1 pkgconfig_2.0.3 bit64_4.0.5 generics_0.1.3 lifecycle_1.0.3 cli_3.6.1 fansi_1.0.4 vctrs_0.6.2
[17] withr_2.5.0 compiler_4.3.1 purrr_1.0.1 rstudioapi_0.14 tools_4.3.1 hms_1.1.3 pillar_1.9.0 Rcpp_1.0.10
[25] rlang_1.1.1 dbplyr_2.3.2