Skip to content

rows_update truncates very long text (json) upon insertion #1327

@lhabegger

Description

@lhabegger

I am working with a remote MySQL database trough dplyr / dbplyr. I stumbled upon a strange behaviour today that puzzles me.

I am inserting a very long text (JSON) into a column with Type LONGTEXT (which has enough space to store the string in the db)

I experienced truncation of the string to maximum 65535 characters (byte limit for e.g. TEXT columns) when using dplyr::rows_update . When using an alternative approach using DBI directly this truncation does not occur.

My current naive thinking goes into the direction of the 'copy_to()' mechanisms. I saw in the source of dbplyr that the method dbplyr::detect_col_types() for class DBIConnection returns NULL. So if the new df is created in the remote DB with default col Types (e.g. TEXT vs LONGTEXT) this would explain this behaviour.

dbplyr/R/rows.R

Line 775 in 09cb679

get_col_types.DBIConnection <- function(con, name, call) {

Am I trying to use dbplyr outside of its intended usage? If yes believe this should at least be documented / warned.

This is the code i execute:

library(DBI)
library(dplyr)

# Establish the database connection
con <- dbConnect(odbc::odbc(), Driver = "MySQL ODBC 8.0 ANSI Driver",
                 Server = "your_server", Database = "your_database",
                 User = "your_username", Password = "your_password",
                 timeout = 10)

# Create a sample data frame with long JSON data
json_data <- paste(rep('{"key":"', 100000), "a", rep('"}', 100000), collapse = "")
json_data <- paste0('{"data": "', json_data, '"}')

df <- data.frame(id = 1:3, json = rep(json_data, 3))

print(nchar(json_data))

# Create Table in remote database with name `temp_table` with a column `json` set to LONGTEXT
query <- "
  CREATE TABLE temp_table (
    id INT PRIMARY KEY,
    json LONGTEXT
  )
"
dbExecute(con, query)

# Use dplyr to update the remote database table

tbl(con, "temp_table") %>%
  rows_update(df,by = c("id"), copy=TRUE, in_place = TRUE, unmatched='ignore')

resDplyr <- tbl(con,"temp_table") %>%
  dplyr::collect()

nchar(resDplyr$json[1])


# DBI method -----------------------------------------------------------------
update_statement <- "UPDATE temp_table SET json = ? WHERE id = ? "


# Iterate over the rows in the data frame and update the database
for (i in seq_len(nrow(df))) {
  # Get the values for the current row
  json_data <- df$json[i]
  id <- df$id[i]
  
  # Execute the update statement with the current row values
  dbExecute(con, update_statement, list(json_data, id))
}

resDBI <- tbl(con,"temp_table") %>%
  dplyr::collect()

nchar(resDBI$json[1])

# Close the database connection
dbDisconnect(con)

This is the console output of the relevant portion:

> # Create a sample data frame with long JSON data
> json_data <- paste(rep('{"key":"', 100000), "a", rep('"}', 100000), collapse = "")
> json_data <- paste0('{"data": "', json_data, '"}')
> 
> df <- data.frame(id = 1:3, json = rep(json_data, 3))
> 
> print(nchar(json_data))
[1] 1300012
> 
> # Table in remote database has table `temp_table` with a column `json` set to LONGTEXT
> 
> # Use dplyr to update the remote database table
> 
> tbl(con, "temp_table") %>%
+   rows_update(df,by = c("id"), copy=TRUE, in_place = TRUE, unmatched='ignore')
> 
> resDplyr <- tbl(con,"temp_table") %>%
+   dplyr::collect()
> 
> nchar(resDplyr$json[1])
[1] 65535
> 
> 
> # DBI method -----------------------------------------------------------------
> update_statement <- "UPDATE temp_table SET json = ? WHERE id = ? "
> 
> 
> # Iterate over the rows in the data frame and update the database
> for (i in seq_len(nrow(df))) {
+   # Get the values for the current row
+   json_data <- df$json[i]
+   id <- df$id[i]
+   
+   # Execute the update statement with the current row values
+   dbExecute(con, update_statement, list(json_data, id))
+ }
> 
> resDBI <- tbl(con,"temp_table") %>%
+   dplyr::collect()
> 
> nchar(resDBI$json[1])
[1] 1300012
> 
> # Close the database connection
> dbDisconnect(con)
> ?copy_to
> 

and this is the sessionInfo()

> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_Switzerland.utf8  LC_CTYPE=English_Switzerland.utf8    LC_MONETARY=English_Switzerland.utf8
[4] LC_NUMERIC=C                         LC_TIME=English_Switzerland.utf8    

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.2 DBI_1.1.3  

loaded via a namespace (and not attached):
 [1] utf8_1.2.3       R6_2.5.1         bit_4.0.5        tidyselect_1.2.0 magrittr_2.0.3   odbc_1.3.4       glue_1.6.2       blob_1.2.4      
 [9] tibble_3.2.1     pkgconfig_2.0.3  bit64_4.0.5      generics_0.1.3   lifecycle_1.0.3  cli_3.6.1        fansi_1.0.4      vctrs_0.6.2     
[17] withr_2.5.0      compiler_4.3.1   purrr_1.0.1      rstudioapi_0.14  tools_4.3.1      hms_1.1.3        pillar_1.9.0     Rcpp_1.0.10     
[25] rlang_1.1.1      dbplyr_2.3.2   

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions