Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf issue with dbWriteTable : RMariaDB much slower than RMySQL (from 2 to 40 times), and it's worse with a remote db #162

Closed
ericemc3 opened this issue May 17, 2020 · 23 comments
Milestone

Comments

@ericemc3
Copy link

@ericemc3 ericemc3 commented May 17, 2020

I experiment writing a table (a tibble: 34,951 x 52) into a MySQL database (5.6):

MySQL database on localhost (same computer):

  • RMySQL = 3.4s
  • RMariaDB = 8s

remote MySQL database:

  • RMySQL = 11s
  • RMariaDB : R still running after 5mn, i have to kill it

remote MySQL database with a reduced table (586 x 52):

  • RMySQL = 0.4s
  • RMariaDB = 16s
library(tidyverse) 
library(readxl) 
library(RMySQL) 
library(RMariaDB) 
library(curl) 

urlhebtour <- "https://insee.fr/fr/statistiques/fichier/2021703/base-cc-tourisme-2020.zip"
xlsfile    <- unzip( curl_download( urlhebtour, "temp.zip" ) )  

dt <- read_excel( xlsfile, sheet = "COM", skip = 5 ) %>% select(-LIBGEO) 

# RMySQL test on localhost
db_cred = list( drv = MySQL(), host="localhost", dbname="donstat", user="xxx", password="xxx" )
mycon <- exec( "dbConnect", !!!db_cred ) 

system.time( dbWriteTable( mycon, value = dt, name = "tourism2020", 
                                    overwrite = TRUE, row.names = FALSE ) )
# RMySQL 3.4s

tbl(mycon, "tourism2020") %>% collect()
# A tibble: 34,951 x 52

dbDisconnect(mycon)  

# RMariaDB test on localhost
db_cred <- list( drv = MariaDB(), host="localhost", dbname="donstat", user="xxx", password="xxx" )
mycon <- exec( "dbConnect", !!!db_cred ) 

system.time( dbWriteTable( mycon, value = dt, name = "tourism2020", 
                                       overwrite = TRUE, row.names = FALSE ) )
# RMariaDB 8s

tbl(mycon, "tourism2020") %>% collect()
# A tibble: 34,951 x 52

With a remote database and RMariaDB, it doesn't end

So let's try with a reduced tibble (=> 586 rows)

system.time( dbWriteTable( mycon, value = dt %>% filter(DEP == "31"), name = "tourism2020", overwrite = TRUE, row.names = FALSE ) )

# RMySQL   = 0.4s
# RMariaDB = 16s
@krlmlr
Copy link
Member

@krlmlr krlmlr commented May 17, 2020

Is the local CPU idling or spinning with RMariaDB?

dbWriteTable() calls DBI::dbAppendTable() which creates a SQL query of the form INSERT INTO ... VALUES (?, ?, ?, ?), perhaps this requires confirmation from the server for each row.

The docs suggest to

Use the multiple-row INSERT syntax to reduce communication overhead between the client and the server if you need to insert many rows:

INSERT INTO yourtable VALUES (1,2), (5,5), ...;

Perhaps we can find out empirically what the "sweet spot" is? I suspect somewhere between 1k and 16k cells per query, maybe there's an upper limit regarding number of parameters for a prepared statement.

Unfortunately I don't have a remote MySQL/MariaDB server available, I'd appreciate help here.

@ericemc3
Copy link
Author

@ericemc3 ericemc3 commented May 17, 2020

Of course i am ready to help. My first suggestion would be : just compare how it is implemented in RMySQL and in RMariaDB, since it works efficiently with RMySQL. I'd like to dig into the code if you point me to where it is implemented.

What i know:

  1. with a local MySQL DB you can generally use LOAD LOCAL DATA INFILE, which is very fast, this is what you do indeed
  2. with a remote MySQL DB, you can't. But there are at least 2 ways of inserting, the second is generally faster:

classic insert:

insert into `shippers`(`ShipperID`,`CompanyName`,`Phone`) 
values(1,'Speedy Express','(503) 555-9831');
insert into `shippers`(`ShipperID`,`CompanyName`,`Phone`) 
values(2,'United Package','(503) 555-3199');
insert into `shippers`(`ShipperID`,`CompanyName`,`Phone`) 
values(3,'Federal Shipping','(503) 555-9931');

bulk insert:

insert into `shippers`(`ShipperID`,`CompanyName`,`Phone`) values 
(1,'Speedy Express','(503) 555-9831'),
(2,'United Package','(503) 555-3199'),
(3,'Federal Shipping','(503) 555-9931');

And disabling autocommit before the INSERT process is also a good option, especially for transactional engines such as INNODB. This will prevent a commit every x rows.

SET autocommit=0;

Anyway, at the risk of repeating myself, if the RMariaDB implementation could be the same than the RMySQL one, it would be a first good step.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented May 17, 2020

RMySQL uses LOAD DATA LOCAL INFILE. From https://dev.mysql.com/doc/refman/8.0/en/load-data.html:

LOCAL works only if your server and your client both have been configured to permit it. For example, if mysqld was started with the local_infile system variable disabled, LOCAL does not work.

I'm happy to introduce this, opt-in. By default I'd use a bulk insert as you suggested.

The implementation lives in DBI::dbAppendTable(), we can override in RMariaDB. Would you like to take a stab?

@ericemc3
Copy link
Author

@ericemc3 ericemc3 commented May 17, 2020

Yes, thank you, i would be happy to try.

The LOAD DATA LOCAL INFILE is not always authorized, you are right (you need proper rights, disable the --secure-file-priv option, and so on). But it is such a huge boost when you can use it.

By the way, for Postgresql and MSSQL there are similar bulk (fast loading) features (COPY for PG, and BULK INSERT for MSSQL). I have not checked yet if you are leveraging them.

@rossholmberg
Copy link
Contributor

@rossholmberg rossholmberg commented May 18, 2020

I'm interested in this too, since I notice the same performance issue.

Looking in the DBI repo, I'm seeing this section:

SQL(paste0(
      "INSERT INTO ", table, "\n",
      "  (", paste(fields, collapse = ", "), ")\n",
      "VALUES\n",
      paste0("  (", rows, ")", collapse = ",\n")
))

As far as I can tell, this is what ends up getting called with DBI::dbAppendTable. This looks to me like it already creates a single bulk insert statement, so not sure if bypassing it to call your own function will help if the aim is one or more bulk inserts.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented May 18, 2020

Right, missed that. This leaves room for improvement -- one single SQL statement that big is going to create problems of all sorts.

16 seconds for 500 rows is not great (it's very poor indeed), but at least the original table would be loaded in ~25 minutes (and not take forever). I still wonder what causes this horrible performance.

How many cells does the data have? What size?

We could also review the DBI commands that are actually issued under the hood, with dblog: https://github.com/r-dbi/dblog.

@ericemc3
Copy link
Author

@ericemc3 ericemc3 commented May 18, 2020

Slicing the data by chunks is a good ideal, appending 500 rows at a time for instance.

But when it comes to loading big tables (though 35 000 rows is not so big, we are not talking about "big data" here), there is only one efficient solution, designed for that purpose, a real fast bulk insert such as LOAD DATA INFILE.

This is why it is so important to benefit at least from an option to specify using it (dbWrite, dbAppend, copy_to functions).

If you have the right to write in a database, you generally have the possibility to get LOAD DATA INFILE activated quite easily (my experience).

Another important point with LOAD DATA INFILE : the csv to be loaded must be UTF8 encoded if the MySQL DB is UTF8 (which is the standard now).

SELECT @@character_set_database

@ericemc3
Copy link
Author

@ericemc3 ericemc3 commented May 18, 2020

I ran few additional tests and it looks like there is an overload due to a client/server roundtrip for each row inserted (or whatever network issue more than a SQL engine issue).

Test 1: inserting the above table in a localhost database / database in a local network / remote database
Loading times (RMariaDB) : 8s / 60s / 900s
compared with LOAD DATA INFILE with RMySQL: 4s / 4s / 12s

Test 2: inserting by chunk

data <- data %>% arrange(reg,com) %>% 
                 mutate( tile = ntile(n = 20) ) 

write_by_chunk <- function( chunk, key ) {
  print(key)
  if ( key == 1 ) {
    dbExecute( mycon, "DROP TABLE IF EXISTS tourism2020" )
    dbCreateTable( mycon, "tourism2020", chunk ) 
    dbExecute( mycon, "ALTER TABLE tourism2020 ENGINE = MYISAM" )
  } 
  
  dbAppendTable( mycon, donstat_tbl, chunk )
}

system.time(
  data %>%
    group_by(tile) %>%
    group_walk( write_by_chunk )
)

=> it doesn't make any difference, whatever the number of chunks (5, 10, 25).

I also tested deactivating the Innodb transactionnal mode by specifying the MyISAM Engine, with no sensible effect either.

@andoomcz
Copy link

@andoomcz andoomcz commented May 22, 2020

I am running into the exact same issue. When I run dbWriteTable function using RMySQL package, it takes 3.77 seconds to write 60K+ rows into a remote database. However, when I use RMariaDB, it will hang, and I have to terminate the session. As a workaround, I am using RMySQL where I use dbWriteTable function, and use RMariaDB, but this is certainly an issue, as RMySQL is deprecated.

I am using the most recent versions of RMySQL and RMariaDB to test this, and teh host is on 10.4.11 MariaDB compiled for Win64.

@jtelleriar
Copy link

@jtelleriar jtelleriar commented May 22, 2020

@ericemc3
Copy link
Author

@ericemc3 ericemc3 commented Jun 4, 2020

Hello,
is there something cooking about introducing LOAD DATA INFILE as an option?
Thanks

@mcol
Copy link
Contributor

@mcol mcol commented Jun 5, 2020

For dbWriteTable() I resorted to write a wrapper that takes a dataframe, writes it to a temporary file, then passes the path to the temporary file to the function: with this approach, the data gets loaded using the LOAD DATA LOCAL INFILE construct. Unfortunately there doesn't seem a way of forcing dbAppendTable() to work in a similar way, so having an option directly in these functions would be really helpful.

@andoomcz
Copy link

@andoomcz andoomcz commented Jun 5, 2020

I resorted to using the odbc package. R Studio recommends using odbc. This requires you to install the proper ODBC driver and set it up, but both read and writes are comparable to RMySQL.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jul 3, 2020

I'm in for making the default dbWriteTable() faster. I'll need to provision a remote database to be able to replicate the network roundtrips you're seeing.

@ericemc3
Copy link
Author

@ericemc3 ericemc3 commented Jul 4, 2020

Could you please first implement LOAD DATA LOCAL INFILE as a possible option for the dbWriteTable and copy_to, as you proposed earlier ? This is the real key for speed. I am in for testing of course. The initial RMySQL implementation is probably a good start. Inserting line by line with INSERT will always be slow

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jul 4, 2020

We need a fail-safe dbWriteTable() that works in all cases.

AFAICT, for LOAD DATA INFILE, the file must be accessible from the server process. This means that the client process can't access it (in the general case) and can't help with the column names or types. Basically, we would only run dbExecute(con, "LOAD DATA INFILE ...") ?

@ericemc3
Copy link
Author

@ericemc3 ericemc3 commented Jul 4, 2020

Allowing LOAD DATA INFILE is just a matter of rights setting server side, it is quite easy and and works very well. When you are loading data with R in a database, you generally have full control over the database you want to load data in, so you can grant LOAD DATA INFILE to the proper user.

Of course i understand that the behaviour by default should be "all-terrain" (INSERT by default), but with an option (opt-in) allowing to use LOAD DATA INFILE, it would be great. It is the same situation with Postgresql and other databases, there is the INSERT way and the BULK loading way (COPY with PG, BULK with MSSQL....

This is just RMySQL was doing so far and that we would very much like to find again here to benefit from 5s loading processes and not 1000 s (my last test this morning), which is not usable.

I understand you would like first to optimize the INSERT way in a remote database, and it is ok with me as long as the LOAD DATA INFILE possibility is not dropped.

@daniloandrademendes
Copy link

@daniloandrademendes daniloandrademendes commented Jul 30, 2020

In my project, for workaround, I wrote this function:

appendToTable = function(conn, name, df) {
  f = tempfile()
  dbFields = dbListFields(conn, name)
  for (dbField in dbFields) {
    # if a field in db is not present on df to import
    if (!(dbField %in% colnames(df))) {
      # add with null!
      df[[dbField]] = NA
    }
  }
  df = df %>% select(dbFields)
  write.csv(df, file = f, row.names=F, na = "NULL", fileEncoding = "UTF-8")
  dbWriteTable(conn, name, f, append=T)
}

@ericemc3
Copy link
Author

@ericemc3 ericemc3 commented Aug 3, 2020

@daniloandrademendes This is brilliant! I didn't know that dbWriteTable could take a csv as an input.

So i will use this variant that will fix the perf issue from the outside.

dbWriteTableMySQLFast<- function(conn, df, tbl_name) {
  dbRemoveTable(conn, tbl_name, fail_if_missing = FALSE)
  dbCreateTable(conn, tbl_name, fields = df)
  
  f <- tempfile()
  write.csv(df, file = f, row.names = FALSE, na = "NULL", fileEncoding = "UTF-8")
  
  dbWriteTable(conn, tbl_name, f, append = TRUE)

  unlink(f) 
}

@krlmlr krlmlr added this to the 1.1.0 milestone Aug 26, 2020
@vituri
Copy link

@vituri vituri commented Sep 3, 2020

In my project, for workaround, I wrote this function:

appendToTable = function(conn, name, df) {
  f = tempfile()
  dbFields = dbListFields(conn, name)
  for (dbField in dbFields) {
    # if a field in db is not present on df to import
    if (!(dbField %in% colnames(df))) {
      # add with null!
      df[[dbField]] = NA
    }
  }
  df = df %>% select(dbFields)
  write.csv(df, file = f, row.names=F, na = "NULL", fileEncoding = "UTF-8")
  dbWriteTable(conn, name, f, append=T)
}

Very nice solution! I just added eol = "\r\n" in dbWriteTable so it works in Windows (breaking line problems, I don't know). Thanks a lot!

@ericemc3
Copy link
Author

@ericemc3 ericemc3 commented Sep 5, 2020

@vituri i came to the same conclusion with eol in Windows
Here is the version i currently use in production for MySQL, that also allows for primary key definition:

dbWriteTableMySQLFast <- function(conn, df, tbl_name, pk = NULL) {
  dbRemoveTable(conn, tbl_name, fail_if_missing = FALSE)
  dbCreateTable(conn, tbl_name, fields = df)
  
  f <- tempfile()
  write.csv(df, file = f, row.names = FALSE, na = "NULL", fileEncoding = "UTF-8")
  t <- dbWriteTable(conn, tbl_name, f, append = TRUE, eol = "\r\n")
  unlink(f) 
  
  if (is.null(pk)) {
    invisible( dbExecute( DB, str_glue( "ALTER TABLE {`tbl_name`} ENGINE=INNODB") ) )
    
  } else {
    pks = paste( pk, collapse = ',' )
    invisible( dbExecute( DB, str_glue( "ALTER TABLE {`tbl_name`} ADD PRIMARY KEY ({`pks`}), ENGINE=INNODB") ) )
  }

  tbl(conn, tbl_name) 
}

@kforner
Copy link

@kforner kforner commented Sep 8, 2020

I'm using this, that works with different column order and subsets:

mysql_fast_db_write_table <- function(conn, name, value, field.types = NULL,  append = FALSE, ...) {
  die_if(ncol(value) == 0, 'no column in the data frame to upload')
  
  # create the table if needed
  if (!DBI::dbExistsTable(conn, name) || !append) {
    # fix for the field types
    if (is.null(field.types)) {
      field.types <- vapply(value, DBI::dbDataType, dbObj = conn, FUN.VALUE = character(1))
    }
    DBI::dbWriteTable(conn, name, value[NULL,], field.types = field.types, ...)
  }
  
  if (nrow(value) == 0) return(0)
  
  ### write a temp csv file
  fn <- normalizePath(tempfile("rsdbi"), winslash = "/",  mustWork = FALSE)
  safe.write(value, file = fn)
  on.exit(unlink(fn), add = TRUE)

  mysql_load_data_local_infile(conn, name, path = fn, cols = names(value), header = FALSE, sep = '\t')
  #DBI::dbWriteTable(conn, name, fn, append = TRUE, overwrite = FALSE, header = FALSE, sep = '\t')
}

mysql_load_data_local_infile <- function(conn, name, path, cols, header = TRUE, sep = ",", eol = "\n", skip = 0, quote = "\"") {
  .quote_id <- function(x) DBI::dbQuoteIdentifier(conn ,x)
  .quote_s <- function(x) DBI::dbQuoteString(conn, x)
  
  cols <- sapply(cols,  .quote_id)
  cols <- paste0(cols, collapse = ", ")
      
  sql <- paste0(
      "LOAD DATA LOCAL INFILE ", .quote_s( path), "\n", 
      "INTO TABLE ", .quote_id(name), "\n", 
      "FIELDS TERMINATED BY ", .quote_s(sep), "\n", 
      "OPTIONALLY ENCLOSED BY ", .quote_s(quote), "\n", 
      "LINES TERMINATED BY ", .quote_s(eol), "\n", 
      "IGNORE ", skip + as.integer(header), " LINES\n", 
      "(", cols, ");"
    )
  
   DBI::dbExecute(conn, sql)
}

N.B: safe.write is taken from RMySQL:::safe.write

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Sep 2, 2021

Planning to release an update that uses LOAD DATA LOCAL INFILE very soon. Closing this issue in favor of #11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants