New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf issue with dbWriteTable : RMariaDB much slower than RMySQL (from 2 to 40 times), and it's worse with a remote db #162
Comments
|
Is the local CPU idling or spinning with RMariaDB?
The docs suggest to
Perhaps we can find out empirically what the "sweet spot" is? I suspect somewhere between 1k and 16k cells per query, maybe there's an upper limit regarding number of parameters for a prepared statement. Unfortunately I don't have a remote MySQL/MariaDB server available, I'd appreciate help here. |
|
Of course i am ready to help. My first suggestion would be : just compare how it is implemented in RMySQL and in RMariaDB, since it works efficiently with RMySQL. I'd like to dig into the code if you point me to where it is implemented. What i know:
classic insert: insert into `shippers`(`ShipperID`,`CompanyName`,`Phone`)
values(1,'Speedy Express','(503) 555-9831');
insert into `shippers`(`ShipperID`,`CompanyName`,`Phone`)
values(2,'United Package','(503) 555-3199');
insert into `shippers`(`ShipperID`,`CompanyName`,`Phone`)
values(3,'Federal Shipping','(503) 555-9931');bulk insert: insert into `shippers`(`ShipperID`,`CompanyName`,`Phone`) values
(1,'Speedy Express','(503) 555-9831'),
(2,'United Package','(503) 555-3199'),
(3,'Federal Shipping','(503) 555-9931');And disabling autocommit before the INSERT process is also a good option, especially for transactional engines such as INNODB. This will prevent a commit every x rows. SET autocommit=0;Anyway, at the risk of repeating myself, if the RMariaDB implementation could be the same than the RMySQL one, it would be a first good step. |
|
RMySQL uses
I'm happy to introduce this, opt-in. By default I'd use a bulk insert as you suggested. The implementation lives in |
|
Yes, thank you, i would be happy to try. The By the way, for Postgresql and MSSQL there are similar bulk (fast loading) features (COPY for PG, and BULK INSERT for MSSQL). I have not checked yet if you are leveraging them. |
|
I'm interested in this too, since I notice the same performance issue. Looking in the SQL(paste0(
"INSERT INTO ", table, "\n",
" (", paste(fields, collapse = ", "), ")\n",
"VALUES\n",
paste0(" (", rows, ")", collapse = ",\n")
))As far as I can tell, this is what ends up getting called with |
|
Right, missed that. This leaves room for improvement -- one single SQL statement that big is going to create problems of all sorts. 16 seconds for 500 rows is not great (it's very poor indeed), but at least the original table would be loaded in ~25 minutes (and not take forever). I still wonder what causes this horrible performance. How many cells does the data have? What size? We could also review the DBI commands that are actually issued under the hood, with dblog: https://github.com/r-dbi/dblog. |
|
Slicing the data by chunks is a good ideal, appending 500 rows at a time for instance. But when it comes to loading big tables (though 35 000 rows is not so big, we are not talking about "big data" here), there is only one efficient solution, designed for that purpose, a real fast bulk insert such as LOAD DATA INFILE. This is why it is so important to benefit at least from an option to specify using it (dbWrite, dbAppend, copy_to functions). If you have the right to write in a database, you generally have the possibility to get LOAD DATA INFILE activated quite easily (my experience). Another important point with LOAD DATA INFILE : the csv to be loaded must be UTF8 encoded if the MySQL DB is UTF8 (which is the standard now). SELECT @@character_set_database |
|
I ran few additional tests and it looks like there is an overload due to a client/server roundtrip for each row inserted (or whatever network issue more than a SQL engine issue). Test 1: inserting the above table in a localhost database / database in a local network / remote database Test 2: inserting by chunk data <- data %>% arrange(reg,com) %>%
mutate( tile = ntile(n = 20) )
write_by_chunk <- function( chunk, key ) {
print(key)
if ( key == 1 ) {
dbExecute( mycon, "DROP TABLE IF EXISTS tourism2020" )
dbCreateTable( mycon, "tourism2020", chunk )
dbExecute( mycon, "ALTER TABLE tourism2020 ENGINE = MYISAM" )
}
dbAppendTable( mycon, donstat_tbl, chunk )
}
system.time(
data %>%
group_by(tile) %>%
group_walk( write_by_chunk )
)=> it doesn't make any difference, whatever the number of chunks (5, 10, 25). I also tested deactivating the Innodb transactionnal mode by specifying the MyISAM Engine, with no sensible effect either. |
|
I am running into the exact same issue. When I run dbWriteTable function using RMySQL package, it takes 3.77 seconds to write 60K+ rows into a remote database. However, when I use RMariaDB, it will hang, and I have to terminate the session. As a workaround, I am using RMySQL where I use dbWriteTable function, and use RMariaDB, but this is certainly an issue, as RMySQL is deprecated. I am using the most recent versions of RMySQL and RMariaDB to test this, and teh host is on 10.4.11 MariaDB compiled for Win64. |
|
What I used to do for solving this issue was to execute INSERT INTO
statements over an Aria Engine Table (Without Transactions), and afterward,
execute an ALTER TABLE to convert it back to InnoDB.
Your user would need, however, permission for being able to change table's
engine.
|
|
Hello, |
|
For |
|
I resorted to using the odbc package. R Studio recommends using odbc. This requires you to install the proper ODBC driver and set it up, but both read and writes are comparable to RMySQL. |
|
I'm in for making the default |
|
Could you please first implement LOAD DATA LOCAL INFILE as a possible option for the |
|
We need a fail-safe AFAICT, for |
|
Allowing LOAD DATA INFILE is just a matter of rights setting server side, it is quite easy and and works very well. When you are loading data with R in a database, you generally have full control over the database you want to load data in, so you can grant LOAD DATA INFILE to the proper user. Of course i understand that the behaviour by default should be "all-terrain" (INSERT by default), but with an option (opt-in) allowing to use LOAD DATA INFILE, it would be great. It is the same situation with Postgresql and other databases, there is the INSERT way and the BULK loading way (COPY with PG, BULK with MSSQL.... This is just RMySQL was doing so far and that we would very much like to find again here to benefit from 5s loading processes and not 1000 s (my last test this morning), which is not usable. I understand you would like first to optimize the INSERT way in a remote database, and it is ok with me as long as the LOAD DATA INFILE possibility is not dropped. |
|
In my project, for workaround, I wrote this function: |
|
@daniloandrademendes This is brilliant! I didn't know that dbWriteTable could take a csv as an input. So i will use this variant that will fix the perf issue from the outside. |
Very nice solution! I just added |
|
@vituri i came to the same conclusion with eol in Windows |
|
I'm using this, that works with different column order and subsets: N.B: safe.write is taken from RMySQL:::safe.write |
|
Planning to release an update that uses |
I experiment writing a table (a tibble: 34,951 x 52) into a MySQL database (5.6):
MySQL database on localhost (same computer):
remote MySQL database:
remote MySQL database with a reduced table (586 x 52):
With a remote database and RMariaDB, it doesn't end
So let's try with a reduced tibble (=> 586 rows)
The text was updated successfully, but these errors were encountered: