Permalink
Browse files

Use a prepared statement for all records during the CSV import

Don't build a separate SQL statement per row to insert during CSV import
but use a single prepared statement which can be reused for each row.
This should speed up the CSV import noticeably.
  • Loading branch information...
MKleusberg committed Sep 6, 2017
1 parent 6e40741 commit 643251780557ffccf84a3a672a4d88e941313f97
Showing with 31 additions and 25 deletions.
  1. +31 −25 src/ImportCsvDialog.cpp
@@ -442,11 +442,20 @@ void ImportCsvDialog::importCsv(const QString& fileName, const QString &name)
else if(f->isInteger() && !f->notnull()) // If this is an integer column and NULL is allowed, insert NULL
nullValues << QString();
else // Otherwise (i.e. if this isn't an integer column), insert an empty string
nullValues << "''";
nullValues << "";
}
}
}

// Prepare the INSERT statement. The prepared statement can then be reused for each row to insert
QString sQuery = QString("INSERT INTO %1 VALUES(").arg(sqlb::escapeIdentifier(tableName));
for(size_t i=1;i<=csv.columns();i++)
sQuery.append(QString("?%1,").arg(i));
sQuery.chop(1); // Remove last comma
sQuery.append(")");
sqlite3_stmt* stmt;
sqlite3_prepare_v2(pdb->_db, sQuery.toUtf8(), sQuery.toUtf8().length(), &stmt, nullptr);

// now lets import all data, one row at a time
CSVParser::TCSVResult::const_iterator itBegin = csv.csv().begin();
if(ui->checkboxHeader->isChecked()) // If the first row contains the field names we should skip it here because this is the data import
@@ -455,52 +464,49 @@ void ImportCsvDialog::importCsv(const QString& fileName, const QString &name)
it != csv.csv().end();
++it)
{
QString sql = QString("INSERT INTO %1 VALUES(").arg(sqlb::escapeIdentifier(tableName));

QStringList insertlist;
for(int i=0;i<it->size();i++)
// Bind all values
unsigned int bound_fields = 0;
for(int i=0;i<it->size();i++,bound_fields++)
{
// Empty values need special treatment, but only when importing into an existing table where we could find out something about
// its table definition
if(importToExistingTable && it->at(i).isEmpty() && nullValues.size() > i)
{
// This is an empty value. We'll need to look up how to handle it depending on the field to be inserted into.
QString val = nullValues.at(i);
if(val.isNull())
insertlist << "NULL";
else
insertlist << val;
if(!val.isNull()) // No need to bind NULL values here as that is the default bound value in SQLite
sqlite3_bind_text(stmt, i+1, val.toUtf8(), val.toUtf8().size(), SQLITE_TRANSIENT);
} else {
// This is a non-empty value. Just add it to the list. The sqlite3_mprintf call with the %Q placeholder takes the value,
// adds single quotes around it and doubles all single quotes inside it. This means it'll be safe to be inserted into the SQL
// statement.
char* formSQL = sqlite3_mprintf("%Q", (const char*)it->at(i).toUtf8());
insertlist << formSQL;
if(formSQL)
sqlite3_free(formSQL);
// This is a non-empty value. Just add it to the statement
sqlite3_bind_text(stmt, i+1, static_cast<const char*>(it->at(i).toUtf8()), it->at(i).toUtf8().size(), SQLITE_TRANSIENT);
}
}

// add missing fields with empty values
for(unsigned int i = insertlist.size(); i < csv.columns(); ++i)
// Insert row
if(sqlite3_step(stmt) != SQLITE_DONE)
{
qWarning() << "ImportCSV" << tr("Missing field for record %1").arg(std::distance(itBegin, it) + 1);
insertlist << "NULL";
sqlite3_finalize(stmt);
return rollback(this, pdb, progress, restorepointName, std::distance(itBegin, it) + 1, tr("Inserting row failed: %1").arg(pdb->lastError()));
}

sql.append(insertlist.join(QChar(',')));
sql.append(");");

if(!pdb->executeSQL(sql, false, false))
return rollback(this, pdb, progress, restorepointName, std::distance(itBegin, it) + 1, tr("Inserting row failed: %1").arg(pdb->lastError()));
// Reset statement for next use. Also reset all bindings to NULL. This is important, so we don't need to bind missing columns or empty values in NULL
// columns manually.
sqlite3_reset(stmt);
sqlite3_clear_bindings(stmt);

// Update progress bar and check if cancel button was clicked
unsigned int prog = std::distance(csv.csv().begin(), it);
if(prog % 100 == 0)
progress.setValue(prog);
if(progress.wasCanceled())
{
sqlite3_finalize(stmt);
return rollback(this, pdb, progress, restorepointName, std::distance(itBegin, it) + 1, "");
}
}

// Clean up prepared statement
sqlite3_finalize(stmt);
}

void ImportCsvDialog::setQuoteChar(const QChar& c)

39 comments on commit 6432517

@MKleusberg

This comment has been minimized.

Copy link
Member Author

MKleusberg replied Sep 6, 2017

Anybody interested in benchmarking this very quickly? It would be interesting to know what different it makes on different systems and for different files 😄 Also I haven't checked memory usage at all yet.

Our CSV import is divided into two steps: decoding and inserting. This only affects the inserting step, so don't expect the decoding to go faster. But I do want to speed up both over the next couple of days. So if you feel like preparing a small number of benchmark test cases, feel free to go ahead 😉

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 6, 2017

Oops, didn't see this. Yeah, I can do an old-vs-new test with some largish sized data if that helps.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 6, 2017

Running an import using the commit prior to this.

Data set: https://data.gov.uk/dataset/national-statistics-postcode-lookup-uk ~637MB CSV

Start time: Wed Sep 6 20:22:51 BST 2017
Finish time: Wed Sep 6 20:25:54 BST 2017

So, roughly 3 mins. I'll update DB4S and try it again with this new import code next.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 6, 2017

Yep, it's noticeably faster with this commit.

Start time: Wed Sep 6 20:28:26 BST 2017
Finish time: Wed Sep 6 20:30:36 BST 2017

Seems like a solid win. 😄

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 6, 2017

Oh, with the decoding step would it be feasible to track the data type used in each column so we can automatically set useful types? We should be able to do better than defaulting to 'TEXT' for everything. 😄

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 6, 2017

Would it be useful to also change the SQLite journalling and sync modes prior to doing the import too? eg

PRAGMA journal_mode=OFF
PRAGMA synchronous=OFF
@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 6, 2017

Just tweeted about the speed up too. Hopefully useful promo. 😄

@MKleusberg

This comment has been minimized.

Copy link
Member Author

MKleusberg replied Sep 7, 2017

The type detection should be doable 😄 There's already an issue for it somewhere, isn't there? However, it will probably only be based on the first ten or so records (the number that's visible in the preview table in the dialog), not the entire file, because that would require parsing the entire file before inserting data. That's currently the case but will most likely be changed in the near-ish future. So I think I'll do performance optimisation first and see how that changes things, and only implement type detection afterwards.

I've thought about setting some pragmas, too 😄 The problem with that is that it makes it necessary to commit all changes first which would be really annoying for the CSV import. E.g. manually creating a table, then inserting data into it, then reverting all changes wouldn't work anymore. Also it makes the database more vulnerable to data loss when the system crashes. But I'll try how big the difference is anyway. If it saves another 50 seconds I'll reconsider it 😉

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 7, 2017

... not the entire file, because that would require parsing the entire file before inserting data.

Ahhh, good point. I guess with type detection it's probably more important on the server side (eg when we implement CSV import there), than with DB4S. With DB4S, if people are unhappy with a type it's fairly each to change it in the GUI.

With the server... not so much. Would need manual download, adjustment and re-upload by the user. We'll need to be more rigorous with automatic type detection there, so there's less need for manual adjustments. 😄

@MKleusberg

This comment has been minimized.

Copy link
Member Author

MKleusberg replied Sep 7, 2017

Yeah, that's probably true. But it still makes sense to have it in DB4S, too 😄

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 7, 2017

😄

Oh, was reminded of an interesting tip for INSERT performance yesterday. Slide 84 here mentions inserting several rows worth of data with each insert statement. Something like:

INSERT INTO foo (field1, field2, fieldn) VALUES ('row1a', 'row1b', 'row1c'), ('row2a', 'row2b', 'row2c'), ('row3a', 'row3b', 'row3c'), [etc]

Haven't tried it myself to see if it does anything measurable, but figured it's probably worth a look at since you're already doing this stuff. 😄

@MKleusberg

This comment has been minimized.

Copy link
Member Author

MKleusberg replied Sep 7, 2017

Sure 😄 By the way, I've just tried the PRAGMAs and they improved performance by around 1% on my system. So it's probably not worth it, especially since I can't even rule out that this is an artefact of my not-so-good benchmark.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 8, 2017

Cool. 😄

Just tried out the "adding multiple rows per insert statement" thing with the simple Go SQLite data generator I wrote the other day. Tried it with 2, 3, 4, and 10 rows added at a time per insert:

sqlitebrowser/sqlitedatagen@9226d7a

No measurable difference in time elapsed for adding 1/2 million total rows. Measured several times for each of the values. It might be that the number of rows added per insert needs to be a lot larger (hundreds?), but I can't easily test it. I'd either need to manually write out an Exec() statement with several hundred variables in it (ugh), or rewrite the Exec function in the Go SQLite library we're using (ugh x2). Not really feeling like doing either. 😉

It's definitely possible the lack of improvement with my testing above is due to my coding being non-optimal, or perhaps even the Go SQLite library in question may not be perf tuned (no idea).

@chrisjlocke

This comment has been minimized.

Copy link
Contributor

chrisjlocke replied Sep 8, 2017

I would assume it would only improve performance if you didn't wrap up the transactions in a global trabsaction. Normally, SQLite creates a transaction for each insert, and its this delay that causes the biggest performance bottleneck. Sticking 40 inserts into one transaction (one insert statement) is a lot quicker than doing 40 smaller transactions (40 inserts, each in its own transaction)

Was you doing a global transaction?

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 8, 2017

Well, going by @jasonwyatt's SpeakerDeck slides it's an optimisation to consider after doing the "ensure all inserts are in a single transaction" step.

In the SQLite test data generation util above, there are 5 tables. Each table has (using default settings) 100k rows inserted into it, inside a single transaction. So, 5 transactions in total (1 per table).

@MKleusberg

This comment has been minimized.

Copy link
Member Author

MKleusberg replied Sep 8, 2017

I'm not sure either. But let's find out 😉 My local tests haven't really been conclusive but it might be a very little faster.

@justinclift Can you maybe fetch that code, build it and test again using the same system and the same UK postcode file? Oh I should also point out that this commit is the quickest and dirtiest of all hacks. So you have to make sure the number of records (not including the field names if there are any) is divisible by three 😉 If it's not, DB4S will crash eventually. In this case just delete the last one or two records from the file, so the number is divisible. Sorry for the extra hassle but I figured we might try this out before putting too much effort in the implementation corner cases 😇

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 8, 2017

Yep, no worries. Will try shortly. 😄

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 8, 2017

Hmm... something seems weird. I just tried it and it crashed after the import. Made sure the # of rows was divisible by 3 though. Ahhh... maybe that # needs to be calculated without removing the header line from the calc.

Trying again... 😄

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 8, 2017

Yep, that was it. Shouldn't have removed the row header before the /3 calc.

Ok, initial stats:

Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
116673ms. The parser took 82291ms and the insertion took 34284ms.

I'll run it a few more times so we get an idea for consistency, then try it against the previous commit too (with CSV_BENCHMARK enabled) as well for comparison.

@MKleusberg

This comment has been minimized.

Copy link
Member Author

MKleusberg replied Sep 8, 2017

Cool 😄 But it's already clear that we need to look at the parser too.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 8, 2017

With this new experimental commit:

Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
121275ms. The parser took 86713ms and the insertion took 34518ms.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
121311ms. The parser took 86693ms and the insertion took 34581ms.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
120167ms. The parser took 86065ms and the insertion took 34082ms.

And yeah, the parser does seem to take 2/3 of the time atm. The above results are with these settings:

db4s_cvs_import1

I'll try a run with "Trim fields" turned off, just to see if that changes anything...

@MKleusberg

This comment has been minimized.

Copy link
Member Author

MKleusberg replied Sep 8, 2017

That was my impression, too. Without the prepared statement patch the insertion phase took longer than the parsing and now, with it, it's the other way around. Unfortunately the parser is going to be more difficult to optimise, mostly because we're also using it for the preview in the dialog.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 8, 2017

Times with "Trim fields" turned off:

Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
119714ms. The parser took 81983ms and the insertion took 37716ms.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
121783ms. The parser took 84156ms and the insertion took 37626ms.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
122988ms. The parser took 85083ms and the insertion took 37902ms.

Weirdly, even though the parse time seems slightly faster with these runs, the insert time has made up the difference. 😖

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 8, 2017

I'll try with the previous commit (prior to the "3 rows per insert") next and get some times for that too.

@MKleusberg

This comment has been minimized.

Copy link
Member Author

MKleusberg replied Sep 8, 2017

That's very weird. With the Trim fields option turned off the parse time we could expect the parse time to be a bit faster. But it makes no sense that the insert time has gone up ☹️

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 8, 2017

Hmmm, would it make sense to add a few more counters in this experiment code for summing combined total times of various operations? Happy to run it here as needed. 😄

@MKleusberg

This comment has been minimized.

Copy link
Member Author

MKleusberg replied Sep 8, 2017

You mean like trying out various combinations in the settings automatically and running each import a couple of times or so? If so, yeah, I thought about that but it turned out to be a pain to implement. Qt actually has a way to test run time via the unit tests and do these things relatively easily but for us it's really difficult to use because we would need not only the CSV parser but also the database code for the insert step and that depends on pretty much the entire program. We can do that later though for the parse step 😄

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 8, 2017

Times from the previous commit, with "Trim fields" enabled:

Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
124843ms. The parser took 84334ms and the insertion took 40500ms.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
125222ms. The parser took 83589ms and the insertion took 41630ms.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
126896ms. The parser took 85466ms and the insertion took 41412ms.

Hmmm, so the multiple-rows per insert does seem a bit faster.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 8, 2017

Heh Heh Heh

Nah, that's not what I meant. Hadn't thought that far ahead. I was just thinking of maybe adding a few more timers in places for finer granularity rather than just covering "all of the parser time" and "all of the insert time" for times.

Haven't yet looked at the cvs.parse() code, but might take a look over it in a minute just for general awareness. 😄

@MKleusberg

This comment has been minimized.

Copy link
Member Author

MKleusberg replied Sep 8, 2017

Oh wow, that's more than expected! I might clean up the code and do a proper commit then later 😄

@MKleusberg

This comment has been minimized.

Copy link
Member Author

MKleusberg replied Sep 8, 2017

Oh I see 😉 Yeah, we can do that too. We can also use the Callgrind module of Valgrind to get an idea of what is happening. I'll see later if it's possible to add some more timers 😄

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 8, 2017

Cool. 😄

Looking through our parsing code, which runs before the inserts are done... wouldn't that be er... exactly the optimal place to add column type detection? We're scanning the complete file character by character prior to creating the table. Seems like exactly what we'd want for that? 😁

@MKleusberg

This comment has been minimized.

Copy link
Member Author

MKleusberg replied Sep 8, 2017

It would be 😉 But it's also the place where we take a 1GB CSV file and turn it into a 10GB parsed structure (#792). So I'm pretty sure we have to change that as one of the next steps to parse a couple of rows, then insert them, then parse a few more rows, etc.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 8, 2017

Ahhh ok. Yeah, not an optimal use of time to work out. 😄

If it's useful, the exact data set I was testing against (including the /3 change) is here:

https://nightlies.sqlitebrowser.org/tmp/data.gov.uk%20postcode%20database/National_Statistics_Postcode_Lookup_UK-modified.csv.xz (50MB compressed file)

Note, not wanting to leave that on the nightlies server for long, so if you do/don't grab it let me know when it can be removed. 😄

@MKleusberg

This comment has been minimized.

Copy link
Member Author

MKleusberg replied Sep 8, 2017

Thanks 😄 Just finished downloading it.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 8, 2017

Thx. 😄

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 8, 2017

That's very weird. With the Trim fields option turned off the parse time we could expect the parse time to be a bit faster. But it makes no sense that the insert time has gone up

Oh, I wonder if there's something either in DB4S or the SQLite code itself which is automatically trimming the fields during the insert now instead?

Meh... probably not, but it would kind of fit the behaviour.

@MKleusberg

This comment has been minimized.

Copy link
Member Author

MKleusberg replied Sep 8, 2017

You can try by creating a CSV file with spaces between around the fields:

a, b ,c

For the second column it should make a different whether the checkbox is activated or not. If it doesn't something else is trimming the fields. But on my system it's working as expected.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift replied Sep 9, 2017

As a potentially useful data point when looking into the decoding optimisation, this smaller (~80MB) CSV file takes much longer to decode. Inserting is pretty quick though.

Importing the file '/home/jc/foo/selected_crimes_2012_2015.csv' took
658938ms. The parser took 654968ms and the insertion took 3953ms.

Seems to be sourced from here: https://www.odata.org.il/dataset/maazarim1
Google translate says the language for that website is Hebrew, which makes sense as it's "Israel selected crimes". The licensing info on the page indicates it's CC-BY, so we should be ok to muck around with it. 😄

Please sign in to comment.