Use a prepared statement for all records during the CSV import
Don't build a separate SQL statement per row to insert during CSV import
but use a single prepared statement which can be reused for each row.
This should speed up the CSV import noticeably.
Anybody interested in benchmarking this very quickly? It would be interesting to know what different it makes on different systems and for different files 😄 Also I haven't checked memory usage at all yet.
Our CSV import is divided into two steps: decoding and inserting. This only affects the inserting step, so don't expect the decoding to go faster. But I do want to speed up both over the next couple of days. So if you feel like preparing a small number of benchmark test cases, feel free to go ahead 😉
Oh, with the decoding step would it be feasible to track the data type used in each column so we can automatically set useful types? We should be able to do better than defaulting to 'TEXT' for everything. 😄
The type detection should be doable 😄 There's already an issue for it somewhere, isn't there? However, it will probably only be based on the first ten or so records (the number that's visible in the preview table in the dialog), not the entire file, because that would require parsing the entire file before inserting data. That's currently the case but will most likely be changed in the near-ish future. So I think I'll do performance optimisation first and see how that changes things, and only implement type detection afterwards.
I've thought about setting some pragmas, too 😄 The problem with that is that it makes it necessary to commit all changes first which would be really annoying for the CSV import. E.g. manually creating a table, then inserting data into it, then reverting all changes wouldn't work anymore. Also it makes the database more vulnerable to data loss when the system crashes. But I'll try how big the difference is anyway. If it saves another 50 seconds I'll reconsider it 😉
... not the entire file, because that would require parsing the entire file before inserting data.
Ahhh, good point. I guess with type detection it's probably more important on the server side (eg when we implement CSV import there), than with DB4S. With DB4S, if people are unhappy with a type it's fairly each to change it in the GUI.
With the server... not so much. Would need manual download, adjustment and re-upload by the user. We'll need to be more rigorous with automatic type detection there, so there's less need for manual adjustments. 😄
Oh, was reminded of an interesting tip for INSERT performance yesterday. Slide 84 here mentions inserting several rows worth of data with each insert statement. Something like:
Sure 😄 By the way, I've just tried the PRAGMAs and they improved performance by around 1% on my system. So it's probably not worth it, especially since I can't even rule out that this is an artefact of my not-so-good benchmark.
Just tried out the "adding multiple rows per insert statement" thing with the simple Go SQLite data generator I wrote the other day. Tried it with 2, 3, 4, and 10 rows added at a time per insert:
No measurable difference in time elapsed for adding 1/2 million total rows. Measured several times for each of the values. It might be that the number of rows added per insert needs to be a lot larger (hundreds?), but I can't easily test it. I'd either need to manually write out an Exec() statement with several hundred variables in it (ugh), or rewrite the Exec function in the Go SQLite library we're using (ugh x2). Not really feeling like doing either. 😉
It's definitely possible the lack of improvement with my testing above is due to my coding being non-optimal, or perhaps even the Go SQLite library in question may not be perf tuned (no idea).
I would assume it would only improve performance if you didn't wrap up the transactions in a global trabsaction. Normally, SQLite creates a transaction for each insert, and its this delay that causes the biggest performance bottleneck. Sticking 40 inserts into one transaction (one insert statement) is a lot quicker than doing 40 smaller transactions (40 inserts, each in its own transaction)
Well, going by @jasonwyatt's SpeakerDeck slides it's an optimisation to consider after doing the "ensure all inserts are in a single transaction" step.
In the SQLite test data generation util above, there are 5 tables. Each table has (using default settings) 100k rows inserted into it, inside a single transaction. So, 5 transactions in total (1 per table).
I'm not sure either. But let's find out 😉 My local tests haven't really been conclusive but it might be a very little faster.
@justinclift Can you maybe fetch that code, build it and test again using the same system and the same UK postcode file? Oh I should also point out that this commit is the quickest and dirtiest of all hacks. So you have to make sure the number of records (not including the field names if there are any) is divisible by three 😉 If it's not, DB4S will crash eventually. In this case just delete the last one or two records from the file, so the number is divisible. Sorry for the extra hassle but I figured we might try this out before putting too much effort in the implementation corner cases 😇
Hmm... something seems weird. I just tried it and it crashed after the import. Made sure the # of rows was divisible by 3 though. Ahhh... maybe that # needs to be calculated without removing the header line from the calc.
Yep, that was it. Shouldn't have removed the row header before the /3 calc.
Ok, initial stats:
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
116673ms. The parser took 82291ms and the insertion took 34284ms.
I'll run it a few more times so we get an idea for consistency, then try it against the previous commit too (with CSV_BENCHMARK enabled) as well for comparison.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
121275ms. The parser took 86713ms and the insertion took 34518ms.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
121311ms. The parser took 86693ms and the insertion took 34581ms.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
120167ms. The parser took 86065ms and the insertion took 34082ms.
And yeah, the parser does seem to take 2/3 of the time atm. The above results are with these settings:
I'll try a run with "Trim fields" turned off, just to see if that changes anything...
That was my impression, too. Without the prepared statement patch the insertion phase took longer than the parsing and now, with it, it's the other way around. Unfortunately the parser is going to be more difficult to optimise, mostly because we're also using it for the preview in the dialog.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
119714ms. The parser took 81983ms and the insertion took 37716ms.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
121783ms. The parser took 84156ms and the insertion took 37626ms.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
122988ms. The parser took 85083ms and the insertion took 37902ms.
Weirdly, even though the parse time seems slightly faster with these runs, the insert time has made up the difference. 😖
That's very weird. With the Trim fields option turned off the parse time we could expect the parse time to be a bit faster. But it makes no sense that the insert time has gone up ☹️
Hmmm, would it make sense to add a few more counters in this experiment code for summing combined total times of various operations? Happy to run it here as needed. 😄
You mean like trying out various combinations in the settings automatically and running each import a couple of times or so? If so, yeah, I thought about that but it turned out to be a pain to implement. Qt actually has a way to test run time via the unit tests and do these things relatively easily but for us it's really difficult to use because we would need not only the CSV parser but also the database code for the insert step and that depends on pretty much the entire program. We can do that later though for the parse step 😄
Times from the previous commit, with "Trim fields" enabled:
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
124843ms. The parser took 84334ms and the insertion took 40500ms.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
125222ms. The parser took 83589ms and the insertion took 41630ms.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
126896ms. The parser took 85466ms and the insertion took 41412ms.
Hmmm, so the multiple-rows per insert does seem a bit faster.
Nah, that's not what I meant. Hadn't thought that far ahead. I was just thinking of maybe adding a few more timers in places for finer granularity rather than just covering "all of the parser time" and "all of the insert time" for times.
Haven't yet looked at the cvs.parse() code, but might take a look over it in a minute just for general awareness. 😄
Oh I see 😉 Yeah, we can do that too. We can also use the Callgrind module of Valgrind to get an idea of what is happening. I'll see later if it's possible to add some more timers 😄
Looking through our parsing code, which runs before the inserts are done... wouldn't that be er... exactly the optimal place to add column type detection? We're scanning the complete file character by character prior to creating the table. Seems like exactly what we'd want for that? 😁
It would be 😉 But it's also the place where we take a 1GB CSV file and turn it into a 10GB parsed structure (#792). So I'm pretty sure we have to change that as one of the next steps to parse a couple of rows, then insert them, then parse a few more rows, etc.
That's very weird. With the Trim fields option turned off the parse time we could expect the parse time to be a bit faster. But it makes no sense that the insert time has gone up
Oh, I wonder if there's something either in DB4S or the SQLite code itself which is automatically trimming the fields during the insert now instead?
Meh... probably not, but it would kind of fit the behaviour.
You can try by creating a CSV file with spaces between around the fields:
a, b ,c
For the second column it should make a different whether the checkbox is activated or not. If it doesn't something else is trimming the fields. But on my system it's working as expected.
As a potentially useful data point when looking into the decoding optimisation, this smaller (~80MB) CSV file takes much longer to decode. Inserting is pretty quick though.
Importing the file '/home/jc/foo/selected_crimes_2012_2015.csv' took
658938ms. The parser took 654968ms and the insertion took 3953ms.
Seems to be sourced from here: https://www.odata.org.il/dataset/maazarim1
Google translate says the language for that website is Hebrew, which makes sense as it's "Israel selected crimes". The licensing info on the page indicates it's CC-BY, so we should be ok to muck around with it. 😄
This comment has been minimized.
MKleusberg repliedSep 6, 2017
Anybody interested in benchmarking this very quickly? It would be interesting to know what different it makes on different systems and for different files😄 Also I haven't checked memory usage at all yet.
Our CSV import is divided into two steps: decoding and inserting. This only affects the inserting step, so don't expect the decoding to go faster. But I do want to speed up both over the next couple of days. So if you feel like preparing a small number of benchmark test cases, feel free to go ahead😉
This comment has been minimized.
justinclift repliedSep 6, 2017
Oops, didn't see this. Yeah, I can do an old-vs-new test with some largish sized data if that helps.
This comment has been minimized.
justinclift repliedSep 6, 2017
•
edited
Running an import using the commit prior to this.
Data set: https://data.gov.uk/dataset/national-statistics-postcode-lookup-uk ~637MB CSV
Start time:
Wed Sep 6 20:22:51 BST 2017
Finish time:
Wed Sep 6 20:25:54 BST 2017
So, roughly 3 mins. I'll update DB4S and try it again with this new import code next.
This comment has been minimized.
justinclift repliedSep 6, 2017
Yep, it's noticeably faster with this commit.
Start time:
Wed Sep 6 20:28:26 BST 2017
Finish time:
Wed Sep 6 20:30:36 BST 2017
Seems like a solid win.😄
This comment has been minimized.
justinclift repliedSep 6, 2017
•
edited
Oh, with the decoding step would it be feasible to track the data type used in each column so we can automatically set useful types? We should be able to do better than defaulting to 'TEXT' for everything.😄
This comment has been minimized.
justinclift repliedSep 6, 2017
Would it be useful to also change the SQLite journalling and sync modes prior to doing the import too? eg
This comment has been minimized.
justinclift repliedSep 6, 2017
Just tweeted about the speed up too. Hopefully useful promo.😄
This comment has been minimized.
MKleusberg repliedSep 7, 2017
The type detection should be doable😄 There's already an issue for it somewhere, isn't there? However, it will probably only be based on the first ten or so records (the number that's visible in the preview table in the dialog), not the entire file, because that would require parsing the entire file before inserting data. That's currently the case but will most likely be changed in the near-ish future. So I think I'll do performance optimisation first and see how that changes things, and only implement type detection afterwards.
I've thought about setting some pragmas, too😄 The problem with that is that it makes it necessary to commit all changes first which would be really annoying for the CSV import. E.g. manually creating a table, then inserting data into it, then reverting all changes wouldn't work anymore. Also it makes the database more vulnerable to data loss when the system crashes. But I'll try how big the difference is anyway. If it saves another 50 seconds I'll reconsider it 😉
This comment has been minimized.
justinclift repliedSep 7, 2017
Ahhh, good point. I guess with type detection it's probably more important on the server side (eg when we implement CSV import there), than with DB4S. With DB4S, if people are unhappy with a type it's fairly each to change it in the GUI.
With the server... not so much. Would need manual download, adjustment and re-upload by the user. We'll need to be more rigorous with automatic type detection there, so there's less need for manual adjustments.😄
This comment has been minimized.
MKleusberg repliedSep 7, 2017
Yeah, that's probably true. But it still makes sense to have it in DB4S, too😄
This comment has been minimized.
justinclift repliedSep 7, 2017
•
edited
Oh, was reminded of an interesting tip for INSERT performance yesterday. Slide 84 here mentions inserting several rows worth of data with each insert statement. Something like:
Haven't tried it myself to see if it does anything measurable, but figured it's probably worth a look at since you're already doing this stuff.😄
This comment has been minimized.
MKleusberg repliedSep 7, 2017
Sure😄 By the way, I've just tried the PRAGMAs and they improved performance by around 1% on my system. So it's probably not worth it, especially since I can't even rule out that this is an artefact of my not-so-good benchmark.
This comment has been minimized.
justinclift repliedSep 8, 2017
•
edited
Cool.😄
Just tried out the "adding multiple rows per insert statement" thing with the simple Go SQLite data generator I wrote the other day. Tried it with 2, 3, 4, and 10 rows added at a time per insert:
sqlitebrowser/sqlitedatagen@9226d7a
No measurable difference in time elapsed for adding 1/2 million total rows. Measured several times for each of the values. It might be that the number of rows added per insert needs to be a lot larger (hundreds?), but I can't easily test it. I'd either need to manually write out an Exec() statement with several hundred variables in it (ugh), or rewrite the Exec function in the Go SQLite library we're using (ugh x2). Not really feeling like doing either.😉
It's definitely possible the lack of improvement with my testing above is due to my coding being non-optimal, or perhaps even the Go SQLite library in question may not be perf tuned (no idea).
This comment has been minimized.
chrisjlocke repliedSep 8, 2017
I would assume it would only improve performance if you didn't wrap up the transactions in a global trabsaction. Normally, SQLite creates a transaction for each insert, and its this delay that causes the biggest performance bottleneck. Sticking 40 inserts into one transaction (one insert statement) is a lot quicker than doing 40 smaller transactions (40 inserts, each in its own transaction)
Was you doing a global transaction?
This comment has been minimized.
justinclift repliedSep 8, 2017
•
edited
Well, going by @jasonwyatt's SpeakerDeck slides it's an optimisation to consider after doing the "ensure all inserts are in a single transaction" step.
In the SQLite test data generation util above, there are 5 tables. Each table has (using default settings) 100k rows inserted into it, inside a single transaction. So, 5 transactions in total (1 per table).
This comment has been minimized.
MKleusberg repliedSep 8, 2017
I'm not sure either. But let's find out😉 My local tests haven't really been conclusive but it might be a very little faster.
@justinclift Can you maybe fetch that code, build it and test again using the same system and the same UK postcode file? Oh I should also point out that this commit is the quickest and dirtiest of all hacks. So you have to make sure the number of records (not including the field names if there are any) is divisible by three😉 If it's not, DB4S will crash eventually. In this case just delete the last one or two records from the file, so the number is divisible. Sorry for the extra hassle but I figured we might try this out before putting too much effort in the implementation corner cases 😇
This comment has been minimized.
justinclift repliedSep 8, 2017
Yep, no worries. Will try shortly.😄
This comment has been minimized.
justinclift repliedSep 8, 2017
•
edited
Hmm... something seems weird. I just tried it and it crashed after the import. Made sure the # of rows was divisible by 3 though. Ahhh... maybe that # needs to be calculated without removing the header line from the calc.
Trying again...😄
This comment has been minimized.
justinclift repliedSep 8, 2017
Yep, that was it. Shouldn't have removed the row header before the /3 calc.
Ok, initial stats:
I'll run it a few more times so we get an idea for consistency, then try it against the previous commit too (with CSV_BENCHMARK enabled) as well for comparison.
This comment has been minimized.
MKleusberg repliedSep 8, 2017
Cool😄 But it's already clear that we need to look at the parser too.
This comment has been minimized.
justinclift repliedSep 8, 2017
•
edited
With this new experimental commit:
And yeah, the parser does seem to take 2/3 of the time atm. The above results are with these settings:
I'll try a run with "Trim fields" turned off, just to see if that changes anything...
This comment has been minimized.
MKleusberg repliedSep 8, 2017
That was my impression, too. Without the prepared statement patch the insertion phase took longer than the parsing and now, with it, it's the other way around. Unfortunately the parser is going to be more difficult to optimise, mostly because we're also using it for the preview in the dialog.
This comment has been minimized.
justinclift repliedSep 8, 2017
Times with "Trim fields" turned off:
Weirdly, even though the parse time seems slightly faster with these runs, the insert time has made up the difference.😖
This comment has been minimized.
justinclift repliedSep 8, 2017
I'll try with the previous commit (prior to the "3 rows per insert") next and get some times for that too.
This comment has been minimized.
MKleusberg repliedSep 8, 2017
That's very weird. With the Trim fields option turned off the parse time we could expect the parse time to be a bit faster. But it makes no sense that the insert time has gone up☹️
This comment has been minimized.
justinclift repliedSep 8, 2017
Hmmm, would it make sense to add a few more counters in this experiment code for summing combined total times of various operations? Happy to run it here as needed.😄
This comment has been minimized.
MKleusberg repliedSep 8, 2017
You mean like trying out various combinations in the settings automatically and running each import a couple of times or so? If so, yeah, I thought about that but it turned out to be a pain to implement. Qt actually has a way to test run time via the unit tests and do these things relatively easily but for us it's really difficult to use because we would need not only the CSV parser but also the database code for the insert step and that depends on pretty much the entire program. We can do that later though for the parse step😄
This comment has been minimized.
justinclift repliedSep 8, 2017
Times from the previous commit, with "Trim fields" enabled:
Hmmm, so the multiple-rows per insert does seem a bit faster.
This comment has been minimized.
justinclift repliedSep 8, 2017
Heh Heh Heh
Nah, that's not what I meant. Hadn't thought that far ahead. I was just thinking of maybe adding a few more timers in places for finer granularity rather than just covering "all of the parser time" and "all of the insert time" for times.
Haven't yet looked at the cvs.parse() code, but might take a look over it in a minute just for general awareness.😄
This comment has been minimized.
MKleusberg repliedSep 8, 2017
Oh wow, that's more than expected! I might clean up the code and do a proper commit then later😄
This comment has been minimized.
MKleusberg repliedSep 8, 2017
Oh I see😉 Yeah, we can do that too. We can also use the Callgrind module of Valgrind to get an idea of what is happening. I'll see later if it's possible to add some more timers 😄
This comment has been minimized.
justinclift repliedSep 8, 2017
•
edited
Cool.😄
Looking through our parsing code, which runs before the inserts are done... wouldn't that be er... exactly the optimal place to add column type detection? We're scanning the complete file character by character prior to creating the table. Seems like exactly what we'd want for that?😁
This comment has been minimized.
MKleusberg repliedSep 8, 2017
It would be😉 But it's also the place where we take a 1GB CSV file and turn it into a 10GB parsed structure (#792). So I'm pretty sure we have to change that as one of the next steps to parse a couple of rows, then insert them, then parse a few more rows, etc.
This comment has been minimized.
justinclift repliedSep 8, 2017
•
edited
Ahhh ok. Yeah, not an optimal use of time to work out.😄
If it's useful, the exact data set I was testing against (including the /3 change) is here:
https://nightlies.sqlitebrowser.org/tmp/data.gov.uk%20postcode%20database/National_Statistics_Postcode_Lookup_UK-modified.csv.xz (50MB compressed file)
Note, not wanting to leave that on the nightlies server for long, so if you do/don't grab it let me know when it can be removed.😄
This comment has been minimized.
MKleusberg repliedSep 8, 2017
Thanks😄 Just finished downloading it.
This comment has been minimized.
justinclift repliedSep 8, 2017
Thx.😄
This comment has been minimized.
justinclift repliedSep 8, 2017
•
edited
Oh, I wonder if there's something either in DB4S or the SQLite code itself which is automatically trimming the fields during the insert now instead?
Meh... probably not, but it would kind of fit the behaviour.
This comment has been minimized.
MKleusberg repliedSep 8, 2017
You can try by creating a CSV file with spaces between around the fields:
For the second column it should make a different whether the checkbox is activated or not. If it doesn't something else is trimming the fields. But on my system it's working as expected.
This comment has been minimized.
justinclift repliedSep 9, 2017
•
edited
As a potentially useful data point when looking into the decoding optimisation, this smaller (~80MB) CSV file takes much longer to decode. Inserting is pretty quick though.
Seems to be sourced from here: https://www.odata.org.il/dataset/maazarim1😄
Google translate says the language for that website is Hebrew, which makes sense as it's "Israel selected crimes". The licensing info on the page indicates it's CC-BY, so we should be ok to muck around with it.