New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash with huge txt file #1212

Closed
SilvioGrosso opened this Issue Nov 2, 2017 · 25 comments

Comments

Projects
None yet
5 participants
@SilvioGrosso
Copy link

SilvioGrosso commented Nov 2, 2017

Hello everyone,

These days I have been testing Db Browser for sqlite :-)
Windows 7 (64 bit) with yesterday's daily build (3.10.99).

Today, I have tried to import a huge txt file downloaded from here (332 mb):
http://download.geonames.org/export/dump/allCountries.zip

When you unzip this file you get the AllCountries.txt file , which is around 1,4 gb !!!
It contains 19 columns (UTF-8 encoding; tabs as separator; no header as first column).
You get more information about this table, named "geoname", here (bottom of the page):
http://download.geonames.org/export/dump/

To make it short, Db browser for Sqlite always stops working when It reaches around 11% for this importing.
Worst of all, when you click "cancell" to kill the process the software always crashes.
To make matters even worse, when I try to reopen my database it is no longer available !
It looks like it has been changed in "read-only" mode.

Here is my Dropbox video recorded to show this crash (Windows 7 - 64 bit):
https://www.dropbox.com/s/2rq2dj7merbvgkm/CRASH_IMPORTING_TXT_FILE.mp4?dl=0

To always reproduce this crash on Windows 7 (64 bit) you need to:

  • Create a new database (e.g. Prova.db) ;
  • Click on File > Import from CSV file ;
  • Select the AllCountries.txt file (19 columns - fields are selected) ;
  • Click on Ok to start the process ;
  • The importing usually stops when it reaches 10-11% ;
  • When you click on "Cancell" BrowserSqlite 3.10.99 always crashes ;
  • When you try to Open again your Database (Prova.db) it is readonly .

MKleusberg added a commit that referenced this issue Nov 2, 2017

Fix progress dialog for imports of very large CSV files
QProgressDialog only takes an int as the maximum value and the current
value. So using the number of bytes parsed so far isn't going to work
for very large files when an int will overflow. This commit changes this
by precalculating a progress indicator and handing that over to the
QProgressDialog object.

See issue #1212.
@MKleusberg

This comment has been minimized.

Copy link
Member

MKleusberg commented Nov 2, 2017

Unfortunately I can't reproduce the crash here. Can you maybe attach a debugger and post the results here? We have prepared a Wiki page which explains how to do this relatively easily 😉

But I could reproduce the import getting stuck problem. The import actually just continued, only the progress dialog was broken for very large files. It should be fixed in tomorrow nightly builds. Can you double check if it's working for you though? 😄

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Nov 2, 2017

To make matters even worse, when I try to reopen my database it is no longer available !
It looks like it has been changed in "read-only" mode.

Ahhh, that sounds familiar. I think I saw something like that when testing some crash-causing-code a while ago too.

From memory, the "read only" mode is because the crash leaves the database with it's journal file still right next to it (in the same dir). For some reason (unknown to me, as I didn't investigate), when you next open the file it seems like it's read only. I think the way I made it not-read-only was by deleting the journal file before opening the database again. You'll probably need to try that though, as I'm not 100% sure. Make a backup first, etc, just in case. 😄

@data-man

This comment has been minimized.

Copy link

data-man commented Nov 2, 2017

@SilvioGrosso
You can try csv extension from SQLite ext/misc.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Nov 2, 2017

Just compiled our latest code - using Linux though not Windows - and am trying the import now. It's not fast, but it's up to 20% so far.

@SilvioGrosso As a thought, maybe try removing those image plugins first before starting DB4S, just to see if make any difference. Not really expecting it will, but it's a decent thing to try as that's a known difference between your install and everyone else's (so far). 😄

@SilvioGrosso

This comment has been minimized.

Copy link
Author

SilvioGrosso commented Nov 3, 2017

Hello everyone ! :-)

@justinclift

I think the way I made it not-read-only was by deleting the journal file before opening the database again.

Yep.
To resurrect my databases I have to delete these files which are in the same directory.
At present, however, on Windows 7 (64 bit, yesterday's build) every database is "corrupted" every time I try to import this gigantic Allcountries.txt file :-)
It always occurs at 11% of my importing.
Db Browser always freezes when it reaches this point.
Whey I kill its process it crashes and my database always becomes read-only.
As you suggested, I must delete the journal file in order to make it available again.

BTW, I have tested the huge txt file on Windows 10 (64 bit) as well.
On this system the CSV import process stops also at around 11% (similar to Windows 7).
However, on Windows 10 the software does not crash when I click "Cancel".
Consequently, my database is not corrupted (namely changed in "read-only" mode).

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Nov 3, 2017

The import here (with a Linux build instead) finally finished just now. That took nearly 50 minutes.

After the import, there are 5381811 rows in the table. Scanning through it to randomly look at the data (eg first few rows, last few rows, random ones in the middle) it looks ok. Doesn't seem to be any corruption and the import seems to have worked.

Does anyone know if that row count is correct? If it's not then I might have imported with incorrect options or something and caused some weirdness (no idea). 😄

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Nov 3, 2017

I'll try it out tomorrow on a real Windows system (Win7 gaming PC, not a VM) and see if it crashes there too.

@SilvioGrosso

This comment has been minimized.

Copy link
Author

SilvioGrosso commented Nov 4, 2017

Hello Justin,

Thanks a lot for testing this huge txt file on Windows too ! :-)

Since I am new to sqlitebrowser allows me to ask you a very dumb question...
Apologies for my long message... :-(

On Windows 10 - 64 bit my hardware is:
ASUS X302UJ - NOTEBOOK
CPU: INTEL I7 6500U (4CPUs - 2,6 GHz)
RAM : 8 Gb
GPU : Nvidia Geforce

I have installed today's daily build (3.10.99- Nov 4, 2017 - 64.exe installer)
When I import as CSV this txt file this computer takes around 100 minutes.
I have compared this time twice both with the "trim fields" case checked and unchecked.

With PostgreSQL 9.5.5 version, compiled by Visual C++ build 1800, 64-bit" and pgadmin 1.22.1 version as GUI, it only takes around 2 minutes !

I am aware it is not fair to compare Sqlite vs PostgreSQL since they serve veeery different purposes but I am a tad surprised to notice such a big difference : 100 minutes to import vs 2 minutes !

On both databases, I have always created the geonames table through this same SQL code:
CREATE TABLE geonames
(geonameid INT,
name TEXT,
asciiname TEXT,
alternatenames TEXT,
latitude FLOAT,
longitude FLOAT,
featureclass TEXT,
featurecode TEXT,
countrycode TEXT,
cc2 TEXT,
admin1code TEXT,
admin2code TEXT,
admin3code TEXT,
admin4code TEXT,
population BIGINT,
elevation TEXT, -- this field is the most problematic to import correctly
dem INT,
timezone TEXT,
modificationdate DATE);

To import the txt file : The separator is tab; encoding is UTF-8; there is no first header for the 19 columns to import.

What it really is weird for me is the final number of rows imported with the same AllCountries.txt file.

With PostgreSQL, I get:
11618620 rows
With Microsoft Excel version 2017 I checked it is the same number as PostgreSQL (11618620 rows)

With DbBrowser, I always get (SELECT COUNT(*) FROM geonames ) :
5381811 rows (this agrees with your count BTW)

It is difficult to "debug" this AllCountries.txt file since it is huge...
Needless to say, I am puzzled about this different number:
11618620 rows vs 5381811 rows.
With Notepad++, my text editor, I can not open it up to check it because is way too big :-)

In conclusion, I am not sure this difference is due to a bug or something.
Just asking out of curiosity :-)

THANKS a lot for reading !

@chrisjlocke

This comment has been minimized.

Copy link
Contributor

chrisjlocke commented Nov 4, 2017

It is difficult to "debug" this AllCountries.txt file

What do you mean debug?

It would be easy for me to read the CSV and see if the record has been imported into DB4s. I'm bored so could do that....
That would highlight (maybe) why a record didn't get imported.
Maybe DB4S should create an import.log file for this purpose....

@SilvioGrosso

This comment has been minimized.

Copy link
Author

SilvioGrosso commented Nov 4, 2017

Hello @chrisjlocke ,

What do you mean debug ?

At work, I always import my CSV files with PostgreSQL.
When all problems are sorted out, thanks to the PostgreSQL logs messages, and I have imported my CSV file into PostgreSQL I do the same import process with Sqlite.
Once the CSV file has been "debugged" to fix all problems through PostgreSQL I am an confident "even" Sqlite can import "correctly" this same CSV.

Today, I have opened this txt file (AllCountries) with Sublime Text editor which is a more powerful than Notepad++ (my preferred text editor since it is open source...)
Sublime shows the same total number of rows of PostgreSQL and Microsoft Excel, that is:
11618620 rows

I have select the last row of my geonames table through DbBrowser and it is similar to the one last line showed by Sublime.
Consequenly, I suppose there may be something wrong in the previous rows of my
tables as regard their importing into DbBrowser

See this screenshot where Sublime text editor shows the number of row for this text file.

sublimetext_number_rows

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Nov 4, 2017

Ahhh, the higher row count from PostgreSQL is good info. 😄

We've previously seen an import take ages + have missing rows. That seems to mean DB4S is getting confused about an end-of-field or end-of-line marker somewhere, so starts reading things incorrectly and misunderstanding the contents.

If we can figure out the first row that's missing, we can look at that row and the ones around it in the source text file to figure out what the likely problem (or bug) is.

I'm thinking we can probably figure that out pretty simple. Sublime Text hopefully has some kind of "Go to line x" command. If you take a look at row (say) 1 million, and then look at row 1 million in the Browse Data tab of the equivalent text file imported to DB4S to check if they're the same. If not, halve it and check row 500,000 in both (and keep until the row is found). Is a bit of a brute force approach, but I find it works for lots of things. 😄

@chrisjlocke

This comment has been minimized.

Copy link
Contributor

chrisjlocke commented Nov 4, 2017

Lol. What part of 'I'm bored' didn't you understand! ;)
Well, DB4S was importing the file into a table, and I've been watching Z Nation while that's been doing. I've got a program that'll check which ties are missing... Hang tight... ;)

@MKleusberg

This comment has been minimized.

Copy link
Member

MKleusberg commented Nov 4, 2017

Can you retry with the Quote character field set to nothing? I'm asking because the allCountries file contains a few double quote characters in some of its fields and maybe there are some in the right places to make the CSV import read all text until the next unescaped quote character. Because this can be possibly thousands of lines further down, all those rows would be effectively skipped while also messing up the table structure.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Nov 4, 2017

Lol. What part of 'I'm bored' didn't you understand! ;)

Sorry. Was skimming. 😄

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Nov 5, 2017

Just attempted an import of the file with the quote char field set to nothing. Gave up at 57% (need to do something else with the PC) which took just over 1 hour and 10 minutes.

On one hand, the longer time now seems to indicate it'll find a different set of rows. No idea if it'll be more/less/etc. It's weird the import it taking this long, since the ~700MB postcode file only takes about 100 seconds to import.

@chrisjlocke

This comment has been minimized.

Copy link
Contributor

chrisjlocke commented Nov 5, 2017

Can you retry with the Quote character field set to nothing?

Did that, and ended up with 11,618,851 rows - thats more than the 11,618,620 it should be! 😖

@chrisjlocke

This comment has been minimized.

Copy link
Contributor

chrisjlocke commented Nov 5, 2017

A quick visual shows that some rows are 'corrupted' though.

image

@MKleusberg

This comment has been minimized.

Copy link
Member

MKleusberg commented Nov 5, 2017

At least that's a lot closer to the right row count 😉

But let's first speed up the CSV import for this file. I've been playing around with this now and it looks like we can speed it up by quite a bit. The speedup would actually be on top of the other performance improvements I did a few weeks ago. So stay tuned 😉

@MKleusberg

This comment has been minimized.

Copy link
Member

MKleusberg commented Nov 5, 2017

Very weird. I just imported the whole allCountries file (with no headers, separator=tab, quote=none, encoding=UTF-8, no trim) and here it says that there are 11618620 rows and I didn't change anything about the parser...

@MKleusberg

This comment has been minimized.

Copy link
Member

MKleusberg commented Nov 5, 2017

So I've just committed the performance patch. The 700MB postcode file which @justinclift mentioned took about 105s on my system before this and about 24s after this. I have never tried to import the whole allCountries file (cancelled it when the import reached 15% or so 😉) but it now takes about 87s. Turns out that after all the optimisation efforts there was one more easy way to speed things up left 😄

Maybe @justinclift has the time to start a once-off build, so you can all test with the new version to save you a bunch of time. And if not, it will just be in tomorrow's nightly build.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Nov 5, 2017

@MKleusberg Wow, that is a lot faster now. Just imported the postcode file again... and yep it's about 20 seconds. That's Awesome. Really well done. 😀

With the AllCountries.txt, it takes about a minute (didn't time carefully) now, and the total number of records is 11,618,620. That matches the line count of the file too (using wc -l). It looks like the importing is now getting it right (tab delimiter, no quote char, UTF-8 encoding). I don't remember if I had trimming turned on or off... I think it was turned off.

If someone wants a new build now (?), I can definitely start the build script.

@SilvioGrosso

This comment has been minimized.

Copy link
Author

SilvioGrosso commented Nov 6, 2017

Hello everyone !

Just downloaded today's daily build (x64.exe installer, as usual).
Tested on Windows 10 (64 bit) :
CPU: INTEL I7 6500U (4CPUs - 2,6 GHz)
RAM : 8 Gb
GPU : Nvidia Geforce

AllCountries.txt takes now around 10 minutes which is a big improvement on Windows ! :-)
Unfortunately, it stills import the wrong number of rows, that is :
5381811

I have also tried to import the Csv file: National_Statistics_Postcode_Lookup_UK, proposed as benchmark by Justin. This one takes less than a minute to import:
17568189 rows

As far as I am concerned, you can close this bug since:

  • it does work on Linux now ;
  • this .TXT file is extremely error-prone with PostgreSQL as well ;
  • It does NOT corrupt my database any longer when I "Cancell" the importing. This was my main reason for opening this issue..

BTW, these past months you have done an incredible job on improving DB Browser for Sqlite !
Thanks a lot indeed :-)

@chrisjlocke

This comment has been minimized.

Copy link
Contributor

chrisjlocke commented Nov 6, 2017

Are you sure you're setting the quote character to nothing?

@SilvioGrosso

This comment has been minimized.

Copy link
Author

SilvioGrosso commented Nov 6, 2017

Hello @chrisjlocke !

Are you sure you're setting the quote character to nothing?

This is it :-)

db-browser_csv_settings

Now it imports all rows correctly and the timing on Window 10 is on par with Linux as well.

Thanks a lot for pointing out my flaws.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Nov 7, 2017

Excellent. 😄

@justinclift justinclift closed this Nov 7, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment