New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Better column type detection in CSV import #1003

Closed
justinclift opened this Issue May 3, 2017 · 18 comments

Comments

Projects
None yet
4 participants
@justinclift
Copy link
Member

justinclift commented May 3, 2017

At present our CSV import dialog seems to use TEXT as the type for all columns.

It would be useful to do some form of detection on the data (after the last row is loaded?), then change the column types to match whatever is detected:

  • INTEGER and REAL should both be fairly easy to detect and implement
  • CSV has no defined way of handling BLOB/binary values in the official RFC, so probably no need to detect that

We may need to define something custom for binary/BLOB importing anyway, just so it can be done in the (rare?) cases it's needed. eg for importing DBHub.io CSV exports that have binary/BLOB data in them

@justinclift justinclift changed the title Feature request: Better field type detection in CSV import Feature request: Better column type detection in CSV import May 3, 2017

@chrisjlocke

This comment has been minimized.

Copy link
Contributor

chrisjlocke commented May 3, 2017

If a value of 33 is detected, do you class that as an integer or a real? ;)
Has it any impact though - wouldn't the user map the CSV fields to the database fields (which SQLite doesn't even define the type of anyway)
So a CSV might have 3 fields, but the database has 20, so I'd want to map fields blah to blah.

@MKleusberg

This comment has been minimized.

Copy link
Member

MKleusberg commented May 3, 2017

It actually does have an impact:

create table test(a int, b real);
insert into test values(1,1);
select a/2,b/2 from test;

But yeah, it might be best to add a way to define which CSV field should be mapped to which database field. We could also add a routine to guess decent default data types but it's never going to be perfect. That's because we can't really read the entire CSV file just to find out the best data type (and we need to know them before importing any data) - that's just too much overhead for larger files. Now let's say we read only the first 50 lines or so and one field is empty in all of them; what data type is it going to get? Probably TEXT but in line 51 there might be a number. Or maybe the first 50 lines just contain integers and line 51 contains the first string 😉 Not sure how to avoid errors here.

@chrisjlocke

This comment has been minimized.

Copy link
Contributor

chrisjlocke commented May 3, 2017

Or maybe the first 50 lines just contain integers and line 51 contains the first string 😉 Not sure how to avoid errors here.

I'd guess the types based on the first line, but give the user the ability to change them. The user should know their schema (especially if they're importing CSV files - thats quite an advanced activity - my Mum could probably use DB4S is she wanted, but there is no way she'd dabble with CSV files!) and guessing INT instead of REAL isn't a huge issue. considering we're using TEXT at the moment! ;)

@justinclift

This comment has been minimized.

Copy link
Member Author

justinclift commented May 3, 2017

Heh Heh Heh, I was actually thinking of a fairly simple approach, which - in my head so far - seems like it would be both accurate and fairly reasonable in cpu/IO overhead.

Potential data types:

  1. UNDECIDED
  2. INTEGER
  3. REAL
  4. TEXT

Concept for the process:

  1. After the import options dialog closes, create a simple array of elements, one for each column, initialised to the UNDECIDED value
  • eg for a table with 5 columns, that's means 5 elements
  1. As each cell is read in, check the characters in the cell data and update the corresponding element for the column
  • If they're [0-9], that's INTEGER
  • If they're [0-9,.,] (eg numeric or comma or full stop) that's REAL.
    • Note, this is a simplification as the separator character (often ",") needs to be taken into account, as does the meaning of "." vs "," as decimal point/thousands separator due to locale
  • If it's anything else it's TEXT:
  1. The order of the data types in the list above is important (INTEGERREALTEXT), as once a data type is achieved (eg REAL) subsequent rows can't move the type backwards. eg REAL can't move backwards to INTEGER

Once all of the rows for a table are imported, the element list should contain an accurate representation of the data types per column.

Is there a hole in this approach I'm not seeing? 😄

@justinclift

This comment has been minimized.

Copy link
Member Author

justinclift commented May 3, 2017

Here's an example of why the data type per column is important:
screen shot 2017-05-03 at 16 42 52
The channel_1 column there is correctly sorted because the data type is INTEGER. When the data type was the default TEXT though, the ordering was by string order. eg:

channel_1 ▲

0
1
1
109
11
114
12
13
2
2
21
22
3
4
7
9

String order definitely isn't correct for that column.

@justinclift

This comment has been minimized.

Copy link
Member Author

justinclift commented Jul 2, 2017

Note, some useful follow up thoughts here (from @lhunsicker) for consideration when we're looking at this in more depth: #195 (comment)

MKleusberg added a commit that referenced this issue Sep 18, 2017

Add automatic data type detection to the CSV import
When importing a CSV file into a table that doesn't exist yet (i.e. that
is created during the import), try to guess the data type of each column
based on the first couple of rows. If it is all floats or mixed floats
and integers, set the data type to REAL; if it is all integers, set the
data type to INTEGER; if it is anything else, set the data type to TEXT.

See issue #1003.
@MKleusberg

This comment has been minimized.

Copy link
Member

MKleusberg commented Sep 18, 2017

I've just added basic data type detection. It only distinguishes between TEXT, REAL, and INTEGER but as described above by @justinclift there aren't really any more data types to consider in CSV. TEXT is used as the default data type if none of the other data types can be used. If a column contains both REAL and INTEGER data the data type is set to REAL. Also it only analyses the first 20 rows because we can't parse the entire file just to find out the data types or the import would take ages.

@justinclift Want to try it? 😄

@lhunsicker

This comment has been minimized.

Copy link

lhunsicker commented Sep 18, 2017

@justinclift

This comment has been minimized.

Copy link
Member Author

justinclift commented Sep 18, 2017

@MKleusberg Yep, I'll give it a go now. 😄

@justinclift

This comment has been minimized.

Copy link
Member Author

justinclift commented Sep 18, 2017

@MKleusberg Looking good. 😄

The data types detected in the UK post code file seem correct:

db4s_column_detection

Also enabled the CSV benchmarking code too. Timings with this new commit ("Trim fields" enabled):

Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
66834ms. Of this 10891ms were spent in the row function.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
66377ms. Of this 10902ms were spent in the row function.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
66089ms. Of this 10886ms were spent in the row function.

Trim fields disabled:

Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
65599ms. Of this 10788ms were spent in the row function.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
66901ms. Of this 10955ms were spent in the row function.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
65098ms. Of this 10825ms were spent in the row function.

Those times are significantly quicker than the previous runs on this same computer. It's fairly likely due to something else entirely though -> finally getting around to making ACPI work on this PC, which seems to have sped up many things.

I'll run some timings with the prior commits, to confirm.

@justinclift

This comment has been minimized.

Copy link
Member Author

justinclift commented Sep 18, 2017

Timings (Trim fields enabled) with prior CSV performance commit 0eb1f65 (from 5 days ago):

Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
84059ms. Of this 10022ms were spent in the row function.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
82805ms. Of this 9431ms were spent in the row function.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
84286ms. Of this 9821ms were spent in the row function.

Well, there's clearly a good speed up from that commit 5 days ago until now. 😄

@justinclift

This comment has been minimized.

Copy link
Member Author

justinclift commented Sep 18, 2017

Just for completeness, also ran the timing benchmark using the commit just prior to this new CSV type detection one. With trim fields enabled:

Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
64859ms. Of this 9861ms were spent in the row function.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
64199ms. Of this 9755ms were spent in the row function.
Importing the file '/home/jc/Databases/National_Statistics_Postcode_Lookup_UK.csv' took
63733ms. Of this 9459ms were spent in the row function.

Where-ever this new speed up came from, it existed prior to the type detection. The type detection is important too though. 😉

@MKleusberg

This comment has been minimized.

Copy link
Member

MKleusberg commented Sep 19, 2017

Heh heh, this new speed up should come from commit 659f38e. As far as I can see you haven't benchmarked that one individually yet. This one replaces most of the Qt classes which were used during parsing with some self-written code. The advantage is that this way we can do all the memory management manually and save a load of unnecessary allocations. It will probably be the last CSV performance commit too - at least for a while 😉 Considering that we started with 3 minutes on your system and that it's now just above 1 minute, it's probably good enough for now 😄

Also good to hear that the type detection is working (hadn't tested it for REAL values yet). Should we close the issue then?

@justinclift

This comment has been minimized.

Copy link
Member Author

justinclift commented Sep 19, 2017

Yep, definitely a lot faster, and with type detection as a bonus. That's a pretty good effort in my opinion. 😄

@lhunsicker

This comment has been minimized.

Copy link

lhunsicker commented Sep 28, 2017

@justinclift

This comment has been minimized.

Copy link
Member Author

justinclift commented Sep 28, 2017

Ouch. Thanks for reporting it @lhunsicker. 😄

@chrisjlocke has reopened the issue, so we can investigate it again. Thanks @chrisjlocke. 😀

@lhunsicker Have you tried it with the 3.10.0 release (instead of 3.10.1) too?

Am wondering if it's something that appeared between these two, as that should make it pretty easy to figure out where the problem crept back in.


Thinking out loud, we might want to add a test for this failure to our automated test runs, once we figure out the solution to this new occurrence of it. As it's shown up more than once already, it seems reasonable to think it might pop up again otherwise.

@lhunsicker

This comment has been minimized.

Copy link

lhunsicker commented Sep 28, 2017

@justinclift

This comment has been minimized.

Copy link
Member Author

justinclift commented Sep 28, 2017

Thanks @lhunsicker. And no worries. 😄

@mgrojo mgrojo referenced this issue Oct 6, 2018

Open

Cannot modify database - No Such Savepoint #836

4 of 13 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment