New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import Table from CSV is Not Preserving Quoted Values as Text #1382

Open
frankpd opened this Issue May 9, 2018 · 25 comments

Comments

Projects
None yet
6 participants
@frankpd
Copy link

frankpd commented May 9, 2018

Details for the issue

Problem: Import Table from CSV is not preserving quoted values as text.

Using Import - Table from CSV File, I am trying to import this csv file into an empty database. Values are double quoted to indicate that they should be preserved as text - for example, the FIPS column contains values that are five digit numbers with leading zeros, and these values are all double quoted.

When I import CSV, I leave the default quote character choice as double quotes, and the FIPS values I see in the preview look fine. But when I hit OK, it does the import and saves this column as an Integer, ignoring the double quotes. As a result the leading zeros from the values are dropped and the codes are no longer correct.

I tried all of the following, with no effect: leaving the trim option off, going under View - Preferences - Database and changing the default field type from Integer to Text, adding an integer id column in the CSV before the FIPS column (in case the first column is reserved as an integer key column), using single instead of double quotes to preserve text.

I'm using Version 3.10.99 SQLite Version 3.11.0. I'm running 64-bit Linux Mint 18.2 and am using the linuxgndu sqlitebrowser for ubuntu xenial.

import_csv
import_integers
zeros_dropped

Useful extra information

I'm opening this issue because:

  • DB4S is crashing
  • [ x] DB4S has a bug
  • DB4S needs a feature
  • DB4S has another problem

I'm using DB4S on:

  • Windows: ( version: ___ )
  • [x ] Linux: ( distro: ubuntu xenial___ )
  • Mac OS: ( version: ___ )
  • Other: ___

I'm using DB4S version:

  • 3.10.1
  • 3.10.0
  • 3.9.1
  • [ x] Other: 3.10.99_

I have also:

@chrisjlocke

This comment has been minimized.

Copy link
Contributor

chrisjlocke commented May 9, 2018

Github is mangling the CSV file - can you email it to me? db4s [at] chrisjlocke [dot] co [dot] uk.
Ta!

@chrisjlocke

This comment has been minimized.

Copy link
Contributor

chrisjlocke commented May 9, 2018

Actually, copying and pasting worked. I think.
However, I'm still getting a weird error.

image

If I delete the line its moaning about, it'll moan about another line. Delete those, and it'll moan afgain. Grr. If only there was a log so it showed what it was moaning about, and where.

@frankpd

This comment has been minimized.

Copy link
Author

frankpd commented May 9, 2018

I'll email you the file and see if I can post it somewhere else, but I also did a basic test to make sure it wasn't the file causing the problem. If you take the records below, copy paste them and save them as a CSV and try that you should get the same result (quotes ignored, saved as integers, zeros dropped):

"FIPS","USPS"
"01","AL"
"02","AK"
"04","AZ"
"05","AR"
"06","CA"
"08","CO'
"09","CT
"10","DE"
"11","DC"
"12","FL"

@karim

This comment has been minimized.

Copy link
Member

karim commented May 9, 2018

FIPS column type is INTEGER in your screenshot. Can you try changing its type to TEXT?

Edit:
More info about leading zeros. SQLite will strip the leading zero of all INTEGER columns. I believe the only way to have a leading zero is if the column type is TEXT.

@frankpd

This comment has been minimized.

Copy link
Author

frankpd commented May 9, 2018

@karim the type Integer (that you see in the 2nd image) is assigned after the table is imported in (image 1). The import process ignores the fact that the values in the first column in the csv are in quotes, and thus should be imported and saved as text. So yes, they do need to be saved as text . Is there another way to designate types before the data is imported? Changing the column type afterwards doesn't help, as the zeros have already been dropped. I could go back and concatenate a zero to the front of the codes that lost them, but this isn't desirable - there are a lot of different identifiers that use leading zeros (ISO country codes, US ZIP codes, the FIPS codes in this example, etc.)

@karim

This comment has been minimized.

Copy link
Member

karim commented May 9, 2018

You have to re-import again after changing the type. Changing the type alone will not add leading zero since that information is already lost.

Either create a new database with FIPS as TEXT then import again, or simply delete all the data from this table, change the type, and import your .csv file again.

I had no problems with your .csv file. I created a new database with no tables, then imported the file and it kept the leading zero.

csv1
csv2

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented May 9, 2018

Ahhh. This sounds like a case of our new data type autodetection messing up. It's looking at the contents of each field character by character and thinking 01 (etc) is an integer.

@MKleusberg With the new data type autodetection, what do you reckon the right approach here is? Should we treat all fields in quotes as text, and not do auto-detection on those?

@frankpd If you're ok to try the import with our latest released version (3.10.1), it might work better for you in this instance. Our new auto-detection code wasn't added then, so it'll import all fields as verbatim text.

@justinclift justinclift added the bug label May 9, 2018

@sky5walk

This comment has been minimized.

Copy link

sky5walk commented May 9, 2018

I am huge fan of automation, but csv importing should not assume datatypes.
What if I have HEX values or zip codes or complex numbers in a column?
Can the import tool pop a prompt with nFields based on the delimiter and let the user select?
Then store that as a template for future imports...

@frankpd

This comment has been minimized.

Copy link
Author

frankpd commented May 9, 2018

I just tried karim's suggestion - if I create an empty table shell that has the same structure as the CSV file and designate the types, and then go to import the CSV and specify that table as the name, I'm prompted with "that table already exists - would you like to import data into it?" I say yes, the import is done and the data is preserved. This is something I had considered but couldn't figure out how to do it here. In pgadmin3 for PostgreSQL that is the procedure for importing CSVs, but to do it you would actually click on the empty table, right click and choose an import option.

But I'd still argue that this is a bug - the purpose of quoting values in a CSV is to escape delimiter characters that appear within values, and to preserve fields as text. I wouldn't t think that's something that should be over ruled.

@justinclift I used the Ubuntu PPA to grab the latest stable version, and 3.10.99 was the one it gave me for ubuntu xenial (16.04) - even though 3.10.1 seems to be the latest one listed there: https://launchpad.net/~linuxgndu/+archive/ubuntu/sqlitebrowser

I'll try removing the package and adding it again. I'm writing a book on working with census data and am using sqlitebrowser in my database examples, so I just want to be sure that all works smoothly! :)

@karim

This comment has been minimized.

Copy link
Member

karim commented May 9, 2018

@justinclift I don't know if it is possible to do anything here. My guess is that import dialog see the data as text (cause of the double quotes) and send them to SQLite library which silently converts them to INTEGER and drops the leading zero.

This question somehow confirms it

(3) SQLite lets me insert a string into a database column of type integer!

This is a feature, not a bug. SQLite uses dynamic typing. It does not enforce data type constraints. Data of any type can (usually) be inserted into any column. You can put arbitrary length strings into integer columns, floating point numbers in boolean columns, or dates in character columns. The datatype you assign to a column in the CREATE TABLE command does not restrict what data can be put into that column. Every column is able to hold an arbitrary length string. (There is one exception: Columns of type INTEGER PRIMARY KEY may only hold a 64-bit signed integer. An error will result if you try to put anything other than an integer into an INTEGER PRIMARY KEY column.)

But SQLite does use the declared type of a column as a hint that you prefer values in that format. So, for example, if a column is of type INTEGER and you try to insert a string into that column, SQLite will attempt to convert the string into an integer. If it can, it inserts the integer instead. If not, it inserts the string. This feature is called type affinity.

@karim

This comment has been minimized.

Copy link
Member

karim commented May 9, 2018

@frankpd Since all the data in FIPS column are integers, and even though SQLite see them as strings (cause of the double quotes) it was able to convert them from string to integer, and then insert them as integer instead, after discarding the leading zero.

Changing the column type to TEXT will give a hint to SQLite that you want strings and stop it from attempting to convert them to integers.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented May 9, 2018

I am huge fan of automation, but csv importing should not assume datatypes.

@sky5walk Well... it sounded good at the time? 😄

There are definitely cases where it's super useful. eg working with large databases where the data types are obvious. But as you mention, it seems like a case where we'd probably want an option in the CSV import dialog for toggling on/off.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented May 9, 2018

I'm writing a book on working with census data and am using sqlitebrowser in my database examples, so I just want to be sure that all works smoothly!

@frankpd Awesome! 😄

And yeah, can definitely understand the desire to get thing right. 😁

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented May 9, 2018

@karim Well, it used to be that when we imported CSV's every field was assumed to be text. No exceptions. SQLite itself wasn't converting the values to anything, as we'd automatically created a new table in the background first with all fields as text.

That worked well enough, but was also kind of a pain when importing larger files (millions of rows, etc) where the data was obviously (say) just integers. Changing the data type afterwards worked, but was slow for big files when every column needed changing, one by one.

For example, this one:

Thus adding the "auto detection" thing. Which looks like it still needs a bit of fine tuning before the next release. 😄

@karim

This comment has been minimized.

Copy link
Member

karim commented May 9, 2018

@justinclift Ahhh, I didn't know that, sorry. 😄

@frankpd I have a question if you don't mind me asking. 😃 I noticed that the last column has padding at the end (a lot of spaces) to make it exactly 200 characters. Is this intentional?

@frankpd

This comment has been minimized.

Copy link
Author

frankpd commented May 10, 2018

@karim It's probably not intentional. This file was published by the USDA in an Excel .xls format - it must have been an oversight when they created it. I'm using LibreOffice Calc, and I saved it as a csv so I could import it into the database.

Most files produced by the US government use FIPS or GEOID codes to uniquely identify places like states and counties, and in many cases these codes have leading zeros. The file in this particular example is small, but potentially you could have census data tables with hundreds of thousands of records (so it would be slow to load these tables and then overwrite them).

@karim

This comment has been minimized.

Copy link
Member

karim commented May 10, 2018

@frankpd I see. I just thought it was GitHub messing with the uploaded .csv file.

Thanks for your reply and good luck with your book. 😃

@MKleusberg

This comment has been minimized.

Copy link
Member

MKleusberg commented May 25, 2018

Sorry, I haven't read all comments (yet) but will just drop in here anyway 😉

As @justinclift suggested in #1395 the best approach here might be to add an option to disable the type detection. I think for most cases it's good to have it and it does seem to work mostly fine. But obviously there are some cases where it's causing trouble. The problem here is that we can't check for double quotes around the values because if they are used in a file they are supposed to be around all values (that's at least the assumption because that's how most programs do it - but if I have learned one thing about CSV files it's to never make assumptions). So for files with double quotes, type detection would pretty much always be entirely disabled and that for entirely unrelated reasons because some files just use quotes and others simply don't.

Hope that makes some sort of sense 😉 I see only two options here: either add an option to disable the automatic type detection and/or change the type detection to check for leading zeros and import them as text. I'd go for the first option. What do you think?

MKleusberg added a commit that referenced this issue May 25, 2018

Add option to disable automatic type detection in CSV import
Because there are some circumstances under which the automatic type
detection can cause problems with the imported data and because it is
not accurate when the data changes a lot after the first couple of rows,
we need an option to disable it.

See issue #1382.
@MKleusberg

This comment has been minimized.

Copy link
Member

MKleusberg commented May 25, 2018

Just added that option to disable type detection. It's one of the new "Advanced" settings which means that the same comments as mentioned here apply.

Can you give that a try, @frankpd? 😄

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented May 26, 2018

Ahhh, found it. This is a good example of a CSV that's a pain to change data types (manually) after the import. It's one of the one's that led to the "lets add data type autodetection" to our importing.

It's also decent for benchmarking import times too (eg when optimising code). 😄

@sky5walk

This comment has been minimized.

Copy link

sky5walk commented May 26, 2018

Cool, is there a sample size instead of the never ending download?
I bailed at 150MB...

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented May 26, 2018

Ahhh. Yeah that one is um.. 600+ MB from memory. Just added that info to the above comment. 😄

The Backblaze drive stats ones are even bigger:

    https://www.backblaze.com/b2/hard-drive-test-data.html

@frankpd

This comment has been minimized.

Copy link
Author

frankpd commented May 29, 2018

I think the option to disable automatic type detection would satisfy many use cases. If you have a file without any fields that have leading zero values, the autodetect is a nice option for importing everything correctly. If you have a small to medium size file (in terms of number of fields) that has a field with leading zeros, you can turn off the autodetect, and after import you can modify the types one by one (to convert text columns to integer or real).

A use case where this would not work: you have a really large file with an id field that has leading zeros, and then a ton of columns that are mostly numeric. Turn autodetect on, and the zeros are dropped from the id column and you need to recreate it. Turn autodetect off, and you would have to modify the data types of the ton of columns one by one. The work around for this case would be to create a blank table with the same fields as the CSV and the correct types 1st, and then import the CSV into that table 2nd (by providing the blank table name at import). This takes some work, but is not unreasonable or unheard of.

Are there any instances when you would not want a field with leading zero values saved as text? In the case of ZIP / FIPS / Hex values you always want to preserve the zeros. But in other cases zeros may be used as padding to fill a field out to a certain length, and in those cases it's desirable to drop them.

Ultimately, I think providing the option to enable / disable autodetect is useful. It's nice to have the choice 😄

@MKleusberg Thanks! I just tested the nightly build on Ubuntu and the option works as intended 😄 Leave autodetect on and columns are read in as text or integers depending on value (leading zeros dropped from ID). Go to Advanced and turn autodetect off, all columns are read in as text.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented May 29, 2018

Awesome. Sounds like we have a working solution then. 😄

Are we ok to close this issue, or have I forgotten something?

@sky5walk

This comment has been minimized.

Copy link

sky5walk commented May 29, 2018

I cannot find this Advanced option on the latest Windows nightly build?

MKleusberg added a commit that referenced this issue Jun 7, 2018

Add option to disable automatic type detection in CSV import
Because there are some circumstances under which the automatic type
detection can cause problems with the imported data and because it is
not accurate when the data changes a lot after the first couple of rows,
we need an option to disable it.

See issue #1382.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment