Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic type detection for CSV data #282

Closed
simonw opened this issue Jun 19, 2021 · 4 comments
Closed

Automatic type detection for CSV data #282

simonw opened this issue Jun 19, 2021 · 4 comments
Labels
cli-tool enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Jun 19, 2021

I've touched on this before in #179 - but now that I've added sqlite-utils memory this is much more important - because unlike with sqlite-utils insert the in-memory command doesn't give you the opportunity to fix any types you imported from CSV, so queries like select * from stdin where age > 3 are never going to work correctly against these temporary in-memory tables.

Teaching sqlite-utils insert to detect types for columns in a CSV file would be a backwards-compatibility breaking change. Teaching sqlite-utils memory that trick would not be, since it hasn't been included in a release yet.

It's a little inconsistent, but I'm going to have sqlite-utils memory default to detecting types while sqlite-utils insert does not. In each case this can be controlled by a new command-line option:

cat file.csv | sqlite-utils memory - --no-detect-types

To opt-in for sqlite-utils insert:

cat file.csv | sqlite-utils insert blah.db blah - --detect-types

I'll have short options for these too: -n for --no-detect-types and -d for --detect-types.

@simonw simonw added enhancement New feature or request cli-tool labels Jun 19, 2021
@simonw
Copy link
Owner Author

simonw commented Jun 19, 2021

I built some prototype code here for something which looks at every row in a CSV import and records the likely types: https://gist.github.com/simonw/465f9356f175d1cf86957947dff501d4

This could be used by the command-line tools to figure out what table.transform(types=...) method to use at the end.

This is a different approach to the pure SQL version I tried building in #179 - I think this is a better approach though, it's less prone to weird idiosyncrasies of SQLite types, and it's also easy for us to add on to the existing CSV import code in a way that won't require scanning the data twice.

@simonw
Copy link
Owner Author

simonw commented Jun 19, 2021

I may change the default for sqlite-utils insert to detect types if I release sqlite-utils 4.0, as a backwards-incompatible change.

@simonw
Copy link
Owner Author

simonw commented Jun 19, 2021

I'll have an environment variable for --detect-types so users who really want that as the default option can turn it on.

@simonw
Copy link
Owner Author

simonw commented Jun 19, 2021

Demo:

curl -s 'https://api.github.com/users/simonw/repos?per_page=100' | \
  sqlite-utils memory - 'select sum(size), sum(stargazers_count) from stdin limit 1'
[{"sum(size)": 2042547, "sum(stargazers_count)": 6769}]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cli-tool enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant