Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: import CSV to memory, run SQL, export in a single command #272

Closed
simonw opened this issue Jun 15, 2021 · 22 comments
Closed

Idea: import CSV to memory, run SQL, export in a single command #272

simonw opened this issue Jun 15, 2021 · 22 comments
Labels

Comments

@simonw
Copy link
Owner

simonw commented Jun 15, 2021

I quite often load a CSV file into a SQLite DB, then do stuff with it (like export results back out again as a new CSV) without any intention of keeping the CSV file around afterwards.

What if sqlite-utils could do this for me? Something like this:

sqlite-utils --csv blah.csv --csv baz.csv "select * from blah join baz ..."
@simonw
Copy link
Owner Author

simonw commented Jun 15, 2021

Maybe also support --csvt as an alternative option which takes two arguments: the CSV path and the name of the table that should be created from it (rather than auto-detecting from the filename).

@simonw
Copy link
Owner Author

simonw commented Jun 15, 2021

How about --json and --nl and --tsv too? Imitating the format options for sqlite-utils insert.

And what happens if you provide a filename too? I'm tempted to say that the --csv stuff still gets loaded into an in-memory database but it's given a name and can then be joined against using SQLite memory.blah syntax.

@simonw
Copy link
Owner Author

simonw commented Jun 15, 2021

--csvt seems unnecessary to me: if people want to load different CSV files with the same filename (but in different directories) they will get an error unless they rename the files first.

@simonw
Copy link
Owner Author

simonw commented Jun 15, 2021

--csv - should work though, for reading from stdin. The table can be called stdin.

@simonw
Copy link
Owner Author

simonw commented Jun 15, 2021

Problem: --csv and --json and --nl are already options for sqlite-utils query - need new non-conflicting names.

@simonw
Copy link
Owner Author

simonw commented Jun 15, 2021

--load-csv and --load-json and --load-nl and --load-tsv are unambiguous.

@eyeseast
Copy link
Contributor

So, I do things like this a lot, too. I like the idea of piping in from stdin. Something like this would be nice to do in a makefile:

cat file.csv | sqlite-utils --csv --table data - 'SELECT * FROM data WHERE col="whatever"' > filtered.csv

If you assumed that you're always piping out the same format you're piping in, the option names don't have to change. Depends how much you want to change formats.

@simonw
Copy link
Owner Author

simonw commented Jun 16, 2021

This is going to need to be a separate command, for relatively non-obvious reasons.

sqlite-utils blah.db "select * from x"

Is equivalent to this, because query is the default sub-command:

sqlite-utils query blah.db "select * from x"

But... this means that making the filename optional doesn't actually work - because then this is ambiguous:

sqlite-utils --load-csv blah.csv "select * from blah"

So instead, I'm going to add a new sub-command. I'm currently thinking memory to reflect that this command operates on an in-memory database:

sqlite-utils memory --load-csv blah.csv "select * from blah"

I still think I need to use --load-csv rather than --csv because one interesting use-case for this is loading in CSV and converting it to JSON, or vice-versa.

Another option: allow multiple arguments which are filenames, and use the extension (or sniff the content) to decide what to do with them:

sqlite-utils memory blah.csv foo.csv "select * from foo join blah on ..."

This would require the last positional argument to always be a SQL query, and would treat all other positional arguments as files that should be imported into memory.

@simonw
Copy link
Owner Author

simonw commented Jun 16, 2021

Another option: allow an optional :suffix specifying the type of the file. If this is missing we detect based on the filename.

sqlite-utils memory somefile:csv "select * from somefile"

One catch: how to treat - for standard input?

cat blah.csv | sqlite-utils memory - "select * from stdin"

That's fine for CSV, but what about TSV or JSON or nl-JSON? Maybe this:

cat blah.csv | sqlite-utils memory -:json "select * from stdin"

Bit weird though. The alternative would be to support this:

cat blah.csv | sqlite-utils memory --load-csv -

But that's verbose compared to the version without the long --load-x option.

@simonw
Copy link
Owner Author

simonw commented Jun 16, 2021

Solution: sqlite-utils memory - attempts to detect the input based on if it starts with a { or [ (likely JSON) or if it doesn't use the csv.Sniffer() mechanism. Or you can use sqlite-utils memory -:csv to specifically indicate the type of input.

@simonw
Copy link
Owner Author

simonw commented Jun 16, 2021

The documentation already covers this

$ sqlite-utils :memory: "select sqlite_version()"
[{"sqlite_version()": "3.29.0"}]

https://sqlite-utils.datasette.io/en/latest/cli.html#running-queries-and-returning-json

sqlite-utils memory "select sqlite_version()" is a little bit more intuitive than that.

@simonw
Copy link
Owner Author

simonw commented Jun 16, 2021

Mainly for debugging purposes it would be useful to be able to save the created in-memory database back to a file again later. This could be done with:

sqlite-utils memory blah.csv --save saved.db

Can use .iterdump() to implement this: https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.iterdump

Maybe instead (or as-well-as) offer --dump which dumps out the SQL from that.

@simonw
Copy link
Owner Author

simonw commented Jun 16, 2021

Got a prototype working!

 % curl -s 'https://fivethirtyeight.datasettes.com/polls/president_approval_polls.csv?_size=max&_stream=1' | sqlite-utils memory - 'select * from t limit 5' --nl 
{"rowid": "1", "question_id": "139304", "poll_id": "74225", "state": "", "politician_id": "11", "politician": "Donald Trump", "pollster_id": "568", "pollster": "YouGov", "sponsor_ids": "352", "sponsors": "Economist", "display_name": "YouGov", "pollster_rating_id": "391", "pollster_rating_name": "YouGov", "fte_grade": "B", "sample_size": "1500", "population": "a", "population_full": "a", "methodology": "Online", "start_date": "1/16/21", "end_date": "1/19/21", "sponsor_candidate": "", "tracking": "", "created_at": "1/20/21 10:18", "notes": "", "url": "https://docs.cdn.yougov.com/y9zsit5bzd/weeklytrackingreport.pdf", "source": "538", "yes": "42.0", "no": "53.0"}
{"rowid": "2", "question_id": "139305", "poll_id": "74225", "state": "", "politician_id": "11", "politician": "Donald Trump", "pollster_id": "568", "pollster": "YouGov", "sponsor_ids": "352", "sponsors": "Economist", "display_name": "YouGov", "pollster_rating_id": "391", "pollster_rating_name": "YouGov", "fte_grade": "B", "sample_size": "1155", "population": "rv", "population_full": "rv", "methodology": "Online", "start_date": "1/16/21", "end_date": "1/19/21", "sponsor_candidate": "", "tracking": "", "created_at": "1/20/21 10:18", "notes": "", "url": "https://docs.cdn.yougov.com/y9zsit5bzd/weeklytrackingreport.pdf", "source": "538", "yes": "44.0", "no": "55.0"}
{"rowid": "3", "question_id": "139306", "poll_id": "74226", "state": "", "politician_id": "11", "politician": "Donald Trump", "pollster_id": "23", "pollster": "American Research Group", "sponsor_ids": "", "sponsors": "", "display_name": "American Research Group", "pollster_rating_id": "9", "pollster_rating_name": "American Research Group", "fte_grade": "B", "sample_size": "1100", "population": "a", "population_full": "a", "methodology": "Live Phone", "start_date": "1/16/21", "end_date": "1/19/21", "sponsor_candidate": "", "tracking": "", "created_at": "1/20/21 10:18", "notes": "", "url": "https://americanresearchgroup.com/economy/", "source": "538", "yes": "30.0", "no": "66.0"}
{"rowid": "4", "question_id": "139307", "poll_id": "74226", "state": "", "politician_id": "11", "politician": "Donald Trump", "pollster_id": "23", "pollster": "American Research Group", "sponsor_ids": "", "sponsors": "", "display_name": "American Research Group", "pollster_rating_id": "9", "pollster_rating_name": "American Research Group", "fte_grade": "B", "sample_size": "990", "population": "rv", "population_full": "rv", "methodology": "Live Phone", "start_date": "1/16/21", "end_date": "1/19/21", "sponsor_candidate": "", "tracking": "", "created_at": "1/20/21 10:18", "notes": "", "url": "https://americanresearchgroup.com/economy/", "source": "538", "yes": "29.0", "no": "67.0"}
{"rowid": "5", "question_id": "139298", "poll_id": "74224", "state": "", "politician_id": "11", "politician": "Donald Trump", "pollster_id": "1528", "pollster": "AtlasIntel", "sponsor_ids": "", "sponsors": "", "display_name": "AtlasIntel", "pollster_rating_id": "546", "pollster_rating_name": "AtlasIntel", "fte_grade": "B/C", "sample_size": "5188", "population": "a", "population_full": "a", "methodology": "Online", "start_date": "1/15/21", "end_date": "1/19/21", "sponsor_candidate": "", "tracking": "", "created_at": "1/19/21 21:52", "notes": "", "url": "https://projects.fivethirtyeight.com/polls/20210119_US_Atlas2.pdf", "source": "538", "yes": "44.6", "no": "53.9"}

@simonw
Copy link
Owner Author

simonw commented Jun 16, 2021

Moving this to a PR.

@simonw
Copy link
Owner Author

simonw commented Jun 16, 2021

Here's a radical idea: what if I combined sqlite-utils memory into sqlite-utils query?

The trick here would be to detect if the arguments passed on the command-line refer to SQLite databases or if they refer to CSV/JSON data that should be imported into temporary tables.

Detecting a SQLite database file is actually really easy - they all start with the same binary string:

>>> open("my.db", "rb").read(100)
b'SQLite format 3\x00...

(Need to carefully check that a CSV file withSQLite format 3 as the first column name doesn't accidentally get interpreted as a SQLite DB though).

So then what would the semantics of sqlite-utils query (which is also the default command) be?

  • sqlite-utils mydb.db "select * from x"
  • sqlite-utils my.csv "select * from my"
  • sqlite-utils mydb.db my.csv "select * from mydb.x join my on ..." - this is where it gets weird. We can't import the CSV data directly into mpdb.db - it's suppose to go into the in-memory database - so now we need to start using database aliases like mydb.x because we passed at least one other file?

The complexity here is definitely in the handling of a combination of SQLite database files and CSV filenames. Also, sqlite-utils query doesn't accept multiple filenames at the moment, so that will change.

I'm not 100% sold on this as being better than having a separate sqlite-utils memory command, as seen in #273.

@simonw
Copy link
Owner Author

simonw commented Jun 16, 2021

But... sqlite-utils my.csv "select * from my" is a much more compelling initial experience than sqlite-utils memory my.csv "select * from my".

@simonw
Copy link
Owner Author

simonw commented Jun 16, 2021

Plus, could I make this change to sqlite-utils query without breaking backwards compatibility? Adding a new sqlite-utils memory command is completely safe from that perspective.

@simonw
Copy link
Owner Author

simonw commented Jun 16, 2021

I wonder if there's a better name for this than sqlite-utils memory?

  • sqlite-utils memory hello.csv "select * from hello"
  • sqlite-utils mem hello.csv "select * from hello"
  • sqlite-utils temp hello.csv "select * from hello"
  • sqlite-utils adhoc hello.csv "select * from hello"
  • sqlite-utils scratch hello.csv "select * from hello"

I think memory is best. I don't like the others, except for scratch which is OK.

@simonw
Copy link
Owner Author

simonw commented Jun 16, 2021

Also sqlite-utils memory reflects the existing sqlite-utils :memory: mechanism, which is a point in its favour.

And it helps emphasize that the file you are querying will be loaded into memory, so probably don't try this against a 1GB CSV file.

@simonw
Copy link
Owner Author

simonw commented Jun 16, 2021

Columns from data imported from CSV in this way is currently treated as TEXT, which means numeric sorts and suchlike won't work as people might expect. It would be good to do automatic type detection here, see #179.

simonw added a commit that referenced this issue Jun 18, 2021
* Turn SQL errors into click errors
* Initial CSV-only prototype of sqlite-utils memory, refs #272
* Implement --save plus tests for --save and --dump, refs #272
* Re-arranged CLI query documentation, refs #272
* Re-organized CLI query docs, refs #272
* Docs for --save and --dump plus made SQL optional for those, refs #273
* Replaced one last :memory: example
* Documented --attach option for memory command, refs #272
* Improved arrangement of CLI query documentation
@simonw
Copy link
Owner Author

simonw commented Jun 18, 2021

I'll split the remaining work out into separate issues.

@simonw
Copy link
Owner Author

simonw commented Jun 19, 2021

Wrote this up on my blog here: https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/ - with a video demo here: https://www.youtube.com/watch?v=OUjd0rkc678

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants