Idea: import CSV to memory, run SQL, export in a single command #272

simonw · 2021-06-15T23:02:48Z

I quite often load a CSV file into a SQLite DB, then do stuff with it (like export results back out again as a new CSV) without any intention of keeping the CSV file around afterwards.

What if sqlite-utils could do this for me? Something like this:

sqlite-utils --csv blah.csv --csv baz.csv "select * from blah join baz ..."

The text was updated successfully, but these errors were encountered:

simonw · 2021-06-15T23:03:26Z

Maybe also support --csvt as an alternative option which takes two arguments: the CSV path and the name of the table that should be created from it (rather than auto-detecting from the filename).

simonw · 2021-06-15T23:06:37Z

How about --json and --nl and --tsv too? Imitating the format options for sqlite-utils insert.

And what happens if you provide a filename too? I'm tempted to say that the --csv stuff still gets loaded into an in-memory database but it's given a name and can then be joined against using SQLite memory.blah syntax.

simonw · 2021-06-15T23:07:38Z

--csvt seems unnecessary to me: if people want to load different CSV files with the same filename (but in different directories) they will get an error unless they rename the files first.

simonw · 2021-06-15T23:08:02Z

--csv - should work though, for reading from stdin. The table can be called stdin.

simonw · 2021-06-15T23:09:08Z

Problem: --csv and --json and --nl are already options for sqlite-utils query - need new non-conflicting names.

simonw · 2021-06-15T23:09:31Z

--load-csv and --load-json and --load-nl and --load-tsv are unambiguous.

eyeseast · 2021-06-16T01:41:03Z

So, I do things like this a lot, too. I like the idea of piping in from stdin. Something like this would be nice to do in a makefile:

cat file.csv | sqlite-utils --csv --table data - 'SELECT * FROM data WHERE col="whatever"' > filtered.csv

If you assumed that you're always piping out the same format you're piping in, the option names don't have to change. Depends how much you want to change formats.

simonw · 2021-06-16T02:19:48Z

This is going to need to be a separate command, for relatively non-obvious reasons.

sqlite-utils blah.db "select * from x"

Is equivalent to this, because query is the default sub-command:

sqlite-utils query blah.db "select * from x"

But... this means that making the filename optional doesn't actually work - because then this is ambiguous:

sqlite-utils --load-csv blah.csv "select * from blah"

So instead, I'm going to add a new sub-command. I'm currently thinking memory to reflect that this command operates on an in-memory database:

sqlite-utils memory --load-csv blah.csv "select * from blah"

I still think I need to use --load-csv rather than --csv because one interesting use-case for this is loading in CSV and converting it to JSON, or vice-versa.

Another option: allow multiple arguments which are filenames, and use the extension (or sniff the content) to decide what to do with them:

sqlite-utils memory blah.csv foo.csv "select * from foo join blah on ..."

This would require the last positional argument to always be a SQL query, and would treat all other positional arguments as files that should be imported into memory.

simonw · 2021-06-16T02:22:52Z

Another option: allow an optional :suffix specifying the type of the file. If this is missing we detect based on the filename.

sqlite-utils memory somefile:csv "select * from somefile"

One catch: how to treat - for standard input?

cat blah.csv | sqlite-utils memory - "select * from stdin"

That's fine for CSV, but what about TSV or JSON or nl-JSON? Maybe this:

cat blah.csv | sqlite-utils memory -:json "select * from stdin"

Bit weird though. The alternative would be to support this:

cat blah.csv | sqlite-utils memory --load-csv -

But that's verbose compared to the version without the long --load-x option.

simonw · 2021-06-16T02:27:20Z

Solution: sqlite-utils memory - attempts to detect the input based on if it starts with a { or [ (likely JSON) or if it doesn't use the csv.Sniffer() mechanism. Or you can use sqlite-utils memory -:csv to specifically indicate the type of input.

simonw · 2021-06-16T02:34:21Z

The documentation already covers this

$ sqlite-utils :memory: "select sqlite_version()"
[{"sqlite_version()": "3.29.0"}]

https://sqlite-utils.datasette.io/en/latest/cli.html#running-queries-and-returning-json

sqlite-utils memory "select sqlite_version()" is a little bit more intuitive than that.

simonw · 2021-06-16T03:59:28Z

Mainly for debugging purposes it would be useful to be able to save the created in-memory database back to a file again later. This could be done with:

sqlite-utils memory blah.csv --save saved.db

Can use .iterdump() to implement this: https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.iterdump

Maybe instead (or as-well-as) offer --dump which dumps out the SQL from that.

simonw · 2021-06-16T05:02:47Z

Got a prototype working!

 % curl -s 'https://fivethirtyeight.datasettes.com/polls/president_approval_polls.csv?_size=max&_stream=1' | sqlite-utils memory - 'select * from t limit 5' --nl 
{"rowid": "1", "question_id": "139304", "poll_id": "74225", "state": "", "politician_id": "11", "politician": "Donald Trump", "pollster_id": "568", "pollster": "YouGov", "sponsor_ids": "352", "sponsors": "Economist", "display_name": "YouGov", "pollster_rating_id": "391", "pollster_rating_name": "YouGov", "fte_grade": "B", "sample_size": "1500", "population": "a", "population_full": "a", "methodology": "Online", "start_date": "1/16/21", "end_date": "1/19/21", "sponsor_candidate": "", "tracking": "", "created_at": "1/20/21 10:18", "notes": "", "url": "https://docs.cdn.yougov.com/y9zsit5bzd/weeklytrackingreport.pdf", "source": "538", "yes": "42.0", "no": "53.0"}
{"rowid": "2", "question_id": "139305", "poll_id": "74225", "state": "", "politician_id": "11", "politician": "Donald Trump", "pollster_id": "568", "pollster": "YouGov", "sponsor_ids": "352", "sponsors": "Economist", "display_name": "YouGov", "pollster_rating_id": "391", "pollster_rating_name": "YouGov", "fte_grade": "B", "sample_size": "1155", "population": "rv", "population_full": "rv", "methodology": "Online", "start_date": "1/16/21", "end_date": "1/19/21", "sponsor_candidate": "", "tracking": "", "created_at": "1/20/21 10:18", "notes": "", "url": "https://docs.cdn.yougov.com/y9zsit5bzd/weeklytrackingreport.pdf", "source": "538", "yes": "44.0", "no": "55.0"}
{"rowid": "3", "question_id": "139306", "poll_id": "74226", "state": "", "politician_id": "11", "politician": "Donald Trump", "pollster_id": "23", "pollster": "American Research Group", "sponsor_ids": "", "sponsors": "", "display_name": "American Research Group", "pollster_rating_id": "9", "pollster_rating_name": "American Research Group", "fte_grade": "B", "sample_size": "1100", "population": "a", "population_full": "a", "methodology": "Live Phone", "start_date": "1/16/21", "end_date": "1/19/21", "sponsor_candidate": "", "tracking": "", "created_at": "1/20/21 10:18", "notes": "", "url": "https://americanresearchgroup.com/economy/", "source": "538", "yes": "30.0", "no": "66.0"}
{"rowid": "4", "question_id": "139307", "poll_id": "74226", "state": "", "politician_id": "11", "politician": "Donald Trump", "pollster_id": "23", "pollster": "American Research Group", "sponsor_ids": "", "sponsors": "", "display_name": "American Research Group", "pollster_rating_id": "9", "pollster_rating_name": "American Research Group", "fte_grade": "B", "sample_size": "990", "population": "rv", "population_full": "rv", "methodology": "Live Phone", "start_date": "1/16/21", "end_date": "1/19/21", "sponsor_candidate": "", "tracking": "", "created_at": "1/20/21 10:18", "notes": "", "url": "https://americanresearchgroup.com/economy/", "source": "538", "yes": "29.0", "no": "67.0"}
{"rowid": "5", "question_id": "139298", "poll_id": "74224", "state": "", "politician_id": "11", "politician": "Donald Trump", "pollster_id": "1528", "pollster": "AtlasIntel", "sponsor_ids": "", "sponsors": "", "display_name": "AtlasIntel", "pollster_rating_id": "546", "pollster_rating_name": "AtlasIntel", "fte_grade": "B/C", "sample_size": "5188", "population": "a", "population_full": "a", "methodology": "Online", "start_date": "1/15/21", "end_date": "1/19/21", "sponsor_candidate": "", "tracking": "", "created_at": "1/19/21 21:52", "notes": "", "url": "https://projects.fivethirtyeight.com/polls/20210119_US_Atlas2.pdf", "source": "538", "yes": "44.6", "no": "53.9"}

simonw · 2021-06-16T05:02:56Z

Moving this to a PR.

simonw · 2021-06-16T15:26:19Z

Here's a radical idea: what if I combined sqlite-utils memory into sqlite-utils query?

The trick here would be to detect if the arguments passed on the command-line refer to SQLite databases or if they refer to CSV/JSON data that should be imported into temporary tables.

Detecting a SQLite database file is actually really easy - they all start with the same binary string:

>>> open("my.db", "rb").read(100)
b'SQLite format 3\x00...

(Need to carefully check that a CSV file withSQLite format 3 as the first column name doesn't accidentally get interpreted as a SQLite DB though).

So then what would the semantics of sqlite-utils query (which is also the default command) be?

sqlite-utils mydb.db "select * from x"
sqlite-utils my.csv "select * from my"
sqlite-utils mydb.db my.csv "select * from mydb.x join my on ..." - this is where it gets weird. We can't import the CSV data directly into mpdb.db - it's suppose to go into the in-memory database - so now we need to start using database aliases like mydb.x because we passed at least one other file?

The complexity here is definitely in the handling of a combination of SQLite database files and CSV filenames. Also, sqlite-utils query doesn't accept multiple filenames at the moment, so that will change.

I'm not 100% sold on this as being better than having a separate sqlite-utils memory command, as seen in #273.

simonw · 2021-06-16T15:30:24Z

But... sqlite-utils my.csv "select * from my" is a much more compelling initial experience than sqlite-utils memory my.csv "select * from my".

simonw · 2021-06-16T15:31:31Z

Plus, could I make this change to sqlite-utils query without breaking backwards compatibility? Adding a new sqlite-utils memory command is completely safe from that perspective.

simonw · 2021-06-16T15:37:51Z

I wonder if there's a better name for this than sqlite-utils memory?

sqlite-utils memory hello.csv "select * from hello"
sqlite-utils mem hello.csv "select * from hello"
sqlite-utils temp hello.csv "select * from hello"
sqlite-utils adhoc hello.csv "select * from hello"
sqlite-utils scratch hello.csv "select * from hello"

I think memory is best. I don't like the others, except for scratch which is OK.

simonw · 2021-06-16T15:38:58Z

Also sqlite-utils memory reflects the existing sqlite-utils :memory: mechanism, which is a point in its favour.

And it helps emphasize that the file you are querying will be loaded into memory, so probably don't try this against a 1GB CSV file.

simonw · 2021-06-16T15:46:13Z

Columns from data imported from CSV in this way is currently treated as TEXT, which means numeric sorts and suchlike won't work as people might expect. It would be good to do automatic type detection here, see #179.

* Turn SQL errors into click errors * Initial CSV-only prototype of sqlite-utils memory, refs #272 * Implement --save plus tests for --save and --dump, refs #272 * Re-arranged CLI query documentation, refs #272 * Re-organized CLI query docs, refs #272 * Docs for --save and --dump plus made SQL optional for those, refs #273 * Replaced one last :memory: example * Documented --attach option for memory command, refs #272 * Improved arrangement of CLI query documentation

simonw · 2021-06-18T15:01:41Z

I'll split the remaining work out into separate issues.

Refs #272

Closes #281, closes #279, refs #272

Refs #272, #274, #275, #276, #282, #284, #285

simonw · 2021-06-19T23:36:48Z

Wrote this up on my blog here: https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/ - with a video demo here: https://www.youtube.com/watch?v=OUjd0rkc678

simonw added the research label Jun 15, 2021

simonw added a commit that referenced this issue Jun 16, 2021

Initial CSV-only prototype of sqlite-utils memory, refs #272

c7234ca

simonw mentioned this issue Jun 16, 2021

sqlite-utils memory command for directly querying CSV/JSON data #273

Merged

4 tasks

simonw added a commit that referenced this issue Jun 16, 2021

Implement --save plus tests for --save and --dump, refs #272

988d363

simonw added a commit that referenced this issue Jun 16, 2021

Re-arranged CLI query documentation, refs #272

1e6ab6f

simonw added a commit that referenced this issue Jun 16, 2021

Re-organized CLI query docs, refs #272

88b4628

simonw mentioned this issue Jun 16, 2021

sqlite-utils dump my.db command #274

Closed

simonw added a commit that referenced this issue Jun 16, 2021

Documented --attach option for memory command, refs #272

4aada0a

This was referenced Jun 18, 2021

sqlite-utils memory should handle TSV and JSON in addition to CSV #279

Closed

Add --encoding option to sqlite-utils memory #280

Closed

Mechanism for explicitly stating CSV or JSON or TSV for sqlite-utils memory #281

Closed

simonw closed this as completed Jun 18, 2021

simonw added a commit that referenced this issue Jun 18, 2021

--encoding option for sqlite-utils memory, closes #280

7684bbf

Refs #272

simonw added a commit that referenced this issue Jun 19, 2021

TSV and JSON support for sqlite-utils memory

00e4bd5

Closes #281, closes #279, refs #272

simonw added a commit that referenced this issue Jun 19, 2021

Add sqlite-utils memory to the README, refs #272

1091a9c

simonw added a commit that referenced this issue Jun 19, 2021

Better help text for 'sqlite-utils memory', refs #272

59992d2

simonw added a commit that referenced this issue Jun 19, 2021

Release 3.10

13e76b3

Refs #272, #274, #275, #276, #282, #284, #285

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: import CSV to memory, run SQL, export in a single command #272

Idea: import CSV to memory, run SQL, export in a single command #272

simonw commented Jun 15, 2021

simonw commented Jun 15, 2021

simonw commented Jun 15, 2021

simonw commented Jun 15, 2021

simonw commented Jun 15, 2021

simonw commented Jun 15, 2021

simonw commented Jun 15, 2021

eyeseast commented Jun 16, 2021

simonw commented Jun 16, 2021

simonw commented Jun 16, 2021

simonw commented Jun 16, 2021

simonw commented Jun 16, 2021

simonw commented Jun 16, 2021 •

edited

Loading

simonw commented Jun 16, 2021

simonw commented Jun 16, 2021

simonw commented Jun 16, 2021 •

edited

Loading

simonw commented Jun 16, 2021

simonw commented Jun 16, 2021

simonw commented Jun 16, 2021 •

edited

Loading

simonw commented Jun 16, 2021 •

edited

Loading

simonw commented Jun 16, 2021

simonw commented Jun 18, 2021

simonw commented Jun 19, 2021

Idea: import CSV to memory, run SQL, export in a single command #272

Idea: import CSV to memory, run SQL, export in a single command #272

Comments

simonw commented Jun 15, 2021

simonw commented Jun 15, 2021

simonw commented Jun 15, 2021

simonw commented Jun 15, 2021

simonw commented Jun 15, 2021

simonw commented Jun 15, 2021

simonw commented Jun 15, 2021

eyeseast commented Jun 16, 2021

simonw commented Jun 16, 2021

simonw commented Jun 16, 2021

simonw commented Jun 16, 2021

simonw commented Jun 16, 2021

simonw commented Jun 16, 2021 • edited Loading

simonw commented Jun 16, 2021

simonw commented Jun 16, 2021

simonw commented Jun 16, 2021 • edited Loading

simonw commented Jun 16, 2021

simonw commented Jun 16, 2021

simonw commented Jun 16, 2021 • edited Loading

simonw commented Jun 16, 2021 • edited Loading

simonw commented Jun 16, 2021

simonw commented Jun 18, 2021

simonw commented Jun 19, 2021

simonw commented Jun 16, 2021 •

edited

Loading

simonw commented Jun 16, 2021 •

edited

Loading

simonw commented Jun 16, 2021 •

edited

Loading

simonw commented Jun 16, 2021 •

edited

Loading