Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better way of representing binary data in .csv output #1034

Closed
simonw opened this issue Oct 20, 2020 · 19 comments
Closed

Better way of representing binary data in .csv output #1034

simonw opened this issue Oct 20, 2020 · 19 comments
Labels
blocked Awaiting something else to happen first csv enhancement help wanted small
Milestone

Comments

@simonw
Copy link
Owner

simonw commented Oct 20, 2020

I just noticed this: https://latest.datasette.io/fixtures/binary_data.csv

rowid,data
1,b'\x15\x1c\x02\xc7\xad\x05\xfe'
2,b'\x15\x1c\x03\xc7\xad\x05\xfe'

There's no good way to represent binary data in a CSV file, but this seems like one of the more-bad options.

@simonw
Copy link
Owner Author

simonw commented Oct 20, 2020

Most obvious option is base64. Any other potential solutions I'm missing?

@simonw
Copy link
Owner Author

simonw commented Oct 20, 2020

The datasette-render-binary plugin does this, which I really like - but without the different coloured fonts I'm not sure how readable it would be as just plain text:

image

Really the goal here is to find the most human-friendly option, so that people looking at the output have a vague idea what's going on. That's why I'm not leaping at the chance to use base64.

@simonw
Copy link
Owner Author

simonw commented Oct 20, 2020

Asked for ideas on Twitter: https://twitter.com/simonw/status/1318409558805467136

@simonw
Copy link
Owner Author

simonw commented Oct 20, 2020

I could go with the same format as datasette-render-binary but using 0x00 as the format for the hex bytes.

0x15 0x1C 0x02 0xC7 JFIF 0x00 0x01

Problem with this is that it's ambiguous: if the ASCII characters 0x15 occur in the text they will be indistinguishable from those hex bytes.

But since representing binary data in CSV fundamentally doesn't make sense I'm not sure if that really matters.

@simonw
Copy link
Owner Author

simonw commented Oct 20, 2020

An even easier option: do what the Datasette UI does and output <Binary data: 7 bytes> for that CSV cell, as seen on https://latest.datasette.io/fixtures/binary_data

@simonw
Copy link
Owner Author

simonw commented Oct 20, 2020

Or... default to <Binary data: 7 bytes> and support a ?_base64=1 option which outputs in base64 instead.

@simonw
Copy link
Owner Author

simonw commented Oct 20, 2020

New idea: since binary in CSV doesn't make sense anyway, emulate Datasette's HTML UI default and output this:

id,title,data
1,Some title,<Binary data: 14 bytes>
2,Other title,<Binary data: 57 bytes>

Then allow users to add ?_base64=1 to the URL to get base64 instead
https://twitter.com/simonw/status/1318679950635888641

@simonw
Copy link
Owner Author

simonw commented Oct 20, 2020

This feels good to me - it's consistent with how other features in Datasette work, and it means users who need the binary data in CSV (for whatever reason) can get it if they want to.

@simonw
Copy link
Owner Author

simonw commented Oct 20, 2020

Enzo has a great solution here: https://twitter.com/enzo_mdd/status/1318685442976436226

Or maybe an option for a url. This keeps the CSV small but allows scripts to download binary data as needed.

In #1036 I'm planning on adding a way for users to access BLOB data. I can include that URL in the CSV output.

@simonw
Copy link
Owner Author

simonw commented Oct 21, 2020

Blocked awaiting #1036 (update: now unblocked)

@simonw simonw added the blocked Awaiting something else to happen first label Oct 21, 2020
@simonw simonw added this to the 0.51 milestone Oct 23, 2020
@simonw
Copy link
Owner Author

simonw commented Oct 25, 2020

This is actually a bit tricky to implement, for a few reasons:

  • Need to generate a full URL, including the https://host/ bit. I've done this for next_url in the JSON output before, thankfully.
  • This only makes sense for CSV output for tables. If it's the CSV output of an arbitrary query there's no /db/table/-/blob/pk/column.blob page for me to link to.
  • Need to generate those /.../-/blob/... URLs for the data that is being output as CSV.

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2020

Here's how those absolute next_url values are generated:

next_url = self.ds.absolute_url(
request, path_with_replaced_args(request, added_args)
)

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2020

I should turn datasette.absolute_url(...) into a documented internal API on https://docs.datasette.io/en/stable/internals.html#datasette-class

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2020

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2020

What should happen for CSV export of arbitrary SQL queries, where there's no obvious BLOB to link to?

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2020

For arbitrary CSV the only solution I can think of is to embed the base64 value.

@simonw
Copy link
Owner Author

simonw commented Oct 25, 2020

SQLite actually has APIs that could help here: https://www.sqlite.org/c3ref/column_database_name.html - for any given SQL query they identify the origin/table/column that is the source of each resulting column.

Those aren't exposed in the Python sqlite3 module though, so using them could be extremely tricky.

@simonw
Copy link
Owner Author

simonw commented Oct 29, 2020

I'm going to link to. the new .blob representation using the new ?_blob_hash=xxx argument to ensure that the content served is the expected binary blob.

@simonw simonw closed this as completed in 178b7e8 Oct 29, 2020
@simonw
Copy link
Owner Author

simonw commented Oct 30, 2020

Demos:

https://latest.datasette.io/fixtures/binary_data.csv?_size=max

rowid,data
1,http://latest.datasette.io/fixtures/binary_data/1.blob?_blob_column=data
2,http://latest.datasette.io/fixtures/binary_data/2.blob?_blob_column=data
3,

https://latest.datasette.io/fixtures.csv?sql=select+rowid%2C+data+from+binary_data+order+by+rowid+limit+1001&_size=max

rowid,data
1,http://latest.datasette.io/fixtures.blob?sql=select+rowid%2C+data+from+binary_data+order+by+rowid+limit+1001&_size=max&_blob_column=data&_blob_hash=f3088978da8f9aea479ffc7f631370b968d2e855eeb172bea7f6c7a04262bb6d
2,http://latest.datasette.io/fixtures.blob?sql=select+rowid%2C+data+from+binary_data+order+by+rowid+limit+1001&_size=max&_blob_column=data&_blob_hash=b835b0483cedb86130b9a2c280880bf5fadc5318ddf8c18d0df5204d40df1724
3,

simonw added a commit that referenced this issue Oct 30, 2020
simonw added a commit that referenced this issue Oct 31, 2020
@simonw simonw added the csv label Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Awaiting something else to happen first csv enhancement help wanted small
Projects
None yet
Development

No branches or pull requests

1 participant