Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to download BLOB data from the Datasette UI #1036

Closed
simonw opened this issue Oct 20, 2020 · 16 comments
Closed

Make it possible to download BLOB data from the Datasette UI #1036

simonw opened this issue Oct 20, 2020 · 16 comments

Comments

@simonw
Copy link
Owner

simonw commented Oct 20, 2020

Currently you can only extract binary BLOB data as base64-encoded JSON, which is not user friendly at all. It should always be possible for end-users to get the binary data out.

I'm worried about XSS vulnerabilities here, but hopefully sending Content-Type: application/octet-stream helps there? Need to research that.

@simonw
Copy link
Owner Author

simonw commented Oct 20, 2020

@simonw
Copy link
Owner Author

simonw commented Oct 20, 2020

From https://hackerone.com/reports/126197:

archive.uber.com mirrors pypi. When downloading .tar.gz files from archive.uber.com, the MIME type is application/octet-stream. Injecting <html><script>alert(0)</script> into the start of the .tar.gz causes an XSS in Internet Explorer due to MIME sniffing.

So you do have to be careful not to open accidental XSS holes with application/octet-stream thanks to (presumably older) versions of IE.

From that thread it looks like the solution is to add a X-Content-Type-Options: nosniff header.

@simonw
Copy link
Owner Author

simonw commented Oct 20, 2020

https://security.stackexchange.com/questions/12896/does-x-content-type-options-really-prevent-content-sniffing-attacks says:

In Tangled Web Michal Zalewski says:

Refrain from using Content-Type: application/octet-stream and use application/binary instead, especially for unknown document types. Refrain from returning Content-Type: text/plain.

For example, any code-hosting platform must exercise caution when returning executables or source archives as application/octet-stream, because there is a risk they may be misinterpreted as HTML and displayed inline.

@simonw
Copy link
Owner Author

simonw commented Oct 20, 2020

I can also use a Content-Disposition header to force a download. I'm reasonably confident that the combination of Content-Disposition and X-Content-Type-Options: nosniff and application/binary will let me allow users to download the contents of arbitrary BLOB columns without any XSS risk.

@simonw
Copy link
Owner Author

simonw commented Oct 20, 2020

I think this plus the binary-CSV stuff in #1034 will justify a dedicated section of the documentation to talk about how Datasette handles binary BLOB columns.

@simonw
Copy link
Owner Author

simonw commented Oct 21, 2020

Extra security idea: a blob_download_host setting which can be used to indicate a host that should be used for downloads - for example datasettestatic.com. If this setting is populated then binary downloads are served from paths on that host only, and no other Datasette URLs from that host will be served.

@simonw
Copy link
Owner Author

simonw commented Oct 21, 2020

Possible URL for this: /db/table/-/blob/primary-keys - this would use the /db/table/-/ namespace proposed in #296.

@simonw simonw changed the title Make it possible to extract BLOB data from the Datasette UI Make it possible to download BLOB data from the Datasette UI Oct 21, 2020
@simonw
Copy link
Owner Author

simonw commented Oct 21, 2020

What should the suggested filename be?

I think something that includes the table name, primary key and the name of the column would work.

How about a file extension? I guess .binary, then let the user rename it? Or .raw.

@simonw
Copy link
Owner Author

simonw commented Oct 21, 2020

Actually I like .blob

@simonw
Copy link
Owner Author

simonw commented Oct 21, 2020

So for https://latest.datasette.io/fixtures/binary_data the BLOB download URLs would be:

https://latest.datasette.io/fixtures/-/blob/binary_data/1/data.blob - that last bit after the primary key is to indicate the data column

With these headers:

  • Content-Disposition: attachment; filename="binary_data-1-data.blob"
  • X-Content-Type-Options: nosniff
  • Content-Type: application/binary

@simonw
Copy link
Owner Author

simonw commented Oct 21, 2020

Should this work just for BLOB columns, or should it work for other columns too?

For the moment I'm going to restrict it to BLOBs, since data from other columns is available through the UI whereas BLOB columns are not.

@simonw
Copy link
Owner Author

simonw commented Oct 21, 2020

@simonw
Copy link
Owner Author

simonw commented Oct 21, 2020

This code needs these permission checks:

await self.check_permission(request, "view-instance")
await self.check_permission(request, "view-database", database)
await self.check_permission(request, "view-table", (database, table))

@philshem
Copy link

philshem commented Jan 18, 2021

Hi Simon

Just finding this old issue regarding downloading blobs. Nice work!

image

As a feature request, maybe it would be possible to assign a blob column as a certain data type (e.g. .jpg) and then each blob could be downloaded as that type of file (perhaps if the file types were constrained to normal blobs that people store in sqlite databases, this could avoid the execution stuff mentioned above).

I guess the column blob-type definition could fit into this dropdown selection:

image

Let me know if I should open a new issue with a feature request. (This could slowly go in the direction of displaying image blob-types in the browser.)

Thanks for the great tool!


edit: just reading the rest of the twitter thread: https://twitter.com/simonw/status/1318685933256855552

perhaps this is already possible in some form with the plugin datasette-media: https://github.com/simonw/datasette-media

@simonw
Copy link
Owner Author

simonw commented Jan 18, 2021

As you can see, I'm pretty paranoid about serving content with Content-Type HTTP headers because I'm so worried about execution vulnerabilities. I'm much more comfortable exploring that kind of thing in plugins, since that way people can opt-in to riskier features.

You found datasette-media which is my most comprehensive exploration of that idea so far - but there's definitely lots of room for more plugins along those lines.

Maybe even an output plugin? .jpg as an export format which returns the BLOB column for a row as a JPEG image with the correct content-type header (but first verifies that the binary content does indeed look like a real JPEG) could be interesting.

@philshem
Copy link

It might be possible with this library: https://docs.python.org/3/library/imghdr.html

quick test of the downloaded blob:

>>> import imghdr
>>> imghdr.what('material_culture-1-image.blob')
'jpeg'

The output plugin would be cool. I'll look into making my first datasette plugin. I'm also imagining displaying the image in the browser -- but that would be a step 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants