Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor document serve view to be agnostic of file storage backends #1420

Closed
gasman opened this issue Jun 18, 2015 · 17 comments
Closed

Refactor document serve view to be agnostic of file storage backends #1420

gasman opened this issue Jun 18, 2015 · 17 comments

Comments

@gasman
Copy link
Collaborator

gasman commented Jun 18, 2015

For Wagtail 1.0, the document serving view was rewritten, based around the django-sendfile API, to provide various enhancements: support for django-sendfile backends, serving the correct MIME type, and support for If-Modified-Since headers. However, the django-sendfile API assumes the use of a local storage backend (i.e. one that implements the path operation), and so we had to implement a fallback (#1417) to restore support for remote storage backends such as boto-s3. As with previous versions of Wagtail, this fallback code involves reading the file content through the open operation and serving it through Django, which is particularly inefficient for remote storage.

I propose the following new approach for document serving:

  1. Add a WAGTAILDOCS_FILE_STORAGE setting, which by default is the same as DEFAULT_FILE_STORAGE, and use that as the storage backend for the file field of the Document model. This makes it possible to set a storage backend for documents independently of other media (e.g. images).

  2. Introduce a new configuration setting (proposed name WAGTAILDOCS_SERVE_METHOD) to specify the overall top-level behaviour for document serving. This will be one of the following values (names are provisional, and not all will necessarily be implemented in the first instance):

    • 'direct': this bypasses the serve view entirely, and uses the value of document.file.url (if the backend implements it; if not, we fall back on the URL of the 'serve' view as usual) whenever we want to refer to the document URL, e.g. as the result of expanding a <a linktype="document"> link in rich text. This means that we end up bypassing the document_served signal, and any other application logic that might exist in the serve view (download stats tracking, access restrictions) - this is most appropriate for fully static sites (django-medusa / bakery).
    • 'redirect': The serve view will redirect to document.file.url (assuming the backend implements it). This is probably the best option for remote storage backends; it means that the document_served signal will still be fired for people following the link, but it's still possible to bypass it if they know the direct URL (which presents a loophole if access restrictions are in use).
    • 'authenticated_redirect': same as 'redirect', but implements a backend-specific authentication method (such as S3's signed URL feature with an expiry timestamp) to prevent people from using the remote URLs indefinitely. To be specified...
    • 'serve_view': The document is served as a direct response (i.e. a HTTP 200 response) to the request to the serve view. At this point, we don't specify whether this will be done with the aid of sendfile, or served directly by Django. This method is most appropriate for local storage backends; for remote backends, it will perform an expensive open operation (i.e. the current Wagtail 1.0 behaviour). This is also the only option which doesn't require the backend to expose a url property; ideally, the site implementer should ensure that the file isn't available through a direct URL, to prevent bypassing the serve view.

    (Open question: Given that the 'sensible default' for this setting will vary depending on the backend in use, is there a good way of selecting a suitable default value at startup, in the case that WAGTAILDOCS_FILE_STORAGE/DEFAULT_FILE_STORAGE is specified but WAGTAILDOCS_SERVE_METHOD isn't?)

    For all methods except 'serve_view', we're now done. For 'serve_view', we continue as follows (note that wherever possible, we're using the Django storages API rather than using direct file access, and catching NotImplementedError to account for varying capabilities - this way, all storage backends are following the same code path for the most part):

  3. Handling If-Modified-Since: If the backend implements the modified_time property, check this and return a 'not modified' response if applicable. (Aside: if we added an updated_at field to Document, we could do this without relying on the storage backend - provided we don't care about modifications that are done directly on the filestore, bypassing the database)

  4. Set a Content-Type header using mimetypes.guess_type(filename), as our sendfile logic currently does. (We could potentially sniff file content too, either for local storages only, or for all backends if we did it at upload time and stored it in the database - but that's something for another ticket...)

  5. If the backend implements the size property, set the Content-Length header. (TODO: confirm that this is indeed optional, as per discussion on Restore ability to serve docs from non-local storage backends #1417 (diff))

  6. If the backend implements the path property, and a SENDFILE_BACKEND is specified, hand off to django-sendfile. (Since sendfile is now optional and only activated when this setting is present, this possibly negates the need to bundle our own sendfile in wagtail.utils. On the other hand, if django-sendfile duplicates most of the header handling logic above, we might be better off with our own stripped-down implementation...)

  7. Otherwise, serve the file via StreamingHttpResponse.

TODO: decide how to incorporate #733 (setting content-disposition to display PDFs in the browser) into this. Special case for PDF, or make it configurable?

@gasman
Copy link
Collaborator Author

gasman commented Jun 18, 2015

(Open question: Given that the 'sensible default' for this setting will vary depending on the backend in use, is there a good way of selecting a suitable default value at startup, in the case that WAGTAILDOCS_FILE_STORAGE/DEFAULT_FILE_STORAGE is specified but WAGTAILDOCS_SERVE_METHOD isn't?)

This could be done as follows:

  • Add another method called 'auto', and make this the default. 'auto' behaves as follows:
  • If the backend provides url but not path, behave as 'redirect'
  • in all other cases, behave as 'serve_view'

@kaedroho
Copy link
Contributor

kaedroho commented Jul 1, 2015

For 3. Would it be possible to make use of a new file_sha field (as proposed in #977) to generate an Etag instead?

@gasman
Copy link
Collaborator Author

gasman commented Jul 1, 2015

Sure, that makes sense. (I thought about suggesting modified_date as another field on the documents table, but I dare say the Etag approach is more robust...)

@sanajaved7
Copy link

Hi all - has there been any progress on setting content-disposition to display PDFs in the browser instead of downloading them? I saw the document.file.url workaround from @chhantyal but we have attachments in StreamFields so that won't work for us. This is pretty important for us so if you all have other workarounds in mind, I'm all ears!

@tobiasmcnulty
Copy link
Contributor

tobiasmcnulty commented May 9, 2017

I spent some time looking over this yesterday. A few thoughts/questions:

  • I think we may need to collapse redirect and authenticated_redirect (i.e., just rely on the developer to configure the storage the way they want). Not sure about other backends, but with S3 at least one doesn’t get to choose that after the storage is configured: https://github.com/jschneier/django-storages/blob/master/storages/backends/s3boto3.py#L590-L592

  • Am I correct in thinking that AbstractDocument.url is the right place to send the user a direct URL, if configured that way?

  • I'm a little concerned about the maintenance headache of making it seem like one can simply switch the storage backend for documents and it magically works. Is this a documented need from a user/site or more of a "this might be needed at some point" sort of thing?

  • Is there any value to allowing the user to fall back to serve_view if a remote storage with a url property is configured? I.e., could we just default to redirecting (if possible), otherwise use sendfile (if possible), otherwise open serve the file directly? If required, this is something the developer could implement by overriding the url property in their storage class and raising a NotImplementedError.

  • Would it be reasonable to collapse these settings into a single dictionary, to avoid adding too many new settings? E.g. (with no attachment to any particular names):

      WAGTAILDOCS_FILE_STORAGE = {
          'backend': DEFAULT_FILE_STORAGE,
          'serve_method': 'redirect',
      }
    

@tomdyson
Copy link
Contributor

Assigning to @robmoorman to review

@gasman
Copy link
Collaborator Author

gasman commented May 15, 2017

I think we may need to collapse redirect and authenticated_redirect (i.e., just rely on the developer to configure the storage the way they want).

OK - so basically, the distinction between redirect and authenticated_redirect is redundant because the S3 configuration elsewhere will determine whether authentication parameters are in use...? That sounds fair enough.

Am I correct in thinking that AbstractDocument.url is the right place to send the user a direct URL, if configured that way?

Yep, the method on the model seems like the right place to put the logic for returning one URL or the other depending on how it's configured.

I'm a little concerned about the maintenance headache of making it seem like one can simply switch the storage backend for documents and it magically works. Is this a documented need from a user/site or more of a "this might be needed at some point" sort of thing?

You mean being able to switch storage backends after documents have already been uploaded, without breaking the existing ones? I don't think that's a requirement - I'd say there's no expectation for us to provide any more 'magic' than Django's storage API gives us already.

Is there any value to allowing the user to fall back to serve_view if a remote storage with a url property is configured?

Afraid I can't quite follow what you're proposing here - is this the suggested behaviour for serve_method = 'auto'?

Would it be reasonable to collapse these settings into a single dictionary, to avoid adding too many new settings?

I'd be happy with that. Perhaps we can also have a setting like 'inline_content_types': ['application/pdf'] in there to address the "serving PDFs inline in the browser" point? Somehow it feels more palatable to have that as part of a settings dict than a top-level WAGTAILDOCS_INLINE_CONTENT_TYPES setting :-)

@tobiasmcnulty
Copy link
Contributor

@gasman to provide a custom storage= for AbstractDocument.file I think we'd need to make our own storage class that wraps whatever storage class is configured at runtime. Using a LazyObject like Django does (see: https://github.com/django/django/blob/a0c07d77fc313388c72a17cd59411265069f037f/django/core/files/storage.py#L359-L361) generates a migration in the wagtaildocs app for whatever storage I have configured at the time, which obviously will not work. For example, this code:

class DefaultDocumentStorage(LazyObject):
    def _setup(self):
        import_path = getattr(settings, 'WAGTAILDOCS_FILE_STORAGE', {}).get('backend')
        # get_storage_class will fall back to settings.DEFAULT_FILE_STORAGE if import_path is None
        self._wrapped = get_storage_class(import_path)()


@python_2_unicode_compatible
class AbstractDocument(CollectionMember, index.Indexed, models.Model):
    title = models.CharField(max_length=255, verbose_name=_('title'))
    file = models.FileField(upload_to='documents', verbose_name=_('file'), storage=DefaultDocumentStorage())
    # ...

causes this migration to be created:

class Migration(migrations.Migration):

    dependencies = [
        ('wagtaildocs', '0007_merge'),
    ]

    operations = [
        migrations.AlterField(
            model_name='document',
            name='file',
            field=models.FileField(storage=storages.backends.s3boto3.S3Boto3Storage(), upload_to='documents', verbose_name='file'),
        ),
    ]

Even though S3Boto3Storage is only referenced in my local settings file.

I'm not keen on the idea of making our own storage class wrapper and would rather split this (supporting different storage backends for images and files) off into a separate issue if possible. This also feels like something that should be solved in Django rather than in Wagtail. What do you think?

@tobiasmcnulty
Copy link
Contributor

As another thought, I believe providing a custom storage is already possible via the WAGTAILDOCS_DOCUMENT_MODEL setting (?)...

@tobiasmcnulty
Copy link
Contributor

tobiasmcnulty commented Aug 17, 2017

For posterity: I chatted with @gasman on Slack and we agreed it makes sense to defer implementing any ability to customize the file storage backend (beyond what Django gives you) at this time.

@jdavid
Copy link

jdavid commented May 10, 2018

Today I had to override the document model only to change the storage backend.

The problem is documents were uploaded to the media/ directory. In the server I use the nginx backend for sendfile, which is configured to serve files from the sendfile/ folder, not media/. So I could upload documents, but not download them (404).

Changing the configuration so sendfile serves from media/ is not an option, as it would open a security issue: files stored in sendfile/ would become public if moved to media/.

If WAGTAILDOCS_FILE_STORAGE was implemented I would not need to override the document model. May it be possible to split that from the rest of this issue?

@tobiasmcnulty
Copy link
Contributor

@jdavid The problem is that changing WAGTAILDOCS_FILE_STORAGE generates a migration (which is an entirely unintuitive side effect of changing a setting -- see above). Given this, I think overriding the document model to change the storage backend is probably the right way to do it...

@jdavid
Copy link

jdavid commented May 10, 2018

Maybe (draft):

@deconstructible
class DocumentStorage(FileSystemStorage):
    def __init__(self, **kwargs):
        kwargs['location'] = settings.WAGTAILDOCS_FILE_STORAGE_LOCATION
        kwargs['base_url'] = settings.WAGTAILDOCS_FILE_STORAGE_BASE_URL
        super().__init__(**kwargs)

document_storage = DocumentStorage()

There would be a migration in Wagtail, but then projects could change location and base_url without triggering new migrations.

It doesn't solve the problem to use something else than FileSystemStorage. For that I have seen patterns such as:

@deconstructible
class DocumentStorage(WAGTAILDOCS_FILE_STORAGE_BACKEND):
    [...]

rjsparks added a commit to ietf-tools/www that referenced this issue Jun 7, 2019
…/document/doc_id/filename.

This is a workaround for a wart in wagtail that insists on serving documents as attachments.
It is a mild compromise in that the proported inside wagtail features of counting document references and controlling access to files is circumvented. However, by design, we don't want different access controls to files for different viewers of the IETF website, and we are looking at different ways to gather information like references.

See wagtail/wagtail#4359, wagtail/wagtail#1158, and the knot at wagtail/wagtail#1420.
@Pomax
Copy link
Contributor

Pomax commented Aug 24, 2019

I don't see any mention of a "manual override" for Content-Type, where site owners can specify a mapping that effects bypassing the mimetype guessing, like:

WAGTAIL_DOCUMENT_CONTENT_TYPES = [
    ('pdf', 'application/pdf'),
    ('bin', 'application/octet-stream'),
    ('gifv', 'video/mp4'),
    ...etc...
]

@gasman
Copy link
Collaborator Author

gasman commented Aug 27, 2019

@Pomax If we go with the approach described in #1158 (comment) (which is probably a better place to continue this discussion, as the HTTP headers are a somewhat separate detail from serving methods), it'll be possible to override the default mimetypes.guess_type logic by setting up a custom document model and overriding the content_type property.

I wouldn't be against adding a mapping to the settings (similar to the proposed WAGTAILDOCS_INLINE_CONTENT_TYPES), so that it's possible to customise this without a custom document model, but I wonder how necessary this is - is there reason to believe that mimetypes.guess_type won't do an adequate job of recognising common document types?

@Pomax
Copy link
Contributor

Pomax commented Aug 28, 2019

Mostly to ensure people stay in control of what can happen: some folks will want to force specific extensions to "inert" form so that browsers never try to interpret them (or quite the opposite, making it active content) , where others may want to serve a content type that mimetypes.guess_type hasn't even heard of. Basically: guessing is fine, but if the code is going to do guess work, it's super useful to also offer a way for the people whose setup it is to be able to say something along the lines of "actually, don't guess, here's what we know this is, full stop". (Especially for overloaded extensions, or conventionally "unknown" extensions that might be tied into a browser extension that knows how to deal with that content in-browser).

And having that custom document model would be valuable, of course, but I would imagine that most people would make use of a settings.py list if they had that option (not having to create custom files but setting a list means a lower bug and maintenance surface, which is always a big plus).

Happy to "redo" this comment in #1158 if that helps move it over to the right place!

@gasman
Copy link
Collaborator Author

gasman commented Sep 23, 2019

The main part of this (the WAGTAILDOCS_SERVE_METHOD setting) is now completed in #5296, and #1158 has been opened to cover the content-type / content-disposition headers. The remaining points mentioned here are the If-Modified-Since, Content-Length and Etag headers, and I don't think those are important enough to justify keeping this open (PRs still welcome of course...) - so I'm calling this complete.

@gasman gasman closed this as completed Sep 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants