Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide hooks with content_type (raw from server) and mime_type #459

Open
wants to merge 4 commits into
base: master
from

Conversation

@tomaszn
Copy link
Contributor

tomaszn commented Mar 1, 2020

This allows creating automatic converters like:

class AtomFeedFilter(AutoMatchFilter):           
    MATCH = {'mime_type': 'application/atom+xml'}
    def filter(self, data):
        ...
@tomaszn tomaszn force-pushed the tomaszn:filters_by_content_type branch from 3bcbc46 to 7ca2a6f Mar 11, 2020
@tomaszn tomaszn force-pushed the tomaszn:filters_by_content_type branch from 7ca2a6f to 1ebccc0 Mar 11, 2020
Copy link
Owner

thp left a comment

See comments.

@@ -254,7 +254,7 @@ def retrieve(self, job_state):
file_scheme = 'file://'
if self.url.startswith(file_scheme):
logger.info('Using local filesystem (%s URI scheme)', file_scheme)
return open(self.url[len(file_scheme):], 'rt').read()
return (open(self.url[len(file_scheme):], 'rt').read(), 'text/plain')

This comment has been minimized.

Copy link
@thp

thp Mar 22, 2020

Owner

The local file doesn't necessarily need to be text/plain, could be a different type?

This comment has been minimized.

Copy link
@tomaszn

tomaszn Mar 22, 2020

Author Contributor

Could be configured manually per job. And if not defined per job, then fallback could be taken from settings. If no fallback in settings, then libmagic can be used.

This comment has been minimized.

Copy link
@thp

thp Mar 23, 2020

Owner

Yes, make it an __optional__ setting in the Job (it make sense for file jobs), maybe mimetype or something and if not set, use text/plain.

@@ -209,7 +209,7 @@ class UrlJob(Job):
__required__ = ('url',)
__optional__ = ('cookies', 'data', 'method', 'ssl_no_verify', 'ignore_cached', 'http_proxy', 'https_proxy',
'headers', 'ignore_connection_errors', 'ignore_http_error_codes', 'encoding', 'timeout',
'ignore_timeout_errors', 'ignore_too_many_redirects')
'ignore_timeout_errors', 'ignore_too_many_redirects', 'content_type', 'mime_type')

This comment has been minimized.

Copy link
@thp

thp Mar 22, 2020

Owner

What's the difference between content_type and mime_type here?

This comment has been minimized.

Copy link
@tomaszn

tomaszn Mar 22, 2020

Author Contributor
  • content_type is the raw response header including optional encoding information.
  • mime_type is the first part before optional ";" and doesn't contain optional encoding information, which is already taken into account when decoding the response. this is what you'd usually use.

This comment has been minimized.

Copy link
@thp

thp Mar 23, 2020

Owner

These are settings from the jobs list YAML, you don't need to specify them here? (users are not supposed to "set" those values in the job configuration, right?)

This comment has been minimized.

Copy link
@thp

thp Mar 23, 2020

Owner

Ok, nevermind, I understood how you plan to use this now... Basically that's a kind of "separate" kind of key that we (I think) haven't had yet where we want the job to obtain those properties (so it can be used for the filter matching) but it's not something we want the user to specify via the config variable (we'd otherwise overwrite it?).

This comment has been minimized.

Copy link
@tomaszn

tomaszn Mar 23, 2020

Author Contributor

Right, I included them here so the automatic matching takes these two dictionary keys in MATCH into account.

After your remarks about defaults in ShellJobs maybe it should be possible to configure it (one can create say screenshots with a script).

@@ -358,4 +364,4 @@ def retrieve(self, job_state):
from requests_html import HTMLSession
session = HTMLSession()
response = session.get(self.navigate)
return response.html.html
return (response.html.html, response.headers.get('Content-type', ''))

This comment has been minimized.

Copy link
@thp

thp Mar 22, 2020

Owner

Should there be a default fallback value for response.headers.get()? Maybe text/html?

This comment has been minimized.

Copy link
@tomaszn

tomaszn Mar 22, 2020

Author Contributor

A typical web server would have that header. If it is missing, then the server is probably a custom script. Then maybe text/plain? or using magic.from_buffer() to determine it?

This comment has been minimized.

Copy link
@thp

thp Mar 23, 2020

Owner

Yes, text/plain is probably also fine. However, how does that deal with headers like:

Content-Type: text/html; charset=utf-8

This comment has been minimized.

Copy link
@tomaszn

tomaszn Mar 23, 2020

Author Contributor

It's returned in a tuple and stored as content_type. Additionally, text/html is extracted and stored as mime_type.

This comment has been minimized.

Copy link
@tomaszn

tomaszn Mar 23, 2020

Author Contributor

Oh, I forgot that empty content_type is also acceptable, and in that case job.content_type and job.mime_type are not created. Maybe that's the way to go, rather than guess?

@@ -41,7 +41,10 @@ from urlwatch import reporters
# __required__ = ('username', 'password')
#
# def retrieve(self, job_state):
# return 'Would log in to {} with {} and {}\n'.format(self.url, self.username, self.password)
# return (
# 'Would log in to {} with {} and {}\n'.format(self.url, self.username, self.password),

This comment has been minimized.

Copy link
@thp

thp Mar 22, 2020

Owner

Unrelated change, please remove.

This comment has been minimized.

Copy link
@tomaszn

tomaszn Mar 22, 2020

Author Contributor

This pull request changes JobState.process() so that it requires all retrieve implementations to return a tuple:

  data, content_type = self.job.retrieve(self)

So this example needs updating unless backward compatibility is implemented in JobState.process().

This comment has been minimized.

Copy link
@thp

thp Mar 23, 2020

Owner

Oh yeah true.. That's kind of unfortunate, it would by default break all custom user scripts. I would make it so that self.job.retrieve(self) can handle both the return value being a string (legacy behavior) or the return value being a 2-tuple. But let's think a bit more about this (I'll add another PR review comment).

Copy link
Owner

thp left a comment

I like the basic premise of this PR, but have been thinking a bit more about it.

Your proposed use case:

    MATCH = {'mime_type': 'application/atom+xml'}

This is a very narrow use case, it might be better to extend it to not just a content type, but to return a dictionary of headers for every request? The filter itself might be a bit more complicated then, but it is only written once. You could then also filter for other header values (I can't think of a good example, but maybe format/filter certain CMS with the X-Generator header or something).

Then again, maybe it's good enough if we provide an Atom feed filter and users just manually opt in to use that filter for jobs?

In any case, right now it doesn't handle the charset parameter (e.g. for Content-Type: text/html; charset=utf-8) so it's not fully robust yet.

To summarize, it's still a nice idea, but not sure if the added complexity and potential backwards compatibility issues (user-written job types where the retrieve() function doesn't return a mime type) is worth the single use case.

@tomaszn

This comment has been minimized.

Copy link
Contributor Author

tomaszn commented Mar 23, 2020

Thanks for the review.

  1. Including all headers. Might be useful. But some headers are numeric, like Content-Length, and to make use of them the matching algorithm should support comparison operators like "greater then" etc. Are you thinking about returning a dict from retrieve?

  2. The charset parameter has been somehow handled (look for CHARSET_RE in jobs.py), but its value is ignored. The decoding is done by trying several combinations of parameters, ignoring the charset value. I wrapped that in a function definition, so it can be later worked on and tested:

        def try_decoding(response):
            try:
                try:
                    try:
                        return response.content.decode('utf-8')
                    except UnicodeDecodeError:
                        return response.content.decode('latin1')
                except UnicodeDecodeError:
                    return response.content.decode('utf-8', 'ignore')
            except LookupError:
                # If this is an invalid encoding, decode as ascii (Debian bug 731931)
                return response.content.decode('ascii', 'ignore')

    I think that it even might work as intended, because charset can be configured incorrectly. Anyway, mime_type is extracted from the header to not care about charset.

  3. Please also take a look at #434. It introduces needs_binary, which is a "binary" version of mime_type. I think that implementing a PDF conversion filter with my approach would be simpler.

  4. I think that the matching mechanism is too simple. Let's take a look at a case of user, that just wants to have hooks for HTML and Atom:

    • AtomHook: MATCH = {'mime_type': 'application/atom+xml'}
    • AtomHook2: MATCH = {'url': re.compile('.*/atom.xml')}, because some his favorite blogs have misconfigured Content-Type and use text/html
    • HtmlHook: MATCH = {'mime_type': 'text/html'}

    Currently both matching hooks will be used for these misconfigured atom.xml urls, which is not what the user wants. I see two solutions for this, which are not exclusive:

    • Introducing logical operators (AND, OR, NOT). A quick search reveals a library that implements necessary expressions: https://pypi.org/project/filtration/
    • Priorities. If a hook matches, then lower priority hooks (fallbacks) are not processed. It should also allow setting the order in which hooks in one priority are run. So a hook for say cnn.com would run first, and then a default HTML processing would take place (or other way round, but in a predictable manner).
@tomaszn tomaszn force-pushed the tomaszn:filters_by_content_type branch from 829873b to 048ff5d Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.