Provide hooks with content_type (raw from server) and mime_type #459

tomaszn · 2020-03-01T23:19:24Z

This allows creating automatic converters like:

class AtomFeedFilter(AutoMatchFilter):           
    MATCH = {'mime_type': 'application/atom+xml'}
    def filter(self, data):
        ...

thp

See comments.

thp · 2020-03-22T13:25:10Z

lib/urlwatch/jobs.py

@@ -254,7 +254,7 @@ def retrieve(self, job_state):
        file_scheme = 'file://'
        if self.url.startswith(file_scheme):
            logger.info('Using local filesystem (%s URI scheme)', file_scheme)
-            return open(self.url[len(file_scheme):], 'rt').read()
+            return (open(self.url[len(file_scheme):], 'rt').read(), 'text/plain')


The local file doesn't necessarily need to be text/plain, could be a different type?

Could be configured manually per job. And if not defined per job, then fallback could be taken from settings. If no fallback in settings, then libmagic can be used.

Yes, make it an __optional__ setting in the Job (it make sense for file jobs), maybe mimetype or something and if not set, use text/plain.

thp · 2020-03-22T13:25:44Z

lib/urlwatch/jobs.py

@@ -209,7 +209,7 @@ class UrlJob(Job):
    __required__ = ('url',)
    __optional__ = ('cookies', 'data', 'method', 'ssl_no_verify', 'ignore_cached', 'http_proxy', 'https_proxy',
                    'headers', 'ignore_connection_errors', 'ignore_http_error_codes', 'encoding', 'timeout',
-                    'ignore_timeout_errors', 'ignore_too_many_redirects')
+                    'ignore_timeout_errors', 'ignore_too_many_redirects', 'content_type', 'mime_type')


What's the difference between content_type and mime_type here?

content_type is the raw response header including optional encoding information.

mime_type is the first part before optional ";" and doesn't contain optional encoding information, which is already taken into account when decoding the response. this is what you'd usually use.

These are settings from the jobs list YAML, you don't need to specify them here? (users are not supposed to "set" those values in the job configuration, right?)

Ok, nevermind, I understood how you plan to use this now... Basically that's a kind of "separate" kind of key that we (I think) haven't had yet where we want the job to obtain those properties (so it can be used for the filter matching) but it's not something we want the user to specify via the config variable (we'd otherwise overwrite it?).

Right, I included them here so the automatic matching takes these two dictionary keys in MATCH into account.

After your remarks about defaults in ShellJobs maybe it should be possible to configure it (one can create say screenshots with a script).

thp · 2020-03-22T13:26:38Z

lib/urlwatch/jobs.py

@@ -358,4 +364,4 @@ def retrieve(self, job_state):
        from requests_html import HTMLSession
        session = HTMLSession()
        response = session.get(self.navigate)
-        return response.html.html
+        return (response.html.html, response.headers.get('Content-type', ''))


Should there be a default fallback value for response.headers.get()? Maybe text/html?

A typical web server would have that header. If it is missing, then the server is probably a custom script. Then maybe text/plain? or using magic.from_buffer() to determine it?

Yes, text/plain is probably also fine. However, how does that deal with headers like:

Content-Type: text/html; charset=utf-8

It's returned in a tuple and stored as content_type. Additionally, text/html is extracted and stored as mime_type.

Oh, I forgot that empty content_type is also acceptable, and in that case job.content_type and job.mime_type are not created. Maybe that's the way to go, rather than guess?

thp · 2020-03-22T13:26:52Z

share/urlwatch/examples/hooks.py.example

@@ -41,7 +41,10 @@ from urlwatch import reporters
 #    __required__ = ('username', 'password')
 #
 #    def retrieve(self, job_state):
-#        return 'Would log in to {} with {} and {}\n'.format(self.url, self.username, self.password)
+#        return (
+#            'Would log in to {} with {} and {}\n'.format(self.url, self.username, self.password),


Unrelated change, please remove.

This pull request changes JobState.process() so that it requires all retrieve implementations to return a tuple:

data, content_type = self.job.retrieve(self)

So this example needs updating unless backward compatibility is implemented in JobState.process().

Oh yeah true.. That's kind of unfortunate, it would by default break all custom user scripts. I would make it so that self.job.retrieve(self) can handle both the return value being a string (legacy behavior) or the return value being a 2-tuple. But let's think a bit more about this (I'll add another PR review comment).

thp

I like the basic premise of this PR, but have been thinking a bit more about it.

Your proposed use case:

    MATCH = {'mime_type': 'application/atom+xml'}

This is a very narrow use case, it might be better to extend it to not just a content type, but to return a dictionary of headers for every request? The filter itself might be a bit more complicated then, but it is only written once. You could then also filter for other header values (I can't think of a good example, but maybe format/filter certain CMS with the X-Generator header or something).

Then again, maybe it's good enough if we provide an Atom feed filter and users just manually opt in to use that filter for jobs?

In any case, right now it doesn't handle the charset parameter (e.g. for Content-Type: text/html; charset=utf-8) so it's not fully robust yet.

To summarize, it's still a nice idea, but not sure if the added complexity and potential backwards compatibility issues (user-written job types where the retrieve() function doesn't return a mime type) is worth the single use case.

tomaszn · 2020-03-23T22:30:31Z

Thanks for the review.

Including all headers. Might be useful. But some headers are numeric, like Content-Length, and to make use of them the matching algorithm should support comparison operators like "greater then" etc. Are you thinking about returning a dict from retrieve?

The charset parameter has been somehow handled (look for CHARSET_RE in jobs.py), but its value is ignored. The decoding is done by trying several combinations of parameters, ignoring the charset value. I wrapped that in a function definition, so it can be later worked on and tested:

    def try_decoding(response):
        try:
            try:
                try:
                    return response.content.decode('utf-8')
                except UnicodeDecodeError:
                    return response.content.decode('latin1')
            except UnicodeDecodeError:
                return response.content.decode('utf-8', 'ignore')
        except LookupError:
            # If this is an invalid encoding, decode as ascii (Debian bug 731931)
            return response.content.decode('ascii', 'ignore')

I think that it even might work as intended, because charset can be configured incorrectly. Anyway, mime_type is extracted from the header to not care about charset.

Please also take a look at New filter 'pdf2text' to extract text from PDF #434. It introduces needs_binary, which is a "binary" version of mime_type. I think that implementing a PDF conversion filter with my approach would be simpler.
I think that the matching mechanism is too simple. Let's take a look at a case of user, that just wants to have hooks for HTML and Atom:
- AtomHook: MATCH = {'mime_type': 'application/atom+xml'}
- AtomHook2: MATCH = {'url': re.compile('.*/atom.xml')}, because some his favorite blogs have misconfigured Content-Type and use text/html
- HtmlHook: MATCH = {'mime_type': 'text/html'}
Currently both matching hooks will be used for these misconfigured atom.xml urls, which is not what the user wants. I see two solutions for this, which are not exclusive:
- Introducing logical operators (AND, OR, NOT). A quick search reveals a library that implements necessary expressions: https://pypi.org/project/filtration/
- Priorities. If a hook matches, then lower priority hooks (fallbacks) are not processed. It should also allow setting the order in which hooks in one priority are run. So a hook for say cnn.com would run first, and then a default HTML processing would take place (or other way round, but in a predictable manner).

thp · 2020-05-07T16:23:38Z

Including all headers. Might be useful. But some headers are numeric, like Content-Length, and to make use of them the matching algorithm should support comparison operators like "greater then" etc. Are you thinking about returning a dict from retrieve?

For the "comparison operators", you could just have a lambda:

class AtomFeedFilter(AutoMatchFilter):
    MATCH = {'Content-Length': lambda value: int(value) > 300}

thp · 2020-05-07T16:26:39Z

..and there's always the possibility of just writing a custom match() method for filters where it's complicated (e.g. Atom feed based on content-type, but then also based on URL, because of misconfigured server).

If just looking at the headers, this is probably fine, and things like "Content-Type: text/html; charset=utf-8" can be matched using the lambda matcher (lambda value: 'text/html' in value)

thp · 2020-05-07T16:28:16Z

Not 100% sure if this is a good idea or not, but let's see what you come up with (and the "generic" HTTP header matching I'd feel more comfortable than this use case of trying to match Atom feeds).

thp · 2020-05-21T08:24:33Z

@tomaszn Any updates on this?

thp · 2020-07-07T14:54:28Z

Closing for now. Please rebase/reopen if you want to see this happening still.

tomaszn force-pushed the filters_by_content_type branch from 3bcbc46 to 7ca2a6f Compare March 11, 2020 23:36

tomaszn added 3 commits March 12, 2020 00:38

Refactor UrlJob.retrieve to have a single return statement

8f00611

Make all Job.retrieve return also content_type, and store in job

8f55589

Provide hooks with content_type (raw from server) and mime_type

1ebccc0

tomaszn force-pushed the filters_by_content_type branch from 7ca2a6f to 1ebccc0 Compare March 11, 2020 23:39

thp requested changes Mar 22, 2020

View reviewed changes

thp requested changes Mar 23, 2020

View reviewed changes

Backward compatibility

048ff5d

tomaszn force-pushed the filters_by_content_type branch from 829873b to 048ff5d Compare March 23, 2020 23:15

thp closed this Jul 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide hooks with content_type (raw from server) and mime_type #459

Provide hooks with content_type (raw from server) and mime_type #459

tomaszn commented Mar 1, 2020

thp left a comment

thp Mar 22, 2020

tomaszn Mar 22, 2020

thp Mar 23, 2020

thp Mar 22, 2020

tomaszn Mar 22, 2020

thp Mar 23, 2020

thp Mar 23, 2020

tomaszn Mar 23, 2020

thp Mar 22, 2020

tomaszn Mar 22, 2020

thp Mar 23, 2020

tomaszn Mar 23, 2020

tomaszn Mar 23, 2020

thp Mar 22, 2020

tomaszn Mar 22, 2020

thp Mar 23, 2020

thp left a comment

tomaszn commented Mar 23, 2020

thp commented May 7, 2020

thp commented May 7, 2020

thp commented May 7, 2020

thp commented May 21, 2020

thp commented Jul 7, 2020

Provide hooks with content_type (raw from server) and mime_type #459

Provide hooks with content_type (raw from server) and mime_type #459

Conversation

tomaszn commented Mar 1, 2020

thp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thp left a comment

Choose a reason for hiding this comment

tomaszn commented Mar 23, 2020

thp commented May 7, 2020

thp commented May 7, 2020

thp commented May 7, 2020

thp commented May 21, 2020

thp commented Jul 7, 2020