Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide hooks with content_type (raw from server) and mime_type #459

Closed
wants to merge 4 commits into from

Conversation

tomaszn
Copy link
Contributor

@tomaszn tomaszn commented Mar 1, 2020

This allows creating automatic converters like:

class AtomFeedFilter(AutoMatchFilter):           
    MATCH = {'mime_type': 'application/atom+xml'}
    def filter(self, data):
        ...

Copy link
Owner

@thp thp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments.

@@ -254,7 +254,7 @@ def retrieve(self, job_state):
file_scheme = 'file://'
if self.url.startswith(file_scheme):
logger.info('Using local filesystem (%s URI scheme)', file_scheme)
return open(self.url[len(file_scheme):], 'rt').read()
return (open(self.url[len(file_scheme):], 'rt').read(), 'text/plain')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local file doesn't necessarily need to be text/plain, could be a different type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be configured manually per job. And if not defined per job, then fallback could be taken from settings. If no fallback in settings, then libmagic can be used.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, make it an __optional__ setting in the Job (it make sense for file jobs), maybe mimetype or something and if not set, use text/plain.

@@ -209,7 +209,7 @@ class UrlJob(Job):
__required__ = ('url',)
__optional__ = ('cookies', 'data', 'method', 'ssl_no_verify', 'ignore_cached', 'http_proxy', 'https_proxy',
'headers', 'ignore_connection_errors', 'ignore_http_error_codes', 'encoding', 'timeout',
'ignore_timeout_errors', 'ignore_too_many_redirects')
'ignore_timeout_errors', 'ignore_too_many_redirects', 'content_type', 'mime_type')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between content_type and mime_type here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • content_type is the raw response header including optional encoding information.
  • mime_type is the first part before optional ";" and doesn't contain optional encoding information, which is already taken into account when decoding the response. this is what you'd usually use.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are settings from the jobs list YAML, you don't need to specify them here? (users are not supposed to "set" those values in the job configuration, right?)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, nevermind, I understood how you plan to use this now... Basically that's a kind of "separate" kind of key that we (I think) haven't had yet where we want the job to obtain those properties (so it can be used for the filter matching) but it's not something we want the user to specify via the config variable (we'd otherwise overwrite it?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I included them here so the automatic matching takes these two dictionary keys in MATCH into account.

After your remarks about defaults in ShellJobs maybe it should be possible to configure it (one can create say screenshots with a script).

@@ -358,4 +364,4 @@ def retrieve(self, job_state):
from requests_html import HTMLSession
session = HTMLSession()
response = session.get(self.navigate)
return response.html.html
return (response.html.html, response.headers.get('Content-type', ''))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a default fallback value for response.headers.get()? Maybe text/html?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A typical web server would have that header. If it is missing, then the server is probably a custom script. Then maybe text/plain? or using magic.from_buffer() to determine it?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, text/plain is probably also fine. However, how does that deal with headers like:

Content-Type: text/html; charset=utf-8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's returned in a tuple and stored as content_type. Additionally, text/html is extracted and stored as mime_type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I forgot that empty content_type is also acceptable, and in that case job.content_type and job.mime_type are not created. Maybe that's the way to go, rather than guess?

@@ -41,7 +41,10 @@ from urlwatch import reporters
# __required__ = ('username', 'password')
#
# def retrieve(self, job_state):
# return 'Would log in to {} with {} and {}\n'.format(self.url, self.username, self.password)
# return (
# 'Would log in to {} with {} and {}\n'.format(self.url, self.username, self.password),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change, please remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request changes JobState.process() so that it requires all retrieve implementations to return a tuple:

  data, content_type = self.job.retrieve(self)

So this example needs updating unless backward compatibility is implemented in JobState.process().

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah true.. That's kind of unfortunate, it would by default break all custom user scripts. I would make it so that self.job.retrieve(self) can handle both the return value being a string (legacy behavior) or the return value being a 2-tuple. But let's think a bit more about this (I'll add another PR review comment).

Copy link
Owner

@thp thp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the basic premise of this PR, but have been thinking a bit more about it.

Your proposed use case:

    MATCH = {'mime_type': 'application/atom+xml'}

This is a very narrow use case, it might be better to extend it to not just a content type, but to return a dictionary of headers for every request? The filter itself might be a bit more complicated then, but it is only written once. You could then also filter for other header values (I can't think of a good example, but maybe format/filter certain CMS with the X-Generator header or something).

Then again, maybe it's good enough if we provide an Atom feed filter and users just manually opt in to use that filter for jobs?

In any case, right now it doesn't handle the charset parameter (e.g. for Content-Type: text/html; charset=utf-8) so it's not fully robust yet.

To summarize, it's still a nice idea, but not sure if the added complexity and potential backwards compatibility issues (user-written job types where the retrieve() function doesn't return a mime type) is worth the single use case.

@tomaszn
Copy link
Contributor Author

tomaszn commented Mar 23, 2020

Thanks for the review.

  1. Including all headers. Might be useful. But some headers are numeric, like Content-Length, and to make use of them the matching algorithm should support comparison operators like "greater then" etc. Are you thinking about returning a dict from retrieve?

  2. The charset parameter has been somehow handled (look for CHARSET_RE in jobs.py), but its value is ignored. The decoding is done by trying several combinations of parameters, ignoring the charset value. I wrapped that in a function definition, so it can be later worked on and tested:

        def try_decoding(response):
            try:
                try:
                    try:
                        return response.content.decode('utf-8')
                    except UnicodeDecodeError:
                        return response.content.decode('latin1')
                except UnicodeDecodeError:
                    return response.content.decode('utf-8', 'ignore')
            except LookupError:
                # If this is an invalid encoding, decode as ascii (Debian bug 731931)
                return response.content.decode('ascii', 'ignore')

    I think that it even might work as intended, because charset can be configured incorrectly. Anyway, mime_type is extracted from the header to not care about charset.

  3. Please also take a look at New filter 'pdf2text' to extract text from PDF #434. It introduces needs_binary, which is a "binary" version of mime_type. I think that implementing a PDF conversion filter with my approach would be simpler.

  4. I think that the matching mechanism is too simple. Let's take a look at a case of user, that just wants to have hooks for HTML and Atom:

    • AtomHook: MATCH = {'mime_type': 'application/atom+xml'}
    • AtomHook2: MATCH = {'url': re.compile('.*/atom.xml')}, because some his favorite blogs have misconfigured Content-Type and use text/html
    • HtmlHook: MATCH = {'mime_type': 'text/html'}

    Currently both matching hooks will be used for these misconfigured atom.xml urls, which is not what the user wants. I see two solutions for this, which are not exclusive:

    • Introducing logical operators (AND, OR, NOT). A quick search reveals a library that implements necessary expressions: https://pypi.org/project/filtration/
    • Priorities. If a hook matches, then lower priority hooks (fallbacks) are not processed. It should also allow setting the order in which hooks in one priority are run. So a hook for say cnn.com would run first, and then a default HTML processing would take place (or other way round, but in a predictable manner).

@thp
Copy link
Owner

thp commented May 7, 2020

  • Including all headers. Might be useful. But some headers are numeric, like Content-Length, and to make use of them the matching algorithm should support comparison operators like "greater then" etc. Are you thinking about returning a dict from retrieve?

For the "comparison operators", you could just have a lambda:

class AtomFeedFilter(AutoMatchFilter):
    MATCH = {'Content-Length': lambda value: int(value) > 300}

@thp
Copy link
Owner

thp commented May 7, 2020

..and there's always the possibility of just writing a custom match() method for filters where it's complicated (e.g. Atom feed based on content-type, but then also based on URL, because of misconfigured server).

If just looking at the headers, this is probably fine, and things like "Content-Type: text/html; charset=utf-8" can be matched using the lambda matcher (lambda value: 'text/html' in value)

@thp
Copy link
Owner

thp commented May 7, 2020

Not 100% sure if this is a good idea or not, but let's see what you come up with (and the "generic" HTTP header matching I'd feel more comfortable than this use case of trying to match Atom feeds).

@thp
Copy link
Owner

thp commented May 21, 2020

@tomaszn Any updates on this?

@thp
Copy link
Owner

thp commented Jul 7, 2020

Closing for now. Please rebase/reopen if you want to see this happening still.

@thp thp closed this Jul 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants