-
-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide hooks with content_type (raw from server) and mime_type #459
Conversation
3bcbc46
to
7ca2a6f
Compare
7ca2a6f
to
1ebccc0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments.
@@ -254,7 +254,7 @@ def retrieve(self, job_state): | |||
file_scheme = 'file://' | |||
if self.url.startswith(file_scheme): | |||
logger.info('Using local filesystem (%s URI scheme)', file_scheme) | |||
return open(self.url[len(file_scheme):], 'rt').read() | |||
return (open(self.url[len(file_scheme):], 'rt').read(), 'text/plain') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The local file doesn't necessarily need to be text/plain
, could be a different type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be configured manually per job. And if not defined per job, then fallback could be taken from settings. If no fallback in settings, then libmagic can be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, make it an __optional__
setting in the Job (it make sense for file jobs), maybe mimetype
or something and if not set, use text/plain
.
@@ -209,7 +209,7 @@ class UrlJob(Job): | |||
__required__ = ('url',) | |||
__optional__ = ('cookies', 'data', 'method', 'ssl_no_verify', 'ignore_cached', 'http_proxy', 'https_proxy', | |||
'headers', 'ignore_connection_errors', 'ignore_http_error_codes', 'encoding', 'timeout', | |||
'ignore_timeout_errors', 'ignore_too_many_redirects') | |||
'ignore_timeout_errors', 'ignore_too_many_redirects', 'content_type', 'mime_type') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between content_type
and mime_type
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
content_type
is the raw response header including optional encoding information.mime_type
is the first part before optional ";" and doesn't contain optional encoding information, which is already taken into account when decoding the response. this is what you'd usually use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are settings from the jobs list YAML, you don't need to specify them here? (users are not supposed to "set" those values in the job configuration, right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, nevermind, I understood how you plan to use this now... Basically that's a kind of "separate" kind of key that we (I think) haven't had yet where we want the job to obtain those properties (so it can be used for the filter matching) but it's not something we want the user to specify via the config variable (we'd otherwise overwrite it?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I included them here so the automatic matching takes these two dictionary keys in MATCH
into account.
After your remarks about defaults in ShellJob
s maybe it should be possible to configure it (one can create say screenshots with a script).
@@ -358,4 +364,4 @@ def retrieve(self, job_state): | |||
from requests_html import HTMLSession | |||
session = HTMLSession() | |||
response = session.get(self.navigate) | |||
return response.html.html | |||
return (response.html.html, response.headers.get('Content-type', '')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a default fallback value for response.headers.get()
? Maybe text/html
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A typical web server would have that header. If it is missing, then the server is probably a custom script. Then maybe text/plain
? or using magic.from_buffer() to determine it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, text/plain
is probably also fine. However, how does that deal with headers like:
Content-Type: text/html; charset=utf-8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's returned in a tuple and stored as content_type
. Additionally, text/html
is extracted and stored as mime_type
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I forgot that empty content_type
is also acceptable, and in that case job.content_type
and job.mime_type
are not created. Maybe that's the way to go, rather than guess?
@@ -41,7 +41,10 @@ from urlwatch import reporters | |||
# __required__ = ('username', 'password') | |||
# | |||
# def retrieve(self, job_state): | |||
# return 'Would log in to {} with {} and {}\n'.format(self.url, self.username, self.password) | |||
# return ( | |||
# 'Would log in to {} with {} and {}\n'.format(self.url, self.username, self.password), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated change, please remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request changes JobState.process()
so that it requires all retrieve
implementations to return a tuple:
data, content_type = self.job.retrieve(self)
So this example needs updating unless backward compatibility is implemented in JobState.process()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah true.. That's kind of unfortunate, it would by default break all custom user scripts. I would make it so that self.job.retrieve(self)
can handle both the return value being a string (legacy behavior) or the return value being a 2-tuple. But let's think a bit more about this (I'll add another PR review comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the basic premise of this PR, but have been thinking a bit more about it.
Your proposed use case:
MATCH = {'mime_type': 'application/atom+xml'}
This is a very narrow use case, it might be better to extend it to not just a content type, but to return a dictionary of headers for every request? The filter itself might be a bit more complicated then, but it is only written once. You could then also filter for other header values (I can't think of a good example, but maybe format/filter certain CMS with the X-Generator
header or something).
Then again, maybe it's good enough if we provide an Atom feed filter and users just manually opt in to use that filter for jobs?
In any case, right now it doesn't handle the charset
parameter (e.g. for Content-Type: text/html; charset=utf-8
) so it's not fully robust yet.
To summarize, it's still a nice idea, but not sure if the added complexity and potential backwards compatibility issues (user-written job types where the retrieve()
function doesn't return a mime type) is worth the single use case.
Thanks for the review.
|
829873b
to
048ff5d
Compare
For the "comparison operators", you could just have a lambda: class AtomFeedFilter(AutoMatchFilter):
MATCH = {'Content-Length': lambda value: int(value) > 300} |
..and there's always the possibility of just writing a custom If just looking at the headers, this is probably fine, and things like "Content-Type: text/html; charset=utf-8" can be matched using the lambda matcher ( |
Not 100% sure if this is a good idea or not, but let's see what you come up with (and the "generic" HTTP header matching I'd feel more comfortable than this use case of trying to match Atom feeds). |
@tomaszn Any updates on this? |
Closing for now. Please rebase/reopen if you want to see this happening still. |
This allows creating automatic converters like: