-
Notifications
You must be signed in to change notification settings - Fork 61
SH Python client update #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #38 +/- ##
===========================================
- Coverage 93.8% 82.78% -11.02%
===========================================
Files 14 16 +2
Lines 1178 1470 +292
===========================================
+ Hits 1105 1217 +112
- Misses 73 253 +180
Continue to review full report at Codecov.
|
|
Looks good @vshlapakov |
|
@dangra Ok, now it's ready for careful review, I've just finished with VCR cassettes but will add it in a separate PR to reduce amount of the changes diff. Btw, regarding our discussion about retries: looks like it's fine to rely on sh.hubstorage code, it's possible to pass |
scrapinghub/client.py
Outdated
|
|
||
| def iter(self, **params): | ||
| """ Iterate over jobs collection for a given set of params. | ||
| FIXME the function returns a list of dicts, not a list of Job's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so why not jobs instances? we can have separate method iter_dicts to iterate over dicts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the other hand if we go this path - we need something similar for summary - cause it returns list of jobs for each state, accompanied by jobs count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that iter returns jobs summary, not jobs.
Let's return to the question when there's any real need for it
scrapinghub/client.py
Outdated
|
|
||
| def lastjobsummary(self, **params): | ||
| spiderid = None if not self.spider else self.spider.id | ||
| # FIXME returns a generator, not a list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prepend method with iter_? also IMO summary is confusing word, this method returns metadata for a job, maybe iter_last_jobs?
scrapinghub/client.py
Outdated
| _queuename, spiderid=spiderid, **params) | ||
|
|
||
| def lastjobsummary(self, **params): | ||
| spiderid = None if not self.spider else self.spider.id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, is that expected?
In [16]: summ = proj.jobs.lastjobsummary(spiderid=1021)
179 spiderid = None if not self.spider else self.spider.id
180 # FIXME returns a generator, not a list
--> 181 return self._hsproject.spiders.lastjobsummary(spiderid, **params)
182
183
TypeError: lastjobsummary() got multiple values for keyword argument 'spiderid'
Should either be explicitly prohibited with error message that suggests how to get ^ from Spider object, or check params and apply self.spider.id only if necessary
scrapinghub/client.py
Outdated
|
|
||
| @property | ||
| def _hsjobq(self): | ||
| return self._client._hsclient.get_project(self.projectid).jobq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason to keep it as property and instantiate new hs project object with get_project on each call?
scrapinghub/client.py
Outdated
| self._client = client | ||
| # FIXME it'd be nice to reuse _Proxy here, but Collection init is | ||
| # a bit custom: there's a compound key and required collections | ||
| # field to create an origin instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change Agreed not to do that_Proxy to take already instantiated origin object?
scrapinghub/client.py
Outdated
| self.activity = Activity(client._hsclient, projectid) | ||
| self.collections = Collections(_Collections, client, projectid) | ||
| self.frontier = Frontier(client._hsclient, projectid) | ||
| self.reports = Reports(client._hsclient, projectid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, I think reports is something that was removed from SH long time ago.
scrapinghub/client.py
Outdated
| self.spiders = Spiders(client, projectid) | ||
|
|
||
| # proxied sub-resources | ||
| self.activity = Activity(client._hsclient, projectid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proj.activity.list() returns generator. I think we shouldn't leave non-proxied objects in the new client, it may cause confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [38]: proj.activity.add({'job': 'xxx', 'user': 'xxx', 'event': 'hey-ho'})
Out[38]: <generator object jldecode at 0x106eee730>
don't like ^ as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well noticed, I'll fix it 👍
scrapinghub/client.py
Outdated
| from .hubstorage.activity import Activity | ||
| from .hubstorage.frontier import Frontier | ||
| from .hubstorage.job import JobMeta | ||
| from .hubstorage.project import Reports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't forget to remove import
README_client.rst
Outdated
| For example, to schedule a spider run (it returns a job object):: | ||
|
|
||
| >>> project.jobs.schedule('spider1', arg1='val1') | ||
| <scrapinghub.client.Job at 0x106ee12e8>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vshlapakov wouldn't it be much nicer for future to have dedicated parameter for spider arguments instead of assuming that all unknown kwargs in the schedule methods are spider args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chekunkov Not sure tbh: it's less error-prone, but spider args is the most frequent parameter here, it can be annoying to always provide it explicitly spider_args={..}; I think we could keep it as-is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vshlapakov if we decide to add spider_args later that would be a backwards incompatible change if someone accidentally has spider argument spider_args. another problem is that parameters like units or priority or spider_settings are not distinguishable from spider arguments. by having separate kwarg for spider args we can guarantee that if we add new parameters to schedule endpoint they would not clash with spider args that clients use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, I really have nothing against spider_args except this little doubt about convenience. Alright, let's better change it now than later
scrapinghub/client.py
Outdated
|
|
||
| def schedule(self, spidername=None, **params): | ||
| if not spidername and not self.spider: | ||
| raise APIError('Please provide spidername') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it should be a ValueError. APIError now mixes errors caused by incorrect arguments, HTTP error from Connection, errors from new client, this looks like a bad design to me. IMO:
ValueErroris what needed if method receives unexpected argument (or doesn't receive expected) - I'd replaceWrongProjectIDandWrongJobKeywithValueErroras well.- new client deserves new base Exception class, new exceptions like DuplicateJobError should inherit from new base class - let's keep APIError for legacy Connection
- Most of 4xx responses should be covered with specific exceptions classes, like 400 should result in
ValidationErrororInvalidUsage, 404 in genericNotFound. they also should store either originalrequests.exceptions.HTTPErroror response object - to be able to get body and check why things failed. - 5xx errors should be raised as is I think, we just need to be sure it's always requests.exceptions.HTTPError
scrapinghub/client.py
Outdated
| self._proxy_methods([ | ||
| 'count', 'get', 'set', 'delete', 'create_writer', | ||
| '_validate_collection', | ||
| ]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to proxy these methods, they are only used internally from hubstorage.collectionsrt.Collection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree to drop all of them except _validate_collection - it's reused in new_collection method (and in other new_ methods as a consequence)
scrapinghub/client.py
Outdated
| '_validate_collection', | ||
| ]) | ||
|
|
||
| def new_store(self, colname): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO new_ should be replaced with get_ for all the following methods. it's not new collection if it was already created and project.collections.get_store('foo') looks a bit more relevant here.
scrapinghub/client.py
Outdated
| self._client = client | ||
| self._origin = _Collection(coltype, colname, collections._origin) | ||
| proxy_methods(self._origin, self, [ | ||
| 'create_writer', 'get', 'set', 'delete', 'count', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one note about get method - in Hubstorage client if you don't pass key to get - it returns an iterator over the entire collection. literally the same thing as iter. I think it makes more sense to make key parameter required for get method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should proxy create_writer here, it's not clear how it can be used, it may be some legacy code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agree on both points, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed to keep create_writer - could be useful for writing to collection in batches
scrapinghub/utils.py
Outdated
|
|
||
| def _get_http_error_msg(exc): | ||
| try: | ||
| return exc.response.json() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should check if origin is HTTPError
scrapinghub/utils.py
Outdated
| class ScrapinghubAPIError(Exception): | ||
|
|
||
| class DuplicateJobError(APIError): | ||
| def __init__(self, origin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be possible to raise ScrapinghubAPIError without origin and with some custom error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already done, please update the page
Minor improvements for new python-client
Add return types in docstrings Add hints for kwargs, fix docstrings Use update_kwargs helper to unify logic Rename spider_args -> job_args Unify spider param for different methods Don't return count from job.update_tags
Improve client to use it via IDE
Handle ValueError from _decode_response

2nd attempt to merge, simplify and fix existing SH python clients, based on composition approach.
Please ignore codecov warnings: the PR contains a lot of tests - it should be enough, but they are disabled cause based on VCR.py cassettes that I'll add in a separate PR shortly once this PR is approved and merged.