Skip to content

Conversation

@vshlapakov
Copy link
Contributor

@vshlapakov vshlapakov commented Dec 14, 2016

2nd attempt to merge, simplify and fix existing SH python clients, based on composition approach.

Please ignore codecov warnings: the PR contains a lot of tests - it should be enough, but they are disabled cause based on VCR.py cassettes that I'll add in a separate PR shortly once this PR is approved and merged.

@vshlapakov vshlapakov self-assigned this Dec 14, 2016
@codecov-io
Copy link

codecov-io commented Dec 15, 2016

Codecov Report

Merging #38 into master will decrease coverage by 11.01%.
The diff coverage is 38.56%.

@@             Coverage Diff             @@
##           master      #38       +/-   ##
===========================================
- Coverage    93.8%   82.78%   -11.02%     
===========================================
  Files          14       16        +2     
  Lines        1178     1470      +292     
===========================================
+ Hits         1105     1217      +112     
- Misses         73      253      +180
Impacted Files Coverage Δ
scrapinghub/__init__.py 100% <100%> (ø) ⬆️
scrapinghub/client.py 37.61% <37.61%> (ø)
scrapinghub/utils.py 40% <40%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bdf6897...0c280ba. Read the comment docs.

@dangra
Copy link
Contributor

dangra commented Dec 19, 2016

Looks good @vshlapakov

@vshlapakov vshlapakov changed the title [WIP] SH Python client update (composition) SH Python client update (composition) Dec 21, 2016
@vshlapakov
Copy link
Contributor Author

vshlapakov commented Dec 21, 2016

@dangra Ok, now it's ready for careful review, I've just finished with VCR cassettes but will add it in a separate PR to reduce amount of the changes diff.
VCP.py results

Btw, regarding our discussion about retries: looks like it's fine to rely on sh.hubstorage code, it's possible to pass start parameter on start, and on any exception during streaming the code extracts key from last chunk and passes startafter parameter instead of initial start: HS supports both. Am I missing something else here?

@vshlapakov vshlapakov requested a review from dangra December 21, 2016 20:14
@vshlapakov vshlapakov changed the title SH Python client update (composition) SH Python client update Dec 22, 2016
@vshlapakov vshlapakov assigned chekunkov and unassigned vshlapakov Dec 28, 2016

def iter(self, **params):
""" Iterate over jobs collection for a given set of params.
FIXME the function returns a list of dicts, not a list of Job's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so why not jobs instances? we can have separate method iter_dicts to iterate over dicts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the other hand if we go this path - we need something similar for summary - cause it returns list of jobs for each state, accompanied by jobs count

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that iter returns jobs summary, not jobs.
Let's return to the question when there's any real need for it


def lastjobsummary(self, **params):
spiderid = None if not self.spider else self.spider.id
# FIXME returns a generator, not a list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepend method with iter_? also IMO summary is confusing word, this method returns metadata for a job, maybe iter_last_jobs?

_queuename, spiderid=spiderid, **params)

def lastjobsummary(self, **params):
spiderid = None if not self.spider else self.spider.id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, is that expected?

In [16]: summ = proj.jobs.lastjobsummary(spiderid=1021)

    179         spiderid = None if not self.spider else self.spider.id
    180         # FIXME returns a generator, not a list
--> 181         return self._hsproject.spiders.lastjobsummary(spiderid, **params)
    182
    183

TypeError: lastjobsummary() got multiple values for keyword argument 'spiderid'

Should either be explicitly prohibited with error message that suggests how to get ^ from Spider object, or check params and apply self.spider.id only if necessary


@property
def _hsjobq(self):
return self._client._hsclient.get_project(self.projectid).jobq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to keep it as property and instantiate new hs project object with get_project on each call?

self._client = client
# FIXME it'd be nice to reuse _Proxy here, but Collection init is
# a bit custom: there's a compound key and required collections
# field to create an origin instance
Copy link
Contributor

@chekunkov chekunkov Dec 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change _Proxy to take already instantiated origin object? Agreed not to do that

self.activity = Activity(client._hsclient, projectid)
self.collections = Collections(_Collections, client, projectid)
self.frontier = Frontier(client._hsclient, projectid)
self.reports = Reports(client._hsclient, projectid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I think reports is something that was removed from SH long time ago.

self.spiders = Spiders(client, projectid)

# proxied sub-resources
self.activity = Activity(client._hsclient, projectid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proj.activity.list() returns generator. I think we shouldn't leave non-proxied objects in the new client, it may cause confusion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [38]: proj.activity.add({'job': 'xxx', 'user': 'xxx', 'event': 'hey-ho'})
Out[38]: <generator object jldecode at 0x106eee730>

don't like ^ as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well noticed, I'll fix it 👍

from .hubstorage.activity import Activity
from .hubstorage.frontier import Frontier
from .hubstorage.job import JobMeta
from .hubstorage.project import Reports
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to remove import

For example, to schedule a spider run (it returns a job object)::

>>> project.jobs.schedule('spider1', arg1='val1')
<scrapinghub.client.Job at 0x106ee12e8>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshlapakov wouldn't it be much nicer for future to have dedicated parameter for spider arguments instead of assuming that all unknown kwargs in the schedule methods are spider args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chekunkov Not sure tbh: it's less error-prone, but spider args is the most frequent parameter here, it can be annoying to always provide it explicitly spider_args={..}; I think we could keep it as-is for now.

Copy link
Contributor

@chekunkov chekunkov Dec 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshlapakov if we decide to add spider_args later that would be a backwards incompatible change if someone accidentally has spider argument spider_args. another problem is that parameters like units or priority or spider_settings are not distinguishable from spider arguments. by having separate kwarg for spider args we can guarantee that if we add new parameters to schedule endpoint they would not clash with spider args that clients use

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, I really have nothing against spider_args except this little doubt about convenience. Alright, let's better change it now than later


def schedule(self, spidername=None, **params):
if not spidername and not self.spider:
raise APIError('Please provide spidername')
Copy link
Contributor

@chekunkov chekunkov Dec 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say it should be a ValueError. APIError now mixes errors caused by incorrect arguments, HTTP error from Connection, errors from new client, this looks like a bad design to me. IMO:

  • ValueError is what needed if method receives unexpected argument (or doesn't receive expected) - I'd replace WrongProjectID and WrongJobKey with ValueError as well.
  • new client deserves new base Exception class, new exceptions like DuplicateJobError should inherit from new base class - let's keep APIError for legacy Connection
  • Most of 4xx responses should be covered with specific exceptions classes, like 400 should result in ValidationError or InvalidUsage, 404 in generic NotFound. they also should store either original requests.exceptions.HTTPError or response object - to be able to get body and check why things failed.
  • 5xx errors should be raised as is I think, we just need to be sure it's always requests.exceptions.HTTPError

self._proxy_methods([
'count', 'get', 'set', 'delete', 'create_writer',
'_validate_collection',
])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to proxy these methods, they are only used internally from hubstorage.collectionsrt.Collection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree to drop all of them except _validate_collection - it's reused in new_collection method (and in other new_ methods as a consequence)

'_validate_collection',
])

def new_store(self, colname):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO new_ should be replaced with get_ for all the following methods. it's not new collection if it was already created and project.collections.get_store('foo') looks a bit more relevant here.

self._client = client
self._origin = _Collection(coltype, colname, collections._origin)
proxy_methods(self._origin, self, [
'create_writer', 'get', 'set', 'delete', 'count',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one note about get method - in Hubstorage client if you don't pass key to get - it returns an iterator over the entire collection. literally the same thing as iter. I think it makes more sense to make key parameter required for get method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should proxy create_writer here, it's not clear how it can be used, it may be some legacy code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree on both points, thanks

Copy link
Contributor Author

@vshlapakov vshlapakov Dec 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed to keep create_writer - could be useful for writing to collection in batches


def _get_http_error_msg(exc):
try:
return exc.response.json()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should check if origin is HTTPError

class ScrapinghubAPIError(Exception):

class DuplicateJobError(APIError):
def __init__(self, origin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be possible to raise ScrapinghubAPIError without origin and with some custom error message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already done, please update the page

vshlapakov and others added 28 commits March 6, 2017 17:15
Minor improvements for new python-client
Add return types in docstrings
Add hints for kwargs, fix docstrings
Use update_kwargs helper to unify logic
Rename spider_args -> job_args
Unify spider param for different methods
Don't return count from job.update_tags
Improve client to use it via IDE
@vshlapakov vshlapakov merged commit ec5590b into master Mar 23, 2017
@chekunkov chekunkov deleted the sc1467-1 branch March 23, 2017 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants