SH Python client update #38

vshlapakov · 2016-12-14T16:43:22Z

2nd attempt to merge, simplify and fix existing SH python clients, based on composition approach.

Please ignore codecov warnings: the PR contains a lot of tests - it should be enough, but they are disabled cause based on VCR.py cassettes that I'll add in a separate PR shortly once this PR is approved and merged.

codecov-io · 2016-12-15T13:51:48Z

Codecov Report

Merging #38 into master will decrease coverage by 11.01%.
The diff coverage is 38.56%.

@@             Coverage Diff             @@
##           master      #38       +/-   ##
===========================================
- Coverage    93.8%   82.78%   -11.02%     
===========================================
  Files          14       16        +2     
  Lines        1178     1470      +292     
===========================================
+ Hits         1105     1217      +112     
- Misses         73      253      +180

Impacted Files	Coverage Δ
scrapinghub/__init__.py	`100% <100%> (ø)`	⬆️
scrapinghub/client.py	`37.61% <37.61%> (ø)`
scrapinghub/utils.py	`40% <40%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bdf6897...0c280ba. Read the comment docs.

dangra · 2016-12-19T19:17:22Z

Looks good @vshlapakov

vshlapakov · 2016-12-21T15:45:50Z

@dangra Ok, now it's ready for careful review, I've just finished with VCR cassettes but will add it in a separate PR to reduce amount of the changes diff.

Btw, regarding our discussion about retries: looks like it's fine to rely on sh.hubstorage code, it's possible to pass start parameter on start, and on any exception during streaming the code extracts key from last chunk and passes startafter parameter instead of initial start: HS supports both. Am I missing something else here?

chekunkov · 2016-12-28T20:05:15Z

scrapinghub/client.py

+
+    def iter(self, **params):
+        """ Iterate over jobs collection for a given set of params.
+        FIXME the function returns a list of dicts, not a list of Job's


so why not jobs instances? we can have separate method iter_dicts to iterate over dicts

on the other hand if we go this path - we need something similar for summary - cause it returns list of jobs for each state, accompanied by jobs count

Agreed that iter returns jobs summary, not jobs.
Let's return to the question when there's any real need for it

chekunkov · 2016-12-28T20:13:27Z

scrapinghub/client.py

+
+    def lastjobsummary(self, **params):
+        spiderid = None if not self.spider else self.spider.id
+        # FIXME returns a generator, not a list


prepend method with iter_? also IMO summary is confusing word, this method returns metadata for a job, maybe iter_last_jobs?

chekunkov · 2016-12-28T20:22:47Z

scrapinghub/client.py

+            _queuename, spiderid=spiderid, **params)
+
+    def lastjobsummary(self, **params):
+        spiderid = None if not self.spider else self.spider.id


Oops, is that expected?

In [16]: summ = proj.jobs.lastjobsummary(spiderid=1021) 179 spiderid = None if not self.spider else self.spider.id 180 # FIXME returns a generator, not a list --> 181 return self._hsproject.spiders.lastjobsummary(spiderid, **params) 182 183 TypeError: lastjobsummary() got multiple values for keyword argument 'spiderid'

Should either be explicitly prohibited with error message that suggests how to get ^ from Spider object, or check params and apply self.spider.id only if necessary

chekunkov · 2016-12-28T20:30:28Z

scrapinghub/client.py

+
+    @property
+    def _hsjobq(self):
+        return self._client._hsclient.get_project(self.projectid).jobq


any reason to keep it as property and instantiate new hs project object with get_project on each call?

chekunkov · 2016-12-28T20:46:50Z

scrapinghub/client.py

+        self._client = client
+        # FIXME it'd be nice to reuse _Proxy here, but Collection init is
+        # a bit custom: there's a compound key and required collections
+        # field to create an origin instance


~~Maybe change _Proxy to take already instantiated origin object?~~ Agreed not to do that

chekunkov · 2016-12-28T20:57:15Z

scrapinghub/client.py

+        self.activity = Activity(client._hsclient, projectid)
+        self.collections = Collections(_Collections, client, projectid)
+        self.frontier = Frontier(client._hsclient, projectid)
+        self.reports = Reports(client._hsclient, projectid)


hm, I think reports is something that was removed from SH long time ago.

chekunkov · 2016-12-28T21:00:24Z

scrapinghub/client.py

+        self.spiders = Spiders(client, projectid)
+
+        # proxied sub-resources
+        self.activity = Activity(client._hsclient, projectid)


proj.activity.list() returns generator. I think we shouldn't leave non-proxied objects in the new client, it may cause confusion

In [38]: proj.activity.add({'job': 'xxx', 'user': 'xxx', 'event': 'hey-ho'}) Out[38]: <generator object jldecode at 0x106eee730>

don't like ^ as well

Well noticed, I'll fix it 👍

chekunkov · 2016-12-29T13:29:47Z

scrapinghub/client.py

+from .hubstorage.activity import Activity
+from .hubstorage.frontier import Frontier
+from .hubstorage.job import JobMeta
+from .hubstorage.project import Reports


don't forget to remove import

chekunkov · 2016-12-30T10:45:25Z

README_client.rst

+For example, to schedule a spider run (it returns a job object)::
+
+    >>> project.jobs.schedule('spider1', arg1='val1')
+    <scrapinghub.client.Job at 0x106ee12e8>>


@vshlapakov wouldn't it be much nicer for future to have dedicated parameter for spider arguments instead of assuming that all unknown kwargs in the schedule methods are spider args?

@chekunkov Not sure tbh: it's less error-prone, but spider args is the most frequent parameter here, it can be annoying to always provide it explicitly spider_args={..}; I think we could keep it as-is for now.

@vshlapakov if we decide to add spider_args later that would be a backwards incompatible change if someone accidentally has spider argument spider_args. another problem is that parameters like units or priority or spider_settings are not distinguishable from spider arguments. by having separate kwarg for spider args we can guarantee that if we add new parameters to schedule endpoint they would not clash with spider args that clients use

That's correct, I really have nothing against spider_args except this little doubt about convenience. Alright, let's better change it now than later

chekunkov · 2016-12-30T11:10:43Z

scrapinghub/client.py

+
+    def schedule(self, spidername=None, **params):
+        if not spidername and not self.spider:
+            raise APIError('Please provide spidername')


I'd say it should be a ValueError. APIError now mixes errors caused by incorrect arguments, HTTP error from Connection, errors from new client, this looks like a bad design to me. IMO:

ValueError is what needed if method receives unexpected argument (or doesn't receive expected) - I'd replace WrongProjectID and WrongJobKey with ValueError as well.

new client deserves new base Exception class, new exceptions like DuplicateJobError should inherit from new base class - let's keep APIError for legacy Connection

Most of 4xx responses should be covered with specific exceptions classes, like 400 should result in ValidationError or InvalidUsage, 404 in generic NotFound. they also should store either original requests.exceptions.HTTPError or response object - to be able to get body and check why things failed.

5xx errors should be raised as is I think, we just need to be sure it's always requests.exceptions.HTTPError

chekunkov · 2016-12-30T13:07:30Z

scrapinghub/client.py

+        self._proxy_methods([
+            'count', 'get', 'set', 'delete', 'create_writer',
+            '_validate_collection',
+        ])


I don't think we need to proxy these methods, they are only used internally from hubstorage.collectionsrt.Collection

Agree to drop all of them except _validate_collection - it's reused in new_collection method (and in other new_ methods as a consequence)

chekunkov · 2016-12-30T13:10:35Z

scrapinghub/client.py

+            '_validate_collection',
+        ])
+
+    def new_store(self, colname):


IMO new_ should be replaced with get_ for all the following methods. it's not new collection if it was already created and project.collections.get_store('foo') looks a bit more relevant here.

chekunkov · 2016-12-30T13:12:26Z

scrapinghub/client.py

+        self._client = client
+        self._origin = _Collection(coltype, colname, collections._origin)
+        proxy_methods(self._origin, self, [
+            'create_writer', 'get', 'set', 'delete', 'count',


one note about get method - in Hubstorage client if you don't pass key to get - it returns an iterator over the entire collection. literally the same thing as iter. I think it makes more sense to make key parameter required for get method.

I'm not sure we should proxy create_writer here, it's not clear how it can be used, it may be some legacy code.

Yes, agree on both points, thanks

Agreed to keep create_writer - could be useful for writing to collection in batches

chekunkov · 2016-12-30T13:57:01Z

scrapinghub/utils.py

+
+def _get_http_error_msg(exc):
+    try:
+        return exc.response.json()


you should check if origin is HTTPError

chekunkov · 2016-12-30T13:58:19Z

scrapinghub/utils.py

+class ScrapinghubAPIError(Exception):

-class DuplicateJobError(APIError):
+    def __init__(self, origin):


should be possible to raise ScrapinghubAPIError without origin and with some custom error message

Already done, please update the page

Minor improvements for new python-client

Add return types in docstrings Add hints for kwargs, fix docstrings Use update_kwargs helper to unify logic Rename spider_args -> job_args Unify spider param for different methods Don't return count from job.update_tags

Improve client to use it via IDE

Handle ValueError from _decode_response

vshlapakov self-assigned this Dec 14, 2016

vshlapakov mentioned this pull request Dec 20, 2016

SH python client update (inheritance) #36

Closed

vshlapakov changed the title ~~[WIP] SH Python client update (composition)~~ SH Python client update (composition) Dec 21, 2016

vshlapakov requested a review from dangra December 21, 2016 20:14

vshlapakov changed the title ~~SH Python client update (composition)~~ SH Python client update Dec 22, 2016

This was referenced Dec 22, 2016

VCR cassettes for updated client #40

Merged

Documentation for new python client #41

Closed

vshlapakov assigned chekunkov and unassigned vshlapakov Dec 28, 2016

chekunkov suggested changes Dec 28, 2016

View reviewed changes

chekunkov reviewed Dec 29, 2016

View reviewed changes

chekunkov reviewed Dec 30, 2016

View reviewed changes

vshlapakov force-pushed the sc1467-1 branch from dffc4b0 to 87f2e1c Compare December 30, 2016 10:49

chekunkov reviewed Dec 30, 2016

View reviewed changes

chekunkov suggested changes Dec 30, 2016

View reviewed changes

vshlapakov added 11 commits December 30, 2016 17:04

Initial skeleton

bbe0aaf

Pass spider as an object to Jobs, minor fixes

4a349a1

Bugfixes after local testing

91e173f

Add sub-resources for Job

5681cac

Initial attempt to proxy Items/Logs/Requests

a7b4f4a

Proxy methods for DType/ItemsType

36f3656

Proxy Collections methods, simplify changes

8cac2e2

Optimize code logic a bit

eac1270

Schedule jobs via Dash endpoint

98bd7ac

Update tags for spider/job

ac52218

Fix LogLevel logic for Logs object

fbae50b

vshlapakov and others added 28 commits March 6, 2017 17:15

Add docstrings with samples

1d73421

Only str methods should be enforced

dd29c2b

Splitting proxy logic

6970869

Refactor MappingProxy based on MutableMapping

94ad3f9

Add __iter__ as iterkeys

e5fbd84

Add MappingProxy with simple methods

004f6c0

Minor fix for JobMeta docstring

03c8d6b

Extend README for project settings

22c4d3a

Drop job.update_metadata, fix docs

09410a6

Drop redundant count function

58dd3d4

Split MappingProxy set method

91609c8

Minor docstring fix

3414c1d

Another minor docstring fixes

7e4212a

Indentation fix

d9b7212

We already have project.settings.get

9755fea

Drop redundant typecheck for MappingProxy.set

c9ae42a

Split test_proxy, rename others according to modules

82adab5

Drop job.purge, fix tests

3d38150

Add missing tests for project settings

3344157

Add package for scrapinghub.client in setup.py

5b5c370

Fix error types, rename InvalidUsage -> BadRequest

18481b6

Merge pull request #53 from scrapinghub/sc1467-1-fixes

59f4883

Minor improvements for new python-client

Improve docstrings, add kwargs hints, unification

b6c86ff

Add return types in docstrings Add hints for kwargs, fix docstrings Use update_kwargs helper to unify logic Rename spider_args -> job_args Unify spider param for different methods Don't return count from job.update_tags

Minor README fix for legacy client

5656825

Merge pull request #54 from scrapinghub/sc1467-1-ide

7137456

Improve client to use it via IDE

Handle internal server errors and 404

a273b76

Merge pull request #55 from scrapinghub/sc1467-handle-json-decode-error

471580a

Handle ValueError from _decode_response

Fix tests after changes to handle server errors and 404

0c280ba

vshlapakov merged commit ec5590b into master Mar 23, 2017

chekunkov deleted the sc1467-1 branch March 23, 2017 14:33

SH Python client update #38

SH Python client update #38

Uh oh!

Conversation

vshlapakov commented Dec 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Dec 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dangra commented Dec 19, 2016

Uh oh!

vshlapakov commented Dec 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chekunkov Dec 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chekunkov Dec 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chekunkov Dec 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vshlapakov Dec 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

vshlapakov commented Dec 14, 2016 •

edited

Loading

codecov-io commented Dec 15, 2016 •

edited

Loading

vshlapakov commented Dec 21, 2016 •

edited

Loading

chekunkov Dec 28, 2016 •

edited

Loading

chekunkov Dec 30, 2016 •

edited

Loading

chekunkov Dec 30, 2016 •

edited

Loading

vshlapakov Dec 30, 2016 •

edited

Loading