Improvements and fixes for new python client #48

vshlapakov · 2017-02-07T13:52:05Z

The PR is created to cover the following minor things found after testing:

fix inconsistencies between project.id - spider.id - job.key (use key or id everywhere)
collection.set() should return None instead of a generator
simplify and unify using HS filter param
activity.add(): add client validation for project id
collections.iter() should iterate through collections
unify errors when apikey is wrong (now it's InvalidUsage or ScrapinghubAPIError)
handle HTTP 401 from JobQ separately with providing a good message instead of html

Other issues added after next step of testing at February 8th:

collection.delete() should return None instead of a generator
add validation for collection.delete input
validate that update_tags method has any input

Also there's a new spider.update_tags method that will be added on server-side soon.

Each of the points will be implemented in a separate commit(s) to simplify reviewing.

To provide better support for filter param for all classes having HS-based iter* methods we have to simplify it and make the logic reusable. The changes also contain minor renaming for cleaner code.

chekunkov · 2017-02-08T12:45:25Z

scrapinghub/utils.py

+        filter_data = []
+        for elem in params.pop('filter'):
+            if not isinstance(elem, (list, tuple)):
+                raise ValueError("Filter condition must be tuple or list")


I think we still should accept strings as an input format

chekunkov · 2017-02-08T12:46:30Z

scrapinghub/utils.py

+        for elem in params.pop('filter'):
+            if not isinstance(elem, (list, tuple)):
+                raise ValueError("Filter condition must be tuple or list")
+            filter_data.append(json.dumps(elem))


elem len is validated on the server side, right?

Yes, examples:

InvalidUsage: unable to parse filter parameter: Match spec must contain 3 values InvalidUsage: unable to parse filter parameter: incorrect arguments for operator isnotempty

chekunkov · 2017-02-08T13:45:12Z

scrapinghub/client.py

        <scrapinghub.client.Spider at 0x106ee3748>
-        >>> spider.id
+        >>> spider.key
        2


project.key = project id
job.key = '123/1/2'

spider.key == 2?

I've updated all the docstrings with the following model (68f0428):

projects 123 and 456

project 123 has 2 spiders: (1, spider1) and (2, spider2)

chekunkov · 2017-02-08T13:45:46Z

scrapinghub/client.py


    def __init__(self, client, projectid):
-        self.id = projectid
+        self.key = projectid


job.key is a string, shouldn't project key be a string as well?

@vshlapakov don't forget to ensure that key is str self.key = str(projectid)

and add line with >>> project.key output

chekunkov · 2017-02-08T13:48:28Z

scrapinghub/client.py

+    - retrive logs with a given log level and filter by a word
+
+        >>> filters = [("message", "contains", ["logger"])]
+        >>> list(job.logs.iter(level='WARNING', filter=filters))


I know I asked that before but maybe we should have a entity.list() method instead of list(entity.iter()) pattern?

chekunkov · 2017-02-08T13:53:07Z

scrapinghub/client.py

            raise ValueError("key cannot be None")
        return self._origin.get(key, *args, **kwargs)
+
+    def set(self, *args, **kwargs):


hm, this method is still mentioned in the proxy_methods(self._origin, self, [...]) call https://github.com/scrapinghub/python-scrapinghub/pull/48/files#diff-797cd9f74e9304996bef041260c04262R1044

Fix spider id in docstrings to keep consistency Don't proxy methods overwritten in Collection class

chekunkov · 2017-02-09T11:30:12Z

scrapinghub/client.py

-        >>> spider.id
-        2
+        >>> spider.key
+        '1'


spider.id is 1 (or '1', whatever), but spider.key should be '123/1'

Ah, I misread your comment, sure

chekunkov · 2017-02-09T11:36:53Z

scrapinghub/client.py


    def __init__(self, client, projectid):
-        self.id = projectid
+        self.key = projectid


@vshlapakov don't forget to ensure that key is str self.key = str(projectid)

and add line with >>> project.key output

chekunkov · 2017-02-09T11:38:29Z

scrapinghub/client.py

            >>> job.update_tags(add=['consumed'])
        """
+        if not (add or remove):
+            raise ValueError('Please provide tags to add or remove')


just in case, shouldn't it be handled by server side validation?

Server responds with 200 for now, we may change it, but what's the point to do an empty request? Of course, it's not the only place where empty requests are allowed, but I think empty update_tags call can also be wrongly interpreted, like if we want to update local cache or something else.

but if it's empty - shouldn't api server return 400?

Well, it makes sense, I don't mind to add it 👌 But I still don't see any harm to have client validation for it, there's no point to do update() call without data, do you think it's redundant?

yeah, redundancy is my main concern. when data is validated server side - you only need to change it in one place later

Ok, convinced, I'll create a ticket for server-side changes and drop it from here

chekunkov · 2017-02-17T12:08:17Z

scrapinghub/client.py

+            return
+        path = 'v2/projects/{}/spiders/{}/tags'.format(self.projectid, self.id)
+        url = urljoin(self._client._connection.url, path)
+        response = self._client._connection._session.patch(url, json=params)


Agree. Do we want to expose v2 api though?

chekunkov · 2017-02-17T12:12:52Z

scrapinghub/client.py


-        >>> project.spiders.get('spider2')
+        >>> spider = project.spiders.get('spider1')
        <scrapinghub.client.Spider at 0x106ee3748>


different from python interpreter output

>>> spider = project.spiders.get('localinfo') >>> spider <scrapinghub.client.Spider object at 0x10d6cb1d0>

I think you can simply remove <scrapinghub.client.Spider at 0x106ee3748>

chekunkov · 2017-02-17T12:20:13Z

scrapinghub/client.py

        self.projectid = projectid
-        self.id = spiderid
+        self.key = '{}/{}'.format(str(projectid), str(spiderid))
+        self.id = str(spiderid)


I find co-existance of id and key in Spider object a little bit confusing. Do we need them both? If so - why project and job object don't have both key and id. If id is needed only for internal use (e.g. to construct queries) - maybe we can make it private and remove from docs? If we expect users to use it - let's keep it public, but let's also be consistent.

Spider id is needed only for internal use like you mentioned, to avoid splitting the key all the time to construct queries and do some checks. Let's better make it private and remove from the docs, it makes sense to have it only for spider.

chekunkov · 2017-02-17T12:25:11Z

scrapinghub/client.py

        return self.get('vcs', colname)

+    def list(self):
+        return list(self._origin.apiget('list'))


let's add iter() for consistency?

the same for other classes that have list() method, but don't have iter() yet, e.g. Projects

chekunkov · 2017-02-17T12:29:58Z

scrapinghub/client.py

+        """
+        self._origin.set(*args, **kwargs)
+
+    def delete(self, _keys):


why _keys? keys is not reserved or builtin

Renamed (initially was done for consistency with original method, but it's a not an argument)

chekunkov · 2017-02-17T12:32:09Z

scrapinghub/client.py

+        """
+        if (not isinstance(_keys, string_types) and
+                not(isinstance(_keys, (list, tuple)) and
+                    all(isinstance(key, string_types) for key in _keys))):


if not isinstance(keys, (list, tuple)): keys = [keys] if not all(isinstance(key, string_types) for key in keys)): raise ValueError()

chekunkov · 2017-02-20T13:00:28Z

scrapinghub/utils.py

    except ValueError:
        raise ValueError("Job key parts should be integers")
-    return JobKey(*parts)
+    return JobKey(*map(str, parts))


This change is not backwards compatible

Ah, seem like this entire file is something that has never been merged to the master, so probably we are good here... Correct me if I'm wrong

No, you're right, it's a new module, we should be good

chekunkov · 2017-02-20T13:11:25Z

tests/client/conftest.py

 @pytest.fixture(scope='session')
 def client():
-    return ScrapinghubClient(auth=TEST_ADMIN_AUTH,
+    return ScrapinghubClient(apikey=TEST_ADMIN_AUTH,


@vshlapakov just noticed you replaced auth with apikey. note that in some cases we pass colon separated auth string to the HubstorageClient, do you think this will never happen here?

https://github.com/scrapinghub/scrapinghub-entrypoint-scrapy/blob/9e5c4f79f4fa2442cedf7e13185c7087961f6b46/sh_scrapy/env.py#L53-L53

Sorry for not being sound enough about the change, that's my bad.

Right, I think it will never happen, my args are the following:

Connection works only with apikey, HubstorageClient logic is more flexible, it also accepts (user,pass) as a tuple, or as a colon-separated auth string, so apikey looks like a common denominator for both clients, simple and power enough;

apikey key looks like an enhancement over user/pass pair, and as last approach was dropped from Connection at some point, I think we're not going to revise it in the nearest future.

Anyway let's revise it if you have something in mind.

my point is the following - users can add new client to the requirements and use it in Scrapy Cloud. when they do so, they would probably use auth from the environment - and that's where have to pass decoded job_key:JWT_token pair, they don't have an option to pass an apikey and if they pass that pair as an apikey - their requests to dash endpoint will fail with 401 because Connection doesn't expect such input and uses apikey parameter as a "user" part of auth tuple

@chekunkov hm, that's true, but Dash doesn't support authorisation via job_key:JWT_token, are we going to add it for consistency? or you propose to leave it up to user, but put a list with jobkey-auth supported logic somewhere to the docs? if latter, should we check for it and print a warning when instantiating Connection?

but Dash doesn't support authorisation via job_key:JWT_token, are we going to add it for consistency?

true, we already discuss this for one endpoint in Dash, we may expand it to a bigger scope, both in terms of supported endpoints and entities, e.g. we can authorize JWT tokens to operate on a project level.

but my original question was more about semantics. apikey scope is kinda narrow, while in reality we use other ways to authenticate clients, which are passed in the same way, so more general auth is more suitable in my opinion.

as for handling jwt tokens - I'd expect new client to be able to consume hex encoded jwt tokens as they are stored in environment and handle decoding and splitting it into (user, password) auth tuple internally

I see, makes sense to me. Let's do what you're proposing 👌

Added, please review the changes.

Now it works in the following manner (I added a warning when using user/pass pair in legacy Connection): with a given SHUB_JOBAUTH env var, a client can be instantiated with:

c = ScrapinghubClient(os.env['SHUB_JOBAUTH'])) UserWarning: A lot of endpoints support authentication only via apikey.

After it you can use the client as usual, but it have access only to job data for now.

chekunkov · 2017-02-21T16:56:18Z

scrapinghub/utils.py

+        if not isinstance(auth, string_types):
+            auth = auth.decode('ascii')
+    except binascii.Error:
+        pass


@vshlapakov did you try it with real apikey? it won't work correctly because real apikey consists of valid hexademical characters. you should try to decode the apikey, split it in parts, check that username part is a valid jobkey and password part is not empty - and only after that you can assume that this is a jwt token auth and can be used for auth.

another option is to check apikey len, but this wouldn't allow us to change apikey len in future wihtout causing users to upgrade

You're right, I focused on jwt tokens yesterday and completely missed that apikey is also hex-encoded 🤦‍♂️ I tried to address the issues in the following commits, check when you have some time pls, corresponding tests are included.

chekunkov · 2017-02-21T17:01:26Z

scrapinghub/client.py

-        self._hsclient = HubstorageClient(auth=apikey, **kwargs)
+        auth = parse_auth(auth)
+        connection_kwargs = {'apikey': auth, 'url': dash_endpoint}
+        if len(auth) == 2:


parse_auth should raise ValueError if auth tuple cannot be extracted or its len != 2. this way you'll always be able to unpack it into username/apikey and password part and pass them to target functions directly.

chekunkov · 2017-02-22T11:17:50Z

scrapinghub/utils.py

+        if not isinstance(auth[0], string_types):
+            raise ValueError("Login must me of a string type")
+        if not (auth[1] is None or isinstance(auth[1], string_types)):
+            raise ValueError("Password must be None or of a string type")


let's keep it simple - both login and password can be strings only, if password isn't set it's empty string

chekunkov · 2017-02-22T11:22:43Z

tests/client/test_utils.py

+
+def test_parse_auth_apikey():
+    test_key = u'\xe3\x98\xb8\xe6\x91\x84\xe9'
+    apikey = encode(test_key.encode('utf8'), 'hex_codec').decode('ascii')


I tried to show that apikey is just a utf8 string encoded with hex: all the conversions are related with that codecs.encode accepts a binary and returns a binary. But yeah, now I see it's not clear, let's just test with pre-generated test apikey

chekunkov · 2017-02-22T11:26:07Z

tests/client/test_utils.py

+
+
+def test_parse_auth_simple():
+    assert parse_auth('user:pass') == ('user', 'pass')


you already have doctests in the parse_auth function, why not simply enable them?

Right, will do

chekunkov · 2017-02-22T11:40:29Z

scrapinghub/utils.py

+        login, _, password = auth.partition(':')
+        if not password:
+            raise ValueError("Bad apikey, please check your credentials")
+        return (login, password)


honestly, it's not super clear to me how you detect apikey vs jwt here? both apikey and jwt in my local tests pass this block without exception. can you split the process in 2 explicit parts - a function that tries to extract jobkey and jwt from string, if it fails - assume this is a apikey and validate it.

both apikey and jwt in my local tests pass this block without exception

That's right, both apikey and jwt are hex-encoded strings, so it should be decoded with codecs.decode(auth, 'hex_codec') on above. If there's binascii.Error or TypeError we consider it as user:password string, but if there's no password - we assume that apikey wasn't correct.

Ok, let's try to split it to simplify the logic.

both apikey and jwt are hex-encoded strings

@vshlapakov probably we shouldn't have such expectations for apikey, that may change in future.

chekunkov · 2017-02-22T14:14:13Z

pytest.ini

@@ -0,0 +1,2 @@
+[pytest]
+addopts = --doctest-modules --doctest-glob='scrapinghub/*.py'


@vshlapakov you either need --doctest-modules OR --doctest-glob='scrapinghub/*.py' (I think the latter)

chekunkov · 2017-02-22T15:22:59Z

scrapinghub/client.py

+                raise ValueError("spider_args should be a dictionary")
+            cleaned_args = {k: v for k, v in spider_args.items()
+                            if k not in params}
+            params.update(cleaned_args)


@vshlapakov please update docs - spider_args should be the only recommended way to pass spider arguments

vshlapakov added 2 commits February 7, 2017 12:48

Make project/spider/job keys consistent

d725c3e

collection.set() should return None

22d0428

vshlapakov self-assigned this Feb 7, 2017

vshlapakov force-pushed the sc1467-fixes branch from 53bc463 to 2c08116 Compare February 7, 2017 15:38

vshlapakov added 2 commits February 7, 2017 18:44

Better support for filter parameter

23a7d52

Add base validation for adding activities

459c928

vshlapakov force-pushed the sc1467-fixes branch from 2c08116 to 459c928 Compare February 7, 2017 15:45

vshlapakov added 3 commits February 7, 2017 19:14

Add list & drop iter* for collections

79dd889

Add AuthorizationError and handle HTTP 401

ed6482e

Unify authentication errors

d40a2f8

vshlapakov requested a review from chekunkov February 7, 2017 16:39

Refactor logic for wrapping iter* params

b5d4942

To provide better support for filter param for all classes having HS-based iter* methods we have to simplify it and make the logic reusable. The changes also contain minor renaming for cleaner code.

chekunkov reviewed Feb 8, 2017

View reviewed changes

chekunkov suggested changes Feb 8, 2017

View reviewed changes

vshlapakov added 8 commits February 9, 2017 11:03

Accept strings in iter filter param

5c5a9ff

Some minor fixes after CR

793a4e1

Fix spider id in docstrings to keep consistency Don't proxy methods overwritten in Collection class

Unify test data (projects,spiders) in docstrings

68f0428

Add list method for all entities (with a warning)

3b18654

Use string keys everywhere

1d1d5ee

Spider key should also be a string

707a6ad

Add validation for collection.delete input

1ee768d

Check input for update_tags logic

100476c

chekunkov suggested changes Feb 9, 2017

View reviewed changes

chekunkov reviewed Feb 9, 2017

View reviewed changes

vshlapakov added 4 commits February 9, 2017 16:03

Add spider.key for consistency, minor fixes

f8f846d

Fix minor bug when using project.jobs

f9330fb

Move check for update_tags input on server-side

c457d40

Add spider.update_tags logic

360a12f

vshlapakov changed the title ~~[WIP] Improvements and fixes for new python client~~ Improvements and fixes for new python client Feb 16, 2017

vshlapakov mentioned this pull request Feb 17, 2017

VCR cassettes for updated client #40

Merged

Add some tests for iter-filter logic

fdab51f

chekunkov suggested changes Feb 17, 2017

View reviewed changes

vshlapakov added 3 commits February 17, 2017 18:25

Spider id should be a private attribute

5aa7593

Rename col.delete() _keys argument for simplicity

0dad052

Add missing iter methods for more consistent API

b48bfc0

vshlapakov assigned chekunkov Feb 20, 2017

chekunkov reviewed Feb 20, 2017

View reviewed changes

Allow authorisation via jwt tokens

d5ef2c3

chekunkov suggested changes Feb 21, 2017

View reviewed changes

Extend auth logic to cover all possible cases

cafc15a

vshlapakov force-pushed the sc1467-fixes branch from 8c9a047 to cafc15a Compare February 22, 2017 09:27

Check for bad apikey: should be hex-encoded

ac7272a

chekunkov reviewed Feb 22, 2017

View reviewed changes

chekunkov suggested changes Feb 22, 2017

View reviewed changes

vshlapakov added 2 commits February 22, 2017 16:14

Rework auth logic after code-review

6d05df0

Add pytest.ini to enable doctests

f611720

vshlapakov force-pushed the sc1467-fixes branch from 103068b to f611720 Compare February 22, 2017 13:54

chekunkov reviewed Feb 22, 2017

View reviewed changes

Provide extended API for Frontier

1f98bb9

vshlapakov force-pushed the sc1467-fixes branch from 8f9a458 to 1f98bb9 Compare February 22, 2017 14:47

vshlapakov added 2 commits February 22, 2017 17:55

Drop outdated test for modified auth logic

57a688e

Add spider_args to schedule method

50222c4

chekunkov reviewed Feb 22, 2017

View reviewed changes

vshlapakov added 2 commits February 22, 2017 18:36

Add spider_args param to the docs

e75fd4d

Use spider_args in tests

50a10c0

chekunkov approved these changes Feb 22, 2017

View reviewed changes

chekunkov merged commit 58f3b1c into sc1467-1 Feb 22, 2017

chekunkov deleted the sc1467-fixes branch February 22, 2017 16:49



		def test_parse_auth_simple():
		assert parse_auth('user:pass') == ('user', 'pass')

		@@ -0,0 +1,2 @@
		[pytest]
		addopts = --doctest-modules --doctest-glob='scrapinghub/*.py'

Improvements and fixes for new python client #48

Improvements and fixes for new python client #48

Uh oh!

Conversation

vshlapakov commented Feb 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chekunkov Feb 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vshlapakov Feb 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vshlapakov Feb 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

vshlapakov commented Feb 7, 2017 •

edited

Loading

chekunkov Feb 8, 2017 •

edited

Loading

vshlapakov Feb 9, 2017 •

edited

Loading

vshlapakov Feb 9, 2017 •

edited

Loading

chekunkov Feb 22, 2017 •

edited

Loading