Skip to content

Conversation

@vshlapakov
Copy link
Contributor

@vshlapakov vshlapakov commented Feb 7, 2017

The PR is created to cover the following minor things found after testing:

  • fix inconsistencies between project.id - spider.id - job.key (use key or id everywhere)
  • collection.set() should return None instead of a generator
  • simplify and unify using HS filter param
  • activity.add(): add client validation for project id
  • collections.iter() should iterate through collections
  • unify errors when apikey is wrong (now it's InvalidUsage or ScrapinghubAPIError)
  • handle HTTP 401 from JobQ separately with providing a good message instead of html

Other issues added after next step of testing at February 8th:

  • collection.delete() should return None instead of a generator
  • add validation for collection.delete input
  • validate that update_tags method has any input

Also there's a new spider.update_tags method that will be added on server-side soon.

Each of the points will be implemented in a separate commit(s) to simplify reviewing.

@vshlapakov vshlapakov requested a review from chekunkov February 7, 2017 16:39
To provide better support for filter param for all classes having
HS-based iter* methods we have to simplify it and make the logic
reusable. The changes also contain minor renaming for cleaner code.
filter_data = []
for elem in params.pop('filter'):
if not isinstance(elem, (list, tuple)):
raise ValueError("Filter condition must be tuple or list")
Copy link
Contributor

@chekunkov chekunkov Feb 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still should accept strings as an input format

for elem in params.pop('filter'):
if not isinstance(elem, (list, tuple)):
raise ValueError("Filter condition must be tuple or list")
filter_data.append(json.dumps(elem))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elem len is validated on the server side, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, examples:

InvalidUsage: unable to parse filter parameter: Match spec must contain 3 values
InvalidUsage: unable to parse filter parameter: incorrect arguments for operator isnotempty

<scrapinghub.client.Spider at 0x106ee3748>
>>> spider.id
>>> spider.key
2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

project.key = project id
job.key = '123/1/2'

spider.key == 2?

Copy link
Contributor Author

@vshlapakov vshlapakov Feb 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated all the docstrings with the following model (68f0428):

  • projects 123 and 456
  • project 123 has 2 spiders: (1, spider1) and (2, spider2)


def __init__(self, client, projectid):
self.id = projectid
self.key = projectid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job.key is a string, shouldn't project key be a string as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshlapakov don't forget to ensure that key is str self.key = str(projectid)

and add line with >>> project.key output

- retrive logs with a given log level and filter by a word
>>> filters = [("message", "contains", ["logger"])]
>>> list(job.logs.iter(level='WARNING', filter=filters))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I asked that before but maybe we should have a entity.list() method instead of list(entity.iter()) pattern?

raise ValueError("key cannot be None")
return self._origin.get(key, *args, **kwargs)

def set(self, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, this method is still mentioned in the proxy_methods(self._origin, self, [...]) call https://github.com/scrapinghub/python-scrapinghub/pull/48/files#diff-797cd9f74e9304996bef041260c04262R1044

>>> spider.id
2
>>> spider.key
'1'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spider.id is 1 (or '1', whatever), but spider.key should be '123/1'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I misread your comment, sure


def __init__(self, client, projectid):
self.id = projectid
self.key = projectid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshlapakov don't forget to ensure that key is str self.key = str(projectid)

and add line with >>> project.key output

>>> job.update_tags(add=['consumed'])
"""
if not (add or remove):
raise ValueError('Please provide tags to add or remove')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just in case, shouldn't it be handled by server side validation?

Copy link
Contributor Author

@vshlapakov vshlapakov Feb 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Server responds with 200 for now, we may change it, but what's the point to do an empty request? Of course, it's not the only place where empty requests are allowed, but I think empty update_tags call can also be wrongly interpreted, like if we want to update local cache or something else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if it's empty - shouldn't api server return 400?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it makes sense, I don't mind to add it 👌 But I still don't see any harm to have client validation for it, there's no point to do update() call without data, do you think it's redundant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, redundancy is my main concern. when data is validated server side - you only need to change it in one place later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, convinced, I'll create a ticket for server-side changes and drop it from here

@vshlapakov vshlapakov changed the title [WIP] Improvements and fixes for new python client Improvements and fixes for new python client Feb 16, 2017
return
path = 'v2/projects/{}/spiders/{}/tags'.format(self.projectid, self.id)
url = urljoin(self._client._connection.url, path)
response = self._client._connection._session.patch(url, json=params)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Do we want to expose v2 api though?

>>> project.spiders.get('spider2')
>>> spider = project.spiders.get('spider1')
<scrapinghub.client.Spider at 0x106ee3748>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

different from python interpreter output

>>> spider = project.spiders.get('localinfo')
>>> spider
<scrapinghub.client.Spider object at 0x10d6cb1d0>

I think you can simply remove <scrapinghub.client.Spider at 0x106ee3748>

self.projectid = projectid
self.id = spiderid
self.key = '{}/{}'.format(str(projectid), str(spiderid))
self.id = str(spiderid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find co-existance of id and key in Spider object a little bit confusing. Do we need them both? If so - why project and job object don't have both key and id. If id is needed only for internal use (e.g. to construct queries) - maybe we can make it private and remove from docs? If we expect users to use it - let's keep it public, but let's also be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spider id is needed only for internal use like you mentioned, to avoid splitting the key all the time to construct queries and do some checks. Let's better make it private and remove from the docs, it makes sense to have it only for spider.

return self.get('vcs', colname)

def list(self):
return list(self._origin.apiget('list'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add iter() for consistency?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same for other classes that have list() method, but don't have iter() yet, e.g. Projects

"""
self._origin.set(*args, **kwargs)

def delete(self, _keys):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why _keys? keys is not reserved or builtin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed (initially was done for consistency with original method, but it's a not an argument)

"""
if (not isinstance(_keys, string_types) and
not(isinstance(_keys, (list, tuple)) and
all(isinstance(key, string_types) for key in _keys))):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not isinstance(keys, (list, tuple)):
    keys = [keys]
if not all(isinstance(key, string_types) for key in keys)):
    raise ValueError()

except ValueError:
raise ValueError("Job key parts should be integers")
return JobKey(*parts)
return JobKey(*map(str, parts))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not backwards compatible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, seem like this entire file is something that has never been merged to the master, so probably we are good here... Correct me if I'm wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you're right, it's a new module, we should be good

@pytest.fixture(scope='session')
def client():
return ScrapinghubClient(auth=TEST_ADMIN_AUTH,
return ScrapinghubClient(apikey=TEST_ADMIN_AUTH,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshlapakov just noticed you replaced auth with apikey. note that in some cases we pass colon separated auth string to the HubstorageClient, do you think this will never happen here?

https://github.com/scrapinghub/scrapinghub-entrypoint-scrapy/blob/9e5c4f79f4fa2442cedf7e13185c7087961f6b46/sh_scrapy/env.py#L53-L53

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not being sound enough about the change, that's my bad.

Right, I think it will never happen, my args are the following:

  • Connection works only with apikey, HubstorageClient logic is more flexible, it also accepts (user,pass) as a tuple, or as a colon-separated auth string, so apikey looks like a common denominator for both clients, simple and power enough;
  • apikey key looks like an enhancement over user/pass pair, and as last approach was dropped from Connection at some point, I think we're not going to revise it in the nearest future.

Anyway let's revise it if you have something in mind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is the following - users can add new client to the requirements and use it in Scrapy Cloud. when they do so, they would probably use auth from the environment - and that's where have to pass decoded job_key:JWT_token pair, they don't have an option to pass an apikey and if they pass that pair as an apikey - their requests to dash endpoint will fail with 401 because Connection doesn't expect such input and uses apikey parameter as a "user" part of auth tuple

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chekunkov hm, that's true, but Dash doesn't support authorisation via job_key:JWT_token, are we going to add it for consistency? or you propose to leave it up to user, but put a list with jobkey-auth supported logic somewhere to the docs? if latter, should we check for it and print a warning when instantiating Connection?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but Dash doesn't support authorisation via job_key:JWT_token, are we going to add it for consistency?

true, we already discuss this for one endpoint in Dash, we may expand it to a bigger scope, both in terms of supported endpoints and entities, e.g. we can authorize JWT tokens to operate on a project level.

but my original question was more about semantics. apikey scope is kinda narrow, while in reality we use other ways to authenticate clients, which are passed in the same way, so more general auth is more suitable in my opinion.

as for handling jwt tokens - I'd expect new client to be able to consume hex encoded jwt tokens as they are stored in environment and handle decoding and splitting it into (user, password) auth tuple internally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense to me. Let's do what you're proposing 👌

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, please review the changes.

Now it works in the following manner (I added a warning when using user/pass pair in legacy Connection): with a given SHUB_JOBAUTH env var, a client can be instantiated with:

c = ScrapinghubClient(os.env['SHUB_JOBAUTH']))
UserWarning: A lot of endpoints support authentication only via apikey.

After it you can use the client as usual, but it have access only to job data for now.

if not isinstance(auth, string_types):
auth = auth.decode('ascii')
except binascii.Error:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshlapakov did you try it with real apikey? it won't work correctly because real apikey consists of valid hexademical characters. you should try to decode the apikey, split it in parts, check that username part is a valid jobkey and password part is not empty - and only after that you can assume that this is a jwt token auth and can be used for auth.

another option is to check apikey len, but this wouldn't allow us to change apikey len in future wihtout causing users to upgrade

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I focused on jwt tokens yesterday and completely missed that apikey is also hex-encoded 🤦‍♂️ I tried to address the issues in the following commits, check when you have some time pls, corresponding tests are included.

self._hsclient = HubstorageClient(auth=apikey, **kwargs)
auth = parse_auth(auth)
connection_kwargs = {'apikey': auth, 'url': dash_endpoint}
if len(auth) == 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_auth should raise ValueError if auth tuple cannot be extracted or its len != 2. this way you'll always be able to unpack it into username/apikey and password part and pass them to target functions directly.

if not isinstance(auth[0], string_types):
raise ValueError("Login must me of a string type")
if not (auth[1] is None or isinstance(auth[1], string_types)):
raise ValueError("Password must be None or of a string type")
Copy link
Contributor

@chekunkov chekunkov Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep it simple - both login and password can be strings only, if password isn't set it's empty string


def test_parse_auth_apikey():
test_key = u'\xe3\x98\xb8\xe6\x91\x84\xe9'
apikey = encode(test_key.encode('utf8'), 'hex_codec').decode('ascii')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wat?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to show that apikey is just a utf8 string encoded with hex: all the conversions are related with that codecs.encode accepts a binary and returns a binary. But yeah, now I see it's not clear, let's just test with pre-generated test apikey



def test_parse_auth_simple():
assert parse_auth('user:pass') == ('user', 'pass')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you already have doctests in the parse_auth function, why not simply enable them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, will do

login, _, password = auth.partition(':')
if not password:
raise ValueError("Bad apikey, please check your credentials")
return (login, password)
Copy link
Contributor

@chekunkov chekunkov Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly, it's not super clear to me how you detect apikey vs jwt here? both apikey and jwt in my local tests pass this block without exception. can you split the process in 2 explicit parts - a function that tries to extract jobkey and jwt from string, if it fails - assume this is a apikey and validate it.

Copy link
Contributor Author

@vshlapakov vshlapakov Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both apikey and jwt in my local tests pass this block without exception

That's right, both apikey and jwt are hex-encoded strings, so it should be decoded with codecs.decode(auth, 'hex_codec') on above. If there's binascii.Error or TypeError we consider it as user:password string, but if there's no password - we assume that apikey wasn't correct.

Ok, let's try to split it to simplify the logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both apikey and jwt are hex-encoded strings

@vshlapakov probably we shouldn't have such expectations for apikey, that may change in future.

pytest.ini Outdated
@@ -0,0 +1,2 @@
[pytest]
addopts = --doctest-modules --doctest-glob='scrapinghub/*.py'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshlapakov you either need --doctest-modules OR --doctest-glob='scrapinghub/*.py' (I think the latter)

raise ValueError("spider_args should be a dictionary")
cleaned_args = {k: v for k, v in spider_args.items()
if k not in params}
params.update(cleaned_args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshlapakov please update docs - spider_args should be the only recommended way to pass spider arguments

@chekunkov chekunkov merged commit 58f3b1c into sc1467-1 Feb 22, 2017
@chekunkov chekunkov deleted the sc1467-fixes branch February 22, 2017 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants