-
Notifications
You must be signed in to change notification settings - Fork 62
Improvements and fixes for new python client #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
53bc463 to
2c08116
Compare
2c08116 to
459c928
Compare
To provide better support for filter param for all classes having HS-based iter* methods we have to simplify it and make the logic reusable. The changes also contain minor renaming for cleaner code.
scrapinghub/utils.py
Outdated
| filter_data = [] | ||
| for elem in params.pop('filter'): | ||
| if not isinstance(elem, (list, tuple)): | ||
| raise ValueError("Filter condition must be tuple or list") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still should accept strings as an input format
scrapinghub/utils.py
Outdated
| for elem in params.pop('filter'): | ||
| if not isinstance(elem, (list, tuple)): | ||
| raise ValueError("Filter condition must be tuple or list") | ||
| filter_data.append(json.dumps(elem)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elem len is validated on the server side, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, examples:
InvalidUsage: unable to parse filter parameter: Match spec must contain 3 values
InvalidUsage: unable to parse filter parameter: incorrect arguments for operator isnotempty
scrapinghub/client.py
Outdated
| <scrapinghub.client.Spider at 0x106ee3748> | ||
| >>> spider.id | ||
| >>> spider.key | ||
| 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
project.key = project id
job.key = '123/1/2'
spider.key == 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated all the docstrings with the following model (68f0428):
- projects 123 and 456
- project 123 has 2 spiders: (1, spider1) and (2, spider2)
scrapinghub/client.py
Outdated
|
|
||
| def __init__(self, client, projectid): | ||
| self.id = projectid | ||
| self.key = projectid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
job.key is a string, shouldn't project key be a string as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vshlapakov don't forget to ensure that key is str self.key = str(projectid)
and add line with >>> project.key output
scrapinghub/client.py
Outdated
| - retrive logs with a given log level and filter by a word | ||
| >>> filters = [("message", "contains", ["logger"])] | ||
| >>> list(job.logs.iter(level='WARNING', filter=filters)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I asked that before but maybe we should have a entity.list() method instead of list(entity.iter()) pattern?
| raise ValueError("key cannot be None") | ||
| return self._origin.get(key, *args, **kwargs) | ||
|
|
||
| def set(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, this method is still mentioned in the proxy_methods(self._origin, self, [...]) call https://github.com/scrapinghub/python-scrapinghub/pull/48/files#diff-797cd9f74e9304996bef041260c04262R1044
Fix spider id in docstrings to keep consistency Don't proxy methods overwritten in Collection class
scrapinghub/client.py
Outdated
| >>> spider.id | ||
| 2 | ||
| >>> spider.key | ||
| '1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spider.id is 1 (or '1', whatever), but spider.key should be '123/1'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I misread your comment, sure
scrapinghub/client.py
Outdated
|
|
||
| def __init__(self, client, projectid): | ||
| self.id = projectid | ||
| self.key = projectid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vshlapakov don't forget to ensure that key is str self.key = str(projectid)
and add line with >>> project.key output
scrapinghub/client.py
Outdated
| >>> job.update_tags(add=['consumed']) | ||
| """ | ||
| if not (add or remove): | ||
| raise ValueError('Please provide tags to add or remove') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just in case, shouldn't it be handled by server side validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Server responds with 200 for now, we may change it, but what's the point to do an empty request? Of course, it's not the only place where empty requests are allowed, but I think empty update_tags call can also be wrongly interpreted, like if we want to update local cache or something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but if it's empty - shouldn't api server return 400?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it makes sense, I don't mind to add it 👌 But I still don't see any harm to have client validation for it, there's no point to do update() call without data, do you think it's redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, redundancy is my main concern. when data is validated server side - you only need to change it in one place later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, convinced, I'll create a ticket for server-side changes and drop it from here
| return | ||
| path = 'v2/projects/{}/spiders/{}/tags'.format(self.projectid, self.id) | ||
| url = urljoin(self._client._connection.url, path) | ||
| response = self._client._connection._session.patch(url, json=params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Do we want to expose v2 api though?
scrapinghub/client.py
Outdated
| >>> project.spiders.get('spider2') | ||
| >>> spider = project.spiders.get('spider1') | ||
| <scrapinghub.client.Spider at 0x106ee3748> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
different from python interpreter output
>>> spider = project.spiders.get('localinfo')
>>> spider
<scrapinghub.client.Spider object at 0x10d6cb1d0>
I think you can simply remove <scrapinghub.client.Spider at 0x106ee3748>
scrapinghub/client.py
Outdated
| self.projectid = projectid | ||
| self.id = spiderid | ||
| self.key = '{}/{}'.format(str(projectid), str(spiderid)) | ||
| self.id = str(spiderid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find co-existance of id and key in Spider object a little bit confusing. Do we need them both? If so - why project and job object don't have both key and id. If id is needed only for internal use (e.g. to construct queries) - maybe we can make it private and remove from docs? If we expect users to use it - let's keep it public, but let's also be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spider id is needed only for internal use like you mentioned, to avoid splitting the key all the time to construct queries and do some checks. Let's better make it private and remove from the docs, it makes sense to have it only for spider.
scrapinghub/client.py
Outdated
| return self.get('vcs', colname) | ||
|
|
||
| def list(self): | ||
| return list(self._origin.apiget('list')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add iter() for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same for other classes that have list() method, but don't have iter() yet, e.g. Projects
scrapinghub/client.py
Outdated
| """ | ||
| self._origin.set(*args, **kwargs) | ||
|
|
||
| def delete(self, _keys): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why _keys? keys is not reserved or builtin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed (initially was done for consistency with original method, but it's a not an argument)
scrapinghub/client.py
Outdated
| """ | ||
| if (not isinstance(_keys, string_types) and | ||
| not(isinstance(_keys, (list, tuple)) and | ||
| all(isinstance(key, string_types) for key in _keys))): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not isinstance(keys, (list, tuple)):
keys = [keys]
if not all(isinstance(key, string_types) for key in keys)):
raise ValueError()
| except ValueError: | ||
| raise ValueError("Job key parts should be integers") | ||
| return JobKey(*parts) | ||
| return JobKey(*map(str, parts)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is not backwards compatible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, seem like this entire file is something that has never been merged to the master, so probably we are good here... Correct me if I'm wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you're right, it's a new module, we should be good
tests/client/conftest.py
Outdated
| @pytest.fixture(scope='session') | ||
| def client(): | ||
| return ScrapinghubClient(auth=TEST_ADMIN_AUTH, | ||
| return ScrapinghubClient(apikey=TEST_ADMIN_AUTH, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vshlapakov just noticed you replaced auth with apikey. note that in some cases we pass colon separated auth string to the HubstorageClient, do you think this will never happen here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not being sound enough about the change, that's my bad.
Right, I think it will never happen, my args are the following:
Connectionworks only with apikey,HubstorageClientlogic is more flexible, it also accepts(user,pass)as a tuple, or as a colon-separated auth string, so apikey looks like a common denominator for both clients, simple and power enough;- apikey key looks like an enhancement over user/pass pair, and as last approach was dropped from
Connectionat some point, I think we're not going to revise it in the nearest future.
Anyway let's revise it if you have something in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my point is the following - users can add new client to the requirements and use it in Scrapy Cloud. when they do so, they would probably use auth from the environment - and that's where have to pass decoded job_key:JWT_token pair, they don't have an option to pass an apikey and if they pass that pair as an apikey - their requests to dash endpoint will fail with 401 because Connection doesn't expect such input and uses apikey parameter as a "user" part of auth tuple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chekunkov hm, that's true, but Dash doesn't support authorisation via job_key:JWT_token, are we going to add it for consistency? or you propose to leave it up to user, but put a list with jobkey-auth supported logic somewhere to the docs? if latter, should we check for it and print a warning when instantiating Connection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but Dash doesn't support authorisation via job_key:JWT_token, are we going to add it for consistency?
true, we already discuss this for one endpoint in Dash, we may expand it to a bigger scope, both in terms of supported endpoints and entities, e.g. we can authorize JWT tokens to operate on a project level.
but my original question was more about semantics. apikey scope is kinda narrow, while in reality we use other ways to authenticate clients, which are passed in the same way, so more general auth is more suitable in my opinion.
as for handling jwt tokens - I'd expect new client to be able to consume hex encoded jwt tokens as they are stored in environment and handle decoding and splitting it into (user, password) auth tuple internally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense to me. Let's do what you're proposing 👌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, please review the changes.
Now it works in the following manner (I added a warning when using user/pass pair in legacy Connection): with a given SHUB_JOBAUTH env var, a client can be instantiated with:
c = ScrapinghubClient(os.env['SHUB_JOBAUTH']))
UserWarning: A lot of endpoints support authentication only via apikey.
After it you can use the client as usual, but it have access only to job data for now.
| if not isinstance(auth, string_types): | ||
| auth = auth.decode('ascii') | ||
| except binascii.Error: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vshlapakov did you try it with real apikey? it won't work correctly because real apikey consists of valid hexademical characters. you should try to decode the apikey, split it in parts, check that username part is a valid jobkey and password part is not empty - and only after that you can assume that this is a jwt token auth and can be used for auth.
another option is to check apikey len, but this wouldn't allow us to change apikey len in future wihtout causing users to upgrade
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I focused on jwt tokens yesterday and completely missed that apikey is also hex-encoded 🤦♂️ I tried to address the issues in the following commits, check when you have some time pls, corresponding tests are included.
scrapinghub/client.py
Outdated
| self._hsclient = HubstorageClient(auth=apikey, **kwargs) | ||
| auth = parse_auth(auth) | ||
| connection_kwargs = {'apikey': auth, 'url': dash_endpoint} | ||
| if len(auth) == 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parse_auth should raise ValueError if auth tuple cannot be extracted or its len != 2. this way you'll always be able to unpack it into username/apikey and password part and pass them to target functions directly.
8c9a047 to
cafc15a
Compare
scrapinghub/utils.py
Outdated
| if not isinstance(auth[0], string_types): | ||
| raise ValueError("Login must me of a string type") | ||
| if not (auth[1] is None or isinstance(auth[1], string_types)): | ||
| raise ValueError("Password must be None or of a string type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep it simple - both login and password can be strings only, if password isn't set it's empty string
tests/client/test_utils.py
Outdated
|
|
||
| def test_parse_auth_apikey(): | ||
| test_key = u'\xe3\x98\xb8\xe6\x91\x84\xe9' | ||
| apikey = encode(test_key.encode('utf8'), 'hex_codec').decode('ascii') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to show that apikey is just a utf8 string encoded with hex: all the conversions are related with that codecs.encode accepts a binary and returns a binary. But yeah, now I see it's not clear, let's just test with pre-generated test apikey
|
|
||
|
|
||
| def test_parse_auth_simple(): | ||
| assert parse_auth('user:pass') == ('user', 'pass') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you already have doctests in the parse_auth function, why not simply enable them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, will do
scrapinghub/utils.py
Outdated
| login, _, password = auth.partition(':') | ||
| if not password: | ||
| raise ValueError("Bad apikey, please check your credentials") | ||
| return (login, password) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
honestly, it's not super clear to me how you detect apikey vs jwt here? both apikey and jwt in my local tests pass this block without exception. can you split the process in 2 explicit parts - a function that tries to extract jobkey and jwt from string, if it fails - assume this is a apikey and validate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both apikey and jwt in my local tests pass this block without exception
That's right, both apikey and jwt are hex-encoded strings, so it should be decoded with codecs.decode(auth, 'hex_codec') on above. If there's binascii.Error or TypeError we consider it as user:password string, but if there's no password - we assume that apikey wasn't correct.
Ok, let's try to split it to simplify the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both apikey and jwt are hex-encoded strings
@vshlapakov probably we shouldn't have such expectations for apikey, that may change in future.
103068b to
f611720
Compare
pytest.ini
Outdated
| @@ -0,0 +1,2 @@ | |||
| [pytest] | |||
| addopts = --doctest-modules --doctest-glob='scrapinghub/*.py' | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vshlapakov you either need --doctest-modules OR --doctest-glob='scrapinghub/*.py' (I think the latter)
8f9a458 to
1f98bb9
Compare
| raise ValueError("spider_args should be a dictionary") | ||
| cleaned_args = {k: v for k, v in spider_args.items() | ||
| if k not in params} | ||
| params.update(cleaned_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vshlapakov please update docs - spider_args should be the only recommended way to pass spider arguments
The PR is created to cover the following minor things found after testing:
project.id - spider.id - job.key(usekeyorideverywhere)filterparamInvalidUsageorScrapinghubAPIError)Other issues added after next step of testing at February 8th:
Also there's a new
spider.update_tagsmethod that will be added on server-side soon.Each of the points will be implemented in a separate commit(s) to simplify reviewing.