-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
_key and _type #106
_key and _type #106
Conversation
Codecov Report
@@ Coverage Diff @@
## 0.3.6dev #106 +/- ##
===========================================
Coverage ? 78.97%
===========================================
Files ? 23
Lines ? 1598
Branches ? 277
===========================================
Hits ? 1262
Misses ? 290
Partials ? 46
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
One thing is that looking at the tests seems that we are always append the SH_URL to the item index, this is True? If so, I think can be a good idea to have this parametrized somehow for people who want to run Arche
unrelated to Scrapy Cloud items.
No, index depends on how a user gets the data. If from Cloud, url will be there, if by other means - it's up to a user to create his own index. |
Refactor validator to func
This pr makes schema validation a bit slower, but it's constant and I think not huge.
before, 50k items
a.glance()
335 ms ± 39.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
this pr
350 ms ± 19.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Some incentives are given in #104 . The main thing is that we keep Spidermon and Arche schemas more consistent by dropping these keys, and it's 2 fields less to describe in a schema.
Also it kind of nicer since we don't have to include
_key
in pandas selectors anymore, index is always there.Now df looks like:
![Screenshot 2019-06-11 at 17 40 47](https://user-images.githubusercontent.com/10396557/59308854-14850b00-8c70-11e9-8482-57a3688202c9.png)
One thing to note is that
raw
is modified after the first pass.