Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_key and _type #106

Merged
merged 11 commits into from
Jun 12, 2019
Merged

_key and _type #106

merged 11 commits into from
Jun 12, 2019

Conversation

manycoding
Copy link
Contributor

@manycoding manycoding commented Jun 6, 2019

This pr makes schema validation a bit slower, but it's constant and I think not huge.
before, 50k items a.glance()
335 ms ± 39.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
this pr
350 ms ± 19.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Some incentives are given in #104 . The main thing is that we keep Spidermon and Arche schemas more consistent by dropping these keys, and it's 2 fields less to describe in a schema.
Also it kind of nicer since we don't have to include _key in pandas selectors anymore, index is always there.

Now df looks like:
Screenshot 2019-06-11 at 17 40 47

One thing to note is that raw is modified after the first pass.

@manycoding manycoding changed the title Key _key Jun 6, 2019
@codecov
Copy link

codecov bot commented Jun 6, 2019

Codecov Report

❗ No coverage uploaded for pull request base (0.3.6dev@bc42578). Click here to learn what that means.
The diff coverage is 88.63%.

Impacted file tree graph

@@             Coverage Diff             @@
##             0.3.6dev     #106   +/-   ##
===========================================
  Coverage            ?   78.97%           
===========================================
  Files               ?       23           
  Lines               ?     1598           
  Branches            ?      277           
===========================================
  Hits                ?     1262           
  Misses              ?      290           
  Partials            ?       46
Impacted Files Coverage Δ
src/arche/data_quality_report.py 78.72% <ø> (ø)
src/arche/tools/json_schema_validator.py 96.07% <100%> (ø)
src/arche/rules/others.py 100% <100%> (ø)
src/arche/rules/duplicates.py 100% <100%> (ø)
src/arche/tools/schema.py 78% <100%> (ø)
src/arche/readers/items.py 85.24% <100%> (ø)
src/arche/rules/json_schema.py 100% <100%> (ø)
src/arche/report.py 97.72% <100%> (ø)
src/arche/rules/price.py 58% <50%> (ø)
src/arche/arche.py 85.21% <66.66%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bc42578...f52bca1. Read the comment docs.

Copy link

@raphapassini raphapassini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
One thing is that looking at the tests seems that we are always append the SH_URL to the item index, this is True? If so, I think can be a good idea to have this parametrized somehow for people who want to run Arche unrelated to Scrapy Cloud items.

@manycoding
Copy link
Contributor Author

One thing is that looking at the tests seems that we are always append the SH_URL to the item index, this is True?

No, index depends on how a user gets the data. If from Cloud, url will be there, if by other means - it's up to a user to create his own index.

@manycoding manycoding changed the title _key _key and _type Jun 12, 2019
@manycoding manycoding merged commit 32927ac into 0.3.6dev Jun 12, 2019
@manycoding manycoding deleted the _key branch June 12, 2019 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants