-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save items data in df #75
Conversation
Codecov Report
@@ Coverage Diff @@
## master #75 +/- ##
=========================================
+ Coverage 65.47% 66.08% +0.6%
=========================================
Files 24 24
Lines 1596 1592 -4
Branches 278 274 -4
=========================================
+ Hits 1045 1052 +7
+ Misses 527 515 -12
- Partials 24 25 +1
Continue to review full report at Codecov.
|
@@ -147,12 +148,6 @@ def data_quality_report(self, bucket: Optional[str] = None): | |||
raise ValueError("Collections are not supported") | |||
if not self.schema: | |||
raise ValueError("Schema is empty") | |||
if not self.report.results: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be dealt with in data_quality_report.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
This is a pull request to prepare some work for #69
I am getting rid of dict, so it won't slow down implementing new df API. The changes are big (I also included a tad which is not related to pr), but I am going to comment some code here to help you understand.
Feel free to skip a review if you find it too complex :)
itertuples()
to iterate.json_val_df_dict.ipynb.zip
You can launch them from here https://mybinder.org/v2/gh/scrapinghub/jupyterhub-stacks/master?filepath=arche-notebook
P.S. Data can be validated at once without any iteractions, but
jsonschema
is awfully slow for this and it will require creating different schemas. Which, in turn, make them incompatible with current spidermon validation. Thus, at this point I don't see this bottleneck as critical to address. It shouldn't be slower than it is now anyway.