Save items data in df #75

manycoding · 2019-04-29T20:29:27Z

This is a pull request to prepare some work for #69
I am getting rid of dict, so it won't slow down implementing new df API. The changes are big (I also included a tad which is not related to pr), but I am going to comment some code here to help you understand.
Feel free to skip a review if you find it too complex :)

The tests showing why I chose itertuples() to iterate.
json_val_df_dict.ipynb.zip
You can launch them from here https://mybinder.org/v2/gh/scrapinghub/jupyterhub-stacks/master?filepath=arche-notebook

P.S. Data can be validated at once without any iteractions, but jsonschema is awfully slow for this and it will require creating different schemas. Which, in turn, make them incompatible with current spidermon validation. Thus, at this point I don't see this bottleneck as critical to address. It shouldn't be slower than it is now anyway.

codecov · 2019-04-29T20:29:59Z

Codecov Report

Merging #75 into master will increase coverage by 0.6%.
The diff coverage is 79.36%.

@@            Coverage Diff            @@
##           master      #75     +/-   ##
=========================================
+ Coverage   65.47%   66.08%   +0.6%     
=========================================
  Files          24       24             
  Lines        1596     1592      -4     
  Branches      278      274      -4     
=========================================
+ Hits         1045     1052      +7     
+ Misses        527      515     -12     
- Partials       24       25      +1

Impacted Files	Coverage Δ
src/arche/tools/api.py	`54.54% <100%> (-4.11%)`	⬇️
src/arche/data_quality_report.py	`31.31% <40%> (+1.01%)`	⬆️
src/arche/rules/json_schema.py	`72.97% <50%> (ø)`	⬆️
src/arche/readers/items.py	`81.11% <77.27%> (+4.28%)`	⬆️
src/arche/arche.py	`69.56% <80%> (-0.65%)`	⬇️
src/arche/tools/json_schema_validator.py	`96% <93.75%> (+17.27%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d43e8c...1e62d7a. Read the comment docs.

manycoding · 2019-04-29T20:30:15Z

src/arche/arche.py

@@ -147,12 +148,6 @@ def data_quality_report(self, bucket: Optional[str] = None):
            raise ValueError("Collections are not supported")
        if not self.schema:
            raise ValueError("Schema is empty")
-        if not self.report.results:


This should be dealt with in data_quality_report.py

ejulio

👌

Save items data in df

9fd17df

manycoding requested review from raphapassini, ejulio, victor-torres, alexander-matsievsky and ivankivanov April 29, 2019 20:29

manycoding commented Apr 29, 2019

View reviewed changes

Copy df in rules

1e62d7a

manycoding added this to the 0.4.0 milestone May 2, 2019

ejulio approved these changes May 2, 2019

View reviewed changes

Merge branch 'master' into items_data_in_df

fa6c715

manycoding merged commit 883fedb into master May 2, 2019

manycoding deleted the items_data_in_df branch May 7, 2019 16:36

manycoding mentioned this pull request May 9, 2019

Consuming items data to df creates inconsistencies with jsonschema #83

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save items data in df #75

Save items data in df #75

manycoding commented Apr 29, 2019 •

edited

Loading

codecov bot commented Apr 29, 2019 •

edited

Loading

manycoding Apr 29, 2019

ejulio left a comment

Save items data in df #75

Save items data in df #75

Conversation

manycoding commented Apr 29, 2019 • edited Loading

codecov bot commented Apr 29, 2019 • edited Loading

Codecov Report

manycoding Apr 29, 2019

Choose a reason for hiding this comment

ejulio left a comment

Choose a reason for hiding this comment

manycoding commented Apr 29, 2019 •

edited

Loading

codecov bot commented Apr 29, 2019 •

edited

Loading