Releases · scrapinghub/arche

12 Jul 18:22

manycoding

v0.3.6

b1cbb6f

Categories Latest

Latest

[0.3.6] (2019-07-12)

Added

Categories rule with a plot showing unique values and count per field. By default, report_all() only includes fields which have less or equal to 10 unique values. See https://arche.readthedocs.io/en/latest/nbs/Rules.html#Category-fields, #100
Category documentation

Changed

Arche.report_all() does not shorten report by default, added short parameter.
Data is consistent with Dash and Spidermon: _type, _key fields are dropped from dataframe, raw data, basic schema, #104, #106
df.index now stores _key instead
basic_json_schema() works with deleted jobs
start is supported for Collections, #112
enum is counted as a category tag, #18
Garbage Symbols searches in str representation of nested fields instead of expanded df, #130
Show real coverage difference (negative\positive) instead of absolute, #114

Fixed

Arche.glance(), #88
Item links in Schema validation errors, #89
Empty NAN bars on category graphs, #93
data_quality_report(), #95
Wrong number of Collection Items if it contains item 0, #112

Removed

Responses Per Item Ratio rule
Deprecated expand parameter and removed flat_df, since Garbage Rule deal with nested data itself, #133

Thanks - @ejulio @victor-torres @Gallaecio @alexander-matsievsky @ivankivanov @raphapassini @alexandr1988

Assets 2

14 May 20:29

manycoding

v0.3.5

91fab16

Data from iterables

[0.3.5] (2019-05-14)

Added

Arche() supports any iterables with item dicts, fixing jsonschema consistency, #83
Items.from_array to read raw data from iterables, #83

Changed

If reading from pandas df directly, store raw data in numpy array. See gotchas http://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#support-for-integer-na

Assets 2

06 May 18:32

manycoding

v0.3.4

ee589c9

0.3.4

[0.3.4] (2019-05-06)

Fixed

basic_json_schema() fails with long 1.0 types, #80

Assets 2

03 May 23:09

manycoding

v0.3.3

285cd21

Data from anywhere, 1 year release

[0.3.3] (2019-05-03)

Added

Accept dataframes as source or target, #69

Changed

data_quality_report plots the same "Fields Coverage" instead of green "Scraped Fields Coverage"
Plot theme changed from ggplot2 to seaborn, #62
Same target and source raise an error, was a warning before
Passed rules marked with green PASSED.

Fixed

Online documentation now renders graphs https://arche.readthedocs.io/en/latest/, #41
Error colours are back in report_all().

Removed

Deprecated earlier Arche.basic_json_schema(), use basic_json_schema()
Removed Quickstart.md as redundant - documentation lives in notebooks

Assets 2

18 Apr 20:54

manycoding

v0.3.2

216118c

Raw schemas from repos

[0.3.2] (2019-04-18)

Added

Allow reading private raw schemas directly from bitbucket, #58

Changed

Progress widgets are removed before printing graphs
New plotly v4 API

Fixed

Failing Compare Prices For Same Urls when url is nan, #67
Empty graphs in Jupyter Notebook, #63

Removed

Scraped Items History graphs

Assets 2

12 Apr 23:49

manycoding

v0.3.0

401063e

More Graphs

[0.3.0] (2019-04-12)

Fixed

Big notebook size, replaced cufflinks with plotly and ipython, #39

Changed

Fields Coverage now is printed as a bar plot, #9
Fields Counts renamed to Coverage Difference and results in 2 bar plots, #9, #51:
- Coverage from job stats fields counts which reflects coverage for each field for both jobs
- Coverage difference more than 5% which prints >5% difference between the coverages (was ratio difference before)
Compare Scraped Categories renamed to Category Coverage Difference and results in 2 bar plots for each category, #52:
- Coverage for field which reflects value counts (categories) coverage for the field for both jobs
- Coverage difference more than 10% for field which shows >10% differences between the category coverages
Boolean Fields plots Coverage for boolean fields graph which reflects normalized value counts for boolean fields for both jobs, #53

Removed

cufflinks dependency
category_field tag

Assets 2

26 Mar 19:42

manycoding

v2019.03.25

b947610

2019.03.25

Added

CHANGES.md

new arche.rules.duplicates.find_by() to find duplicates by chosen columns

import arche
from arche.readers.items import JobItems
df = JobItems(0, "235801/1/15").df
arche.rules.duplicates.find_by(df, ["title", "category"]).show()

basic_json_schema().json() prints a schema in JSON format

Result.show() to print a rule result, e.g.

from arche.rules.garbage_symbols import garbage_symbols
from arche.readers.items import JobItems
items = JobItems(0, "235801/1/15")
garbage_symbols(items).show()

notebooks to documentation

Changed

Tags rule returns unused tags, #2
basic_json_schema() prints a schema as a python dict

Deprecated

Arche().basic_json_schema() deprecated in favor of arche.basic_json_schema()

Fixed

Arche().basic_json_schema() not using items_numbers argument

Assets 2

18 Mar 13:46

manycoding

v2019.03.18

15d1634

2019.03.18 Gone public

Fixes

Duplicates rule were refactored and is faster up to 100x, thanks @ivankivanov, #268
Report was fixed, #270
Small jobs do not use multithreading

New:

Progress bards were added to JSON validation, flatten df, #204 #263 #201 #200
Garbage symbols were limited to 20 character in output

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.3.6] (2019-07-12)

Added

Changed

Fixed

Removed

[0.3.5] (2019-05-14)

Added

Changed

[0.3.4] (2019-05-06)

Fixed

[0.3.3] (2019-05-03)

Added

Changed

Fixed

Removed

[0.3.2] (2019-04-18)

Added

Changed

Fixed

Removed

[0.3.0] (2019-04-12)

Fixed

Changed

Removed

Added

Changed

Deprecated

Fixed

Releases: scrapinghub/arche

Categories

[0.3.6] (2019-07-12)

Added

Changed

Fixed

Removed

Data from iterables

[0.3.5] (2019-05-14)

Added

Changed

0.3.4

[0.3.4] (2019-05-06)

Fixed

Data from anywhere, 1 year release

[0.3.3] (2019-05-03)

Added

Changed

Fixed

Removed

Raw schemas from repos

[0.3.2] (2019-04-18)

Added

Changed

Fixed

Removed

More Graphs

[0.3.0] (2019-04-12)

Fixed

Changed

Removed

2019.03.25

Added

Changed

Deprecated

Fixed

2019.03.18 Gone public