# Rules

This notebook contains rules used in the library with examples. Some rules executed during `Arche.report_all()`, and some are meant to be executed separately.

Some definitions here are used interchangeably:

* Rule - a test case for data. As a test case, it can be failed, passed or skipped. Some of the rules output only information like [Category fields](#Category-fields)

* **df** - a dataframe which holds input data (from a job, collection or other source)

* Scrapy cloud item - a row in a **df**

* Items fields - columns in a **df**

In [None]:
import arche
from arche import *
from arche.readers.items import Items

In [None]:
items = Items.from_df(pd.read_csv("https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/items_products_8.csv"))
target_items = Items.from_df(pd.read_csv("https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/items_products_7.csv"))

In [None]:
df = items.df.drop(columns=["_type"])
target_df = target_items.df

## Accessing Graphs Data
The data is in `stats`. See `Result` class for more details.

In [None]:
arche.rules.coverage.check_fields_coverage(df).stats

## Coverage
### Fields coverage on input data

In [None]:
help(arche.rules.coverage.check_fields_coverage)

In [None]:
arche.rules.coverage.check_fields_coverage(df).show()

### Anomalies

In [None]:
help(arche.rules.coverage.anomalies)

In [None]:
res = arche.rules.coverage.anomalies(target="381798/2/4", sample=["381798/2/8", "381798/2/7", "381798/2/6"])
res.show()

## Categories

### Category fields

In [None]:
help(arche.rules.category.get_categories)

In [None]:
arche.rules.category.get_categories(df, max_uniques=200).show()

### Category coverage
In `report_all()`, these rules use `category` tag.

In [None]:
help(arche.rules.category.get_coverage_per_category)

In [None]:
arche.rules.category.get_coverage_per_category(df, ["category"]).show()

In [None]:
help(arche.rules.category.get_difference)

In [None]:
arche.rules.category.get_difference(df, target_df, ["category"]).show()

## Duplicates
### Find duplicates by columns (fields)
This rule is not included in `Arche.report_all()`.

In [None]:
help(arche.rules.duplicates.find_by)

In [None]:
arche.rules.duplicates.find_by(df, ["name", "part_number"]).show(short=True)