Data quality #115

elijahbenizzy · 2022-04-16T21:47:18Z

[Short description explaining the high-level reason for the pull request]

Changes

Testing

Notes

Checklist

Testing checklist

Python - local testing

python 3.6
python 3.7

elijahbenizzy · 2022-04-16T23:34:31Z

See docstring for check_output for basic design.

Will need to rebase/fix some stuff up. However, there are some questions that this raises:

How to handle the notion of "anonymous nodes". E.G. nodes that are used as intermediaries in a subdag. Our example of this is the node that computes the results before feeding it to a data validator. I propose a central name\_node(prefix, suffix, **kwargs) function that does a stable hash of the inputs, allowing for readable prefixes/suffixes
The weird API that we have in wihch you pass either a set of kwargs for the default resolver or a list of custom resolvers. I propose two separate APIs. Both check_outputs and check_outputs.custom(*validators). Ideally they should both go to the same API, but the problem is that we can't resolve arguments without knowing the dependent node types...
Say we decorate @extract_columns with @check_output. This brings up some qs. (1) Should this create one DQ checks for each column. IMO yes. (2) What if you just want to decorate specific columns? I think this is too much complexity to handle for now, but we could add provisions later on. (3) What
How to get a report of DQ results? IMO node tags should be decorated with DQ metadata -- we should be able to pretty easily query it.

elijahbenizzy · 2022-04-16T23:51:12Z

Some tasks prior to this being ready:

Add a set of new default decorators
Add documentation
Add unit tests
Rebase from main
Separate into a few small commits
Merge tagging branch, set this to merge against main

elijahbenizzy · 2022-06-07T23:42:32Z

Alright, not 100% done but functional. Remaining:

End to end examples in documentation
Solve a bug in which layering decorators will cause two nodes of the same name
Fish around for feedback
Version bump
Fix tests

And then later tasks:

Publish
Provide some integrations (looking at whylabs)
Market

skrawcz

Almost there! Would have to play around with it and see how it impacts the GraphAdapters though. We will also want to show that it doesn't interfere with other decorators, or at least how it interacts with them, e.g. parameterize. This will then impact documentation, since I think this feature is large enough we probably need to show case a bunch of uses for it for people to cut and paste.

Discussion points to chat about:

The double decorator issues seems like we could do without it? Make an issue to track though?
😆 at last commit message.
We should standardize how we import, I think module imports are cleaner in general.
The naming of the validators I think should be as specific as possible, thus if they only operate over pandas series, we should have pandas and series in the name.
What else do you need help on? I didn't check the test coverage, but that would be something to double check that we have added appropriate unit tests.

skrawcz · 2022-06-12T22:38:23Z

hamilton/data_quality/default_validators.py

+import numbers
+from typing import Any, Type, List, Optional, Tuple
+
+from hamilton.data_quality.base import DataValidator, ValidationResult


I think we should use the style of importing the module, like google does. That way when you're reading the class, you know if it's defined locally or not. Default to local. The exceptions here are typing classes.

Yep, happy to do that

skrawcz · 2022-06-12T22:38:27Z

hamilton/data_quality/default_validators.py

+import numbers
+from typing import Any, Type, List, Optional, Tuple
+
+from hamilton.data_quality.base import DataValidator, ValidationResult


I think we should use the style of importing the module, like google does. That way when you're reading the class, you know if it's defined locally or not. Default to local. The exceptions here are typing classes.

hamilton/data_quality/default_validators.py

skrawcz · 2022-06-12T22:39:54Z

hamilton/data_quality/base.py

+class ValidationResult:
+    passes: bool  # Whether or not this passed the validation
+    message: str  # Error message or success message
+    diagnostics: Dict[str, Any] = dataclasses.field(default_factory=dict)  # Any extra diagnostics information needed, free-form


are there any set keys emerging? Maybe a TypedDict would help here?

Haven't noticed any common ones. In fact, those will all be part of the dataclass itself... This is purely unstructured, but stable stuff.

hamilton/function_modifiers.py

data_quality.md

skrawcz · 2022-06-12T22:53:29Z

hamilton/data_quality/default_validators.py

+class NansAllowedValidatorPandas(MaxFractionNansValidatorPandas):
+    def __init__(self, allow_nans: bool, importance: str):
+        if allow_nans:
+            raise ValueError(f"Only allowed to block Nans with this validator."
+                             f"Otherwise leave blank or specify the percentage of Nans using {MaxFractionNansValidatorPandas.name()}")
+        super(NansAllowedValidatorPandas, self).__init__(max_fraction_nan=0 if not allow_nans else 1.0, importance=importance)


Language here is confusing. Should there not be any if statement here at all?

e.g. I could see people doing - allow_nans=True and allow_nans=False being fine uses.

People might do this just to be explicit with expectations.

Yeah, my original thought was allow_nans=False was the only one you'd want. E.G. its a no-op if you have allow_nans=True. Recall that this will be mixed with a bunch of other params...

hamilton/function_modifiers.py

data_quality.md

elijahbenizzy · 2022-06-15T00:34:59Z

OK, this is pretty close for a first release (rc version).

elijahbenizzy · 2022-06-17T03:37:08Z

A quick summary for the whylogs folks.

what this does
This PR adds a new decorator for hamilton functions. This decorator does data quality checks. It has two modes: warn and fail -- should be easy to figure out what these do :)

The decorator has two modes of working with it:

Passing in default arguments to use a set of included validators (check_output)
Passing in a custom validator

(1) looks something like this:

# Note that this actually produces two as it uses two arguments
@check_output(range=(0, 1), allow_nans=False, importance='fail')
def data_between_0_and_1_with_no_nans(some_input: pd.Series) -> pd.Series:
    ...

Where (2) looks something like this:

# Note that this actually produces two as it uses two arguments
@check_output_custom(
    MyCustomDataInRangeValidator(low=0, high=1), 
    MyCustomNoNansAllowedValidator())
def data_between_0_and_1_with_no_nans(some_input: pd.Series) -> pd.Series:
    ...

how this fits into hamilton/the hamilton plan
This is another step towards making an all-encompassing dataflow tool. The small set of included validators should cover some base cases (and are extensible). We hope to encourage

your task

We would love feedback! From you...

Check out the branch/mess around with it -- write a basic dataflow to use for testing
Look at the class DataValidator - can you fit your client into it somehow?
Write one!

Happy to pair as needed. Also happy to make changes in case we need any more abstractions.

We now have a test-integrations section in config.yml. I've decided to group them together to avoid a proliferation. Should we arrive at conflicting requirements we can solve that later.

It was causing circular imports otherwise.

elijahbenizzy · 2022-07-06T15:38:40Z

OK so pandera integration is here and its pretty clean IMO. That said, this is going to be tricky due to decorators...

E.G. when we do an extract_columns on a DataFrame[df_schema] to do this right we'll need that to handle the typing correctly so each node produces the right Series[series_schema] where series_schema represents the subschema of df_schema. Doable, but not sure how worth it it is (as of now), from the implementation perspective.

skrawcz

Yeah just documentation thoughts.

data_quality.md

skrawcz · 2022-07-06T21:25:59Z

data_quality.md

+## Default Validators
+
+The available default validators are listed in the variable `AVAILABLE_DEFAULT_VALIDATORS`
+in `default_validators.py`. To add more, please implement the class in that file then add to the list.
+There is a test that ensures that everything is added to that list.


This seems out of place? Should be in a developer's section?

Yeah its a little grouped together now

skrawcz · 2022-07-06T21:30:06Z

data_quality.md

@@ -0,0 +1,150 @@
+# Data Quality


This file needs to be reordered a bit I think.

We want to optimize for someone getting started very easily - thus should list things like importance and accessing results up higher. E.g. how do I get something going, how do I configure that basic thing, how do I get the results.

More complex use cases should be pushed further down.

So for me it's:

introduction with code to cut & paste

information on how to customize/tweak that code (list of kwargs, importance levels)

how to access results

Pandera integration

Writing your own custom validators

skrawcz · 2022-07-06T21:32:52Z

decorators.md

+    pass
+```
+
+The check_output validator takes in arguments that each correspond to one of the default validators.


Suggested change

The check_output validator takes in arguments that each correspond to one of the default validators.

The check_output decorator takes in arguments that each correspond to one of the default validators. For a list of default available validators see ....

skrawcz · 2022-07-06T21:33:45Z

hamilton/data_quality/default_validators.py

+from hamilton.data_quality import base
+from hamilton.data_quality.base import BaseDefaultValidator


can delete the class, and instead prepend base. to where it's used.

skrawcz · 2022-07-06T21:34:29Z

hamilton/data_quality/pandera_validators.py

+
+
+class PanderaDataFrameValidator(base.BaseDefaultValidator):
+    """Pandera schema validator for dataframes"""


link to pandera docs for dataframe schema / link to our docs.

skrawcz · 2022-07-06T21:34:39Z

hamilton/data_quality/pandera_validators.py

+
+
+class PanderaSeriesSchemaValidator(base.BaseDefaultValidator):
+    """Pandera schema validator for series"""


link to pandera docs for series schema / link to our docs.

It's two words and it should be separated by an `_` to be consistent with all the other validators.

They were in consistent. This changes them to follow the following: * Name of validation + Data type E.g. MaxFractionNansValidator + PandasSeries

PandasSeries and Primitives are added. This enables one to do: ```python @check_output(values_in=[ LIST, OF, VALUES ], ...) ``` Which will check what the function outputs and validate that it is within one of the values provided. Currently the primitives operates over numbers and strings. I punted on lists and dictionaries -- they should probably be different validator classes.

There were multiple ways the same module was imported. Reduced it to a single way.

Using `inclusive=True` is deprecated. So changing to use `both` to not get a warning from pandas when using this.

Sorry, merging two commits here. (1) The `name()` was pretty much just the `argument` + `_validator`. So I just encoded that and updated the names of variables, and classes to match this format of doing things. Thus (2) is rolled into this. Because we should be making sure that arguments, names, class names, are following some semblance of structure.

Prior behavior stopped at the first failure. We don't want that to happen. Instead we want to run through all the checks and log them appropriately, this change does that. So I had to changed "act" to accommodate this. Since `act` itself was only used in a single place, I just moved the `if` statement into the BaseDataValidationDecorator. That said, the class structure here feels a little odd -- might be easy to introduce a circular dependency at some point accidentally. But yeah we need a better mechanism for storing results for people to access.

This only works over numbers and strings. If we want to do dicts and lists, we probably want a specific validator for them -- we don't need the if/else checks here.

Fixing rogue function doc that was not like the other function doc we have setup. Namely, we use `:param NAME:` rather than `@param NAME:`.

This data quality example is based on the example we provided with the outerbounds(metaflow) folks. It's purpose is to show how one might apply data quality. It also shows how to use the same code and make it work for dask*, ray, spark. Ray - everything seems to work as intended. Spark - had to change how the data & some features are computed. Dask* - had to change how the data & some features are computed. CAVEAT: right now the validators don't work properly for dask data types. That is because either (1) it's a future object, or (2) we use pandas series syntax, when instead we should use the dask specific syntax. In short - DEFAULT DQ DOES NOT WORK WITH DASK DATA TYPES. BUT it DOES WORK if you're just using spark for multi-processing, and not using dask data types. So we need to think how we might change/inject the validator implementation based on say a graph adapter or something... otherwise this forces one to really stick to one data type or another, i.e. pandas or dask dataframe. Documentation should hopefully be enough to document what is doing on. The only TODO is to create an analogous example using Pandera -- my hope is that it will handle dask datatypes...

Before it did not check anything -- and instead assumed a dictionary of series and scalars. Now if there is only a single value, and it happens to be a dataframe, we will return that, instead of trying to build another dataframe. Adds unit tests for this function.

It did not take in importance or call the super class. Updates unit tests.

Dask datatypes are lazily evaluated. So we need to check whether the "validate" result we get back from pandera is a dask like object. If so, we then want to "materialize" it so that we can actually compute the validation result. Without this check, they are never evaluated, because nothing downstream asks for the result to be computed.

Using the same trick as we employed before, we can simply compute a result for the scalar primitive validators to be a valid value to compare against. Without this, things break because we're trying to compare a dask data type thing. Note: we could do a similar strategy for the Pandas Series validators, however we'd need to do something akin to what pandera does under the hood with `map_partitions` over the dask like object. I vote to push people to use pandera if they're using dask data types.

Adds one logger statement to ensure things are logged nicely, one by one in the case of a failure -- they were otherwise hard to interpret. Fixes install instructions otherwise in the case of pandera validators.

So that the output does not take over your whole screen.

This example is virtually identical to `examples/data_quality/simple`. It instead makes the following choices: 1. Separate feature logic file for Pandas on Spark. Just to show another way to cut things. Well that, and to correct the "data type" validation issue with the simple example. 2. Uses Pandera and shows how to validate Series and Dataframes using Pandera + `@check_output`.

In the case there is a failure, it's probably useful to print the valid values expected. Also changes on applies_to check for `issubclass`, this is related to PR feedback, and I'm too lazy to make another commit just for it.

Co-authored-by: Stefan Krawczyk <skrawczyk@stitchfix.com>

elijahbenizzy force-pushed the data-quality branch from 4b87523 to 8a49bde Compare April 16, 2022 23:36

elijahbenizzy changed the base branch from main to tag-nodes April 16, 2022 23:39

elijahbenizzy force-pushed the tag-nodes branch from 0b6ed7a to 8bbe9ab Compare April 16, 2022 23:52

elijahbenizzy force-pushed the data-quality branch 2 times, most recently from 5523169 to 3b9ab68 Compare April 17, 2022 22:19

This was referenced Apr 29, 2022

explore extracting columns with validations #123

Closed

Prototype Data Quality Feature #41

Closed

Base automatically changed from tag-nodes to main April 30, 2022 04:04

skrawcz linked an issue Apr 30, 2022 that may be closed by this pull request

Prototype Data Quality Feature #41

Closed

elijahbenizzy force-pushed the data-quality branch 4 times, most recently from a48f8cd to 5eb2e98 Compare May 9, 2022 00:13

elijahbenizzy force-pushed the data-quality branch 5 times, most recently from 05db9c7 to da4e064 Compare June 1, 2022 19:00

elijahbenizzy force-pushed the data-quality branch 2 times, most recently from afaf503 to 94d478a Compare June 7, 2022 23:35

elijahbenizzy force-pushed the data-quality branch 2 times, most recently from 2b57408 to 8e6d7da Compare June 9, 2022 16:06

skrawcz reviewed Jun 12, 2022

View reviewed changes

elijahbenizzy force-pushed the data-quality branch 2 times, most recently from 4444cb2 to 3b1d1b0 Compare June 15, 2022 00:34

elijahbenizzy added 3 commits July 6, 2022 08:33

Adds sections to test external integrations in config.yml

552e54e

We now have a test-integrations section in config.yml. I've decided to group them together to avoid a proliferation. Should we arrive at conflicting requirements we can solve that later.

Moves BaseDefaultValidator to base

4051daf

It was causing circular imports otherwise.

Adds documentation for pandera validator

26c39cb

skrawcz reviewed Jul 6, 2022

View reviewed changes

skrawcz added 20 commits July 12, 2022 17:06

Changes type check argument from datatype to data_type

1b3b86b

It's two words and it should be separated by an `_` to be consistent with all the other validators.

Standardizes DQ Validator class names

3bf1dd2

They were in consistent. This changes them to follow the following: * Name of validation + Data type E.g. MaxFractionNansValidator + PandasSeries

Fixes missing data_type change

f33613e

Cleans up module imports in function_modifiers

eca5f6a

There were multiple ways the same module was imported. Reduced it to a single way.

Changes validator between to use both

8e08c8a

Using `inclusive=True` is deprecated. So changing to use `both` to not get a warning from pandas when using this.

Adds data type validator for primitives

e0bd8c0

This only works over numbers and strings. If we want to do dicts and lists, we probably want a specific validator for them -- we don't need the if/else checks here.

Fixes function doc using @

e954df2

Fixing rogue function doc that was not like the other function doc we have setup. Namely, we use `:param NAME:` rather than `@param NAME:`.

Fixes pandera validator constructor

19820ac

It did not take in importance or call the super class. Updates unit tests.

Fixes typos & adds error log messages

b151905

Adds one logger statement to ensure things are logged nicely, one by one in the case of a failure -- they were otherwise hard to interpret. Fixes install instructions otherwise in the case of pandera validators.

Makes simple data quality examples only print head

f96dd0a

So that the output does not take over your whole screen.

Fixes setup.py comment

eef2c37

Adds printing values to diagnostic messages

6b4c5bf

In the case there is a failure, it's probably useful to print the valid values expected. Also changes on applies_to check for `issubclass`, this is related to PR feedback, and I'm too lazy to make another commit just for it.

skrawcz self-requested a review July 13, 2022 16:19

skrawcz approved these changes Jul 13, 2022

View reviewed changes

Update data_quality.md

dfe94a2

Co-authored-by: Stefan Krawczyk <skrawczyk@stitchfix.com>

elijahbenizzy merged commit 860c60a into main Jul 13, 2022

elijahbenizzy deleted the data-quality branch July 13, 2022 17:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data quality #115

Data quality #115

elijahbenizzy commented Apr 16, 2022 •

edited

elijahbenizzy commented Apr 16, 2022

elijahbenizzy commented Apr 16, 2022 •

edited

elijahbenizzy commented Jun 7, 2022 •

edited

skrawcz left a comment

skrawcz Jun 12, 2022

elijahbenizzy Jun 14, 2022

skrawcz Jun 12, 2022

skrawcz Jun 12, 2022

elijahbenizzy Jun 15, 2022

skrawcz Jun 12, 2022

skrawcz Jun 12, 2022

skrawcz Jun 12, 2022

elijahbenizzy Jun 14, 2022

elijahbenizzy commented Jun 15, 2022

elijahbenizzy commented Jun 17, 2022 •

edited by skrawcz

elijahbenizzy commented Jul 6, 2022 •

edited

skrawcz left a comment

skrawcz Jul 6, 2022

elijahbenizzy Jul 13, 2022

skrawcz Jul 6, 2022

skrawcz Jul 6, 2022

skrawcz Jul 6, 2022

skrawcz Jul 6, 2022

skrawcz Jul 6, 2022

	The check_output validator takes in arguments that each correspond to one of the default validators.
	The check_output decorator takes in arguments that each correspond to one of the default validators. For a list of default available validators see ....

		from hamilton.data_quality import base
		from hamilton.data_quality.base import BaseDefaultValidator



		class PanderaDataFrameValidator(base.BaseDefaultValidator):
		"""Pandera schema validator for dataframes"""



		class PanderaSeriesSchemaValidator(base.BaseDefaultValidator):
		"""Pandera schema validator for series"""

Data quality #115

Data quality #115

Conversation

elijahbenizzy commented Apr 16, 2022 • edited

Changes

Testing

Notes

Checklist

Testing checklist

Python - local testing

elijahbenizzy commented Apr 16, 2022

elijahbenizzy commented Apr 16, 2022 • edited

elijahbenizzy commented Jun 7, 2022 • edited

skrawcz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elijahbenizzy commented Jun 15, 2022

elijahbenizzy commented Jun 17, 2022 • edited by skrawcz

elijahbenizzy commented Jul 6, 2022 • edited

skrawcz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elijahbenizzy commented Apr 16, 2022 •

edited

elijahbenizzy commented Apr 16, 2022 •

edited

elijahbenizzy commented Jun 7, 2022 •

edited

elijahbenizzy commented Jun 17, 2022 •

edited by skrawcz

elijahbenizzy commented Jul 6, 2022 •

edited