add support for grouping column of interest by another column. #16

cosmicBboy · 2018-12-11T16:25:29Z

if the user needs to validate some column x conditioned on the values of column y, the user can specify multiple column names argument

DataFrameSchema({
    ("x", "y"): Columns([Int, String], Check(lambda x, y: x[y == "foo"] > 10))
})

The text was updated successfully, but these errors were encountered:

mastersplinter · 2019-05-18T21:35:16Z

What's your view on handling the nullable, strict, allow_duplicates when supplying a tuple of columns for checking?

Do you see something like this?:

DataFrameSchema({
    "a": Column(Float, nullable=True),
    ("b", "c", "d"): Columns(
        (
            (Int, nullable=True, strict=True),
            (Float, nullable=True, allow_duplicates=True),
            (String, nullable=True, strict=False),
        ),
        Check(lambda b, c, d: b*c == d > 10),
    "e": Column(Bool, nullable=True),
})

cosmicBboy · 2019-05-18T22:00:04Z

this is an interesting problem... I'm wondering if it would make more sense to extend the Check class so specifying a tuple of column names that the column depends_on that modifies the signature of the check function, like so:

DataFrameSchema({
    "a": Column(Float, nullable=True, checks=[
        Check(lambda a: a > 0),
        Check(lambda a, b, c: a[(b == "foo") & (c == "bar")] > 10, depends_on=("b", "c"))
    ]),
    "b": Column(String, nullable=True),
    "c": Column(String, nullable=True),
})

This way the user only needs to specify column properties once, and modify the function signature of a Check.

I'm concerned that my original proposal will make schemas more convoluted to read.

One problem with this above proposal that I haven't thought about too much is whether the order of execution of Checks matter. Before each check was atomic in the sense that it only depends on one column... but now with depends_on, I wonder if under the hood we'd need to run checks on "b" and "c" before executing the check on a[(b == "foo") & (c == "bar")] > 10.

Can probably check at schema initialization whether there are circular dependencies, and error out immediately if there are any, e.g.

DataFrameSchema({
    "a": Column(Float, nullable=True, checks=[
        Check(lambda a, b: a[(b == "foo")] > 10, depends_on=("b"))
    ]),
    "b": Column(String, nullable=True, checks=[
        Check(lambda b, a: b[(a > 5)] == "bar", depends_on=("a"))
    ]),
})

# SchemaError: circular dependency in columns "a" and "b"

mastersplinter · 2019-05-18T23:17:42Z

Specifying depends_on before those columns are defined seems a bit odd on first read (it makes me think of the classic IDE error "variable referenced before assignment").

Maybe the use of b in the check of a could be handled implicitly, i.e. the fact that b exists is enough for building a depends_on dependency graph behind the scenes?

That still suffers from the readability issue.

One option could be having an entirely separate checks list separate from the columns for readability, something like this:

DataFrameSchema(
    columns={
        "a": Column(Float, nullable=True),
        "b": Column(String, nullable=True),
        "c": Column(String, nullable=True)
    },
    checks=[
        Check(lambda a: a > 0),
        Check(lambda a, b, c: a[(b == "foo") & (c == "bar")] > 10)
    ]
)

cosmicBboy · 2019-05-19T19:22:42Z

the separate checks idea is challenging because I think checks should be bound to the data provided by the columns keys. There really isn't a straightforward way to let a Check know what the function signature is besides associating a Check with column names in an explicit way.

Really, the intent behind checks that involve multiple columns is to enable things like hypothesis testing https://github.com/cosmicBboy/pandera/issues/17, which I think is the strongest use case for a solution where additional column dependencies are specified at the Check level, one possible proposal being the depends_on kwarg.

Basically the backend implementation of the one_sided_t_test described in #17 would be something like:

from scipy import stats

def two_sample_one_sided_ttest(group1, group2, relationship="gt", alpha=0.05, **kwargs):
    ttest = stats.ttest_ind(group1, group2, **kwargs)
    is_significant = ttest.pvalue / 2 < alpha
    # `relationship` refers to the relationship of group1 w.r.t. group2
    if relationship == "gt":
        return ttest.statistic > 0 and is_significant
    elif relationship == "lt":
        return ttest.statistic < 0 and is_significant
    else:
        raise ValueError("relationship %s not recognized" % relationship)

DataFrameSchema({
    "weight": Column(
        Float,
        checks=[
            Check(
                lambda w, s: two_sample_one_sided_ttest(w[s == 'men'], w[s == 'women']),
                depends_on=("sex")
            )
        ]
    ),
    "sex": Column(String, Check(lambda s: s.isin(["men", "women"]).all()))
})

Specifying depends_on before those columns are defined seems a bit odd on first read (it makes me think of the classic IDE error "variable referenced before assignment").

I think this makes sense for imperative program models, but the goal with pandera is that the schema definition is (somewhat) declarative, such that the properties of a dataframe are asserted up-front, and the details of how validation happens is abstracted away for the user... I say somewhat because the validators are still fairly low-level, e.g. asserting the values of a column are positive via lambda s: s > 0.

In this way we're not really defining columns in an imperative way, we're declaring that a Check associated with the column weight depends on this other column sex. What are the properties of the sex column? Well, the user would be expected to define these properties with {"sex": Column(...)}

Not sure if this thought process makes sense... we can continue mulling this over, but two of the things I'd like to shoot for are:

keeping the API as clear and concise as possible
discouraging an API that leads to over-specification of column properties

mastersplinter · 2019-05-20T21:43:56Z

I like your thinking - column wise checks follow the logic of "this column depends on these others", and fits the primary use case. The syntax gets clunky if a user writes too many overlapping checks, and that can easily be handled outside of Pandera if required.

I looked at implementing it and one observation: if the user made a mistake in specifying the lambda function or depends_on, it would be difficult to diagnose/debug:, e.g.:

DataFrameSchema({
    "weight": Column(
        Float,
        checks=[
            Check(
                lambda w, s, h, a: w+s+a+h,
                depends_on=("sex","age","height")
            )
        ]
    )
})

In the above function, the positional order of depends_on sets h=age, height=a which is probably not what the user intended. Not sure if or how to make this better.

One possible implementation leaves open the possibility of changing syntax in future relatively flexibly: passing a dataframe through SeriesSchemaBase and into Check.

So the call of Check goes from:

    def __call__(self, parent_schema, series, index):

and becomes this:

    def __call__(self, parent_schema, dataframe, index):

To make this work the whole dataframe could be passed into SeriesSchemaBase, and the existing code be adapted to only reference the named column, i.e. using df[self._name] currently in column.

Then modifying the way the checks are called in SeriesSchemaBase from:

        for i, check in enumerate(self._checks):
            check_results.append(check(self, series, i))

into something like:

        for i, check in enumerate(self._checks):
            check_results.append(check(self, df[this_column,depend_on], i))

Effectively by passing a Dataframe into Check, either the subsetted dataframe, or the full dataframe - instead of multiple series leaves open the possibility of extending DataFrameSchema to also have checks, or changing the syntax..

cosmicBboy · 2019-05-21T16:45:19Z

In the above function, the positional order of depends_on sets h=age, height=a which is probably not what the user intended. Not sure if or how to make this better.

good point, I do like the direction of making the API for specifying checks clearer, and I'd like to maintain the separation of the primary column (the one that is specified in the columns key) and the depends_on columns.

For example:

DataFrameSchema({
    "weight": Column(
        Float,
        checks=[
            Check(
                lambda w, d: w + d["sex"] + d["age"] + d["height"],
                depends_on=("sex","age","height")
            )
        ]
    )
})

basically specifying depends_on modifies the expected Check function signature to

(series, dataframe) -> bool|pd.Series[bool]

where dataframe contains the columns specified in depends_on.

I like this because weight is still a first-class argument to the check function and sex, age, height are contained within the second argument.

It's then the user's responsibility to use the contents in the dependencies dataframe to make an assertion about the primary column.

On a side-note, more complex check functions (with many dependencies and complex logic) should probably be defined before hand with the def myfunc(x, df) pattern.

fixes #16. this diff enables more complex `Column` `Check`s that incorporate data from other columns in a principled way. This is done by providing a `groupby` argument, which enables the user to make assertions about subsets of the column based on different groups. This is the first step towards #17. This diff also changes the default of `element_wise=False`

cosmicBboy · 2019-05-23T23:53:59Z

after much thought, the solution to the problem of specifying additional columns in a Column Check was to introduce a groupby argument to the Check class.

The reason for this is that it enables the user to group a Column of interest by some other column in the dataframe with some reasonable guard rails.

If we went with the depends_on solution proposed in https://github.com/cosmicBboy/pandera/issues/16#issuecomment-494469311, the API would be too generic, in that the deps dataframe could be arbitrarily used to make assertions, which isn't quite the desired behavior: what we want is to validate properties about the Column of interest, not the dependency columns. This use case isn't necessarily "bad", but it can be enabled by something like #14.

The groupby argument groups the Column of interest into a dict of key, Series, where the key is a discrete group. This actually enables the user to assert properties about subsets of the Column by the desired group.

* add `groupby` and `groups` arg to `Check` fixes #16. this diff enables more complex `Column` `Check`s that incorporate data from other columns in a principled way. This is done by providing a `groupby` argument, which enables the user to make assertions about subsets of the column based on different groups. This is the first step towards #17. This diff also changes the default of `element_wise=False` * update README * fix README typo, add more tests tests make sure that SeriesSchema and Index cannot use Checks with `groupby` argument * add more tests

Fixed builtin check bug and added test for supported builtin checks f…

cosmicBboy changed the title ~~add support for specifying additional columns to Column object~~ add support for specifying tuple of column name as keys in DataFrameSchema Dec 18, 2018

cosmicBboy mentioned this issue May 23, 2019

add groupby and groups arg to Check #42

Merged

cosmicBboy changed the title ~~add support for specifying tuple of column name as keys in DataFrameSchema~~ add support for grouping column of interest by another column. May 23, 2019

cosmicBboy closed this as completed in #42 May 24, 2019

cosmicBboy pushed a commit that referenced this issue May 12, 2023

Merge pull request #16 from NeerajMalhotra-QB/feature_pyspark_backend

b17e150

Fixed builtin check bug and added test for supported builtin checks f…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for grouping column of interest by another column. #16

add support for grouping column of interest by another column. #16

cosmicBboy commented Dec 11, 2018 •

edited

mastersplinter commented May 18, 2019

cosmicBboy commented May 18, 2019

mastersplinter commented May 18, 2019 •

edited

cosmicBboy commented May 19, 2019

mastersplinter commented May 20, 2019 •

edited

cosmicBboy commented May 21, 2019 •

edited

cosmicBboy commented May 23, 2019

add support for grouping column of interest by another column. #16

add support for grouping column of interest by another column. #16

Comments

cosmicBboy commented Dec 11, 2018 • edited

mastersplinter commented May 18, 2019

cosmicBboy commented May 18, 2019

mastersplinter commented May 18, 2019 • edited

cosmicBboy commented May 19, 2019

mastersplinter commented May 20, 2019 • edited

cosmicBboy commented May 21, 2019 • edited

cosmicBboy commented May 23, 2019

cosmicBboy commented Dec 11, 2018 •

edited

mastersplinter commented May 18, 2019 •

edited

mastersplinter commented May 20, 2019 •

edited

cosmicBboy commented May 21, 2019 •

edited