-
-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for grouping column of interest by another column. #16
Comments
Column
objectDataFrameSchema
What's your view on handling the Do you see something like this?:
|
this is an interesting problem... I'm wondering if it would make more sense to extend the
This way the user only needs to specify column properties once, and modify the function signature of a I'm concerned that my original proposal will make schemas more convoluted to read. One problem with this above proposal that I haven't thought about too much is whether the order of execution of Can probably check at schema initialization whether there are circular dependencies, and error out immediately if there are any, e.g.
|
Specifying Maybe the use of That still suffers from the readability issue. One option could be having an entirely separate
|
the separate Really, the intent behind checks that involve multiple columns is to enable things like hypothesis testing https://github.com/cosmicBboy/pandera/issues/17, which I think is the strongest use case for a solution where additional column dependencies are specified at the Basically the backend implementation of the one_sided_t_test described in #17 would be something like:
I think this makes sense for imperative program models, but the goal with pandera is that the schema definition is (somewhat) declarative, such that the properties of a dataframe are asserted up-front, and the details of how validation happens is abstracted away for the user... I say somewhat because the validators are still fairly low-level, e.g. asserting the values of a column are positive via In this way we're not really defining columns in an imperative way, we're declaring that a Not sure if this thought process makes sense... we can continue mulling this over, but two of the things I'd like to shoot for are:
|
I like your thinking - column wise checks follow the logic of "this column depends on these others", and fits the primary use case. The syntax gets clunky if a user writes too many overlapping checks, and that can easily be handled outside of Pandera if required. I looked at implementing it and one observation: if the user made a mistake in specifying the
In the above function, the positional order of One possible implementation leaves open the possibility of changing syntax in future relatively flexibly: passing a dataframe through So the call of
and becomes this:
To make this work the whole dataframe could be passed into Then modifying the way the checks are called in
into something like:
Effectively by passing a Dataframe into |
good point, I do like the direction of making the API for specifying checks clearer, and I'd like to maintain the separation of the primary column (the one that is specified in the For example:
basically specifying
where I like this because It's then the user's responsibility to use the contents in the dependencies dataframe to make an assertion about the primary column. On a side-note, more complex |
fixes #16. this diff enables more complex `Column` `Check`s that incorporate data from other columns in a principled way. This is done by providing a `groupby` argument, which enables the user to make assertions about subsets of the column based on different groups. This is the first step towards #17. This diff also changes the default of `element_wise=False`
DataFrameSchema
after much thought, the solution to the problem of specifying additional columns in a The reason for this is that it enables the user to group a If we went with the The |
* add `groupby` and `groups` arg to `Check` fixes #16. this diff enables more complex `Column` `Check`s that incorporate data from other columns in a principled way. This is done by providing a `groupby` argument, which enables the user to make assertions about subsets of the column based on different groups. This is the first step towards #17. This diff also changes the default of `element_wise=False` * update README * fix README typo, add more tests tests make sure that SeriesSchema and Index cannot use Checks with `groupby` argument * add more tests
Fixed builtin check bug and added test for supported builtin checks f…
if the user needs to validate some column
x
conditioned on the values of columny
, the user can specify multiple column names argumentThe text was updated successfully, but these errors were encountered: