-
-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement parsing functionality in dataframe schemas #252
Comments
parser
decorator that passes the failure boolean dataframe into the decorated function
as part of this issue, rename the |
For inspiration, perhaps have a look at pandas_schema. I also found this article providing a solution to the requirements above. What I like about this method is:
Some room for improvement that I see when tackling this requirement is to provide an extended version of what pandas-schema provides (0,"{row: 2, column: ""dec3""}: ""ee"" is not decimal"):
This could help when building pipelines. Certain validation errors are good to know as 'warning' while other validation errors impact the continuation of a pipeline. Being able to filter on those and being able to exporting the validation results (errors) would greatly improve usage of validations in pipelines and traceability Eg. a fail in decimal-validation could result in either:
|
thanks for the feedback @Tankske, will consider it when working out the specific implementation for this issue. The The main problem this issue is trying to tackle is to help users during the development/debugging process to be able to apply For error reporting and human readability, I think perhaps something close to the lazy validation output would be nice (which, in fact, I modeled a little after pandas-schema, but in a tabular format), so perhaps @pa.parser(schema)
def clean_data(df, failed):
"""
:param df: dataframe to clean
:param failed: passed in by `pa.parser` decorator.
"""
check_results = failed.check_results # boolean dataframe where checks failed
failure_cases = failed.cases # dataframe of failure cases with human-readable output
clean_df = (
# replace negative values with nans
df.update_where(check_results["positive"], "col1", np.nan)
# filter out records with unknown categories
.filter_on(check_results["category_abc"], complement=True)
)
return clean_df Re: filtering and imputation, I'd be open to expanding I think the code sketch that you provided is a good start, and if you'd like to pursue this direction, I'd encourage you to to open up a new issue articulating the problem and solution design that you have in mind. |
Hello, just checking if this functionality can be implemented using available APIs in pandera, or we have to wait for this PR to land. TIA, |
hi @JoyMonteiro this functionality won't be available for another few releases... supporting parsing is something I want to design carefully, the code sketches above are likely not going to be what the final implementation looks like. To help with the design process, can you describe what your parsing use case is? |
I see. thanks for letting me know. We are trying to build a tool to assist in cleaning/curating data. It would consist of a UI probably made with panel where the user uploads an excel file. This file will be parsed and cleaned (to some extent) This would likely be an iterative process until the dataframe reaches a certain data quality (meaning it obeys a This is where I hoped to use this functionality. Having it in pandera would be nice because the final validation of |
Pandera ParsingAs referenced in #542, imo pydantic's We can map (ii) easily onto pandera's concept of checks, which return booleans mainly to indicate which elements in the dataframe failed the check... ultimately a pandera is sort of a parsing tool because Here are a few ways to go about implementing the parsing functionality. Proposal 1: Parsing as a functionAs hinted at in this issue's description: #252 (comment), this propsal would provide a functional interface for users to parse a raw dataframe given the failure cases produced by a schema: import numpy as np
import pandas as pd
import pandera as pa
schema = pa.DataFrameSchema({
"col1": Column(checks=pa.Check.ge(0)),
"col2": Column(checks=pa.Check.isin(["a", "b", "c"]))
})
@pa.parse(schema)
def clean_data(df, failure_cases):
"""
:param df: dataframe to clean
:param failure_cases: passed in by `pa.parser` decorator. A boolean dataframe with
the same index as df, where columns are check names. True indicates
failure cases.
"""
clean_df = (
# replace negative values with nans
df.update_where(failure_cases["col1"]["ge"], "col1", np.nan)
# filter out records with unknown categories
.filter_on(failure_cases["col2"]["isin"], complement=True)
)
return clean_df
# - SchemaModel syntax -
class Schema(pa.SchemaModel):
col1 = Field(ge=0)
col2 = Field(isin=["a", "b", "c"])
@pa.parse
def clean_data(df: pa.typing.DataFrame[Schema], failure_cases):
...
def load_data(file):
return (
pd.read_csv(file)
.pipe(clean_data)
# visualization, modeling, etc.
...
) Proposal 2: A single global parser function supplied to a schemaVery similar to proposal 1, but baked into a schema object (also similar to the now-deprecated schema = pa.DataFrameSchema(
columns={
"col1": Column(checks=pa.Check.ge(0)),
"col2": Column(checks=pa.Check.isin(["a", "b", "c"]))
},
parser=lambda df, failure_cases: ...
)
class Schema(pa.SchemaModel):
col1 = Field(ge=0)
col2 = Field(isin=["a", "b", "c"])
# the parser method might be a reserved name in SchemaModel
@classmethod
def parser(cls, df, failure_cases):
... Proposal 3: Column- and Dataframe-level parsersCloser in spirit to pydantic: schema = pa.DataFrameSchema(
columns={
# a single parser that replaces negative values by 0
"column": pa.Column(int, parsers=pa.Parser(lambda series: series.mask(series <= 0, 0)))
},
)
class Schema(pa.SchemaModel):
column: int
@pa.parser("column")
def column_gt_zero(cls, series)
return series.mask(series <= 0, 0)
# dataframe-level parsing
schema = pa.DataFrameSchema(
parsers=pa.Parser(lambda df: df.mask(df < 0, 0))
)
class Schema(pa.SchemaModel):
class Config:
gt = 0
@pa.dataframe_parser("column")
def column_gt_zero(cls, df, check_results)
return series.mask(~check_results["gt"], 0) Proposal 4: Parsers as a Special Type of CheckSimilar to 3, but instead of introducing a new keyword, introduce schema = pa.DataFrameSchema(
columns={
# a single parser that replaces negative values by 0
"column": pa.Column(int, checks=pa.Parser(lambda series: series.mask(series <= 0, 0)))
},
) The parser function would have the same semantics as pydantic validators, so users can also define parsers that are equivalent to checks: # parsers can be equivalent to checks
def gt_0(series):
failure_cases = series <= 0
if failure_cases.any():
raise SchemaError(..., failure_cases=failure_cases)
return series
gt_0_parser = pa.Parser(gt_0, ...) And similar to pydantic, parsers functions might also support depending on checks/parsers that come before it: schema = pa.DataFrameSchema(
columns={
# a single parser that replaces negative values by 0
"column": pa.Column(
int,
checks=[
pa.Check.gt(0),
pa.Parser(lambda series, check_results: series.mask(~check_results["gt"], 0))
]
)
},
) Pros and Cons(1) and (2)Pros
Cons
(3) and (4)Pros
Cons
Right now I'm leaning towards (4), but I'm curious what your thoughts are @jeffzi @d-chambers @JoyMonteiro |
I think your proposals are not incompatible:
At my work, I created a function that splits out failed cases after validation so that they can be stored and debugged later. try:
events = schema.validate(events, lazy=True)
except pa.errors.SchemaErrors as err:
events = extract_bad_records(err) # split failed cases and append a column "error"
is_bad_event = ~events["error"].isnull()
events = events[~is_bad_event].drop(columns="error")
bad_events = events[is_bad_event]
... I would pick 1 solution for pre and post validation since they do not have the same purpose. My preference:
re: naming. attrs uses validator for My order of preference would be |
cool, thanks for your feedback @jeffzi, I think it makes sense to distinguish between pre-processing and post-processing. re: naming, I think
And there was even an issue discussing the renaming of To clarify my thinking around this feature, here's a higher-level proposal about the pandera parsing/validation pipeline order of execution (inspired by the way
Note that the user interaction model that this implies is that pandera intends step 4 only as a way for the user to further refine the parsing functionality in step 2 in order to fulfill the constraints established by step 3. I like solution (3), which keeps parsing and checking separate: this way we get a nice continuity with the data synthesis strategies, which is still coupled with Totally agreed on better UX for handling schema errors, though I think your proposal needs a little more refinement, since the |
I understand your reasoning about the
Agreed. The |
parser
decorator that passes the failure boolean dataframe into the decorated function
Is your feature request related to a problem? Please describe.
Based on the discussion in pyjanitor-devs/pyjanitor#703 and #249 with @ericmjl and @UGuntupalli, one of the ideas to come out of it is that there is a use case for the following data cleaning workflow:
Describe the solution you'd like
pandera
should be a utility for data validation, leaving any data manipulation to corepandas
or packages likepyjanitor
. Therefore, it's within the scope of this package to provide users with access to the the results of theCheck
s that are performed by schemas after validation (without raising a SchemaError) for them to use for their own purposes.The solution I'm leaning towards right now is a decorator whose name is still TBD, but the functionality of the decorator (with the placeholder name of
parser
) is the illustrated in this code sketch:What
pa.parser
does is basically combine check_input and check_output with some extra semantics:failed
in this example as a positional argumentclean_data
is responsible for cleaning the data so that those failure cases are amended somehowCheck
s that output a Series/DataFrame that matches the index of the raw dataframe would be included infailed
The
parser
decorator might even have acheck_output: bool
kwarg that makes checking the output of the function optional.Describe alternatives you've considered
DataFrameSchema
likeget_check_results
to get the boolean Series/DataFrame of passes/failuresThe text was updated successfully, but these errors were encountered: