Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parsing functionality in dataframe schemas #252

Open
cosmicBboy opened this issue Jul 28, 2020 · 10 comments
Open

Implement parsing functionality in dataframe schemas #252

cosmicBboy opened this issue Jul 28, 2020 · 10 comments
Labels
enhancement New feature or request

Comments

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Jul 28, 2020

Is your feature request related to a problem? Please describe.

Based on the discussion in pyjanitor-devs/pyjanitor#703 and #249 with @ericmjl and @UGuntupalli, one of the ideas to come out of it is that there is a use case for the following data cleaning workflow:

  1. check the properties of the values in a raw DataFrame
  2. get a boolean Series/DataFrame indicating where those checks passed/failed
  3. use the boolean Series/DataFrame for data cleaning/visualization of errors.

Describe the solution you'd like

pandera should be a utility for data validation, leaving any data manipulation to core pandas or packages like pyjanitor. Therefore, it's within the scope of this package to provide users with access to the the results of the Checks that are performed by schemas after validation (without raising a SchemaError) for them to use for their own purposes.

The solution I'm leaning towards right now is a decorator whose name is still TBD, but the functionality of the decorator (with the placeholder name of parser) is the illustrated in this code sketch:

import numpy as np
import pandas as pd
import pandera as pa


schema = pa.DataFrameSchema({
    "col1": Column(checks=pa.Check.greater_than_or_equal_to(0, name="positive")),
    "col2": Column(checks=pa.Check.isin(["a", "b", "c"], name="category_abc"))
})


@pa.parser(schema)
def clean_data(df, failed):
    """
    :param df: dataframe to clean
    :param failed: passed in by `pa.parser` decorator. A boolean dataframe with
        the same index as df, where columns are check names. True indicates
        failure cases.
    """
    clean_df = (
        # replace negative values with nans
        df.update_where(failed["positive"], "col1", np.nan)
        # filter out records with unknown categories
        .filter_on(failed["category_abc"], complement=True)
    )
    return clean_df


def load_data(file):
    return (
        pd.read_csv(file)
        .pipe(clean_data)
        # visualization, modeling, etc.
        ...
    )

What pa.parser does is basically combine check_input and check_output with some extra semantics:

  1. the schema is validated on the decorated function's input, producing the boolean vectors that @UGuntupalli needs to implement step 3 and 4
  2. the parser decorator then passes in the results in a dataframe called failed in this example as a positional argument
  3. the function body clean_data is responsible for cleaning the data so that those failure cases are amended somehow
  4. the parser decorator then re-executes data validation to make sure the function implements the correct data cleaning logic.
  5. Only Checks that output a Series/DataFrame that matches the index of the raw dataframe would be included in failed

The parser decorator might even have a check_output: bool kwarg that makes checking the output of the function optional.

Describe alternatives you've considered

  • Add a method to DataFrameSchema like get_check_results to get the boolean Series/DataFrame of passes/failures
@cosmicBboy cosmicBboy added the enhancement New feature or request label Jul 28, 2020
@cosmicBboy cosmicBboy changed the title Implement decorator that passes the failure boolean dataframe into the decorated function Implement parser decorator that passes the failure boolean dataframe into the decorated function Aug 1, 2020
@cosmicBboy
Copy link
Collaborator Author

as part of this issue, rename the transformer argument to parser in the pa.DataFrameSchema, and it should have the same function signature and behavior as a pa.parser decorator.

@cosmicBboy cosmicBboy modified the milestones: 0.4.5 release, 0.4.6 release Aug 5, 2020
@Tankske
Copy link

Tankske commented Aug 12, 2020

For inspiration, perhaps have a look at pandas_schema. I also found this article providing a solution to the requirements above. What I like about this method is:

  • You get a list of all errors
  • You can retrieve indexes from rows that failed the validation
  • You can see an validation error message

Some room for improvement that I see when tackling this requirement is to provide an extended version of what pandas-schema provides (0,"{row: 2, column: ""dec3""}: ""ee"" is not decimal"):

  • Keep the index eg. 0
  • Extend the dictionary with more metadata about the error to improve usability eg.
    {row: 2
    , column: ""dec3""
    , value: ""ee""
    , validation_error: ""decimal_validation""}
  • Have a human readable error message eg. ""ee"" is not decimal"

This could help when building pipelines. Certain validation errors are good to know as 'warning' while other validation errors impact the continuation of a pipeline. Being able to filter on those and being able to exporting the validation results (errors) would greatly improve usage of validations in pipelines and traceability Eg. a fail in decimal-validation could result in either:

  • Removing these records as further processing will fail otherwise
  • Imputing a default to continue if feasible (sidetrack in a pipeline)
def impute_age():
    average_age = 45      # or a function based on other values
    return average_age

checks=[
            # a check that removes the record upon failure
            pa.Check(lambda s: decimal_validation(s), element_wise=False, error_msg="value is not a decimal", filter=True),

            # a check that updates the value upon failure
            pa.Check(lambda s: age_validation(s), element_wise=False, error_msg="age is not considered within normal ranges", filter=False, update=impute_age())
        ]

@cosmicBboy
Copy link
Collaborator Author

thanks for the feedback @Tankske, will consider it when working out the specific implementation for this issue.

The Check object has an error argument that serves the same purpose as error_msg in your code example, so that attribute can be pulled out of a check object.

The main problem this issue is trying to tackle is to help users during the development/debugging process to be able to apply pandas data transformation logic on rows/columns that don't fulfill the assumptions made in the schema. This activity would be facilitated by having a data structure that's isomorphic to the thing that you want to operate on (the invalid dataframe), which is why I'm initially thinking that the failed data structure should, at least in part, be a dataframe.

For error reporting and human readability, I think perhaps something close to the lazy validation output would be nice (which, in fact, I modeled a little after pandas-schema, but in a tabular format), so perhaps failed should be a namedtuple containing bool the boolean dataframe and the error reporting data frame:

@pa.parser(schema)
def clean_data(df, failed):
    """
    :param df: dataframe to clean
    :param failed: passed in by `pa.parser` decorator.
    """
    check_results = failed.check_results  # boolean dataframe where checks failed
    failure_cases = failed.cases  # dataframe of failure cases with human-readable output
    clean_df = (
        # replace negative values with nans
        df.update_where(check_results["positive"], "col1", np.nan)
        # filter out records with unknown categories
        .filter_on(check_results["category_abc"], complement=True)
    )
    return clean_df

Re: filtering and imputation, I'd be open to expanding pandera's scope from data validation to data parsing (filtering, imputation, value replacement, etc.), but I'd like to design it carefully so as to not add too much complexity to the user experience. The reason I've scoped this issue the way I did is that it maintains the separation of concerns between validation and data manipulation, delegating the latter to core pandas or tools built on top of it like pyjanitor.

I think the code sketch that you provided is a good start, and if you'd like to pursue this direction, I'd encourage you to to open up a new issue articulating the problem and solution design that you have in mind.

@JoyMonteiro
Copy link

Hello,

just checking if this functionality can be implemented using available APIs in pandera, or we have to wait for this PR to land.

TIA,
Joy

@cosmicBboy
Copy link
Collaborator Author

hi @JoyMonteiro this functionality won't be available for another few releases... supporting parsing is something I want to design carefully, the code sketches above are likely not going to be what the final implementation looks like.

To help with the design process, can you describe what your parsing use case is?

@JoyMonteiro
Copy link

I see. thanks for letting me know.

We are trying to build a tool to assist in cleaning/curating data. It would consist of a UI probably made with panel where the user uploads an excel file. This file will be parsed and cleaned (to some extent)
in the backend and all rows/columns which don't pass validation checks will be highlighted and displayed to the user.

This would likely be an iterative process until the dataframe reaches a certain data quality (meaning it obeys a
detailed schema)

This is where I hoped to use this functionality. Having it in pandera would be nice because the final validation of
the cleaned dataframe would be via pandera. Right now we are doing the validation using a bunch of functions,
and writing a more maintainable version of the same would imply duplicating things that pandera already does.

@cosmicBboy
Copy link
Collaborator Author

cosmicBboy commented Jul 6, 2021

Pandera Parsing

As referenced in #542, imo pydantic's validator decorator is a bit of a misnomer, as what it's doing is (i) parsing raw data values and (ii) emitting an error in the case of invalid ones (the validation bit).

We can map (ii) easily onto pandera's concept of checks, which return booleans mainly to indicate which elements in the dataframe failed the check... ultimately a SchemaError is raised (or SchemaErrors in the case of lazy validation), which contains the data on which row/column indexes contained failing values.

pandera is sort of a parsing tool because schema(df) or schema.validate(df) is guaranteed to return a valid dataframe or raise an error for values that are not in the domain of the schema specification... what it currently lacks is a way of specifying transformations that can convert potentially invalid raw values into valid values (coerce=True) is currently the only transformation that pandera schemas perform.

Here are a few ways to go about implementing the parsing functionality.

Proposal 1: Parsing as a function

As hinted at in this issue's description: #252 (comment), this propsal would provide a functional interface for users to parse a raw dataframe given the failure cases produced by a schema:

import numpy as np
import pandas as pd
import pandera as pa


schema = pa.DataFrameSchema({
    "col1": Column(checks=pa.Check.ge(0)),
    "col2": Column(checks=pa.Check.isin(["a", "b", "c"]))
})


@pa.parse(schema)
def clean_data(df, failure_cases):
    """
    :param df: dataframe to clean
    :param failure_cases: passed in by `pa.parser` decorator. A boolean dataframe with
        the same index as df, where columns are check names. True indicates
        failure cases.
    """
    clean_df = (
        # replace negative values with nans
        df.update_where(failure_cases["col1"]["ge"], "col1", np.nan)
        # filter out records with unknown categories
        .filter_on(failure_cases["col2"]["isin"], complement=True)
    )
    return clean_df

# - SchemaModel syntax -
class Schema(pa.SchemaModel):
    col1 = Field(ge=0)
    col2 = Field(isin=["a", "b", "c"])

@pa.parse
def clean_data(df: pa.typing.DataFrame[Schema], failure_cases):
    ...


def load_data(file):
    return (
        pd.read_csv(file)
        .pipe(clean_data)
        # visualization, modeling, etc.
        ...
    )

Proposal 2: A single global parser function supplied to a schema

Very similar to proposal 1, but baked into a schema object (also similar to the now-deprecated transformer kwarg):

schema = pa.DataFrameSchema(
    columns={
        "col1": Column(checks=pa.Check.ge(0)),
        "col2": Column(checks=pa.Check.isin(["a", "b", "c"]))
    },
    parser=lambda df, failure_cases: ...
)

class Schema(pa.SchemaModel):
    col1 = Field(ge=0)
    col2 = Field(isin=["a", "b", "c"])

    # the parser method might be a reserved name in SchemaModel
    @classmethod
    def parser(cls, df, failure_cases):
        ...

Proposal 3: Column- and Dataframe-level parsers

Closer in spirit to pydantic:

schema = pa.DataFrameSchema(
    columns={
        # a single parser that replaces negative values by 0
        "column": pa.Column(int, parsers=pa.Parser(lambda series: series.mask(series <= 0, 0)))
    },
)

class Schema(pa.SchemaModel):
    column: int

    @pa.parser("column")
    def column_gt_zero(cls, series)
        return series.mask(series <= 0, 0)

# dataframe-level parsing
schema = pa.DataFrameSchema(
    parsers=pa.Parser(lambda df: df.mask(df < 0, 0))
)

class Schema(pa.SchemaModel):

    class Config:
        gt = 0

    @pa.dataframe_parser("column")
    def column_gt_zero(cls, df, check_results)
        return series.mask(~check_results["gt"], 0)

Proposal 4: Parsers as a Special Type of Check

Similar to 3, but instead of introducing a new keyword, introduce Parser as a special type of Check. This also harkens back to something @jeffzi proposed here

schema = pa.DataFrameSchema(
    columns={
        # a single parser that replaces negative values by 0
        "column": pa.Column(int, checks=pa.Parser(lambda series: series.mask(series <= 0, 0)))
    },
)

The parser function would have the same semantics as pydantic validators, so users can also define parsers that are equivalent to checks:

# parsers can be equivalent to checks
def gt_0(series):
    failure_cases = series <= 0
    if failure_cases.any():
        raise SchemaError(..., failure_cases=failure_cases)
    return series


gt_0_parser = pa.Parser(gt_0, ...)

And similar to pydantic, parsers functions might also support depending on checks/parsers that come before it:

schema = pa.DataFrameSchema(
    columns={
        # a single parser that replaces negative values by 0
        "column": pa.Column(
            int,
            checks=[
                pa.Check.gt(0),
                pa.Parser(lambda series, check_results: series.mask(~check_results["gt"], 0))
            ]
        )
    },
)

Pros and Cons

(1) and (2)

Pros

  • Simple to implement and reason about a single function that parses the dataframe
  • Don't have to worry about check and parsing order

Cons

  • Off-loads the complexity of parsing to the user. The parser function basically is a user-defined function/set of functions that will look like pandas data-cleaning code.

(3) and (4)

Pros

  • Each parser has a well-defined scope
  • Less complex for the user to implement
  • More familiar to pydantic users

Cons

  • Many open questions and new ideas potentially need to be introduced into pandera to achieve feature-parity with pydantic, e.g.:
    • should parsers be an alternative way of validating dataframes, or should they interact with checks somehow, like having access to check boolean outputs?
    • if parsers and checks do interact, should checks be re-run after parsing? Or is the user responsible for making sure the parsers output valid values?
    • should there be multiple parsers per column? how about multiple parsers at the dataframe-level? Are there unnecessary complications resulting in multiple parsers per column?
    • what is the execution order of parsers? column-level -> dataframe-level? vice-versa?

Right now I'm leaning towards (4), but I'm curious what your thoughts are @jeffzi @d-chambers @JoyMonteiro

@jeffzi
Copy link
Collaborator

jeffzi commented Jul 7, 2021

I think your proposals are not incompatible:

  • (1) and (2) are called after validation and hand in the failed cases. The later is a desirable feature missing from (3), (4).
    SchemaErrors.failure_cases itself is a bit rough imo.

  • (3) and (4) are called before validation.

At my work, I created a function that splits out failed cases after validation so that they can be stored and debugged later. SchemaError could have a method to separate failed cases with their associated errors. That would be a third alternative for post-validation processing, without introducing a new argument or decorator.

try:
    events = schema.validate(events, lazy=True)
except pa.errors.SchemaErrors as err:
    events = extract_bad_records(err) # split failed cases and append a column "error"
    is_bad_event = ~events["error"].isnull()
    events = events[~is_bad_event].drop(columns="error")
    bad_events = events[is_bad_event]
    ...

I would pick 1 solution for pre and post validation since they do not have the same purpose. My preference:

  • pre: The choice between (3) and (4) depends on the direction of the Check api. Specialized Check classes were discussed in Is there a way to generate informative error on custom checks?  #429. Is that option back on the table?
  • post: I'd rather have failed cases returned from validate() or improve the UX of the Exception (see proposal above).

re: naming. attrs uses validator for check and converter for what you called parser. Basically attrs implements (3). Attrs converters are called before validators.

My order of preference would be transformer >= converter > parser. transformer is a well-known concept, converter says what it does, and parser is often associated with loading data from a particular format (e.g. json -> dataframe).

@cosmicBboy
Copy link
Collaborator Author

cosmicBboy commented Jul 8, 2021

cool, thanks for your feedback @jeffzi, I think it makes sense to distinguish between pre-processing and post-processing. re: naming, I think parse is a perfectly suitable name for converting data from a raw form into a clean form... in fact, pydantic even makes that distinction:

pydantic is primarily a parsing library, not a validation library. Validation is a means to an end: building a model which conforms to the types and constraints provided.

And there was even an issue discussing the renaming of validate -> parse though I think for maintainability reasons this wasn't done.

To clarify my thinking around this feature, here's a higher-level proposal about the pandera parsing/validation pipeline order of execution (inspired by the way attrs does things):

  1. check the data type or cast the column into that type if coerce=True. (perhaps coerce=True should become the default in a future version)
  2. parse the column/dataframe into some normalized set of values (solutions 3 or 4 above)
  3. check that constraints are actually met by the parsers
  4. postprocess the dataframe for debugging/reporting purposes. Users can also handling check failure cases by whatever means the user deems necessary, but this is outside of the correctness guarantees that pandera provides.

Note that the user interaction model that this implies is that pandera intends step 4 only as a way for the user to further refine the parsing functionality in step 2 in order to fulfill the constraints established by step 3.

I like solution (3), which keeps parsing and checking separate: this way we get a nice continuity with the data synthesis strategies, which is still coupled with Checks and doesn't have to change as a result of introducing this feature. This means that parsers basically serve as an extension of the dtype + coerce=True combination, allowing users to further refine the contents of a dataframe beyond the the primitive data types.

Totally agreed on better UX for handling schema errors, though I think your proposal needs a little more refinement, since the error column in the code snippet would need to be denormalized since there can be multiple errors for a single index/column location... I think we can further discuss in a separate issue.

@jeffzi
Copy link
Collaborator

jeffzi commented Jul 8, 2021

I understand your reasoning about the parser name, thanks for clarifying. I completely agree with the flow you described 👍

Totally agreed on better UX for handling schema errors, though I think your proposal needs a little more refinement,

Agreed. The error columns actually contains all errors related to the row, in a json line format. If an error is schema-wide, the error is duplicated in every row... That's a bit hacky and very specific to my use case. A pandera-native solution should not have so strong assumptions. I'll think about a generalized approach.

@cosmicBboy cosmicBboy changed the title Implement parser decorator that passes the failure boolean dataframe into the decorated function Implement parsing functionality in dataframe schemas Jul 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants