Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: drop invalid rows on validate with new param #1189

Merged
merged 38 commits into from
Jun 23, 2023
Merged

Enhancement: drop invalid rows on validate with new param #1189

merged 38 commits into from
Jun 23, 2023

Conversation

kykyi
Copy link
Contributor

@kykyi kykyi commented May 16, 2023

In response to issue: #44

Overview

Currently a user must implement try/except logic in order to drop invalid rows in a df or series being validated. This PR adds a new drop_invalid kwarg to the schema constructor to allow for invalid rows to be dropped on df validation.

An important note is that for the drop_invalid logic to work, lazy must be true! I don't know if this a good thing or a bad thing, but it's a thing. Otherwise we would need to be more specific and probably handle this logic at the point of raising or not raising the SchemaError, rather than at the point of raising or not raising the SchemaErrors (plural) as this PR does 🤖

How it looks

Before, some version of:

schema = DataFrameSchema({"col": Column(str)}, drop_invalid=True)
df = pd.DataFrame({"col": [1, "the number one"]})

 try:
    schema.validate(df, lazy=True)
except (SchemaErrors, SchemaError) as exc:
    failure_case_index = exc.failure_cases["index"]
    df.drop(index=failure_case_index, inplace=True)

Now:

schema = DataFrameSchema({"col": Column(str)}, drop_invalid=True)
df = pd.DataFrame({"col": [1, "the number one"]})
schema.validate(df, inplace=True, lazy=True)

To-do before merge:

  • Run pre-commit hooks 😅

kykyi and others added 12 commits April 14, 2023 09:43
Signed-off-by: kykyi <baden.ashford@gmail.com>
Signed-off-by: kykyi <baden.ashford@gmail.com>
Signed-off-by: kykyi <baden.ashford@gmail.com>
Signed-off-by: kykyi <baden.ashford@gmail.com>
…match

Signed-off-by: kykyi <baden.ashford@gmail.com>
…lasses and functions

Signed-off-by: kykyi <baden.ashford@gmail.com>
Signed-off-by: kykyi <baden.ashford@gmail.com>
Signed-off-by: kykyi <baden.ashford@gmail.com>
Signed-off-by: kykyi <baden.ashford@gmail.com>
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
@kykyi kykyi marked this pull request as draft May 16, 2023 12:53
environment.yml Outdated
@@ -47,7 +47,6 @@ dependencies:

# testing
- isort >= 5.7.0
- codecov
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure what this change is, fairly sure it was removed in #1136

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I need to rebase with the main of the non-forked repo!



def drop_invalid(validate: Callable) -> Callable:
"""Decorator around `validate()` methods to handle the dropping of invalid
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todo: docstring format to be updated.

)


def drop_invalid(validate: Callable) -> Callable:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cosmicBboy not thrilled about the naming here, but couldn't find a convention to follow. Think it should be something a bit more verbose, but I guess it does what it says!

@kykyi
Copy link
Contributor Author

kykyi commented May 22, 2023

Rethinking this approach, don't think the decorator is the best idea

@kykyi kykyi marked this pull request as ready for review May 23, 2023 08:47
]

# run the checks
error_handler = self.__run_checks_and_handle_errors(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cosmicBboy how do you feel about name mangling? Can just as easily swap to a single _

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do a little refactoring here:

  • if we move the core checks to its own run_core_checks method (same for run_core_parsers, we clean this up a bit (I'm not familiar with the name mangling, but I don't think it's needed here: exposing a public method for running the core checks makes sense to me.
  • we don't need the try... except here, since we have an error_handler object that we can use to check if there are schema errors.
  • The drop_invalid logic can live underneath if error_handler.collected_errors: and return the check object.
  • Probably worth creating a drop_invalid_data method to encapsulate the dropping logic. I was also thinking about how to handle metadata-level failures (e.g. wrong data type): should those columns be dropped to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tips! I have updated the PR accordingly.

On the last point though, currently the PR would remove the rows in the check_obj where the index was a failure cases, such as for the reason of being an incorrect data type. If the whole column was incorrect, then the whole check_obj would be wiped 😱

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing invalid columns is slightly different than dropping invalid rows, and could make more sense to treat them as distinct cases with distinct apis?

@codecov
Copy link

codecov bot commented Jun 4, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.01 🎉

Comparison is base (5792fb2) 97.23% compared to head (e721458) 97.25%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1189      +/-   ##
==========================================
+ Coverage   97.23%   97.25%   +0.01%     
==========================================
  Files          65       65              
  Lines        5066     5101      +35     
==========================================
+ Hits         4926     4961      +35     
  Misses        140      140              
Impacted Files Coverage Δ
pandera/api/pandas/array.py 100.00% <ø> (ø)
pandera/api/pandas/components.py 99.11% <ø> (ø)
pandera/api/pandas/model.py 92.25% <ø> (ø)
pandera/strategies/pandas_strategies.py 97.82% <ø> (ø)
pandera/api/base/schema.py 100.00% <100.00%> (ø)
pandera/api/pandas/container.py 99.25% <100.00%> (+<0.01%) ⬆️
pandera/api/pandas/model_config.py 100.00% <100.00%> (ø)
pandera/backends/base/__init__.py 100.00% <100.00%> (ø)
pandera/backends/pandas/array.py 98.29% <100.00%> (+0.14%) ⬆️
pandera/backends/pandas/base.py 100.00% <100.00%> (ø)
... and 2 more

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

kykyi and others added 8 commits June 4, 2023 11:50
Signed-off-by: Baden Ashford <baden.ashford@gmail.com>
Signed-off-by: Baden Ashford <baden.ashford@gmail.com>
Signed-off-by: Baden Ashford <baden.ashford@gmail.com>
…method

Signed-off-by: Baden Ashford <baden.ashford@gmail.com>
Signed-off-by: Baden Ashford <baden.ashford@gmail.com>
Signed-off-by: Baden Ashford <baden.ashford@gmail.com>
Signed-off-by: Baden Ashford <baden.ashford@gmail.com>
Copy link
Collaborator

@cosmicBboy cosmicBboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work! see inline comment for refactoring suggestions

]

# run the checks
error_handler = self.__run_checks_and_handle_errors(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do a little refactoring here:

  • if we move the core checks to its own run_core_checks method (same for run_core_parsers, we clean this up a bit (I'm not familiar with the name mangling, but I don't think it's needed here: exposing a public method for running the core checks makes sense to me.
  • we don't need the try... except here, since we have an error_handler object that we can use to check if there are schema errors.
  • The drop_invalid logic can live underneath if error_handler.collected_errors: and return the check object.
  • Probably worth creating a drop_invalid_data method to encapsulate the dropping logic. I was also thinking about how to handle metadata-level failures (e.g. wrong data type): should those columns be dropped to?

kykyi added 2 commits June 5, 2023 08:39
Signed-off-by: Baden Ashford <baden.ashford@gmail.com>
Signed-off-by: Baden Ashford <baden.ashford@gmail.com>
@cosmicBboy
Copy link
Collaborator

This is awesome!

Apologies for switching things up a little bit more... bit the more I think about it, the more I think it makes sense to make drop_invalid a property of the schema, not a kwarg to the validate method. The reasons being:

  1. This behavior falls under the semantics of data parsing: transforming the data into the desired state. This is akin to coerce=True and strict="filter", because it modifies the returned data into the valid form.
  2. Kwargs to validate modifies what is validated, not what data is returned (e.g. head, tail, sample, etc).
  3. The drop_invalid option does not apply well to out-of-core dataframes like pyspark and dask dataframes... adding this option to the validate method won't make sense to add to the parent schema class because then this kwarg won't apply to the subsequent dataframe libraries that pandera will eventually support, and having kwargs that don't do anything for specific subclasses adds confusion to the API.

Let me know if you need help making these modifications, I'm happy to do them!

We'll also need docs for this new feature.

@kykyi
Copy link
Contributor Author

kykyi commented Jun 6, 2023

@cosmicBboy don't be sorry, better to get it right 👌

So just to get the api right, are you imagining something like:

schema = DataFrameSchema({"col": Column(int, checks=[Check(lambda x: x >= 3)])}, drop_invalid=True)  # include kwarg here
df = pd.DataFrame({"col": [1, 2, 3, 4, 5]})

schema.validate(df) # rather than `.validate(df, drop_invalid=True)`
=> pd.DataFrame({"col": [3, 4, 5]}) 

Shouldn't be too onerous to change now that I am familiar with the repo 👍

@cosmicBboy
Copy link
Collaborator

yep! that's correct. The changes to the class-based API need to be made too, so the BaseConfig class need to add a drop_invalid argument as well: https://github.com/unionai-oss/pandera/blob/main/pandera/api/pandas/model_config.py

Add drop_invalid attr to BaseConfig

Signed-off-by: Baden Ashford <baden.ashford@gmail.com>
@kykyi
Copy link
Contributor Author

kykyi commented Jun 6, 2023

@cosmicBboy I have pushed up the requested changes! Am I correct that the docs should go in /docs/source/<new docs>.rst? I'll get onto those now 👍

)

if lazy and error_handler.collected_errors:
if hasattr(schema, "drop_invalid") and schema.drop_invalid:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To guard against a repeat of #1188

Signed-off-by: Baden Ashford <baden.ashford@gmail.com>
Comment on lines 112 to 114
if hasattr(schema, "drop_invalid") and schema.drop_invalid:
check_obj = self.drop_invalid_data(check_obj, error_handler)
return check_obj
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird, it seems like the test_drop_invalid_for_dataframe_schema test should cover this case no? Is this a codecov bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hasattr? That was just some defensive code as there was a recent issue with the new default attr not being available on pickled schemas. Can drop if you think it is overkill

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it's okay to have that in there, it's just that codecov is complaining that this part of the code wasn't executed, meaning that lines 113 weren't executed during CI. But it seems like test_drop_invalid_for_dataframe_schema should have caused this code to run no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no it doesn't appear so locally! Codecov saving us 😆

This was my bad. So this code in this file:

            if is_table(check_obj[column_name]):
                for i in range(check_obj[column_name].shape[1]):
                    validate_column(
                        check_obj[column_name].iloc[:, [i]], column_name
                    )
            else:
                if hasattr(schema, "drop_invalid") and schema.drop_invalid:
                    # replace the check_obj with the validated check_obj
                    check_obj = validate_column(
                        check_obj, column_name, return_check_obj=True
                    )
                else:
                    validate_column(check_obj, column_name)

        if lazy and error_handler.collected_errors:
            if hasattr(schema, "drop_invalid") and schema.drop_invalid: # the line in question!
                check_obj = self.drop_invalid_data(check_obj, error_handler)
                return check_obj
            else:
                raise SchemaErrors(
                    schema=schema,
                    schema_errors=error_handler.collected_errors,
                    data=check_obj,
                )

Raises an error on validate_column but does not rescue it. The rescue is done higher up in ArraySchemaBackend.validate(). This is now obvious as there is no except between validate_column and if lazy and error_handler.collected_errors!

I've removed these lines and pushed the changes👌

Signed-off-by: Baden Ashford <baden.ashford@gmail.com>
@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Jun 8, 2023

The PR is looking like it's in great shape!

Two more things:

How to handle invalid columns

Based on the current implementation of drop_invalid_data it looks like it's only considering data-level failure cases (which are associated with an index). However, the index column will be None for failure case rows that are at the metadata-level, e.g. if a column isn't of the correct dtype (when coerce=False) or the column couldn't be coerced (when coerce=True). We have three options there:

  • (a) drop the columns
  • (b) raise a SchemaError(s)
  • (c) consider the drop_invalid kwarg to be a row-wise operation at the data-level, such that any row containing any invalid value is dropped.

I'm leaning towards (c) because it's the easiest to reason about, and it also means no further code changes need to be made to this PR :)

Docs

With this feature I think it's worth creating a new docs page (drop_invalid_data.rst) to explain how this feature works across all the schema objects (DataFrameSchema, SeriesSchema, Column, DataFrameModel). If you write up a draft I can help edit/review.

Again, thanks so much for your work on this, it's a huge feature!

@kykyi
Copy link
Contributor Author

kykyi commented Jun 9, 2023

Hi @cosmicBboy I have added some draft docs 📝 . I also noticed I hadn't properly implemented the drop logic for the Model based schema so have updated that too.

Re scope of "drop invalid", I think your option c) sounds reasonable. Dropping columns seems like a different operation. As a Pandora user it could be kind of confusing for drop_invalid to remove columns as well as rows, as this could reduce your df to pd.DataFrame([]) which would be an average UX!

So if we are sticking with the implemented logic, I think the PR is good for a review ✅ 🙇 !

@kykyi
Copy link
Contributor Author

kykyi commented Jun 9, 2023

Also @cosmicBboy as a follow up to this PR, I'm thinking of adding a schema.invalid_data field. Or should that be on this PR as well?

kykyi added 5 commits June 9, 2023 13:39
Signed-off-by: Baden Ashford <baden.ashford@gmail.com>
Signed-off-by: Baden Ashford <baden.ashford@gmail.com>
Signed-off-by: Baden Ashford <baden.ashford@gmail.com>
Signed-off-by: Baden Ashford <baden.ashford@gmail.com>
Signed-off-by: Baden Ashford <baden.ashford@gmail.com>
@cosmicBboy
Copy link
Collaborator

Thanks! It looks like there are some rst formatting issues: https://github.com/unionai-oss/pandera/actions/runs/5221954728/jobs/9426870512?pr=1189#step:24:568

You can test the build out locally with

pip install requirements-docs.txt
make docs

@cosmicBboy
Copy link
Collaborator

as a follow up to this PR, I'm thinking of adding a schema.invalid_data field. Or should that be on this PR as well?

What would that field do? In any case it should be a separate PR.

@kykyi
Copy link
Contributor Author

kykyi commented Jun 9, 2023

@cosmicBboy idk if this is an ongoing issue with the repo or just me, but I get this dependency hell when I run pip install -r requirements-docs.txt

INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. See https://pip.pypa.io/warnings/backtracking for guidance. If you want to abort this run, press Ctrl + C.
  Using cached twine-3.2.0-py3-none-any.whl (34 kB)
  Using cached twine-3.1.1-py3-none-any.whl (36 kB)
  Using cached twine-3.1.0-py3-none-any.whl (35 kB)
  Using cached twine-3.0.0-py3-none-any.whl (35 kB)
  Using cached twine-2.0.0-py3-none-any.whl (34 kB)
  Using cached twine-1.15.0-py2.py3-none-any.whl (35 kB)
  Using cached twine-1.14.0-py2.py3-none-any.whl (35 kB)
  Using cached twine-1.13.0-py2.py3-none-any.whl (34 kB)
  Using cached twine-1.12.1-py2.py3-none-any.whl (34 kB)
INFO: pip is looking at multiple versions of recommonmark to determine which version is compatible with other requirements. This could take a while.
Collecting recommonmark
  Using cached recommonmark-0.7.1-py2.py3-none-any.whl (10 kB)
  Using cached recommonmark-0.7.0-py2.py3-none-any.whl (10 kB)
  Using cached recommonmark-0.6.0-py2.py3-none-any.whl (10 kB)
  Using cached recommonmark-0.5.0-py2.py3-none-any.whl (9.8 kB)
  Using cached recommonmark-0.4.0-py2.py3-none-any.whl (9.4 kB)
Collecting commonmark<=0.5.4
  Using cached CommonMark-0.5.4.tar.gz (120 kB)
  Preparing metadata (setup.py) ... done
Collecting recommonmark
  Using cached recommonmark-0.3.0-py2.py3-none-any.whl (9.4 kB)
  Using cached recommonmark-0.2.0-py2.py3-none-any.whl (5.3 kB)
INFO: pip is looking at multiple versions of recommonmark to determine which version is compatible with other requirements. This could take a while.
  Using cached recommonmark-0.1.1-py2.py3-none-any.whl (5.0 kB)
  Using cached recommonmark-0.1.0-py2.py3-none-any.whl (5.0 kB)
  Using cached recommonmark-0.0.2-py2.py3-none-any.whl (5.0 kB)
  Using cached recommonmark-0.0.1-py2.py3-none-any.whl (5.0 kB)
INFO: pip is looking at multiple versions of sphinx-copybutton to determine which version is compatible with other requirements. This could take a while.
Collecting sphinx-copybutton
  Using cached sphinx_copybutton-0.5.2-py3-none-any.whl (13 kB)
INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. See https://pip.pypa.io/warnings/backtracking for guidance. If you want to abort this run, press Ctrl + C.
  Using cached sphinx_copybutton-0.5.1-py3-none-any.whl (13 kB)
  Using cached sphinx_copybutton-0.5.0-py3-none-any.whl (12 kB)
  Using cached sphinx_copybutton-0.4.0-py3-none-any.whl (12 kB)
  ...

If I just manually install I then get:

(pandera-dev) generalaccess@Pave-Macbook pandera % make docs
rm -rf docs/**/generated docs/**/methods docs/_build docs/source/_contents
python -m sphinx -E "docs/source" "docs/_build" && make -C docs doctest
Running Sphinx v7.0.1
[autosummary] generating autosummary for: CONTRIBUTING.md, checks.rst, dask.rst, data_format_conversion.rst, data_synthesis_strategies.rst, dataframe_models.rst, dataframe_schemas.rst, decorators.rst, drop_invalid_data.rst, dtype_validation.rst, ..., reference/generated/pandera.errors.SchemaInitError.rst, reference/generated/pandera.extensions.rst, reference/index.rst, reference/io.rst, reference/schema_inference.rst, reference/strategies.rst, schema_inference.rst, series_schemas.rst, supported_libraries.rst, try_pandera.rst

Extension error (sphinx.ext.autosummary):
Handler <function process_generate_options at 0x11ea86ca0> for event 'builder-inited' threw an exception (exception: no module named pandera.io)
make: *** [docs] Error 2

😅

@cosmicBboy
Copy link
Collaborator

hmm, okay lemme see if I can debug... yeah I needa fix the pip environment for docs, I use conda and need to maintain both 😅

@cosmicBboy
Copy link
Collaborator

while I'm at it, mind if I update the option to drop_invalid_row instead of drop_invalid? I think the former argument name communicates the intent more clearly

@kykyi
Copy link
Contributor Author

kykyi commented Jun 9, 2023

Yeah change whatever you like! Maybe rows over row?

@kykyi
Copy link
Contributor Author

kykyi commented Jun 13, 2023

@cosmicBboy are you still making changes or any action needed from me here?

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Copy link
Collaborator

@cosmicBboy cosmicBboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the changes @kykyi, merging!

@cosmicBboy cosmicBboy merged commit 6c6eb57 into unionai-oss:main Jun 23, 2023
@kykyi kykyi deleted the feature/drop-invalid-rows branch June 27, 2023 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants