Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add column order validation #352

Merged
merged 7 commits into from
Dec 12, 2020
Merged

Conversation

jeffzi
Copy link
Collaborator

@jeffzi jeffzi commented Dec 10, 2020

This PR adds an ordered argument to DataFrameSchema and to model.BaseConfig for the model api. It closes #342.

A side effect is that 2 schema errors will be raised if there is a permutation:

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema({"a": pa.Column(int), "b": pa.Column(int)}, ordered=True)
df = pd.DataFrame([[1, 2]], columns=["b", "a"])
schema.validate(df, lazy=True)
#> Traceback (most recent call last):
#> <ipython-input-1-f6ae10d7ea6a> in <module>
#> ----> 1 schema.validate(df, lazy=True)
#> ~/Projects/development/pandera/pandera/schemas.py in validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
#>     582         if lazy and error_handler.collected_errors:
#>     583             raise errors.SchemaErrors(
#> --> 584                 error_handler.collected_errors, check_obj
#>     585             )
#>     586 
#> SchemaErrors: A total of 2 schema errors were found.
#> Error Counts
#> - column_not_ordered: 2
#> Schema Error Summary
#>                                       failure_cases  n_failure_cases
#> schema_context  column check                                        
#> DataFrameSchema <NA>   column_ordered        [a, b]                2
#> Usage Tip
#> Directly inspect all errors by catching the exception:
#> ```
#> try:
#>     schema.validate(dataframe, lazy=True)
#> except SchemaErrors as err:
#>     err.failure_cases  # dataframe of schema errors
#>     err.data  # invalid dataframe
#> ```

Created on 2020-12-11 by the reprexpy package

Note: The PR targets the dev branch.

@cosmicBboy
Copy link
Collaborator

thanks @jeffzi! I'm currently working on all the documentation for the 0.6.0 release #350, I can add column order validation to that task too

@cosmicBboy cosmicBboy closed this Dec 11, 2020
@cosmicBboy cosmicBboy reopened this Dec 11, 2020
@codecov
Copy link

codecov bot commented Dec 11, 2020

Codecov Report

Merging #352 (776b82d) into dev (4d2a205) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #352      +/-   ##
==========================================
+ Coverage   99.02%   99.03%   +0.01%     
==========================================
  Files          21       21              
  Lines        2351     2383      +32     
==========================================
+ Hits         2328     2360      +32     
  Misses         23       23              
Impacted Files Coverage Δ
pandera/model.py 100.00% <100.00%> (ø)
pandera/schema_components.py 98.91% <100.00%> (+0.08%) ⬆️
pandera/schemas.py 98.70% <100.00%> (+0.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d2a205...776b82d. Read the comment docs.

@jeffzi
Copy link
Collaborator Author

jeffzi commented Dec 11, 2020

I initially forgot to look into MultiIndex order. Now I've tested adding a ordered argument to pandera.MultiIndex and ran into issues:

  1. ordered does not make sense if some index levels are not named. I'm raising an error if ordered=True and at least one index is not named.
  2. Duplicates (:imp:): I found a bug where duplicate indexes in a DataFrame.MultiIndex are not passed to DataFrameSchema.validate() (base class of pandera.MultiIndex). It's caused by https://github.com/pandera-dev/pandera/blob/5e35569bf49101a22a52a55a057edff3319e64d1/pandera/schema_components.py#L574

to_frame drops duplicates. The documentation says it implicitly:

Column ordering is determined by the DataFrame constructor with data as a dict.

import pandas as pd

df = pd.DataFrame(
    index=pd.MultiIndex.from_arrays([[1], [2], [3]], names=["b", "a", "a"]),
)
print(df.index.to_frame()) 
#>        b  a
#> b a a      
#> 1 2 3  1  3

df = pd.DataFrame(
    index=pd.MultiIndex.from_arrays([[1], [2], [3], [4]], names=["a", "b", "c", "a"]),
)
print(df.index.to_frame())
#>          a  b  c
#> a b c a         
#> 1 2 3 4  4  2  3

Created on 2020-12-11 by the reprexpy package

@cosmicBboy That bug should be addressed before merging this PR (make one test fail). I can take a look over the weekend.

Pinging @ktroutman since he seemed interested in that discussion :)

@cosmicBboy
Copy link
Collaborator

I just touched this part of the codebase so I took a crack at the multiindex -> df conversion, preserving duplicate index names... tried to emulate the behavior of pandas as much as possible

@jeffzi
Copy link
Collaborator Author

jeffzi commented Dec 11, 2020

Awesome @cosmicBboy 🔥 I fixed a bug with optional columns and added documentation.
On my side it's good to go, let me know if you see anything else.

@cosmicBboy
Copy link
Collaborator

failing test doesn't have to do with the PR, the issues should be fixed by #351, merging this now

@cosmicBboy cosmicBboy merged commit 206d35c into unionai-oss:dev Dec 12, 2020
@cosmicBboy cosmicBboy mentioned this pull request Dec 14, 2020
cosmicBboy added a commit that referenced this pull request Dec 15, 2020
* implement hypothesis strategies for generating synthetic data (#314)

* schemas can generate valid samples

* implemented basic generation

* implement register check strategy

* implementations for built-in checks and register check strat

* implement column, series, dataframe strategies

* implement more tests

* implement index/multiindex strategies, at built-in str tests

* simplify string strategy tests

* fix chained continuous tests

* implement nullable strategies

* address pylint issues

* update environment, setup.py

* add docstrings to new PandasDtype methods

* null mask is the last strategy in index_strategy

* address mypy and black errors

* fix legacy pandas issue with nullable ints

* skip complex256 tests with windows os

* use SUPPORTED_DTYPES to control tested dtypes for os

* fix multiindex strategy equality test

* bugfix: test index/multiindex strategy type check

* add back linux/osx tests

* fix str strat tests, move BaseStrat error

* improve test coverage

* fix series schema pdtype test

* add more teset coverage

* feature/dataframe-checks (#334)

* add support for dataframe check strategies

* add support for coerce dtype on dataframe-level pandas_dtype

* fix issue with type coercian in multiindex

* add packaging to requirements-dev.txt

* update travis ci spec with pandera-core env file

* increase deadline of in_range strategy test

* fix test_in_range_strategy

* fix bugs in dataframe check and index tests

* update conftest.py: reduce max examples to 100

* fix dataframe strategy

* fix type error on windows

* improve coverage

* bugfix/lazy-error-dtype-coercion (#339)

* bugfix: dtype coercion should be cause by lazy validation, update String documentation

* improve test coverage

* rebase onto dev

* feature/str-dtype: introduce PandasDtype.STRING for pandas-native str type (#340)

* add PandasDtype.STRING, remove PandasDtype.Str

- changing the semantics of PandasDtype.String to map onto the
  pandas-native 'string' type would have caused backwards
  compatibility issues
- revert changes from 2e2c5d7: remove PandasDtype.Str, and
  add a PandasDtype.STRING type for the pandas-native string
  type

* fix docs

* fix tests for legacy pandas

* fix: DataFrameSchema.dtype property

* add github action file for ci tests (#349)

this diff replaces travis-ci with github actions for unit test CI

* bugfix/345: support duplicate columns (#346)

* support validating dataframes w/ duplicate columns

* add test for multiindex duplicate names

* increase coverage

* feature/fallback-strategy (#351)

* check fallback strategy, register custom checks

* fix pylint errors

* add test for usage in Field

* fix pylint error: hypothesis

* 100% coverage extensions module

* increase deadline on test strategies

* suppress health check on slow running tests

* add ci test run on push/PR on dev and master branches (#353)

* Add column order validation (#352)

* add DataFrameSchema.ordered

* fix ordered documentation

* add support for MultiIndex.ordered

* preserve duplicate indexes when converting multiiindex to df

* increase code coverage

* fix ordered with optional columns

* add documentation of ordered

Co-authored-by: cosmicBboy <niels.bantilan@gmail.com>

* update ci-badge in README and docs (#355)

* update ci badge

* add dev and prod build badge

* revert

* docs: strategies, extensions; bugfix: strategy support object dtype (#354)

* add documentation for strategies, check register, extensions

* finish up documentation

* fix pylint import error

* mypy ignore type

* update docs

* update readme, add schema model to strategy docs

Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants