Add column order validation #352

jeffzi · 2020-12-10T23:47:59Z

This PR adds an ordered argument to DataFrameSchema and to model.BaseConfig for the model api. It closes #342.

A side effect is that 2 schema errors will be raised if there is a permutation:

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema({"a": pa.Column(int), "b": pa.Column(int)}, ordered=True)
df = pd.DataFrame([[1, 2]], columns=["b", "a"])
schema.validate(df, lazy=True)
#> Traceback (most recent call last):
#> <ipython-input-1-f6ae10d7ea6a> in <module>
#> ----> 1 schema.validate(df, lazy=True)
#> ~/Projects/development/pandera/pandera/schemas.py in validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
#>     582         if lazy and error_handler.collected_errors:
#>     583             raise errors.SchemaErrors(
#> --> 584                 error_handler.collected_errors, check_obj
#>     585             )
#>     586 
#> SchemaErrors: A total of 2 schema errors were found.
#> Error Counts
#> - column_not_ordered: 2
#> Schema Error Summary
#>                                       failure_cases  n_failure_cases
#> schema_context  column check                                        
#> DataFrameSchema <NA>   column_ordered        [a, b]                2
#> Usage Tip
#> Directly inspect all errors by catching the exception:
#> ```
#> try:
#>     schema.validate(dataframe, lazy=True)
#> except SchemaErrors as err:
#>     err.failure_cases  # dataframe of schema errors
#>     err.data  # invalid dataframe
#> ```

^{Created on 2020-12-11 by the reprexpy package}

Note: The PR targets the dev branch.

cosmicBboy · 2020-12-11T00:13:56Z

thanks @jeffzi! I'm currently working on all the documentation for the 0.6.0 release #350, I can add column order validation to that task too

codecov · 2020-12-11T03:35:23Z

Codecov Report

Merging #352 (776b82d) into dev (4d2a205) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev     #352      +/-   ##
==========================================
+ Coverage   99.02%   99.03%   +0.01%     
==========================================
  Files          21       21              
  Lines        2351     2383      +32     
==========================================
+ Hits         2328     2360      +32     
  Misses         23       23

Impacted Files	Coverage Δ
pandera/model.py	`100.00% <100.00%> (ø)`
pandera/schema_components.py	`98.91% <100.00%> (+0.08%)`	⬆️
pandera/schemas.py	`98.70% <100.00%> (+0.04%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d2a205...776b82d. Read the comment docs.

jeffzi · 2020-12-11T11:29:33Z

I initially forgot to look into MultiIndex order. Now I've tested adding a ordered argument to pandera.MultiIndex and ran into issues:

ordered does not make sense if some index levels are not named. I'm raising an error if ordered=True and at least one index is not named.
Duplicates (:imp:): I found a bug where duplicate indexes in a DataFrame.MultiIndex are not passed to DataFrameSchema.validate() (base class of pandera.MultiIndex). It's caused by https://github.com/pandera-dev/pandera/blob/5e35569bf49101a22a52a55a057edff3319e64d1/pandera/schema_components.py#L574

to_frame drops duplicates. The documentation says it implicitly:

Column ordering is determined by the DataFrame constructor with data as a dict.

import pandas as pd

df = pd.DataFrame(
    index=pd.MultiIndex.from_arrays([[1], [2], [3]], names=["b", "a", "a"]),
)
print(df.index.to_frame()) 
#>        b  a
#> b a a      
#> 1 2 3  1  3

df = pd.DataFrame(
    index=pd.MultiIndex.from_arrays([[1], [2], [3], [4]], names=["a", "b", "c", "a"]),
)
print(df.index.to_frame())
#>          a  b  c
#> a b c a         
#> 1 2 3 4  4  2  3

^{Created on 2020-12-11 by the reprexpy package}

@cosmicBboy That bug should be addressed before merging this PR (make one test fail). I can take a look over the weekend.

Pinging @ktroutman since he seemed interested in that discussion :)

cosmicBboy · 2020-12-11T14:21:04Z

I just touched this part of the codebase so I took a crack at the multiindex -> df conversion, preserving duplicate index names... tried to emulate the behavior of pandas as much as possible

jeffzi · 2020-12-11T22:14:22Z

Awesome @cosmicBboy 🔥 I fixed a bug with optional columns and added documentation.
On my side it's good to go, let me know if you see anything else.

cosmicBboy · 2020-12-12T02:12:13Z

failing test doesn't have to do with the PR, the issues should be fixed by #351, merging this now

* implement hypothesis strategies for generating synthetic data (#314) * schemas can generate valid samples * implemented basic generation * implement register check strategy * implementations for built-in checks and register check strat * implement column, series, dataframe strategies * implement more tests * implement index/multiindex strategies, at built-in str tests * simplify string strategy tests * fix chained continuous tests * implement nullable strategies * address pylint issues * update environment, setup.py * add docstrings to new PandasDtype methods * null mask is the last strategy in index_strategy * address mypy and black errors * fix legacy pandas issue with nullable ints * skip complex256 tests with windows os * use SUPPORTED_DTYPES to control tested dtypes for os * fix multiindex strategy equality test * bugfix: test index/multiindex strategy type check * add back linux/osx tests * fix str strat tests, move BaseStrat error * improve test coverage * fix series schema pdtype test * add more teset coverage * feature/dataframe-checks (#334) * add support for dataframe check strategies * add support for coerce dtype on dataframe-level pandas_dtype * fix issue with type coercian in multiindex * add packaging to requirements-dev.txt * update travis ci spec with pandera-core env file * increase deadline of in_range strategy test * fix test_in_range_strategy * fix bugs in dataframe check and index tests * update conftest.py: reduce max examples to 100 * fix dataframe strategy * fix type error on windows * improve coverage * bugfix/lazy-error-dtype-coercion (#339) * bugfix: dtype coercion should be cause by lazy validation, update String documentation * improve test coverage * rebase onto dev * feature/str-dtype: introduce PandasDtype.STRING for pandas-native str type (#340) * add PandasDtype.STRING, remove PandasDtype.Str - changing the semantics of PandasDtype.String to map onto the pandas-native 'string' type would have caused backwards compatibility issues - revert changes from 2e2c5d7: remove PandasDtype.Str, and add a PandasDtype.STRING type for the pandas-native string type * fix docs * fix tests for legacy pandas * fix: DataFrameSchema.dtype property * add github action file for ci tests (#349) this diff replaces travis-ci with github actions for unit test CI * bugfix/345: support duplicate columns (#346) * support validating dataframes w/ duplicate columns * add test for multiindex duplicate names * increase coverage * feature/fallback-strategy (#351) * check fallback strategy, register custom checks * fix pylint errors * add test for usage in Field * fix pylint error: hypothesis * 100% coverage extensions module * increase deadline on test strategies * suppress health check on slow running tests * add ci test run on push/PR on dev and master branches (#353) * Add column order validation (#352) * add DataFrameSchema.ordered * fix ordered documentation * add support for MultiIndex.ordered * preserve duplicate indexes when converting multiiindex to df * increase code coverage * fix ordered with optional columns * add documentation of ordered Co-authored-by: cosmicBboy <niels.bantilan@gmail.com> * update ci-badge in README and docs (#355) * update ci badge * add dev and prod build badge * revert * docs: strategies, extensions; bugfix: strategy support object dtype (#354) * add documentation for strategies, check register, extensions * finish up documentation * fix pylint import error * mypy ignore type * update docs * update readme, add schema model to strategy docs Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com>

add DataFrameSchema.ordered

97fa198

jeffzi force-pushed the feature/column-order branch from 65732ee to 97fa198 Compare December 10, 2020 23:56

cosmicBboy closed this Dec 11, 2020

cosmicBboy reopened this Dec 11, 2020

Jean-Francois Zinque added 2 commits December 11, 2020 10:25

fix ordered documentation

b240d4f

add support for MultiIndex.ordered

0aad7c3

preserve duplicate indexes when converting multiiindex to df

32d88a7

Jean-Francois Zinque added 3 commits December 11, 2020 21:33

increase code coverage

6cdd3ca

fix ordered with optional columns

f364feb

add documentation of ordered

776b82d

cosmicBboy merged commit 206d35c into unionai-oss:dev Dec 12, 2020

cosmicBboy mentioned this pull request Dec 14, 2020

Ordered columns #342

Closed

cosmicBboy mentioned this pull request Dec 15, 2020

SchemaModels: Inheritance problem related to coercion #357

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add column order validation #352

Add column order validation #352

jeffzi commented Dec 10, 2020 •

edited

Loading

cosmicBboy commented Dec 11, 2020

codecov bot commented Dec 11, 2020 •

edited

Loading

jeffzi commented Dec 11, 2020 •

edited

Loading

cosmicBboy commented Dec 11, 2020

jeffzi commented Dec 11, 2020

cosmicBboy commented Dec 12, 2020

Add column order validation #352

Add column order validation #352

Conversation

jeffzi commented Dec 10, 2020 • edited Loading

cosmicBboy commented Dec 11, 2020

codecov bot commented Dec 11, 2020 • edited Loading

Codecov Report

jeffzi commented Dec 11, 2020 • edited Loading

cosmicBboy commented Dec 11, 2020

jeffzi commented Dec 11, 2020

cosmicBboy commented Dec 12, 2020

jeffzi commented Dec 10, 2020 •

edited

Loading

codecov bot commented Dec 11, 2020 •

edited

Loading

jeffzi commented Dec 11, 2020 •

edited

Loading