Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for optional name validation of single-index #326

Merged
merged 2 commits into from
Nov 22, 2020

Conversation

jeffzi
Copy link
Collaborator

@jeffzi jeffzi commented Nov 19, 2020

This PR adds optional name validation for single-index.

  1. Previously, validation would fail on a named index if the index name was not set in the DataFrameSchema:
import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema, Index, Check

schema = DataFrameSchema(index=Index(pa.Object))
df = pd.DataFrame(index=pd.Index(["index_1", "index_2", "index_3"], name="idx"))
schema.validate(df)
#> Traceback (most recent call last):
#> ...
#> SchemaError: Expected <class 'pandera.schema_components.Index'> to have name 'None', found 'idx'

Created on 2020-11-19 by the reprexpy package

The new behavior is to disable name validation when the name is set to None. We discussed adding a new argument check_name to Index in #323 but I think this solution is more elegant. @cosmicBboy Let me know if you see any drawbacks to this approach.

Moreover, the PR does add a check_name parameters to Field. That closes #323

  1. check_name (bool): Whether to check the name of the column/index during validation.
    None is the default behavior, which translates to True for columns and multi-index, and to False for a single index.
import pandera as pa
import pandas as pd

df = pd.DataFrame(index=pd.Index(["cat", "dog"], name="animal"))


class SchemaNamedIndex(pa.SchemaModel):
    a: pa.typing.Index[str]
    # Same as the following:
    # a: pa.typing.Index[str] = Field(check_name=False)
    # a: pa.typing.Index[str] = Field(check_name=None)


SchemaNamedIndex.validate(df)  # ok
#> Empty DataFrame
#> Columns: []
#> Index: [cat, dog]


class SchemaNamedIndex(pa.SchemaModel):
    a: pa.typing.Index[str] = pa.Field(check_name=True)


SchemaNamedIndex.validate(df)  # fails
#> Traceback (most recent call last):
#> ...
#> SchemaError: Expected <class 'pandera.schema_components.Index'> to have name 'a', found 'animal'

Created on 2020-11-19 by the reprexpy package

  1. As discussed here, columns must always be named.

  2. Multi-index suffers from the same "bug" as described in 1. but it's much harder to fix because pa.MultiIndex is implemented as a subclass of pa.DataFrameSchema. I think it should be the topic of another issue.

import pandas as pd
import pandera as pa

from pandera import Column, DataFrameSchema, Index, MultiIndex, Check

schema = DataFrameSchema(index=MultiIndex([Index(pa.String), Index(pa.Int)]))
df = pd.DataFrame(
    index=pd.MultiIndex.from_arrays(
        [["foo", "bar", "foo"], [0, 1, 2]], names=["index0", "index1"]
    )
)
schema.validate(df)
#> Traceback (most recent call last):
#> ...
#> SchemaError: column '0' not in dataframe
#>               index0  index1
#> index0 index1               
#> foo    0         foo       0
#> bar    1         bar       1
#> foo    2         foo       2

Created on 2020-11-19 by the reprexpy package

@jeffzi jeffzi changed the title Add support for named index in SchemaModel Add support for optional name validation of single-index Nov 19, 2020
@codecov-io
Copy link

codecov-io commented Nov 21, 2020

Codecov Report

Merging #326 (8cb53d3) into master (31fe07b) will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #326      +/-   ##
==========================================
+ Coverage   98.77%   98.79%   +0.02%     
==========================================
  Files          18       18              
  Lines        1708     1747      +39     
==========================================
+ Hits         1687     1726      +39     
  Misses         21       21              
Impacted Files Coverage Δ
pandera/model.py 100.00% <100.00%> (ø)
pandera/model_components.py 100.00% <100.00%> (ø)
pandera/schemas.py 98.12% <100.00%> (+0.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31fe07b...8cb53d3. Read the comment docs.

Copy link
Collaborator

@cosmicBboy cosmicBboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks @jeffzi!

@cosmicBboy
Copy link
Collaborator

The new behavior is to disable name validation when the name is set to None. We discussed adding a new argument check_name to Index in #323 but I think this solution is more elegant.

👍 💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SchemaError on index name for single-index dataframe when using SchemaModel
3 participants