Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

287 moving columns to multiindex #319

Merged

Conversation

abyz0123
Copy link
Contributor

@abyz0123 abyz0123 commented Nov 8, 2020

Here is my initial shot at the re/set index functionality. I tried to stay tight on the pandas implementation. Would be interested in thoughts or suggestions.

Copy link
Collaborator

@cosmicBboy cosmicBboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work @ktroutman!

As a procedural thing, I think it would help the dev process if you set up pre-commit, which handles code formatting and linting before pushing to remote, as well as make tests to make sure all tests are passing (I saw the circular import error when I ran this PR locally, see inline comments).

Overall I think the behavior is what we want, my main thought about the API is that I think we can go without the inplace argument. The main reason it's there in pandas is for memory efficiency and avoiding copies of potentially large dataframes. In pandera's case, however, the memory footprint of schemas are negligible, so I think it would be desirable to always copy schemas when they're modified to prevent unexpected schema mutation (and it also simplifies the implementation!).

I added some more detailed feedback in the inline comments, feel free to ignore the nit comments!

pandera/schemas.py Outdated Show resolved Hide resolved
pandera/schemas.py Outdated Show resolved Hide resolved
pandera/schemas.py Outdated Show resolved Hide resolved
pandera/schemas.py Outdated Show resolved Hide resolved
pandera/schemas.py Outdated Show resolved Hide resolved
pandera/schemas.py Outdated Show resolved Hide resolved
@abyz0123
Copy link
Contributor Author

Thanks a lot for the comments, I was struggling a bit on getting the tests to run (for technical reasons) and the hooks to run. I've hacked through the technical issues on my side and I'm pretty sure I've addressed most of your comments in the latest commit.

I found the Multiindex and Index stuff to be cumbersome to work with (if you remove one column from a 2 column multiindex, must then manually change to Index), with this feature as well as in working with the api on my projects. If this idea does work, I would probably use this as my main interface with indices, similar to pandas.

Copy link
Collaborator

@cosmicBboy cosmicBboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's looking good @ktroutman! looks like unit tests are passing, a few more things to ironed out in code-formatting land, looks schemas.py and test_schemas.py need to re-formatted with black (needs to be version 20.8b1 with option --line-length 79 if pre-commit is finicky and you need to do it manually)

And then some pylint errors, some with code formatting issues. See my in-line comments for specifics on the raise-missing-from error.

Also, let me know if you need any help setting up a development environment with all the testing infrastructure on it, I realize the developer docs are a little sparse and would be worth making it a little explicit!

@@ -6,22 +6,22 @@
import warnings
from functools import wraps
from pathlib import Path
from typing import Any, Callable, Dict, List, Optional, Union
from typing import Any, Callable, Dict, List, Optional, Tuple, Union
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tuple is unused

from .checks import Check
from .dtypes import PandasDtype, PandasExtensionType
from .error_formatters import (
from pandera import constants, dtypes, errors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be reverted to relative imports

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure thing -- this is a remnant of my environment set up issues. will revert on next commit.

]
assert not_in_cols == []
except AssertionError:
raise Exception(f"Keys {not_in_cols} not found in schema columns!")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these exceptions should be errors.SchemaInitError and use the from keyword so that full error trace is available. assert statements are also nice when quickly scripting things, but in this case it might be more appropriate to use a conditional then raise here.

not_in_cols: List[str] = [
    x for x in keys_temp if x not in new_schema.columns.keys()
]
if not not_in_cols:
     raise errors.SchemaInitError(f"Keys {not_in_cols} not found in schema columns!")

dup_cols: List[str] = [x for x in set(keys_temp) if keys_temp.count(x) > 1]
assert dup_cols == []
except AssertionError:
raise Exception(f"Keys {dup_cols} are duplicated!")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above

try:
assert new_schema.index is not None
except AssertionError:
raise Exception("There is currently no index set for this schema.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment on line 717

)
try:
assert level_not_in_index == []
except AssertionError:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment on line 717

level_temp: Union[List[Any], List[str], None] = list(set(level)) if level is not None else []

# ensure all specified keys are present in the index
level_not_in_index: List[str] = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type of this variable should be compatible with level_temp, since it's a potential type of this value.



# ensure no duplicates and tuple type
level_temp: Union[List[Any], List[str], None] = list(set(level)) if level is not None else []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think None should be an allowable type type, since this'll be an empty list of level is None

@abyz0123
Copy link
Contributor Author

Thanks again for the quick response and really helpful comments! I've fixed main issues youve pointed out and the type checking seems to be ok now.

As far as testing goes, I tested the main functionality that I can think of, but something that could I could probably use advice on :)

@cosmicBboy
Copy link
Collaborator

thanks @ktroutman, the tests look good to me! I pushed a few changes to your branch to get pylint and black to stop complaining

from pandera.schema_components import Index, MultiIndex

new_schema = copy.deepcopy(self)

keys_temp: List = (
list(set(keys)) if not isinstance(keys, List) else keys
list(set(keys)) if not isinstance(keys, list) else keys
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡

@codecov-io
Copy link

codecov-io commented Nov 19, 2020

Codecov Report

Merging #319 (6c8c7ec) into master (0cb9869) will increase coverage by 0.08%.
The diff coverage is 96.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #319      +/-   ##
==========================================
+ Coverage   98.64%   98.73%   +0.08%     
==========================================
  Files          18       18              
  Lines        1702     1738      +36     
==========================================
+ Hits         1679     1716      +37     
+ Misses         23       22       -1     
Impacted Files Coverage Δ
pandera/schemas.py 97.88% <96.66%> (-0.10%) ⬇️
pandera/typing.py 100.00% <0.00%> (ø)
pandera/__init__.py 100.00% <0.00%> (ø)
pandera/schema_components.py 98.42% <0.00%> (+0.02%) ⬆️
pandera/dtypes.py 100.00% <0.00%> (+2.29%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0cb9869...6c8c7ec. Read the comment docs.

@cosmicBboy cosmicBboy merged commit 642be59 into unionai-oss:master Nov 20, 2020
@abyz0123 abyz0123 deleted the 287_moving_columns_to_multiindex branch November 22, 2020 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants