-
-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
287 moving columns to multiindex #319
287 moving columns to multiindex #319
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work @ktroutman!
As a procedural thing, I think it would help the dev process if you set up pre-commit, which handles code formatting and linting before pushing to remote, as well as make tests
to make sure all tests are passing (I saw the circular import error when I ran this PR locally, see inline comments).
Overall I think the behavior is what we want, my main thought about the API is that I think we can go without the inplace
argument. The main reason it's there in pandas
is for memory efficiency and avoiding copies of potentially large dataframes. In pandera
's case, however, the memory footprint of schemas are negligible, so I think it would be desirable to always copy schemas when they're modified to prevent unexpected schema mutation (and it also simplifies the implementation!).
I added some more detailed feedback in the inline comments, feel free to ignore the nit
comments!
Thanks a lot for the comments, I was struggling a bit on getting the tests to run (for technical reasons) and the hooks to run. I've hacked through the technical issues on my side and I'm pretty sure I've addressed most of your comments in the latest commit. I found the Multiindex and Index stuff to be cumbersome to work with (if you remove one column from a 2 column multiindex, must then manually change to Index), with this feature as well as in working with the api on my projects. If this idea does work, I would probably use this as my main interface with indices, similar to pandas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's looking good @ktroutman! looks like unit tests are passing, a few more things to ironed out in code-formatting land, looks schemas.py
and test_schemas.py
need to re-formatted with black
(needs to be version 20.8b1
with option --line-length 79
if pre-commit is finicky and you need to do it manually)
And then some pylint errors, some with code formatting issues. See my in-line comments for specifics on the raise-missing-from
error.
Also, let me know if you need any help setting up a development environment with all the testing infrastructure on it, I realize the developer docs are a little sparse and would be worth making it a little explicit!
pandera/schemas.py
Outdated
@@ -6,22 +6,22 @@ | |||
import warnings | |||
from functools import wraps | |||
from pathlib import Path | |||
from typing import Any, Callable, Dict, List, Optional, Union | |||
from typing import Any, Callable, Dict, List, Optional, Tuple, Union |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tuple
is unused
pandera/schemas.py
Outdated
from .checks import Check | ||
from .dtypes import PandasDtype, PandasExtensionType | ||
from .error_formatters import ( | ||
from pandera import constants, dtypes, errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should be reverted to relative imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure thing -- this is a remnant of my environment set up issues. will revert on next commit.
pandera/schemas.py
Outdated
] | ||
assert not_in_cols == [] | ||
except AssertionError: | ||
raise Exception(f"Keys {not_in_cols} not found in schema columns!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these exceptions should be errors.SchemaInitError
and use the from
keyword so that full error trace is available. assert
statements are also nice when quickly scripting things, but in this case it might be more appropriate to use a conditional then raise
here.
not_in_cols: List[str] = [
x for x in keys_temp if x not in new_schema.columns.keys()
]
if not not_in_cols:
raise errors.SchemaInitError(f"Keys {not_in_cols} not found in schema columns!")
pandera/schemas.py
Outdated
dup_cols: List[str] = [x for x in set(keys_temp) if keys_temp.count(x) > 1] | ||
assert dup_cols == [] | ||
except AssertionError: | ||
raise Exception(f"Keys {dup_cols} are duplicated!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment above
pandera/schemas.py
Outdated
try: | ||
assert new_schema.index is not None | ||
except AssertionError: | ||
raise Exception("There is currently no index set for this schema.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment on line 717
pandera/schemas.py
Outdated
) | ||
try: | ||
assert level_not_in_index == [] | ||
except AssertionError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment on line 717
pandera/schemas.py
Outdated
level_temp: Union[List[Any], List[str], None] = list(set(level)) if level is not None else [] | ||
|
||
# ensure all specified keys are present in the index | ||
level_not_in_index: List[str] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the type of this variable should be compatible with level_temp
, since it's a potential type of this value.
pandera/schemas.py
Outdated
|
||
|
||
# ensure no duplicates and tuple type | ||
level_temp: Union[List[Any], List[str], None] = list(set(level)) if level is not None else [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think None
should be an allowable type type, since this'll be an empty list of level is None
Thanks again for the quick response and really helpful comments! I've fixed main issues youve pointed out and the type checking seems to be ok now. As far as testing goes, I tested the main functionality that I can think of, but something that could I could probably use advice on :) |
thanks @ktroutman, the tests look good to me! I pushed a few changes to your branch to get pylint and black to stop complaining |
from pandera.schema_components import Index, MultiIndex | ||
|
||
new_schema = copy.deepcopy(self) | ||
|
||
keys_temp: List = ( | ||
list(set(keys)) if not isinstance(keys, List) else keys | ||
list(set(keys)) if not isinstance(keys, list) else keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡
Codecov Report
@@ Coverage Diff @@
## master #319 +/- ##
==========================================
+ Coverage 98.64% 98.73% +0.08%
==========================================
Files 18 18
Lines 1702 1738 +36
==========================================
+ Hits 1679 1716 +37
+ Misses 23 22 -1
Continue to review full report at Codecov.
|
Here is my initial shot at the re/set index functionality. I tried to stay tight on the pandas implementation. Would be interested in thoughts or suggestions.