287 moving columns to multiindex #319

abyz0123 · 2020-11-08T16:14:02Z

Here is my initial shot at the re/set index functionality. I tried to stay tight on the pandas implementation. Would be interested in thoughts or suggestions.

cosmicBboy

nice work @ktroutman!

As a procedural thing, I think it would help the dev process if you set up pre-commit, which handles code formatting and linting before pushing to remote, as well as make tests to make sure all tests are passing (I saw the circular import error when I ran this PR locally, see inline comments).

Overall I think the behavior is what we want, my main thought about the API is that I think we can go without the inplace argument. The main reason it's there in pandas is for memory efficiency and avoiding copies of potentially large dataframes. In pandera's case, however, the memory footprint of schemas are negligible, so I think it would be desirable to always copy schemas when they're modified to prevent unexpected schema mutation (and it also simplifies the implementation!).

I added some more detailed feedback in the inline comments, feel free to ignore the nit comments!

pandera/schemas.py

abyz0123 · 2020-11-16T21:23:16Z

Thanks a lot for the comments, I was struggling a bit on getting the tests to run (for technical reasons) and the hooks to run. I've hacked through the technical issues on my side and I'm pretty sure I've addressed most of your comments in the latest commit.

I found the Multiindex and Index stuff to be cumbersome to work with (if you remove one column from a 2 column multiindex, must then manually change to Index), with this feature as well as in working with the api on my projects. If this idea does work, I would probably use this as my main interface with indices, similar to pandas.

cosmicBboy

it's looking good @ktroutman! looks like unit tests are passing, a few more things to ironed out in code-formatting land, looks schemas.py and test_schemas.py need to re-formatted with black (needs to be version 20.8b1 with option --line-length 79 if pre-commit is finicky and you need to do it manually)

And then some pylint errors, some with code formatting issues. See my in-line comments for specifics on the raise-missing-from error.

Also, let me know if you need any help setting up a development environment with all the testing infrastructure on it, I realize the developer docs are a little sparse and would be worth making it a little explicit!

cosmicBboy · 2020-11-17T01:03:24Z

pandera/schemas.py

@@ -6,22 +6,22 @@
 import warnings
 from functools import wraps
 from pathlib import Path
-from typing import Any, Callable, Dict, List, Optional, Union
+from typing import Any, Callable, Dict, List, Optional, Tuple, Union


Tuple is unused

cosmicBboy · 2020-11-17T01:03:38Z

pandera/schemas.py

-from .checks import Check
-from .dtypes import PandasDtype, PandasExtensionType
-from .error_formatters import (
+from pandera import constants, dtypes, errors


these should be reverted to relative imports

sure thing -- this is a remnant of my environment set up issues. will revert on next commit.

cosmicBboy · 2020-11-17T01:08:25Z

pandera/schemas.py

+            ]
+            assert not_in_cols == []
+        except AssertionError:
+            raise Exception(f"Keys {not_in_cols} not found in schema columns!")


these exceptions should be errors.SchemaInitError and use the from keyword so that full error trace is available. assert statements are also nice when quickly scripting things, but in this case it might be more appropriate to use a conditional then raise here.

not_in_cols: List[str] = [ x for x in keys_temp if x not in new_schema.columns.keys() ] if not not_in_cols: raise errors.SchemaInitError(f"Keys {not_in_cols} not found in schema columns!")

cosmicBboy · 2020-11-17T01:08:43Z

pandera/schemas.py

+            dup_cols: List[str] = [x for x in set(keys_temp) if keys_temp.count(x) > 1]
+            assert dup_cols == []
+        except AssertionError:
+            raise Exception(f"Keys {dup_cols} are duplicated!")


see comment above

cosmicBboy · 2020-11-17T01:09:49Z

pandera/schemas.py

+        try:
+            assert new_schema.index is not None
+        except AssertionError:
+            raise Exception("There is currently no index set for this schema.")


see comment on line 717

cosmicBboy · 2020-11-17T01:09:56Z

pandera/schemas.py

+        )
+        try:
+            assert level_not_in_index == []
+        except AssertionError:


see comment on line 717

cosmicBboy · 2020-11-17T01:13:28Z

pandera/schemas.py

+        level_temp: Union[List[Any], List[str], None] = list(set(level)) if level is not None else []
+
+        # ensure all specified keys are present in the index
+        level_not_in_index: List[str] = (


the type of this variable should be compatible with level_temp, since it's a potential type of this value.

cosmicBboy · 2020-11-17T01:15:13Z

pandera/schemas.py

+
+
+        # ensure no duplicates and tuple type
+        level_temp: Union[List[Any], List[str], None] = list(set(level)) if level is not None else []


I don't think None should be an allowable type type, since this'll be an empty list of level is None

abyz0123 · 2020-11-18T19:04:47Z

Thanks again for the quick response and really helpful comments! I've fixed main issues youve pointed out and the type checking seems to be ok now.

As far as testing goes, I tested the main functionality that I can think of, but something that could I could probably use advice on :)

cosmicBboy · 2020-11-18T22:02:43Z

thanks @ktroutman, the tests look good to me! I pushed a few changes to your branch to get pylint and black to stop complaining

abyz0123 · 2020-11-18T23:05:38Z

pandera/schemas.py

        from pandera.schema_components import Index, MultiIndex

        new_schema = copy.deepcopy(self)

        keys_temp: List = (
-            list(set(keys)) if not isinstance(keys, List) else keys
+            list(set(keys)) if not isinstance(keys, list) else keys


codecov-io · 2020-11-19T03:00:51Z

Codecov Report

Merging #319 (6c8c7ec) into master (0cb9869) will increase coverage by 0.08%.
The diff coverage is 96.66%.

@@            Coverage Diff             @@
##           master     #319      +/-   ##
==========================================
+ Coverage   98.64%   98.73%   +0.08%     
==========================================
  Files          18       18              
  Lines        1702     1738      +36     
==========================================
+ Hits         1679     1716      +37     
+ Misses         23       22       -1

Impacted Files	Coverage Δ
pandera/schemas.py	`97.88% <96.66%> (-0.10%)`	⬇️
pandera/typing.py	`100.00% <0.00%> (ø)`
pandera/__init__.py	`100.00% <0.00%> (ø)`
pandera/schema_components.py	`98.42% <0.00%> (+0.02%)`	⬆️
pandera/dtypes.py	`100.00% <0.00%> (+2.29%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0cb9869...6c8c7ec. Read the comment docs.

ktroutman added 3 commits November 3, 2020 00:17

intermediate change

afb7e2d

changes

28629d4

tests added

8ea0dfe

cosmicBboy reviewed Nov 11, 2020

View reviewed changes

fixes, ignoring two mypy type flags

07f6d54

cosmicBboy reviewed Nov 17, 2020

View reviewed changes

ktroutman added 3 commits November 18, 2020 19:50

fixed a few typing issues, formatting, all formatting and tests pass

c031d0d

quick fix of relative imports

4ff58a1

tests were not added for some reason, adding here.

c224bd1

fix pylint errors, black formatting on test_schemas.py

6c8c7ec

abyz0123 commented Nov 18, 2020

View reviewed changes

add test for reset_index with index is None

8f180b9

cosmicBboy merged commit 642be59 into unionai-oss:master Nov 20, 2020

cosmicBboy mentioned this pull request Nov 21, 2020

Add documentation for DataFrameSchema transformations #327

Closed

abyz0123 deleted the 287_moving_columns_to_multiindex branch November 22, 2020 17:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

287 moving columns to multiindex #319

287 moving columns to multiindex #319

abyz0123 commented Nov 8, 2020

cosmicBboy left a comment

abyz0123 commented Nov 16, 2020

cosmicBboy left a comment

cosmicBboy Nov 17, 2020

cosmicBboy Nov 17, 2020

abyz0123 Nov 17, 2020

cosmicBboy Nov 17, 2020

cosmicBboy Nov 17, 2020

cosmicBboy Nov 17, 2020

cosmicBboy Nov 17, 2020

cosmicBboy Nov 17, 2020

cosmicBboy Nov 17, 2020

abyz0123 commented Nov 18, 2020

cosmicBboy commented Nov 18, 2020

abyz0123 Nov 18, 2020

codecov-io commented Nov 19, 2020 •

edited

Loading



		# ensure no duplicates and tuple type
		level_temp: Union[List[Any], List[str], None] = list(set(level)) if level is not None else []

287 moving columns to multiindex #319

287 moving columns to multiindex #319

Conversation

abyz0123 commented Nov 8, 2020

cosmicBboy left a comment

Choose a reason for hiding this comment

abyz0123 commented Nov 16, 2020

cosmicBboy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abyz0123 commented Nov 18, 2020

cosmicBboy commented Nov 18, 2020

Choose a reason for hiding this comment

codecov-io commented Nov 19, 2020 • edited Loading

Codecov Report

codecov-io commented Nov 19, 2020 •

edited

Loading