Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyarrow.lib.ArrowKeyError: A type extension with name pandas.period already defined #947

Closed
jbbqqf opened this issue Feb 13, 2024 · 9 comments · Fixed by #954
Closed

pyarrow.lib.ArrowKeyError: A type extension with name pandas.period already defined #947

jbbqqf opened this issue Feb 13, 2024 · 9 comments · Fixed by #954
Labels

Comments

@jbbqqf
Copy link

jbbqqf commented Feb 13, 2024

As far as I can tell, I cannot use twice pd.read_parquet when mocking the filesystem.

How To Reproduce

% venv/bin/pytest tests/test_pyfakefs.py
=============================================================================== test session starts ===============================================================================
platform linux -- Python 3.11.7, pytest-8.0.0, pluggy-1.4.0
rootdir: /home/jbb/sandbox/pyfakefs_x_pandas
plugins: pyfakefs-5.3.5
collected 2 items                                                                                                                                                                 

tests/test_pyfakefs.py .F                                                                                                                                                   [100%]

==================================================================================== FAILURES =====================================================================================
_____________________________________________________________________________________ test_2 ______________________________________________________________________________________

    def test_2() -> None:
        dir_ = Path(Path(__file__).parent, "data")
    
>       df = pd.read_parquet(Path(dir_, "test.parquet"))

tests/test_pyfakefs.py:17: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
venv/lib/python3.11/site-packages/pandas/io/parquet.py:651: in read_parquet
    impl = get_engine(engine)
venv/lib/python3.11/site-packages/pandas/io/parquet.py:63: in get_engine
    return engine_class()
venv/lib/python3.11/site-packages/pandas/io/parquet.py:169: in __init__
    import pandas.core.arrays.arrow.extension_types  # pyright: ignore[reportUnusedImport] # noqa: F401
venv/lib/python3.11/site-packages/pandas/core/arrays/arrow/extension_types.py:59: in <module>
    pyarrow.register_extension_type(_period_type)
pyarrow/types.pxi:1842: in pyarrow.lib.register_extension_type
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   pyarrow.lib.ArrowKeyError: A type extension with name pandas.period already defined

pyarrow/error.pxi:91: ArrowKeyError
============================================================================= short test summary info =============================================================================
FAILED tests/test_pyfakefs.py::test_2 - pyarrow.lib.ArrowKeyError: A type extension with name pandas.period already defined
=========================================================================== 1 failed, 1 passed in 0.26s ===========================================================================

Your environment

% tree tests
tests
├── data
│   └── test.parquet
└── test_pyfakefs.py
% cat tests/test_pyfakefs.py 
from pyfakefs.fake_filesystem import FakeFilesystem
from pathlib import Path
import pandas as pd


def test_1(fs: FakeFilesystem) -> None:
    dir_ = Path(Path(__file__).parent, "data")
    fs.add_real_directory(dir_)

    df = pd.read_parquet(Path(dir_, "test.parquet"))


# def test_2(fs: FakeFilesystem) -> None:
def test_2() -> None:
    dir_ = Path(Path(__file__).parent, "data")

    df = pd.read_parquet(Path(dir_, "test.parquet"))
python3.11 -m venv venv
source venv/bin/activate
pip install pandas pytest pyfakefs pyarrow
venv/bin/pytest tests/test_pyfakefs.py
% pip freeze
iniconfig==2.0.0
numpy==1.26.4
packaging==23.2
pandas==2.2.0
pluggy==1.4.0
pyarrow==15.0.0
pyfakefs==5.3.5
pytest==8.0.0
python-dateutil==2.8.2
pytz==2024.1
six==1.16.0
tzdata==2024.1

This is how I generated the parquet file:

python
>>> import pandas as pd
>>> pd.DataFrame({"a": [1], "b": 2}).to_parquet("data/test.parquet")
@mrbean-bremen
Copy link
Member

Thanks! Looks like the patching is not correctly reverted in this case for some reason.

@jbbqqf
Copy link
Author

jbbqqf commented Feb 22, 2024

@mrbean-bremen thank you for the quick response. How often do you perform maintenance tasks on this lib?

I'm looking for visibility on this issue to understand if I need to look for another solution.

I am aware you're probably maintaining this repo on your personal time.

@mrbean-bremen
Copy link
Member

I try to fix issues as soon as possible, but that depends on the kind of issue, and on other things I have to do. I had some shot at that one last weekend, but didn't get anywhere, and haven't since worked on it. I will see what I can do, but probably not before the weekend. I will let you know if I find something, or if this may take longer.

And yes, I'm doing this in my free time. I'm always happy about other contributors, of course...

@mrbean-bremen
Copy link
Member

The problem has to do with the dynamic patcher, but I didn't find the root cause yet. Switching off the dynamic patcher fixes the example, but depending on your use case, it may be needed (it patches modules loaded dynamically during the test).
If you want to try this, you can replace fs with a customized fixture, e.g.:

@pytest.fixture
def fs_no_dyn_patch():
    with Patcher(use_dynamic_patch=False) as p:
        yield p.fs

@mrbean-bremen
Copy link
Member

For the record (and visibility...):
This is not what I thought originally. I was biased because of some recent problems related to the dynamic patcher cleanup, but I should have payed more attention to the actual error message. It is still a problem in the dynamic patcher cleanup, but specific to the module pandas.core.arrays.arrow.extension_types, which cannot be reloaded. The module registers a couple of extensions on load, and on reload this fails because they are already registered.
I thought about making a PR to fix this in pandas, but found no easy way to do this (pyarrow does not allow to query for registered extensions, and the exception cannot be caught easily because it comes from C code).

So I will probably add a specific fix for this module (and the possibility to make similar fixes for other modules).

mrbean-bremen added a commit to mrbean-bremen/pyfakefs that referenced this issue Feb 25, 2024
- may be needed for modules that cannot cleanly reload
- used for pandas.core.arrays.arrow.extension_types, see pytest-dev#947
mrbean-bremen added a commit that referenced this issue Feb 25, 2024
- may be needed for modules that cannot cleanly reload
- used for pandas.core.arrays.arrow.extension_types, see #947
@mrbean-bremen
Copy link
Member

@jbbqqf - should be fixed on the main branch. Can you please test if this works for you?

@jbbqqf
Copy link
Author

jbbqqf commented Mar 30, 2024

Hi @mrbean-bremen .

I deeply apologize for asking for feedback and not replying until now. I had my test suite broken and ran out of time to fix it.

It is now fixed and I didn't experience the issue described above with the main branch of this repo installed (pip install git+https://github.com/pytest-dev/pyfakefs.git@582abdf44b5ee11f84215d7a638224ce67e122e9) instead of pyfakefs==5.3.5.

Thank you!

@mrbean-bremen
Copy link
Member

Thank you @jbbqqf - I'm waiting for feedback for another issue, and if nothing comes up, I will make a new release sometime next week (traveling right now).

@mrbean-bremen
Copy link
Member

FYI: A new release is out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants