Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: setting column with 2D object array raises #61026

Open
2 tasks done
tonyyuyiding opened this issue Mar 1, 2025 · 8 comments · May be fixed by #61035
Open
2 tasks done

BUG: setting column with 2D object array raises #61026

tonyyuyiding opened this issue Mar 1, 2025 · 8 comments · May be fixed by #61035
Assignees
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@tonyyuyiding
Copy link

Research

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/79457029/setting-pandas-dataframe-column-with-numpy-object-array-causes-error/

Question about pandas

I found that setting pandas DataFrame column with a 2D numpy array whose dtype is object will cause a wierd error. I wonder why it happens.

The code I ran is as follows:

import numpy as np
import pandas as pd

print(f"numpy version: {np.__version__}")
print(f"pandas version: {pd.__version__}")

data = pd.DataFrame({
    "c1": [1, 2, 3, 4, 5],
})

t1 = np.array([["A"], ["B"], ["C"], ["D"], ["E"]])
data["c1"] = t1 # This works well

t2 = np.array([["A"], ["B"], ["C"], ["D"], ["E"]], dtype=object)
data["c1"] = t2 # This throws an error

Result (some unrelated path removed):

numpy version: 2.2.3
pandas version: 2.2.3
Traceback (most recent call last):
  File "...\test.py", line 15, in <module>
    data["c1"] = t2 # This throws an error
    ~~~~^^^^^^
  File "...\Anaconda\envs\test\Lib\site-packages\pandas\core\frame.py", line 4311, in __setitem__
    self._set_item(key, value)
    ~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "...\Anaconda\envs\test\Lib\site-packages\pandas\core\frame.py", line 4524, in _set_item
    value, refs = self._sanitize_column(value)
                  ~~~~~~~~~~~~~~~~~~~~~^^^^^^^
  File "...\Anaconda\envs\test\Lib\site-packages\pandas\core\frame.py", line 5267, in _sanitize_column
    arr = sanitize_array(value, self.index, copy=True, allow_2d=True)
  File "...\Anaconda\envs\test\Lib\site-packages\pandas\core\construction.py", line 606, in sanitize_array
    subarr = maybe_infer_to_datetimelike(data)
  File "...\Anaconda\envs\test\Lib\site-packages\pandas\core\dtypes\cast.py", line 1181, in maybe_infer_to_datetimelike
    raise ValueError(value.ndim)  # pragma: no cover
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: 2

I'm not sure whether it is the expected behaviour. I find it strange because simply adding dtype=object will cause the error.

@tonyyuyiding tonyyuyiding added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Mar 1, 2025
@Abhibhav2003
Copy link

Abhibhav2003 commented Mar 1, 2025

Hey @tonyyuyiding ,

Actually if you see the default behavior of numpy, Numpy detects that all elements are strings and assigns dtype='<U1', meaning Unicode strings of length 1.

Image

Image

But when you explicitly mention dtype as "object" each element is treated as a general Python object, rather than a NumPy-native type. So, instead of storing elements in a contiguous block of memory, NumPy does not store the actual values directly but instead stores pointers (references) to Python objects. This makes dtype=object behave differently from other NumPy data types..

When you assign t2 to data as data["c1"] = t2 , pandas expects a 1-D array, However, t2 is technically a nested structure (a 2D array where each element is a separate Python object holding a list-like value). This conflicts with Pandas' column format, leading to an error.

What is the fix ?
You can actually flatten t2 into a 1D structure, by using ravel() function.

Just Like this :

Image

@tonyyuyiding
Copy link
Author

Thanks for the explanation! I have a further question. You mentioned that pandas expects a 1-D array, but I think t1 and t2 are both 2D. Why we can assign t1 to a column but not t2? Is it because "dtype=object behave differently from other NumPy data types"?

@Abhibhav2003
Copy link

Even though t1 is 2D, it contains a single column.
Pandas automatically reshapes it to 1D when assigning to a DataFrame column.

But in the case of t2,
Pandas sees that each element in t2 is an arbitrary Python object (["A"], ["B"]), not a simple string.
Since dtype=object, Pandas does NOT automatically reshape it.
The shape mismatch causes an assignment error.

Because t2 contains references to python objects not just direct values.

@rhshadrach
Copy link
Member

rhshadrach commented Mar 1, 2025

Thanks for the report!

Pandas sees that each element in t2 is an arbitrary Python object (["A"], ["B"]), not a simple string.
Since dtype=object, Pandas does NOT automatically reshape it.

@Abhibhav2003 - what are you basing this off of?

This looks like a bug to me. In the object case, pandas calls maybe_infer_to_datetimelike which raises on ndim != 1 with the comment # Caller is responsible. Further investigations are welcome!

@tonyyuyiding
Copy link
Author

I also think this looks like a bug now. At least the error message can be more informative. I'm reading through the source code and trying to find what's happening.

@rhshadrach rhshadrach changed the title QST: setting DataFrame column with 2D numpy array whose dtype is object causes a strange error BUG: setting column with 2D object array raises Mar 2, 2025
@rhshadrach rhshadrach added Indexing Related to indexing on series/frames, not to indexes themselves Bug and removed Usage Question Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 2, 2025
@chilin0525
Copy link
Contributor

FYI, I tested the following case, and it works when assigning values to multiple columns.

data = pd.DataFrame({
    "c1": [1, 2, 3, 4, 5],
    "c2": [1, 2, 3, 4, 5],
})
t3 = np.array([["A", "F"], ["B", "G"], ["C", "H"], ["D", "I"], ["E", "J"]], dtype=object)
data[["c1", "c2"]] = t3

@tonyyuyiding
Copy link
Author

Thanks for the information!

I also find another stange behavior

import numpy as np
import pandas as pd

data = pd.DataFrame({
    "c1": [1, 2, 3, 4, 5],
})

t = np.array([[["A"]], [["B"]], [["C"]], [["D"]], [["E"]]]) # shape: (5, 1, 1). dtype is not set to object
data["c1"] = t # error

Here's what I get:

Traceback (most recent call last):
  File ".../test.py", line 9, in <module>
    data["c1"] = t
    ~~~~^^^^^^
  File ".../site-packages/pandas/core/frame.py", line 4185, in __setitem__
    self._set_item(key, value)
  File ".../site-packages/pandas/core/frame.py", line 4391, in _set_item
    self._set_item_mgr(key, value, refs)
  File ".../site-packages/pandas/core/frame.py", line 4360, in _set_item_mgr
    self._iset_item_mgr(loc, value, refs=refs)
  File ".../site-packages/pandas/core/frame.py", line 4349, in _iset_item_mgr
    self._mgr.iset(loc, value, inplace=inplace, refs=refs)
  File ".../site-packages/pandas/core/internals/managers.py", line 1231, in iset
    raise AssertionError(
AssertionError: Shape of new values must be compatible with manager shape

I wonder whether it is the expected behavior. It seems that there can be more meaningful error messages when the array's dimension >= 3. Besides, I have no idea whether a 2D numpy array should be accepted when setting a column in the original design.

@tonyyuyiding
Copy link
Author

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
4 participants