-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: setting column with 2D object array raises #61026
Comments
Hey @tonyyuyiding , Actually if you see the default behavior of numpy, Numpy detects that all elements are strings and assigns dtype='<U1', meaning Unicode strings of length 1. But when you explicitly mention dtype as "object" each element is treated as a general Python object, rather than a NumPy-native type. So, instead of storing elements in a contiguous block of memory, NumPy does not store the actual values directly but instead stores pointers (references) to Python objects. This makes dtype=object behave differently from other NumPy data types.. When you assign t2 to data as data["c1"] = t2 , pandas expects a 1-D array, However, t2 is technically a nested structure (a 2D array where each element is a separate Python object holding a list-like value). This conflicts with Pandas' column format, leading to an error. What is the fix ? Just Like this : |
Thanks for the explanation! I have a further question. You mentioned that pandas expects a 1-D array, but I think |
Even though t1 is 2D, it contains a single column. But in the case of t2, Because t2 contains references to python objects not just direct values. |
Thanks for the report!
@Abhibhav2003 - what are you basing this off of? This looks like a bug to me. In the object case, pandas calls |
I also think this looks like a bug now. At least the error message can be more informative. I'm reading through the source code and trying to find what's happening. |
FYI, I tested the following case, and it works when assigning values to multiple columns. data = pd.DataFrame({
"c1": [1, 2, 3, 4, 5],
"c2": [1, 2, 3, 4, 5],
})
t3 = np.array([["A", "F"], ["B", "G"], ["C", "H"], ["D", "I"], ["E", "J"]], dtype=object)
data[["c1", "c2"]] = t3 |
Thanks for the information! I also find another stange behavior
Here's what I get:
I wonder whether it is the expected behavior. It seems that there can be more meaningful error messages when the array's dimension >= 3. Besides, I have no idea whether a 2D numpy array should be accepted when setting a column in the original design. |
take |
Research
I have searched the [pandas] tag on StackOverflow for similar questions.
I have asked my usage related question on StackOverflow.
Link to question on StackOverflow
https://stackoverflow.com/questions/79457029/setting-pandas-dataframe-column-with-numpy-object-array-causes-error/
Question about pandas
I found that setting pandas DataFrame column with a 2D numpy array whose dtype is object will cause a wierd error. I wonder why it happens.
The code I ran is as follows:
Result (some unrelated path removed):
I'm not sure whether it is the expected behaviour. I find it strange because simply adding
dtype=object
will cause the error.The text was updated successfully, but these errors were encountered: