-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug cov nat #60898
base: main
Are you sure you want to change the base?
Bug cov nat #60898
Conversation
fbourgey
commented
Feb 9, 2025
- closes BUG: cov buggy when having NaT in column #53115
@@ -11239,6 +11239,12 @@ def cov( | |||
c -0.150812 0.191417 0.895202 | |||
""" | |||
data = self._get_numeric_data() if numeric_only else self | |||
if data.select_dtypes(include=[np.datetime64, np.timedelta64]).shape[1] > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this happening in spite of numeric_only
or is that False
in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be False
in that case. If numeric_only=True
, then no error is raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm something feels off to me about this. I assume other types like object/string are raising naturally without any special casing here. Can you check where those are raising to see if that unlocks any clues? I don't think we should be special-casing the type selection in the algorithm like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following
df = DataFrame({"a": ["a","b","c"]})
df.cov()
raises
ValueError: could not convert string to float: 'a'
so object/string should be fine.
@WillAyd, do you have any other suggestions on the best way to handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the question is ultimately how are those raising even though we don't branch for them within this function. Do you think you can find that through the debugger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can check the type of every element of the DataFrame or Series and do something like
if data.map(lambda x: isinstance(x, (np.datetime64, np.timedelta64, pd.Timedelta, pd.Timestamp, pd._libs.tslibs.nattype.NaTType))).any().any():
raise ValueError()
The cov()
method does data.to_numpy(dtype=float, na_value=np.nan, copy=False)
on data
. This gives strange values when there are some pd.NaT
and raises an error if it contains some string
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm my overall point is that we shouldn't be special-casing anything within the implementation here. Does this naturally dispatch to calling a method on the TimedeltaArray
or DatetimeArray
? I feel like it would be better for those arrays to signal that this method is not supported rather than baking it into the implementation here
@jbrockmendel knows more about this, so he may have some other ideas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it would be better for those arrays
That is going to break down in heterogeneous-dtype (or just non-consolidated) cases.
There are a couple bugs here.
The first is that .cov should never work with dt64 or td64 dtypes, regardless of whether they contain NaTs. The unit on the result would be timedelta**2, which isn't a thing. This is the same reason why DatetimeArray and TimedeltaArray .var raises while .std does not.
The second bug is not in .cov but in .corr, which should work. The problem is in to_numpy
not respecting na_value
correctly:
dti = pd.date_range("2016-01-01", periods=3)
df = pd.DataFrame(dti)
df.iloc[0,0] = pd.NaT
df.to_numpy(float, na_value=np.nan)
>>> df.to_numpy(float, na_value=np.nan)
array([[-9.22337204e+18],
[ 1.45169280e+18],
[ 1.45177920e+18]])
The first entry in that result should be np.nan.