Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug cov nat #60898

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Bug cov nat #60898

wants to merge 2 commits into from

Conversation

fbourgey
Copy link
Contributor

@fbourgey fbourgey commented Feb 9, 2025

@@ -11239,6 +11239,12 @@ def cov(
c -0.150812 0.191417 0.895202
"""
data = self._get_numeric_data() if numeric_only else self
if data.select_dtypes(include=[np.datetime64, np.timedelta64]).shape[1] > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this happening in spite of numeric_only or is that False in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be False in that case. If numeric_only=True, then no error is raised.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm something feels off to me about this. I assume other types like object/string are raising naturally without any special casing here. Can you check where those are raising to see if that unlocks any clues? I don't think we should be special-casing the type selection in the algorithm like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following

df = DataFrame({"a": ["a","b","c"]})
df.cov()

raises

ValueError: could not convert string to float: 'a'

so object/string should be fine.

@WillAyd, do you have any other suggestions on the best way to handle this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the question is ultimately how are those raising even though we don't branch for them within this function. Do you think you can find that through the debugger?

Copy link
Contributor Author

@fbourgey fbourgey Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can check the type of every element of the DataFrame or Series and do something like

if data.map(lambda x: isinstance(x, (np.datetime64, np.timedelta64, pd.Timedelta, pd.Timestamp, pd._libs.tslibs.nattype.NaTType))).any().any():
    raise ValueError()

The cov() method does data.to_numpy(dtype=float, na_value=np.nan, copy=False) on data. This gives strange values when there are some pd.NaT and raises an error if it contains some string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm my overall point is that we shouldn't be special-casing anything within the implementation here. Does this naturally dispatch to calling a method on the TimedeltaArray or DatetimeArray? I feel like it would be better for those arrays to signal that this method is not supported rather than baking it into the implementation here

@jbrockmendel knows more about this, so he may have some other ideas

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it would be better for those arrays

That is going to break down in heterogeneous-dtype (or just non-consolidated) cases.

There are a couple bugs here.

The first is that .cov should never work with dt64 or td64 dtypes, regardless of whether they contain NaTs. The unit on the result would be timedelta**2, which isn't a thing. This is the same reason why DatetimeArray and TimedeltaArray .var raises while .std does not.

The second bug is not in .cov but in .corr, which should work. The problem is in to_numpy not respecting na_value correctly:

dti = pd.date_range("2016-01-01", periods=3)
df = pd.DataFrame(dti)
df.iloc[0,0] = pd.NaT
df.to_numpy(float, na_value=np.nan)

>>> df.to_numpy(float, na_value=np.nan)
array([[-9.22337204e+18],
       [ 1.45169280e+18],
       [ 1.45177920e+18]])

The first entry in that result should be np.nan.

@WillAyd WillAyd added Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: cov buggy when having NaT in column
3 participants