-
Notifications
You must be signed in to change notification settings - Fork 900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cudf-polars
/pylibcudf
string -> date parsing
#16306
cudf-polars
/pylibcudf
string -> date parsing
#16306
Conversation
Polars seems to enforce some kind of handling for ambiguous datetimes through this API, we can either error, raise, or replace them but all seem to rely on identifying if they exist and where. This is problematic for us short term in the very general case because we'd likely have to write some careful logic and tests to do this properly. Are there any conditions we can think of that preclude the possibility of this occurring, like maybe requiring that the format specify UTC for now? |
Paging @mroeschke who has done a load of this in cudf-classic for the pandas compatibility. If we think the answer is "it's too complicated", let's just not support unconditionally and write up properly as an issue. |
Both As you alluded to, a |
Thanks @mroeschke , please correct me if I am wrong, but I think this means we should at least be able to proceed with the case of |
Internally within cudf we can assume this. From polars I'm not sure if this can be assumed, especially if it's directly from user input. e.g.: df = pl.DataFrame({"a": ["2024-03-10 02:30:00 US/Pacific"]})
df.select(pl.col("a").str.to_datetime(format="%Y-%m-%d %H:%M:%S %Z")) should raise as |
Just looking around for a possible combination of parameters we could support for this API. Starting with a column strings that represent dates:
|
Thanks Brandon.
Agreed, here we can raise early with a message that one should provide a format.
Agreed that this is what happens. Can you open an issue/PR documenting the "null for unparseable strings" in the I think this is fine because, like polars, we would raise at collection time if we fail with
Agreed.
Is that true? I think it is rather that the target dtype being parsed to has no timezone in it.
I think this makes sense. |
Hopefully this logic seems right, when a user passes a format with any timezone, I'd think any string that doesn't cleanly parse to that timezone should error or yield null. This is what polars seems to do on the cpu at least. >>> data = pl.DataFrame({'a':['2011-01-01T00:00:00Z-01:00']}).lazy()
>>> data.select(pl.col('a').str.strptime(format='%Y-%m-%dT%H:%M:%SZ', dtype=pl.Datetime('ns'), strict=False)).collect()
shape: (1, 1)
┌──────────────┐
│ a │
│ --- │
│ datetime[ns] │
╞══════════════╡
│ null │
└──────────────┘ So if a user passes, say, |
not_timestamps = plc.unary.unary_operation( | ||
plc.strings.convert.convert_datetime.is_timestamp( | ||
col.obj, format.encode() | ||
), | ||
plc.unary.UnaryOperator.NOT, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid the negation, because what you want is that all(is_timestamp(col))
.
So:
is_timestamp = is_timestamp(...)
if strict:
all_timestamp = plc.reduce.reduce(is_timestamp, plc.aggregation.all(), bool)
if not plc.interop.to_arrow(all_timestamp).as_py():
raise ...
else:
mask = plc.unary.unary_operation(is_timestamp, NOT)
bitmask, null_count = transform.bools_to_mask(mask) # Need to expose this
result = col.obj.with_mask(bitmask, null_count)
to_timestamps(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something we want to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth updating to use a reduce
so that we can avoid the negation in the strict
case. I am a bit hesitant to implement bools_to_mask
as part of this PR merely for scope reasons. I think it would benefit us to take our time with a comprehensive set of tests for bools_to_mask
especially considering the way libcudf pads mask buffers in edge cases and so forth.
"No introspection" makes sense as a baseline policy, but curious how we handle this with the pandas interface? Do we violate that policy for the sake of UX? E.g., import pandas as pd
import cudf
s = pd.Series(["2024-01-03"])
print(pd.to_datetime(s))
s = cudf.Series(["2024-01-03"])
print(cudf.to_datetime(s))
0 2024-01-03
dtype: datetime64[ns]
0 2024-01-03
dtype: datetime64[ns] |
Looks like in cudf Python we get the first string element, try to infer the date format (using an private-ish pandas function), and then use |
d91d59a
to
1397354
Compare
cudf-polars
/pylibcudf
string -> date parsing
This probably needs updated since pylibcudf has been moved. |
Since this is going to the feature branch (which doesn't, and won't, have the pylibcudf migration), I think not |
@@ -856,6 +867,26 @@ def do_evaluate( | |||
else prefix.obj, | |||
) | |||
) | |||
elif self.name == pl_expr.StringFunction.Strptime: | |||
# TODO: ignores ambiguous |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "ambiguous", and can we add a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
polars strptime enforces some handling for ambiguous datetimes - those that could map to two different points in UTC, usually due an associated timezone IIUC. Earlier in the general comments if this PR I was attempting to reason that certain passed datetime formats (specifically those that denote utc) don't need to be checked for these - making handling them unnecessary as long as we only accept those formats. This would combine with only allowing strict=False
(which nulls out any timestamps not in the right format) to produce the right result for a limited set of cases we allow, I think.
not_timestamps = plc.unary.unary_operation( | ||
plc.strings.convert.convert_datetime.is_timestamp( | ||
col.obj, format.encode() | ||
), | ||
plc.unary.UnaryOperator.NOT, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something we want to do?
…bcudf-polars-date-from-str
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion for the testing utility
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
Can you apply: diff --git a/python/cudf_polars/cudf_polars/testing/asserts.py b/python/cudf_polars/cudf_polars/testing/asserts.py
index bc8d1a86d6..a79d45899c 100644
--- a/python/cudf_polars/cudf_polars/testing/asserts.py
+++ b/python/cudf_polars/cudf_polars/testing/asserts.py
@@ -199,7 +199,11 @@ def assert_collect_raises(
except polars_except:
pass
except Exception as e:
- raise AssertionError(f"Polars CPU did not raise {polars_except}") from e
+ raise AssertionError(
+ f"CPU execution RAISED {type(e)}, EXPECTED {polars_except}"
+ ) from e
+ else:
+ raise AssertionError(f"CPU execution DID NOT RAISE {polars_except}")
engine = GPUEngine(raise_on_fail=True)
try:
@@ -207,4 +211,8 @@ def assert_collect_raises(
except cudf_except:
pass
except Exception as e:
- raise AssertionError(f"GPU did not raise {cudf_except}") from e
+ raise AssertionError(
+ f"GPU execution RAISED {type(e)}, EXPECTED {polars_except}"
+ ) from e
+ else:
+ raise AssertionError(f"GPU execution DID NOT RAISE {polars_except}")
diff --git a/python/cudf_polars/tests/testing/test_asserts.py b/python/cudf_polars/tests/testing/test_asserts.py
index 72954569ac..8e7f1a09d9 100644
--- a/python/cudf_polars/tests/testing/test_asserts.py
+++ b/python/cudf_polars/tests/testing/test_asserts.py
@@ -7,7 +7,10 @@ import pytest
import polars as pl
+from cudf_polars.containers import DataFrame
+from cudf_polars.dsl.ir import Select
from cudf_polars.testing.asserts import (
+ assert_collect_raises,
assert_gpu_result_equal,
assert_ir_translation_raises,
)
@@ -33,3 +36,55 @@ def test_translation_assert_raises():
with pytest.raises(AssertionError):
# This should fail, because we can't translate this query, but it doesn't raise E.
assert_ir_translation_raises(unsupported, E)
+
+
+def test_collect_assert_raises(monkeypatch):
+ df = pl.LazyFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})
+
+ with pytest.raises(AssertionError):
+ # This should raise, because polars CPU can run this query
+ assert_collect_raises(
+ df,
+ polars_except=pl.exceptions.InvalidOperationError,
+ cudf_except=pl.exceptions.InvalidOperationError,
+ )
+
+ # Here's an invalid query that gets caught at IR optimisation time.
+ q = df.select(pl.col("a") * pl.col("b"))
+
+ # This exception is raised in preprocessing, so is the same for
+ # both CPU and GPU engines.
+ assert_collect_raises(
+ q,
+ polars_except=pl.exceptions.InvalidOperationError,
+ cudf_except=pl.exceptions.InvalidOperationError,
+ )
+
+ with pytest.raises(AssertionError):
+ # This should raise because the expected GPU error is wrong
+ assert_collect_raises(
+ q,
+ polars_except=pl.exceptions.InvalidOperationError,
+ cudf_except=NotImplementedError,
+ )
+
+ with pytest.raises(AssertionError):
+ # This should raise because the expected CPU error is wrong
+ assert_collect_raises(
+ q,
+ polars_except=NotImplementedError,
+ cudf_except=pl.exceptions.InvalidOperationError,
+ )
+
+ with monkeypatch.context() as m:
+ m.setattr(Select, "evaluate", lambda self, cache: DataFrame([]))
+ # This query should fail, but we monkeypatch a bad
+ # implementation of Select which "succeeds" to check that our
+ # assertion notices this case.
+ q = df.select(pl.col("a") + pl.Series([1, 2]))
+ with pytest.raises(AssertionError):
+ assert_collect_raises(
+ q,
+ polars_except=pl.exceptions.ComputeError,
+ cudf_except=pl.exceptions.ComputeError,
+ ) To add coverage for the new utility (and fix a logic bug) |
This PR adds datetime/timestamp parsing from string columns in pylibcudf and cudf-polars.
Closes #16174