`cudf-polars`/`pylibcudf` string -> date parsing #16306

brandon-b-miller · 2024-07-18T10:59:35Z

This PR adds datetime/timestamp parsing from string columns in pylibcudf and cudf-polars.

python/cudf_polars/cudf_polars/dsl/expr.py

brandon-b-miller · 2024-07-23T16:37:27Z

Polars seems to enforce some kind of handling for ambiguous datetimes through this API, we can either error, raise, or replace them but all seem to rely on identifying if they exist and where. This is problematic for us short term in the very general case because we'd likely have to write some careful logic and tests to do this properly.

Are there any conditions we can think of that preclude the possibility of this occurring, like maybe requiring that the format specify UTC for now?

wence- · 2024-07-23T16:51:37Z

Polars seems to enforce some kind of handling for ambiguous datetimes through this API, we can either error, raise, or replace them but all seem to rely on identifying if they exist and where. This is problematic for us short term in the very general case because we'd likely have to write some careful logic and tests to do this properly.

Are there any conditions we can think of that preclude the possibility of this occurring, like maybe requiring that the format specify UTC for now?

Paging @mroeschke who has done a load of this in cudf-classic for the pandas compatibility.

If we think the answer is "it's too complicated", let's just not support unconditionally and write up properly as an issue.

mroeschke · 2024-07-23T17:34:11Z

Both %z and %Z raise in cudf classic currently as NotImplementedError

As you alluded to, a %Z in the format matching UTC exactly is the only safe timezone localization without introspecting the data. Technically any string matching %z could be supported since it should be treated as an "arbitrary UTC offset", but there's no way to represent that in the dtype today.

brandon-b-miller · 2024-07-23T18:13:00Z

Both %z and %Z raise in cudf classic currently as NotImplementedError

As you alluded to, a %Z in the format matching UTC exactly is the only safe timezone localization without introspecting the data. Technically any string matching %z could be supported since it should be treated as an "arbitrary UTC offset", but there's no way to represent that in the dtype today.

Thanks @mroeschke , please correct me if I am wrong, but I think this means we should at least be able to proceed with the case of %Z - my reasoning here being that if we assume the data is all normalized to UTC, no instant in time should be ambiguous, so any kind of "handling" should be a no-op.

mroeschke · 2024-07-23T18:51:14Z

Thanks @mroeschke , please correct me if I am wrong, but I think this means we should at least be able to proceed with the case of %Z - my reasoning here being that if we assume the data is all normalized to UTC, no instant in time should be ambiguous, so any kind of "handling" should be a no-op.

assume the data is all normalized to UTC

Internally within cudf we can assume this. From polars I'm not sure if this can be assumed, especially if it's directly from user input. e.g.:

df = pl.DataFrame({"a": ["2024-03-10 02:30:00 US/Pacific"]})
df.select(pl.col("a").str.to_datetime(format="%Y-%m-%d %H:%M:%S %Z"))

should raise as 2024-03-10 02:30:00 is not a valid date and time in US/Pacific

brandon-b-miller · 2024-07-30T20:34:30Z

Just looking around for a possible combination of parameters we could support for this API. Starting with a column strings that represent dates:

Off the bat we need to deal with the format parameter, which in polars may be None. I argue we can not support the None case because the polars API promises that the format is inferred from the data in this case, which is data introspection we won't do.
Next we need to deal with strict, which has two modes, both of which require identifying malformed timestamps and I think are theoretically possible using cudf::is_timestamp to introspect:
- if True, fail if any string fails to convert
- if False, return null for strings that fail to convert (this is not specified in the docs and is my observation based on experimenting locally)
cache seems like a CPU only backend option that doesn't have a GPU equivalent
ambiguous, as we have discussed a bit above. However for this one I am thinking there is possibly a way forward that comes from the fact that times in libcudf are implicitly all UTC, which has no transitions. This means that I think the following scenario is allowable:
- format is passed and we can state with certainty that the it specifies UTC (the user tells us the data is UTC timesatmps)
- strict may be either True or False. Either way, we'll use is_timestamp with the format to identify where they are and either do a null replace if strict is False or an any of some kind on the inverse of the mask from is_timestamp to throw if strict is True
- At this point we have either errored or we have a column of clean UTC timestamps/nulls. This should map cleanly onto the libcudf API which would normally return undefined if a malformed string is present, but won't have that issue at this point, because those places have been replaced with nulls, for which the libcudf api also returns nulls.

wence- · 2024-07-31T11:59:19Z

Thanks Brandon.

Just looking around for a possible combination of parameters we could support for this API. Starting with a column strings that represent dates:
* Off the bat we need to deal with the `format` parameter, which in polars may be `None`. I argue we can not support the `None` case because the polars API promises that `the format is inferred from the data` in this case, which is data introspection we won't do.

Agreed, here we can raise early with a message that one should provide a format.

* Next we need to deal with `strict`, which has two modes, both of which require identifying malformed timestamps and I think are theoretically possible using `cudf::is_timestamp` to introspect:
  
  * if `True`, fail if any string fails to convert
  * if `False`, return `null` for strings that fail to convert (this is not specified in the docs and is my observation based on experimenting locally)

Agreed that this is what happens. Can you open an issue/PR documenting the "null for unparseable strings" in the strict=False case?

I think this is fine because, like polars, we would raise at collection time if we fail with strict=True.

* `cache` seems like a CPU only backend option that doesn't have a GPU equivalent

Agreed.

* `ambiguous`, as we have discussed a bit above. However for this one I am thinking there is possibly a way forward that comes from the fact that times in libcudf are implicitly all UTC, which has no transitions. This means that I think the following scenario is allowable:
  
  * `format` is passed and we can state with certainty that the it specifies UTC (the user tells us the data is UTC timesatmps)

Is that true?

I think it is rather that the target dtype being parsed to has no timezone in it.

  * `strict` may be either `True` or `False`. Either way, we'll use `is_timestamp` with the `format` to identify where they are and either do a null replace if strict is `False` or an `any` of some kind on the inverse of the mask from `is_timestamp` to throw if strict is `True`
  * At this point we have either errored or we have a column of clean UTC timestamps/nulls. This should map cleanly onto the libcudf API which would normally return _undefined_ if a malformed string is present, but won't have that issue at this point, because those places have been replaced with nulls, for which the libcudf api also returns nulls.

I think this makes sense.

brandon-b-miller · 2024-07-31T15:01:23Z

Is that true?
I think it is rather that the target dtype being parsed to has no timezone in it.

Hopefully this logic seems right, when a user passes a format with any timezone, I'd think any string that doesn't cleanly parse to that timezone should error or yield null. This is what polars seems to do on the cpu at least.

>>> data = pl.DataFrame({'a':['2011-01-01T00:00:00Z-01:00']}).lazy()
>>> data.select(pl.col('a').str.strptime(format='%Y-%m-%dT%H:%M:%SZ', dtype=pl.Datetime('ns'), strict=False)).collect()
shape: (1, 1)
┌──────────────┐
│ a            │
│ ---          │
│ datetime[ns] │
╞══════════════╡
│ null         │
└──────────────┘

So if a user passes, say, '%Y-%m-%dT%H:%M:%SZ', then any data that doesn't fit that format should end up being converted to null having not passed a theoretical is_timestamp(str, '%Y-%m-%dT%H:%M:%SZ') check. Any elements that haven't been nullified at this point then must be properly formed UTC timestamps, and therefore aren't subject to ambiguity, so we shouldn't need handling for the ambiguous kwarg until we somehow accept a wider set of formats later on.

wence- · 2024-08-01T15:57:34Z

python/cudf_polars/cudf_polars/dsl/expr.py

+            not_timestamps = plc.unary.unary_operation(
+                plc.strings.convert.convert_datetime.is_timestamp(
+                    col.obj, format.encode()
+                ),
+                plc.unary.UnaryOperator.NOT,
+            )


We can avoid the negation, because what you want is that all(is_timestamp(col)).

So:

is_timestamp = is_timestamp(...) if strict: all_timestamp = plc.reduce.reduce(is_timestamp, plc.aggregation.all(), bool) if not plc.interop.to_arrow(all_timestamp).as_py(): raise ... else: mask = plc.unary.unary_operation(is_timestamp, NOT) bitmask, null_count = transform.bools_to_mask(mask) # Need to expose this result = col.obj.with_mask(bitmask, null_count) to_timestamps(...)

Is this something we want to do?

I think it's worth updating to use a reduce so that we can avoid the negation in the strict case. I am a bit hesitant to implement bools_to_mask as part of this PR merely for scope reasons. I think it would benefit us to take our time with a comprehensive set of tests for bools_to_mask especially considering the way libcudf pads mask buffers in edge cases and so forth.

python/cudf_polars/tests/expressions/test_stringfunction.py

beckernick · 2024-08-01T18:05:28Z

"No introspection" makes sense as a baseline policy, but curious how we handle this with the pandas interface? Do we violate that policy for the sake of UX?

E.g.,

import pandas as pd
import cudf

s = pd.Series(["2024-01-03"])
print(pd.to_datetime(s))

s = cudf.Series(["2024-01-03"])
print(cudf.to_datetime(s))
0   2024-01-03
dtype: datetime64[ns]
0   2024-01-03
dtype: datetime64[ns]

mroeschke · 2024-08-01T18:25:54Z

curious how we handle this with the pandas interface?

Looks like in cudf Python we get the first string element, try to infer the date format (using an private-ish pandas function), and then use strptime with that inferred date format on the entire column.

bdice · 2024-08-17T20:15:33Z

This probably needs updated since pylibcudf has been moved.

wence- · 2024-08-20T16:11:53Z

This probably needs updated since pylibcudf has been moved.

Since this is going to the feature branch (which doesn't, and won't, have the pylibcudf migration), I think not

wence- · 2024-08-20T16:12:40Z

python/cudf_polars/cudf_polars/dsl/expr.py

@@ -856,6 +867,26 @@ def do_evaluate(
                    else prefix.obj,
                )
            )
+        elif self.name == pl_expr.StringFunction.Strptime:
+            # TODO: ignores ambiguous


What is "ambiguous", and can we add a test?

polars strptime enforces some handling for ambiguous datetimes - those that could map to two different points in UTC, usually due an associated timezone IIUC. Earlier in the general comments if this PR I was attempting to reason that certain passed datetime formats (specifically those that denote utc) don't need to be checked for these - making handling them unnecessary as long as we only accept those formats. This would combine with only allowing strict=False (which nulls out any timestamps not in the right format) to produce the right result for a limited set of cases we allow, I think.

python/cudf_polars/tests/expressions/test_stringfunction.py

wence- · 2024-08-20T16:33:10Z

python/cudf_polars/cudf_polars/dsl/expr.py

+            not_timestamps = plc.unary.unary_operation(
+                plc.strings.convert.convert_datetime.is_timestamp(
+                    col.obj, format.encode()
+                ),
+                plc.unary.UnaryOperator.NOT,
+            )


Is this something we want to do?

python/cudf_polars/cudf_polars/dsl/expr.py

…bcudf-polars-date-from-str

wence-

Suggestion for the testing utility

python/cudf_polars/cudf_polars/testing/asserts.py

python/cudf_polars/tests/expressions/test_stringfunction.py

Co-authored-by: Lawrence Mitchell <wence@gmx.li>

wence- · 2024-08-28T15:26:04Z

Can you apply:

diff --git a/python/cudf_polars/cudf_polars/testing/asserts.py b/python/cudf_polars/cudf_polars/testing/asserts.py
index bc8d1a86d6..a79d45899c 100644
--- a/python/cudf_polars/cudf_polars/testing/asserts.py
+++ b/python/cudf_polars/cudf_polars/testing/asserts.py
@@ -199,7 +199,11 @@ def assert_collect_raises(
     except polars_except:
         pass
     except Exception as e:
-        raise AssertionError(f"Polars CPU did not raise {polars_except}") from e
+        raise AssertionError(
+            f"CPU execution RAISED {type(e)}, EXPECTED {polars_except}"
+        ) from e
+    else:
+        raise AssertionError(f"CPU execution DID NOT RAISE {polars_except}")
 
     engine = GPUEngine(raise_on_fail=True)
     try:
@@ -207,4 +211,8 @@ def assert_collect_raises(
     except cudf_except:
         pass
     except Exception as e:
-        raise AssertionError(f"GPU did not raise {cudf_except}") from e
+        raise AssertionError(
+            f"GPU execution RAISED {type(e)}, EXPECTED {polars_except}"
+        ) from e
+    else:
+        raise AssertionError(f"GPU execution DID NOT RAISE {polars_except}")
diff --git a/python/cudf_polars/tests/testing/test_asserts.py b/python/cudf_polars/tests/testing/test_asserts.py
index 72954569ac..8e7f1a09d9 100644
--- a/python/cudf_polars/tests/testing/test_asserts.py
+++ b/python/cudf_polars/tests/testing/test_asserts.py
@@ -7,7 +7,10 @@ import pytest
 
 import polars as pl
 
+from cudf_polars.containers import DataFrame
+from cudf_polars.dsl.ir import Select
 from cudf_polars.testing.asserts import (
+    assert_collect_raises,
     assert_gpu_result_equal,
     assert_ir_translation_raises,
 )
@@ -33,3 +36,55 @@ def test_translation_assert_raises():
     with pytest.raises(AssertionError):
         # This should fail, because we can't translate this query, but it doesn't raise E.
         assert_ir_translation_raises(unsupported, E)
+
+
+def test_collect_assert_raises(monkeypatch):
+    df = pl.LazyFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})
+
+    with pytest.raises(AssertionError):
+        # This should raise, because polars CPU can run this query
+        assert_collect_raises(
+            df,
+            polars_except=pl.exceptions.InvalidOperationError,
+            cudf_except=pl.exceptions.InvalidOperationError,
+        )
+
+    # Here's an invalid query that gets caught at IR optimisation time.
+    q = df.select(pl.col("a") * pl.col("b"))
+
+    # This exception is raised in preprocessing, so is the same for
+    # both CPU and GPU engines.
+    assert_collect_raises(
+        q,
+        polars_except=pl.exceptions.InvalidOperationError,
+        cudf_except=pl.exceptions.InvalidOperationError,
+    )
+
+    with pytest.raises(AssertionError):
+        # This should raise because the expected GPU error is wrong
+        assert_collect_raises(
+            q,
+            polars_except=pl.exceptions.InvalidOperationError,
+            cudf_except=NotImplementedError,
+        )
+
+    with pytest.raises(AssertionError):
+        # This should raise because the expected CPU error is wrong
+        assert_collect_raises(
+            q,
+            polars_except=NotImplementedError,
+            cudf_except=pl.exceptions.InvalidOperationError,
+        )
+
+    with monkeypatch.context() as m:
+        m.setattr(Select, "evaluate", lambda self, cache: DataFrame([]))
+        # This query should fail, but we monkeypatch a bad
+        # implementation of Select which "succeeds" to check that our
+        # assertion notices this case.
+        q = df.select(pl.col("a") + pl.Series([1, 2]))
+        with pytest.raises(AssertionError):
+            assert_collect_raises(
+                q,
+                polars_except=pl.exceptions.ComputeError,
+                cudf_except=pl.exceptions.ComputeError,
+            )

To add coverage for the new utility (and fix a logic bug)

brandon-b-miller added 2 commits July 17, 2024 12:27

plumb datetime/timedelta conversion to pylibcudf

6c71b32

begin polars tests

3baef83

brandon-b-miller added feature request New feature or request non-breaking Non-breaking change cudf.polars Issues specific to cudf.polars labels Jul 18, 2024

github-actions bot added Python Affects Python cuDF API. CMake CMake build issue pylibcudf Issues specific to the pylibcudf package labels Jul 18, 2024

brandon-b-miller commented Jul 18, 2024

View reviewed changes

python/cudf_polars/cudf_polars/dsl/expr.py Show resolved Hide resolved

Merge branch 'branch-24.08' into pylibcudf-polars-date-from-str

8c663ff

wence- reviewed Aug 1, 2024

View reviewed changes

python/cudf_polars/tests/expressions/test_stringfunction.py Outdated Show resolved Hide resolved

brandon-b-miller changed the base branch from branch-24.08 to branch-24.10 August 5, 2024 19:50

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas labels Aug 5, 2024

brandon-b-miller removed libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas labels Aug 5, 2024

brandon-b-miller added 4 commits August 10, 2024 15:17

slightly better tests

d993c3b

fix up tests

1c1e8ab

continue fixing up

28ae458

merge latest/resolve conflicts

1397354

brandon-b-miller force-pushed the pylibcudf-polars-date-from-str branch from d91d59a to 1397354 Compare August 10, 2024 22:28

github-actions bot added the CMake CMake build issue label Aug 10, 2024

brandon-b-miller added 5 commits August 11, 2024 13:51

do not support strict/do not throw runtime errors

8e0745e

start of tests

28ae041

broken to_duration tests

b432495

passing

cc7eafa

comment

77c3c98

brandon-b-miller changed the title ~~Pylibcudf polars date from str~~ cudf-polars/pylibcudf string -> date parsing Aug 16, 2024

brandon-b-miller marked this pull request as ready for review August 16, 2024 19:10

brandon-b-miller requested a review from a team as a code owner August 16, 2024 19:10

brandon-b-miller requested review from vyasr and bdice and removed request for a team August 16, 2024 19:10

wence- reviewed Aug 20, 2024

View reviewed changes

brandon-b-miller added 2 commits August 21, 2024 05:02

Merge remote-tracking branch 'upstream/feature/cudf-polars' into pyli…

735fd96

…bcudf-polars-date-from-str

address reviews

d9d2c98

wence- reviewed Aug 27, 2024

View reviewed changes

python/cudf_polars/cudf_polars/testing/asserts.py Outdated Show resolved Hide resolved

python/cudf_polars/tests/expressions/test_stringfunction.py Outdated Show resolved Hide resolved

python/cudf_polars/tests/expressions/test_stringfunction.py Outdated Show resolved Hide resolved

brandon-b-miller and others added 3 commits August 28, 2024 06:59

Apply suggestions from code review

98a2972

Co-authored-by: Lawrence Mitchell <wence@gmx.li>

address reviews, update testing utility

0a71f2d

merge/resolve

e3f4b3b

wence- approved these changes Aug 28, 2024

View reviewed changes

apply patch

e638138

wence- merged commit 0bf68d4 into rapidsai:feature/cudf-polars Aug 28, 2024
80 of 81 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cudf-polars`/`pylibcudf` string -> date parsing #16306

`cudf-polars`/`pylibcudf` string -> date parsing #16306

brandon-b-miller commented Jul 18, 2024 •

edited

Loading

brandon-b-miller commented Jul 23, 2024

wence- commented Jul 23, 2024

mroeschke commented Jul 23, 2024

brandon-b-miller commented Jul 23, 2024

mroeschke commented Jul 23, 2024 •

edited

Loading

brandon-b-miller commented Jul 30, 2024

wence- commented Jul 31, 2024

brandon-b-miller commented Jul 31, 2024

wence- Aug 1, 2024

wence- Aug 20, 2024

brandon-b-miller Aug 21, 2024

beckernick commented Aug 1, 2024

mroeschke commented Aug 1, 2024

bdice commented Aug 17, 2024

wence- commented Aug 20, 2024

wence- Aug 20, 2024

brandon-b-miller Aug 20, 2024

wence- Aug 20, 2024

wence- left a comment

wence- commented Aug 28, 2024

cudf-polars/pylibcudf string -> date parsing #16306

cudf-polars/pylibcudf string -> date parsing #16306

Conversation

brandon-b-miller commented Jul 18, 2024 • edited Loading

brandon-b-miller commented Jul 23, 2024

wence- commented Jul 23, 2024

mroeschke commented Jul 23, 2024

brandon-b-miller commented Jul 23, 2024

mroeschke commented Jul 23, 2024 • edited Loading

brandon-b-miller commented Jul 30, 2024

wence- commented Jul 31, 2024

brandon-b-miller commented Jul 31, 2024

wence- Aug 1, 2024

Choose a reason for hiding this comment

wence- Aug 20, 2024

Choose a reason for hiding this comment

brandon-b-miller Aug 21, 2024

Choose a reason for hiding this comment

beckernick commented Aug 1, 2024

mroeschke commented Aug 1, 2024

bdice commented Aug 17, 2024

wence- commented Aug 20, 2024

wence- Aug 20, 2024

Choose a reason for hiding this comment

brandon-b-miller Aug 20, 2024

Choose a reason for hiding this comment

wence- Aug 20, 2024

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

wence- commented Aug 28, 2024

`cudf-polars`/`pylibcudf` string -> date parsing #16306

`cudf-polars`/`pylibcudf` string -> date parsing #16306

brandon-b-miller commented Jul 18, 2024 •

edited

Loading

mroeschke commented Jul 23, 2024 •

edited

Loading