Replace cudf _from_columns with a public API #1326

mroeschke · 2024-01-22T21:12:05Z

Description

This private API will be removed in 24.02

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

harrism

A question.

harrism · 2024-01-22T21:22:10Z

python/cuspatial/cuspatial/core/spatial/distance.py

@@ -91,7 +91,7 @@ def directed_hausdorff_distance(multipoints: GeoSeries):
        as_column(multipoints.multipoints.geometry_offset[:-1]),
    )

-    return DataFrame._from_columns(result, range(num_spaces))
+    return DataFrame(dict(enumerate(result)))


Is this as performant as before? It's definitely less obvious what is happening compared to before, so perhaps a 1-liner doc? As a non-Python-native, is this considered Pythonic?

There's a little more overhead due to validation checks in DataFrame as opposed to DataFrame._from_columns, but this is the most equivalent public API to the previous line. Here's the performance of this patch with 500k points

In [1]: from shapely.geometry import MultiPoint ...: In [2]: import cuspatial ...: In [3]: data = [[(0, 0)] * 500_000] In [4]: s = cuspatial.GeoSeries([MultiPoint(coords) for coords in data]) # PR In [5]: %timeit cuspatial.directed_hausdorff_distance(s) 555 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As a non-Python-native, is this considered Pythonic?

I would say so yes. I will add a 1 line comment about what this line is doing.

The dict(enumerate(...)) is a Pythonic way to construct a dictionary of the form {i: values[i]} from a sequence values. This constructor is less performant than the old approach. The more performant choice here would be to use DataFrame._from_data(dict(enumerate(...))). I assume Matt picked this alternative to avoid using an internal method. The reality though is that cuspatial hooks heavily into cudf internals at the Python layer. Until we get around to a more thorough refactoring of cudf to better expose the functionality cuspatial relies on, it's probably OK to continue using cudf internals for better perf. I'm OK with either though and will leave it up to Matt to decide what to use.

Since there's an understanding that cuspatial uses internal cudf APIs already and performant, public APIs don't exist for cuspatial yet, I'll switch this to the more performant _from_data.

Honestly I think we should remove internal cuDF calls from cuSpatial except in cases where there is a significant performance benefit. In other cases the maintenance cost is not worthwhile.

I don't know what the performance benefit is here, I think you only showed the performance of the dict() approach above.

OK actually it appears validation through the public cudf.DataFrame doesn't scale well with wide frames. Not sure what is a representative scale for directed_hausdorff_distance is, but for a 1000 x 1000 point comparison, this routine's microbenchmark is equivalent to

In [1]: import cudf In [2]: columns = [cudf.core.column.as_column(range(1000), dtype="float64")] * 1000 In [3]: %timeit cudf.DataFrame._from_data(dict(enumerate(columns))) 371 µs ± 6.96 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) In [4]: %timeit cudf.DataFrame(dict(enumerate(columns))) 11.3 ms ± 50 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So let's keep _from_data here for now.

Honestly I think we should remove internal cuDF calls from cuSpatial except in cases where there is a significant performance benefit. In other cases the maintenance cost is not worthwhile.

I agree, but IIRC there are also cases where cuSpatial uses cuDF internals because there is no public-facing API for doing what cuSpatial needs. For example, cuDF is not really designed for extensible sets of dtypes, so cuSpatial uses internal methods to set up usage of the GeoColumn because there is no public access to column-level APIs otherwise. We hope to make that properly extensible at some point, but that's a long ways away.

Another main reason is that GeoColumn/GeoDataFrame extends Column/DataFrame so it depends on a lot of their internals. Removing the inheritance and using just the public APIs from cuDF may give us what we need minus the performance cost. (Use a Series everywhere a column op might come into play).

I think it would be possible but would have a significant performance cost. We can evaluate when we get around to the appropriate stage of cudf refactoring.

vyasr · 2024-01-23T20:39:49Z

/merge

Replace cudf _from_columns with a public API

fd5d977

mroeschke requested a review from a team as a code owner January 22, 2024 21:12

mroeschke requested review from trxcllnt and thomcom January 22, 2024 21:12

github-actions bot added the Python Related to Python code label Jan 22, 2024

harrism reviewed Jan 22, 2024

View reviewed changes

harrism added bug Something isn't working non-breaking Non-breaking change labels Jan 22, 2024

mroeschke added 2 commits January 22, 2024 13:47

Add comment about result

1176895

Use _from_data

f58d987

isVoid approved these changes Jan 23, 2024

View reviewed changes

harrism approved these changes Jan 23, 2024

View reviewed changes

This was referenced Jan 23, 2024

handle more RAPIDS version formats in update-version.sh, refactor dependencies.yaml #1329

Merged

Add support for CUDA 12.2 wheels rapidsai/build-planning#7

Closed

rapids-bot bot merged commit 534b267 into rapidsai:branch-24.02 Jan 23, 2024
62 checks passed

mroeschke deleted the fix/_from_columns branch January 23, 2024 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace cudf _from_columns with a public API #1326

Replace cudf _from_columns with a public API #1326

mroeschke commented Jan 22, 2024 •

edited by harrism

Loading

harrism left a comment

harrism Jan 22, 2024

mroeschke Jan 22, 2024

vyasr Jan 22, 2024

mroeschke Jan 22, 2024 •

edited

Loading

harrism Jan 22, 2024 •

edited

Loading

mroeschke Jan 23, 2024

vyasr Jan 23, 2024

isVoid Jan 23, 2024

vyasr Jan 23, 2024

vyasr commented Jan 23, 2024

Replace cudf _from_columns with a public API #1326

Replace cudf _from_columns with a public API #1326

Conversation

mroeschke commented Jan 22, 2024 • edited by harrism Loading

Description

Checklist

harrism left a comment

Choose a reason for hiding this comment

harrism Jan 22, 2024

Choose a reason for hiding this comment

mroeschke Jan 22, 2024

Choose a reason for hiding this comment

vyasr Jan 22, 2024

Choose a reason for hiding this comment

mroeschke Jan 22, 2024 • edited Loading

Choose a reason for hiding this comment

harrism Jan 22, 2024 • edited Loading

Choose a reason for hiding this comment

mroeschke Jan 23, 2024

Choose a reason for hiding this comment

vyasr Jan 23, 2024

Choose a reason for hiding this comment

isVoid Jan 23, 2024

Choose a reason for hiding this comment

vyasr Jan 23, 2024

Choose a reason for hiding this comment

vyasr commented Jan 23, 2024

mroeschke commented Jan 22, 2024 •

edited by harrism

Loading

mroeschke Jan 22, 2024 •

edited

Loading

harrism Jan 22, 2024 •

edited

Loading