Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace cudf _from_columns with a public API #1326

Merged
merged 3 commits into from
Jan 23, 2024

Conversation

mroeschke
Copy link
Contributor

@mroeschke mroeschke commented Jan 22, 2024

Description

This private API will be removed in 24.02

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke requested a review from a team as a code owner January 22, 2024 21:12
@github-actions github-actions bot added the Python Related to Python code label Jan 22, 2024
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question.

@@ -91,7 +91,7 @@ def directed_hausdorff_distance(multipoints: GeoSeries):
as_column(multipoints.multipoints.geometry_offset[:-1]),
)

return DataFrame._from_columns(result, range(num_spaces))
return DataFrame(dict(enumerate(result)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this as performant as before? It's definitely less obvious what is happening compared to before, so perhaps a 1-liner doc? As a non-Python-native, is this considered Pythonic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a little more overhead due to validation checks in DataFrame as opposed to DataFrame._from_columns, but this is the most equivalent public API to the previous line. Here's the performance of this patch with 500k points

In [1]: from shapely.geometry import MultiPoint
   ...: 

In [2]: import cuspatial
   ...: 

In [3]: data = [[(0, 0)] * 500_000]

In [4]: s = cuspatial.GeoSeries([MultiPoint(coords) for coords in data])

# PR
In [5]: %timeit cuspatial.directed_hausdorff_distance(s)
555 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As a non-Python-native, is this considered Pythonic?

I would say so yes. I will add a 1 line comment about what this line is doing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dict(enumerate(...)) is a Pythonic way to construct a dictionary of the form {i: values[i]} from a sequence values. This constructor is less performant than the old approach. The more performant choice here would be to use DataFrame._from_data(dict(enumerate(...))). I assume Matt picked this alternative to avoid using an internal method. The reality though is that cuspatial hooks heavily into cudf internals at the Python layer. Until we get around to a more thorough refactoring of cudf to better expose the functionality cuspatial relies on, it's probably OK to continue using cudf internals for better perf. I'm OK with either though and will leave it up to Matt to decide what to use.

Copy link
Contributor Author

@mroeschke mroeschke Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's an understanding that cuspatial uses internal cudf APIs already and performant, public APIs don't exist for cuspatial yet, I'll switch this to the more performant _from_data.

Copy link
Member

@harrism harrism Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I think we should remove internal cuDF calls from cuSpatial except in cases where there is a significant performance benefit. In other cases the maintenance cost is not worthwhile.

I don't know what the performance benefit is here, I think you only showed the performance of the dict() approach above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK actually it appears validation through the public cudf.DataFrame doesn't scale well with wide frames. Not sure what is a representative scale for directed_hausdorff_distance is, but for a 1000 x 1000 point comparison, this routine's microbenchmark is equivalent to

In [1]: import cudf

In [2]: columns = [cudf.core.column.as_column(range(1000), dtype="float64")] * 1000

In [3]: %timeit cudf.DataFrame._from_data(dict(enumerate(columns)))
371 µs ± 6.96 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [4]: %timeit cudf.DataFrame(dict(enumerate(columns)))
11.3 ms ± 50 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So let's keep _from_data here for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I think we should remove internal cuDF calls from cuSpatial except in cases where there is a significant performance benefit. In other cases the maintenance cost is not worthwhile.

I agree, but IIRC there are also cases where cuSpatial uses cuDF internals because there is no public-facing API for doing what cuSpatial needs. For example, cuDF is not really designed for extensible sets of dtypes, so cuSpatial uses internal methods to set up usage of the GeoColumn because there is no public access to column-level APIs otherwise. We hope to make that properly extensible at some point, but that's a long ways away.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another main reason is that GeoColumn/GeoDataFrame extends Column/DataFrame so it depends on a lot of their internals. Removing the inheritance and using just the public APIs from cuDF may give us what we need minus the performance cost. (Use a Series everywhere a column op might come into play).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be possible but would have a significant performance cost. We can evaluate when we get around to the appropriate stage of cudf refactoring.

@harrism harrism added bug Something isn't working non-breaking Non-breaking change labels Jan 22, 2024
@vyasr
Copy link
Contributor

vyasr commented Jan 23, 2024

/merge

@rapids-bot rapids-bot bot merged commit 534b267 into rapidsai:branch-24.02 Jan 23, 2024
62 checks passed
@mroeschke mroeschke deleted the fix/_from_columns branch January 23, 2024 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change Python Related to Python code
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

4 participants