Add partition(similar to numpy.partition) #1498

NewBornRustacean · 2025-03-27T15:49:58Z

1. Overview

This is about the partition function to handle both contiguous and non-contiguous memory more efficiently. The function now uses a simpler approach for handling non-contiguous memory, reducing overhead in the fallback path.

2. Motivation

The partition function is an important operation that partially sorts an array around the k-th element along a specified axis. This is useful for:

Efficiently finding the k-th smallest/largest element without fully sorting
Dividing data into smaller-than-k and larger-than-k groups (e.g., for quick select algorithms)
Implementing median-finding algorithms

3. Note

The implementation relies on select_nth_unstable
- It might be possible to write our own functions to implement quickselect algorithm rather than to use existing one.
- let me know if that's the case😃
The current implementation of partition() creates a temporary array for each lane when memory is non-contiguous, which can be inefficient.

4. We can go further

in numpy.partition, it is possible to pass sequence of ints as a kth parameter. It would be nice to have one in our case, maybe seperated function like partition_multiple()
there is also interesting and useful function numpy.argpartition. I think I could implement that also as a follow-up.

Related issue

add numpy.partition like functionality #1497

nilgoyette

I wonder if this function should be in ndarray or in another crate. I don't have the answer so I did a normal review :)

src/impl_methods.rs

akern40 · 2025-03-30T02:24:18Z

@nilgoyette here's that question again about what belongs in the library - it's pesky! It keeps coming! I think I'm ok with this? I suppose its home could be in a sort of ndarray-sort crate? Certainly its implementation is complex enough that we shouldn't expect users to write it themselves. And we could always include it in some sort of future, large "crate split".

On the other hand, devil's advocate: this function doesn't have an issue from a user, but rather is driven by parity with NumPy. If we decide to keep accepting NumPy functions, our method-discoverability problem is gonna get worse, right?

@NewBornRustacean you're welcome to join the discussion on this. In the meantime: I believe that to_owned will always return a fully-contiguous array (in fact, I believe that Array must always be contiguous?). If that's true, then the non-contiguous part of your code will never be called. You may be able to get rid of it!

NewBornRustacean · 2025-03-30T04:37:24Z

@nilgoyette here's that question again about what belongs in the library - it's pesky! It keeps coming! I think I'm ok with this? I suppose its home could be in a sort of ndarray-sort crate? Certainly its implementation is complex enough that we shouldn't expect users to write it themselves. And we could always include it in some sort of future, large "crate split".

On the other hand, devil's advocate: this function doesn't have an issue from a user, but rather is driven by parity with NumPy. If we decide to keep accepting NumPy functions, our method-discoverability problem is gonna get worse, right?

@NewBornRustacean you're welcome to join the discussion on this. In the meantime: I believe that to_owned will always return a fully-contiguous array (in fact, I believe that Array must always be contiguous?). If that's true, then the non-contiguous part of your code will never be called. You may be able to get rid of it!

@akern40 @nilgoyette what's up! thanks to kind review and suggestions!

about contiguous part : working on it! removing part to handle non-contiguous. thanks!

I largely agree with you akern40 ! If there were a dedicated crate like ndarray-sort, that would be ideal. Expanding functionality gradually and then splitting it into a separate crate at some point sounds like a good approach.

Before submitting this PR, I reviewed the maintainer table discussions and past issues, and it seems like this "what belongs in the library" question is closely tied to the future direction of ndarray development.

Thinking about why users rely on NumPy, I’d say (at least in my case) the two main reasons are:

Rapid prototyping
Data processing for data analysis

For ndarray users, a major motivation seems to be removing Python from the ML ecosystem. But in that case, most people probably wouldn't start developing directly with ndarray from scratch. Instead, they’re likely trying to migrate something that already exists in a Python-based environment (at least, that was my experience—I'm curious if others feel the same!). If that’s the case, wouldn’t these users be particularly interested in reason 1: "rapid prototyping"?

I don’t think ndarray needs to be fully functionally equivalent to NumPy! However, in a perspective of "rapid prototyping", I do believe that supporting at least some of NumPy’s functionality is essential for attracting those users who are trying to remove Python from their ML workflows.

nilgoyette · 2025-03-30T12:08:46Z

Wait, I didn't even realize there was a to_owned just before!

Why do we need it? The comment above doesn't explain. In fact, that comment is not related to this line.
Even if the array is owned, it will be contiguous only for one axis, no? If you iterate on the "wrong" axis, won't you still need to non-contiguous code?

NewBornRustacean · 2025-03-30T12:29:19Z

Wait, I didn't even realize there was a to_owned just before!

Why do we need it? The comment above doesn't explain. In fact, that comment is not related to this line.

Even if the array is owned, it will be contiguous only for one axis, no? If you iterate on the "wrong" axis, won't you still need to non-contiguous code?

Yeah, I'm facing the issue. When I call the function with Axis(1), for example, each of the lanes could be non contigous(I think this is because of the stride, but I'm not sure).

Or we can implement our own select function so don't rely on select_nth_unstable() that require contiguous memory.

akern40 · 2025-03-30T12:58:42Z

@nilgoyette yes you're right!! I am being silly - you can definitely still be non-contiguous depending on the axis. Sorry everyone.

nilgoyette · 2025-03-30T14:00:33Z

When I call the function with Axis(1), for example, each of the lanes could be non contigous(I think this is because of the stride, but I'm not sure).

Any contiguous array (2D or more) will be non-contiguous on at least one axis. And I can't say which axis because it can use 'c' or 'f' order (please read about those terms if you never heard them before). Imagine a matrix (2D array)

 -->
1 2 3
4 5 6

It is contiguous if you read it in its natural order (1 2 3 4 5 6), but it's not if you read it in the other axis (1 4 2 5 3 6).

Knowing this, it's a good idea to provide code that works in both cases (non-contiguous version of select_nth_unstable), or to check and optimize, like you did. Of course, you're free to choose what you want to code (we're doing open source after all). Both versions work. If you're really motivated, you can do both and create a small benchmark to know which one is the best/fastest.

Btw, I honestly don't have the answer :) Allocating is slow, copying is not slow but we try to avoid it if we can. However, working on a big matrix on a non-contiguous array (can't fit in the cache) is slow too. You might discover that creating a temp array is faster only on huge matrices, but not in general.

NewBornRustacean · 2025-04-03T13:53:47Z

@nilgoyette Sorry for the late update! I was thinking about ways to avoid copying in the non-contiguous case. I was considering whether it might be possible by flattening the input array and then obtaining slices corresponding to each lane, but it didn’t work out as expected.

(I referred to NumPy’s internal implementation, and it seems that NumPy also performs a copy in the non-contiguous case.)

For now, I’ve only applied the changes you mentioned in the last review!

stackoverflow: how can I find the actual implementation of np.partition?

src/impl_methods.rs

nilgoyette · 2025-04-06T17:31:13Z

I decided to merge this PR, leaving the question "should we merge this in ndarray?" open. We will probably have to split ndarray into several smaller crates someday. In the meantime, I think we should avoid driving contributor away.

NewBornRustacean added 2 commits March 28, 2025 00:46

fn partition, first draft

b800bec

handle non-contiguous with create and copy back

df59179

NewBornRustacean marked this pull request as ready for review March 28, 2025 02:31

nilgoyette requested changes Mar 29, 2025

View reviewed changes

src/impl_methods.rs Outdated Show resolved Hide resolved

Merge branch 'master' into feature-add-numpy-like-partition

8034819

NewBornRustacean added 2 commits April 3, 2025 22:39

include review: not to allocate n-vector, reuse a single vector

8f72117

format nightly

a0df066

nilgoyette reviewed Apr 3, 2025

View reviewed changes

src/impl_methods.rs Outdated Show resolved Hide resolved

src/impl_methods.rs Outdated Show resolved Hide resolved

applied feedback: safe unwrap(), avoid navie iter()

3344cf8

nilgoyette reviewed Apr 5, 2025

View reviewed changes

src/impl_methods.rs Outdated Show resolved Hide resolved

initialize temp vector in advance

05dee0e

nilgoyette approved these changes Apr 5, 2025

View reviewed changes

nilgoyette merged commit 2324d2a into rust-ndarray:master Apr 6, 2025
15 checks passed

NewBornRustacean deleted the feature-add-numpy-like-partition branch April 7, 2025 11:02

akern40 mentioned this pull request Apr 10, 2025

Partition needs a check for empty arrays #1501

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add partition(similar to numpy.partition) #1498

Add partition(similar to numpy.partition) #1498

NewBornRustacean commented Mar 27, 2025 •

edited

Loading

nilgoyette left a comment

akern40 commented Mar 30, 2025

NewBornRustacean commented Mar 30, 2025

nilgoyette commented Mar 30, 2025

NewBornRustacean commented Mar 30, 2025

akern40 commented Mar 30, 2025

nilgoyette commented Mar 30, 2025

NewBornRustacean commented Apr 3, 2025

nilgoyette commented Apr 6, 2025

Add partition(similar to numpy.partition) #1498

Add partition(similar to numpy.partition) #1498

Conversation

NewBornRustacean commented Mar 27, 2025 • edited Loading

1. Overview

2. Motivation

3. Note

4. We can go further

Related issue

nilgoyette left a comment

Choose a reason for hiding this comment

akern40 commented Mar 30, 2025

NewBornRustacean commented Mar 30, 2025

nilgoyette commented Mar 30, 2025

NewBornRustacean commented Mar 30, 2025

akern40 commented Mar 30, 2025

nilgoyette commented Mar 30, 2025

NewBornRustacean commented Apr 3, 2025

nilgoyette commented Apr 6, 2025

NewBornRustacean commented Mar 27, 2025 •

edited

Loading