Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass a numpy array of booleans to python simplify? #2968

Closed
hyanwong opened this issue Jun 27, 2024 · 7 comments
Closed

Pass a numpy array of booleans to python simplify? #2968

hyanwong opened this issue Jun 27, 2024 · 7 comments

Comments

@hyanwong
Copy link
Member

hyanwong commented Jun 27, 2024

For teaching purposes, it's nice to be able to do

ts.simplify(ts.nodes_time == 0)

rather than

ts.simplify(np.where(ts.nodes_time == 0)[0])

What do people think of the idea of special-casing the samples argument to simplify() to check if it is of length num_nodes and of dtype=bool, and then simply doing a np.where(samples)[0]? At the moment this fails with a Duplicate sample value error, as it treats the 0s and 1s as sample IDs.

@jeromekelleher
Copy link
Member

It's a bunch of work to do, and not that much different? Seems to me that you'd have to explain more by adding the boolean selector for nodes as well as the list of nodes option.

@hyanwong
Copy link
Member Author

hyanwong commented Jun 27, 2024

It's only a 5 line addition at the top of tables.simplify(), isn't it?

try:
    if len(samples) == ts.num_nodes and samples.dtype == bool:
        samples = np.where(samples)[0]
except AttributeError:
   pass

You're right that it's not that different, and it's not really a priority, but I'm finding that every extra barrier to new tskit users puts some of them off (and takes an extra few minutes to explain). There's a reason why numpy allows both boolean and numerical indexing. I'm not sure I would actually explain in a practical that you can use both: it's obvious from the context, right?

@jeromekelleher
Copy link
Member

As usual with these things, implementation is by far the easiest thing and something like 5% of the actual effort. Testing, documenting and making sure there are no regressions against existing code make the majority of the work.

@petrelharp
Copy link
Contributor

I agree. Higher on my list would be adding an individuals argument.

@hyanwong
Copy link
Member Author

hyanwong commented Jun 27, 2024

I guess part of the problem is nothing to do with tskit, but the very unintuitive np.where syntax which requires the [0] at the end (which I always forget). The other option is

ts.simplify(np.flatnonzero(ts.nodes_time == 0))

but that's hardly more intuitive, IMO. Or alternatively

ts.simplify((ts.nodes_time == 0).nonzero()[0])

Which is equally cryptic, but at least doesn't need you to use the np prefix, which again requires explanation when it's in the first few lines of a beginner's tutorial.

Anyway, if no-one thinks it's a good idea, I'll close this. However, it's worth pointing out that I'm finding it hard to introduce tskit to new users (especially from R, and if they don't know numpy).

@petrelharp
Copy link
Contributor

Another option is

today_nodes = np.arange(ts.num_nodes)[ts.nodes_time == 0]
ts.simplify(today_nodes)

@petrelharp
Copy link
Contributor

FWIW, I can't think of a single function in R that overloads arguments like that - i.e., takes either a vector of indices or a vector of booleans. Indexing works like that, but only indexing. Overloading can lead to nasty corner cases.

@hyanwong hyanwong closed this as not planned Won't fix, can't repro, duplicate, stale Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants