New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix zero HVG bug emerging when batch_key is given #1180
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I'm just requesting a minor documentation update.
I'm a little unsure about changing the return type, but also I would like to see if anyone complains. My expectation is that no one wants a recarray
.
if min_disp is None: min_disp = 0.5 | ||
if min_mean is None: min_mean = 0.0125 | ||
if max_mean is None: max_mean = 3 | ||
if max_disp is None: max_disp = np.inf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two thoughts on this:
- Should this function throw an error if the user passes
n_top
and one of these values? - These values should get documented. If we're not using
None
as a sentinel for missing values, could these just be the argument defaults?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Should this function throw an error if the user passes
n_top
and one of these values?
This is handled already with a "logg.info" line:
if n_top_genes is not None and not all(m is None for m in [
min_disp, max_disp, min_mean, max_mean
]):
logg.info('If you pass `n_top_genes`, all cutoffs are ignored.')
- These values should get documented. If we're not using
None
as a sentinel for missing values, could these just be the argument defaults?
I actually like that they are None by default, it seems more consistent with the rest of scanpy. I added the defaults to the documentation.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default values
What I like about default arguments is that they get automatically documented and are probably right, while manually documented values can go out of date if the values change. I think we only use None
as a default value when we want to be able to check if the user provided a value.
n_top_genes / value filters behavior
I think the behavior for the filters (i.e. max, min, and n_top_genes) should either work together or be mutually exclusive. If they're mutually exclusive, an error is better than a warning. Otherwise I think it's too easy to miss a warning message and think the code is doing something it's not.
For example, I think this line could select the top 100 genes that have a mean of at least 2:
sc.pp.highly_variable_genes(adata, min_mean=2, n_top_genes=100)
If this goes in a script in a pipeline, it'd be very easy not to see the logged message. Then the user would have the wrong idea of what their output means. I think failing fast is better in this case.
In addition, I'm not sure that it makes sense for n_top_genes to have a higher priority than the value filters.
Was there an issue that should be closed by this? |
No description provided.