Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qc metrics update #615

Merged
merged 11 commits into from May 7, 2019

Conversation

@ivirshup
Copy link
Collaborator

commented Apr 23, 2019

Two major, and two minor, updates to qc metric calculation:

Tests run much faster now

test_qc_metrics.py used to take ~30 seconds, now takes ~2.

These tests have been kinda slow for a while. This was mostly due to numba compilation. I was using numba.njit(parallel=True), which cannot be cached so compilation occurred every time the tests ran. However, I expect most use cases only calculate QC metrics once in a session, and only for large datasets (at least 300,000 cells) is parallelization + compilation faster than performing the calculation in a single thread. Now a cached single threaded version is used unless the dataset is large.

Can now calculate observation and variable metrics separately

Split the calculation of qc metrics into two functions for obs and var. These separate calls are now available as: describe_obs and describe_var after pd.DataFrame.describe. This is mostly to go along with my split-apply-combine experiments. In particular a use case like:

(adata
    .groupby(obs="leiden")
    .apply(sc.pp.describe_var)
    .combine(...)
)

Where metrics like number of fraction of cells, mean expression, etc. are calculated within each group (useful for things like #562).

Minor updates

  • User can now choose to use expression from layers or raw instead of adata.X
  • Doc updates 馃 (am I polluting sc.pp._docs.py too much?)

@ivirshup ivirshup requested a review from falexwolf Apr 25, 2019

@falexwolf

This comment has been minimized.

Copy link
Member

commented Apr 26, 2019

Cool! 馃槃

But is it possible to stay within the naming scheme? Or maybe even better, to avoid cluttering the namespace with many similar functions, add a parameter to calculate_qc_metrics that allows to restrict to obs and var if wanted?

Updating docs/release_notes.rst would also be great!

@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 27, 2019

I'd have to think about the naming scheme, since its something I've been going back and forth on recently. I'm playing around with these semantic a bit over in my mantis repo, but I don't think I'm ready to make a call on which one I like best. Right now I'm split between following plyexperiment and xarray.

I think I'd like to keep them as separate functions, since the calculate_qc_metrics function already has about as complicated a relationship as I'd want with parameter values and return type.

How about these keep their current name for now, but we don't export these functions beyond _qc.py? This will let me play around with their naming scheme and the usefulness a bit more, while scanpy gets the faster, more thorough tests and improved calculate_qc_metrics?

@falexwolf

This comment has been minimized.

Copy link
Member

commented Apr 27, 2019

How about these keep their current name for now, but we don't export these functions beyond _qc.py?

That's OK for me! I thought about naming these calculate_qc_metrics_obs and ..._var, but that's not great, either... describe_obs and describe_var are probably better names compared to calculate_qc_metrics... So let's just not add a new bunch of functions for now and think about a lasting solution for Scanpy 2, where we can also break backwards compat.

The mantis repo looks fantastic!!!

ivirshup added some commits Apr 22, 2019

Added describe_obs and describe_var
I'll be using these internally for `calculate_qc_metrics` next, but am leaving that undone for now for testing purposes.
Make cell proportions not parallel for small data
This eliminates the long compilation time for jit warmup in cases where it probably wasn't neccesary.

One advantage of this is the tests will run much faster after the first run.

@ivirshup ivirshup force-pushed the ivirshup:qc_metrics_update branch from c8a42df to e4f07c5 Apr 30, 2019

@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented May 1, 2019

@falexwolf, this is good now, yeah?

@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented May 1, 2019

@flying-sheep and idea what's up with this build error? Docstrings are failing tests, but look fine to me. Seems related to 3cacdc8.

@falexwolf

This comment has been minimized.

Copy link
Member

commented May 6, 2019

Yes, it's good to go! Sorry about the trivial conflict in the release notes: I just made 1.4.2 based on the changes of the last two weeks on master, which I had the chance to test in the past week...

@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented May 7, 2019

No problem, thanks!

I think conflicts in release note are going to be common. Maybe those commits could be separate from the PR?

@ivirshup ivirshup merged commit fe623da into theislab:master May 7, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can鈥檛 perform that action at this time.