Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add compositional to scipy.stats for compositional data analysis #12782

Open
jolespin opened this issue Aug 27, 2020 · 8 comments
Open

Add compositional to scipy.stats for compositional data analysis #12782

jolespin opened this issue Aug 27, 2020 · 8 comments
Labels
enhancement A new feature or improvement scipy.stats

Comments

@jolespin
Copy link

Is your feature request related to a problem? Please describe.
Absolutely. Compositional data analysis [CoDA] is large in fields such as bioinformatics, geology, and economics.

In statistics, compositional data are quantitative descriptions of the parts of some whole, conveying relative information. Mathematically, compositional data is represented by points on a simplex. Measurements involving probabilities, proportions, percentages, and ppm can all be thought of as compositional data.
https://en.wikipedia.org/wiki/Compositional_data

Describe the solution you'd like
To have a compositional section in scipy.stats that, at the very least, has common CoDA methods such as closure, center log-ratio, isometric log-ratio, etc. Currently some of the methods are implemented in scikit-bio but I feel that they are much more generalizable to more sciences.

There are also correlation-style pairwise operations that are robust to bias from compositionality. This figure sums up why this is important from Morton et al.

One of the most practical pairwise operations is the rho metric originally published in Lovell et al. 2015, adapted by Erb et al. 2016, and implemented in R by Quinn et al. 2018 in the propr R package. I've reimplemented key metrics such as rho, phi, and variance log-ratio in my compositional Python package that have been optimized to make use of vectorization in numpy. rho is a drop-in replacement for correlation where the values range from -1 to 1 and phi is the unscaled version of rho. variance log-ratio is akin to a distance measure I believe.

I would like for these to be integrated into the scipy ecosystem to be more accessible to not only bioinformaticians but geologist and other sciences that use compositional data. Currently, most of the implementations either use many dependencies, do not fully make use of numpy vectorization for speed, or are available only in R.

Describe alternatives you've considered

I've been using 3rd party packages (scikit-bio and gneiss) and developed my own (https://github.com/jolespin/compositional).

Additional context (e.g. screenshots)

This figure is also helpful in describing the rationale:
image

Fig 1. Why correlations between relative abundances tell us absolutely nothing.
These plots show two hypothetical mRNAs that are part of a larger total. (a) Seven pairs of relative abundances (mRNA1/total, mRNA2/total) are shown in red, representing the two mRNAs in seven different experimental conditions. The dotted reference line shows (mRNA1 + mRNA2)/total = 1.) Rays from origin through the red points show absolute abundances that could have given rise to these relative abundances, e.g., the blue, green or purple sets of points (whose Pearson correlations are −1, +1 and 0.0 respectively). (b) Relative abundances that are proportional must come from equivalent absolute abundances. Here the blue, green or purple sets of point pairs have the same proportionality as the pairs of relative abundances in red, though not necessarily the same order or dispersion.

https://journals.plos.org/ploscompbiol/article/figure/image?size=large&id=10.1371/journal.pcbi.1004075.g001

Key resources:

@rlucas7
Copy link
Member

rlucas7 commented Aug 28, 2020 via email

@jolespin
Copy link
Author

jolespin commented Aug 28, 2020

I'll look into the specifics as I know it must meet a lot of criteria. I've updated this post to include publications from other fields:

The compositional package that I wrote has "optional" dependencies for gneiss and scikit-bio but currently it relies on pandas. These could be easily adapted to not rely on labeled data (i.e. pandas) and rely solely on numpy. The way I've written most of the functions are to handle numpy arrays and pandas so I could just remove the pandas functionality. I'm also alright with not using the compositional package as a dependency. The only reason I wrote it was to have some code I can use for publications but it would be preferred if the functions were native in scipy (provided it met certain criteria for scipy specifications). I could always make a wrapper for pandas compatibility.

Here's a statement on generalizability from experts in the field:

Typical examples in different fields are: geology (geochemical elements), economy (income/expenditure distribution), medicine (body composition: fat, bone, lean), questionnaire surveys (ipsative data), food industry (food composition: fat, sugar, etc), chemistry (chemical composition), ecology (abundance of different species), paleontology (foraminifera taxa), agriculture (nutrient balance ionomics), sociology (time-use surveys), environmental sciences (soil contamination), and genetics (genotype frequency). This type of data appears in most applications, and the interest and importance of consistent statistical methods cannot be underestimated. Although the concern of the problems related to them was kept alive mainly by researchers from the field of Geosciences, in particular by members of the International Association for Mathematical Geosciences, the awareness of coherent methods is growing in the environmental and biological sciences.

http://www.compositionaldata.com/

Publications from other domains (not microbial ecology):

@jolespin
Copy link
Author

jolespin commented Aug 31, 2020

The methods I propose would be the following:

However, we would want to talk to the developers at scikit-bio to see what their thoughts are for this matter. I have implementations for closure, clr, and clr_inv but would not want step on anyones toes regarding this. My main goal is to make it more accessible and use environments with few dependencies.

I also propose the following pairwise metrics:

I suggest that compositional would be an additional module with scipy.stats as either scipy.stats.compositional or scipy.stats.coda as this is a rapidly evolving field with new methods coming from many domains.

@rgommers rgommers added enhancement A new feature or improvement scipy.stats labels Sep 1, 2020
@jolespin
Copy link
Author

jolespin commented Sep 7, 2020

Do you have a link out to documentation requirements? I want to make sure I do this right. Once I get my code in a shareable form that meets SciPy criteria I will loop in the skbio group to see if they would like to contribute.

@rlucas7
Copy link
Member

rlucas7 commented Sep 7, 2020 via email

@MosGeo
Copy link

MosGeo commented May 29, 2022

I'll add my two scents as a user:

  • While it is great, scikit-bio is heavy on the requirements (e.g., requires matplotlib and ipython with default installation).
  • Today, discovered compoda (https://github.com/ofgulban/compoda). The code is clean and documented. I think it is well worth checking out.

@jolespin
Copy link
Author

Coming back to this now that I've graduated (sorry it went stale). What's still needed to get VLR and RHO in scikit-bio?

@jolespin
Copy link
Author

I'll add my two scents as a user:

  • While it is great, scikit-bio is heavy on the requirements (e.g., requires matplotlib and ipython with default installation).
  • Today, discovered compoda (https://github.com/ofgulban/compoda). The code is clean and documented. I think it is well worth checking out.

Looks like a clean package but I don't if some of the implementations are optimized. For example, the clr_transformation uses an unnecessary for-loop. Check out my github.com/jolespin/compositional package when you get a chance. My plan is to get these implemented in scikit-bio (not as a dependency but a reimplementation). This package is just a placeholder until then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

No branches or pull requests

4 participants