feature request: make `scipy.stats.pearsonr` accept 2-D arrays #9307

Phillip-M-Feldman · 2018-09-25T05:46:39Z

I'd love to see scipy.stats.pearsonr modified to allow 2-D arrays to be passed, with one column per variable and one row per observation. The output would be a pair of 2-D arrays--one for the Pearson correlation coefficient and one for the p-value. Currently, obtaining something like this requires a pair of nested loops, with scipy.stats.pearsonr being called once for each pair of variables.

The text was updated successfully, but these errors were encountered:

wolhandlerdeb · 2020-12-06T15:30:17Z

I know this is probably no more relevant for you, but maybe it could help someone else.
you can convert your data to a pandas dataFrame, and apply the function pandas.DataFrame.corr with parameter method='pearson'. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html
The only (regrettable) difference is that this function does not return p-val.

Phillip-M-Feldman · 2020-12-06T18:10:25Z

I still do such tests. Thank you!

mdhaber · 2020-12-07T17:57:56Z

At some point @swallan or @DominicChm might be vectorizing mannwhitneyu to respond to gh-12837; perhaps this would be a good follow-up.

mdhaber · 2020-12-31T02:00:41Z

@Phillip-M-Feldman would gh-13312 close this issue, from your perspective? It allows pearsonr to process nd-arrays, but it is just for convenience, not speed.

Phillip-M-Feldman · 2020-12-31T03:09:52Z

This sounds fine.

…

On Wed, Dec 30, 2020 at 6:00 PM Matt Haberland ***@***.***> wrote: @Phillip-M-Feldman <https://github.com/Phillip-M-Feldman> would gh-13312 <#13312> close this issue, from your perspective? It allows pearsonr to process nd-arrays, but it is just for convenience, not speed. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9307 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIEDRF5EVUKOL5FJBVKYJTSXPLNLANCNFSM4FW7T5RQ> .

mdhaber · 2021-04-22T07:29:23Z

I don't think gh-13312 would address this as requested, actually. I think the request is for it to work more like spearmanr, which is an oddball in terms of the way it's vectorized.

mdhaber · 2021-08-19T23:41:33Z

On the other hand, if the decorator in gh-13312 were applied to this function, I think the user could achieve the desired behavior by passing in arrays of shapes (n_var, 1, n_obs) and (1, n_var, n_obs) and taking the statistic along the last axis.

I won't add pearsonr in gh-13312, but it could be vectorized like this in a follow-up.

mdhaber · 2022-02-08T00:06:11Z

@tirthasheshpatel @tupui @Kai-Striega thought you might be interested in this one as we try to make the stats interface more consistent. We might think that pearsonr should be vectorized like spearmanr, as they are both correlation functions. But the behavior of spearmanr (and np.corrcoef) is unusual:

from scipy.stats import mannwhitneyu, spearmanr
x = np.random.rand(3, 10); y = np.random.rand(3, 10)
mannwhitneyu(x, y, axis=1).statistic.shape  # (3,)
spearmanr(x, y, axis=1)[0].shape  # (6, 6)

It does pairwise comparisons between every row of x and y with every other row of x and y.

If we hope to make the scipy.stats interface more consistent, I think we'd want it to be vectorized like essentially other function; e.g.

pearsonr(x, y, axis=1)[0].shape  # we would want (3,), not (6, 6)

If the user wants to do pairwise comparisons, then they can do:

z = np.concatenate([x, y], axis=0)
pearsonr(z[:, np.newaxis, :], z[np.newaxis, :, :], axis=2)[0].shape

I'd suggest that we deprecate the 2D behavior then, after the deprecation cycle, replace it.
What do you think?

mdhaber · 2022-05-04T02:25:20Z

@tirthasheshpatel @tupui @Kai-Striega what do you think? Shall we get the deprecation in before 1.9.0 branches so we can vectorize these consistently?

Kai-Striega · 2022-05-06T09:37:05Z

I'm +1 on this but, won't have the bandwidth to implement it

mdhaber · 2022-05-06T14:01:02Z

No problem. I'll do it!

josef-pkt · 2022-05-07T15:09:04Z

#1924 #625 I think pearsonr should essentially behave like np.corrcoef + pvalue

#16133 (comment)

mdhaber · 2022-05-07T18:28:27Z

Which other reduction functions in stats do you think should behave that way? Would spearmanr and pearsonr be the only exceptions, and all other stats reduction functions would work according to the usual system?

josef-pkt · 2022-05-07T19:42:06Z

kendalltau would be the third correlation measure (pandas treats those 3 the same, AFAIR)
For example we can use those 3 to define elliptical copulas, kendalltau is also commonly used for bivariate copulas where we only need a single correlation/dependence coefficient.

However, kendalltau is not "naturally" vectorized

For statsmodels, I have an old PR with few more outlier robust cov/corr functions.
More general it would be relevant for all pairwise dependence/distance/association measures.

It also depends a bit on the common usecase:

cov, corr and it's variant are very often (or mostly) multivariate measures with the corresponding matrix returns.
For kendalltau and spearmans rho, scipy is the main package for the descriptive statistics, p-values are most of the time a bonus feature.

Hypothesis test functions are mainly for one, two or k- sample cases, with e.g. possibly vectorized (stratified) two-sample case.
Multiple testing like all pairwise comparisons are usually treated as a separate issue, and can be implemented for the specific multiple contrasts like in tukeyhsd. or multiple comparison to a control.
I would not treat those test functions as matrix/multivariate function because that is a special case extension that does not dominate the usage.

josef-pkt · 2022-05-07T19:58:39Z

another distinction

Most of the 2-sample tests assume independent samples, there is no underlying multivariate structure. It's just comparing different independent samples. (correlation in tukeyhsd comes from the contrast of comparisons and not from the correlation of the underlying data).

In the multivariate cases like MANOVA, repeated measures ANOVA or general correlated samples we are back to having to handle the correlation across samples in a joint way.

example:
Testing equality of 2 correlations depends on whether they have one of the variables in common or not.
If we want to test more than two independent correlations, then it's better to go back to the matrix return of cov/corr/...

josef-pkt · 2022-05-07T20:26:20Z

after another coffee break:

I guess what I'm saying boils down to whether a function is mainly for multivariate analysis or for analysis of k (= 1, 2, 2) independent samples.

IMO, corr, cov, pearsonr, kendalltau and spearmanr are in the multivariate camp, the bivariate off-diagonal element is just a special case.

mdhaber · 2022-05-07T20:39:32Z

corr and cov are not SciPy functions, AFAICT.
Because you did not mention them, shall we assume that pointbiserialr, weightedtau, linregress, and all the other "Correlation functions" should be vectorized as usual, and only pearsonr, kendalltau, and spearmanr would get this special treatment? (I suspect not, but we should be explicit here, so I'm not going to make any assumptions. Please name all reduction functions you think should be treated like spearmanr is now.)

josef-pkt · 2022-05-07T21:27:12Z

I used corr and cov as generic term for any king of correlation or covariance measures

pointbiserialr is redundant and just for name recognition
The difference is that it has two different types in x and y, So I would think that the cross-correlation would be the most appropriate in the multivariate case.

I never looked at weightedtau, vectorizing to correlation matrix might not gain anything computationally compared to a loop and I have no idea about use cases

linregress is just used as single linear regression, nothing multivariate about it, If vectorization is available, then it can be in be parallel of separate cases.
I guess the same theil and siegel slopes (I never used them)

I don't know how 2-d sommersd works, My scipy version is too old
AFAIR, I have only seen sommersd and related measures in references that used it for 2 random variables (2 sample case).
I guess it's mainly for comparison of 2 ~~multinomial~~ ordinal samples. (2-d, r x s contingency tables)
I'm not sure what the main use cases for related higher dimensional contingency tables would be.
statsmodels has only stratified contingency tables as higher dim where the underlying structure is vectorization of independent sub-contingency tables.

josef-pkt · 2022-05-07T22:01:03Z

(after another break)

I would put sommersd in a similar category as the correlations measures for ordinal association.
One possible multivariate extension would be to return an asymmetric association matrix with sommersd(x, y) in the upper (?) triangular matrix and sommersd(y, x) in the lower triangular matrix.

But that's purely theoretical, I've never seen a usecase like that. And I don't know if there are interesting followup analyses that would use it.

ilayn added scipy.stats enhancement A new feature or improvement labels Sep 26, 2018

mdhaber mentioned this issue Dec 7, 2020

A Solid Foundation for Statistics in Python with SciPy mdhaber/scipy#26

Closed

This was referenced Dec 29, 2020

ENH: Update the Mann-Whitney-Wilcoxon test #4933

Merged

ENH: stats: add axis and nan_policy parameters to functions with decorator #13312

Merged

mdhaber mentioned this issue Aug 27, 2021

ENH: stats: consistent nan_policy, axis, masked array, and dtype support #14651

Open

13 tasks

mdhaber mentioned this issue May 6, 2022

DEP: stats.spearmanr: deprecate existing behavior with 2D inputs #16133

Closed

mdhaber mentioned this issue Jun 7, 2022

ENH: stats: return bunches rather than plain tuples #16364

Closed

mdhaber mentioned this issue Jan 13, 2023

ENH: stats.monte_carlo_test: enable multivariate tests #17781

Closed

mdhaber mentioned this issue Feb 22, 2024

ENH: stats.pearsonr: add support for axis argument #20137

Merged

tirthasheshpatel closed this as completed in #20137 Mar 21, 2024

lucascolley added this to the 1.14.0 milestone Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request: make `scipy.stats.pearsonr` accept 2-D arrays #9307

feature request: make `scipy.stats.pearsonr` accept 2-D arrays #9307

Phillip-M-Feldman commented Sep 25, 2018

wolhandlerdeb commented Dec 6, 2020

Phillip-M-Feldman commented Dec 6, 2020

mdhaber commented Dec 7, 2020

mdhaber commented Dec 31, 2020

Phillip-M-Feldman commented Dec 31, 2020 via email

mdhaber commented Apr 22, 2021

mdhaber commented Aug 19, 2021 •

edited

mdhaber commented Feb 8, 2022 •

edited

mdhaber commented May 4, 2022

Kai-Striega commented May 6, 2022

mdhaber commented May 6, 2022

josef-pkt commented May 7, 2022

mdhaber commented May 7, 2022 •

edited

josef-pkt commented May 7, 2022

josef-pkt commented May 7, 2022

josef-pkt commented May 7, 2022 •

edited

mdhaber commented May 7, 2022 •

edited

josef-pkt commented May 7, 2022 •

edited

josef-pkt commented May 7, 2022

feature request: make scipy.stats.pearsonr accept 2-D arrays #9307

feature request: make scipy.stats.pearsonr accept 2-D arrays #9307

Comments

Phillip-M-Feldman commented Sep 25, 2018

wolhandlerdeb commented Dec 6, 2020

Phillip-M-Feldman commented Dec 6, 2020

mdhaber commented Dec 7, 2020

mdhaber commented Dec 31, 2020

Phillip-M-Feldman commented Dec 31, 2020 via email

mdhaber commented Apr 22, 2021

mdhaber commented Aug 19, 2021 • edited

mdhaber commented Feb 8, 2022 • edited

mdhaber commented May 4, 2022

Kai-Striega commented May 6, 2022

mdhaber commented May 6, 2022

josef-pkt commented May 7, 2022

mdhaber commented May 7, 2022 • edited

josef-pkt commented May 7, 2022

josef-pkt commented May 7, 2022

josef-pkt commented May 7, 2022 • edited

mdhaber commented May 7, 2022 • edited

josef-pkt commented May 7, 2022 • edited

josef-pkt commented May 7, 2022

feature request: make `scipy.stats.pearsonr` accept 2-D arrays #9307

feature request: make `scipy.stats.pearsonr` accept 2-D arrays #9307

mdhaber commented Aug 19, 2021 •

edited

mdhaber commented Feb 8, 2022 •

edited

mdhaber commented May 7, 2022 •

edited

josef-pkt commented May 7, 2022 •

edited

mdhaber commented May 7, 2022 •

edited

josef-pkt commented May 7, 2022 •

edited