Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: make scipy.stats.pearsonr accept 2-D arrays #9307

Closed
Phillip-M-Feldman opened this issue Sep 25, 2018 · 19 comments · Fixed by #20137
Closed

feature request: make scipy.stats.pearsonr accept 2-D arrays #9307

Phillip-M-Feldman opened this issue Sep 25, 2018 · 19 comments · Fixed by #20137
Labels
enhancement A new feature or improvement scipy.stats
Milestone

Comments

@Phillip-M-Feldman
Copy link

I'd love to see scipy.stats.pearsonr modified to allow 2-D arrays to be passed, with one column per variable and one row per observation. The output would be a pair of 2-D arrays--one for the Pearson correlation coefficient and one for the p-value. Currently, obtaining something like this requires a pair of nested loops, with scipy.stats.pearsonr being called once for each pair of variables.

@ilayn ilayn added scipy.stats enhancement A new feature or improvement labels Sep 26, 2018
@wolhandlerdeb
Copy link

I know this is probably no more relevant for you, but maybe it could help someone else.
you can convert your data to a pandas dataFrame, and apply the function pandas.DataFrame.corr with parameter method='pearson'. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html
The only (regrettable) difference is that this function does not return p-val.

@Phillip-M-Feldman
Copy link
Author

I still do such tests. Thank you!

@mdhaber
Copy link
Contributor

mdhaber commented Dec 7, 2020

At some point @swallan or @DominicChm might be vectorizing mannwhitneyu to respond to gh-12837; perhaps this would be a good follow-up.

@mdhaber
Copy link
Contributor

mdhaber commented Dec 31, 2020

@Phillip-M-Feldman would gh-13312 close this issue, from your perspective? It allows pearsonr to process nd-arrays, but it is just for convenience, not speed.

@Phillip-M-Feldman
Copy link
Author

Phillip-M-Feldman commented Dec 31, 2020 via email

@mdhaber
Copy link
Contributor

mdhaber commented Apr 22, 2021

I don't think gh-13312 would address this as requested, actually. I think the request is for it to work more like spearmanr, which is an oddball in terms of the way it's vectorized.

@mdhaber
Copy link
Contributor

mdhaber commented Aug 19, 2021

On the other hand, if the decorator in gh-13312 were applied to this function, I think the user could achieve the desired behavior by passing in arrays of shapes (n_var, 1, n_obs) and (1, n_var, n_obs) and taking the statistic along the last axis.

I won't add pearsonr in gh-13312, but it could be vectorized like this in a follow-up.

@mdhaber
Copy link
Contributor

mdhaber commented Feb 8, 2022

@tirthasheshpatel @tupui @Kai-Striega thought you might be interested in this one as we try to make the stats interface more consistent. We might think that pearsonr should be vectorized like spearmanr, as they are both correlation functions. But the behavior of spearmanr (and np.corrcoef) is unusual:

from scipy.stats import mannwhitneyu, spearmanr
x = np.random.rand(3, 10); y = np.random.rand(3, 10)
mannwhitneyu(x, y, axis=1).statistic.shape  # (3,)
spearmanr(x, y, axis=1)[0].shape  # (6, 6)

It does pairwise comparisons between every row of x and y with every other row of x and y.

If we hope to make the scipy.stats interface more consistent, I think we'd want it to be vectorized like essentially other function; e.g.

pearsonr(x, y, axis=1)[0].shape  # we would want (3,), not (6, 6)

If the user wants to do pairwise comparisons, then they can do:

z = np.concatenate([x, y], axis=0)
pearsonr(z[:, np.newaxis, :], z[np.newaxis, :, :], axis=2)[0].shape

I'd suggest that we deprecate the 2D behavior then, after the deprecation cycle, replace it.
What do you think?

@mdhaber
Copy link
Contributor

mdhaber commented May 4, 2022

@tirthasheshpatel @tupui @Kai-Striega what do you think? Shall we get the deprecation in before 1.9.0 branches so we can vectorize these consistently?

@Kai-Striega
Copy link
Member

I'm +1 on this but, won't have the bandwidth to implement it

@mdhaber
Copy link
Contributor

mdhaber commented May 6, 2022

No problem. I'll do it!

@josef-pkt
Copy link
Member

#1924 #625 I think pearsonr should essentially behave like np.corrcoef + pvalue

#16133 (comment)

@mdhaber
Copy link
Contributor

mdhaber commented May 7, 2022

Which other reduction functions in stats do you think should behave that way? Would spearmanr and pearsonr be the only exceptions, and all other stats reduction functions would work according to the usual system?

@josef-pkt
Copy link
Member

kendalltau would be the third correlation measure (pandas treats those 3 the same, AFAIR)
For example we can use those 3 to define elliptical copulas, kendalltau is also commonly used for bivariate copulas where we only need a single correlation/dependence coefficient.

However, kendalltau is not "naturally" vectorized

For statsmodels, I have an old PR with few more outlier robust cov/corr functions.
More general it would be relevant for all pairwise dependence/distance/association measures.

It also depends a bit on the common usecase:

cov, corr and it's variant are very often (or mostly) multivariate measures with the corresponding matrix returns.
For kendalltau and spearmans rho, scipy is the main package for the descriptive statistics, p-values are most of the time a bonus feature.

Hypothesis test functions are mainly for one, two or k- sample cases, with e.g. possibly vectorized (stratified) two-sample case.
Multiple testing like all pairwise comparisons are usually treated as a separate issue, and can be implemented for the specific multiple contrasts like in tukeyhsd. or multiple comparison to a control.
I would not treat those test functions as matrix/multivariate function because that is a special case extension that does not dominate the usage.

@josef-pkt
Copy link
Member

another distinction

Most of the 2-sample tests assume independent samples, there is no underlying multivariate structure. It's just comparing different independent samples. (correlation in tukeyhsd comes from the contrast of comparisons and not from the correlation of the underlying data).

In the multivariate cases like MANOVA, repeated measures ANOVA or general correlated samples we are back to having to handle the correlation across samples in a joint way.

example:
Testing equality of 2 correlations depends on whether they have one of the variables in common or not.
If we want to test more than two independent correlations, then it's better to go back to the matrix return of cov/corr/...

@josef-pkt
Copy link
Member

josef-pkt commented May 7, 2022

after another coffee break:

I guess what I'm saying boils down to whether a function is mainly for multivariate analysis or for analysis of k (= 1, 2, 2) independent samples.

IMO, corr, cov, pearsonr, kendalltau and spearmanr are in the multivariate camp, the bivariate off-diagonal element is just a special case.

@mdhaber
Copy link
Contributor

mdhaber commented May 7, 2022

corr and cov are not SciPy functions, AFAICT.
Because you did not mention them, shall we assume that pointbiserialr, weightedtau, linregress, and all the other "Correlation functions" should be vectorized as usual, and only pearsonr, kendalltau, and spearmanr would get this special treatment? (I suspect not, but we should be explicit here, so I'm not going to make any assumptions. Please name all reduction functions you think should be treated like spearmanr is now.)

@josef-pkt
Copy link
Member

josef-pkt commented May 7, 2022

I used corr and cov as generic term for any king of correlation or covariance measures

pointbiserialr is redundant and just for name recognition
The difference is that it has two different types in x and y, So I would think that the cross-correlation would be the most appropriate in the multivariate case.

I never looked at weightedtau, vectorizing to correlation matrix might not gain anything computationally compared to a loop and I have no idea about use cases

linregress is just used as single linear regression, nothing multivariate about it, If vectorization is available, then it can be in be parallel of separate cases.
I guess the same theil and siegel slopes (I never used them)

I don't know how 2-d sommersd works, My scipy version is too old
AFAIR, I have only seen sommersd and related measures in references that used it for 2 random variables (2 sample case).
I guess it's mainly for comparison of 2 multinomial ordinal samples. (2-d, r x s contingency tables)
I'm not sure what the main use cases for related higher dimensional contingency tables would be.
statsmodels has only stratified contingency tables as higher dim where the underlying structure is vectorization of independent sub-contingency tables.

@josef-pkt
Copy link
Member

(after another break)

I would put sommersd in a similar category as the correlations measures for ordinal association.
One possible multivariate extension would be to return an asymmetric association matrix with sommersd(x, y) in the upper (?) triangular matrix and sommersd(y, x) in the lower triangular matrix.

But that's purely theoretical, I've never seen a usecase like that. And I don't know if there are interesting followup analyses that would use it.

@lucascolley lucascolley added this to the 1.14.0 milestone Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants