Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p-adj values in pairwise_tukeyhsd and MultiComparison bounded by 0.001 and 0.9 #8185

Open
TalWac opened this issue Mar 23, 2022 · 6 comments

Comments

@TalWac
Copy link

TalWac commented Mar 23, 2022

Describe the bug

Dear developer,

when I'm running pairwise_tukeyhsd or MultiComparison.tukeyhsd I see that the values in p-adj are bounded from below by 0.001 and from about by 0.9. I see that when I compare to R test TukeyHSD the values of the p-adj can be higher than 0.9 and smaller then 0.001.

This is a problem for me since I have many other of the column M# (i.e, M1, M2, ..., M900)
and I'll want to adjust with FDR (False discovery rate )for these values afterward. To do so, values that are much smaller the 0.001 are rounded to 0.001 and will not be significant after FDR.

Code Sample python:

# the data 
dat = {'Class_0': ['EKVX', 'EKVX', 'EKVX', 'EKVX', 'EKVX', 'EKVX', 'HOP62', 'HOP62',
                   'HOP62', 'HOP62', 'HOP62', 'HOP62', 'HOP92', 'HOP92', 'HOP92',
                   'HOP92', 'HOP92', 'HOP92', 'MPLT4', 'MPLT4', 'MPLT4', 'MPLT4',
                   'MPLT4', 'MPLT4', 'RPMI8226', 'RPMI8226', 'RPMI8226', 'RPMI8226',
                   'RPMI8226', 'RPMI8226'],
        'M2': [29.0660572, 29.4701607, 29.92649292, 29.62876917, 31.09229301,
              30.73131457, 29.54510531, 30.37553099, 29.22060366, 30.00387611,
              30.2893471, 30.42107896, 30.36357138, 30.04577991, 30.41139568,
              30.18903135, 30.23977308, 30.38604485, 28.45853761, 28.34003309,
              28.49059352, 28.65866869, 28.73640364, 28.76752952, 26.45224629,
              26.17478715, 26.77912804, 26.55138747, 26.69202968, 26.58586463]}

dat = pd.DataFrame(data=dat)

tueky_out = pairwise_tukeyhsd(endog=dat['M2'] , groups=dat['Class_0'], alpha=0.05)
print(tueky_out)
 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
======================================================
group1  group2  meandiff p-adj   lower   upper  reject
------------------------------------------------------
  EKVX    HOP62  -0.0099    0.9 -0.7456  0.7257  False
  EKVX    HOP92   0.2868 0.7575 -0.4489  1.0224  False
  EKVX    MPLT4  -1.4106  0.001 -2.1462 -0.6749   True
  EKVX RPMI8226  -3.4466  0.001 -4.1823 -2.7109   True
 HOP62    HOP92   0.2967 0.7359  -0.439  1.0323  False
 HOP62    MPLT4  -1.4006  0.001 -2.1363  -0.665   True
 HOP62 RPMI8226  -3.4367  0.001 -4.1723  -2.701   True
 HOP92    MPLT4  -1.6973  0.001  -2.433 -0.9616   True
 HOP92 RPMI8226  -3.7334  0.001  -4.469 -2.9977   True
 MPLT4 RPMI8226  -2.0361  0.001 -2.7717 -1.3004   True
------------------------------------------------------

Code Sample R for the same data:

aov.out <-lm(M2 ~ Class_0 ,data =dat )
TukeyHSD.out <-TukeyHSD(aov(aov.out))
> TukeyHSD.out
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = aov.out)

$Class_0
                       diff        lwr        upr     p adj
HOP62-EKVX     -0.009924239 -0.7455454  0.7256969 0.9999994
HOP92-EKVX      0.286751449 -0.4488697  1.0223726 0.7814832
MPLT4-EKVX     -1.410553581 -2.1461747 -0.6749324 0.0000670
RPMI8226-EKVX  -3.446607382 -4.1822285 -2.7109862 0.0000000
HOP92-HOP62     0.296675687 -0.4389455  1.0322968 0.7599709
MPLT4-HOP62    -1.400629342 -2.1362505 -0.6650082 0.0000741
RPMI8226-HOP62 -3.436683143 -4.1723043 -2.7010620 0.0000000
MPLT4-HOP92    -1.697305029 -2.4329262 -0.9616839 0.0000039
RPMI8226-HOP92 -3.733358831 -4.4689800 -2.9977377 0.0000000
RPMI8226-MPLT4 -2.036053801 -2.7716750 -1.3004326 0.0000002

I have run these for different values of the Columns 'M#' (i.e, M1, ...,M900)

There is a difference in the values in the adj-pvalue but they are somewhat close between R and python, however, in statsmodels they are always bounded by 0.001 and 0.9.

is it possible to change it?

Kindly help,

Tal

@josef-pkt
Copy link
Member

josef-pkt commented Mar 23, 2022

AFAIK, that's a limitation about how our p-values are implemented in qsturng.
The latest version of scipy has a better implementation of the tukey-hsd distribution.
This might improve if you have both scipy and statsmodels at the latest version, but I have not tried yet.
correction the change to optionally delegate p-values to scipy is in main but not yet in a release AFAICS
pull request #8035

However:

I'll want to adjust with FDR (False discovery rate )for these values afterward. To do so, values that are much smaller the 0.001 are rounded to 0.001 and will not be significant after FDR.

tukey-hsd already corrects for multiple testing FWER. Why are you still using FDR p-value correction after that?

@TalWac
Copy link
Author

TalWac commented Mar 23, 2022

tukey-hsd already corrects for multiple testing FWER. Why are you still using FDR p-value correction after that?

You are right that it corrects for multiple testing. but only in a specific column in my example above it is called M2 (molecule #2).
Since I have ~900 columns (molecules M1, M2, M3,.... M900), I'll need to correct for the number test I do, not only the number of pairwise comparisons.

As for #8035 not sure I understood how to combine scipy and statsmodels to have correct p-values

@josef-pkt
Copy link
Member

how to combine scipy and statsmodels to have correct p-values

you need to have the latest scipy release and the development version of statsmodels.

I found nightly version, but never tried those out
https://anaconda.org/scipy-wheels-nightly/statsmodels

otherwise statsmodels main needs to be installed from github

I'm not yet on the latest scipy, but scipy stats has now also tukey-hsd, which you should be able to use
https://docs.scipy.org/doc/scipy-1.8.0/html-scipyorg/reference/generated/scipy.stats.tukey_hsd.html

@TalWac
Copy link
Author

TalWac commented Mar 27, 2022

As for :

I found nightly version, but never tried those out
https://anaconda.org/scipy-wheels-nightly/statsmodels
otherwise statsmodels main needs to be installed from github

It was a bit confusing avouand did not try.

But the last option:

I'm not yet on the latest scipy, but scipy stats has now also tukey-hsd, which you should be able to use
https://docs.scipy.org/doc/scipy-1.8.0/html-scipyorg/reference/generated/scipy.stats.tukey_hsd.html

Used the link above and p-values are much more accurate and more similar to those in the R package.

Thank you very much!!

@TalWac
Copy link
Author

TalWac commented Mar 28, 2022

However, the values in the lower and upper confidence interval have opposite values in tukey_hsd in scipy.stats
comparing to R and pairwise_tukeyhsd in statsmodels.stats.multicomp

from scipy.stats import tukey_hsd

dat_EKVX = dat.loc[dat['Class_0']=='EKVX',  'M2']
dat_HOP62 = dat.loc[dat['Class_0']=='HOP62',  'M2']
dat_HOP92 = dat.loc[dat['Class_0']=='HOP92',  'M2']
dat_MPLT4 = dat.loc[dat['Class_0']=='MPLT4',  'M2']
dat_RPMI8226 = dat.loc[dat['Class_0']=='RPMI8226',  'M2']

res = tukey_hsd(dat_EKVX, dat_HOP62, dat_HOP92,dat_MPLT4,dat_RPMI8226)

print(res)
Tukey's HSD Pairwise Group Comparisons (95.0% Confidence Interval)
Comparison  Statistic  p-value  Lower CI  Upper CI
 (0 - 1)      0.010     1.000    -0.726     0.746
 (0 - 2)     -0.287     0.781    -1.022     0.449
 (0 - 3)      1.411     0.000     0.675     2.146
 (0 - 4)      3.447     0.000     2.711     4.182
 (1 - 0)     -0.010     1.000    -0.746     0.726
 (1 - 2)     -0.297     0.760    -1.032     0.439
 (1 - 3)      1.401     0.000     0.665     2.136
 (1 - 4)      3.437     0.000     2.701     4.172
 (2 - 0)      0.287     0.781    -0.449     1.022
 (2 - 1)      0.297     0.760    -0.439     1.032
 (2 - 3)      1.697     0.000     0.962     2.433
 (2 - 4)      3.733     0.000     2.998     4.469
 (3 - 0)     -1.411     0.000    -2.146    -0.675
 (3 - 1)     -1.401     0.000    -2.136    -0.665
 (3 - 2)     -1.697     0.000    -2.433    -0.962
 (3 - 4)      2.036     0.000     1.300     2.772
 (4 - 0)     -3.447     0.000    -4.182    -2.711
 (4 - 1)     -3.437     0.000    -4.172    -2.701
 (4 - 2)     -3.733     0.000    -4.469    -2.998
 (4 - 3)     -2.036     0.000    -2.772    -1.300

@josef-pkt
Copy link
Member

what do you mean with opposite values?

pair differences can go either way y0 - y1 or y1 - y0
your scipy results have both instead of just lower or upper triangle of all pairs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants