Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some basics on doing permutations #4

Open
mesamuels opened this issue Jun 18, 2023 · 2 comments
Open

some basics on doing permutations #4

mesamuels opened this issue Jun 18, 2023 · 2 comments

Comments

@mesamuels
Copy link

Hi, I have just started to use your permutation test package. I am not a programmer, so it took a bit of time to figure out how to download and install on my system, where it is not standardly present. I am using the Digital Research Alliance of Canada system, used to be called Compute Canada. I have a couple of questions:

First, I was able to install the wheel version, but for some reason not the tar.gz version, my system had Path troubles it seems. Anyway, I can import the program in Python and run the test. Question 1, where is the package actually? I do not see any obvious new directories or files in my home directory, I wouldn't think it could be installed more generally. Does the installation update some path variables in my home account so python knows where to look for permutations_stats.permutations ?

More importantly, is there some general documentation on setting some parameters in the system, like the number of permutations, whether to permute both X and Y (in my case I am doing a Brunner Munzel test with cases and controls, so just two data variables)? I was able to run the test with default parameters when X and Y both had around a dozen values, but when I tried to use my real data, which has over 100 values for one of the two, the system "killed" the program so presumably I exceeded memory allocation. I can request more memory, but need to have some rough idea of how much memory is required for a certain large number of data points. Is it possible to permute only one of the two data variables? Is it meaningful to do that?

Some of this info can be found on the web for other permutation test packages (like for R), but your implementation may be different so it's best to ask at the source.

You don't seem to have a literature citation posted here, is there one now that I can cite when I write this up? Otherwise do I just mention it as software used in my Methods section with a link to here?

FYI, this is for a human genetics project. We sequenced a bunch of cases and controls to look for mutations causing a rare endocrine phenotype, focused on very rare genomic variants. As a result we are doing a kind of gene burden test, and want to compare the burden of variants in the cases vs controls. For particular reasons, a Fisher or chi-square 2x2 test is not appropriate here, rather we are using Mann-Whitney or Brunner-Munzel, essentially to compare the average per sample mutation burden by ranking the samples. Because the number of cases is small (12), I wanted to run a permutation version just in case the regular tests (BM or MW) are too forgiving. BM is probably better generally as we don't know whether the variances are the same in the cases and controls, although they likely are.

Thanks and cheers from Montreal!
Mark Samuels
Associate Professor in Medicine
University of Montreal

@mesamuels
Copy link
Author

Aha, I figured out from you documentation (sorry it was a bit tricky to find), to set the number of permutations in "simulation" mode, I guess exact mode doesn't have limits. There was no problem running with my set of 12 cases and 300 controls doing 1 million permutations, with something like 64 gigabytes of memory. There is not a huge difference between the basic and permuted Brunner-Munzel p-values, guess that it is quite a good test. Still, it's great to have the validation. I'm assuming a million permutations is enough. BTW, I don't know how to use "numba", I imported numpy as recommended, we don't seem to have numba on our system unless it is hiding somewhere. Maybe with that I could run an exact test, but it's not really necessary.

Thanks,
Mark

@trevismd
Copy link
Owner

trevismd commented Oct 28, 2024

Thank you for reaching out and for posting your reply, @mesamuels.

Please don't hesitate if anything else comes up I could help you with, I'll try and answer faster this time.

To help the next reader if not you as I am so late, while numba or more resources could help run the simulations faster, I don't think it would enable you to compute all permutations either, there are just too many ways to select 12 out of 312 records.

Disclaimer: I am a trained physician and data science enthusiast converted to full-time software engineering, not a statistician. I recommend checking with a professional how the following comment can apply to your specific case: to evaluate if 1 million permutations are enough, you could run the simulation multiple times and assess the variability of the p-values you get. If they stay close together, you could be confident that the real value is likely close too. If they don't you can do this operation again with more permutations per simulation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants