Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add statistical tests against existing simulator #102

Closed
jeromekelleher opened this issue Nov 15, 2023 · 15 comments · Fixed by #132
Closed

Add statistical tests against existing simulator #102

jeromekelleher opened this issue Nov 15, 2023 · 15 comments · Fixed by #132
Milestone

Comments

@jeromekelleher
Copy link
Member

jeromekelleher commented Nov 15, 2023

We should be testing our results statistically by comparing them against other simulators. There's no reason we can do this at the small scale.

It shouldn't be too hard to use e.g. PhenotypeSimulator on some simple simulations, and qqplot the results as a comparison with our results.

So, we would do something like:

  1. Simulate a small tree sequence using msprime for say, 10 samples, and then for n replicates each do:
  2. Compute phenotypes for each of the 10 individuals under a given model to get a distribution per-individual using tstrait
  3. Export the data to VCF and do the same thing with an external too/ We can then compare the per-individual distributions as qqplots, which should be meaningful.

This will take some work to do, but is an important validation step.

@jeromekelleher jeromekelleher added this to the 0.0.2 release milestone Nov 15, 2023
@jeromekelleher
Copy link
Member Author

See the msprime verification.py script which does lots of this type of thing, and also the section in the msprime developer docs

@daikitag
Copy link
Collaborator

I am thinking about comparing tstrait with SLiM and AlphaSimR. I am planning to simulate traits and genetic information from those packages, obtain the tree sequence from the simulated genetic information, and simulate traits by using tstrait. I think that would be a good comparison, as SLiM and AlphaSimR are widely used packages.

@jeromekelleher
Copy link
Member Author

If you do that though, you'll have to take into account the different processes that generate the ARG in the first place wont you?

Much simpler if you take a given ARG, and then generate traits either using tstrait, or one of the methods that are based on input sequences.

These totally don't have to be big examples, the R methods that read in the full VCF are fine.

@daikitag
Copy link
Collaborator

@jeromekelleher
I'm thinking about comparing tstrait with AlphaSimR, as they are using the exact same algorithm that we are using (see Third Step: Assign Quantitative Trait Nucleotide Effects, ... in https://acsess.onlinelibrary.wiley.com/doi/10.3835/plantgenome2016.02.0013), and they have written the codes to incorporate tree sequence file into their simulation algorithm (https://github.com/gaynorr/AlphaSimR_Examples/blob/master/misc/msprime.R).
I imagine that AlphaSimR will obtain similar results as us, as I had taken a lot of ideas from them.

I'm thinking about simulating a small tree sequence in msprime, put it into AlphaSimR as a founder population, and simulate the phenotype information just from that founder population (I'm not planning to do any genetic simulation with AlphaSimR), but what do you think about this?
Given the popularity of AlphaSimR, I think we only need to compare tstrait with AlphaSimR.

@jeromekelleher
Copy link
Member Author

To me this sounds more complicated than exporting to vcf, but if you think we can do what we need with AlphaSim, then great.

@daikitag
Copy link
Collaborator

I managed to compare tstrait with AlphaSimR, and we managed to observe a consistency with the results after standardizing the simulated genetic values (AlphaSimR standardizes the simulated genetic values to make sure that it is perfectly the same with the input mean and variance).
Comparison with PhenotypeSimulator sounds pretty complicated, as we need to input kinship to simulate effect sizes (https://hannahvmeyer.github.io/PhenotypeSimulator/reference/geneticBgEffects.html) and their environmental noise simulation is not using heritability values (https://hannahvmeyer.github.io/PhenotypeSimulator/reference/noiseBgEffects.html).

@daikitag
Copy link
Collaborator

I will upload the comparison codes soon.

@gregorgorjanc
Copy link
Member

Comparison with PhenotypeSimulator sounds pretty complicated, as we need to input kinship to simulate effect sizes

Yeah, don’t go into the kinship abyss

@daikitag
Copy link
Collaborator

I added a comparison with AlphaSimR in #108. I used various parameter combinations, and the QQ-plots look amazing
image
image

Many thanks to @jeromekelleher and @gregorgorjanc for valuable suggestions.

@daikitag
Copy link
Collaborator

I think now it is safe to say that tstrait's simulated genetic values are having the correct properties (or otherwise the simulation through AlphaSimR is incorrect, which is very unlikely).

@daikitag
Copy link
Collaborator

I should also note that the QQ-plot is not producing a straight line simply because the effect sizes are simulated from a normal distribution. For example, if scaling is not done in tstrait's genetic values (scaling is conducted in AlphaSimR), we will observe a strange QQ-plot. Thus, we can say that the simulated genetic values from both tstrait and AlphaSimR are having similar distributions.

image

@daikitag daikitag linked a pull request Nov 22, 2023 that will close this issue
@daikitag daikitag removed a link to a pull request Jan 23, 2024
@daikitag
Copy link
Collaborator

daikitag commented Feb 1, 2024

The statistical tests can be now added to tstrait by using verification.py. All tests must be a subclass of the Test class defined in https://github.com/tskit-dev/tstrait/blob/main/verification.py#L38, and the test methods must start with ''test_''. The codes were largely adapted from msprime/verification.py, and please see its documentation for details.

@daikitag daikitag linked a pull request Feb 8, 2024 that will close this issue
@daikitag
Copy link
Collaborator

daikitag commented Feb 9, 2024

We should be testing our results statistically by comparing them against other simulators. There's no reason we can do this at the small scale.

It shouldn't be too hard to use e.g. PhenotypeSimulator on some simple simulations, and qqplot the results as a comparison with our results.

So, we would do something like:

  1. Simulate a small tree sequence using msprime for say, 10 samples, and then for n replicates each do:
  2. Compute phenotypes for each of the 10 individuals under a given model to get a distribution per-individual using tstrait
  3. Export the data to VCF and do the same thing with an external too/ We can then compare the per-individual distributions as qqplots, which should be meaningful.

This will take some work to do, but is an important validation step.

I did this in Pull request #132, but I'm not sure if this validation step would be a good test, considering that all individuals will have a similar normal-looking distributions. So even if I use two different individuals to produce a QQ-plot, the results match.

compare_tstrait_phenotype_ind_1_AlphaSimR_phenotype_ind_2

Instead, I would like to propose a similar validation test that simplePHENOTYPES did https://github.com/samuelbfernandes/simplePHENOTYPES/blob/master/vignettes/Supplementary.pdf, and examine if the simulation output of tstrait is exactly the same as the simulation output of other packages.
I think putting a similar Notebook as this inside the paper as a supplementary material will definitely strengthen our claim that the simulation output of tstrait is correct, as a single exact test ensures that the tstrait algorithm is producing correct results.

I would like to propose that we completely ignore the QQ-plot comparison, and we instead do the following:

  1. Exact test with AlphaSimR for a single trait
  2. Exact test with AlphaSimR for pleiotropic trait
  3. Exact test with simplePHENOTYPES for a single trait
  4. Exact test with simplePHENOTYPES for pleiotropic trait
  5. Exact test with ARG-Needle simulation for a single trait
  6. Exact test with ARG-Needle simulation for a single trait with frequency dependence

For these exact test, we will be simulating traits + effect sizes by using an extrenal program, and we will put those effect sizes into the tstrait package to see if the simulated output will exactly match.

What do you think about this @jeromekelleher ?

@daikitag
Copy link
Collaborator

The exact tests are performed here (#140), but we plan to add further tests to examine the complete pipeline of the simulation framework.

@daikitag
Copy link
Collaborator

I did the simulation by using the out of Africa model in stdpopsim, and it seems that the simulated phenotypes are slightly different when they are from different populations:
image
image

I guess we can repeat the above with the out of Africa model by using a high narrow-sense heritability, such that the simulation results won't be dominated by normal nosie.

@daikitag daikitag linked a pull request Feb 20, 2024 that will close this issue
@mergify mergify bot closed this as completed in #132 Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants