Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

where would I find the equation or lognorm fit parameters to the Ks distributions? #54

Open
tamsen opened this issue Dec 22, 2023 · 10 comments

Comments

@tamsen
Copy link

tamsen commented Dec 22, 2023

Hi Cecilia,

This looks like a great application! I have a few questions:

  1. Does the Ksrates output provide the equation and/or lognorm fit parameters for the mixture-model fits? I'd like to be able to see the actual fit parameters and be able to reproduce the distribution & fit.

  2. This software can also be run on a single genome, correct? It does not require a species trio? (Understood that this may affect the rate adjustments)

Thanks so much,
Tamsen

@Cecilia-Sensalari
Copy link
Collaborator

Cecilia-Sensalari commented Dec 25, 2023

Hello tamsen,

Thanks for reaching out!

There are two output files where details of the mixture model fits are stored, you can refer to the docs for their description. See here for the exponential-lognormal mixture model, here for the lognormal mixture model and here under rate_adjustment/species/paralogs_analyses for where to find them.

Concerning the number of species to set up your analysis, I'm not completely sure I understand your goal and what you mean with Understood that this may affect the rate adjustments. Could you re-elaborate? :)
So far I can say that for the "full" ksrates pipeline (e.g. when running it through Nextflow) you need a minimum of three species, because the final goal is to plot a paralog distribution with on top at least one adjusted ortholog mode; our rate-adjustment works with a focal species, a sister species and an outgroup species.
However, you can individually run command paralogs-ks to only build the paralog Ks distribution of your focal species, and later plot it with plot-paralogs to visualize it, without ortholog modes. The only thing to be aware of in this case is that the configuration file and the commands always expect to deal with the "full setup", e.g. the Newick tree, the FASTA files for all species in the tree... otherwise they complain and exit!

Best,
Cecilia

@tamsen
Copy link
Author

tamsen commented Dec 27, 2023

Thank you so much for the prompt reply over the holidays! That's great. The "species_parameters" files look like what I will need. I'll try your suggestions and let you know if I have any follow up questions. :-)

Thanks again
Tamsen

@tamsen tamsen closed this as completed Dec 27, 2023
@tamsen
Copy link
Author

tamsen commented Jan 4, 2024

Hi Cecilia,

I do have a follow up question! Thanks to your help, I am now able to generate the fit parameters I was looking for, but I would like to make sure that I am using them correctly to reconstruct the lognorm components. As a test, I used the "elaeis" example data, and turned off the co-linearity option to activate the exponential-lognormal mixture model. During the analysis, "model 2" was chosen by the software as the best fit. In the “elmm_elaeis_parameters.txt” file I find the gaussian parameters "Normal_Mean, Normal_SD,Normal_Weight" for each iteration and model. Choosing the last iteration of model 2 ( Initialization=10, model=2), I get -"1.2342795183, 0.2428038069, 0.2551137529" for the "Normal_Mean, Normal_SD,Normal_Weight" of the first component of the mixture model. The next row is the parameters for the next component and so on. [Please do correct me if I understand any of that wrong.]

With these values for the txt file, I can recreate the plot for the Gaussians for Model 2 shown in the “elmm_elaeis_models_data_driven.pdf” .
My question is, how to do I translate these Gaussians into the lognorm space to get the equations for the model components on the left-hand side of the “elmm_elaeis_models_data_driven.pdf” plots? (I did a reverse transform on the X axis, but it looked like I was missing a scaling factor for the magnitudes of the lognorm components..) Can you please explain to me what I need to do to get the lognorm equations needed for the mixture model in Ks space?

Thank you so much for your time!
Tamsen

elmm_elaeis_parameters.txt
elmm_elaeis_models_data_driven.pdf

@tamsen tamsen reopened this Jan 4, 2024
@Cecilia-Sensalari
Copy link
Collaborator

Cecilia-Sensalari commented Jan 15, 2024

Hi tamsen,

Sorry for my late reply!

While elmm_elaeis_models_data_driven.pdf is a density plot, the final ksrates output has actual counts on the y-axis.
The function plotting the best model into the final ksrates output introduces a scaling factor to convert from a density plot to a real count plot:
scaling = bin_width * len(deconvoluted_data[deconvoluted_data <= max_ks_EM])
(Note that when plotting the components in elmm_elaeis_models_data_driven.pdf, the scaling variable is still present but set to 1 to make it irrelevant).

How is the scaling factor computed? From docstrings of the function above:

bin_width: width of the histogram bins
deconvoluted_data: artificial dataset that produces the same histogram shape, used to avoid having weights in EM.
max_ks_EM: upper limit of the Ks range considered for the mixture modeling fitting
  • The bin width is taken from your configuration file, according to this line.
  • The deconvoluted data variable is a list of artificial Ks values that I had to generate to replace the original Ks list, in order to avoid using weights during the Exp-Log mixture model (more here); for what concerns the scaling, it is a dataset with almost exactly the same number of Ks values of the original Ks dataset. (EDIT: this is wrong, please refer to my next post for follow up discussion)
  • max_ks_EM is by default 5, check whether you changed it or not (more here)

After having computed the scaling factor, the next code line plots the components onto the final figure by using the scaling factor.

Do you think you can try to reuse this information and these lines of code to recreate the plot yourself? Apologies for this rather long explanation!

@tamsen
Copy link
Author

tamsen commented Jan 16, 2024

Hi Cecilia,

Thank you for your long explanation! :-) I appreciate it. And I hope you have a nice winter holiday. Yes, my hope would be to use the information from ksrates to recreate the fit plot myself in Ks space, in order to apply some mathematical models to the fitted lognorms. Given your explanation, I see two options to get the output I want

A) Given that
scaling = bin_width * len(deconvoluted_data[deconvoluted_data <= max_ks_EM])
Do you think I could use the ks array from the [species].ks.tsv file, as a proxy for "deconvoluted_data" in the calculation, since its only the num datapoints less than max_ks_EM that is the value needed? (My bin_width and max_ks_EM are still set at the default values).

~ or ~

B) Locally modify your code to output/log the scaling factors I need, as they are calculated. Then I could use those scaling fators, combined with the parameters in the elmm_elaeis_parameters.txt file, to rereate the fits lognorm in Ks space. (For simplcity, I'd prefer not to modify ks rates, if it can be avoided).

If you get a chance please let me know if this sounds right, or if you have a better suggestion.

Thank you again so much,
Tamsen

@Cecilia-Sensalari
Copy link
Collaborator

Cecilia-Sensalari commented Jan 17, 2024

Hi tamsen,

I've been looking at the source code to remind myself about how the deconvolution is implemented, and I have to correct my previous message:

for what concerns the scaling, [the deconvoluted dataset] is a dataset with almost exactly the same number of Ks values of the original Ks dataset. WRONG!

The original Ks dataset is redundant, so it does not have the same length of the deconvoluted data.
("Redundant" here means that there are e.g. four Ks age estimates (0.2, 0.19, 0.21 and 0.23) for the same duplication event; by means of node weighting, each Ks is then assigned a 1/4=0.25 weight in the histogram to make them in total count as 1 estimate).

The length of the deconvoluted dataset is instead equal to the sum of the weights of the original Ks data (only considering Ks up to 5, as per max_ks_EM in your config file).
Note that actually len(deconvoluted_data) is an approximation of sum(ks_weights) due to some rounding of floating numbers; they are however very similar and this shouldn't significantly affect the calculation of the scaling factor when plotting the fitted components.

Example:

bin_width = 0.1
max_ks_EM = 5

Length of the original Ks data (only Ks <= max_ks_EM): 30392
Sum of Ks weights (only Ks <= max_ks_EM): 8761.754450000479
Length of deconvoluted Ks data (only Ks <= max_ks_EM): 8761
Scaling factor: 876.1

Solution 1: if you're fine with the approximation "sum(weights)=len(deconv_data)", you can 1) compute the sum of weights from the ks.tsv file (column WeightOutliersExcluded, only for Ks <= 5) and then 2) multiply this number by bin_width.

Solution 2: I assembled in a Python script the ksrates functions involved in the deconvolution (see attached file), which you can run by providing the path to your ks.tsv file; it will calculate and print the scaling factor like shown in the example above.

compute_scaling_factor.txt

Cheers,
Cecilia

@tamsen
Copy link
Author

tamsen commented Jan 17, 2024

image

:-)

Thank you. I went with Soln 2 since it worked like a charm. Thanks again!

@tamsen tamsen closed this as completed Jan 17, 2024
@cvargas88
Copy link

Hi Tamsen and Cecilia,
Thanks a lot for this thread. I am trying to obtain the exact same plot as in your previous comment but I have been unable to reproduce it. Thanks to your previous comments I was able to find the gaussian parameters, but when I apply the following transformation:

log_ks <- log(ks_list)
transformed_ks <- (log_ks - normal_mean) / normal_sd
transformed_ks <- transformed_ks * normal_weight

I get values which have a mode close to what is expected (though not exactly) but the distribution has a different shape and even negative values. Could you please suggest me what am I missing or share your script to reproduce the aforementioned plot?

Thank you very much in advance!
Carlos

@Cecilia-Sensalari
Copy link
Collaborator

Hi cvargas88,

Thanks for reaching out and sorry about my late replay!

I'm afraid I need some more "story" to understand what your goal is and how you are achieving it. For example, could you describe the reasoning behind your code and what transformed_ks is? Perhaps it would help to upload a file/picture of the unexpected distribution you obtained. Are you computing and using the scaling factor somewhere?

Some general comments that might help.

  • Ks vs. logKs: When you are using normal "Ks" values (i.e. not log-transformed), there are only positive values (or equal to 0). When log-transforming the Ks values (first remove all 0 Ks), there are also negative logKs and the distribution has a different shape. You can compare the two types of distributions in file elmm_elaeis_models_data_driven.pdf uploaded by tamsen: normal "Ks" in the left column in grey and logKs in the right column in red.
  • Density vs. real count: About the scaling factor in ksrates, it is set to 1 when plotting density plots (y-axis: "Density of...", like in elmm_elaeis_models_data_driven.pdf), but it is set to another number when plotting the real count plot (y-axis: "Number of...", like in the final ksrates output files). This scaling factor is computed like I described above.

Cheers,
Cecilia

@cvargas88
Copy link

Dear Cecilia,
Thank you very much for your reply and also for developing such an useful tool. I used ksrates to identify WGD events and their position according to speciation events, therefore the mixed_species_elmm.pdf is just what I needed. My goal is to reproduce that plot, but I have been unable to find the data to do so. After some searching I found this thread and believed it was a way to reproduce it.
transformed_ks corresponded to the list of ks values after substracting the normal_mean and dividing it by the normal_sd that I found in the elmm_elaeis_parameters.txt as previously described in the thread. However plotting them (even after transforming them using exp) does not look anything like the plot in mixed_species_elmm.pdf. Am I missing something?

Thank you very much in advance!
Carlos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants