Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General guidelines for Pgen #4

Open
jeremycfd opened this issue Mar 29, 2018 · 3 comments
Open

General guidelines for Pgen #4

jeremycfd opened this issue Mar 29, 2018 · 3 comments
Labels

Comments

@jeremycfd
Copy link

Hi Quentin,

I've been playing around with IGoR a bit after having read the paper and I'm wondering if you can give some suggestions about how to go through the process of estimating Pgen for small datasets:

  1. Say for instance we have anywhere from 50 to a few hundred human single-chain TCR sequences that are also epitope-specific. Would you recommend simply using the model that comes with IGoR to estimate Pgen, or do you anticipate any improvement in first using -infer to update the model, even though there are relatively few sequences and they are not representative of random selection from the repertoire?

  2. I recall from the paper that you recommend considering at least 50 scenarios for each somatic recombination event. But when I set --scenarios to any value, I can't see evidence in the logs or the output that the number of scenarios I specified is actually being used. Perhaps I'm looking in the wrong place. Can you advise? Or perhaps estimating Pgen doesn't benefit from considering more than the 10 most likely scenarios?

Thanks!

@qmarcou
Copy link
Owner

qmarcou commented Apr 6, 2018

Hi Jeremy,
Sorry for the late reply. These are two interesting questions, see my answers below:

  1. From what we have observed only gene usage and/or alleles sequences vary among the recombination machinery of different individuals. If you were to have a few hundreds of out of frame sequences I would have recommended to re-learn only the gene usage distributions. Now in your case the best would probably be to use the provided model as such. Anyway, these gene usage variability is not what's controlling most of Pgen variations.

  2. There are two different things here: the number of scenarios explored by IGoR and the number of scenarios that IGoR outputs. Even by specifying --scenarios 50 IGoR will explore many more of them, however only 50 of them will be written into file in the output directory. What is controlling the number of scenarios IGoR explores during an Expectation-Maximization step are the --P_ratio_thresh and/or --MLSO commands. In theory the more scenarios have been explored the best, in practice there is a balance with runtime, but the probability ratio threshold should not be set too high.

Hope this answers your questions

@jeremycfd
Copy link
Author

jeremycfd commented Apr 29, 2018

Thanks for the explanation! I find that setting --P_ratio_thresh to 0.0 causes issues (every Pgen comes back as nan) but I can set it to extremely small values (e.g., 1E-10) without issue. What is the default P_ratio_thresh? (Perhaps that could be added to man igor).

Cheers.

@qmarcou
Copy link
Owner

qmarcou commented Apr 30, 2018

Mmm that is odd, as explained in here setting it to 0.0 should explore all possible scenarios (yielding a very slow execution time) at first thought I don't see why this should return nan. Could you attach a sample of the pgen, and inference_logs files for debugging purposes?
The default value for this parameter is 10^{-5}, I actually thought it was in the README, this will be added, thanks for pointing this out!
Thanks!

alfaceor added a commit to alfaceor/IGoR that referenced this issue May 12, 2020
Option -datadir added to get IGoR default data dir (IGOR_DATA_DIR)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants