General guidelines for Pgen #4

jeremycfd · 2018-03-29T15:00:02Z

Hi Quentin,

I've been playing around with IGoR a bit after having read the paper and I'm wondering if you can give some suggestions about how to go through the process of estimating Pgen for small datasets:

Say for instance we have anywhere from 50 to a few hundred human single-chain TCR sequences that are also epitope-specific. Would you recommend simply using the model that comes with IGoR to estimate Pgen, or do you anticipate any improvement in first using -infer to update the model, even though there are relatively few sequences and they are not representative of random selection from the repertoire?
I recall from the paper that you recommend considering at least 50 scenarios for each somatic recombination event. But when I set --scenarios to any value, I can't see evidence in the logs or the output that the number of scenarios I specified is actually being used. Perhaps I'm looking in the wrong place. Can you advise? Or perhaps estimating Pgen doesn't benefit from considering more than the 10 most likely scenarios?

Thanks!

qmarcou · 2018-04-06T06:25:56Z

Hi Jeremy,
Sorry for the late reply. These are two interesting questions, see my answers below:

From what we have observed only gene usage and/or alleles sequences vary among the recombination machinery of different individuals. If you were to have a few hundreds of out of frame sequences I would have recommended to re-learn only the gene usage distributions. Now in your case the best would probably be to use the provided model as such. Anyway, these gene usage variability is not what's controlling most of Pgen variations.
There are two different things here: the number of scenarios explored by IGoR and the number of scenarios that IGoR outputs. Even by specifying --scenarios 50 IGoR will explore many more of them, however only 50 of them will be written into file in the output directory. What is controlling the number of scenarios IGoR explores during an Expectation-Maximization step are the --P_ratio_thresh and/or --MLSO commands. In theory the more scenarios have been explored the best, in practice there is a balance with runtime, but the probability ratio threshold should not be set too high.

Hope this answers your questions

jeremycfd · 2018-04-29T20:49:22Z

Thanks for the explanation! I find that setting --P_ratio_thresh to 0.0 causes issues (every Pgen comes back as nan) but I can set it to extremely small values (e.g., 1E-10) without issue. What is the default P_ratio_thresh? (Perhaps that could be added to man igor).

Cheers.

qmarcou · 2018-04-30T14:47:16Z

Mmm that is odd, as explained in here setting it to 0.0 should explore all possible scenarios (yielding a very slow execution time) at first thought I don't see why this should return nan. Could you attach a sample of the pgen, and inference_logs files for debugging purposes?
The default value for this parameter is 10^{-5}, I actually thought it was in the README, this will be added, thanks for pointing this out!
Thanks!

Option -datadir added to get IGoR default data dir (IGOR_DATA_DIR)

qmarcou added the question label Apr 6, 2018

spritelysauropods mentioned this issue Jul 16, 2019

NaN values for (alpha) chain recombination probabilities! #28

Open

alfaceor added a commit to alfaceor/IGoR that referenced this issue May 12, 2020

Merge pull request qmarcou#4 from alfaceor/master

9a15478

Option -datadir added to get IGoR default data dir (IGOR_DATA_DIR)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General guidelines for Pgen #4

General guidelines for Pgen #4

jeremycfd commented Mar 29, 2018

qmarcou commented Apr 6, 2018

jeremycfd commented Apr 29, 2018 •

edited

Loading

qmarcou commented Apr 30, 2018

General guidelines for Pgen #4

General guidelines for Pgen #4

Comments

jeremycfd commented Mar 29, 2018

qmarcou commented Apr 6, 2018

jeremycfd commented Apr 29, 2018 • edited Loading

qmarcou commented Apr 30, 2018

jeremycfd commented Apr 29, 2018 •

edited

Loading