-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allelic variants in generated repertoire #21
Comments
Hi Anna, IGoR's handling of allelic variants.Origin of IGoR's genomic templates.The provided genomic templates originally come from the IMGT database, to which some variants that were found upon constructing the generative model on the training dataset were appended. Because people maintaining IMGT wanted to create an exhaustive database the obtained list of alleles comprise many allelic variants that had been found here and there in the population. Because IGoR does not yet ship with an on the fly inference of allelic variants present in the dataset it has to rely on these IMGT variants. Number of allelic variants.The biology.In fact some studies suggest that the TCR and BCR locus are quite dynamic and gene duplication might be common. From Kidd et al. « The inference of phased haplotypes for the immunoglobulin H chain V region gene loci by analysis of VDJ gene rearrangements. », The Journal of Immunology, (2012):
This in fact would naturally lead to observe more than 2 alleles of the same gene. IGoROn top of the fact that several variants allelic variants could be present on the same chromosome, there are other shortcomings due to the sequencing process. Because alleles of the same gene may vary by a single nucleotide and because the sequencing process is both error prone (i.e could introduce such single nucleotide variations) and have finite read length (meaning not all nucleotides of the gene/allele can be observed), it is not always possible to distinguish between two different alleles and one can only assign posterior probabilities on the gene/allele identity. This leads to assigning non zero probability to most alleles. Restricting the number of alleles used in IGoR.Tuning gene and allele usage to your dataset.It has been shown that gene/allele usage frequencies are the most variable components of the recombination machinery across individuals and sequencing technologies (see IGoR's paper for a more detailed discussion). To perform any computation on your dataset it might be interesting to first use the inference mode of IGoR and only relearn the gene usage frequencies for your dataset using the Manually restricting the number of gene/alleles available for a dataset.In order to restrict the number of genes/alleles to a limited list (e.g to generate sequences with a particular VJ combination) the user can supply such a list via the
Thus supplying a FASTA files containing only the desired V and J alleles will automatically restrict the usage to these genes without the need for the user to re-infer a model, provided these genes/alleles were already contained in the initial gene list. In a close future I'd like to introduce such notions in a more complete wiki/manual of IGoR, thus please tell me if anything remains unclear from this answer! Best, Quentin |
Thank you for detailed answer! |
Hi Quentin
I haven't managed to find any information about how IGoR handles allelic variants presented in models. In default models some IGHV, TRAV and TRBV genes have several alleles (up to 7) and all of them have non-zero probability. Obviously, having more than 2 alleles of one gene in one repertoire is not realistic (if not considering chimeras).
It seems that I have to edit the models manually to have not more than 2 alleles of each gene. How to do it in a proper way? Should I just set to zero the probabilities of all alleles except of the most frequent two and recalculate their probabilities?
Best, Anna
The text was updated successfully, but these errors were encountered: