Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions in alignment results #32

Open
Yyx2626 opened this issue Nov 7, 2018 · 5 comments
Open

Questions in alignment results #32

Yyx2626 opened this issue Nov 7, 2018 · 5 comments
Labels

Comments

@Yyx2626
Copy link

Yyx2626 commented Nov 7, 2018

Hi Quentin @qmarcou,

I am a new bioinformatics postdoc in Prof. Frederick Alt's lab. I have read your 2018 Nature Communication paper introducing the fantastic tool IGoR, and I am now trying to apply IGoR to our data in mouse. However, I encounter some questions about IGoR, firstly the align step.

(1) I have run through IGoR demo, written a script to display the alignment, and compared the alignment with IgBLAST results. Then, I had two strange observations: 1) The alignment of D and J often has one-bp mismatch at 5' end, which is dropped in IgBLAST. 2) Although I saw no mismatch in the alignment, IGoR -align reported many mismatches indexes.

For example, the third read sequence in the demo (seq_index = 2) is TCCCCAACCAGACAGCTCTTTACTTCTGTGCCACCAGTGACCCGGGTACAACGACGAGCA, and IGoR top alignments include:

#seq_index;gene_name;score;offset;insertions;deletions;mismatches;length;5_p_align_offset;3_p_align_offset
2;M11951|TRBV24-1*01|Homo sapiens|F|V-REGION|189..476|288 nt|1| | | | |288+0=288| | |;195;-244;{};{};{40,41,42};40;0;39
2; TRBD1*01;25;8;{};{};{8,9,15,16,17,18,19};6;9;14
2;M14159|TRBJ2-7*01|Homo sapiens|F|J-REGION|2316..2362|47 nt|2| | | | |47+0=47| | |;35;48;{};{};{49,50,52};8;52;59

With my scripts, I displayed the alignment, as shown below

V D J target length = 288 12 47
V TRBV24-1*01, obs 1:40:40 , ref 245:500:256  (1-based start:end:length)
obs TCCCCAACCAGACAGCTCTTTACTTCTGTGCCACCAGTGA
ref ........................................TTTG
D TRBD1*01, obs 10:15:6 , ref 2:7:6
obs AGACAG
ref G.....
J TRBJ2-7*01, obs 53:60:8 , ref 5:12:8
obs GACGAGCA
ref T.......

On the other hand, IgBLAST result is

V D J target length = 288 1000 47
V TRBV24-1*01, obs 1:40:40 , ref 245:284:40
obs TCCCCAACCAGACAGCTCTTTACTTCTGTGCCACCAGTGA
ref ........................................
D -, obs 41:40:0 , ref 1:0:0
obs 
ref 
J TRBJ2-7*01, obs 54:60:7 , ref 6:12:7
obs ACGAGCA
ref .......

I looked at some other examples, and see the one-bp mismatch at 5'-end of D or J in IGoR align results, but not in IgBLAST results, and I do not know why. Also, I am not sure why IGoR reported so many indexes of mismatches than actually displayed (maybe outside the alignment region?).

(2) In order to run IGoR -align, I am preparing input files. Along with three VDJ sequence fasta files, I saw V_gene_CDR3_anchors.csv and J_gene_CDR3_anchors.csv containing anchor_index for most V,J segments. I examined IMGT annotation (eg. http://www.imgt.org/ligmdb/view?id=U66059) and guessed the anchor_index for Vs is the start position of 2nd-CYS, and the anchor_index for Js is the start position of J-MOTIF (both 0-based). Is this guess correct?
I read IGoR document webpage, which says "The index should correspond to the first letter of the cysteine (for V) or tryptophan/phenylalanin (for J) for the nucleotide sequence of the gene.", and "If the considered sequences are nucleotide CDR3 sequences (delimited by its anchors on 3' and 5' sides) using the command --ntCDR3 alignments will be performed using gene anchors information as offset bounds.". So if I do not use --ntCDR3, is it necessary to provide the anchor_index for V and J?

Thank you very much for your kind attention. Looking forward to your reply.

Best regards,
Adam Yongxin Ye

@qmarcou
Copy link
Owner

qmarcou commented Nov 13, 2018

Dear Adam,
I'll try and reply quickly and further edit my answer for more clarity.

About (1):

  1. There might be slight differences due to differences in the scoring scheme. IgBlast has a scoring scheme (number of contiguous nucleotides to consider a hit, and then the scoring strategy to extend the alignment in BLAST style) and IGoR's alignment module has another (smith waterman alignment with some ad hoc scoring matrix and gap penalties). Although the fine tuning of the parameters for IgBlast is important (as the alignments are the end results) it is less for IGoR because alignment is just a preprocessing step before the actual probabilistic treatment (for which parameters are or were inferred from real data and not just fine tuned). Basically IGoR is only using alignments to obtain the actual offsets of the sequences regarding each other but calling the number of deletions etc (which is what you seem to be aiming at) is irrelevant as it requires a probabilistic treatment.

  2. Thanks for pointing this out, I should write this explicitly in the README. In fact the extra mismatches indices are outside the actual alignment (and they don't contribute to the overall score), and they are useful for the inference/probabilistic machinery to rapidly know where extra mismatches (due to fewer deletions) would lie. In order to do this I take the best(s) SW alignment, and extend it "manually" to get all mismatches. Note that is is far from being bulletproof in a regime with a large number of insertions and deletions one should actually use a global alignment algorithm to build the alignment extension in a principled way (this is a fix I'd like to make when I have time).

About (2):
Yes your understanding of the anchors definition is correct, and you indeed do not need to supply V and J anchors if you are not using any CDR3 related operation.

Tell me if some points remain unclear and I will try to edit and clarify my answer!

Best,
Quentin

@Yyx2626
Copy link
Author

Yyx2626 commented Nov 17, 2018

Dear Quentin,

Thank you very much. The response is clear. And I have started running IGoR -align on my data, and did some tests on demo for -infer and -evaluate -output, as well as pygor. Here, I encountered four other questions about IGoR model inference and evaluation.

1 (infer). According to the document webpage, the -chain option can be set to alpha, beta, light, heavy_naive, heavy_memory; however, under share/igor/models/human, there are only three folders: tcr_alpha, tcr_beta, bcr_heavy. So the questions are: (1) Where is the model for light chain, or is it unfinished? (2) Is there any difference between heavy_naive and heavy_memory? do they both use bcr_heavy folder, but have differences hard-coded in igor? BTW, (3) what are the default values set for --L_thresh and --P_ratio_thresh?

2 (evaluate). (1) Why -evaluate and -output need another round of iteration? (2) Do you have any detailed suggestions about the approaches to evaluate the model fitting after -infer is done, eg. look at the distribution of what measure? using pygor? (3) How to know the iteration is converged, and how accurate (eg. standard error) for each model parameters? (4) When running -evaluate after -infer, should I set -set_custom_model to the final_parms.txt and final_marginals.txt under _inference folder?

3 (pygor). I tested pygor, which could read best_scenarios output file as well as model files, and store in some container classes. (1) Does pygor also provide functions for probabilistic calculation based on the input IGoR model? for example, calculate the generative probability for a given scenario. BTW, (2) In inference_logs.txt file (with little documentation?), does seq_likelihood column mean (similar to) Pgen and seq_best_scenario column means the generative probability for the best scenario?

4 (coverage). I tested -output --coverage (also with little documentation?). (1) Do VD_genes means V_gene and D_gene; VDJ_genes means V_gene, D_gene and J_gene; etc.? (2) When I tried to run --coverage on the demo foo, igor could run through for V_gene and J_gene, but would report error for D_gene (or VD_genes, VDJ_genes, etc.); below are the commands and output.

Commands:

igor -set_wd demo -batch foo -read_seqs share/igor/demo/murugan_naive1_noncoding_demo_seqs.txt
igor -set_wd demo -batch foo -species human -chain beta -align --all
igor -set_wd demo -batch foo -species human -chain beta -infer --N_iter 1

Command:

igor -set_wd demo -batch foo -species human -chain beta -evaluate -output --coverage

Output:

Chain parameter set to: beta
terminate called after throwing an instance of 'std::logic_error'
  what():  basic_string::_M_construct null not valid
Abort trap: 6

Command:

igor -set_wd demo -batch foo -species human -chain beta -evaluate -output --coverage -set_custom_model demo/foo_inference/final_parms.txt demo/foo_inference/final_marginals.txt

Output:

[IGoR] ERROR: Unknown argument "-set_custom_model" to specify coverage target!
 Supported arguments are: V_gene, VD_genes, D_gene, DJ_gene, VJ_gene, J_gene, VDJ_genes

Command:

igor -set_wd demo -batch foo -species human -chain beta -evaluate -output --coverage D_gene

Or

igor -set_wd demo -batch foo -species human -chain beta -evaluate -output --coverage D_gene  -set_custom_model demo/foo_inference/final_parms.txt demo/foo_inference/final_marginals.txt

Output:

...
Performing Evaluate/Inference iteration 1
Initializing probability bounds...
Initialization of probability bounds over.
terminate called after throwing an instance of 'char const*'
Abort trap: 6

Thank you very much for your kind attention. Looking forward to your reply.

Best wishes,
Adam Yongxin Ye

@qmarcou
Copy link
Owner

qmarcou commented Dec 19, 2018

Dear Adam,
My apologies I just realized I had not answered your second round of questions.

About the infer command

About models

There is indeed no model for light chain available yet, both by lack of data and lack of time to create one. I would however be happy to add a model any user manage to obtain from data. These questions and how to create a model from scratched have been adressed in the documentation and in a few others github issues already.

Difference between heavy naive and memory

The project was initially to have different models, however I was not sure which error model (given that different models are obtained on different genes) should be used. See the IGoR paper for more details. Alternatively I have provided all inferred hypermutation models in the bcr/heavy/supplementary_models folder. They can be loaded using the -set_custom_model command line.

About default parameters

I thought they were given in documentation, my apologies. I will add them to the docs.
You can also find them in the main.cpp file:

	bool viterbi_inference = false;
	double likelihood_thresh_inference = 1e-60;
	double proba_threshold_ratio_inference = 1e-5;
	size_t n_iter_inference = 5;
	bool infer_only = false;
	bool no_infer = false;
	set<string> infer_restrict_nicknames;
	bool fix_err_rate = false;
	bool subsample_seqs = false;
	size_t n_subsample_seqs;

About the evaluate command

Why does it need another round of iteration?

The evaluate command is the same as performing one iteration of the Expectation-Maximization algorithm. The rationnal of separating the two commands is simply to make sure the desired number of iterations is performed while learning the model (and in the future because one could use some heuristics to accelerate the model learning phase)

About summary statistics

I'll point you to issue #25 I have recently edited and should answer your question.

Convergence and model parameters accuracy

The convergence point has also been addressed in issue #25.
As for model parameters accuracy I think there are two answers:

  • First and most natural: compute their standard error (or anything of the kind) between models learned on different datasets. This will give you an idea of biological + experimental + sampling noise altogether.
  • More specifically for sampling noise: similarly to the mutual information mentioned in issue Add built in support the mouse igH analysis #25 one could imagine computing the Fisher information based on the scenario counter. I had initially sketched some code to perform this directly as a counter, sadly I never had time to finish it and this is not a priority...

Using -set_custom_model to the final_parms.txt and final_marginals.txt for -evaluate?

Yes this is what you should do.

About pygor

Does pygor provide a utility to compute single scenario probability?

Short is no, however I agree this could be a useful function.

About the inference_logs.txt file

The sequence likelihood is the sum of putative scenario probabilities weighted by the error model. This quantity is equal to the sequence Pgen if and only there are no errors on the sequences and the error rate is set (or inferred) to 0. Pgen is ill-defined when these two conditions are not met, and you should use the [https://qmarcou.github.io/IGoR/#generation-probability](Pgen estimator) in that case. See the IGoR paper supplementary information for details and full discussion of the matter.
The same comment as above for your second question: the seq_best_scenario column gives the likelihood of the best scenario (meaning its generation probability weighted by the error model).

About the coverage command

About the coverage command syntax

Yes VD means V and D and so on.

About the error on the D gene coverage

It is not supported yet see the piece of code below. It is surprising that this error has not been thrown upon using the command.

	if(count_on_d){
		throw("/!\\ D coverage and errors counters not implemented yet! /!\\ ");
	}

@Yyx2626
Copy link
Author

Yyx2626 commented Dec 19, 2018

Dear Quentin,

Thank you very much for the response.

I have successfully run through IGoR on mouse IgH in last month, with some tricks to bypass some issues. And the Pgen results looked quite good and reasonable.

One issue (on computational performance) is the memory usage when running IGoR -align or -infer. I run IGoR on my MacBook with 32GB physical memory limit and 12 CPU cores, and the total amount of input reads is about 3M (3 million). When running IGoR -align, I found out that V alignment could use up to 12 CPU cores and more memory than the physical limit, since Mac OS probably can automatically move the memory use exceeding the limit to disk or somewhere else and labeled as "compressed memory". However, D alignment could only use 1~3 CPU cores, and asked for more and more memory when processing more reads. When the memory usage exceeds the physical limit, the progress would be slowed down dramatically. I believed that the alignment should be independent for each input read, so I split the input into several parts, to lower the memory usage and thus fasten running through IGoR -align step.

Similarly, when running IGoR -infer on the whole 3M input reads, IGoR would ask for much more memory than the physical limit and stuck before showing the progress bar. I am not sure why EM algorithm, as an algorithm traversing each input sequence iteratively, requests more memory for more input reads, which looks to have space complexity linear to input sample size. To run through IGoR -infer, I have to split the 3M input reads into 15 parts, and manually make IGoR -infer run on each part (0.2M reads) iteratively (use the final model of the previous part as the inital model of the next part, and iterate for several times on all 15 parts). And I finally chose four models after several iterations, and found the Pgen predicted by IGoR -evaluate -output looked stable in magnitude and also followed our expectation. Do you think this splitting trick is applicable and acceptable?

Another issue is about using only CDR3 region as input. I tested it with --ntCDR3 option in IGoR -align step. However, the alignment results may often miss the true answer, which may be recovered by using --thresh 0 instead of --ntCDR3, although it would be slower. On the other hand, for a part of input sequences, even if the alignment files contained the true V,D,J genes, IGoR -evaluate -output might not find any (best) scenarios and leave Pgen=NaN. I have not figured out the reason and solution for this problem.

By the way, is there any way to output P_read as well as Pgen for each input sequence? And output P_recomb and P_error for each (best) scenario?

Thank you very much for your kind attention.

Best wishes,
Adam Yongxin Ye

@qmarcou
Copy link
Owner

qmarcou commented Dec 24, 2018

Dear Adam,
I have finished editing my previous response in order to answer the rest of your questions.
Concerning this new batch of questions:

One issue (on computational performance) is the memory usage when running IGoR -align or -infer. I run IGoR on my MacBook with 32GB physical memory limit and 12 CPU cores, and the total amount of input reads is about 3M (3 million). When running IGoR -align, I found out that V alignment could use up to 12 CPU cores and more memory than the physical limit, since Mac OS probably can automatically move the memory use exceeding the limit to disk or somewhere else and labeled as "compressed memory". However, D alignment could only use 1~3 CPU cores, and asked for more and more memory when processing more reads. When the memory usage exceeds the physical limit, the progress would be slowed down dramatically. I believed that the alignment should be independent for each input read, so I split the input into several parts, to lower the memory usage and thus fasten running through IGoR -align step.

I'm not sure why the V alignement blew up your memory this is rather weird, I'll have a check and keep you posted. About D alignments: I think this is simply due to the fact that D alignment is very fast and the different threads must compete in order to get a chance to write alignments on the hard-drive for each sequence. If your hard drive is not very fast then this is most likely the limiting step and the threads spend a lot of time waiting for disk writing. Again I'm not sure why the memory usage is blowing up, I will have a check.

Similarly, when running IGoR -infer on the whole 3M input reads, IGoR would ask for much more memory than the physical limit and stuck before showing the progress bar. I am not sure why EM algorithm, as an algorithm traversing each input sequence iteratively, requests more memory for more input reads, which looks to have space complexity linear to input sample size. To run through IGoR -infer, I have to split the 3M input reads into 15 parts, and manually make IGoR -infer run on each part (0.2M reads) iteratively (use the final model of the previous part as the inital model of the next part, and iterate for several times on all 15 parts). And I finally chose four models after several iterations, and found the Pgen predicted by IGoR -evaluate -output looked stable in magnitude and also followed our expectation. Do you think this splitting trick is applicable and acceptable?

Yes this is indeed an issue. IGoR requests more memory for more input reads simply because it loads (and filter) all alignments before starting EM and store them in memory. In order to decrease this memory usage you can either alter your alignment filter thresholds (this will also increase your computation speed, however possibly to the expense of a loss of accuracy in the algorithm).
By experience around 100k sequences is generally more than enough to learn a generative model, 3M is a bit of an overkill (although your are right than in theory you should use all available sequences). Yes if you learn the model on 15 different batches independently and get the same results this is rather reassuring. (As a side note: remember that in order to get a proper recombination model you should first filter non productive sequences in order to avoid the selection blur).

Another issue is about using only CDR3 region as input. I tested it with --ntCDR3 option in IGoR -align step. However, the alignment results may often miss the true answer, which may be recovered by using --thresh 0 instead of --ntCDR3, although it would be slower. On the other hand, for a part of input sequences, even if the alignment files contained the true V,D,J genes, IGoR -evaluate -output might not find any (best) scenarios and leave Pgen=NaN. I have not figured out the reason and solution for this problem.

Ah yes, this is something that had already been brought up to my attention in issue #7. I thought I had fixed this problem, maybe I overlooked something. I will have a check.
About your second problem you should have a look at issue #28.

By the way, is there any way to output P_read as well as Pgen for each input sequence? And output P_recomb and P_error for each (best) scenario?

P_read is the sequence likelihood (in the inference_logs.txt file), again Pgen is ill defined for erroneous sequences and you can use the Pgen estimator.
I cannot check right now but I think P_recomb and P_error are given by the best scenario counter.

Best wishes,
Quentin

Repository owner deleted a comment from decenwang Apr 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants