-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Modularization of the methods for detecting junctions. #45
[FEATURE] Modularization of the methods for detecting junctions. #45
Conversation
[TEST] Add new tests for different method combinations. Signed-off-by: Lydia Buntrock <lydia.buntrock@fu-berlin.de>
// seqan3::value_list_validator method_validator{"1", "cigar_string", | ||
// "2", "split_read", | ||
// "3", "read_pairs", | ||
// "4", "read_depth"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking, if the user wants to insert names instead of numbers, so we could offer both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! We should figure out if it really makes sense to have all four of the options. Also, duplicate entries in the test cases definitely need to be solved...
|
||
// Options - Methods: | ||
parser.add_option(args.methods, 'm', "method", "Choose the method to be used.", | ||
seqan3::option_spec::ADVANCED, method_validator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this is, a user could only choose one of the four methods, no? E.g. you can't use -m 1, 2, 3, 4
or something. Maybe instead, it's better to have four flags instead of options? Then the user could do something like -c -s -d
for cigar, split reads, and read depth.
Although to be completely honest, I can't really think of a reason why a user would not want to use the CIGAR string. For split reads, if the user has very short reads I guess they might think split reads aren't useful. Same for read depth if the sequencing was very shallow.
With regards to read pairs, I think this is something we should detect automatically (whether the sequencing was done with single end or paired end sequencing), so that a user doesn't accidentally turn on the read pairs option with single end sequencing data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with the current implementation, a user can choose four methods using something like -m 1 -m 2 -m 3 -m 4
because args.methods is a std::vector
.
You have a good point, though, with the CIGAR strings and the read pairs. It might be good to spend some time thinking about efficient processing of the SAM/BAM files:
- cigar and split reads collect evidence from each individual read.
- read depth, in contrast, is an analysis performed on the depth profile along the genomes, not individual reads.
- read pair should (like you say) only be used for paired-end data. But for this type of data it works on a per-read basis like cigar and split reads.
So we should maybe have a separate detection function for read depth because it's completely separate from the other methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my change (the modularisation) is there to compare all methods with all methods.
But later, such a recognition from the SAM/BAM files would certainly be useful. I would create a new issue for this.
And then you could make some methods dependent on the input files.
* | ||
* \details Detects junctions from the CIGAR strings and supplementary alignment tags of read alignment records. | ||
* We sort out unmapped alignments, secondary alignments, duplicates and alignments with low mapping quality. | ||
* Then, the CIGAR string of all remaining alignments is analyzed. | ||
* For primary alignments, also the split read information is analyzed. | ||
*/ | ||
void detect_junctions_in_alignment_file(const std::filesystem::path & alignment_file_path, | ||
const std::filesystem::path & insertion_file_path); | ||
const std::filesystem::path & insertion_file_path, | ||
const std::vector<uint8_t> methods); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this function implements all four detection methods but can be parametrized by the methods
parameter to only use a subset of methods. This makes a lot of sense because the alignment file has to be processed only once (in the best case) and different methods can be applied on the same read in one go.
However, this architecture is not really modularized, right? To be really modularized we would have four detect_junctions
functions that each implement one method. But then, the alignment file would be processed four times in the worst case. So I guess that's why you decided to choose the other option?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct. Maybe the term modularize is not really the right one.
The idea is, that we can combine different methods. And compare them. As we also want to compare run time, the complete decoupling would end in a higher run time if we combine two methods.
So I would leave it and would change the term modularisation with decoupling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. But I think it's not really a decoupling either. How about "Enable selection of methods for detecting junctions" as new title for this pull request?
Signed-off-by: Lydia Buntrock <lydia.buntrock@fu-berlin.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I'm still more of a fan of using flags instead of numbered methods (I think -c -s -d
looks more descriptive/intuitive than -m 1 -m 2 -m 4
), but it's largely a stylistic choice so both are fine!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good now :)
Resolves #40
Adds also some new tests for different method combinations.