Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Modularization of the methods for detecting junctions. #45

Merged
merged 2 commits into from
Jan 27, 2021

Conversation

Irallia
Copy link
Collaborator

@Irallia Irallia commented Jan 20, 2021

Resolves #40
Adds also some new tests for different method combinations.

[TEST] Add new tests for different method combinations.

Signed-off-by: Lydia Buntrock <lydia.buntrock@fu-berlin.de>
@Irallia Irallia self-assigned this Jan 20, 2021
Comment on lines +26 to +29
// seqan3::value_list_validator method_validator{"1", "cigar_string",
// "2", "split_read",
// "3", "read_pairs",
// "4", "read_depth"};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking, if the user wants to insert names instead of numbers, so we could offer both.

Copy link
Collaborator

@joshuak94 joshuak94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! We should figure out if it really makes sense to have all four of the options. Also, duplicate entries in the test cases definitely need to be solved...


// Options - Methods:
parser.add_option(args.methods, 'm', "method", "Choose the method to be used.",
seqan3::option_spec::ADVANCED, method_validator);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this is, a user could only choose one of the four methods, no? E.g. you can't use -m 1, 2, 3, 4 or something. Maybe instead, it's better to have four flags instead of options? Then the user could do something like -c -s -d for cigar, split reads, and read depth.

Although to be completely honest, I can't really think of a reason why a user would not want to use the CIGAR string. For split reads, if the user has very short reads I guess they might think split reads aren't useful. Same for read depth if the sequencing was very shallow.

With regards to read pairs, I think this is something we should detect automatically (whether the sequencing was done with single end or paired end sequencing), so that a user doesn't accidentally turn on the read pairs option with single end sequencing data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with the current implementation, a user can choose four methods using something like -m 1 -m 2 -m 3 -m 4 because args.methods is a std::vector.

You have a good point, though, with the CIGAR strings and the read pairs. It might be good to spend some time thinking about efficient processing of the SAM/BAM files:

  • cigar and split reads collect evidence from each individual read.
  • read depth, in contrast, is an analysis performed on the depth profile along the genomes, not individual reads.
  • read pair should (like you say) only be used for paired-end data. But for this type of data it works on a per-read basis like cigar and split reads.

So we should maybe have a separate detection function for read depth because it's completely separate from the other methods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my change (the modularisation) is there to compare all methods with all methods.
But later, such a recognition from the SAM/BAM files would certainly be useful. I would create a new issue for this.
And then you could make some methods dependent on the input files.

*
* \details Detects junctions from the CIGAR strings and supplementary alignment tags of read alignment records.
* We sort out unmapped alignments, secondary alignments, duplicates and alignments with low mapping quality.
* Then, the CIGAR string of all remaining alignments is analyzed.
* For primary alignments, also the split read information is analyzed.
*/
void detect_junctions_in_alignment_file(const std::filesystem::path & alignment_file_path,
const std::filesystem::path & insertion_file_path);
const std::filesystem::path & insertion_file_path,
const std::vector<uint8_t> methods);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this function implements all four detection methods but can be parametrized by the methods parameter to only use a subset of methods. This makes a lot of sense because the alignment file has to be processed only once (in the best case) and different methods can be applied on the same read in one go.

However, this architecture is not really modularized, right? To be really modularized we would have four detect_junctionsfunctions that each implement one method. But then, the alignment file would be processed four times in the worst case. So I guess that's why you decided to choose the other option?!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct. Maybe the term modularize is not really the right one.
The idea is, that we can combine different methods. And compare them. As we also want to compare run time, the complete decoupling would end in a higher run time if we combine two methods.
So I would leave it and would change the term modularisation with decoupling?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. But I think it's not really a decoupling either. How about "Enable selection of methods for detecting junctions" as new title for this pull request?

Signed-off-by: Lydia Buntrock <lydia.buntrock@fu-berlin.de>
Copy link
Collaborator

@joshuak94 joshuak94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I'm still more of a fan of using flags instead of numbered methods (I think -c -s -d looks more descriptive/intuitive than -m 1 -m 2 -m 4), but it's largely a stylistic choice so both are fine!

Copy link
Collaborator

@eldariont eldariont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good now :)

@Irallia Irallia merged commit f825e23 into seqan:master Jan 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Modularize methods for detecting junctions
3 participants