Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Modularization of the methods for detecting junctions. #45

Merged
merged 2 commits into from
Jan 27, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion include/detect_breakends/junction_detection.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,16 @@
*
* \param alignment_file_path input file - path to the sam/bam file
* \param insertion_file_path output file - path for the fasta file
* \param methods - list of methods for detecting junctions (1: cigar_string,
* 2: split_read,
* 3: read_pairs,
* 4: read_depth)
*
* \details Detects junctions from the CIGAR strings and supplementary alignment tags of read alignment records.
* We sort out unmapped alignments, secondary alignments, duplicates and alignments with low mapping quality.
* Then, the CIGAR string of all remaining alignments is analyzed.
* For primary alignments, also the split read information is analyzed.
*/
void detect_junctions_in_alignment_file(const std::filesystem::path & alignment_file_path,
const std::filesystem::path & insertion_file_path);
const std::filesystem::path & insertion_file_path,
const std::vector<uint8_t> methods);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this function implements all four detection methods but can be parametrized by the methods parameter to only use a subset of methods. This makes a lot of sense because the alignment file has to be processed only once (in the best case) and different methods can be applied on the same read in one go.

However, this architecture is not really modularized, right? To be really modularized we would have four detect_junctionsfunctions that each implement one method. But then, the alignment file would be processed four times in the worst case. So I guess that's why you decided to choose the other option?!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct. Maybe the term modularize is not really the right one.
The idea is, that we can combine different methods. And compare them. As we also want to compare run time, the complete decoupling would end in a higher run time if we combine two methods.
So I would leave it and would change the term modularisation with decoupling?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. But I think it's not really a decoupling either. How about "Enable selection of methods for detecting junctions" as new title for this pull request?

32 changes: 29 additions & 3 deletions src/detect_breakends.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,34 +6,60 @@ struct cmd_arguments
{
std::filesystem::path alignment_file_path{};
std::filesystem::path insertion_file_path{};
std::vector<uint8_t> methods{1, 2, 3, 4}; // default is using all methods
};

void initialize_argument_parser(seqan3::argument_parser & parser, cmd_arguments & args)
{
parser.info.author = "David Heller";
parser.info.author = "David Heller & Lydia Buntrock";
parser.info.app_name = "iGenVar";
parser.info.man_page_title = "Short and long Read SV Caller";
parser.info.short_description = "Detect junctions in a read alignment file";
parser.info.version = "0.0.1";
parser.info.date = "19-01-2021"; // last update
parser.info.email = "lydia.buntrock@fu-berlin.de";
parser.info.long_copyright = "long_copyright";
parser.info.short_copyright = "short_copyright";
parser.info.url = "https://github.com/seqan/iGenVar/";

// Validatiors:
// seqan3::value_list_validator method_validator{"1", "cigar_string",
// "2", "split_read",
// "3", "read_pairs",
// "4", "read_depth"};
Comment on lines +26 to +29
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking, if the user wants to insert names instead of numbers, so we could offer both.

seqan3::arithmetic_range_validator method_validator{1, 4};

// Options - Input / Output:
parser.add_positional_option(args.alignment_file_path, "Input read alignments in SAM or BAM format.",
seqan3::input_file_validator{{"sam", "bam"}} );
parser.add_positional_option(args.insertion_file_path, "Output file for insertion alleles",
seqan3::output_file_validator{{"fa", "fasta"}} );

// Options - Methods:
parser.add_option(args.methods, 'm', "method", "Choose the method to be used.",
seqan3::option_spec::ADVANCED, method_validator);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this is, a user could only choose one of the four methods, no? E.g. you can't use -m 1, 2, 3, 4 or something. Maybe instead, it's better to have four flags instead of options? Then the user could do something like -c -s -d for cigar, split reads, and read depth.

Although to be completely honest, I can't really think of a reason why a user would not want to use the CIGAR string. For split reads, if the user has very short reads I guess they might think split reads aren't useful. Same for read depth if the sequencing was very shallow.

With regards to read pairs, I think this is something we should detect automatically (whether the sequencing was done with single end or paired end sequencing), so that a user doesn't accidentally turn on the read pairs option with single end sequencing data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with the current implementation, a user can choose four methods using something like -m 1 -m 2 -m 3 -m 4 because args.methods is a std::vector.

You have a good point, though, with the CIGAR strings and the read pairs. It might be good to spend some time thinking about efficient processing of the SAM/BAM files:

  • cigar and split reads collect evidence from each individual read.
  • read depth, in contrast, is an analysis performed on the depth profile along the genomes, not individual reads.
  • read pair should (like you say) only be used for paired-end data. But for this type of data it works on a per-read basis like cigar and split reads.

So we should maybe have a separate detection function for read depth because it's completely separate from the other methods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my change (the modularisation) is there to compare all methods with all methods.
But later, such a recognition from the SAM/BAM files would certainly be useful. I would create a new issue for this.
And then you could make some methods dependent on the input files.

}

int main(int argc, char ** argv)
{
seqan3::argument_parser myparser{"detectJunctions", argc, argv}; // initialise myparser
cmd_arguments args{};
initialize_argument_parser(myparser, args);

// Parse the given arguments and catch possible errors.
try
{
myparser.parse(); // trigger command line parsing
}
catch (seqan3::argument_parser_error const & ext) // catch user errors
{
seqan3::debug_stream << "[Error] " << ext.what() << "\n"; // customise your error message
seqan3::debug_stream << "[Error] " << ext.what() << '\n'; // customise your error message
return -1;
}
detect_junctions_in_alignment_file(args.alignment_file_path, args.insertion_file_path);

detect_junctions_in_alignment_file(args.alignment_file_path,
args.insertion_file_path,
args.methods);

return 0;
}
54 changes: 40 additions & 14 deletions src/detect_breakends/junction_detection.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,10 @@ void analyze_cigar(std::string chromosome,
* \cond
* \param alignment_file_path input file - path to the sam/bam file
* \param insertion_file_path output file - path for the fasta file
* \param methods - list of methods for detecting junctions (1: cigar_string,
* 2: split_read,
* 3: read_pairs,
* 4: read_depth)
* \endcond
*
* \details Detects junctions from the CIGAR strings and supplementary alignment tags of read alignment records.
Expand All @@ -232,7 +236,8 @@ void analyze_cigar(std::string chromosome,
* More details on this in the associated function `retrieve_aligned_segments()`.
*/
void detect_junctions_in_alignment_file(const std::filesystem::path & alignment_file_path,
const std::filesystem::path & insertion_file_path)
const std::filesystem::path & insertion_file_path,
const std::vector<uint8_t> methods)
{
// Open input alignment file
using my_fields = seqan3::fields<seqan3::field::id,
Expand Down Expand Up @@ -274,20 +279,41 @@ void detect_junctions_in_alignment_file(const std::filesystem::path & alignment_
}
else
{
// Detect junctions from CIGAR string
analyze_cigar(query_name, ref_name, pos, cigar, seq, junctions, insertion_alleles, 30, insertion_file);
// Detect junctions from SA tag (primary alignments only)
if (!hasFlagSupplementary(flag))
{
std::string sa_tag = tags.get<"SA"_tag>();
if (!sa_tag.empty())
for (uint8_t method : methods) {
switch (method)
{
std::vector<aligned_segment> aligned_segments{};
auto strand = (hasFlagReverseComplement(flag) ? strand::reverse : strand::forward);
aligned_segments.push_back(aligned_segment{strand, ref_name, pos, mapq, cigar});
retrieve_aligned_segments(sa_tag, aligned_segments);
std::sort(aligned_segments.begin(), aligned_segments.end());
analyze_aligned_segments(aligned_segments, junctions, query_name);
case 1: // Detect junctions from CIGAR string
analyze_cigar(query_name,
ref_name,
pos,
cigar,
seq,
junctions,
insertion_alleles,
30,
insertion_file);
break;
case 2: // Detect junctions from split read evidence (SA tag, primary alignments only)
if (!hasFlagSupplementary(flag))
{
std::string sa_tag = tags.get<"SA"_tag>();
if (!sa_tag.empty())
{
std::vector<aligned_segment> aligned_segments{};
auto strand = (hasFlagReverseComplement(flag) ? strand::reverse : strand::forward);
aligned_segments.push_back(aligned_segment{strand, ref_name, pos, mapq, cigar});
retrieve_aligned_segments(sa_tag, aligned_segments);
std::sort(aligned_segments.begin(), aligned_segments.end());
analyze_aligned_segments(aligned_segments, junctions, query_name);
}
}
break;
case 3: // Detect junctions from read pair evidence
break;
// continue;
case 4: // Detect junctions from read depth evidence
break;
// continue;
}
}

Expand Down
72 changes: 71 additions & 1 deletion test/api/junction_detection_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,77 @@ TEST(junction_detection, fasta_out_not_empty)

testing::internal::CaptureStdout();
detect_junctions_in_alignment_file(DATADIR"simulated.minimap2.hg19.coordsorted_cutoff.sam",
tmp_dir/"detect_breakends_out_short.fasta");
tmp_dir/"detect_breakends_out_short.fasta", {1, 2, 3, 4});

std::string std_cout = testing::internal::GetCapturedStdout();
seqan3::debug_stream << "std_out:\n" << std_cout << '\n';
EXPECT_EQ(expected, std_cout);

// cleanup
std::filesystem::remove(tmp_dir/"detect_breakends_out_short.fasta"); // remove output
}

TEST(junction_detection, method_1_only)
{
std::string expected{
"Reference\tm2257/8161/CCS\t41972616\tForward\tRead \t0\t2294\tForward\tchr21\n"
"Reference\tm2257/8161/CCS\t41972616\tReverse\tRead \t0\t3975\tReverse\tchr21\n"
};

std::filesystem::path tmp_dir = std::filesystem::temp_directory_path(); // get the temp directory
std::filesystem::remove(tmp_dir/"detect_breakends_out_short.fasta"); // remove old output if existent

testing::internal::CaptureStdout();
detect_junctions_in_alignment_file(DATADIR"simulated.minimap2.hg19.coordsorted_cutoff.sam",
tmp_dir/"detect_breakends_out_short.fasta", {1});

std::string std_cout = testing::internal::GetCapturedStdout();
seqan3::debug_stream << "std_out:\n" << std_cout << '\n';
EXPECT_EQ(expected, std_cout);

// cleanup
std::filesystem::remove(tmp_dir/"detect_breakends_out_short.fasta"); // remove output
}

TEST(junction_detection, method_2_only)
{
std::string expected{
"Reference\tchr22\t17458417\tForward\tReference\tchr21\t41972615\tForward\tm41327/11677/CCS\n"
"Reference\tchr22\t17458418\tForward\tReference\tchr21\t41972616\tForward\tm21263/13017/CCS\n"
"Reference\tchr22\t17458418\tForward\tReference\tchr21\t41972616\tForward\tm38637/7161/CCS\n"
};

std::filesystem::path tmp_dir = std::filesystem::temp_directory_path(); // get the temp directory
std::filesystem::remove(tmp_dir/"detect_breakends_out_short.fasta"); // remove old output if existent

testing::internal::CaptureStdout();
detect_junctions_in_alignment_file(DATADIR"simulated.minimap2.hg19.coordsorted_cutoff.sam",
tmp_dir/"detect_breakends_out_short.fasta", {2});

std::string std_cout = testing::internal::GetCapturedStdout();
seqan3::debug_stream << "std_out:\n" << std_cout << '\n';
EXPECT_EQ(expected, std_cout);

// cleanup
std::filesystem::remove(tmp_dir/"detect_breakends_out_short.fasta"); // remove output
}

TEST(junction_detection, method_1_and_2)
{
std::string expected{
"Reference\tchr22\t17458417\tForward\tReference\tchr21\t41972615\tForward\tm41327/11677/CCS\n"
"Reference\tchr22\t17458418\tForward\tReference\tchr21\t41972616\tForward\tm21263/13017/CCS\n"
"Reference\tchr22\t17458418\tForward\tReference\tchr21\t41972616\tForward\tm38637/7161/CCS\n"
"Reference\tm2257/8161/CCS\t41972616\tForward\tRead \t0\t2294\tForward\tchr21\n"
"Reference\tm2257/8161/CCS\t41972616\tReverse\tRead \t0\t3975\tReverse\tchr21\n"
};

std::filesystem::path tmp_dir = std::filesystem::temp_directory_path(); // get the temp directory
std::filesystem::remove(tmp_dir/"detect_breakends_out_short.fasta"); // remove old output if existent

testing::internal::CaptureStdout();
detect_junctions_in_alignment_file(DATADIR"simulated.minimap2.hg19.coordsorted_cutoff.sam",
tmp_dir/"detect_breakends_out_short.fasta", {1, 2});

std::string std_cout = testing::internal::GetCapturedStdout();
seqan3::debug_stream << "std_out:\n" << std_cout << '\n';
Expand Down
4 changes: 2 additions & 2 deletions test/cli/detect_breakends_cli_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ TEST_F(cli_test, no_options)
cli_test_result result = execute_app("detect_breakends");
std::string expected
{
"detectJunctions - Detect junctions in a read alignment file\n"
"===========================================================\n"
"iGenVar - Detect junctions in a read alignment file\n"
"===================================================\n"
" Try -h or --help for more information.\n"
};
EXPECT_EQ(result.exit_code, 0);
Expand Down