-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alignment] Global MSA with free end-gaps for a subset of sequences #2674
Comments
Hi Robin, thank you for your excellent and helpful write-up of your use case. I'm unfortunately not too familiar with MSA and my answer might lack understanding. Furthermore, two of our key-person who understand MSA extensively are not available right now (one left the group and the other one is on parental leave). Maybe the most important information right at the start: SeqAn3 does not have any MSA functionality, yet. So you might not need to upgrade, yet. If you think that SeqAn2 is deprecated, and you feel the urge to upgrade. That isn't the case, it is in a "feature"-freeze mode, and it is perfectly valid to use SeqAn2. We have some user feature requests for MSA (see #1763), a pending and stale pull request (#1990) that uses the seqan2 implementation as the backend and puts that into a seqan3-style frontend, and a general design issue (seqan/product_backlog#104).
As I stated, I'm not an expert and haven't used the MSA part of SeqAn2. From what I caught on in our meetings, I think MSA had a concept of "Libraries". If I remember correctly, those represented the Alignments, and from what I heard, you could individually have different alignment methods. (Just from Memory, I'm not sure if I misremembered something). For example one alignment could be computed by using the edit distance (Levenshtein distance), one could just compute the Hamming distance and another one the longest common substring. And those alignments could be put into constructing the MSA. If my memory is right, I think you could have different alignment computations (each with a different "free-end" configuration) depending on whether your sequence intersect your subgraph or is completely contained.
You mean ACGT and ACGTA should align to
? I think via a scoring scheme that will be hard. You could penalise gaps (gap-opening for example) more than mismatches altogether, but I think you could always think of sequences that still puts a gap at the end. You could just align a sub-sequence that excludes your first and last character and add the score of that first and last character after the computation. I need to think about this more, maybe there is a more elegant solution to this. What beneficial property would such a forced first and last character match have? Or do you mean that the complete first and second sequence must be matched (aka global alignment)? |
Thank you for your thoughtful response, @marehr! I just got back from some important travels, and I was glad to see your thoughts. They are helpful. That's good to hear about SeqAn2 - I hadn't realize that. I will look into libraries for seqan2, and see if I can use them to input already-complete alignments into the MSA, the way you suggest. That would work well for my use-case. With regards to your last question, the problem may actually be less odd than that. As I think you've understood quite well, there are two types of alignments I'd like to have:
This way, any old sequence-backed path through the subgraph is conserved in the new subgraph. Note that the first character and last character of the strings will always be identical between these strings, because in the graph they were already collapsed into a single, shared node at the beginning/end of my subgraph. I've tried to guarantee alignment between these first & last characters by using special characters to represent the first and last characters of the string:
The idea is that the aligner will match up the Y's and X's because there isn't any better place to put them. But I still risk having the aligner put gaps on those ends in some edge cases. So, my first follow-up question:
An invalid alignment solution, where the final T in my chosen string is associated with a gap in the other string:
In the graph, that would make the chosen string's T an "end node," which wouldn't reconnect to the rest of the graph. A dead end in the middle of the subgraph. If the subgraph was in the middle of a chromosome, that would indicate that some individuals had a chromosome that was truncated in the middle, which wasn't the case. A couple valid solutions, where the chosen string's final T is aligned to another T:
Essentially, I want to give one string no end-gap penalty on one side, which allows it to finish alignment long before the rest of the string is finished. But I want to prevent final alignments to a gap. So, my second question: Thanks for all your help. I really appreciate it. |
Sorry for the late response, I'm terrible at keeping conversations alive 🙈 I sketched for (1) a solution in seqan3, the basic idea is to align the subsequences without the first and last char and extend the aligned sequences by 1 char on both sides afterwards. Unfortunately, I need to copy the complete gap information of the alignment, I think we could "optimise" such a use case to reduce the number of copy operations. I think I saw the concept of soft-clips (which your use case reminds me of) in seqan2, maybe we can do the same thing more efficient in seqan2. #include <seqan3/std/span>
#include <seqan3/alignment/configuration/align_config_edit.hpp>
#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
template <seqan3::alphabet alphabet_t, seqan3::aligned_sequence aligned_subsequence_t>
auto softclip_aligned_sequence(std::span<alphabet_t> sequence,
size_t softclip_begin,
size_t softclip_end,
aligned_subsequence_t aligned_subsequence)
{
seqan3::gap_decorator<std::span<alphabet_t>> aligned_sequence{sequence};
auto aligned_sequence_it = aligned_sequence.begin();
// skip first softclip_begin many chars; as it is fixed to be not a gap
aligned_sequence_it = std::ranges::next(aligned_sequence_it, softclip_begin);
auto maybe_insert_gaps = [&aligned_sequence](auto aligned_sequence_it, size_t gaps)
{
if (gaps == 0)
return aligned_sequence_it;
aligned_sequence_it = seqan3::insert_gap(aligned_sequence, aligned_sequence_it, gaps);
// skip last inserted gap
return ++aligned_sequence_it;
};
size_t gaps = 0;
for (auto && chr: aligned_subsequence)
{
if (chr == seqan3::gap{}) // gap char
{
// collect "runs" of gaps
++gaps;
++softclip_end;
continue;
}
// non-gap char
aligned_sequence_it = maybe_insert_gaps(aligned_sequence_it, gaps);
assert(chr == *aligned_sequence_it); // should have same char
++aligned_sequence_it; // advance read non-gap char
gaps = 0; // reset "run" of gaps
}
aligned_sequence_it = maybe_insert_gaps(aligned_sequence_it, gaps);
assert(std::ranges::equal(seqan3::views::slice(aligned_sequence, softclip_begin, softclip_end),
aligned_subsequence));
return aligned_sequence;
}
int main()
{
using namespace seqan3::literals;
seqan3::dna4_vector sequence1{"AGTAGACTACG"_dna4};
seqan3::dna4_vector sequence2{"ACGTACGACACG"_dna4};
seqan3::debug_stream << "sequence1: " << sequence1 << '\n';
seqan3::debug_stream << "sequence2: " << sequence2 << '\n';
size_t softclip1_begin = 1;
size_t softclip2_begin = 1;
size_t softclip1_end = sequence1.size() - 1;
size_t softclip2_end = sequence2.size() - 1;
// create sub-sequences
std::span subsequence1 = std::span{sequence1}.subspan(softclip1_begin, softclip1_end - softclip1_begin);
std::span subsequence2 = std::span{sequence2}.subspan(softclip2_begin, softclip2_end - softclip2_begin);
seqan3::debug_stream << "subsequence1: " << subsequence1 << '\n';
seqan3::debug_stream << "subsequence2: " << subsequence2 << '\n';
auto configuration = seqan3::align_cfg::method_global{} | seqan3::align_cfg::edit_scheme;
for (auto && result: seqan3::align_pairwise(std::tie(subsequence1, subsequence2), configuration))
{
// use indices `result.sequence1/2_id()`` to get original sequence // skipped here for simplicity
std::span unaligned_sequence1{sequence1};
std::span unaligned_sequence2{sequence2};
auto [aligned_subsequence1, aligned_subsequence2] = result.alignment();
seqan3::debug_stream << "aligned_subsequence1: " << aligned_subsequence1 << std::endl;
seqan3::debug_stream << "aligned_subsequence2: " << aligned_subsequence2 << std::endl;
// rebuild aligned_sequences
seqan3::gap_decorator aligned_sequence1
= softclip_aligned_sequence(unaligned_sequence1, softclip1_begin, softclip1_end, aligned_subsequence1);
seqan3::gap_decorator aligned_sequence2
= softclip_aligned_sequence(unaligned_sequence2, softclip2_begin, softclip2_end, aligned_subsequence2);
seqan3::debug_stream << "aligned_sequence1: " << aligned_sequence1 << std::endl;
seqan3::debug_stream << "aligned_sequence2: " << aligned_sequence2 << std::endl;
}
} outputs:
I'll try your second approach of adding a custom scoring scheme to heavily penalise start and end chars. |
Hi Marehr! I'm especially curious about your thoughts with the second part of my question: If you have any insights for that, I'd greatly appreciate it. |
For (2) I think this is not possible with our current standard alignment algorithm (seqan2 or seqan3). We can penalise char-specific mismatches, but we can't penalise char-specific gaps. https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm defines the recursion as:
Note that d is fixed and completely independent of the current char. That means if we want to have a "standard" alignment (in the sense of cost configuration), we can't differentiate between a special start / end char and a normal middle char. The only way to force a match at the start / end would be to configure a rather high gap opening / extension cost. (Which in turn has no 100% guarantee!) The recursion above is how we implement it right now for the linear gap scheme. (We actually implement the affine gap scheme, a bit more generalised version, but that does not change the underlying argument). Funnily, the German wiki has a variant for the recursion https://de.wikipedia.org/wiki/Smith-Waterman-Algorithmus#Matrix-Rekurrenzen (I'm not sure why this is in the Smith-Waterman article, at first glance this seems to be different from the usual recursion):
This is interesting, because this would allow to associate a cost between a character and a gap symbol. In your case we could associate "infinite" cost between the special start character and all other chars (including the in/del-gap) and zero cost with itself. I don't know if this more formal way is easier to understand: A new alphabet Σ' = Σ x {S, M, E} (each char gets a Start, Middle and End position) A new score matrix for a,b in Σ ∪ {'-'}
This effectively would allow associating different cost on segments of the alignment. I.e. forcing the start char to match, computing a normal alignment in the middle and forcing the end to match; all in one alignment computation. EDIT:// It is actually a solution to your (1) problem and not exactly a solution for your (2) problem, but I have a feeling that this might lead to a solution. |
No, this should definitely work! The second example is a semi-global alignment with the twist that an "T" end char in the first sequence has zero cost to any other "T" middle/end char in the second sequence and ∞ otherwise. |
That would work, indeed! That would be great! |
Yeah, this is not implemented, yet. But, I think as this functionality is "just" ™️ a swap of the gap extension definition within the recursion formula, this should be definitely prototype-able. I guess seqan3/include/seqan3/alignment/pairwise/detail/policy_affine_gap_recursion.hpp Lines 119 to 140 in b5b3526
would be a good starting point. (maybe passing the extension score sequence1/sequence2 in, as we do it for the sequence_score?) |
Cool! I haven't quite reached the point where this is the most urgent next step for my project. But I oughta give it a shot soon. Thanks for the all the help. |
Platform
Question
Is there a way to perform a global MSA that gives some sequences free end-gaps? I have a use case that requires that most of the sequences to be aligned globally, but permits a few sequences to cease alignment in the middle of other sequences.
My Use Case:
I'm building a tool for vg that uses Seqan's MSA to simplify subgraphs of a genome variation graph. To do this, I:
Depending on how the graph is constructed, this approach can simplify the way the graph represents regions with complex genetic variation.
The problem is, some of the subgraphs I'm simplifying have sequences that either start or end in the middle of the complex subgraph. I don't want to penalize a gap for those sequences on their "free end," because they're not supposed to stretch across the subgraph from beginning end.
If, for example, I have a sequence within the subgraph that started outside the subgraph, but ends inside the subgraph, I'd want to penalize an end-gap for the left side of the sequence, but not the right. Is there a way to do that, while still performing
a global MSA for the rest of the sequences in the subgraph?
~
A second sub-question:
Is there a way to guarantee that a given sequence will align both its first and last character? For my particular use-case, it wouldn't make sense to allow the string to end on a gap-alignment, since that would mean that the entire graph could plausibly end at the string's last character, which isn't the case.
Hopefully I've described my question, sub-question, and use case in a way that makes sense. Please let me know if there's anything I could further clarify.
Thanks,
The text was updated successfully, but these errors were encountered: