-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: improve lib type relationship inference for SRA samples #160
Comments
Thanks a lot @balajtimate - nice analysis. I agree with lowering the cutoff accordingly. And of course I agree with addressing the main issue as you suggest (as we had already discussed previously). Wasn't another conclusion that we can never confidently say whether a library is single? Because they'll always be indistinguishable from the first or second mate of an SRA library that retained the read IDs but cut off the mate info (and where the mate is missing)? Perhaps the whole category of reporting type results individually for each sample doesn't make too much sense. If there's only one sample, then you can always go ahead and analyze that sample as if it were a single-ended library, even if it's just one of two mate files. And if two samples are given, all that really matters is the mate relationship. It made more sense in a world where the mate info was nicely given in the seq ID :) Anyway, I'm not proposing to change all of this now. Let's stick to what you suggest. It will give us what we need - and we can still think about refactoring/simplyfing the type inference some other time (or not). |
Yes, that's correct. 😕 As a last resort, we could also check the file name, as PE samples are named |
Is your feature request related to a problem? Please describe.
With #157, the library type relationship is determined by aligning the reads separately and comparing the ratio of concordant pairs to the number of aligned reads. This works great for SRA samples where the
seq_ids
don't confirm to the Illumina seq_id formats. The problem is with samples where theseq_ids
are in the appropriate format, but they don't contain any information regarding their mate, so HTSinfer treats them assingle
. In this case, it would be beneficial to double-check the relationship by alignment, trying to minimize the number of samples deemed "single", when two files were given as input.Also, when the the relationship is deemed
split_mates
based on the ratio of concordant reads, the type of library for the samples still remain eithersingle
(in the case of above example) ornull
(for SRA samples where the seq_id is not in the correct format). In this case, I would say it's safe to assume that input 1 is thefirst_mate
and input 2 issecond_mate
, but since it's not confirmed from the seq_ids, so additional categories could be added (first_mate_assumed
,second_mate_assumed
) as suggested, and the lib type updated after the alignment.Describe the solution you'd like
Check the relationship by forcing the alignment when both inputs are deemed "single"; update lib type for individual samples based on alignment results.
The text was updated successfully, but these errors were encountered: