-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: automatic quality score conversion #18
Comments
Thanks for suggestion and detailed description. I'll implement it when I'm free (2-3 months later, sorry). |
@shenwei356 Glad to hear that—thanks so much! |
Quality score convertion: |
Hi @cviner , I've added a command (alpha version below) By default, it guesses the quality encoding according to the leading 1000 ( Example:
Where |
@shenwei356 This is great—I will test it soon! Thanks for adding this! For the auto-detection, is it able to detect Illumina 1.5 as going up to ASCII 105 ( |
@shenwei356 I am not sure how to test this, using the provided binary.
returns:
Perhaps you meant to enclose |
Oh no, I uploaded the wrong binaries. fixed. The special cases of Illumina 1.5 and 1.8 will be taken into consideration soon. |
As you mentioned, 1.5 format can be uniquely identified if many updated alpha version: For a Illumina 1.5 test data:
|
@shenwei356 That is a good point. It may indeed do so, which is why I think it is best to output a warning when the heuristic is used and have an option to disable it. I think that overall, using it is better than not including it. I do not think that my criteria is particularly robust and would welcome a better heuristic! |
How about this, using the criteria. Possible quality encodings of the leading N (default: 1000) records are computed firstly. And if the proportion of guessed Illumina 1.5 format is greater than some threshold, we can call them Illumina 1.5. If not, follow the regular workflow, i.e., computing the intersection of the N guesses. |
Yes—I think that is a very good idea. |
Although that should still somehow account for an enrichment of |
Alright, I'll implement it tomorrow. Good morning for you and good night for me. |
@cviner I set the default faction of guessed Illumina 1.5 in the leanding N as 0.1 ( |
@shenwei356 Thanks! I have tested the latest binary on a few Illumina 1.5 datasets and one Sanger. Everything appears to work well! Thanks for implementing this! It would be nice to accept actual format names like One issue is that the Sanger dataset is converted from Illumina 1.8, when it correctly guesses both options. I would suggest that if those two options are present, Sanger should be the default. Finally, if the source and target formats match, conversion should be aborted with a message. |
Why should sanger be default if it may be sanger and illumina 1.8. Most commands of seqkit follow the UNIX philosophy and read from stdin or file and write to stdout or file. So it still output when output and input format match, anyway it's very fast and log is written to stderr. |
Currently it defaults to Illumina 1.8. The default should be Sanger because Illumina is now using that encoding, it is the default in many tools (e.g. Cutadapt), and all NCBI SRA / EBI ENA FASTQs are re-processed to the Sanger format. It can therefore be regarded as the current de facto standard. As for always processing it, that is fair enough, but an option to not process it and return an informative message if the source and target encodings are the same would be nice. Thanks again for all your work on this! |
OK, I'll work on it ASAP. |
sorry I can't have access to computer for 2 months. |
@shenwei356 No problem! My current approach is sufficient for now. Thanks again for all your work on this and I look forward to using it after it is done. |
Updated as you requested. Examples:
|
@shenwei356 Thanks so much! There does appear to be a small bug with the first example, since Illumina-1.8+ -> Sanger should still be converted, since while similar, they are not the same. |
what's the differences? range? |
@shenwei356 Yes, exactly. Illumina-1.8+ currently typically goes to one value higher ( I think that for this purpose, the formats should be considered different, with conversion from Sanger -> 1.8+ resulting in no change, but with the converse resulting in all scores > 40, being set to 40. |
I guess in this case, however, it might be best to simply leave it as is and consider the two formats the same by default (since otherwise, defaulting to Sanger, could discard higher quality scores that should usually be preserved). Perhaps you could simply keep it this way but add a |
@cviner updated as you suggested, thank you for making it better. Note that The test dataset contains score 41 (
By default, nothing change when converting Illumina 1.8 to Sanger. A warning message show that source and target quality encoding match.
When switching flag
Other cases: To Illumina-1.5.
To Illumina-1.5 and back to Sanger.
Checking encoding
Real Illumina 1.5+ data
|
This is great! Thanks so much for implementing all of this, @shenwei356! It would be great if you could release a new version of SeqKit, integrating all of these enhancements. I look forward to using it in my pipeline. |
@shenwei356 Great—thanks! |
A common issue encountered by many involves determining and then converting quality scores from various older sequencing datasets to modern Phread+33 (Sanger / Illumina 1.8+) encoding. There does not yet appear to be a single unified system to perform both of these functions, as part of a larger pipeline on the Linux command line.
seqtk
itself provides the ability to convert from Illumina1.5 to Illumina 1.8+ formats, for instance, but does not convert from formats prior to Illumina 1.3. A useful Python script exists to guess the correct encoding. Conversion is further complicated by the fact that valid conversion between Sanger and Solexa involves a non-linear transformation and may require some precautions to ensure numerical stability.SeqKit seems like an ideal toolkit to include this functionality, which many would likely find highly useful.
The text was updated successfully, but these errors were encountered: