Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: automatic quality score conversion #18

Closed
cviner opened this issue Jul 11, 2017 · 29 comments
Closed

Feature Request: automatic quality score conversion #18

cviner opened this issue Jul 11, 2017 · 29 comments

Comments

@cviner
Copy link

cviner commented Jul 11, 2017

A common issue encountered by many involves determining and then converting quality scores from various older sequencing datasets to modern Phread+33 (Sanger / Illumina 1.8+) encoding. There does not yet appear to be a single unified system to perform both of these functions, as part of a larger pipeline on the Linux command line. seqtk itself provides the ability to convert from Illumina1.5 to Illumina 1.8+ formats, for instance, but does not convert from formats prior to Illumina 1.3. A useful Python script exists to guess the correct encoding. Conversion is further complicated by the fact that valid conversion between Sanger and Solexa involves a non-linear transformation and may require some precautions to ensure numerical stability.

SeqKit seems like an ideal toolkit to include this functionality, which many would likely find highly useful.

@shenwei356
Copy link
Owner

shenwei356 commented Jul 11, 2017

Thanks for suggestion and detailed description. I'll implement it when I'm free (2-3 months later, sorry).

@cviner
Copy link
Author

cviner commented Jul 11, 2017

@shenwei356 Glad to hear that—thanks so much!

@shenwei356
Copy link
Owner

shenwei356 commented Jul 19, 2017

@shenwei356
Copy link
Owner

shenwei356 commented Jul 24, 2017

Hi @cviner , I've added a command (alpha version below) seqkit convert for your request.
Please help to test it.

By default, it guesses the quality encoding according to the leading 1000 (-n) records,
you can also specify the input (--from) and output (--to) encoding.

Example:

seqkit convert read_1.fq.gz          --to 5 # auto detect input quality encoding
seqkit convert read_1.fq.gz --from 3 --to 5

Where 5 is the code for quality encoding. available value: 0 for Unknown, 1 for Sanger, 2 for Solexa, 3 for Illumina-1.3+, 4 for Illumina-1.5+, 5 for Illumina-1.8+.

@cviner
Copy link
Author

cviner commented Jul 24, 2017

@shenwei356 This is great—I will test it soon! Thanks for adding this!

For the auto-detection, is it able to detect Illumina 1.5 as going up to ASCII 105 (i, as I mentioned elsewhere)?
Would it be possible to include a heuristic, to uniquely identify 1.5 format if many Bs are present (similar to something I recently added to the script I am currently using to guess the encoding)?
For conversion of Illumina 1.5 to 1.8+, could the special code B be re-assigned to 0 (ASCII 33, !) in the new scale (as opposed to purely converted to 2)?

@cviner
Copy link
Author

cviner commented Jul 24, 2017

@shenwei356 I am not sure how to test this, using the provided binary.

./taxonkit convert

returns:

Error: unknown command "convert" for "taxonkit"
Run 'taxonkit --help' for usage.
unknown command "convert" for "taxonkit"

Perhaps you meant to enclose seqkit?

@shenwei356
Copy link
Owner

Oh no, I uploaded the wrong binaries. fixed.

The special cases of Illumina 1.5 and 1.8 will be taken into consideration soon.

Repository owner deleted a comment from zhaoxia413 Jul 25, 2017
@shenwei356
Copy link
Owner

shenwei356 commented Jul 25, 2017

As you mentioned, 1.5 format can be uniquely identified if many Bs are present. But will this bring false-positive? And is the "within 4th common qualities" criteria robust?


updated alpha version:

For a Illumina 1.5 test data:

$ cat tests/Illimina1.5.fq 
@HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCTTGAGATTTGTTGGGGGAGACATTTTTGTGATTGCCTTGAT
+HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
efcfffffcfeefffcffffffddf`feed]`]_Ba_^__[YBBBBBBBBBBRTT\]][]dddd`ddd^dddadd^BBBBBBBBBBBBBBBBBBBBBBBB

$ seqkit convert tests/Illimina1.5.fq 
[INFO] possible quality encodings: [Solexa Illumina-1.3+ Illumina-1.5+]
[ERRO] fail to guess the source quality encoding, please specify it

@cviner
Copy link
Author

cviner commented Jul 25, 2017

@shenwei356 That is a good point. It may indeed do so, which is why I think it is best to output a warning when the heuristic is used and have an option to disable it.

I think that overall, using it is better than not including it. I do not think that my criteria is particularly robust and would welcome a better heuristic!

@shenwei356
Copy link
Owner

How about this, using the criteria. Possible quality encodings of the leading N (default: 1000) records are computed firstly. And if the proportion of guessed Illumina 1.5 format is greater than some threshold, we can call them Illumina 1.5. If not, follow the regular workflow, i.e., computing the intersection of the N guesses.

@cviner
Copy link
Author

cviner commented Jul 25, 2017

Yes—I think that is a very good idea.

@cviner
Copy link
Author

cviner commented Jul 25, 2017

Although that should still somehow account for an enrichment of Bs. Perhaps integrating that idea with the previous one?

@shenwei356
Copy link
Owner

Alright, I'll implement it tomorrow. Good morning for you and good night for me.

@shenwei356
Copy link
Owner

shenwei356 commented Jul 26, 2017

@cviner I set the default faction of guessed Illumina 1.5 in the leanding N as 0.1 (--thresh-illumina1.5-frac).
please test on your data.

@cviner
Copy link
Author

cviner commented Jul 26, 2017

@shenwei356 Thanks! I have tested the latest binary on a few Illumina 1.5 datasets and one Sanger. Everything appears to work well! Thanks for implementing this!

It would be nice to accept actual format names like solexa or sanger.

One issue is that the Sanger dataset is converted from Illumina 1.8, when it correctly guesses both options. I would suggest that if those two options are present, Sanger should be the default. Finally, if the source and target formats match, conversion should be aborted with a message.

@shenwei356
Copy link
Owner

Why should sanger be default if it may be sanger and illumina 1.8.

Most commands of seqkit follow the UNIX philosophy and read from stdin or file and write to stdout or file. So it still output when output and input format match, anyway it's very fast and log is written to stderr.

@cviner
Copy link
Author

cviner commented Jul 27, 2017

Currently it defaults to Illumina 1.8. The default should be Sanger because Illumina is now using that encoding, it is the default in many tools (e.g. Cutadapt), and all NCBI SRA / EBI ENA FASTQs are re-processed to the Sanger format. It can therefore be regarded as the current de facto standard.

As for always processing it, that is fair enough, but an option to not process it and return an informative message if the source and target encodings are the same would be nice.

Thanks again for all your work on this!

@shenwei356
Copy link
Owner

OK, I'll work on it ASAP.

@shenwei356
Copy link
Owner

sorry I can't have access to computer for 2 months.

@cviner
Copy link
Author

cviner commented Jul 30, 2017

@shenwei356 No problem! My current approach is sufficient for now. Thanks again for all your work on this and I look forward to using it after it is done.

shenwei356 added a commit that referenced this issue Aug 5, 2017
@shenwei356
Copy link
Owner

Updated as you requested.

Examples:

$ seqkit convert t.fq 
[INFO] possible quality encodings: [Illumina-1.8+]
[INFO] guessed quality encoding: Illumina-1.8+
[INFO] converting Illumina-1.8+ -> Sanger
[WARN] source and target quality encoding match. aborted.
$ seqkit convert t.fq --to solexa --dry-run
[INFO] possible quality encodings: [Illumina-1.8+]
[INFO] guessed quality encoding: Illumina-1.8+
[INFO] converting Illumina-1.8+ -> Solexa

@cviner
Copy link
Author

cviner commented Aug 6, 2017

@shenwei356 Thanks so much!

There does appear to be a small bug with the first example, since Illumina-1.8+ -> Sanger should still be converted, since while similar, they are not the same.

@shenwei356
Copy link
Owner

what's the differences? range?

@cviner
Copy link
Author

cviner commented Aug 6, 2017

@shenwei356 Yes, exactly. Illumina-1.8+ currently typically goes to one value higher (I, ASCII 41) than the initial Sanger format. More importantly, it is expected to permit higher quality scores for newer sequencing chemistries. The latest HiSeq X Ten system, for instance, uses a binning system that introduces additional complications.

I think that for this purpose, the formats should be considered different, with conversion from Sanger -> 1.8+ resulting in no change, but with the converse resulting in all scores > 40, being set to 40.

@cviner
Copy link
Author

cviner commented Aug 6, 2017

I guess in this case, however, it might be best to simply leave it as is and consider the two formats the same by default (since otherwise, defaulting to Sanger, could discard higher quality scores that should usually be preserved).

Perhaps you could simply keep it this way but add a -f option to force conversion, even between formats regarded as identical? Then when forced, for Illumina 1.8 -> Sanger only, all scores above 40 would be truncated to 40, with a warning output to mention this to the user?

@shenwei356
Copy link
Owner

shenwei356 commented Aug 11, 2017

@cviner updated as you suggested, thank you for making it better.

Note that seqkit convert always output sequences.

The test dataset contains score 41 (J):

$ seqkit head -n 1 tests/Illimina1.8.fq.gz 
@ST-E00493:56:H33MFALXX:4:1101:23439:1379 1:N:0:NACAACCA                                                   
NCGTGGAAAGACGCTAAGATTGTGATGTGCTTCCCTGACGATTACAACTGGCGTAAGGACGTTTTGCCTACCTATAAGGCTAACCGTAAGGGTTCTCGCAAGCCTGTAGGTTACAAGAGGTTCGTAGCCGAAGTGATGGCTGACTCACGG                                                                
+
#AAAFAAJFFFJJJ<JJJJJFFFJFJJJJJFJJAJJJFJJFJFJJJJFAFJ<JA<FFJ7FJJFJJAAJJJJ<JJJJJJJFJJJAJJJJJFJJ77<JJJJ-F7A-FJFFJJJJJJ<FFJ-<7FJJJFJJ)A7)7AA<7--)<-7F-A7FA<

By default, nothing change when converting Illumina 1.8 to Sanger. A warning message show that source and target quality encoding match.

$ seqkit convert tests/Illimina1.8.fq.gz  | seqkit head -n 1 
[INFO] possible quality encodings: [Illumina-1.8+]
[INFO] guessed quality encoding: Illumina-1.8+
[INFO] converting Illumina-1.8+ -> Sanger
[WARN] source and target quality encoding match.
@ST-E00493:56:H33MFALXX:4:1101:23439:1379 1:N:0:NACAACCA
NCGTGGAAAGACGCTAAGATTGTGATGTGCTTCCCTGACGATTACAACTGGCGTAAGGACGTTTTGCCTACCTATAAGGCTAACCGTAAGGGTTCTCGCAAGCCTGTAGGTTACAAGAGGTTCGTAGCCGAAGTGATGGCTGACTCACGG
+
#AAAFAAJFFFJJJ<JJJJJFFFJFJJJJJFJJAJJJFJJFJFJJJJFAFJ<JA<FFJ7FJJFJJAAJJJJ<JJJJJJJFJJJAJJJJJFJJ77<JJJJ-F7A-FJFFJJJJJJ<FFJ-<7FJJJFJJ)A7)7AA<7--)<-7F-A7FA<

When switching flag --force on, J was converted to I (40).

$ seqkit convert tests/Illimina1.8.fq.gz -f | seqkit head -n 1 
[INFO] possible quality encodings: [Illumina-1.8+]
[INFO] guessed quality encoding: Illumina-1.8+
[INFO] converting Illumina-1.8+ -> Sanger
@ST-E00493:56:H33MFALXX:4:1101:23439:1379 1:N:0:NACAACCA
NCGTGGAAAGACGCTAAGATTGTGATGTGCTTCCCTGACGATTACAACTGGCGTAAGGACGTTTTGCCTACCTATAAGGCTAACCGTAAGGGTTCTCGCAAGCCTGTAGGTTACAAGAGGTTCGTAGCCGAAGTGATGGCTGACTCACGG
+
#AAAFAAIFFFIII<IIIIIFFFIFIIIIIFIIAIIIFIIFIFIIIIFAFI<IA<FFI7FIIFIIAAIIII<IIIIIIIFIIIAIIIIIFII77<IIII-F7A-FIFFIIIIII<FFI-<7FIIIFII)A7)7AA<7--)<-7F-A7FA<

Other cases:

To Illumina-1.5.

$ seqkit convert tests/Illimina1.8.fq.gz --to Illumina-1.5+ | seqkit head -n 1 
[INFO] possible quality encodings: [Illumina-1.8+]
[INFO] guessed quality encoding: Illumina-1.8+
[INFO] converting Illumina-1.8+ -> Illumina-1.5+
@ST-E00493:56:H33MFALXX:4:1101:23439:1379 1:N:0:NACAACCA
NCGTGGAAAGACGCTAAGATTGTGATGTGCTTCCCTGACGATTACAACTGGCGTAAGGACGTTTTGCCTACCTATAAGGCTAACCGTAAGGGTTCTCGCAAGCCTGTAGGTTACAAGAGGTTCGTAGCCGAAGTGATGGCTGACTCACGG
+
B```e``ieeeiii[iiiiieeeieiiiiieii`iiieiieieiiiie`ei[i`[eeiVeiieii``iiii[iiiiiiieiii`iiiiieiiVV[iiiiLeV`Leieeiiiiii[eeiL[VeiiieiiH`VHV``[VLLH[LVeL`Ve`[

To Illumina-1.5 and back to Sanger.

$ seqkit convert tests/Illimina1.8.fq.gz --to Illumina-1.5+ | seqkit convert | seqkit head -n 1
[INFO] possible quality encodings: [Illumina-1.8+]
[INFO] guessed quality encoding: Illumina-1.8+
[INFO] converting Illumina-1.8+ -> Illumina-1.5+
[INFO] possible quality encodings: [Illumina-1.5+]
[INFO] guessed quality encoding: Illumina-1.5+
[INFO] converting Illumina-1.5+ -> Sanger
@ST-E00493:56:H33MFALXX:4:1101:23439:1379 1:N:0:NACAACCA
NCGTGGAAAGACGCTAAGATTGTGATGTGCTTCCCTGACGATTACAACTGGCGTAAGGACGTTTTGCCTACCTATAAGGCTAACCGTAAGGGTTCTCGCAAGCCTGTAGGTTACAAGAGGTTCGTAGCCGAAGTGATGGCTGACTCACGG
+
!AAAFAAJFFFJJJ<JJJJJFFFJFJJJJJFJJAJJJFJJFJFJJJJFAFJ<JA<FFJ7FJJFJJAAJJJJ<JJJJJJJFJJJAJJJJJFJJ77<JJJJ-F7A-FJFFJJJJJJ<FFJ-<7FJJJFJJ)A7)7AA<7--)<-7F-A7FA<

Checking encoding

$ seqkit convert tests/Illimina1.8.fq.gz --from Solexa 
[INFO] converting Solexa -> Sanger
[ERRO] seq: invalid Solexa quality

Real Illumina 1.5+ data

$ seqkit seq tests/Illimina1.5.fq 
@HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCTTGAGATTTGTTGGGGGAGACATTTTTGTGATTGCCTTGAT
+
efcfffffcfeefffcffffffddf`feed]`]_Ba_^__[YBBBBBBBBBBRTT\]][]dddd`ddd^dddadd^BBBBBBBBBBBBBBBBBBBBBBBB

$ seqkit convert tests/Illimina1.5.fq | seqkit head -n 1
[INFO] possible quality encodings: [Illumina-1.5+]
[INFO] guessed quality encoding: Illumina-1.5+
[INFO] converting Illumina-1.5+ -> Sanger
@HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCTTGAGATTTGTTGGGGGAGACATTTTTGTGATTGCCTTGAT
+
FGDGGGGGDGFFGGGDGGGGGGEEGAGFFE>A>@!B@?@@<:!!!!!!!!!!355=>><>EEEEAEEE?EEEBEE?!!!!!!!!!!!!!!!!!!!!!!!!

shenwei356 added a commit that referenced this issue Aug 11, 2017
@cviner
Copy link
Author

cviner commented Aug 11, 2017

This is great! Thanks so much for implementing all of this, @shenwei356!

It would be great if you could release a new version of SeqKit, integrating all of these enhancements.

I look forward to using it in my pipeline.

@cviner cviner closed this as completed Aug 11, 2017
@shenwei356
Copy link
Owner

@cviner Please try the new version: v0.7.0

@cviner
Copy link
Author

cviner commented Aug 14, 2017

@shenwei356 Great—thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants