Feature Request: automatic quality score conversion #18

cviner · 2017-07-11T18:47:07Z

A common issue encountered by many involves determining and then converting quality scores from various older sequencing datasets to modern Phread+33 (Sanger / Illumina 1.8+) encoding. There does not yet appear to be a single unified system to perform both of these functions, as part of a larger pipeline on the Linux command line. seqtk itself provides the ability to convert from Illumina1.5 to Illumina 1.8+ formats, for instance, but does not convert from formats prior to Illumina 1.3. A useful Python script exists to guess the correct encoding. Conversion is further complicated by the fact that valid conversion between Sanger and Solexa involves a non-linear transformation and may require some precautions to ensure numerical stability.

SeqKit seems like an ideal toolkit to include this functionality, which many would likely find highly useful.

The text was updated successfully, but these errors were encountered:

shenwei356 · 2017-07-11T22:41:45Z

Thanks for suggestion and detailed description. I'll implement it when I'm free (2-3 months later, sorry).

cviner · 2017-07-11T23:15:17Z

@shenwei356 Glad to hear that—thanks so much!

shenwei356 · 2017-07-19T14:56:04Z

Quality score convertion:

shenwei356 · 2017-07-24T16:06:25Z

Hi @cviner , I've added a command (alpha version below) seqkit convert for your request.
Please help to test it.

By default, it guesses the quality encoding according to the leading 1000 (-n) records,
you can also specify the input (--from) and output (--to) encoding.

Example:

seqkit convert read_1.fq.gz          --to 5 # auto detect input quality encoding
seqkit convert read_1.fq.gz --from 3 --to 5

Where 5 is the code for quality encoding. available value: 0 for Unknown, 1 for Sanger, 2 for Solexa, 3 for Illumina-1.3+, 4 for Illumina-1.5+, 5 for Illumina-1.8+.

cviner · 2017-07-24T18:04:35Z

@shenwei356 This is great—I will test it soon! Thanks for adding this!

For the auto-detection, is it able to detect Illumina 1.5 as going up to ASCII 105 (i, as I mentioned elsewhere)?
Would it be possible to include a heuristic, to uniquely identify 1.5 format if many Bs are present (similar to something I recently added to the script I am currently using to guess the encoding)?
For conversion of Illumina 1.5 to 1.8+, could the special code B be re-assigned to 0 (ASCII 33, !) in the new scale (as opposed to purely converted to 2)?

cviner · 2017-07-24T18:25:48Z

@shenwei356 I am not sure how to test this, using the provided binary.

./taxonkit convert

returns:

Error: unknown command "convert" for "taxonkit"
Run 'taxonkit --help' for usage.
unknown command "convert" for "taxonkit"

Perhaps you meant to enclose seqkit?

shenwei356 · 2017-07-25T00:02:21Z

Oh no, I uploaded the wrong binaries. fixed.

The special cases of Illumina 1.5 and 1.8 will be taken into consideration soon.

shenwei356 · 2017-07-25T15:21:10Z

As you mentioned, 1.5 format can be uniquely identified if many Bs are present. But will this bring false-positive? And is the "within 4th common qualities" criteria robust?

updated alpha version:

For a Illumina 1.5 test data:

$ cat tests/Illimina1.5.fq 
@HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCTTGAGATTTGTTGGGGGAGACATTTTTGTGATTGCCTTGAT
+HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
efcfffffcfeefffcffffffddf`feed]`]_Ba_^__[YBBBBBBBBBBRTT\]][]dddd`ddd^dddadd^BBBBBBBBBBBBBBBBBBBBBBBB

$ seqkit convert tests/Illimina1.5.fq 
[INFO] possible quality encodings: [Solexa Illumina-1.3+ Illumina-1.5+]
[ERRO] fail to guess the source quality encoding, please specify it

cviner · 2017-07-25T15:30:23Z

@shenwei356 That is a good point. It may indeed do so, which is why I think it is best to output a warning when the heuristic is used and have an option to disable it.

I think that overall, using it is better than not including it. I do not think that my criteria is particularly robust and would welcome a better heuristic!

shenwei356 · 2017-07-25T15:40:08Z

How about this, using the criteria. Possible quality encodings of the leading N (default: 1000) records are computed firstly. And if the proportion of guessed Illumina 1.5 format is greater than some threshold, we can call them Illumina 1.5. If not, follow the regular workflow, i.e., computing the intersection of the N guesses.

cviner · 2017-07-25T15:53:35Z

Yes—I think that is a very good idea.

cviner · 2017-07-25T15:54:27Z

Although that should still somehow account for an enrichment of Bs. Perhaps integrating that idea with the previous one?

shenwei356 · 2017-07-25T16:00:42Z

Alright, I'll implement it tomorrow. Good morning for you and good night for me.

shenwei356 · 2017-07-26T02:35:57Z

@cviner I set the default faction of guessed Illumina 1.5 in the leanding N as 0.1 (--thresh-illumina1.5-frac).
please test on your data.

cviner · 2017-07-26T18:16:47Z

@shenwei356 Thanks! I have tested the latest binary on a few Illumina 1.5 datasets and one Sanger. Everything appears to work well! Thanks for implementing this!

It would be nice to accept actual format names like solexa or sanger.

One issue is that the Sanger dataset is converted from Illumina 1.8, when it correctly guesses both options. I would suggest that if those two options are present, Sanger should be the default. Finally, if the source and target formats match, conversion should be aborted with a message.

shenwei356 · 2017-07-27T00:55:24Z

Why should sanger be default if it may be sanger and illumina 1.8.

Most commands of seqkit follow the UNIX philosophy and read from stdin or file and write to stdout or file. So it still output when output and input format match, anyway it's very fast and log is written to stderr.

cviner · 2017-07-27T17:30:12Z

Currently it defaults to Illumina 1.8. The default should be Sanger because Illumina is now using that encoding, it is the default in many tools (e.g. Cutadapt), and all NCBI SRA / EBI ENA FASTQs are re-processed to the Sanger format. It can therefore be regarded as the current de facto standard.

As for always processing it, that is fair enough, but an option to not process it and return an informative message if the source and target encodings are the same would be nice.

Thanks again for all your work on this!

shenwei356 · 2017-07-27T23:57:42Z

OK, I'll work on it ASAP.

shenwei356 · 2017-07-29T05:13:08Z

sorry I can't have access to computer for 2 months.

cviner · 2017-07-30T19:36:22Z

@shenwei356 No problem! My current approach is sufficient for now. Thanks again for all your work on this and I look forward to using it after it is done.

shenwei356 · 2017-08-05T11:23:40Z

Updated as you requested.

Examples:

$ seqkit convert t.fq 
[INFO] possible quality encodings: [Illumina-1.8+]
[INFO] guessed quality encoding: Illumina-1.8+
[INFO] converting Illumina-1.8+ -> Sanger
[WARN] source and target quality encoding match. aborted.

$ seqkit convert t.fq --to solexa --dry-run
[INFO] possible quality encodings: [Illumina-1.8+]
[INFO] guessed quality encoding: Illumina-1.8+
[INFO] converting Illumina-1.8+ -> Solexa

cviner · 2017-08-06T04:55:28Z

@shenwei356 Thanks so much!

There does appear to be a small bug with the first example, since Illumina-1.8+ -> Sanger should still be converted, since while similar, they are not the same.

shenwei356 · 2017-08-06T08:10:10Z

what's the differences? range?

cviner · 2017-08-06T16:06:06Z

@shenwei356 Yes, exactly. Illumina-1.8+ currently typically goes to one value higher (I, ASCII 41) than the initial Sanger format. More importantly, it is expected to permit higher quality scores for newer sequencing chemistries. The latest HiSeq X Ten system, for instance, uses a binning system that introduces additional complications.

I think that for this purpose, the formats should be considered different, with conversion from Sanger -> 1.8+ resulting in no change, but with the converse resulting in all scores > 40, being set to 40.

cviner · 2017-08-06T16:10:55Z

I guess in this case, however, it might be best to simply leave it as is and consider the two formats the same by default (since otherwise, defaulting to Sanger, could discard higher quality scores that should usually be preserved).

Perhaps you could simply keep it this way but add a -f option to force conversion, even between formats regarded as identical? Then when forced, for Illumina 1.8 -> Sanger only, all scores above 40 would be truncated to 40, with a warning output to mention this to the user?

shenwei356 · 2017-08-11T14:19:20Z

@cviner updated as you suggested, thank you for making it better.

Note that seqkit convert always output sequences.

The test dataset contains score 41 (J):

$ seqkit head -n 1 tests/Illimina1.8.fq.gz 
@ST-E00493:56:H33MFALXX:4:1101:23439:1379 1:N:0:NACAACCA                                                   
NCGTGGAAAGACGCTAAGATTGTGATGTGCTTCCCTGACGATTACAACTGGCGTAAGGACGTTTTGCCTACCTATAAGGCTAACCGTAAGGGTTCTCGCAAGCCTGTAGGTTACAAGAGGTTCGTAGCCGAAGTGATGGCTGACTCACGG                                                                
+
#AAAFAAJFFFJJJ<JJJJJFFFJFJJJJJFJJAJJJFJJFJFJJJJFAFJ<JA<FFJ7FJJFJJAAJJJJ<JJJJJJJFJJJAJJJJJFJJ77<JJJJ-F7A-FJFFJJJJJJ<FFJ-<7FJJJFJJ)A7)7AA<7--)<-7F-A7FA<

By default, nothing change when converting Illumina 1.8 to Sanger. A warning message show that source and target quality encoding match.

$ seqkit convert tests/Illimina1.8.fq.gz  | seqkit head -n 1 
[INFO] possible quality encodings: [Illumina-1.8+]
[INFO] guessed quality encoding: Illumina-1.8+
[INFO] converting Illumina-1.8+ -> Sanger
[WARN] source and target quality encoding match.
@ST-E00493:56:H33MFALXX:4:1101:23439:1379 1:N:0:NACAACCA
NCGTGGAAAGACGCTAAGATTGTGATGTGCTTCCCTGACGATTACAACTGGCGTAAGGACGTTTTGCCTACCTATAAGGCTAACCGTAAGGGTTCTCGCAAGCCTGTAGGTTACAAGAGGTTCGTAGCCGAAGTGATGGCTGACTCACGG
+
#AAAFAAJFFFJJJ<JJJJJFFFJFJJJJJFJJAJJJFJJFJFJJJJFAFJ<JA<FFJ7FJJFJJAAJJJJ<JJJJJJJFJJJAJJJJJFJJ77<JJJJ-F7A-FJFFJJJJJJ<FFJ-<7FJJJFJJ)A7)7AA<7--)<-7F-A7FA<

When switching flag --force on, J was converted to I (40).

$ seqkit convert tests/Illimina1.8.fq.gz -f | seqkit head -n 1 
[INFO] possible quality encodings: [Illumina-1.8+]
[INFO] guessed quality encoding: Illumina-1.8+
[INFO] converting Illumina-1.8+ -> Sanger
@ST-E00493:56:H33MFALXX:4:1101:23439:1379 1:N:0:NACAACCA
NCGTGGAAAGACGCTAAGATTGTGATGTGCTTCCCTGACGATTACAACTGGCGTAAGGACGTTTTGCCTACCTATAAGGCTAACCGTAAGGGTTCTCGCAAGCCTGTAGGTTACAAGAGGTTCGTAGCCGAAGTGATGGCTGACTCACGG
+
#AAAFAAIFFFIII<IIIIIFFFIFIIIIIFIIAIIIFIIFIFIIIIFAFI<IA<FFI7FIIFIIAAIIII<IIIIIIIFIIIAIIIIIFII77<IIII-F7A-FIFFIIIIII<FFI-<7FIIIFII)A7)7AA<7--)<-7F-A7FA<

Other cases:

To Illumina-1.5.

$ seqkit convert tests/Illimina1.8.fq.gz --to Illumina-1.5+ | seqkit head -n 1 
[INFO] possible quality encodings: [Illumina-1.8+]
[INFO] guessed quality encoding: Illumina-1.8+
[INFO] converting Illumina-1.8+ -> Illumina-1.5+
@ST-E00493:56:H33MFALXX:4:1101:23439:1379 1:N:0:NACAACCA
NCGTGGAAAGACGCTAAGATTGTGATGTGCTTCCCTGACGATTACAACTGGCGTAAGGACGTTTTGCCTACCTATAAGGCTAACCGTAAGGGTTCTCGCAAGCCTGTAGGTTACAAGAGGTTCGTAGCCGAAGTGATGGCTGACTCACGG
+
B```e``ieeeiii[iiiiieeeieiiiiieii`iiieiieieiiiie`ei[i`[eeiVeiieii``iiii[iiiiiiieiii`iiiiieiiVV[iiiiLeV`Leieeiiiiii[eeiL[VeiiieiiH`VHV``[VLLH[LVeL`Ve`[

To Illumina-1.5 and back to Sanger.

$ seqkit convert tests/Illimina1.8.fq.gz --to Illumina-1.5+ | seqkit convert | seqkit head -n 1
[INFO] possible quality encodings: [Illumina-1.8+]
[INFO] guessed quality encoding: Illumina-1.8+
[INFO] converting Illumina-1.8+ -> Illumina-1.5+
[INFO] possible quality encodings: [Illumina-1.5+]
[INFO] guessed quality encoding: Illumina-1.5+
[INFO] converting Illumina-1.5+ -> Sanger
@ST-E00493:56:H33MFALXX:4:1101:23439:1379 1:N:0:NACAACCA
NCGTGGAAAGACGCTAAGATTGTGATGTGCTTCCCTGACGATTACAACTGGCGTAAGGACGTTTTGCCTACCTATAAGGCTAACCGTAAGGGTTCTCGCAAGCCTGTAGGTTACAAGAGGTTCGTAGCCGAAGTGATGGCTGACTCACGG
+
!AAAFAAJFFFJJJ<JJJJJFFFJFJJJJJFJJAJJJFJJFJFJJJJFAFJ<JA<FFJ7FJJFJJAAJJJJ<JJJJJJJFJJJAJJJJJFJJ77<JJJJ-F7A-FJFFJJJJJJ<FFJ-<7FJJJFJJ)A7)7AA<7--)<-7F-A7FA<

Checking encoding

$ seqkit convert tests/Illimina1.8.fq.gz --from Solexa 
[INFO] converting Solexa -> Sanger
[ERRO] seq: invalid Solexa quality

Real Illumina 1.5+ data

$ seqkit seq tests/Illimina1.5.fq 
@HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCTTGAGATTTGTTGGGGGAGACATTTTTGTGATTGCCTTGAT
+
efcfffffcfeefffcffffffddf`feed]`]_Ba_^__[YBBBBBBBBBBRTT\]][]dddd`ddd^dddadd^BBBBBBBBBBBBBBBBBBBBBBBB

$ seqkit convert tests/Illimina1.5.fq | seqkit head -n 1
[INFO] possible quality encodings: [Illumina-1.5+]
[INFO] guessed quality encoding: Illumina-1.5+
[INFO] converting Illumina-1.5+ -> Sanger
@HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCTTGAGATTTGTTGGGGGAGACATTTTTGTGATTGCCTTGAT
+
FGDGGGGGDGFFGGGDGGGGGGEEGAGFFE>A>@!B@?@@<:!!!!!!!!!!355=>><>EEEEAEEE?EEEBEE?!!!!!!!!!!!!!!!!!!!!!!!!

cviner · 2017-08-11T17:52:07Z

This is great! Thanks so much for implementing all of this, @shenwei356!

It would be great if you could release a new version of SeqKit, integrating all of these enhancements.

I look forward to using it in my pipeline.

shenwei356 · 2017-08-12T15:59:34Z

@cviner Please try the new version: v0.7.0

cviner · 2017-08-14T16:35:42Z

@shenwei356 Great—thanks!

shenwei356 added the enhancement label Jul 11, 2017

Repository owner deleted a comment from zhaoxia413 Jul 25, 2017

shenwei356 added a commit that referenced this issue Aug 5, 2017

update convert. #18

0d39313

shenwei356 added a commit that referenced this issue Aug 11, 2017

update convert following #18

c00b425

cviner closed this as completed Aug 11, 2017

cviner mentioned this issue Mar 19, 2018

Feature Request: automatic quality score conversion OpenGene/fastp#44

Closed

cviner mentioned this issue Nov 25, 2018

Guessing heuristics brentp/bio-playground#18

Closed

shenwei356 mentioned this issue Nov 1, 2021

[QUESTION] Guessing FASTQ quality system using seqkit convert #254

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: automatic quality score conversion #18

Feature Request: automatic quality score conversion #18

cviner commented Jul 11, 2017

shenwei356 commented Jul 11, 2017 •

edited

Loading

cviner commented Jul 11, 2017

shenwei356 commented Jul 19, 2017 •

edited

Loading

shenwei356 commented Jul 24, 2017 •

edited

Loading

cviner commented Jul 24, 2017 •

edited

Loading

cviner commented Jul 24, 2017 •

edited

Loading

shenwei356 commented Jul 25, 2017

shenwei356 commented Jul 25, 2017 •

edited

Loading

cviner commented Jul 25, 2017

shenwei356 commented Jul 25, 2017

cviner commented Jul 25, 2017

cviner commented Jul 25, 2017

shenwei356 commented Jul 25, 2017

shenwei356 commented Jul 26, 2017 •

edited

Loading

cviner commented Jul 26, 2017

shenwei356 commented Jul 27, 2017

cviner commented Jul 27, 2017

shenwei356 commented Jul 27, 2017

shenwei356 commented Jul 29, 2017

cviner commented Jul 30, 2017

shenwei356 commented Aug 5, 2017

cviner commented Aug 6, 2017

shenwei356 commented Aug 6, 2017

cviner commented Aug 6, 2017

cviner commented Aug 6, 2017

shenwei356 commented Aug 11, 2017 •

edited

Loading

cviner commented Aug 11, 2017

shenwei356 commented Aug 12, 2017

cviner commented Aug 14, 2017

Feature Request: automatic quality score conversion #18

Feature Request: automatic quality score conversion #18

Comments

cviner commented Jul 11, 2017

shenwei356 commented Jul 11, 2017 • edited Loading

cviner commented Jul 11, 2017

shenwei356 commented Jul 19, 2017 • edited Loading

shenwei356 commented Jul 24, 2017 • edited Loading

cviner commented Jul 24, 2017 • edited Loading

cviner commented Jul 24, 2017 • edited Loading

shenwei356 commented Jul 25, 2017

shenwei356 commented Jul 25, 2017 • edited Loading

cviner commented Jul 25, 2017

shenwei356 commented Jul 25, 2017

cviner commented Jul 25, 2017

cviner commented Jul 25, 2017

shenwei356 commented Jul 25, 2017

shenwei356 commented Jul 26, 2017 • edited Loading

cviner commented Jul 26, 2017

shenwei356 commented Jul 27, 2017

cviner commented Jul 27, 2017

shenwei356 commented Jul 27, 2017

shenwei356 commented Jul 29, 2017

cviner commented Jul 30, 2017

shenwei356 commented Aug 5, 2017

cviner commented Aug 6, 2017

shenwei356 commented Aug 6, 2017

cviner commented Aug 6, 2017

cviner commented Aug 6, 2017

shenwei356 commented Aug 11, 2017 • edited Loading

cviner commented Aug 11, 2017

shenwei356 commented Aug 12, 2017

cviner commented Aug 14, 2017

shenwei356 commented Jul 11, 2017 •

edited

Loading

shenwei356 commented Jul 19, 2017 •

edited

Loading

shenwei356 commented Jul 24, 2017 •

edited

Loading

cviner commented Jul 24, 2017 •

edited

Loading

cviner commented Jul 24, 2017 •

edited

Loading

shenwei356 commented Jul 25, 2017 •

edited

Loading

shenwei356 commented Jul 26, 2017 •

edited

Loading

shenwei356 commented Aug 11, 2017 •

edited

Loading