Skip to content

While mean error is properly calculated within reads, is it properly calculated across reads? #40

@schorlton

Description

@schorlton

As per title.

cat test.fastq
@test_1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
0,-/(/--/133451659;6875662/''$.110%&)(./1137540.+((&(##%&$)***,-30..*-,/20/0.,/124466640(-.',+)*,/.-,,+-&$$#%*(+,+/2/%'((&&$*+'*++'(()$%+-..1374,--'-&$$$$'.0/00340.--*++0/003513.0&&'-00,./03240/-,-.,,014'%%%0211+04753./,+1(*/222.&-,(()$$%$$/32300./2&,0%)00/6//00-+%&%&&'&/+/22321+-,'%
@test_2
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
AA=96802,2-:9539;9897,+56771<:;<<;;4:;8008.<>;92@BBBA?(&54<=;4233334/$&&+2257./7421/4;B?;A5699@<:@:-)8145.335000

Manual implementation:

#! /usr/bin/env python

from statistics import mean
from Bio.SeqIO.QualityIO import FastqGeneralIterator
import math


def decode(c):
    return ord(c) - 33

def phred_to_probability(phred_score):
    return 10**(-phred_score/10)

def probability_to_phred(error_probability):
    return -10 * math.log10(error_probability)

def raw_q_to_probability(q):
    return phred_to_probability(decode(q))


def calc_stats(fastq):
    mean_read_errors = []
    mean_read_quals = []
    with open(fastq) as in_handle:
        for _title, _seq, qual in FastqGeneralIterator(in_handle):

            error_probabilities = list(map(raw_q_to_probability, qual))
            mean_read_error = mean(error_probabilities)
            mean_read_errors.append(mean_read_error)
            mean_read_quals.append(probability_to_phred(mean_read_error))
    
    print(f"Averaging errors: {probability_to_phred(mean(mean_read_errors))}")
    print(f"Averaging quals: {mean(mean_read_quals)}")

calc_stats("test.fastq")
(base) mambauser@1:/phred_calc$ python3 mean_across_reads.py 
Averaging errors: 11.13979284230169
Averaging quals: 12.12473614554866

NanoStat:

(base) mambauser@1:/phred_calc$ NanoStat --fastq test.fastq -n nanostat.txt
(base) mambauser@1:/phred_calc$ cat nanostat.txt 
General summary:         
Mean read length:            198.0
Mean read quality:            12.1
Median read length:          198.0
Median read quality:          12.1
Number of reads:               2.0
Read length N50:             284.0
STDEV read length:            86.0
Total bases:                 396.0
Number, percentage and megabases of reads above quality cutoffs
>Q5:	2 (100.0%) 0.0Mb
>Q7:	2 (100.0%) 0.0Mb
>Q10:	1 (50.0%) 0.0Mb
>Q12:	1 (50.0%) 0.0Mb
>Q15:	1 (50.0%) 0.0Mb
Top 5 highest mean basecall quality scores and their read lengths
1:	15.2 (112)
2:	9.1 (284)
3:	NA
4:	NA
5:	NA
Top 5 longest reads and their mean basecall quality score
1:	284 (9.1)
2:	112 (15.2)
3:	NA
4:	NA
5:	NA

Thank you!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions