Skip to content

Test cases

Thomas Cokelaer edited this page Aug 13, 2020 · 20 revisions

We present some examples to emphasize the fact that the FastQC plots should be interpreted with care. Sometimes warnings or errors may appear, however, this may be a normal behaviour shown in this page.

Authors: Thomas Cokelaer, Laure Lemee

interpretation of the per base sequence quality and per sequence quality score plots

Those two plots convey very similar information. The left hand side is the most standard plot use to access quality of a run. It gives the average quality at a given position across all sequences. If some sequences have bad quality compared to the other, we will not see them. The plot in the right-hand side would provide that information. Indeed here, this is the histogram of the average quality of all sequences. For example if 10% of sequencse have a quality of 30 and others have a quality of 35, we should see a bimodal distribution indicating a possible drop of quality during the run.

per base quality per sequence quality score

GC plot exhibiting non-normal and non-similar shapes

The GC plot (see below) show the average GC content of the reads for each sample. In genomics, if a long genome is sequenced and there is no contamination you should see the type of plot in the right-hand side. The mean of the curve gives you the mean-GC content that should match the known GC content of the genome being sequenced; In the left-hand side, although an error is reported by FastQC, this is the typical kind of plots you will see in metagenomics or when the library is a mixed of small genomes (like here).

phage streptococcus

non-constant per-base sequence content

The per-base sequence content plot shows the percentage of each of the four normal DNA bases on the y-axis at each base position. FastQC reports warning if the difference between the ACGT > 10% in any position and an error if greater than 20%. Here below, we have two genomes from the same library and the left and right are quite different. In the left-hand side, FastQC reports a good run. Indeed, the ACGT content is constant along the read position. This is what is expected for a long genome (here 22Mb). In the right In the right-hand side, FastQC reports an error. Yet, as shown below the quality is even better (Phred Score >35). In fact, there is nothing wrong here. The sequenced genome is just short and therefore lack diversity when computing this kind of plot.

Plasmodium Virome

Here is another example where fastQC will report an error whereas the run is perfectly fine. This concern a 16S library and here again due to the short length of the reads and diversity of genomes, the ACGT line is not straight nor random

16s acgt content

RNA-seq N's present in large proportions

Once a fastqc (and multiqc) is available, we usually look at the quality plot. Those tools provide a green/orange/red light indicating no warning/warning/error status. In this RNA-seq experiment with 6 samples, we got a per base sequence quality plot showing a drop of quality from position 0 to 40, which is pronounced in one of the samples. We have the feeling that one sample is totally wrong since the quality is below 20 at the beginning of all reads.

A complimentary plot is the per base N content, which is shown here below:

Here we see the same samples. The red curve corresponds to the same sample that was red in the previous plot. This sample has actually 40% of Ns at the beginning and is therefore tagged with a red color (error) indicating that this sample should be dropped.

In fact, what is going on here is that the quality of the library was such that lots of dimers of adapters were created. 40% of the reads actually contains no data. Sequencers created reads with just N's and no genomic content. Yet, the other 60% of reads were totally correct and with high quality. Moreover, the reads made o Ns have a length of 35 bp. Coming back to the first plot, if we ignore the reads with Ns (that have poor quality), the rest of the data has an expected high quality.

Subsequent RNA-seq analysis, which ignores the reads with Ns, showed no difference between this sample and the other 5 samples.

Conclusion: even though the plots indicated a very poor quality for one sample, ignoring the Ns and assuming the yield of reads is enough for the bioinformatics analysis, the reads were usable and the experiment validated.