# ChemBioSys and AquaDiva | Genome-resolved metagenomics workshop

Throughout the course, you will be working with two mock datasets that have been generated _in silico_. Starting from an unknown number of genomes, we have simulated two sequencing runs, and in these two runs the original genomes are differently covered, they differ in their abundance between the two datasets. This will become relevant further down the road, once we start doing assemblies and genome binning.

<img src="img/albertsen.png" alt="Differential coverage" width="500"/>

<font size="2"> [Albertsen and colleagues](https://www.nature.com/articles/nbt.2579) propagated the use of differential coverage for metagenome binning. _© Albertsen et al., 2013_ </font>

Differential coverage means that the coverage of a reference sequences changes across multiple samples, exemplarily shown in the example from Albertsen et al., 2013.  Most binning algorithms make use of this information.

## Recap | DNA sequencing

Before we dive into sequence data quality assessment and control, we do a little re-cap regarding DNA sequencing. When we talk about sequencing, we primarily talk about next-generation sequencing (NGS), the term NGS is a bit misleading as we meanwhile use NGS for almost 20 years. Before NGS, [Sanger sequencing](https://en.wikipedia.org/wiki/Sanger_sequencing) was the "workhorse" in terms of DNA sequencing. The _de facto_ standard in terms of NGS is Illumina sequencing, there are other NGS platforms, but they only play a minor role or are already obsolote (e.g. 454/pyrosequencing).

[This](https://www.nature.com/articles/nrg2626) meanwhile 10+ years old, but still excellent review by Metzker provides a great introduction into NGS.

Most NGS techniques/strategies rely on clonally amplified template DNA, this is because the used imaging systems can not resolve single fluorescent events, which are the basis for reading DNA sequences.

### Template amplification

In the case of Illumina sequencing, clonal amplification is achieved via bridge amplification. Template DNA, treated in a way that it carries flanking primers for subsequent amplification, binds to the surface of a flowcell that is covered with complementary probes. Template DNA is amplified by PCR, wobbles over, binds a "free" primer and the process is repeated. The resulting clusters make sure that the fluorescent signals generated during sequencing are sufficiently strong for detection.

![Bridge amplification](img/bridge_amplification.jpg)  
<font size="2"> Cycle reversible termination sequencing. _© Metzker, 2010_ </font>

### Sequencing

Illumina sequencing relies, as stated above, on cyclic reversible termination. DNA polymerase bound to the template incorporates one fluorescently labelled nucleotide at a time. Unincorporated nucleotides are washed away and imaging is performed to determine the identity of the incorporated base. The fluorescent dye (which blocks and prevents further nucleotide incorporation) is cleaved away and the cycle is repeated.

![Bridge amplification](img/CRT.jpg)  
<font size="2"> Bridge amplification, clonal DNA template is amplified to guarantee sufficiently strong fluorescent signals during sequencing. _© Metzker, 2010_ </font>

### Sequencing library preparation

Library preparation comprises the following stages:

1. Shearing
2. End-repair and adapter ligation
3. PCR enrichment

### Shearing
Remember, we want high-quality DNA for sequencing. For downstream bridge amplification, gDNA has to be sheared, otherwise the resulting clusters blend into each other. DNA shearing can be done by ultrasound or enzymatically. The latter tends to introduce a bias since enzymatic fragmentation is not random.

---
❓**QUESTION**

How does the length of the template DNA affect clonal amplification and potentially sequencing?!

---
	
### End-repair and adapter ligation
For clonal amplification and sequencing, sheared template DNA must be able to bind to the flowcell. Binding is mediated by adapters/primers that are complementary to primers present on the flowcell. Adapters are added enzymatically by ligation, once the template DNA is end-repaired. End-repaired means the template DNA carries a 3' adenine overhang, which is used to ligate the adapters (see the flowchart below). 

The adapters look like a hairpin, with a uracil in the middle. This uracil functions as cutting side for the USER enzyme. As a result of this cutting, 5' and 3' ends are now flanked by adapters.

### PCR enrichment
The adapters function as primers for a subsequent PCR that is used to add the sequencing primers (plus barcodes to distinguish samples) to the template DNA. These primers, labelled P5 and P7 in the flowchart, bind the flowcell and are the starting point for CRT-based sequencing.

![Library Preparation](img/libprep.jpg)  
<font size="2"> Flowchart summarizing the stages of library preparation. _© New England Biolabs_</font>

### Quality control of the libraries
We quantify the libraries by fluorometry, and assess the size distribution by means of chip-based gelelectrophoresis to save material and because of increased resolution. Chip-based gelelectrophoresis means that we use a microfluidic chip with fine capillaries, which we load with a gel/dye mixture that is pressed into the capillaries. Afterwards samples can be loaded and are separated based on migration through an electric field, just like for "standard gelelectrophoresis".

The typical layout of such a chip is shown below.

<img src="img/lab_on_a_chip.jpg" alt="Lab on a chip" width="500"/>

<font size="2"> Layout of a microfluidic chip used for gelelectrophoresis. The chip is loaded with a gel/dye matrix using the gel wells. Samples are loaded via the sample wells. A size standard is loaded via the ladder well. Samples are size-separated based on migration in an electric field. _© Agilent_ </font>

### What about long-read sequencing - Oxford Nanopore sequencing

Next-generation (or 2nd-generation) sequencing introduced high-throughput, 3rd-generation sequencing (PacBio RS and Oxford Nanopore sequencing) brought long-read sequencing to the table. Long reads facilitate (meta)genome assembly and the Illumina and long-read sequencing are meanwhile often combined to facilitate the recovery of high-quality genomes from metagenome datasets.

The concept behind Nanopore sequencing goes back to the 80s when David Deamer came up with the concept.

<img src="img/nanoP_3.png" alt="Deamer" width="500"/>

<font size="2"> Conceptualization Nanopore sequencing. _© David Deamer_ </font>

Single-stranded polynucleotides are electrophoretically driven through a nanopore (green) that provides the only path through which ions or polynucleotides can move from a cis to a trans chamber. Translocation  of the polynucleotide through the nanopore is controlled by an enzyme (red). The ionic current through a nanopore is measured by a sensitive ammeter. In nanopore strand-sequencing, the stepping rate is usually 30 bases per second.

<img src="img/nanoP_1.png" alt="Lab on a chip" width="500"/>

<font size="2"> Nanopore sequencing - how it works. _Deamer et al., 2016_ </font>

## Session 01 | Quality assessment/control of "raw" sequencing data

OK, with our sequencing knowledge being refreshed, we now want to get our hands dirty with some data. To start, we want to have a look at three exemplary sequence datasets and assess their quality. How do we do that? in the `workshop` folder, you find one folder that is called `01_QAQC`. In this folder, you should see three pairs of sequence data files.

In [None]:
### Checking out exemplary sets of sequence data
cd ~/data/workshop/01_QAQC
ls -lrth
head example_1_R1.fastq

`head`, is a command that opens the first couple of lines of a text file.

OK, we do see a couple of `.fastq` files, in fact three pairs of files `*{R1,R2}.fastq`. We do see pairs, because in all three examples, the template DNA has been sequence in paired-end mode. That means DNA fragments were sequenced from both ends.

What are `.fastq` files? You can best imagine them as `.fasta` files on steroids. 

Quick reminder, `.fasta` files are nothing else but `.txt` filex that are formatted in a particular way.

```
>header_string optional_information_string
ATGTCACACACACTAGATACTATAGA
```
In comparison, `.fastq`files contain more information.

![.fastq explained](img/fastq_fig.jpg)  
<font size="2"> Explaination of the .fastq file format. © Robert Edgar, _drive5.com_</font>

For every sequence there are four lines. The first one is a header that contains a unique sequence ID, the second is the sequence itself, the third one is a blankline marked by `+`, and the fourth one is the quality information derived from basecalling as so-called _Q-Score_. 

The Q-score represents the error probability and is represented by ASCII characters.

P = 10<sup>-Q/10</sup>

Q = -10 log<sub>10</sub>(P)

The below gives you an idea what the different characters are standing for, `!` refers for instance to Q-score of 0, which translates into a 100% probability that the basecall is incorrect. A Q-score of 10 translates for instance into an error probality of 1%. 

![Q-scores](img/qscores.gif)  
<font size="2"> What are Q-scores? © Robert Edgar, _drive5.com_</font>

---
❓**QUESTION**

What do you think, what is commonly considered as "good" quality when it comes to Q-scores?!

---

### Hands-on
OK, we now want to assess the quality of these three exemplary datasets. For that, we use a tool that is called [`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). `FastQC` generates reports that summarize various aspects that are relevant for assessing sequence quality. Exemplary reports for good and bad sequence data sets are given [here](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) and [here](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html).

Have a look at them, in parallel have a look at the [`FastQC`documentation](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/) that is also available online, which explains the individual panes that you see in the reports.


---
🔧**TASK**

Run FastQC over the three example datasets and inspect the resulting reports. Are we dealing with good or bad data?!

---


In [None]:
### Let's first call FastQC's help to figure out how to run it
fastqc -h

In [None]:
### OK, that's straightforward, we just have to specify the sequence data files that we want to check.
### Note, we use "*" as wildcard here, as a result FastQC runs over all .fastq files present in the
### current directory.
### You could also run FastQC six times, over each file individually, but who wants that?!
cd ~/workshop/01_QAQC
pwd
fastqc *.fastq

In [None]:
# Let's see how the content of our current working directory changed.
ls -lrth

For every `.fastq` we obtained two files, a `.zip` archive and an `.html` file, the latter represents the reports generated by `FastQC`, which we will now check out.

You can open the `.html` reports here directly in your browser, just _CLICK_ them in the file browser on the left (folder symbol).

### A closer look at Example (1)

We only look at the report for the file containing the forward reads `example_1_R1.fastq`.


`FastQC` reports begin always with basic statistics regarding the corresponding file. The statistics are followed by a bunch of informative plots that summarize the checks carried out by `FastQC`.

#### Per base sequence quality

![Per base sequence quality Example 1](img/per_base_seq_qual.png)

This plot shows us the quality score distribution for all (forward) sequences of the dataset over the whole sequence length. This score should be as high as possible and only start to drop at the end of the sequence. This drop in quality at the end of the sequence is more pronounced when looking at reverse reads, which is simply a consequence of the enzymes used for sequencing being wasted.

#### Per tile sequence quality

![Per tile sequence quality Example 1](img/per_tile_seq_qual.png)

"Real" Illumina sequencing data sets include in the read headers information about the location of the respective template DNA molecule on the flowcell. The flowcell is divided into different tiles. This plot shows if there have been sudden drops in sequencing quality in certain areas of the flowcell, which can be a consequence of air bubbles or debris in the fluidic system.

#### Per sequence quality scores

![Per sequence quality scores Example 1](img/per_seq_qual_score.png)

The quality score distribution over all sequences allows you to assess if subsets of your sequences have unexpectedly low quality scores. Ideally, you want to see one sharp peak on the right.

#### Per base sequence content

![Per base sequence content Example 1](img/per_seq_base_content.png)

Here we see the proportion of each base over the length of the sequences. In the case of genomic DNA, one would expect four nearly parallel lines, with the different proportions reflecting the GC content of the original genome. The presence of sequencing primers/adapters, which artificially skew the proportions of the bases, is well visible in this plot and helps with finding parameters for trimming. Please note, that certain library types (imagine amplicon sequencing or RNAseq) will produce highly biased sequence composition patterns.

#### Per sequence GC content

![Per sequence GC content Example 1](img/per_seq_GC_content.png)

This plot summarizes the GC content over the whole length of all sequences analyzed. This analysis typically yields a bell-shaped normal distribution. Unexpected sharp peaks indicate the presence of contaminating/overrepresented sequences such as adapters or primers.

#### Per base N content

![Per base N content Example 1](img/per_base_N.png)

Here we see the percentage of failed base calls over the length of all sequences.

#### Sequence length distribution

![Sequence length distribution Example 1](img/seq_length_dist.png)

Illumina sequencing platforms generate sequence fragments of uniform length. As a result, we typically see one sharp peak. After sequence data pre-processing, the sequence length is usually not longer uniform.

#### Sequence duplication levels

![Sequence duplication levels Example 1](img/seq_dupl.png)

This analysis yields insights into the duplication level of every sequence in a dataset. In a random library, the degree of duplication is usually low, while levels are significantly higher when looking for instance at amplicon sequencing datasets.

#### Adapter content

![Adapter content Example 1](img/overrep_seqs.png)

Pretty self-explainatory, adapters and other artificial sequences introduced during library preparation are potentially still present.

---
🔧**TASK**

Check out the FastQC reports for the other two examples, how do they compare to example (1)?

---


---
🔓**SUMMARY**

* 💡 we refreshed our NGS knowledge, 
* 📜 recapped the differences between .fasta and .fastq files,
* 🔍 took a look at the concept behind quality scores,
* 📄 and generated and looked at QA/QC reports from three examples datasets!
  
--- 

<sub> © Carl-Eric Wegner, 2023-08 </sub>