# **Generation of Read Pileups**

---

First, let's download the data for this notebook:

In [None]:
! wget http://www.tnt.uni-hannover.de/edu/vorlesungen/AMLG/data/read-pileups.tar.gz
! tar -xzvf read-pileups.tar.gz
! mv -v read-pileups/ data/
! rm -v read-pileups.tar.gz

# DNA sequencing simulation

By randomly sampling substrings from a larger reference sequence, we can simulate a (error-free) DNA sequencing process.


❓&nbsp;**Q1.1**&nbsp;&mdash;&nbsp;
Complete the function `sample_reads(reference_sequence, n_reads, min_read_len, max_read_len)` to sample `n_reads` reads from `reference_sequence` with a minimum/maximum read length of `min_read_len`/`max_read_len`.
The function shall return a list of dictionaries, with one dictionary per read.
A single read dictionary contains the 0-based mapping position of the read on the reference sequence (key `'pos'`) and the read sequence (key `'seq'`).

Example dictionary entry:

```
{'pos': 4, 'seq': 'TTTCATTCTGACTGCAACGGGCAATA'}
```

In [None]:
import random


def sample_reads(reference_sequence, n_reads, min_read_len, max_read_len):
    reads = []

    for _ in range(n_reads):
        # YOUR CODE

    return reads

❓&nbsp;**Q1.2**&nbsp;&mdash;&nbsp;
Sample 20 reads&nbsp;&mdash;&nbsp;including their mapping positions&nbsp;&mdash;&nbsp;from the reference sequence

```
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
```

(which is the start of the _E. coli_ genome provided in this notebook).

Set the other parameters as follows: `n_reads=20`, `in_read_len=15`, `max_read_len=40`.

In [None]:
# YOUR CODE

❓&nbsp;**Q1.3**&nbsp;&mdash;&nbsp;
Print the reference sequence and all sampled reads such that the reads visually align with the reference sequence.

Example:

```
Reference: AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTG...
Read    0:    TTTTCATTCTGACTGCAACGGGCAA
Read    1:             TGACTGCAACGGGCAATATGTC
```

In [None]:
# YOUR CODE

# Pileups, sequencing depth, and coverage

❓&nbsp;**Q2.1**&nbsp;&mdash;&nbsp;
Complete the function `compute_pileup_sizes(reference_sequence, reads)` and use it to compute the pileup size (i.e., sequencing depth) at each position of the reference sequence.

In [None]:
import numpy as np


def compute_pileup_sizes(reference_sequence, reads):
    pileup_sizes = np.zeros(len(reference_sequence))

    # YOUR CODE

    return pileup_sizes


# YOUR CODE

❓&nbsp;**Q2.2**&nbsp;&mdash;&nbsp;
Complete the function `coverage(pileup_sizes)` and use it to compute the coverage (i.e., the average sequencing depth).

In [None]:
def coverage(pileup_sizes):
    # YOUR CODE


print(f"Coverage: {coverage(pileup_sizes=pileup_sizes):.2}")

❓&nbsp;**Q2.3**&nbsp;&mdash;&nbsp;
Plot the pileups and the coverage using `matplotlib.pyplot.bar()` for the pileups and `matplotlib.pyplot.axhline()` for the coverage.

In [None]:
import matplotlib.pyplot as plt


def plot_pileups(reference_sequence, pileup_sizes):
    fig = plt.figure(figsize=(10, 4))
    # YOUR CODE


plot_pileups(reference_sequence=reference_sequence, pileup_sizes=pileup_sizes)

# Simulating the sequencing of an _E. coli_ genome

To provide a more realistic example, we provide the reference genome of _E. coli_ strain DH10B as FASTA file `e-coli-dh10b.fasta`.

> _E. coli_ is a bacterium that is commonly found in the lower intestine of warm-blooded organisms.
> It has a circular DNA molecule approximately 4.6 million base pairs in length, containing more than 4000 protein-coding genes (organized into more than 2500 operons), and several ribosomal RNA (rRNA) operons as well as dozens of transfer RNA (tRNA) genes.

> The [FASTA format](https://en.wikipedia.org/wiki/FASTA_format) is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.
>
> The format allows for sequence names and comments to precede the sequences.
> It originated from the [FASTA software package](https://en.wikipedia.org/wiki/FASTA), but has become a de-facto standard.
>
> A sequence begins with a greater-than character (`>`) followed by a description of the sequence (all in a single line).
> The next lines immediately following the description line are the sequence representation, with one letter per amino acid or nucleic acid.
> An example of a multiple sequence FASTA file follows.
>
> ```
> >SEQUENCE_1
> MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
> LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVEN
> >SEQUENCE_2
> SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
> ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
> ```

❓&nbsp;**Q3.1**&nbsp;&mdash;&nbsp;
Complete the function `read_fasta_file(file_path)` to read a FASTA file into a dictionary.
In the dictionary, the sequence descriptions are the keys, and the actual sequences are the values.
Use the function to read in `e-coli-dh10b.fasta`.

In [None]:
def read_fasta_file(file_path):
    with open(file=file_path, mode="r") as file:
        sequences = {}

        # Read the file line by line
        for line in file:
            # Process each line
            # YOUR CODE

        return sequences


# YOUR CODE

print(f"Read E. coli genome with length {len(ecoli_genome_sequence):,}.")

❓&nbsp;**Q3.2**&nbsp;&mdash;&nbsp;
Now do the following for the _E. coli_ genome:

1. Truncate the genome to a length of 1000 bases.
2. Sample 1000 reads with a minimum/maximum length of 100/250.
2. Compute the pileup sizes.
3. Plot the pileup sizes and the coverage.

In [None]:
# YOUR CODE