# **Generation of Read Pileups**

First, let's download the data for this notebook.

In [None]:
import os
import sys

# Append the root of the Git repository to the path.
git_root = os.popen(cmd="git rev-parse --show-toplevel").read().strip()
sys.path.append(git_root)

In [None]:
from utils import download_file

if not os.path.exists(path="data"):
    os.makedirs(name="data")

download_file(
    # url="https://dataverse.harvard.edu/api/access/datafile/10494346",
    url="https://seafile.cloud.uni-hannover.de/d/5d6029c6eaaf410c8b01/files/?p=%2Fread_pileups%2Fe-coli-dh10b.fasta&dl=1",
    save_filename="data/e-coli-dh10b.fasta",
)

In [None]:
from typing import Dict, List

import numpy.typing as npt

# DNA Sequencing Simulation

By randomly sampling substrings from a larger reference sequence, we can simulate an error-free DNA sequencing process.

##### ❓ Sampling reads from a reference sequence

Complete the function `sample_reads()` to sample `n_reads` reads from the string `reference_sequence` with a minimum/maximum read length of `min_read_len`/`max_read_len`.
A read shall be stored in a dictionary that contains two key-value pairs: the 0-based mapping position of the read on the reference sequence (key `'pos'`) and the read sequence (key `'seq'`).
The function shall hence return a list (of length `n_reads`) of dictionaries.

This is how an example dictionary might look like:

```
{'pos': 4, 'seq': 'TTTCATTCTGACTGCAACGGGCAATA'}
```

In [None]:
import random


def sample_reads(
    reference_sequence: str, n_reads: int, min_read_len: int, max_read_len: int
) -> List[Dict]:
    """Sample reads from a reference sequence."""
    reads = []

    # YOUR CODE

    return reads

##### ❓ Sampling reads from a reference sequence

Now use the function `sample_reads()` to sample 20 reads from the following reference sequence:

```
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
```

Set the other parameters as follows: `n_reads=20`, `min_read_len=15`, `max_read_len=40`.

In [None]:
reference_sequence = (
    "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
)

# YOUR CODE

##### ❓ Sampling reads from a reference sequence

Now print the reference sequence and all sampled reads such that the reads visually align with the reference sequence.

This is how the printout should look like:

```
Reference: AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTG...
Read    0:    TTTTCATTCTGACTGCAACGGGCAA
Read    1:             TGACTGCAACGGGCAATATGTC
```

In [None]:
# YOUR CODE

# Pileups, Sequencing Depth, Coverage

##### ❓ Computing read pileups

Complete the function `compute_pileup_sizes()` and use it to compute the pileup size (i.e., sequencing depth) at each position of the reference sequence.

In [None]:
import numpy as np


def compute_pileup_sizes(
    reference_sequence: str, reads: List[Dict]
) -> npt.NDArray[np.float64]:
    """Compute the pileup sizes for a set of reads."""
    pileup_sizes = np.zeros(shape=len(reference_sequence))

    # YOUR CODE

    return pileup_sizes


pileup_sizes = compute_pileup_sizes(reference_sequence=reference_sequence, reads=reads)

print(pileup_sizes)

##### ❓ Computing the coverage

Complete the function `coverage()` and use it to compute the coverage (i.e., the average sequencing depth) across all positions of the reference sequence.

In [None]:
def coverage(pileup_sizes: npt.NDArray[np.float64]) -> float:
    """Compute the coverage of a set of pileup sizes."""
    # YOUR CODE


print(f"Coverage: {coverage(pileup_sizes=pileup_sizes):.2}")

##### ❓ Visualizing pileups

Plot the pileups and the coverage using `matplotlib.pyplot.bar()` for the pileups and `matplotlib.pyplot.axhline()` for the coverage.

In [None]:
import matplotlib.pyplot as plt


def plot_pileups(
    reference_sequence: str, pileup_sizes: npt.NDArray[np.float64]
) -> None:
    """Plot the pileup sizes."""
    plt.figure(figsize=(10, 4))
    # YOUR CODE
    plt.show()


plot_pileups(reference_sequence=reference_sequence, pileup_sizes=pileup_sizes)

# Simulating the Sequencing of an _E. coli_ Genome

To provide a more realistic example, we provide the reference genome of _E. coli_ strain DH10B as FASTA file `e-coli-dh10b.fasta` (in the `data/` folder).

> _E. coli_ is a bacterium that is commonly found in the lower intestine of warm-blooded organisms.
> It has a circular DNA molecule approximately 4.6 million base pairs in length, containing more than 4000 protein-coding genes (organized into more than 2500 operons), and several ribosomal RNA (rRNA) operons as well as dozens of transfer RNA (tRNA) genes.

> The [FASTA format](https://en.wikipedia.org/wiki/FASTA_format) is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.
>
> The format allows for sequence names and comments to precede the sequences.
> It originated from the [FASTA software package](https://en.wikipedia.org/wiki/FASTA), but has become a de-facto standard.
>
> A sequence begins with a greater-than character (`>`) immediately followed by a description of the sequence (all in a single line).
> The next lines immediately following the description line are the sequence representation, with one letter per amino acid or nucleic acid.
> An example of a multiple sequence FASTA file follows.
>
> ```
> >SEQUENCE_1
> MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
> LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVEN
> >SEQUENCE_2
> SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
> ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
> ```

##### ❓ Reading FASTA files

Complete the function `read_fasta_file()` to read a FASTA file into a dictionary.
In the dictionary, the sequence descriptions are the keys, and the actual sequences are the values.
Use the function to read in `e-coli-dh10b.fasta`.

In [None]:
def read_fasta_file(file_path: str) -> Dict[str, str]:
    """Read a FASTA file."""
    sequences = {}

    with open(file=file_path, mode="r") as file:
        current_description = None
        for line in file:
            line = line.strip()
            # YOUR CODE

    return sequences


ecoli_genome = read_fasta_file(file_path="data/e-coli-dh10b.fasta")
ecoli_genome_sequence = ecoli_genome[
    "NZ_CP110018.1 Escherichia coli strain DH10B chromosome, complete genome"
]

print(f"Read E. coli genome with length {len(ecoli_genome_sequence):,}.")

##### ❓ Visualizing _E. coli_ pileups

Now do the following for the _E. coli_ genome:

1. Truncate the genome to a length of 1000 bases.
2. Sample 1000 reads with a minimum/maximum length of 100/250.
3. Compute the pileup sizes.
4. Plot the pileup sizes and the coverage.

In [None]:
# YOUR CODE