# **[Project] DNA Sequence Prediction for Compression**

# Introduction

We will use data from Illumina's [Platinum Genomes](https://emea.illumina.com/platinumgenomes.html) project.

In this project, they performed whole-genome sequencing (WGS) of the 17 member [CEPH pedigree 1463](https://www.coriell.org/0/Sections/Collections/NIGMS/CEPHFamiliesDetail.aspx?PgId=441&fam=1463&coll=GM) (17 human individuals, 3 generations) on Illumina HiSeq systems to provide a set of high-accuracy human WGS data.

> See the [Platinum Genomes manuscript](http://dx.doi.org/10.1101/gr.210500.116) for a full description of the project.

In particular, Illumina has sequenced the individuals NA12877 and NA12878 to 200x depth on a HiSeq 2000 system.
These data are available via the European Nucleotide Archive (ENA) under accession code [PRJEB3246](https://www.ebi.ac.uk/ena/browser/view/PRJEB3246).

> The PRJEB3246 data is also part of the the [MPEG-G Genomic Information Database](https://mpeg.chiariglione.org/standards/mpeg-g/genomic-information-representation/mpeg-g-genomic-information-database-4) (ID 01).

We will only use a subset of the PRJEB3246 data.
We extracted the first 10,000,000 records (i.e., the first 40,000,000 lines) from two FASTQ files (`ERR174310_1.fastq` and `ERR174310_2.fastq`) generated from NA12877.

# Data access

The download from the TNT homepage is straightforward:

In [None]:
! wget http://www.tnt.uni-hannover.de/edu/vorlesungen/AMLG/data/project-dna-sequence-prediction.tar.gz
! tar -xzvf project-dna-sequence-prediction.tar.gz
! mv -v project-dna-sequence-prediction/ data/
! rm -v project-dna-sequence-prediction.tar.gz

In the `data/` folder you will now find two files: `reads_1.fastq.gz` and `reads_2.fastq.gz`.
First, you need to decompress the files:

In [None]:
! gunzip data/reads_1.fastq.gz
! gunzip data/reads_2.fastq.gz

Next, you can verify they contain the correct number of FASTQ records:

In [None]:
! wc -l data/reads_1.fastq
! wc -l data/reads_2.fastq

The [FASTQ format](https://en.wikipedia.org/wiki/FASTQ_format) is the de-facto standard for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.
Both the sequence letter and quality score are each encoded with a single ASCII character.

Each sequence, i.e., read, is represented by a single FASTQ record, which consists of four lines:
- The first line contains the **read identifier**. It starts with `@`. Typically, sequencing machine vendors generate read identifiers in a proprietary systematic way.
- The second line contains the **sequence**, where each symbol is represented with a single ASCII character.
- The third line starts with `+` and contains an optional **description**. Usually this line is left empty; it then only contains `+` as separator between the sequence and the quality scores.
- The fourth line contains the **quality scores**. A quality score is a value indicating the confidence in a base call.

The following function can be used to parse a FASTQ file:

In [1]:
from typing import Dict, List


def parse_fastq_file(file_path: str, n_records: int = None) -> List[Dict[str, str]]:
    """Parse a FASTQ file.

    Parameters
    ----------
    file_path : str
        The path to the FASTQ file.
    n_records : int
        The number of FASTQ records to parse. The default value is 'None'; in
        this case, the entire FASTQ file will be parsed.

    Returns
    -------
    records : list[dict[str, str]]
        A list of dictionaries, where each dictionary contains one FASTQ
        record.

    """

    with open(file=file_path, mode="r") as file:
        records = []
        lines = []
        for line in file:
            lines.append(line.rstrip())
            if (len(lines)) == 4:
                if n_records == None or len(records) < n_records:
                    records.append(dict(zip(["id", "seq", "desc", "qual"], lines)))
                    lines = []
                else:
                    break

        return records

Try it out:

In [None]:
records = parse_fastq_file(file_path="data/reads_1.fastq", n_records=20)

for i, record in enumerate(records):
    print(f"Record {i:2}: {record}")