# DNA Compression Notes

### What is DNA sequencing?

It is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. 

The advent of rapid DNA sequencing methods (e.g. Next-Generation Sequencing) has greatly accelerated biological and medical research and discovery. 

Huge volumes of data is being created by DNA sequencing and hence compression algorithms need to be applied so that they can be easily and efficiently transferred over networks and stored.

### Common terms

- **Nucleobase**
> Nucleobases are naturally occurring compounds, which form the differentiating component of nucleotides; five bases occur in nature, three of which are common to RNA and DNA (uracil replaces thymine in RNA).
- **Nucleotide**
> Nucleotides form the structural unit (i.e., monomer) of nucleic acids. Each nucleotide comprises a sugar, a phosphate, and a base. In each of RNA and DNA, the sugar and phosphate molecules are the same in each nucleotide, with variation arising from the choice of four different bases. Nucleotides can be divided into two classes based on the chemical structure of their bases. Purines constitute adenosine and guanine, whereas pyrimidines correspond to cytosine, thymine, and uracil. 
- **Genome**
> A genome is the collection of an organism's hereditary information as encoded in its DNA. For most life forms, DNA is assembled into chromosomes; in the figure below, we show an image depicting all of a human male's chromosomes, which is called a karyotype. The individuals of a species share the vast majority of their DNA (humans share about 99.9% of our genome), and so we may also refer to the collective genome of a species.
- **Gene**
> A gene is an interval (or collection of intervals) of DNA whose nucleotides are transcribed into mRNA and eventually expressed in the cell's function by being translated into protein. Because the creation of proteins determines the organism's function, genes are also viewed as units of heredity; alleles are encoded by (often slight) variations in genes. Genes form only about a quarter of the human genome.
- **DNA Sequence**
> The precise ordering of the bases (A, T, G, C) from which DNA is composed. Base pairs form naturally only between A and T and between G and C, so the base sequence of each single strand of DNA can be simply deduced from that of its partner strand.

### Sanger Sequencing vs Next-Generation Sequencing

The concepts behind Sanger vs. next-generation sequencing (NGS) technologies are similar. In both NGS and Sanger sequencing, DNA polymerase adds fluorescent nucleotides one by one onto a growing DNA template strand. Each incorporated nucleotide is identified by its fluorescent tag.

The critical difference between Sanger sequencing and NGS is sequencing volume. While the Sanger method only sequences a single DNA fragment at a time, NGS is massively parallel, sequencing millions of fragments simultaneously per run. This high-throughput process translates into sequencing hundreds to thousands of genes at one time. NGS also offers greater discovery power to detect novel or rare variants with deep sequencing.

Features that distinguish next-gen sequencing from Sanger sequencing:
- **Highly parallel**
- **Micro scale**
- **Fast**
- **Low-cost**
- **Shorter length**

Next-generation sequencing is similar to running a very large number of tiny Sanger sequencing reactions in parallel. Thanks to this parallelization and small scale, large quantities of DNA can be sequenced much more quickly and cheaply with next-generation methods than with Sanger sequencing. 

<!-- | Year  | Cost (\$)  |
|-------|-----------:|
| 2001  | 100,000,000|
| 2015  | 1245       |  -->

### Types of DNA file formats

Genomes are commonly stored in the following file formats:
- **FASTA** 
> Stores a variable number of sequence records, and for each record it stores the sequence itself, and a sequence ID. Each record starts with a header line whose first character is >, followed by the sequence ID. The next lines of a record contain the actual sequence. In the context of nucleotide sequences, FASTA is mostly used to store reference data; that is, data extracted from a curated database.
- **FASTQ**
> Due to how different sequencing technologies work, the estimated probability of having correctly identified a given nucleotide varies. This is expressed in the Phred quality score. FASTA had no standardised way of encoding this. By contrast, a FASTQ record contains a sequence of quality scores for each nucleotide.

### FASTA

FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column.

An example sequence in FASTA format is:
```
>gi|186681228|ref|YP_001864424.1| phycoerythrobilin:ferredoxin oxidoreductase
MNSERSDVTLYQPFLDYAIAYMRSRLDLEPYPIPTGFESNSAVVGKGKNQEEVVTTSYAFQTAKLRQIRA
AHVQGGNSLQVLNFVIFPHLNYDLPFFGADLVTLPGGHLIALDMQPLFRDDSAYQAKYTEPILPIFHAHQ
QHLSWGGDFPEEAQPFFSPAFLWTRPQETAVVETQVFAAFKDYLKAYLDFVEQAEAVTDSQNLVAIKQAQ
LRYLRYRAEKDPARGMFKRFYGAEWTEEYIHGFLFDLERKLTVVK
```

The nucleic acid codes are:

```
        A --> adenosine           M --> A C (amino)
        C --> cytidine            S --> G C (strong)
        G --> guanine             W --> A T (weak)
        T --> thymidine           B --> G T C
        U --> uridine             D --> G A T
        R --> G A (purine)        H --> A C T
        Y --> T C (pyrimidine)    V --> G C A
        K --> G T (keto)          N --> A G C T (any)
                                  -  gap of indeterminate length
```

The accepted amino acid codes are:
```
    A ALA alanine                         P PRO proline
    B ASX aspartate or asparagine         Q GLN glutamine
    C CYS cystine                         R ARG arginine
    D ASP aspartate                       S SER serine
    E GLU glutamate                       T THR threonine
    F PHE phenylalanine                   U     selenocysteine
    G GLY glycine                         V VAL valine
    H HIS histidine                       W TRP tryptophan
    I ILE isoleucine                      Y TYR tyrosine
    K LYS lysine                          Z GLX glutamate or glutamine
    L LEU leucine                         X     any
    M MET methionine                      *     translation stop
    N ASN asparagine                      -     gap of indeterminate length
```

Sequences may be protein sequences or nucleic acid sequences, and they can contain gaps or alignment characters. Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap character; and in amino acid sequences, U and * are acceptable letters (see below). Numerical digits are not allowed but are used in some databases to indicate the position in the sequence. 

The nucleic acid codes supported are:

| Nucleic Acid Code | Meaning | Mnemonic |
|-------------------|----------------------------------|-------------------------|
| A | A | Adenine |
| C | C | Cytosine |
| G | G | Guanine |
| T | T | Thymine |
| U | U | Uracil |
| R | A or G | puRine |
| Y | C, T or U | pYrimidines |
| K | G, T or U | bases which are Ketones |
| M | A or C | bases with aMino groups |
| S | C or G | Strong interaction |
| W | A, T or U | Weak interaction |
| B | not A (i.e. C, G, T or U) | B comes after A |
| D | not C (i.e. A, G, T or U) | D comes after C |
| H | not G (i.e., A, C, T or U) | H comes after G |
| V | neither T nor U (i.e. A, C or G) | V comes after U |
| N | A C G T U | Nucleic acid |
| - | gap of indeterminate length |  |


The amino acid codes supported (22 amino acids and 3 special codes) are:

| Amino Acid Code | Meaning |
|-----------------|-------------------------------------|
| A | Alanine |
| B | Aspartic acid (D) or Asparagine (N) |
| C | Cysteine |
| D | Aspartic acid |
| E | Glutamic acid |
| F | Phenylalanine |
| G | Glycine |
| H | Histidine |
| I | Isoleucine |
| J | Leucine (L) or Isoleucine (I) |
| K | Lysine |
| L | Leucine |
| M | Methionine/Start codon |
| N | Asparagine |
| O | Pyrrolysine |
| P | Proline |
| Q | Glutamine |
| R | Arginine |
| S | Serine |
| T | Threonine |
| U | Selenocysteine |
| V | Valine |
| W | Tryptophan |
| Y | Tyrosine |
| Z | Glutamic acid (E) or Glutamine (Q) |
| X | any |
| * | translation stop |
| - | gap of indeterminate length |

Printing first 25 lines of `sample.fa`

In [1]:
!head -n 25 ./data/sample.fa

>sp|P32320|CDD_HUMAN Cytidine deaminase OS=Homo sapiens OX=9606 GN=CDA PE=1 SV=2
MAQKRPACTLKPECVQQLLVCSQEAKKSAYCPYSHFPVGAALLTQEGRIFKGCNIENACY
PLGICAERTAIQKAVSEGYKDFRAIAIASDMQDDFISPCGACRQVMREFGTNWPVYMTKP
DGTYIVMTVQELLPSSFGPEDLQKTQ
>sp|Q9UDY4|DNJB4_HUMAN DnaJ homolog subfamily B member 4 OS=Homo sapiens OX=9606 GN=DNAJB4 PE=1 SV=1
MGKDYYCILGIEKGASDEDIKKAYRKQALKFHPDKNKSPQAEEKFKEVAEAYEVLSDPKK
REIYDQFGEEGLKGGAGGTDGQGGTFRYTFHGDPHATFAAFFGGSNPFEIFFGRRMGGGR
DSEEMEIDGDPFSAFGFSMNGYPRDRNSVGPSRLKQDPPVIHELRVSLEEIYSGCTKRMK
ISRKRLNADGRSYRSEDKILTIEIKKGWKEGTKITFPREGDETPNSIPADIVFIIKDKDH
PKFKRDGSNIIYTAKISLREALCGCSINVPTLDGRNIPMSVNDIVKPGMRRRIIGYGLPF
PKNPDQRGDLLIEFEVSFPDTISSSSKEVLRKHLPAS
>sp|Q5SY16|NOL9_HUMAN Polynucleotide 5'-hydroxyl-kinase NOL9 OS=Homo sapiens OX=9606 GN=NOL9 PE=1 SV=1
MADSGLLLKRGSCRSTWLRVRKARPQLILSRRPRRRLGSLRWCGRRRLRWRLLQAQASGV
DWREGARQVSRAAAARRPNTATPSPIPSPTPASEPESEPELESASSCHRPLLIPPVRPVG
PGRALLLLPVEQGFTFSGICRVTCLYGQVQVFGFTISQGQPAQDIFSVYTHSCLSIHALH
YSQPEKSKKELKREARNLLKSHLNLDDRRWSMQNFSPQCS

### FASTQ

FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.

A FASTQ file normally uses four lines per sequence.

- Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
- Line 2 is the raw sequence letters.
- Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
- Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

> FastQ -> FASTA with quality

A FASTQ file containing a single sequence might look like this:

```
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
```

The byte representing quality runs from 0x21 (lowest quality; '!' in ASCII) to 0x7e (highest quality; '~' in ASCII). Here are the quality value characters in left-to-right increasing order of quality (ASCII):
```
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
```

Printing first 25 lines of `sample.fq`

In [2]:
!head -n 25 ./data/sample.fq

@cluster_2:UMI_ATTCCG
TTTCCGGGGCACATAATCTTCAGCCGGGCGC
+
9C;=;=<9@4868>9:67AA<9>65<=>591
@cluster_8:UMI_CTTTGA
TATCCTTGCAATACTCTCCGAACGGGAGAGC
+
1/04.72,(003,-2-22+00-12./.-.4-
@cluster_12:UMI_GGTCAA
GCAGTTTAAGATCATTTTATTGAAGAGCAAG
+
?7?AEEC@>=1?A?EEEB9ECB?==:B.A?A
@cluster_21:UMI_AGAACA
GGCATTGCAAAATTTATTACACCCCCAGATC
+
>=2.660/?:36AD;0<14703640334-//
@cluster_29:UMI_GCAGGA
CCCCCTTAAATAGCTGTTTATTTGGCCCCAG
+
8;;;>DC@DAC=B?C@9?B?CDCB@><<??A
@cluster_34:UMI_AGCTCA
TCTTGCAAAAACTCCTAGATCGGAAGAGCAC
+
-/CA:+<599803./2065?6=<>90;?150
@cluster_36:UMI_AACAGA


### Using `Biopython` library to import data

In [3]:
from Bio import SeqIO

#### Importing FASTA Sequence

In [4]:
fasta_sequences = SeqIO.parse(open('data/sample.fa'),'fasta')
fasta_seq = [str(fasta.seq) for fasta in fasta_sequences]

# printing length of sequences
print(len(fasta_seq))

# printing first 5 sequences
print(fasta_seq[:5])

6003
['MAQKRPACTLKPECVQQLLVCSQEAKKSAYCPYSHFPVGAALLTQEGRIFKGCNIENACYPLGICAERTAIQKAVSEGYKDFRAIAIASDMQDDFISPCGACRQVMREFGTNWPVYMTKPDGTYIVMTVQELLPSSFGPEDLQKTQ', 'MGKDYYCILGIEKGASDEDIKKAYRKQALKFHPDKNKSPQAEEKFKEVAEAYEVLSDPKKREIYDQFGEEGLKGGAGGTDGQGGTFRYTFHGDPHATFAAFFGGSNPFEIFFGRRMGGGRDSEEMEIDGDPFSAFGFSMNGYPRDRNSVGPSRLKQDPPVIHELRVSLEEIYSGCTKRMKISRKRLNADGRSYRSEDKILTIEIKKGWKEGTKITFPREGDETPNSIPADIVFIIKDKDHPKFKRDGSNIIYTAKISLREALCGCSINVPTLDGRNIPMSVNDIVKPGMRRRIIGYGLPFPKNPDQRGDLLIEFEVSFPDTISSSSKEVLRKHLPAS', 'MADSGLLLKRGSCRSTWLRVRKARPQLILSRRPRRRLGSLRWCGRRRLRWRLLQAQASGVDWREGARQVSRAAAARRPNTATPSPIPSPTPASEPESEPELESASSCHRPLLIPPVRPVGPGRALLLLPVEQGFTFSGICRVTCLYGQVQVFGFTISQGQPAQDIFSVYTHSCLSIHALHYSQPEKSKKELKREARNLLKSHLNLDDRRWSMQNFSPQCSIVLLEHLKTATVNFITSYPGSSYIFVQESPTPQIKPEYLALRSVGIRREKKRKGLQLTESTLSALEELVNVSCEEVDGCPVILVCGSQDVGKSTFNRYLINHLLNSLPCVDYLECDLGQTEFTPPGCISLLNITEPVLGPPFTHLRTPQKMVYYGKPSCKNNYENYIDIVKYVFSAYKRESPLIVNTMGWVSDQGLLLLIDLIRLLSPSHVVQFRSDHSKYMPDLTPQYVDDMDGLYTKSKTKMRNRRFRLAAFADALEFADEEKESPVEFTGHKLIGVYTD

#### Importing FASTQ Sequence

In [5]:
fastq_sequences = SeqIO.parse(open('data/sample.fq'),'fastq')
fastq_seq = [str(fastq.seq) for fastq in fastq_sequences]

# printing length of sequences
print(len(fastq_seq))

# printing first 5 sequences
print(fastq_seq[:5])

250
['TTTCCGGGGCACATAATCTTCAGCCGGGCGC', 'TATCCTTGCAATACTCTCCGAACGGGAGAGC', 'GCAGTTTAAGATCATTTTATTGAAGAGCAAG', 'GGCATTGCAAAATTTATTACACCCCCAGATC', 'CCCCCTTAAATAGCTGTTTATTTGGCCCCAG']


### Basic Techniques for DNA Compression

The increasing number of (re)sequenced genomes has lead to many proposals for compression algorithms

- **Naive bit encoding** algorithms exploit fixed-length encodings of two or more symbols in a single byte.
- **Dictionary-based** or substitutional compression algorithms replace re- peated substrings by references to a dictionary (i.e., a set of previously seen or predefined common strings), which is built at runtime or offline.
- **Statistical** or entropy encoding algorithms derive a probabilistic model from the input. Based on partial matches of subsets of the input, this model predicts the next symbols in the sequence. High compression rates are possible if the model always indicates high probabilities for the next symbol, i.e., if the prediction is reliable.
- **Referential** or reference-based approaches recently emerged as a fourth type of sequence compression algorithm. Similar to dictionary-based tech- niques, these algorithms replace long substrings of the to-be-compressed input with references to another string. However, these references point to external sequences, which are not part of the to-be-compressed input data. Furthermore, the reference is usually static, while dictionaries are being extended during the compression phase.