# Seminar 2 - GateKeeper and Fast Genome Analysis

"A new hardware architecture for accelerating pre-alignment in DNA short read mapping"

<hr>

## 1 - Executive Summary

#### Problem
Genomic similarity measurement is a computational bottleneck. Examining the similarity of highly dissimilar genomic sequences consumes an overwhelming majority of a modern read mapper's execution time.

#### Goal
Develop a fast and effective filter that can detect highly-dissimilar genomic sequences and eliminate them before invoking computationally costly alignment algorithms

#### Key observation
if two string differ by $\epsilon$ edits, then every pairwise match can be aligned in at most $2\epsilon$ shifts.

### Result

> Huge speedup of up to 130x ompared to the previous state of the art software solution

## 2 - Background, Problem, & Goal

### Definition

Genomic analysis is the identification, measurement of comparison of genomic features such as DNA sequence, structural variation, gene expression, or regulator and functional element annotation at a genomic scale.

### Applications

- understanding genetic variations
- prediction of the presence and abundance of microbes in a sample
- rapid surveillance of disease outbreaks
- developing personalized medicine 

Genome sequencers are chopping genomes and lack information about the order and location of each piece. Solving the puzzle of a genome is a computationally expensive operation.

### Read mapping

Map **reads** to a known reference genome with some minor differences allowed.

> A DNA sample in chemical format is read into a text format and suject to a reference genome in text format

### Sequence alignment (Verification)

**edit distance** is defined as the minimum number of edits (i.e. insertions, deletions, or substitutions) need to make the read exactly match the reference segments

![align](images/alignment.png)

### What makes sequence alignment a bottleneck

A tsunami of sequencing data is generated. A specialized machine is used for sequencing but a general purpose computer is used for analysis (in a G-P computer, data movement dominates performance). 

> data movement dominates performance and is a major system energy bottleneck (accounting for 40 to 62% of expenses)

However, **60% of the read mapper's execution time is spent on sequence alignment**.

The alignment is computationally costly as it is a **quadratic time dynamic-programming algorithm**. Furthermore, **data dependencies** limit the computation parallelism (processing row or column after one another). The **entire matrix is computed** even though strings can be dissimilar (number of differences is computed only at the backtracking step). 

Finally, the search space is large and very dissimilar (**98% of candidate locations have high dissimilarity with a given read**, i.e. a lot of time is wasted)

As such:

> -> We need intelligent algorithms coupled with intelligent architectures to handle data well
>
> the goal is to use architecture to acceperate read mapping and **reduce the dependency to dynamic programming algorithms**

## 3 -  Novelty of GateKeeper

### Key Idea

There are two types of genomic strings:
- similar strings (with matches above a threshold)
- dissimilar strings (with matches under a threshold)

Dissimilar strings are ignored if the number of differences exceed a threshold. The expensiveness is based on the computation on dissimilar strings.

the idea is to have a **filtering algorithm** that can filter out most of the incorrect mappings while preserving all the correct mappings

> GateKeeper is an FPGA implementation implementing such an algorithm

#### Observation

1. *Quickly* find similar sequences using **hamming distance**

2. Computed **Shifted Hamming Distance** for the rest of sequence pairs:
     1. ANDing $2\epsilon+1$ Hamming vectors of two strings, to identify dissimilar sequences

3. Use only bit-parallel operations that nicely map to:
    1. SIMD instructions, FPGA, logic layer of the 3d-stacked memory, and in-memory accelerators
    
### Hamming Distance ($\sum\oplus$)

Hamming distance is used to perform the comparison between a string and a shifted string which has had one or more deletion. 

<u>Comparison of "Istanbul" and "Istnbul"</u>

![shd](images/shifted_HD.png)

![wlk](images/walkthrough.png)

### Alignment Matrix vs. Neighborhood Map

![alg](images/alignmentmat.png)

## 4 - Hardware Architecture

GateKeeper offers a significant solution based on FPGA pre-alignment filtering that greatly speeds up read mapping. However there are some issues with regards to memory management.

FPGA-based pre-alignment can be integrated with the sequencer, however, which can yield even better results.

### Strengths

- New and simple solution to a critical problem
- Gatekeeper does not sacrifice any of the aligner's capabilities, as it does not modify or replace the alignment step
- the design is scalable, could add more processing cores in the future
- Some sequencers use FPGA as well, so gateKEeper could be integrated into them
- Greatly improves filtering speed and accuracy

### Issues
- The benefits of such a mechanism require an FPGA and advanced knowledge with computers. This may be problematic for some biologists, genomicists, geneticists.
- The amendment of the random zeros is a simple "hack" to reduce the number of false positive, but there is no explanation why GateKeeper only flips the patterns 101, and 1001.
- GateKeeper's accuracy degrades exponentially for $E\gt2\%$ and becomes ineffective for $E\gt8\%$.