# Lecture 10: Intelligent Genome Analysis

*Notes*

<hr>

## 1 - Main Idea

Intellgient genome analysis corresponds to:

1. Fast genome analysis (**bandwidth**)
2. Population-scale genome analysis (**scalability**)
3. Using intelligent architectures -- small specialized hardware with less data movement (**energy efficiency and portability**)
4. Considering DNA as a valuable asset -- control-access analysis (**privacy**)
5. Avoiding erroneous analysis (**Accuracy**)

### Why it matters?

It saves lives if we could reduce rWGS from 2-day or 5-day (very costly to costly) to lower: 

1. avoids morbidity
2. reduces hospital stay length
3. reduces inpatient cost

<hr>

## 2.1 - How is Genome Analyzed?

Genome analysis corresponds to untangling DMA yarn balls and sequencing them in the right order. With current technology, this is impossible in one go.

As such:

> Current sequencing machine provides small randomized fragments of the original DNA. Genomes are sampled into small chunks called **reads** that need to be reassembled.

Reading relies on graphene nanopore (< 20nm) to let DNA strands go through to be read as a timeseries where timesteps are A, T, G, or C. Nanopore sequencers use an ionic current to measure changes as the DNA strand passes through the nanopore. 

**Note**: The larger the read pieces, the easier the problem (however long reads tend to have a higher error rate).

> Changes in seuqnecing technologies can render some read mapping algorithms irrelevant

## 2.2 - Read Mapping

Map reads to a known reference geneome with some minor differences allowed. 

> DNA sample (chemical format) -> Reads (text format) -> Reference genome

### Brute force 

Brute force is a really expensive endeavour with complexity $O(m^2kn)$, with $m$ the read length, $k$, the number of reads, and $n$ the reference genome length.

### Smith Waterman 

SW remains the most popular algorithm since 1988. Hamming distance is the second most popular technique since 2008. 

![proc](images/process.png)

## 2.3 - Why is Read Mapping slow?

- need to find many mappings of each read

- need to tolerate variances/sequencing errors in each read

- need to map each read very fast (i.e. performance is important, life critical in some cases)

- need to map reads to both forward andreverse strands

![compsi](images/compsy.png)

> We need intelligent algorithms and intelligent architectures that handle data well

<hr>

## 3 - Algorithmic and Hardware Acceleration 

### Seed Filtering Technique

**Goal**: REducing the number of seed (k-mear) locations
- heuristic (limits the number of mapping locations for each seed)
- supports exact matches only

### Pre-Alignment Filtering TEchnique

**Goal**: Reducing the number of invalid mappings
- Supports both exact and inexact amtches
- Provides some falsely-accepted mappings

### Read Alignment Acceleration

**Goal**: Performing read alignment at scale
- Limits the numeric range of each cell in the DP table and hence supports limited scoring functions
- May not support backtracking step due to random memory accesses

<hr>

## 4 - What's next for Read Mapping?

- Read alignment can be substantially accelerated using computationally inexpensive and accurate pre-alignment filtering algorithms designed for specialzied hardware

- All the three directions cited above are used by mappers today, but filtering has replaced alignment as the bottleneck

- Pre-alignment filtering does not sacrifice any of the aligner capabilities, as it does not modify or replace the alignment step.

### Adoption of hardware accelerators in genome analysis

Computing is still bottlednecked by data movement. The future needs to reduce the high amount of data movement and develop flexible hardware architecture that do not conservatively limit the range of supported parameter values at design time.

Data formats should also evolve to be more efficient (e.g. pdf -> djvu)