### Introduction to NCBI BLAST

#### Database Similarity Searching

- You have a sequenced amplicon and you want to know more about it
  - what it is, does
  - where it comes from, etc
- Yes, you could search a database using dynamic programming (Needleman-Wunch or the Smith-Waterman alignment) algorithms
  - But they are too slow to give you results fast enough
    - Querying a DB of 300,000 sequences using a 100 residue-long sequence took ~3 hours to complete with a regular computer, nearly a decade ago
    - And would require too much computational power on the large scale
- BLAST is a very routine way of doing these searches rapidly (usually seconds)
  - Your sequence (query) is searched against a database of annotated sequences (targets)
    - Side notes:
      - Database annotations are sometimes automatically assigned (automatic annotation)
      - At other times humans intervene and manually correct the data/metadata (manual data curation)
  - BLAST employs algorithmic tricks to speed up alignment process
    - it's a heuristic (not exhaustive) algorithm
      - i.e. it's not guaranteed to retrieve all possible matches
        - e.g. for some families of protein sequences, BLAST can miss 30% of truly significant hits.
    - but it's very fast

#### Before we explain BLAST, we need to understand what a reading frame is
- A **reading frame** is where one starts to translate codons in a **protein-coding sequence**
  - usually each reading frame can be numbered: 1,2,3, and 4,5,6
  - frame-shift are indel mutations that shift/skew the codons in a protein-coding sequence  
- An **open reading frame** (ORF) is a DNA segment having a start codon, and ends with (but not including) the stop codon.
  - [More information about this](https://www.genome.gov/genetics-glossary/Open-Reading-Frame)
- Because mature mRNA (i.e. no introns) is read in triplets, and because we don't know whether a strand is forward or reverse
  - We computationally translate all **possible reading frames** to find the longest one
  - Note that the longest ORF may not always have a start or stop codon
    - i.e. you may be dealing with an inner fragment of a CDS
- The ORF of a nucleotide sequence is a continuous sequence of a minimum length (e.g. 100 codons) with a start and a stop codon.
  - Why is this important?
    - This information can assist in **gene prediction**
  - There is a “standard” translation table, but some organisms may use slight variations of it
    - These are called **genetic codes** or **codon tables**.
    - When translating from genome segments to proteins, the **correct genetic code** is needed. 
      - You can have a look at a list of genetic codes [here](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi)

#### What is BLAST?

- BLAST is an acronym for "Basic Local Alignment Search Tool" (written by Stephen Altschul in 1990)
- A tool for searching databases of biological sequences
  - DNA or protein (as query or targets)
- These two variations of the algorithm are commonly used
  - **blastp**: protein query vs protein database
  - **blastn**: nucleotide query vs nucleotide database

### What if there's no nucleotide sequence that matches your query?
- translated searches (useful for unannotated sequences; protein-coding; more computationally expensive)
  - **tblastn**: protein query, nucleotide database
    - _translates nucleotide database sequences_ in all six open reading frames
  - **blastx**: nucleotide query, protein database:
    - _translates query_ in six open reading frames first
  - **tblastx**: nucleotide query, nucleotide database:
    - _translates both query and database sequences_ 
- Is used by several web servers to search their databases
  - at NCBI, and in many other sites
- You can also **install BLAST as a standalone tool** and prepare your own database
  - Typically, if you want your data to remain private
  - Or if you plan on large volumes of searches
  
Further reading [here](https://www.nlm.nih.gov/ncbi/workshops/2023-08_BLAST_evol/blast_basics.html)

### Should I use a protein sequence or a DNA sequence for a BLAST search?
- Using query protein sequences are better at detecting homologs.
  - protein sequences are much **more informative** and **sensitive** in detecting homologs
    - they have more characters than DNA -> higher complexity "language"
    - Many codons are **degenerate** (DNA mutations are often silent)
  - Searches using protein sequences can yield more significant matches than using DNA sequences. 
    - BUT, the query sequence has to protein-encoding
  - If looking for protein homologs from a newly sequenced genome, one may use TBLASTN
    - translates nucleotide database sequences in all six open reading frames

#### The BLAST algorithm
- Create a list of words (**aka kmers**) from the query sequence (non-overlapping fragments).
  - typically 3 residues for protein sequences, and 11 residues for DNA sequences.
- Search a sequence database for exact matches of these kmers.
  - Each matched word is scored by a given substitution (scoring) matrix, and is retained if it is above a threshold.
- The kmers are extended in both directions while counting the alignment score using the same scoring matrix.
- Extension continues until the alignment score drops below a threshold due to mismatches.
  - This results in a contiguous aligned segment pair without gaps, called a **high-scoring segment pair (HSP)**
- Finally, terminal regions are trimmed before producing a report of the final alignment.

#### Metrics in BLAST

- **Score:**
  - A raw value that represents how well the query sequence matches the database sequence.
  - Cannot be compared across searches
- **Bit-score:**
  - The normalized score, which allows comparisons across different searches.
  - Higher bit-scores mean better alignment quality.

- **Coverage:**
  - Proportion of the query that is aligned with the database (target) sequence.
    - i.e. how much of your query sequence is included in the alignment.

- **Identity:**
  - Percentage of nucleotides (if DNA/RNA) or amino acids (if proteins) that are **exactly the same** within the aligned pair of sequences.
  - Higher identity means more identical residues.

- **Similarity:**
  - Similarity considers not just exact matches but also residues that are physicochemically similar (i.e. amino acids with similar properties).
  - Typically used for protein alignments.

- **E-value (Expect value):**
  - Value that represents the number of alignments with a similar score that you would expect to find by chance in a database search.
  - E-values (closer to zero) indicate more significant matches (i.e. higher likelihood that the match did not occur by chance.
  - Formula for E-value: $E = m × n × P $, where
    - m = total number of residues in a DB,
    - n = number of residues in the query sequence, 
    - P = probability that an HSP is a result of random chance.