In [None]:
# https://www.slideshare.net/ammarkareem3/sequence-alignment-58496054
# Sequence alignment is a way of arranging protein (or DNA) sequences to identify regions of similarity 
# that may be a consequence of evolutionary relationships between the sequences.

| Aspect                | Global Sequence Alignment                                | Local Sequence Alignment                               |
|-----------------------|----------------------------------------------------------|----------------------------------------------------------|
| **Definition**        | Aligns the entire sequence from end to end.              | finds local regions with high similarity bw target and query subseq   |
| **Aim**               | Maximize similarity between entire sequences.            | Identify local regions of high similarity.               |
| **Consideration**     | Entire length of both query and target sequences.        | Substring of the query aligns to a substring of the target. |
| **Characteristics**   | Contains all letters from both target and query subseq                | Aligns substring of query to substring of target.        |
| **Suitability**       | Sequences with similar length and content.               | More suitable for divergent or distantly related sequences. |
| **Applications**      | Comparing homologous genes or proteins.                  | Finding conserved patterns in DNA, identifying functional domains. |
| **Technique Example** | ***Needleman–Wunsch*** algorithm for global alignment.         | ***Smith-Waterman*** algorithm for local alignment.            |
| **Tool Examples**     | EMBOSS Needle, Needleman-Wunsch (Specialized BLAST).     | BLAST, EMBOSS Water, LALIGN.                             |
| **Use Case Examples** | Comparing genes or proteins with similar functions.      | Identifying conserved domains or motifs within larger sequences. |


 [EMBOSS Needle](https://www.ebi.ac.uk/Tools/psa/emboss_needle/), [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi), [Smith-Waterman Algorithm](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm) |                                                            |


In [None]:
# BLAST (Basic Local Alignment Search Tool)

In [None]:
# Global Sequence Alignment
# Definition:
# Global sequence alignment is a method used to align the entire sequence from end to end.
# It aims to find the best alignment that maximizes the similarity between two sequences.
# In a global alignment, we consider the entire length of both the query sequence and the target sequence.
# Characteristics:
# Contains all letters from both query and target sequences.
# Suitable for sequences with similar length and content.
# Used for aligning two closely related sequences.
# Applications:
# Comparing homologous genes or proteins with similar functions.
# Identifying conserved regions across species.
# Technique Example:
# The Needleman–Wunsch algorithm is a classic dynamic programming approach for global sequence alignment.
# Tools Examples:
# EMBOSS Needle: A widely used tool for global pairwise sequence alignment.
# Needleman-Wunsch Global Align Nucleotide Sequences (Specialized BLAST): Part of the BLAST suite for nucleotide sequence alignment.
# Local Sequence Alignment
# Definition:
# Local sequence alignment focuses on finding local regions with the highest level of similarity between two sequences.
# It aligns a substring of the query sequence to a substring of the target sequence.
# Suitable for more divergent sequences or distantly related sequences.
# Characteristics:
# Aligns a substring of the query sequence to a substring of the target sequence.
# Useful for identifying conserved domains or motifs within larger sequences.
# Applications:
# Finding conserved patterns in DNA sequences.
# Identifying functional domains in proteins.
# Technique Example:
# The Smith-Waterman algorithm is commonly used for local sequence alignment.
# Tools Examples:
# BLAST: A versatile tool for both global and local sequence alignment.
# EMBOSS Water: Another popular tool for local pairwise sequence alignment.
# LALIGN: A specialized tool for local protein sequence alignment.
# For further exploration, you can refer to the following resources:

# EMBOSS Needle
# BLAST
# Smith-Waterman Algorithm

- Protein alignment is often more informative than DNA alignment

- Wobbly third nucleotide in DNA does not change the amino acid
This point refers to the redundancy in the genetic code. The third nucleotide in a DNA triplet (codon) can sometimes vary without changing the corresponding amino acid in the protein sequence. This redundancy contributes to the stability of genetic information.

- Improve search sensitivity with DNA sequences - use translated-DNA:protein
alignments, such as those produced by BLASTX and FASTX, rather than
DNA:DNA alignments.

Evolutionary look-back time: DNA:DNA alignments have 5–10-fold shorter
look-back time than protein:protein or translated DNA:protein alignments.

Evolutionary look-back time: DNA:DNA alignments have 5–10-fold shorter look-back time than protein:protein or translated DNA:protein alignments:

This statement implies that when examining evolutionary relationships, DNA:DNA alignments provide a shorter historical perspective compared to protein:protein or translated DNA:protein alignments. Proteins tend to accumulate changes more slowly over time, allowing for a more extended look-back period.
DNA:DNA alignments rarely detect homology after more than 200–400 million years of divergence; protein:protein alignments routinely detect homology in sequences that last shared a common ancestor more than 2.5 billion years ago (e.g. humans to bacteria):

This suggests that DNA sequence alignments may not reliably detect similarities beyond a certain evolutionary timeframe (200–400 million years). In contrast, protein sequence alignments can reveal homology even in sequences that diverged more than 2.5 billion years ago, highlighting the greater sensitivity of protein alignments in evolutionary studies.
Moreover, DNA:DNA alignment statistics are less accurate than protein:protein statistics; while protein:protein alignments with expectation values < 0.001 can reliably be used to infer homology, DNA:DNA expectation values < 10−6 often occur by chance, and 10−10 is a more widely accepted threshold for homology based on DNA:DNA searches:

This emphasizes that statistical measures derived from DNA:DNA alignments may be less accurate than those from protein:protein alignments. For reliable homology inference, a stricter threshold (10^-10) is recommended for DNA:DNA searches compared to protein:protein alignments with an expectation value < 0.001.
Orthologs:

Orthologs are homologous sequences found in different species. They originate from a common ancestral gene during speciation (the process by which new species arise). Orthologs typically retain similar functions and properties, despite potential differences in sequence.
Homologous sequences that arose by gene duplication:

These are paralogs, homologous sequences that arise from gene duplication events. An example is given with human alpha and beta globins, which have distinct functions and properties despite their shared ancestral origin.
Regional distribution in the body, developmental timing of gene expression, and abundance:

When discussing homologs, the text highlights factors such as regional distribution in the body, developmental timing of gene expression, and abundance. These factors can contribute to the functional diversity observed in homologous sequences.

**Orthologs:**
- Homologous genes found in different species.
- Originate from a common ancestral gene through speciation.
- Typically retain similar functions and properties despite evolutionary changes.

- Orthologs are homologous genes in different species that share a common ancestral gene and are separated by a speciation event.

**Paralogs:**
- Homologous genes within the same species.
- Originate from gene duplication events.
- have distinct functions & properties.
- eg: Human alpha and beta globins

- Paralogs are homologous genes within the same species that originated from gene duplication events and have evolved to perform different functions.