# Biological Sequences and the Central Dogma

Sequences - ordered arrangements of repeated elements - are ubiquitous in biology. They include the DNA that makes up our genome, RNA that transmits the template for proteins to ribosomes for manufacture, and the strings of amino acids that compose proteins.

This section will introduce multiple ways in which we can represent such sequences as text and use them to calculate useful quantities. 

#### Prerequisites for this section
* Have Anaconda python installed
* Be familiar with how to run python
* Be familiar with how to use python like a calculator including:
   * assigning variables
   * order of operations
   * multiplication and division


----------------------------------------------------------------------------------------------------------------
#### Sequences in Biology

Let's first review some of the biology of DNA, RNA and Protein sequences. Those already familiar with basic molecular biology and the Central Dogma should skim down to the heading on how to represent biological sequences in text. 

**The Central Dogma**

Flow of information between different types of biological sequences is important in the biology of cells. The Central Dogma of molecular biology is a statement of which routes of information transmission are typical, which are rare, and which never happen. 

In cellular organisms, DNA forms the genome. That genome can be replicated by proteins called DNA polymerases to generate new copies of itself during reproduction. This replication is not perfect, and can introduce random changes in the DNA (mutations), which are an important source of variation among organisms. But overall DNA replication is quite accurate on average under normal circumstances. A key feature of DNA replication is that it is  *semi-conservative*. This means that each two stranded copy of the original two-stranded DNA genome contains one newly synthesized strand and one of the parental strands.

Some DNA forms genes. These contain the information needed to form the sequence of a protein. But proteins are not produced from DNA directly. DNA must first be **transcribed** into RNA before translation can occur.  Critically, each gene is transcibed at different rates under different conditions and at different points in time. This is important because cells in our liver and in our eye both inherit the same DNA. The main reason these tissues are so different is due to differences in which genes are transcribed into RNA (and therefore can be translated into protein).

Finally, RNA can be **translated** into Protein in a structure with both protein and RNA elements called a ribosome. Much as with transcription, not *all* RNAs are translated into protein (this is known as **translational regulation**). In the ribosome, 3 letter codes in the RNA (known as **codons**) guide the addition of new **amino acids** to a new protein. 

A key tenant of the central dogma is that once proteins are formed, information does not flow backwards to RNA or DNA. That is, while proteins can be generated from the information in DNA or RNA in the cell, the opposite is not true.

Let's consider each of the types of sequences involved in the Central Dogma in more detail.

**Nucleotide sequences**
* DNA
* RNA

Nucleotide sequences include DNA and RNA. Although DNA and RNA can differ by only a single hydroxyl group in chemical structure, they play very different roles in the cell. 

**DNA** For cellular organisms (and some but not all viruses) the genome is formed by one or more chromosomes of double-stranded DNA. DNA can be thought of as a form of stable long-term storage of genetic information -- very loosly analagous to a computer's hard drive. The DNA is composed of **nucleotides**, which are often abbreviated 'nt'. The double-stranded string of DNA nucleotides has some **coding regions** (i.e. 'genes') that encode the information needed to put together proteins. Other regions are **non-coding regions**. While much non-coding DNA may have no specific benefit to the cell, other non-coding sequences hold important information such as sites where specific proteins or protein complexes should bind to the DNA to do things like starting DNA replication or transcribing a particular gene in the DNA into RNA. Broadly speaking, mammals like humans tend to have far more non-coding DNA than single celled organisms, which in turn tend to have more non-coding DNA than viruses. This is thought to be both because carrying extra DNA is a greater burdan on fast-reproducing organisms, and also because the larger population sizes of viruses or bacteria may allow for 'streamlining' of the genome to remove unneeded non-coding regions, even though the cost of carrying them is minimal.

**RNA** plays several roles in cells. The most common is to serve as a temporary form of portable storage for the infomation contained in the DNA. You could think of it as very loosely analagous to a computer's RAM, in the limited sense that in programs are loaded into RAM before they can be executed, and there are generally many more programs on a hard drive than there are running programs at any given moment. Similarly, just as computers typically have more hard drive space than RAM, the DNA genome for organisms is much longer than any indiviudal RNA. In addition to its role in information transmission, RNA also plays direct functional roles in the cell. It contributes to the structure of some cellular complexes like the ribosome (which translates RNA into protein). It has also been discovered that RNA can  catalyze (speed up) reactions, a finding for which [Sydney Altman and Tom Cech won the 1989 Nobel Prize in Chemistry](https://www.nobelprize.org/prizes/chemistry/1989/summary/)

**Protein**'s primary roll in cells is to catalyze reactions, and although both RNA and proteins can do so, proteins catalyze many more reactions. This causes reactions that would be energetically favorable anyway to happen much faster than they otherwise would. Protein also plays important roles in cellular structures such as actin filaments.


#### Important features of biological sequences


**Complementarity and base pairing in nucleotide sequences**. A key feature of both RNA and DNA nucleotide sequences (but not amino acids) is complementarity. Consider double-stranded DNA. Each base in one strand must pair with a base in the other strand. But not all pairings are equal. Generally, Adenosine (A) nucleotides pair with Thymidine (T), and Guanine (G) nucleotides pair with Cytosine (C). Biology students memorizing this pattern of complemementation often state it in shorthand based on the letter codes of each nucleotide:

> In DNA, A pairs with T, and G pairs with C

Base pairing is also important in single stranded RNA. Since there is not another strand to pair with, single stranded RNA can pair with other nucleotides in its own sequence, causing the initially linear RNA to fold up into more complex **secondary structures**. If you hold up the end of a shoelace and fold it back on itself over a finger you will form something that looks like a **stem-loop**, a common type of RNA secondary structure. When RNA pairs with itself, compementation works similarly as in DNA, except that Uracil (U) replaces Thymidine. Thus A pairs with U and C pairs with G.

> In RNA, A pairs with U, and G pairs with C

**Layers of structural complexity**

Althought DNA, RNA and Protein can all be expressed as linear sequences, they also form more complex 3d structures that can be important for function. The role of complementation and base-pairing in RNA secondary structure was mentioned above. Protein also has [secondary structure](https://en.wikipedia.org/wiki/Protein_secondary_structure). These include spiral or sheet-like structures called alpha-helices and beta-sheets. These alpha-helices and beta-sheets can fold further to form the overall 3d structure. Many proteins that are enzymes (e.g. that catalyze a reaction) have an active site where the reaction occurs. Changes to the protein sequence that are in the active site often have a bigger effect on the function of that protein than changes at other locations. This can change their 3d These changes - called conformational changes - may in turn expose sites favorable to binding by other proteins. Finally, the 3d structure of proteins in a cell can be dynamic: steroid receptors like the estrogen receptor, for example, change shape when they bind to estrogen. Similarly, many proteins can become **phosphorylated** by a class of proteins called kinases. This adds phosphate groups which - without changing the amino acid sequence - can alter the proteins 3d shape and therefore sometimes also its function and/or interactions with other proteins.  


Here is an oveview of differences between the sequence types:

| Sequence Type     | Produced from     | Main functions |Units | Types of Units | Number of units |
| :---| :----|:--- |:--- |:--- |:---: |
| DNA      | DNA by DNA replication or (more rarely) from RNA by reverse-transcription (e.g. in retroviruses)   | Information storage |Nucleotides       | Adenosine, Thymidine, Guanine, Cytosine | 4  | 
| RNA      | Transcription of RNA from DNA | Information transmission, catalysis of some reactions, information storage in some viral genomes (e.g. retroviruses) | Nucleotides       | Adenosine, Uracil, Guanine, Cytosine |4  |   
| Protein  | Translation of RNA to Protein in a ribosome | Catalysis of reactions, structural roles | Amino acids       | Alanine, Arginine, Asparagine, <br> Aspartic acid, Cysteine, Glutamine, <br> Glutamic acid, Glycine, Histidine,<br>Isoleucine,Leucine,Lysine,<br>Methionine,Phenylalanine,Proline,<br>Serine,Threonine,Tryptophan,<br>Tyrosine,Valine| 20* |


\* There are 20 'standard' amino acids that are common in human biology. However, other amino acids appear in biology. For example, under certain circumstances 'UGA' codons - which usually halt translation - can produce selenocysteine, a 21st amino acid discovered in 1986. Further reading: [The 21st Amino Acid (Atkins and Gesteland 2000)](https://www.nature.com/articles/35035189)



**Standard representations of biological sequences as text**

The international union of pure and applied chemistry (IUPAC) has established standard codes for representing DNA, RNA, or amino acid (protein) sequences using single letter codes. It is worth noting that these codes represent basic information about the sequence, but in real cells additional signals may be present (e.g. modifications of the histones that wrap up DNA can make it more or less accessible and thereby alter rates of transcription). 

Standard codes for DNA nucletides (A nice summary is available [here](http://zhanglab.ccmb.med.umich.edu/FASTA/)):
<pre>
A: Adenosine
T: Thymidine
C: Cytosine
G: Guanine
</pre>
The codes for RNA nucletides are similar except that it has Uracil instead of Thymidine:
<pre>
A: Adenosine
U: Uridine
C: Cytosine
G: Guanine
</pre>
These cover most cases. However, in some cases DNA sequencing machines can't tell the nucleotides apart.
Therefore there are also ambiguous characters to represent cases where the identity of a nucleotide cannot
fully be determined:
<pre>
N: any nucleotide
R: any purine nucleotide (G or A)
Y: any pyrimidine nucleotide (T or C)
M: any amino nucleotide (A or C)
K: any keto nucleotide (G or T)
S: any amino acid that forms strong bonds (G or C)
W: any amino acid that forms weak bonds (A or T)
B: any of G,T,C
D: any of G,A,T
H: any of A,C,T
V: any of G,C,A
</pre>
Additionally, when comparing two sequences, it is useful to have a gap character to represent when a letter has been deleted from one of the two sequences or added to the other (this is called an **indel** for insertion or deletion):

\- gap in one sequence


**Representations of Amino Acids as text**

Each amino acid letter in a protein sequence has a full name, a short three letter name, and a one letter code commonly used in bioinformatic analysis.
<pre>
 A: Alanine (ALA)
 C: cystine (CYS)
 D: aspartate (ASP) 
 E: glutamate (GLU)
 F: phenylalanine (PHE)
 G: glycine (GLY)
 H: histidine (HIS)
 I: isoleucine (ILE)
 K: lysine (LYS)
 L: leucine (LEU)
 M: methionine (MET)
 N: asparagine (ASN)
 P: proline (PRO)
 R: arginine (ARG)
 S: serine (SER)
 T: threonine (THR)
 U: selecysteine
 V: valine (VAL)
 W: trypotophan (TRP)
 Y: tyrosine (TYR)
</pre>

**Ambiguous amino acid codes**. Just as with nucleotides, certain amino acids are hard to tell apart, and so one letter codes have been developed to represent this ambiguity
<pre>
X: any amino acid
B: aspartate or asparagine (ASX)
Z: glutamate or glutamine (GLX)
</pre>
** Special amino acid codes**. Special amino acid codes are:
<pre>
- a gap or indel. Only used when comparing sequences, and indicates insertion or deletion of an amino acid in one sequence relative to another. 

* translation stop.
</pre>


In [None]:
DNA_nucleotides = ['A','T','G','C']

**Storing sequence in FASTA files** 
DNA, RNA and protein sequences are often stored in text files called FASTA files. These might have several text extensions: .fasta, .fna (usually a nucleotide FASTA file), .faa (usually an amino acid FASTA file).

Here's how the lines of a FASTA file might appear:
<pre>
>gene1
ATCGATCGATCGTACGTCAGTCGTACGTCAGTCAGA
ACTGACTGTACGTACGTACGATGCTACGTACGCATA
ACTACGTACGTACGTACGATCGTACGTACGCATACG
ATGCTACGTACG
>gene2
ACGCTATCGATCGTACGTACGTAGCTACGTGGGGGG
AATATATTTGCGCGCCGTATAATATGCCGATATGCG
GTGCCTCTCTCGGCGCGCGCATTTTTGCGCGAAAAA
AAAAGCGATCGATCGTACGAAAAATGCATAGCTACG
AYCGAYCGAGC
>gene3
AGTCGACTGATCGTAGCTAGCTAGCTACGTAGCTAG
GA
</pre>

Look carefully at the above sequence. Here are the important features:
* each **label line** starts with a greater than ('>') sign. 
* the text after the label line, but before you encounter the next label
  line or the end of the file is the sequence that goes with that label.
* the sequences in FASTA files are often broken up across more than one line
* depending on where you get your FASTA file, the label lines could have 
  additional information embedded in them, or they could just be an uninformative
  id
* depending on where you get your FASTA file, the letters could be uppercase,
  lowercase, or a mixture.

-----------------------------------------------------------------------------------------------------------------
**Exercise: Guess the Sequence**

Given the above information, try to figure out what type of biomolecule is represented by each of the following sequences. For each, write down if you know the answer for sure, or if its just likely given the sequence:


**Sequence 1:** MEELVVEVRGSNGAFYKAFVKDVHEDSITVAFENNWQPDRQIPFHDVRFPPPVGYNKDIN

**Sequence 2:** AUGAUCGUACGUCAGCUCGUACGUCGGCGGUGAUGCUAGCGCUACGUGACUACGUAGCUA

**Sequence 3:** ATGACAACACAATTAAATCCCTATTTTGGTGAATTTGGCGGAATGTATGTGCCGGAAATT


**Sequence 4:** AGC

-----------------------------------------------------------------------------------------------------------------

Once you have jotted down your answers, you can check them in the [answer key](./biological_sequences_exercise_answers.ipynb)