# Practice 2: Expression of genetic material

## Concepts to work

**Translation**, is the synthesis of a protein from the mRNA chain, this occurs within proteins called ribosomes, during this process, the mRNA sequence is read in groups of three nucleotides, called **codons **, which are interpreted by a **genetic code** resulting in an amino acid coding**<sup> 6 </sup>** (fig. 4), which will fold and form proteins (fig. 3).

<img src="img/Figura4-en-es.png" alt="code" width="1000"/>

*Figure 4. Essential genetic code in the expression of proteins where the formation of a codon from a nucleotide (uracil, adenine, guanine, or cytokine) is evidenced, from the start sequence (green) and the stop sequences (red ). Figure modified from: [Molecular biology of the gene, (2008), 15, 509-569]( https://books.google.com.co/books?id=7tadzgEACAAJ&dq=Molecular+biology+of+the+gene&hl=es-419&sa=X&redir_esc=y)*

The ribosome reads the sequence in order, looking for the AUG **start** codon, which, in turn, codes for the methionine amino acid and begins the translation, as it continues advancing it builds the chain of amino acids, it is a process that repeats many times, in which the nucleotide triplets are read and the corresponding amino acid is attached (fig. 3). The resulting chain can be long or short, it is addressed until it finds one of the three codons that code for **stop** (UAA, UGA or UAG) (fig. 4), when synthesized, the chain is released from the ribosome and it is modified or combined to form a functional protein with a specific structure involved in some essential process for the cell or organism**<sup> 7 </sup>**.


## Problem Statement
Continuing with the general objective, to obtain basic information on the cytochrome P450 enzyme, a protein previously worked on. To do this, we are going to carry out the second phase involved in gene expression, in order to obtain the amino acids that code for the protein.

First, we must create a dictionary in which the genetic code is found, where they specify the codons (nucleotide triplets) that synthesize their corresponding amino acid. We must take into account the `key-value` pairs, where the `key` would be the codons and the `value` would be the amino acids.

In [56]:
#Dictionary of codons for translation
genetic_code = {"GUU": "V", "GUC": "V", "GUA": "V", "GUG": "V", "GCU": "A", "GCC": "A", "GCA": "A", "GCG": "A",
                "GAU": "D", "GAC": "D", "GAA": "E", "GAG": "E", "GGU": "G", "GGC": "G", "GGA": "G", "GGG": "G",
                "AGA": "R", "AGG": "R", "AGU": "S", "AGC": "S", "AAU": "N", "AAC": "N", "AAA": "K", "AAG": "K",
                "ACU": "T", "ACC": "T", "ACA": "T", "ACG": "T", "AUU": "I", "AUC": "I", "AUA": "I", "AUG": "M",
                "CGU": "R", "CGC": "R", "CGA": "R", "CGG": "R", "CCU": "P", "CCC": "P", "CCA": "P", "CCG": "P",
                "CAU": "H", "CAC": "H", "CAA": "Q", "CAG": "Q", "UUU": "F", "UUC": "F", "UUA": "L", "UUG": "L",
                "UCU": "S", "UCC": "S", "UCA": "S", "UCG": "S", "UAU": "Y", "UAC": "Y", "UAA": "STOP", "UAG": "STOP",
                "UGU": "C", "UGC": "C", "UGA": "STOP", "UGG": "W", "CUU": "L", "CUC": "L", "CUA": "L", "CUG": "L"}

print(f' Codons are: \n{list(genetic_code.keys())}')
print('-----------------')
print(f' Amino acids are: \n{list(genetic_code.values())}')

 Codons are: 
['GUU', 'GUC', 'GUA', 'GUG', 'GCU', 'GCC', 'GCA', 'GCG', 'GAU', 'GAC', 'GAA', 'GAG', 'GGU', 'GGC', 'GGA', 'GGG', 'AGA', 'AGG', 'AGU', 'AGC', 'AAU', 'AAC', 'AAA', 'AAG', 'ACU', 'ACC', 'ACA', 'ACG', 'AUU', 'AUC', 'AUA', 'AUG', 'CGU', 'CGC', 'CGA', 'CGG', 'CCU', 'CCC', 'CCA', 'CCG', 'CAU', 'CAC', 'CAA', 'CAG', 'UUU', 'UUC', 'UUA', 'UUG', 'UCU', 'UCC', 'UCA', 'UCG', 'UAU', 'UAC', 'UAA', 'UAG', 'UGU', 'UGC', 'UGA', 'UGG', 'CUU', 'CUC', 'CUA', 'CUG']
-----------------
 Amino acids are: 
['V', 'V', 'V', 'V', 'A', 'A', 'A', 'A', 'D', 'D', 'E', 'E', 'G', 'G', 'G', 'G', 'R', 'R', 'S', 'S', 'N', 'N', 'K', 'K', 'T', 'T', 'T', 'T', 'I', 'I', 'I', 'M', 'R', 'R', 'R', 'R', 'P', 'P', 'P', 'P', 'H', 'H', 'Q', 'Q', 'F', 'F', 'L', 'L', 'S', 'S', 'S', 'S', 'Y', 'Y', 'STOP', 'STOP', 'C', 'C', 'STOP', 'W', 'L', 'L', 'L', 'L']


## Control Structures
Next, we will use the **control structures** to be able to analyze the RNA sequence of the cytochrome `RNA_CYP2C9` to synthesize the protein, following these steps:
1. Identify the start of the protein: AUG
2. Divide by threes
3. Find the stop (there can be several, look at the dictionary)
4. Print the protein: AUG(codons - in threes)STOP

In [57]:
# reload the sequence
with open("data/sec_CYP2C9.fasta", "r") as GEN:
    sec_CYP2C9 = GEN.read()
DNA_CYP2C9 =(''.join(sec_CYP2C9.split('\n')[1:]))
RNA_CYP2C9 = DNA_CYP2C9.replace("T","U")

run = True
# search start codon AUG
i = 0
for i in range(len(RNA_CYP2C9)):
    if RNA_CYP2C9[i:i + 3] == 'AUG':  # Start of protein found
        RNA_CYP2C9 = RNA_CYP2C9[i:]  # trim sequence. new RNA
        break  # end the for loop
    if i >= (len(RNA_CYP2C9) - 3):   # Protein start NOT found
        print('Start codon not found AUG')
        RNA_CYP2C9 = RNA_CYP2C9[i:i + 3]
        run = False   # end up
        break   # end the for loop

# This code is only executed if the start of the protein was found
# Executes with the sequence trimmed

protein = list()
if run:
    i = 0
    # start translation
    while i <= len(RNA_CYP2C9) - 2:
        codon = genetic_code[RNA_CYP2C9[i:i + 3]]
        protein.append(codon)
        i += 3
        if codon == 'STOP':
            print(f'>> Protein found')
            RNA_CYP2C9 = RNA_CYP2C9[i:]  # new RNA (trimmed)
            protein = protein[:-1]
            protein_text = ''.join(protein)
            print(f'Protein: {protein_text}')
            break
        if i >= (len(RNA_CYP2C9) - 3):
            print('Codon not found STOP')
            RNA_KR711927 = RNA_CYP2C9[i:i + 3]
            break

>> Protein found
Protein: MDSLVVLVLCLSCLLLLSLWRQSSGRGKLPPGPTPLPVIGNILQIGIKDISKSLTNLSKVYGPVFTLYFGLKPIVVLHGYEAVKEALIDLGEEFSGRGIFPLAERANRGFGIVFSNGKKWKEIRRFSLMTLRNFGMGKRSIEDRVQEEARCLVEELRKTKASPCDPTFILGCAPCNVICSIIFHKRFDYKDQQFLNLMEKLNENIKILSSPWIQICNNFSPIIDYFPGTHNKLLKNVAFMKSYILEKVKEHQESMDMNNPQDFIDCFLMKMEKEKHNQPSEFTIESLENTAVDLFGAGTETTSTTLRYALLLLLKHPEVTAKVQEEIERVIGRNRSPCMQDRSHMPYTDAVVHEVQRYIDLLPTSLPHAVTCDIKFRNYLIPKGTTILISLTSVLHDNKEFPNPEMFDPHHFLDEGGNFKKSKYFMPFSAGKRICVGEALAGMELFLFLTSILQNFNLKSLVDPKNLDTTPVVNGFASVPPFYQLCFIPV


In [58]:
# the protein variable stores a list of each amino acid
print(protein)

['M', 'D', 'S', 'L', 'V', 'V', 'L', 'V', 'L', 'C', 'L', 'S', 'C', 'L', 'L', 'L', 'L', 'S', 'L', 'W', 'R', 'Q', 'S', 'S', 'G', 'R', 'G', 'K', 'L', 'P', 'P', 'G', 'P', 'T', 'P', 'L', 'P', 'V', 'I', 'G', 'N', 'I', 'L', 'Q', 'I', 'G', 'I', 'K', 'D', 'I', 'S', 'K', 'S', 'L', 'T', 'N', 'L', 'S', 'K', 'V', 'Y', 'G', 'P', 'V', 'F', 'T', 'L', 'Y', 'F', 'G', 'L', 'K', 'P', 'I', 'V', 'V', 'L', 'H', 'G', 'Y', 'E', 'A', 'V', 'K', 'E', 'A', 'L', 'I', 'D', 'L', 'G', 'E', 'E', 'F', 'S', 'G', 'R', 'G', 'I', 'F', 'P', 'L', 'A', 'E', 'R', 'A', 'N', 'R', 'G', 'F', 'G', 'I', 'V', 'F', 'S', 'N', 'G', 'K', 'K', 'W', 'K', 'E', 'I', 'R', 'R', 'F', 'S', 'L', 'M', 'T', 'L', 'R', 'N', 'F', 'G', 'M', 'G', 'K', 'R', 'S', 'I', 'E', 'D', 'R', 'V', 'Q', 'E', 'E', 'A', 'R', 'C', 'L', 'V', 'E', 'E', 'L', 'R', 'K', 'T', 'K', 'A', 'S', 'P', 'C', 'D', 'P', 'T', 'F', 'I', 'L', 'G', 'C', 'A', 'P', 'C', 'N', 'V', 'I', 'C', 'S', 'I', 'I', 'F', 'H', 'K', 'R', 'F', 'D', 'Y', 'K', 'D', 'Q', 'Q', 'F', 'L', 'N', 'L', 'M', 'E', 'K',

## Practice activity 2
Based on what you have learned, analyze the sequence of amino acids obtained from the RNA protein and answer:
1. How many amino acids does the protein have?
2. What is the most repeated amino acid?
3. Identify the nucleotide at which amino acid synthesis begins
4. At which nucleotide does amino acid synthesis end?

## Conclusions

At this point in the practice, we use various commands and methods in order to obtain an amino acid sequence from a DNA `strings`, this being a process that can be used in nucleotide sequences of different sizes and from different organisms.

Thus, to obtain the amino acids that make up the proteins, we used **arrangements** and **control structures**, where basic information on the amino acids of the cytochrome P450 protein was obtained, which we will use to classify them and obtain general information. of the enzyme from its subunits (practice 3).