# Practice 2: Expression of genetic material

> **Note:** This book is available in two ways:
> 1. Downloading the repository and following the instructions in the file [README.md](https://github.com/ramirezlab/CHEMO/blob/main/README.md)
> 2. Clicking here on [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ramirezlab/CHEMO/blob/main/1_PART_ONE/1.3_Practice-2_and_Practice-3.en.ipynb?hl=es)


## Concepts to work

**Translation**, is the synthesis of a protein from the mRNA chain, this occurs within proteins called ribosomes, during this process, the mRNA sequence is read in groups of three nucleotides, called **codons**, which are interpreted by a **genetic code** resulting in an amino acid coding **<sup> 1 </sup>** (fig. 1), which will fold and form proteins.

<img src="img/Figura4-en-es.png" alt="code" width="1000"/>

*Figure 1. Essential genetic code in the expression of proteins where the formation of a codon from a nucleotide (uracil, adenine, guanine, or cytokine) is evidenced, from the start sequence (green) and the stop sequences (red ). Figure modified from: [Molecular biology of the gene, (2008), 15, 509-569]( https://books.google.com.co/books?id=7tadzgEACAAJ&dq=Molecular+biology+of+the+gene&hl=es-419&sa=X&redir_esc=y)*

The ribosome reads the sequence in order, looking for the AUG **start** codon, which, in turn, codes for the methionine amino acid and begins the translation, as it continues advancing it builds the chain of amino acids, it is a process that repeats many times, in which the nucleotide triplets are read and the corresponding amino acid is attached (fig. 3). The resulting chain can be long or short, it is addressed until it finds one of the three codons that code for **stop** (UAA, UGA or UAG) (fig. 4), when synthesized, the chain is released from the ribosome and it is modified or combined to form a functional protein with a specific structure involved in some essential process for the cell or organism **<sup> 2 </sup>**.

## Problem Statement
Continuing with the general objective, to obtain basic information on the cytochrome P450 enzyme, a protein previously worked on. To do this, we are going to carry out the second phase involved in gene expression, in order to obtain the amino acids that code for the protein.

First, we must create a dictionary in which the genetic code is found, where they specify the codons (nucleotide triplets) that synthesize their corresponding amino acid. We must take into account the `key-value` pairs, where the `key` would be the codons and the `value` would be the amino acids.

In [None]:
#Dictionary of codons for translation
genetic_code = {"GUU": "V", "GUC": "V", "GUA": "V", "GUG": "V", "GCU": "A", "GCC": "A", "GCA": "A", "GCG": "A",
                "GAU": "D", "GAC": "D", "GAA": "E", "GAG": "E", "GGU": "G", "GGC": "G", "GGA": "G", "GGG": "G",
                "AGA": "R", "AGG": "R", "AGU": "S", "AGC": "S", "AAU": "N", "AAC": "N", "AAA": "K", "AAG": "K",
                "ACU": "T", "ACC": "T", "ACA": "T", "ACG": "T", "AUU": "I", "AUC": "I", "AUA": "I", "AUG": "M",
                "CGU": "R", "CGC": "R", "CGA": "R", "CGG": "R", "CCU": "P", "CCC": "P", "CCA": "P", "CCG": "P",
                "CAU": "H", "CAC": "H", "CAA": "Q", "CAG": "Q", "UUU": "F", "UUC": "F", "UUA": "L", "UUG": "L",
                "UCU": "S", "UCC": "S", "UCA": "S", "UCG": "S", "UAU": "Y", "UAC": "Y", "UAA": "STOP", "UAG": "STOP",
                "UGU": "C", "UGC": "C", "UGA": "STOP", "UGG": "W", "CUU": "L", "CUC": "L", "CUA": "L", "CUG": "L"}

print(f' Codons are: \n{list(genetic_code.keys())}')
print('-----------------')
print(f' Amino acids are: \n{list(genetic_code.values())}')

## Control Structures
Next, we will use the **control structures** to be able to analyze the RNA sequence of the cytochrome `RNA_CYP2C9` to synthesize the protein, following these steps:
1. Identify the start of the protein: AUG
2. Divide by threes
3. Find the stop (there can be several, look at the dictionary)
4. Print the protein: AUG(codons - in threes)STOP

In [None]:
# reload the sequence
with open("data/sec_CYP2C9.fasta", "r") as GEN:
    sec_CYP2C9 = GEN.read()
DNA_CYP2C9 =(''.join(sec_CYP2C9.split('\n')[1:]))
RNA_CYP2C9 = DNA_CYP2C9.replace("T","U")

run = True
# search start codon AUG
i = 0
for i in range(len(RNA_CYP2C9)):
    if RNA_CYP2C9[i:i + 3] == 'AUG':  # Start of protein found
        RNA_CYP2C9 = RNA_CYP2C9[i:]  # trim sequence. new RNA
        break  # end the for loop
    if i >= (len(RNA_CYP2C9) - 3):   # Protein start NOT found
        print('Start codon not found AUG')
        RNA_CYP2C9 = RNA_CYP2C9[i:i + 3]
        run = False   # end up
        break   # end the for loop

# This code is only executed if the start of the protein was found
# Executes with the sequence trimmed

protein = list()
if run:
    i = 0
    # start translation
    while i <= len(RNA_CYP2C9) - 2:
        codon = genetic_code[RNA_CYP2C9[i:i + 3]]
        protein.append(codon)
        i += 3
        if codon == 'STOP':
            print(f'>> Protein found')
            RNA_CYP2C9 = RNA_CYP2C9[i:]  # new RNA (trimmed)
            protein = protein[:-1]
            protein_text = ''.join(protein)
            print(f'Protein: {protein_text}')
            break
        if i >= (len(RNA_CYP2C9) - 3):
            print('Codon not found STOP')
            RNA_KR711927 = RNA_CYP2C9[i:i + 3]
            break

In [None]:
# the protein variable stores a list of each amino acid
print(protein)

## Practice activity 2
Based on what you have learned, analyze the sequence of amino acids obtained from the RNA protein and answer:
1. How many amino acids does the protein have?
2. What is the most repeated amino acid?
3. Identify the nucleotide at which amino acid synthesis begins
4. At which nucleotide does amino acid synthesis end?

## Conclusions

At this point in the practice, we use various commands and methods in order to obtain an amino acid sequence from a DNA `strings`, this being a process that can be used in nucleotide sequences of different sizes and from different organisms.

Thus, to obtain the amino acids that make up the proteins, we used **arrangements** and **control structures**, where basic information on the amino acids of the cytochrome P450 protein was obtained, which we will use to classify them and obtain general information. of the enzyme from its subunits (practice 3).

# Practice 3: Proteins and amino acids
## Concepts to work
The functional diversity expressed by the proteins starts from the molecular variety and the specific sequence that composes them. Amino acids are low molecular weight subunits, which fulfill a specific function in the structure of the protein, due to their physicochemical properties, such as polarity, acidity or basicity, aromaticity, size, capacity to form bonds or their chemical reactivity **<sup> 3 </sup>** (Fig. 2). For this reason, they can be classified in different ways:

1. By polarity, the ability to interact with water molecules, dividing into:

   * Apolar: hydrophobic.
   * Polar: hydrophilic.
   * Acids: negative charge at pH.
   * Basic: positive charge at physiological pH.

<img src="img/Figura6-en.jpg" alt="aminoácidos2" width="600"/>

*Figure 2. Structure and classification of amino acids by their polarity. Figure modified from: [A Brief Guide to the Twenty Common Amino Acids (2014)](https://www.compoundchem.com/2014/09/16/aminoacids/)*

2. Due to the conformation of their side chain, they can be grouped into:
   * Aliphatic
   * Aromatics
   * Hydroxyamino acids
   * Thioamino acids
   * Imino acids
   * Dicarboxylic
   * Amides
   * Dibasic

Knowing the physicochemical properties of proteins have facilitated the prediction of their secondary structures, that is, understanding the folding of proteins for the three-dimensional formation from the chain of amino acids that form it, this through the possible links established between the proteins. subunits and between proteins.

## Problem Statement
To solve the general objective of the practice, we will analyze 2 physicochemical properties of the cytochrome P450 enzyme, using the information in figures 6 and 7 as a guide. In this way, we will obtain the basic information of the protein, which would facilitate a prediction of its folding through the use of omic sciences. The properties we are going to evaluate are:
* Polarity
* Acidity or basicity

Before beginning, we must create a dictionary of the physicochemical properties that we want to evaluate, with the classification of each amino acid. Where, the `key` would be the amino acids and the `value` the properties.

In [None]:
#Dictionary of amino acids for their classification
properties= {"A": "Nonpolar", "V": "Nonpolar", "L": "Nonpolar", "G": "Nonpolar", "I": "Nonpolar", "F": "Nonpolar","W": "Nonpolar", "M": "Nonpolar", "P": "Nonpolar", "S": "Polar", "T": "Polar", "Y": "Polar", "N": "Polar", "Q": "Polar", "C": "Polar", "D": "Acid", "E": "Acid", "K": "Basic", "R": "Basic",  "H": "Basic"}

print(f'Amino acids are: \n{list(properties.keys())}')
print('-----------------')
print(f'The properties are: \n{set(properties.values())}')
# an array is used from the list so that the properties are not repeated

Next, we are going to create the `total_elements` function to get the number of polar, nonpolar, acidic, and basic nucleotides present in the protein.
The `collections.Counter` module will be used, which organizes the elements of a list in a `Counter` that tells how many times each element is repeated.
The `Counter` object can then be turned into a `dictionary` where we can see the information.

`Counter` also has useful methods, for example: `.most_common(n)`, which returns the most common n-element of `Counter`.

More information https://docs.python.org/3/library/collections.html

In [None]:
def total_elements(protein_list):
    # The module to be used is imported
    from collections import Counter
    # An empty list is created where the property of each amino acid is to be stored
    list_protein_properties = list()
    counter_1 = list()

    # Will iterate for each amino acid in the protein
    for element in protein_list:

        # The amino acid property is found and stored in the list (.append())
        list_protein_properties.append(properties[element])
        # The Counter method is called to arrange the data
        counter_1 = Counter(list_protein_properties)
    print(f'Summary of protein properties:')
    print(f'Total items: {len(protein_list)}')
    print(f'Frequency of properties: {dict(counter_1)}')
    print(f'Most common property: {counter_1.most_common(1)[0]}')

    return None

In [None]:
# Protein enzyme cytochrome P450 (found in activity 2)
print(protein)
print('-----------')
total_elements(protein)

With this information, we already know the length of the amino acid sequence and the physicochemical properties of its structure, which has both polar and apolar regions, the latter being the most common, which allows us to have an approximation of the character of the amino acids. functional groups with which it tends to react.

## Practice activities

Taking into account what was reviewed in this first part, make a code in python with which you can:

1. Count the amount of thiamines that were replaced by uracil. The mRNA sequence is found below:
   
GUCUUAACAAGAAGAGAAGGCUUCAAUGGAUUCUCUUGUGGUCCUUGUGCUCUGUCUCUCAUGUUUGCUUCUCCUUUCACUCUGGAGACAGAGCUCUGGGAGAGGAAAACUCCCUCCUGGCCCCACUCCUCUCCCAGUGAU
UGGAAAUAUCCUACAGAUAGGUAUUAAGGACAUCAGCAAAUCCUUAACCAAUCUCUCAAAGGUCUAUGGCCCUGUGUUCACUCUGUAUUUUGGCCUGAAACCCAUAGUGGUGCUGCAUGGAUAUGAAGCAGUGAAGGAAGC
CCUGAUUGAUCUUGGAGAGGAGUUUUCUGGAAGAGGCAUUUUCCCACUGGCUGAAAGAGCUAACAGAGGAUUUGGAAUUGUUUUCAGCAAUGGAAAGAAAUGGAAGGAGAUCCGGCGUUUCUCCCUCAUGACGCUGCGGAA
UUUUGGGAUGGGGAAGAGGAGCAUUGAGGACCGUGUUCAAGAGGAAGCCCGCUGCCUUGUGGAGGAGUUGAGAAAAACCAAGGCCUCACCCUGUGAUCCCACUUUCAUCCUGGGCUGUGCUCCCUGCAAUGUGAUCUGCUC
CAUUAUUUUCCAUAAACGUUUUGAUUAUAAAGAUCAGCAAUUUCUUAACUUAAUGGAAAAGUUGAAUGAAAACAUCAAGAUUUUGAGCAGCCCCUGGAUCCAGAUCUGCAAUAAUUUUUCUCCUAUCAUUGAUUACUUCCC
GGGAACUCACAACAAAUUACUUAAAAACGUUGCUUUUAUGAAAAGUUAUAUUUUGGAAAAAGUAAAAGAACACCAAGAAUCAAUGGACAUGAACAACCCUCAGGACUUUAUUGAUUGCUUCCUGAUGAAAAUGGAGAAGGA
AAAGCACAACCAACCAUCUGAAUUUACUAUUGAAAGCUUGGAAAACACUGCAGUUGACUUGUUUGGAGCUGGGACAGAGACGACAAGCACAACCCUGAGAUAUGCUCUCCUUCUCCUGCUGAAGCACCCAGAGGUCACAGC
UAAAGUCCAGGAAGAGAUUGAACGUGUGAUUGGCAGAAACCGGAGCCCCUGCAUGCAAGACAGGAGCCACAUGCCCUACACAGAUGCUGUGGUGCACGAGGUCCAGAGAUACAUUGACCUUCUCCCCACCAGCCUGCCCCA
UGCAGUGACCUGUGACAUUAAAUUCAGAAACUAUCUCAUUCCCAAGGGCACAACCAUAUUAAUUUCCCUGACUUCUGUGCUACAUGACAACAAAGAAUUUCCCAACCCAGAGAUGUUUGACCCUCAUCACUUUCUGGAUGA
AGGUGGCAAUUUUAAGAAAAGUAAAUACUUCAUGCCUUUCUCAGCAGGAAAACGGAUUUGUGUGGGAGAAGCCCUGGCCGGCAUGGAGCUGUUUUUAUUCCUGACCUCCAUUUUACAGAACUUUAACCUGAAAUCUCUGGU
UGACCCAAAGAACCUUGACACCACUCCAGUUGUCAAUGGAUUUGCCUCUGUGCCGCCCUUCUACCAGCUGUGCUUCAUUCCUGUCUGAAGAAGAGCAGAUGGCCUGGCUGCUGCUGUGCAGUCCCUGCAGCUCUCUUUCCU
CUGGGGCAUUAUCCAUCUUUCACUAUCUGUAAUGCCUUUUCUCACCUGUCAUCUCACAUUUUCCCUUCCCUGAAGAUCUAGUGAACAUUCGACCUCCAUUACGGAGAGUUUCCUAUGUUUCACUGUGCAAAUAUAUCUGCU
AUUCUCCAUACUCUGUAACAGUUGCAUUGACUGUCACAUAAUGCUCAUACUUAUCUAAUGUUGAGUUAUUAAUAUGUUAUUAUUAAAUAGAGAAAUAUGAUUUGUGUAUUAUAAUUCAAAGGCAUUUCUUUUCUGCAUGUU
CUAAAUAAAAAGCAUUAUUAUUUGCUGAGUCAGUUUAUUAGACCUUCCUUCUUUUAUGCAUAAUGUAGGUCAGAAAUUAAAGAAAAUAGAGUUCCAGGAGGCCAUGCUGGUUCUCAAAAUGAUAAGGACAGAAAGGACAAA
GAGGAAGAGGGUAGGGAAGCUAUUUUGGGUGAGUGUUAGAGUUACUUGAGGAUUGGAUUUGAAAGUGAGAAACUGUGUCCAGGGGCAGCUCUAACCUCUAGGGAAAUAUUCAGAGGAUCAGUCAAAGGGUGGAAUGGACAU
UAAAUGCUAGAAUUCUUAUAUCCACAUUGGUGUUCCUUUUUUUUUGAGACAAAGUCUUGCUCUGUCACCCAGGCUGGAGUGCAGUGGUGUGAUCUCAGCUCUCUAUAACCUCCGCCUCCCAGGUUCAAGUGAUUCUCCUGC
CUCAGCCUCCUGAGUAGCUGGGAUUACAGGUGCAUGCCACCACACCUGGCUAAUUUUUUGUAUUUUUAGUACAGACGGGUUUUCACCGUGUUAGCCAGGAUGGUCUUAAUCUCCUGACCUUGUGAUCUGCCUGCCUCAGCC
UCCCAAAGUGCUGGGAUUACAGGUGUGAGCCACUGCGCCUGGCCUGGUGUUACUUUGAAGUGUCAUUACUUUAUCUCUAAAUAAAGAAUCAGAUUACUUUUAUUACUUCAUGUUUCCAACUUAGAAUGAUGUAAUGAAGUA
UAAAUACAUGCUUUCAUAUCGCU

2. Perform a new classification of the protein based on the conformation of the side chain.
3. Identify the difference between the final amino acid sequence compared to the CYP2C19 sequence (NM_000769.4), a sequence from the same family of cytochrome P450 proteins that presents a polymorphism (mutation) in the amino acid sequence.
   
At the end, you must prepare a document in PDF format in which you attach the proposed code and the output of the execution.

## Conclusions
In this tutorial, you understood the uses of basic Python tools used in bioinformatics practices, ranging from data collection and management, to the use of commands and methods for data management and analysis. This was done through two phases, a theory and a practice, where we carried out the expression of the genetic material of a protein from a DNA sequence, until obtaining the corresponding amino acids and their properties.
In the next tutorials, we will explain more Python tools used in collecting and organizing data obtained from electronic resources, implementing different libraries and their development.

# Reference

1. Translation. (2023). Genome.Gov. https://www.genome.gov/genetics-glossary/Translation
2. Genetic code. (2023). Genome.Gov. https://www.genome.gov/genetics-glossary/Genetic-Code
3. Cortés, G. & Aguilar-Ruiz, J. (2006). Importancia de las Propiedades Físico-Químicas de los Aminoácidos en la Predicción de Estructuras de Proteínas usando Vecinos más Cercanos