# Introduction to the Python programming language

The "Collaborative Chemoinformatics Open Platform" (CHEMO) course is an autonomous learning platform aimed at students and researchers interested in the development of biocomputing tools, from Python, which is the most widely used programming language. With advances in omic sciences and new technologies, it has become necessary to acquire computer skills applied to science and the management of biological databases.

Welcome to the introduction of the course CHEMO, in the first part you will find the fundamentals to understand the Python programming language, it is focused on the manipulation, extraction and analysis of data from biological databases (omics) , starting from the basic concepts of programming and its applications, with examples based on manipulation of DNA sequences.

## Contents

In this first *Python_basic* Jupyter notebook you will learn:

1. Introduction to Python
2. Variables
3. Types of data
4. Types of arrangements
5. Upload files
6. String manipulation
7. Flow control structures
8. Functions

# Learning outcomes

It is expected that by the end of this notebook the user will be able to:

1. Understand the basics of the Python programming language.
2. Understand variable formats:
    - What are they?
    - The assignments
    - The basic operations
3. Understand the uses and types of data:
    - Numerical
    - Boolean
    - Text
    - Arrangements
4. Identify the types of arrangements and their use:
    - Lists
    - tuples
    - Sets
    - Dictionaries
5. Load and modify files in different formats and sizes, view them, analyze them, read and write over them.
6. Manage basic flow control structures:
    - Conditionals
    - Iterations
7. Understand the Python tools that can be used in biology.

# Theory: basic concepts

## Variables

A variable is a reference to a value in the computer's memory, where different **types of data** can be stored. Values ​​are usually assigned to variable names, an example of this is: `text = "hello world"`, `text` is the name of the variable and refers to `hello world` which would be the value of the variable, and in the previous example the operator `=` was used, then the operation that was performed was assignment. This is known as an assignment expression.

It is usually the software developer who assigns a variable name that is easy to remember and use in the program. It is important to know that the name of the variables cannot begin with a number and they are case sensitive, and spaces cannot be included.

The following table shows some of the data that can be worked on in Python:

### Native Python data types

| Name in English | Type name | Category | Description | Example |
| :-------------------- | :--------- | :------------- | :-------------------------------------------- |:--- ------------------------------------------|
| integer | `int` | Numeric | Positive/negative integers | 1, 2, ..., n |
| floating point number | `float` | Numeric | Real numbers in decimal form | 1.0, 1.1, 1.22, 1.333, ..., n |
| boolean | `bool` | logical | True or False | True, False |
| string | `str` | Chain | Text | 'Hello world...' |
| list | `list` | sequence | An ordered and mutable collection of objects | [0, 1, 2, 3, ..., n] |
| tuple | `tuple` | sequence | An immutable, ordered collection of objects | (1, 2, 3, ..., n) |
| dictionary | `dict` | Map | Object Pair Map | { "First Name": "Doe", 'Last Name: "Laden" } |
| none | `NoneType` | null | Represents no value | `None` |

(_Note: Variable names cannot be equal to python_ reserved words)

## Type of data

### Data type: numeric

There are three types of numeric data, here we will generally work with two of them: `integers` (`int`) and `Floating point` (`float`). The `type()` function helps us determine the type of an object in Python

* **Integers (int)**: this data type includes all integers, since this set is infinite, the `Python` language is limited by the available memory capacity.
* **Floating point (float)**: This type of data is used to represent most real numbers without problems. `Float` values are stored in a very particular way, called floating point representation, which is explained in detail in [the IEEE 754 standard](https://en.wikipedia.org/wiki/IEEE_754) Thus, if an integer is defined with a decimal point, for example: 1.0, it will be stored as a float.


In [1]:
# Example Integer
integer_1 = 12361
float_1 = 123,215
float_2 = 1236.0

print("The type of the variable integer_1 is: " + str(type(integer_1)))
print("The type of the variable float_1 is: " + str(type(float_1)))
print("The type of float_2 is: " + str(type(float_2)))

The type of the variable integer_1 is: <class 'int'>
The type of the variable float_1 is: <class 'tuple'>
The type of float_2 is: <class 'float'>


#### Arithmetic operations

| Symbol | Description |
|:-------:|:---------------:|
| `+` | addition |
| `-` | subtraction |
| `*` | multiplication |
| `/` | division |
| `**` | power |
| `//` | integer division |
| `%` | module |

*Note*: From now on we will work with `f-strings` that allow you to make short lines of text with built-in variables. Más información en: [f-strings](https://platzi.com/blog/f-strings-en-python/?utm_source=google&utm_medium=cpc&utm_campaign=12915366154&utm_adgroup=&utm_content=&gclid=Cj0KCQjw3IqSBhCoARIsAMBkTb2p5ZOBtPtlGG2B7P0qrtnp8Wwvbgd2OY_F3_P-6OOU1YE_QHHCMaYaAnTaEALw_wcB&gclsrc=aw.ds), [f-strings](https://peps.python.org/pep-0498/)

In [2]:
# Example operations with integers
x = 10
y = 5

print(f'The sum of two integers is an integer, for example: {x + y}')

The sum of two integers is an integer, for example: 15


In [3]:
# Example operations with real numbers in decimal form
x = 10.0
y = 5.0

print(f'The sum of two float numbers results in a float number, for example: {x + y}')

The sum of two float numbers results in a float number, for example: 15.0


In [4]:
# Example operations with real numbers and integers
x = 10
y = 5.0

# Note that the Python interpreter gives importance to float types over int types.
print(f'The sum of an integer and a float is a float, for example: {x + y}')

The sum of an integer and a float is a float, for example: 15.0



In Jupyther (interactive version of Python) notebooks, the last line of the cell will be displayed automatically. This means that it is not always necessary to use the print() function

In [5]:
#Example
x = 15
y = 20

x + y # The operation of this cell is displayed automatically

35

### Data type: Boolean (bool)

This data type (`bool`) has only two values: true: `True` or false: `False`. It can be used for logical operators where it is necessary to evaluate if a statement is true or not.

### Data type: text (String)

Strings are denoted as <code>str</code> and are a sequence of symbols that can include uppercase and lowercase letters, numbers, punctuation marks, and spaces.

There are three ways to represent this type of data, any of them is valid and does not affect the code:
* **Enter single quotes:** 'Donepezil'
* **Enter double quotes:** "Donepezil"
* **Enter three single quotes or three double quotes:** '''Donepezil''' or """Donepezil""" *(Mainly used to define multi-line strings)*

In [6]:
# Example of a variable of type string
text = 'Hello world'
print(f'text: {text}')

text: Hello world


In [7]:
# Example of definitions of string type variables, with single, double and triple quotes.
text = "Hello world, double quotes"
print(f'Text - Double Quotes: {text}')

Text - Double Quotes: Hello world, double quotes


In [8]:
text = 'Hello World, single quotes'
print(f'Text - Single Quotes: {text}')

Text - Single Quotes: Hello World, single quotes


In [9]:
# Triple quotes preserve separators (spaces, newlines)
text = """Hello world,
triple double quote"""
print(f'Text - Triple Double Quote: {text}')

Text - Triple Double Quote: Hello world,
triple double quote


In [10]:
text = '''Hello world,
triple single quote'''
print(f'Text - Triple Single Quote: {text}')

Text - Triple Single Quote: Hello world,
triple single quote


If the text is a very long string, it is possible to write it in multiline mode, as shown below:

In [11]:
multiline_text = """This text is multiline
because it has different lines
for this it is printed
"""
# \n is a line separator
print(f'Multiline text \n{multiline_text}')

Multiline text 
This text is multiline
because it has different lines
for this it is printed



#### Indexing

One of the easiest ways to manipulate a string is through the indexing method:

A text is similar to a list, where each element has an _index_, in `Python` the first element has zero index. This way you can make use of indexing to manipulate parts of the text.

In [12]:
text = "Text to be manipulated"

### EXAMPLE 1
print(f'Extract a word from text (text[6:9]): {text[6:9]}') # from beginning to end

Extract a word from text (text[6:9]): o b


In [13]:
#### EXAMPLE 2
print(f'Extract the word \'want\' from the text by traversing the string from front to back: {text[13:19]}') # from beginning to end
print(f'Extract the word \'want\' from the text by traversing the string from back to front: {text[-16:-10]}') # from the end to the beginning

Extract the word 'want' from the text by traversing the string from front to back: nipula
Extract the word 'want' from the text by traversing the string from back to front: o be m


It is even possible to do logic checks with strings as seen below:

In [14]:
text = "Text to be manipulated"

### EXAMPLE 1
print(f'Could it be that \'for\' is in text?')
print("for" in text)

#### EXAMPLE 2

print(f'\nDoes it \'want\' to be in text?')
print("want" in text)

Could it be that 'for' is in text?
False

Does it 'want' to be in text?
False


#### String Methods

There are several operators in the Python language that allow you to work with String data using operations that return the values ​​without changing the string. Among which are:
* <code>.replace() </code>: Replaces a specific value in the string with another.
* <code>.split()</code> - Splits the string into substrings based on the parameter set. Returns a list of items.
* <code>.find()</code>: Searches the string for a specific value and shows the position in which it is found. Returns the index (position) of the searched element

In [15]:
text = "Hello world. This is sample text"
print(text)

# replace a word
text.replace("Hello", "Hi")

Hello world. This is sample text


'Hi world. This is sample text'

In [16]:
# Separate the phrase by spaces
split_list = text.split(" ")
split_list

['Hello', 'world.', 'This', 'is', 'sample', 'text']

In [17]:
# Find the position of the word 'world' in text
text.find("world")

6

### Data type: Arrays

Lists, tuples, dictionaries, and arrays are used to store multiple items in the same variable.
* **Lists (list):** the elements have a modifiable order, modifications can be made and there may be duplicates, they are also indexed
* **Tuples (tuple):** the elements have an order and cannot be changed, added or deleted once the tuple is created, they can also make duplicates
* **Sets (set):** the elements do not have an order, they cannot be changed, added or deleted once the set is created, they are not indexed nor can there be duplicates
* **Dictionaries (dict):** are used to store data values ​​in key:value pairs, elements have a non-modifiable order, modifications can be made and duplicates are not allowed

#### Lists

Lists are used to store multiple items in a single variable. The elements or data that are stored can be of any type. The following are the characteristics of this type of data:
* The elements of the lists **are ordered**, that is, they have a defined order that will not change because when adding new elements to the list they will be placed at the end of it.
* Items in lists are **modifiable**, meaning you can change, add, and remove items after the list has been created.
* Lists **allow duplicates**, that is, there can be elements with the same value.

In [18]:
# Store in the num variable an ordered list of elements from 0 to 10
num = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print(f'The list of integers from 0 to 10 is:')
num

The list of integers from 0 to 10 is:


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [19]:
# to write the same list you can use the range function
num = list(range(11))
num

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [20]:
num = [1, "2", 3, "4", 5, "6", 7, 8, "9", 10]
# Note how string elements are enclosed in quotes
print(f'The list is:')
num

The list is:


[1, '2', 3, '4', 5, '6', 7, 8, '9', 10]

In the previous example you can see the way in which a list is written:
* It is delimited by square brackets `[ ]`
* Each element is separated by commas `,`

Since the elements of the lists are ordered, the index of an element can be known with the function <code>index()</code> which returns the index of the element in the first occurrence that it finds from index 0 regardless of how many times is the element inside the list.

##### Indexing 
Using the same technique to manipulate string variables, the indexing method can manipulate lists in a similar way:

In [21]:
# num: an ordered list of elements from 0-10
num = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print('list is:')
print(num)
print('---------------------')

#### EXAMPLE 1
print(f'In position 0 of the list is the element (first element):')
print(num[0])
print('---------------------')

#### EXAMPLE 2
print(f'The first two elements of the list are:')
print(num[0:2])
print('---------------------')

#### EXAMPLE 3
print(f'The last two elements of the list are:')
print(num[-2:])

list is:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
---------------------
In position 0 of the list is the element (first element):
0
---------------------
The first two elements of the list are:
[0, 1]
---------------------
The last two elements of the list are:
[9, 10]


##### List Methods
Sometimes it is necessary to perform some basic operations on the lists, some of the existing methods are:

* <code>len()</code>
* <code>.index() </code>
* <code>.pop() </code>
* <code>.append()</code>
* <code>.remove() </code>
* <code>.reverse()</code>

###### .len() Method
With `.len(<list>)` you can determine the number of elements in the list:

In [22]:
num = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print('Number of elements in the list:')
len(num)

Number of elements in the list:


11

###### .index() Method
With `.index(<element>)` you can find the position or index of an element in the list:

In [23]:
# It is possible to find out the position of an element, even the value in a place of the list
print(f'What is the position of number 5?')
num.index(5)

What is the position of number 5?


5

###### .pop() method
With `.pop(<index>)`. An item can be removed from the list and returned.
- In case the index is not specified, the last element is removed

This method **overwrites** the list:

In [24]:
print('Remove the last item in the list')
print(num.pop())
print(num)

Remove the last item in the list
10
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


###### .append() Method
`.append(<element>)` adds the element to the end of the list. This method __overwrites__ the list.

In [25]:
num.append(6)
print(f'The list now with element 6 at the end:')
print(num)

The list now with element 6 at the end:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 6]


###### .remove() Method
With `.remove(<element>)` an element is removed from a list, its element must be indicated:
- In case of having several identical elements, only the first one is eliminated
- If the element is not in the list, return a 'ValueError' error

This method __overwrites__ the list.

In [26]:
num.remove(2)
print(f'The list now without the 2:')
print(num)

The list now without the 2:
[0, 1, 3, 4, 5, 6, 7, 8, 9, 6]


###### .reverse() Method
With `.reverse()` the list is reordered, from the last element to the first. This method __overwrites__ the list.

In [27]:
# recreate num
num = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
num.reverse() # Change the order
print('List in reverse order')
print(num)

List in reverse order
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]


#### Tuples
Tuples are used to store multiple items in a single variable. The elements or data that are stored can be of any type. The following are the characteristics of this type of data:
* The elements of the tuples **are ordered**, that is, they have a defined order that will not change because when adding new elements to the list they will be placed at the end of it.
* Elements in tuples **are immutable**, that is, elements cannot be changed, added, and removed after the tuple has been created.
* Tuples **allow duplicates**, that is, there can be elements with the same value.

In [28]:
# Let tuple_num be an ordered tuple of elements from 1 to 10
tuple_num = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

print(f'The tuple of elements from 1 to 10 is:')
print(tuple_num)

The tuple of elements from 1 to 10 is:
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)


A tuple is immutable, so new values ​​cannot be assigned to its elements.

In [29]:
# tuple_num[3] = 12

Tuple elements can be accessed in the same way as a list

In [30]:
tuple_num = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
print('Tuple of elements')
print(tuple_num)
print('---------------------')

#### EXAMPLE 1
print('First element (zero position)')
print(tuple_num[0])
print('---------------------')

#### EXAMPLE 2
print('Last four items')
print(tuple_num[-4:])
print('---------------------')

#### EXAMPLE 3
print('tuple of elements from position two to five')
print(tuple_num[2:5])
print('---------------------')

Tuple of elements
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
---------------------
First element (zero position)
1
---------------------
Last four items
(7, 8, 9, 10)
---------------------
tuple of elements from position two to five
(3, 4, 5)
---------------------


#### Sets
Sets are used to store multiple items in a single variable. The elements or data that are stored can be of any type. The following are the characteristics of this type of data:
* Elements in arrays are **unordered**, that is, they do not have a defined order since the elements may appear in a different order each time you use them and cannot be referenced by index or key.
* Elements in arrays **are immutable**, ie no elements can be changed after the array has been created.
* Sets **do not allow duplicates**, that is, there cannot be elements with the same value.

In [31]:
# Set of elements with some repetitions
set_num_1 = {1, 2, 1, 2, 3, 5, 8, 1, 3, 9, 10, 2, 3, 4, 5, 6, 7, 8, 1, 0}

print('The set of elements is:')
set_num_1

The set of elements is:


{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

#### Dictionaries
Dictionaries are a special data structure, which gives us a bit more flexibility due to their nature of a list of objects where each pair contains a key and a value. By key we refer to a value that describes the type of data stored, this type of structure is known as a list of key-value objects.

Let's see an example, for this we will define a codon dictionary with its translation into its genetic code:

In [32]:
genetic_code = {
    "GUU": "V", "GUC": "V", "GUA": "V",
    "GUG": "V", "GCU": "A", "GCC": "A",
    "GCA": "A", "GCG": "A", "GAU": "D",
    "GAC": "D", "GAA": "E", "GAG": "E",
    "GGU": "G", "GGC": "G", "GGA": "G",
    "GGG": "G", "AGA": "R", "AGG": "R",
    "AGU": "Y", "AGC": "Y", "AAU": "N",
    "AAC": "N", "AAA": "K", "AAG": "K",
    "ACU": "T", "ACC": "T", "ACA": "T",
    "ACG": "T", "AUU": "I", "AUC": "I",
    "AUA": "I", "AUG": "M", "CGU": "R",
    "CGC": "R", "CGA": "R", "CGG": "R",
    "CCU": "P", "CCC": "P", "CCA": "P",
    "CCG": "P", "CAU": "H", "CAC": "H",
    "CAA": "Q", "CAG": "Q", "UUU": "F",
    "UUC": "F", "UUA": "L", "UUG": "L",
    "UCU": "Y", "UCC": "Y", "UCA": "Y",
    "UCG": "Y", "UAU": "Y", "UAC": "Y",
    "UAA": "STOP", "UAG": "STOP", "UGU":"C",
    "UGC": "C", "UGA": "STOP", "UGG": "W",
    "CUU": "L", "CUC": "L", "CUA": "L",
    "CUG": "L"}

# The previously created dictionary is printed
print(genetic_code)

{'GUU': 'V', 'GUC': 'V', 'GUA': 'V', 'GUG': 'V', 'GCU': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A', 'GAU': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E', 'GGU': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G', 'AGA': 'R', 'AGG': 'R', 'AGU': 'Y', 'AGC': 'Y', 'AAU': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K', 'ACU': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T', 'AUU': 'I', 'AUC': 'I', 'AUA': 'I', 'AUG': 'M', 'CGU': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R', 'CCU': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P', 'CAU': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q', 'UUU': 'F', 'UUC': 'F', 'UUA': 'L', 'UUG': 'L', 'UCU': 'Y', 'UCC': 'Y', 'UCA': 'Y', 'UCG': 'Y', 'UAU': 'Y', 'UAC': 'Y', 'UAA': 'STOP', 'UAG': 'STOP', 'UGU': 'C', 'UGC': 'C', 'UGA': 'STOP', 'UGG': 'W', 'CUU': 'L', 'CUC': 'L', 'CUA': 'L', 'CUG': 'L'}


Python has its own way of defining dictionaries, via the `dict` keyword. Which would give us a more elegant form of creation:

We can generally access dictionary values ​​in the same way that we can access arrays and lists.

In [33]:
### Example 1: What is the value of the genetic code of GGG?
print(f'The amino acid of GGG is: {genetic_code["GGG"]}')
print('---------------------')

### Example 2: Using try/catch for when a key does not exist in the dictionary
genetic_code['YYY']


The amino acid of GGG is: G
---------------------


KeyError: 'YYY'

In example 2 of the cell above it threw the error `KeyError` because `YYY` is not defined in the dictionary. In `Python` you can use exceptions to define how to handle errors. This is done with the `try/except` script. In general, a `try/except` type block has the following structure:

```markdown
try:
    code block
except <type>:
    code block
```

Now, the example can be written as follows:

In [34]:
### Example 2: Using try/catch for when a key does not exist in the dictionary
try:
     genetic_code['YYY']
except KeyError:
     print(f'Property YYY does not exist in the dictionary, try another one!')

Property YYY does not exist in the dictionary, try another one!


# Practice 1: Expression of genetic material

## Concepts to work

Nucleic acids are the basic unit that make up the genetic material, it is present in prokaryotic and eukaryotic cells and viruses<sup> 1 </sup>, it is made up of pentoses, a phosphate group and nitrogenous bases, divided into two groups: purines that are adenine (A) and guanine (G), and the pyrimidines that are cytosine (C), thymine (T) and uracil (U)<sup> 2 </sup>. The union of nucleic acids forms the essential macromolecules for life:

<img src="img/Figura1-en.png" alt="estructure" width="600"/>

*Figure 1. Structure of DNA and RNA showing its characteristics and the nucleic acids that compose it. Own elaboration.*

**DNA (deoxyribonucleic acid)**: is a macromolecule in charge of storing and expressing essential genetic information for the functions of any organism, it has an ordered organization of four nitrogenous bases A, G, T and C, which form a antiparallel and complementary double helix chain, where A always binds to T, and G to C, if any base is modified or its order changes the information, which can trigger mutations<sup> 3 </sup>.

**RNA (ribonucleic acid)**: It is the macromolecular resulting from the transcription of DNA, where the T becomes a U, that is, it is the copy determined by the sequence of one of the DNA strands<sup> 4 </sup>.

One of the functions of the double stranded DNA is the expression of the genetic material, it is the process in charge of the synthesis of the proteins that the cell needs (fig. 2). It consists of two main phases, **transcription** and **translation** where a DNA sequence codes for a particular protein involved in different processes such as metabolism or cell identity.

<img src="img/Figura2-en.jpg" alt="transcription-translation" width="600"/>

*Figure 2. Central dogma of molecular biology, where the expression of the genetic material, the transcription of DNA to RNAm, translation to amino acids and the formation of the protein are evident. Figure modified from: [Dogma](https://www.brainvta.tech/plus/list.php?tid=110).*

**Transcription** is the first step for the generation of proteins, in which from a DNA strand, called template DNA, an RNA strand is synthesized by means of the enzyme RNA polymerase, where a copy is established. Almost identical to the DNA sequence, with the variation of substituting the T for the U throughout the sequence, however, like the T, the U pairs with the A (fig. 3). In eukaryotic cells, this first transcript undergoes a second process called "splicing" in which specific fractions of the sequence that do not code for proteins. Transcription is like translating a book from one language to another <sup> 5 </sup>.

<img src="img/Figura3-en.png" alt="transcription" width="600"/>

*Figure 3. Synthesis of mRNA, showing the transcription of DNA to RNA with the construction of the mRNA chain from the DNA template chain, in the absence of the enzymes involved. Figure modified from: [Molecular biology of the gene, (2008), 13, 429-464]( https://books.google.com.co/books?id=7tadzgEACAAJ&dq=Molecular+biology+of+the+gene&hl=es-419&sa=X&redir_esc=y)*

## Problem Statement
Suppose we want to obtain basic information on the cytochrome P450 enzyme, which encodes a protein involved in drug metabolism and lipid synthesis such as cholesterol and steroids. To analyze the sequence we can use bioinformatics tools, we will start with the manipulation of <code>strings</code> of the sequence, transcribing the DNA sequence to mRNA.

First, we must download the document with which we are going to work. We can do a gene search in the Genbank database: we look for the DNA sequence of cytochrome P450, specifically the C9 subfamily of homo sapiens, ID: [NM_000771.4](https://www.ncbi.nlm.nih.gov/nuccore/NM_000771.4?report=genbank)

To download the sequence, select the **"Fasta"** section, then, in the **"Send to"** section and then **"Complete Record"**, choose the file **(File) ** and the format to obtain the sequence **(Fasta)**, then click on **"Create File"** and the document will be downloaded. To recognize the file, change the name of the document, in this case "sec_CYP2C9.fasta".
- *The `sec_CYP2C9.fasta` file can be found in the `/data/` folder*

Next, the file is loaded in order to carry out the transcription process, for this, the `with` command is used, where the `GEN` variable is saving an object (in this case, a file). These objects can be called different ways. The line `sec_CYP2C9 = (GEN.read())` saves in the `sec_CYP2C9` variable a string with the content of the GEN variable.

### Sequence NM_000771

In [35]:
#Nucleotide sequence of the CYP2C9 gene
with open("data/sec_CYP2C9.fasta", "r") as GEN:
    sec_CYP2C9 = GEN.read()

# the GEN variable is an object that Python can manipulate
print(f'GEN variable type: {type(GEN)}')

# The variable sec_CYP2C9 is a string that we can manipulate
print(f'Variable type sec_CYP2C9: {type(sec_CYP2C9)}')

GEN variable type: <class '_io.TextIOWrapper'>
Variable type sec_CYP2C9: <class 'str'>


In [36]:
print('The downloaded GenBank file of the cytochrome P450 gene is:')
sec_CYP2C9

The downloaded GenBank file of the cytochrome P450 gene is:


'>NM_000771.4 Homo sapiens cytochrome P450 family 2 subfamily C member 9 (CYP2C9), mRNA\nGTCTTAACAAGAAGAGAAGGCTTCAATGGATTCTCTTGTGGTCCTTGTGCTCTGTCTCTCATGTTTGCTT\nCTCCTTTCACTCTGGAGACAGAGCTCTGGGAGAGGAAAACTCCCTCCTGGCCCCACTCCTCTCCCAGTGA\nTTGGAAATATCCTACAGATAGGTATTAAGGACATCAGCAAATCCTTAACCAATCTCTCAAAGGTCTATGG\nCCCTGTGTTCACTCTGTATTTTGGCCTGAAACCCATAGTGGTGCTGCATGGATATGAAGCAGTGAAGGAA\nGCCCTGATTGATCTTGGAGAGGAGTTTTCTGGAAGAGGCATTTTCCCACTGGCTGAAAGAGCTAACAGAG\nGATTTGGAATTGTTTTCAGCAATGGAAAGAAATGGAAGGAGATCCGGCGTTTCTCCCTCATGACGCTGCG\nGAATTTTGGGATGGGGAAGAGGAGCATTGAGGACCGTGTTCAAGAGGAAGCCCGCTGCCTTGTGGAGGAG\nTTGAGAAAAACCAAGGCCTCACCCTGTGATCCCACTTTCATCCTGGGCTGTGCTCCCTGCAATGTGATCT\nGCTCCATTATTTTCCATAAACGTTTTGATTATAAAGATCAGCAATTTCTTAACTTAATGGAAAAGTTGAA\nTGAAAACATCAAGATTTTGAGCAGCCCCTGGATCCAGATCTGCAATAATTTTTCTCCTATCATTGATTAC\nTTCCCGGGAACTCACAACAAATTACTTAAAAACGTTGCTTTTATGAAAAGTTATATTTTGGAAAAAGTAA\nAAGAACACCAAGAATCAATGGACATGAACAACCCTCAGGACTTTATTGATTGCTTCCTGATGAAAATGGA\nGAAGGAAAAGCACAACCAACCATCTGAATTTACTATTGAAAGCTTGG

We see that the variable sec_CYP2C9 stored a line of text with the special character \n. This character means line break. If we print the sequence it looks better:

In [37]:
print(sec_CYP2C9)

>NM_000771.4 Homo sapiens cytochrome P450 family 2 subfamily C member 9 (CYP2C9), mRNA
GTCTTAACAAGAAGAGAAGGCTTCAATGGATTCTCTTGTGGTCCTTGTGCTCTGTCTCTCATGTTTGCTT
CTCCTTTCACTCTGGAGACAGAGCTCTGGGAGAGGAAAACTCCCTCCTGGCCCCACTCCTCTCCCAGTGA
TTGGAAATATCCTACAGATAGGTATTAAGGACATCAGCAAATCCTTAACCAATCTCTCAAAGGTCTATGG
CCCTGTGTTCACTCTGTATTTTGGCCTGAAACCCATAGTGGTGCTGCATGGATATGAAGCAGTGAAGGAA
GCCCTGATTGATCTTGGAGAGGAGTTTTCTGGAAGAGGCATTTTCCCACTGGCTGAAAGAGCTAACAGAG
GATTTGGAATTGTTTTCAGCAATGGAAAGAAATGGAAGGAGATCCGGCGTTTCTCCCTCATGACGCTGCG
GAATTTTGGGATGGGGAAGAGGAGCATTGAGGACCGTGTTCAAGAGGAAGCCCGCTGCCTTGTGGAGGAG
TTGAGAAAAACCAAGGCCTCACCCTGTGATCCCACTTTCATCCTGGGCTGTGCTCCCTGCAATGTGATCT
GCTCCATTATTTTCCATAAACGTTTTGATTATAAAGATCAGCAATTTCTTAACTTAATGGAAAAGTTGAA
TGAAAACATCAAGATTTTGAGCAGCCCCTGGATCCAGATCTGCAATAATTTTTCTCCTATCATTGATTAC
TTCCCGGGAACTCACAACAAATTACTTAAAAACGTTGCTTTTATGAAAAGTTATATTTTGGAAAAAGTAA
AAGAACACCAAGAATCAATGGACATGAACAACCCTCAGGACTTTATTGATTGCTTCCTGATGAAAATGGA
GAAGGAAAAGCACAACCAACCATCTGAATTTACTATTGAAAGCTTGGAAAACACTGCAGTT

## String manipulation

### Loading and manipulation of text files
When downloading a sequence from Genbak in fasta format, the first line contains the sequence references and the text is separated by `\n`, with the following procedure you can clean the text and leave only the sequence:
 - Use the `.split()` method that splits the string into a list of elements. It is useful for us to separate it into line breaks, which are represented with **'\n'**.
 - Then the first element (zero index) should be removed.
 - Finally, rejoin the entire string in a single string, for this, the `.join()` method is used

In [38]:
# 1. Separate the string by lines
separated_sec = sec_CYP2C9.split('\n')
print ("Separated list:\n", str (separated_sec))

Separated list:
 ['>NM_000771.4 Homo sapiens cytochrome P450 family 2 subfamily C member 9 (CYP2C9), mRNA', 'GTCTTAACAAGAAGAGAAGGCTTCAATGGATTCTCTTGTGGTCCTTGTGCTCTGTCTCTCATGTTTGCTT', 'CTCCTTTCACTCTGGAGACAGAGCTCTGGGAGAGGAAAACTCCCTCCTGGCCCCACTCCTCTCCCAGTGA', 'TTGGAAATATCCTACAGATAGGTATTAAGGACATCAGCAAATCCTTAACCAATCTCTCAAAGGTCTATGG', 'CCCTGTGTTCACTCTGTATTTTGGCCTGAAACCCATAGTGGTGCTGCATGGATATGAAGCAGTGAAGGAA', 'GCCCTGATTGATCTTGGAGAGGAGTTTTCTGGAAGAGGCATTTTCCCACTGGCTGAAAGAGCTAACAGAG', 'GATTTGGAATTGTTTTCAGCAATGGAAAGAAATGGAAGGAGATCCGGCGTTTCTCCCTCATGACGCTGCG', 'GAATTTTGGGATGGGGAAGAGGAGCATTGAGGACCGTGTTCAAGAGGAAGCCCGCTGCCTTGTGGAGGAG', 'TTGAGAAAAACCAAGGCCTCACCCTGTGATCCCACTTTCATCCTGGGCTGTGCTCCCTGCAATGTGATCT', 'GCTCCATTATTTTCCATAAACGTTTTGATTATAAAGATCAGCAATTTCTTAACTTAATGGAAAAGTTGAA', 'TGAAAACATCAAGATTTTGAGCAGCCCCTGGATCCAGATCTGCAATAATTTTTCTCCTATCATTGATTAC', 'TTCCCGGGAACTCACAACAAATTACTTAAAAACGTTGCTTTTATGAAAAGTTATATTTTGGAAAAAGTAA', 'AAGAACACCAAGAATCAATGGACATGAACAACCCTCAGGACTTTATTGATTGCTTCCTGATGAAAATGGA', 'GAA

In [39]:
#2. Save the list of elements from the second (index 1) to the end, dereferencing the sequence (index 0).
separated_sec = separated_sec[1:]
# View the list without the first item
print("Separated list, without the first line:\n" + str (separated_sec))

Separated list, without the first line:
['GTCTTAACAAGAAGAGAAGGCTTCAATGGATTCTCTTGTGGTCCTTGTGCTCTGTCTCTCATGTTTGCTT', 'CTCCTTTCACTCTGGAGACAGAGCTCTGGGAGAGGAAAACTCCCTCCTGGCCCCACTCCTCTCCCAGTGA', 'TTGGAAATATCCTACAGATAGGTATTAAGGACATCAGCAAATCCTTAACCAATCTCTCAAAGGTCTATGG', 'CCCTGTGTTCACTCTGTATTTTGGCCTGAAACCCATAGTGGTGCTGCATGGATATGAAGCAGTGAAGGAA', 'GCCCTGATTGATCTTGGAGAGGAGTTTTCTGGAAGAGGCATTTTCCCACTGGCTGAAAGAGCTAACAGAG', 'GATTTGGAATTGTTTTCAGCAATGGAAAGAAATGGAAGGAGATCCGGCGTTTCTCCCTCATGACGCTGCG', 'GAATTTTGGGATGGGGAAGAGGAGCATTGAGGACCGTGTTCAAGAGGAAGCCCGCTGCCTTGTGGAGGAG', 'TTGAGAAAAACCAAGGCCTCACCCTGTGATCCCACTTTCATCCTGGGCTGTGCTCCCTGCAATGTGATCT', 'GCTCCATTATTTTCCATAAACGTTTTGATTATAAAGATCAGCAATTTCTTAACTTAATGGAAAAGTTGAA', 'TGAAAACATCAAGATTTTGAGCAGCCCCTGGATCCAGATCTGCAATAATTTTTCTCCTATCATTGATTAC', 'TTCCCGGGAACTCACAACAAATTACTTAAAAACGTTGCTTTTATGAAAAGTTATATTTTGGAAAAAGTAA', 'AAGAACACCAAGAATCAATGGACATGAACAACCCTCAGGACTTTATTGATTGCTTCCTGATGAAAATGGA', 'GAAGGAAAAGCACAACCAACCATCTGAATTTACTATTGAAAGCTTGGAAAACACTGCAGTTGACTTGTTT

In [40]:
# Join the string to compile the sequence.
DNA_CYP2C9 =(''.join(separated_sec))
print('Final sequence:')
DNA_CYP2C9

Final sequence:


'GTCTTAACAAGAAGAGAAGGCTTCAATGGATTCTCTTGTGGTCCTTGTGCTCTGTCTCTCATGTTTGCTTCTCCTTTCACTCTGGAGACAGAGCTCTGGGAGAGGAAAACTCCCTCCTGGCCCCACTCCTCTCCCAGTGATTGGAAATATCCTACAGATAGGTATTAAGGACATCAGCAAATCCTTAACCAATCTCTCAAAGGTCTATGGCCCTGTGTTCACTCTGTATTTTGGCCTGAAACCCATAGTGGTGCTGCATGGATATGAAGCAGTGAAGGAAGCCCTGATTGATCTTGGAGAGGAGTTTTCTGGAAGAGGCATTTTCCCACTGGCTGAAAGAGCTAACAGAGGATTTGGAATTGTTTTCAGCAATGGAAAGAAATGGAAGGAGATCCGGCGTTTCTCCCTCATGACGCTGCGGAATTTTGGGATGGGGAAGAGGAGCATTGAGGACCGTGTTCAAGAGGAAGCCCGCTGCCTTGTGGAGGAGTTGAGAAAAACCAAGGCCTCACCCTGTGATCCCACTTTCATCCTGGGCTGTGCTCCCTGCAATGTGATCTGCTCCATTATTTTCCATAAACGTTTTGATTATAAAGATCAGCAATTTCTTAACTTAATGGAAAAGTTGAATGAAAACATCAAGATTTTGAGCAGCCCCTGGATCCAGATCTGCAATAATTTTTCTCCTATCATTGATTACTTCCCGGGAACTCACAACAAATTACTTAAAAACGTTGCTTTTATGAAAAGTTATATTTTGGAAAAAGTAAAAGAACACCAAGAATCAATGGACATGAACAACCCTCAGGACTTTATTGATTGCTTCCTGATGAAAATGGAGAAGGAAAAGCACAACCAACCATCTGAATTTACTATTGAAAGCTTGGAAAACACTGCAGTTGACTTGTTTGGAGCTGGGACAGAGACGACAAGCACAACCCTGAGATATGCTCTCCTTCTCCTGCTGAAGCACCCAGAGGTCACAGCTAAAGTCCAGGA

The entire process, from uploading the file, to cleaning it up, and saving it to a variable can be combined into a single cell:

In [41]:
#Nucleotide sequence of the CYP2C9 gene
with open("data/sec_CYP2C9.fasta", "r") as GEN:
    sec_CYP2C9 = GEN.read()
DNA_CYP2C9 =(''.join(sec_CYP2C9.split('\n')[1:]))
print('Final sequence:')
DNA_CYP2C9

Final sequence:


'GTCTTAACAAGAAGAGAAGGCTTCAATGGATTCTCTTGTGGTCCTTGTGCTCTGTCTCTCATGTTTGCTTCTCCTTTCACTCTGGAGACAGAGCTCTGGGAGAGGAAAACTCCCTCCTGGCCCCACTCCTCTCCCAGTGATTGGAAATATCCTACAGATAGGTATTAAGGACATCAGCAAATCCTTAACCAATCTCTCAAAGGTCTATGGCCCTGTGTTCACTCTGTATTTTGGCCTGAAACCCATAGTGGTGCTGCATGGATATGAAGCAGTGAAGGAAGCCCTGATTGATCTTGGAGAGGAGTTTTCTGGAAGAGGCATTTTCCCACTGGCTGAAAGAGCTAACAGAGGATTTGGAATTGTTTTCAGCAATGGAAAGAAATGGAAGGAGATCCGGCGTTTCTCCCTCATGACGCTGCGGAATTTTGGGATGGGGAAGAGGAGCATTGAGGACCGTGTTCAAGAGGAAGCCCGCTGCCTTGTGGAGGAGTTGAGAAAAACCAAGGCCTCACCCTGTGATCCCACTTTCATCCTGGGCTGTGCTCCCTGCAATGTGATCTGCTCCATTATTTTCCATAAACGTTTTGATTATAAAGATCAGCAATTTCTTAACTTAATGGAAAAGTTGAATGAAAACATCAAGATTTTGAGCAGCCCCTGGATCCAGATCTGCAATAATTTTTCTCCTATCATTGATTACTTCCCGGGAACTCACAACAAATTACTTAAAAACGTTGCTTTTATGAAAAGTTATATTTTGGAAAAAGTAAAAGAACACCAAGAATCAATGGACATGAACAACCCTCAGGACTTTATTGATTGCTTCCTGATGAAAATGGAGAAGGAAAAGCACAACCAACCATCTGAATTTACTATTGAAAGCTTGGAAAACACTGCAGTTGACTTGTTTGGAGCTGGGACAGAGACGACAAGCACAACCCTGAGATATGCTCTCCTTCTCCTGCTGAAGCACCCAGAGGTCACAGCTAAAGTCCAGGA

### Indexing and sequencing of lists
To handle very long `string` data, you can reference variables by position, from different sections, using brackets <code>"[start:end]"</code> to access certain elements of a string.
Let's see some examples:

#### Example 1
Print parts of the `DNA_CYP2C9` sequence
(_the count starts from zero_)
*Note*: From now on we will work with `f-strings` that allow you to make short lines of text with built-in variables. Más información en: [f-strings](https://platzi.com/blog/f-strings-en-python/?utm_source=google&utm_medium=cpc&utm_campaign=12915366154&utm_adgroup=&utm_content=&gclid=Cj0KCQjw3IqSBhCoARIsAMBkTb2p5ZOBtPtlGG2B7P0qrtnp8Wwvbgd2OY_F3_P-6OOU1YE_QHHCMaYaAnTaEALw_wcB&gclsrc=aw.ds), [f-strings](https://peps.python.org/pep-0498/)

In [42]:
print(f'The first nucleotide of the sequence is: {DNA_CYP2C9[0]}')

print(f'The first ten nucleotides of the sequence are: {DNA_CYP2C9[:10]}')

The first nucleotide of the sequence is: G
The first ten nucleotides of the sequence are: GTCTTAACAA


In [43]:
print(f'The sequence spanning 2 to 10 nucleotides is: {DNA_CYP2C9[2:10]}')

The sequence spanning 2 to 10 nucleotides is: CTTAACAA


In [44]:
print(f'The nucleotide sequence from position 1100 is: {DNA_CYP2C9[1100:]}')

The nucleotide sequence from position 1100 is: TTGACCTTCTCCCCACCAGCCTGCCCCATGCAGTGACCTGTGACATTAAATTCAGAAACTATCTCATTCCCAAGGGCACAACCATATTAATTTCCCTGACTTCTGTGCTACATGACAACAAAGAATTTCCCAACCCAGAGATGTTTGACCCTCATCACTTTCTGGATGAAGGTGGCAATTTTAAGAAAAGTAAATACTTCATGCCTTTCTCAGCAGGAAAACGGATTTGTGTGGGAGAAGCCCTGGCCGGCATGGAGCTGTTTTTATTCCTGACCTCCATTTTACAGAACTTTAACCTGAAATCTCTGGTTGACCCAAAGAACCTTGACACCACTCCAGTTGTCAATGGATTTGCCTCTGTGCCGCCCTTCTACCAGCTGTGCTTCATTCCTGTCTGAAGAAGAGCAGATGGCCTGGCTGCTGCTGTGCAGTCCCTGCAGCTCTCTTTCCTCTGGGGCATTATCCATCTTTCACTATCTGTAATGCCTTTTCTCACCTGTCATCTCACATTTTCCCTTCCCTGAAGATCTAGTGAACATTCGACCTCCATTACGGAGAGTTTCCTATGTTTCACTGTGCAAATATATCTGCTATTCTCCATACTCTGTAACAGTTGCATTGACTGTCACATAATGCTCATACTTATCTAATGTTGAGTTATTAATATGTTATTATTAAATAGAGAAATATGATTTGTGTATTATAATTCAAAGGCATTTCTTTTCTGCATGTTCTAAATAAAAAGCATTATTATTTGCTGAGTCAGTTTATTAGACCTTCCTTCTTTTATGCATAATGTAGGTCAGAAATTAAAGAAAATAGAGTTCCAGGAGGCCATGCTGGTTCTCAAAATGATAAGGACAGAAAGGACAAAGAGGAAGAGGGTAGGGAAGCTATTTTGGGTGAGTGTTAGAGTTACTTGAGGATTGGATTTGAAAGTGAGAAACTGTGTC

In [45]:
# Using negative indices starts counting from the end of the String, useful in long documents
print(f'The last 25 nucleotides of the sequence are: {DNA_CYP2C9[-25:]}')

The last 25 nucleotides of the sequence are: TATAAATACATGCTTTCATATCGCT


In [46]:
print(f'total nucleotides: {len(DNA_CYP2C9)}')

total nucleotides: 2561


### DNA Transcription
The second thing we are going to do is the **transcription** process, being the first phase of gene expression. To do this, we will replace thymine (T) with uracil (U) throughout the sequence. Using the String <code>.replace() </code> method, obtaining the mRNA sequence, which will be synthesized into a protein.

In [47]:
# Transcribe the woodchuck sequence from DNA to RNA, using the ".replace()" function
RNA_CYP2C9 = DNA_CYP2C9.replace("T","U")
print("mRNA Sequence:\n", str(RNA_CYP2C9))

mRNA Sequence:
 GUCUUAACAAGAAGAGAAGGCUUCAAUGGAUUCUCUUGUGGUCCUUGUGCUCUGUCUCUCAUGUUUGCUUCUCCUUUCACUCUGGAGACAGAGCUCUGGGAGAGGAAAACUCCCUCCUGGCCCCACUCCUCUCCCAGUGAUUGGAAAUAUCCUACAGAUAGGUAUUAAGGACAUCAGCAAAUCCUUAACCAAUCUCUCAAAGGUCUAUGGCCCUGUGUUCACUCUGUAUUUUGGCCUGAAACCCAUAGUGGUGCUGCAUGGAUAUGAAGCAGUGAAGGAAGCCCUGAUUGAUCUUGGAGAGGAGUUUUCUGGAAGAGGCAUUUUCCCACUGGCUGAAAGAGCUAACAGAGGAUUUGGAAUUGUUUUCAGCAAUGGAAAGAAAUGGAAGGAGAUCCGGCGUUUCUCCCUCAUGACGCUGCGGAAUUUUGGGAUGGGGAAGAGGAGCAUUGAGGACCGUGUUCAAGAGGAAGCCCGCUGCCUUGUGGAGGAGUUGAGAAAAACCAAGGCCUCACCCUGUGAUCCCACUUUCAUCCUGGGCUGUGCUCCCUGCAAUGUGAUCUGCUCCAUUAUUUUCCAUAAACGUUUUGAUUAUAAAGAUCAGCAAUUUCUUAACUUAAUGGAAAAGUUGAAUGAAAACAUCAAGAUUUUGAGCAGCCCCUGGAUCCAGAUCUGCAAUAAUUUUUCUCCUAUCAUUGAUUACUUCCCGGGAACUCACAACAAAUUACUUAAAAACGUUGCUUUUAUGAAAAGUUAUAUUUUGGAAAAAGUAAAAGAACACCAAGAAUCAAUGGACAUGAACAACCCUCAGGACUUUAUUGAUUGCUUCCUGAUGAAAAUGGAGAAGGAAAAGCACAACCAACCAUCUGAAUUUACUAUUGAAAGCUUGGAAAACACUGCAGUUGACUUGUUUGGAGCUGGGACAGAGACGACAAGCACAACCCUGAGAUAUGCUCUCCUUCUCCUGCUGAAGCACCCAGAGGUCAC

In [48]:
RNA_CYP2C9.count('U')

755

## Handling large strings
The tools we saw above can be applied to large files with long sequences, such as the complete genome of the Marmot monax, ID: [JAIQCD010000022.1](https://www.ncbi.nlm.nih.gov/nuccore/JAIQCD010000022.1?report=fasta), gene that will be worked on in what remains of this notebook.

The file is in the location `data/dna_marmota.fasta`, below is how it can be loaded and saved in the variable `DNA_marmota`

In [49]:
# From now on we will work with the method
with open("data/dna_marmota.fasta", "r") as GEN:
    DNA_woodchuck =  ''.join((GEN.read()).split('\n')[1:])
print(f'The marmot genome sequence has {len(DNA_woodchuck)} nucleotides') # The len() command counts the number of characters in a string.
print(f'The first 100 nucleotides of the marmot genome are: {DNA_woodchuck[:100]}')

The marmot genome sequence has 34815635 nucleotides
The first 100 nucleotides of the marmot genome are: CCAATATTCTGTACAATTAATTAGTAGCAGGGCATGGGTTGCATGTCTGCCATGGCAGTGACTCCTGGAGCCTGAGATAGGAGGATCTCAAGTTTGAGGT


In [50]:
# Transcribe the marmot sequence from DNA to RNA, using the ".replace()" function
RNA_woodchuck = DNA_woodchuck.replace("T","U")
print("The first nucleotides of the RNA sequence are: "+ (RNA_woodchuck[:1000]))

The first nucleotides of the RNA sequence are: CCAAUAUUCUGUACAAUUAAUUAGUAGCAGGGCAUGGGUUGCAUGUCUGCCAUGGCAGUGACUCCUGGAGCCUGAGAUAGGAGGAUCUCAAGUUUGAGGUCACUUUGUAGGACCUUGUCUCAAAAAAAAAAAAAAAAACCCGGCUGGAAUGAAGUACAACUAGUGUGACAGCAGGUGGGUUUAUCCCAUUCCCAUUUGAAAAAUAAUGGAAAAUCUAGCCACUGGAGGGAUUGAAGGCCCAAUGAGGAGGGAGUAGGGUUAAUACUGCCAUUUUUCUUCUUUGGUGUAUUUUACAAAAAUGUGUGUGUGUGUGUGUGUGUGUGUGUGUGUGUGUGUGUGUGUUUCUCCCACCUUUGUUCUAUUGCUUUGACGUGUUUUUUUUUUUUCUUUUAUUUUUGGUAUUGGCUUGUUUUCACUUUCUUGUCAAUUUCUUUUGUAUCAAUUCUGUAGCUCUUUUUUUUUUACAGAAAAGAAAAUUACAGUCAAAAUAUUUUAUUUUCCUCUUAAUACUAACUUAAUUUCAAUCUCAUAAAGAUACUGUACCUUUAGAAUUCUAUCUCCCACUGUAUGUACUCAUUUUUCUUCCUUUAUUUUGGUGUUGUGCUGGGAUAGAACUCAGGAUGCUGCAGCACUGAGCUACAUCUCUAGCUUUUCUUUCUAUUUUUGCUUUGAGUGUGACCCAGUGUAAAGUGUUAGCCUAGCAUGCCUGAGGCCUUGGCUUCAAUUCUAACUAUCAUGAAACAAAUAUACUCUGUGUACCUGAUCUGACACUACAUGCUGUGUCCAUUAUUUUAAGGCUGCUAUCCUAUUGAAGAAGGUUUCAGAAAGGAUCAUGUAAGUCAUGUAAACAUGCUCCCCAAACAUCAAGUAUCCUUUACUUAAUUGUUAUGUUGAAGGCUAAGCUUAUUUUGAAUGUCAUCAUAUGAGCAAAAAAUCAAAGUGGACAUUAUGAU

## Practice Activity 1
Taking into account what you have learned previously, analyze the sequence obtained from the cytochrome P450 protein and answer:
1. How many times was a thymine (t) replaced by a uracil (U)?
2. What is the size of the string?
3. How many purines are in the enzyme?
4. What is the most repeated pair of nucleotides in the protein?

## Conclusions
In this first practice, commands and methods related to the manipulation of `strings` of files of different sizes were used, using as a starting point we worked with two different DNA chains, where we saw the tools that can be implemented in biological practices. .

At the same time, basic information on the complete cytochrome P450 genome was obtained, starting from the DNA chain to obtain the RNAm of the sequence, which will be translated into amino acids (practice 2).

# Theory: Flow control structures
Python has control flow statements that allow you to group commands in a controlled manner. Two of the most used are:
1. Conditional control structures
2. Iterative control structures

### 1. Conditionals




Conditionals allow you to execute a statement or make a decision if a certain condition is met, resulting in a Boolean value of true or false. The most used functions:
* <code>if</code>: where if the expression is true, the block of followed statements is executed.
* <code>elif</code>: where if the above conditions are not met, another statement is tried.
* <code>else</code>: where the boolean expression is false or a condition is not met take this other option.

```markdown
if <condition_1>:
    code block
elif <condition_2>:
    code block
 ...
 else:
    code block
```

Evaluates the logical expression condition_1 and executes the first block of code if it is `True`; if not, it evaluates the following conditions until it reaches the first one that is `True` and executes the associated block of code. If neither condition is `True` execute the block of code after `else`.
*Observations*
- An `if` conditional does not necessarily need `elif` or `else`
- Only one `else` can appear, and it must go at the end
- Code blocks must be indented by 4 spaces
- The end of the conditional block is when the line returns to the previous level

In certain scenarios, decisions can be made, such as increasing the value of x or, failing that, decreasing its value.

In [51]:
x = 1 # initial value (try changing this value)

# Condition
if x == 1:
    x = x + 1
else:
    x = x -1

print(x) # final value

2


More than one check expression can be part of the block via `elif`, example:

In [52]:
x = 3 # initial value (try changing this value)

if x == 1:
    x = x + 1
elif x == 2:
    x = x - 1
else:
    x = 0

print(x) # final value

0


Sometimes you can include nested if expressions as much as necessary, it is not recommended as a software development practice, for example:

In [53]:
x = 2
y = 2
if x == 1:
    if y == 2:
        x = x + 1
        y = y + 1
elif x == 2:
    if y == 2:
        x = x - 1
        y = y - 1
else:
    x = 0
    y = 0

print(f'Expect x,y to be equal to 1. x={x} y={y}')

Expect x,y to be equal to 1. x=1 y=1


### 2. Iterations
The iterations or loops allow you to repeat a portion of the code as many times as necessary, while the boolean condition is true or false, in python only two functions are included:
* <code>while</code> - Allows multiple iterations executing code while the condition is true.
* <code>for</code>: allows iterating in order over each of the elements of a sequence, be it list, tuple, dictionary, array or string

For more information check the following [link](https://entrenamiento-python-basico.readthedocs.io/es/latest/leccion4/loop_while.html)

#### Structure of While:

In general, a While code block should follow the following form:

```markdown
while <condition>:
    code block
```

*Remember that in Python the `return` keyword is not necessary at the end of a code block since the convention is that the last line of code is always returned.*

In [54]:
#### EXAMPLE 1
# Let's increase the value of x up to 10
x = 1
print('The initial value of x is:')
print(x)

while x < 10:
    print(f'The new value of x is {x}')
    x = x + 1

print('At the end of the loop, the value of x is:')
print(x)

The initial value of x is:
1
The new value of x is 1
The new value of x is 2
The new value of x is 3
The new value of x is 4
The new value of x is 5
The new value of x is 6
The new value of x is 7
The new value of x is 8
The new value of x is 9
At the end of the loop, the value of x is:
10


#### Structure for

In the same way, the `for` block follows a standard structure for its definition:

```markdown
for <var> in <sequence>:
    code block
```

*Remember: you don't need the `return` keyword like in other languages.*

In [55]:
#### EXAMPLE 2
# Let's take as an example a list of objects, for example the types of codons and their genetic code:
genetic_code= [
    ["GUU", "V"], ["GUC", "V"], ["GUA", "V"],
    ["GUG", "V"], ["GCU", "A"], ["GCC", "A"],
    ["GCA", "A"], ["GCG", "A"], ["GAU", "D"],
    ["GAC", "D"], ["GAA", "E"], ["GAG", "E"],
    ["GGU", "G"], ["GGC", "G"], ["GGA", "G"],
    ["GGG", "G"], ["AGA", "R"], ["AGG", "R"],
    ["AGU", "S"], ["AGC", "S"], ["AAU", "N"],
    ["AAC", "N"], ["AAA", "K"], ["AAG", "K"],
    ["ACU", "T"], ["ACC", "T"], ["ACA", "T"],
    ["ACG", "T"], ["AUU", "I"], ["AUC", "I"],
    ["AUA", "I"], ["AUG", "M"], ["CGU", "R"],
    ["CGC", "R"], ["CGA", "R"], ["CGC", "R"],
    ["CCU", "P"], ["CCC", "P"], ["CCA", "P"],
    ["CCG", "P"], ["CAU", "H"], ["CAC", "H"]
]

# You can iterate through all the list elements prot_list and print the protein and its type:
for codon in genetic_code:
    print(f'The amino acid of {codon[0]} is {codon[1]}')

The amino acid of GUU is V
The amino acid of GUC is V
The amino acid of GUA is V
The amino acid of GUG is V
The amino acid of GCU is A
The amino acid of GCC is A
The amino acid of GCA is A
The amino acid of GCG is A
The amino acid of GAU is D
The amino acid of GAC is D
The amino acid of GAA is E
The amino acid of GAG is E
The amino acid of GGU is G
The amino acid of GGC is G
The amino acid of GGA is G
The amino acid of GGG is G
The amino acid of AGA is R
The amino acid of AGG is R
The amino acid of AGU is S
The amino acid of AGC is S
The amino acid of AAU is N
The amino acid of AAC is N
The amino acid of AAA is K
The amino acid of AAG is K
The amino acid of ACU is T
The amino acid of ACC is T
The amino acid of ACA is T
The amino acid of ACG is T
The amino acid of AUU is I
The amino acid of AUC is I
The amino acid of AUA is I
The amino acid of AUG is M
The amino acid of CGU is R
The amino acid of CGC is R
The amino acid of CGA is R
The amino acid of CGC is R
The amino acid of CCU is P
T

We can iterate the items in a dictionary through the `key-value` pairs using the `.items()` method.

In [56]:
#### EXAMPLE 3: Using for to loop through the elements of a dictionary
genetic_code = {
    "GUU": "V", "GUC": "V", "GUA": "V",
    "GUG": "V", "GCU": "A", "GCC": "A",
    "GCA": "A", "GCG": "A", "GAU": "D",
    "GAC": "D", "GAA": "E", "GAG": "E",
    "GGU": "G", "GGC": "G", "GGA": "G",
    "GGG": "G", "AGA": "R", "AGG": "R",
    "AGU": "Y", "AGC": "Y", "AAU": "N",
    "AAC": "N", "AAA": "K", "AAG": "K",
    "ACU": "T", "ACC": "T", "ACA": "T",
    "ACG": "T", "AUU": "I", "AUC": "I",
    "AUA": "I", "AUG": "M", "CGU": "R",
    "CGC": "R", "CGA": "R", "CGG": "R",
    "CCU": "P", "CCC": "P", "CCA": "P",
    "CCG": "P", "CAU": "H", "CAC": "H",
    "CAA": "Q", "CAG": "Q", "UUU": "F",
    "UUC": "F", "UUA": "L", "UUG": "L",
    "UCU": "Y", "UCC": "Y", "UCA": "Y",
    "UCG": "Y", "UAU": "Y", "UAC": "Y",
    "UAA": "STOP", "UAG": "STOP", "UGU":"C",
    "UGC": "C", "UGA": "STOP", "UGG": "W",
    "CUU": "L", "CUC": "L", "CUA": "L",
    "CUG": "L"}

for codon, gene in genetic_code.items():
    print(f"The amino acid of {codon} is {gene}")

The amino acid of GUU is V
The amino acid of GUC is V
The amino acid of GUA is V
The amino acid of GUG is V
The amino acid of GCU is A
The amino acid of GCC is A
The amino acid of GCA is A
The amino acid of GCG is A
The amino acid of GAU is D
The amino acid of GAC is D
The amino acid of GAA is E
The amino acid of GAG is E
The amino acid of GGU is G
The amino acid of GGC is G
The amino acid of GGA is G
The amino acid of GGG is G
The amino acid of AGA is R
The amino acid of AGG is R
The amino acid of AGU is Y
The amino acid of AGC is Y
The amino acid of AAU is N
The amino acid of AAC is N
The amino acid of AAA is K
The amino acid of AAG is K
The amino acid of ACU is T
The amino acid of ACC is T
The amino acid of ACA is T
The amino acid of ACG is T
The amino acid of AUU is I
The amino acid of AUC is I
The amino acid of AUA is I
The amino acid of AUG is M
The amino acid of CGU is R
The amino acid of CGC is R
The amino acid of CGA is R
The amino acid of CGG is R
The amino acid of CCU is P
T

# Practice 2: Expression of genetic material

## Concepts to work

**Translation**, is the synthesis of a protein from the mRNA chain, this occurs within proteins called ribosomes, during this process, the mRNA sequence is read in groups of three nucleotides, called **codons **, which are interpreted by a **genetic code** resulting in an amino acid coding (fig. 4), which will fold and form proteins (fig. 3).

<img src="img/Figura4-en-es.png" alt="code" width="1000"/>

*Figure 4. Essential genetic code in the expression of proteins where the formation of a codon from a nucleotide (uracil, adenine, guanine, or cytokine) is evidenced, from the start sequence (green) and the stop sequences (red ). Figure modified from: [Molecular biology of the gene, (2008), 15, 509-569]( https://books.google.com.co/books?id=7tadzgEACAAJ&dq=Molecular+biology+of+the+gene&hl=es-419&sa=X&redir_esc=y)*

The ribosome reads the sequence in order, looking for the AUG **start** codon, which, in turn, codes for the methionine amino acid and begins the translation, as it continues advancing it builds the chain of amino acids, it is a process that repeats many times, in which the nucleotide triplets are read and the corresponding amino acid is attached (fig. 3). The resulting chain can be long or short, it is addressed until it finds one of the three codons that code for **stop** (UAA, UGA or UAG) (fig. 4), when synthesized, the chain is released from the ribosome and It is modified or combined to form a functional protein with a specific structure involved in some essential process for the cell or organism.


## Problem Statement
Continuing with the general objective, to obtain basic information on the cytochrome P450 enzyme, a protein previously worked on. To do this, we are going to carry out the second phase involved in gene expression, in order to obtain the amino acids that code for the protein.

First, we must create a dictionary in which the genetic code is found, where they specify the codons (nucleotide triplets) that synthesize their corresponding amino acid. We must take into account the `key-value` pairs, where the `key` would be the codons and the `value` would be the amino acids.

In [57]:
#Dictionary of codons for translation
genetic_code = {"GUU": "V", "GUC": "V", "GUA": "V", "GUG": "V", "GCU": "A", "GCC": "A", "GCA": "A", "GCG": "A",
                "GAU": "D", "GAC": "D", "GAA": "E", "GAG": "E", "GGU": "G", "GGC": "G", "GGA": "G", "GGG": "G",
                "AGA": "R", "AGG": "R", "AGU": "S", "AGC": "S", "AAU": "N", "AAC": "N", "AAA": "K", "AAG": "K",
                "ACU": "T", "ACC": "T", "ACA": "T", "ACG": "T", "AUU": "I", "AUC": "I", "AUA": "I", "AUG": "M",
                "CGU": "R", "CGC": "R", "CGA": "R", "CGG": "R", "CCU": "P", "CCC": "P", "CCA": "P", "CCG": "P",
                "CAU": "H", "CAC": "H", "CAA": "Q", "CAG": "Q", "UUU": "F", "UUC": "F", "UUA": "L", "UUG": "L",
                "UCU": "S", "UCC": "S", "UCA": "S", "UCG": "S", "UAU": "Y", "UAC": "Y", "UAA": "STOP", "UAG": "STOP",
                "UGU": "C", "UGC": "C", "UGA": "STOP", "UGG": "W", "CUU": "L", "CUC": "L", "CUA": "L", "CUG": "L"}

print(f' Codons are: \n{list(genetic_code.keys())}')
print('-----------------')
print(f' Amino acids are: \n{list(genetic_code.values())}')

 Codons are: 
['GUU', 'GUC', 'GUA', 'GUG', 'GCU', 'GCC', 'GCA', 'GCG', 'GAU', 'GAC', 'GAA', 'GAG', 'GGU', 'GGC', 'GGA', 'GGG', 'AGA', 'AGG', 'AGU', 'AGC', 'AAU', 'AAC', 'AAA', 'AAG', 'ACU', 'ACC', 'ACA', 'ACG', 'AUU', 'AUC', 'AUA', 'AUG', 'CGU', 'CGC', 'CGA', 'CGG', 'CCU', 'CCC', 'CCA', 'CCG', 'CAU', 'CAC', 'CAA', 'CAG', 'UUU', 'UUC', 'UUA', 'UUG', 'UCU', 'UCC', 'UCA', 'UCG', 'UAU', 'UAC', 'UAA', 'UAG', 'UGU', 'UGC', 'UGA', 'UGG', 'CUU', 'CUC', 'CUA', 'CUG']
-----------------
 Amino acids are: 
['V', 'V', 'V', 'V', 'A', 'A', 'A', 'A', 'D', 'D', 'E', 'E', 'G', 'G', 'G', 'G', 'R', 'R', 'S', 'S', 'N', 'N', 'K', 'K', 'T', 'T', 'T', 'T', 'I', 'I', 'I', 'M', 'R', 'R', 'R', 'R', 'P', 'P', 'P', 'P', 'H', 'H', 'Q', 'Q', 'F', 'F', 'L', 'L', 'S', 'S', 'S', 'S', 'Y', 'Y', 'STOP', 'STOP', 'C', 'C', 'STOP', 'W', 'L', 'L', 'L', 'L']


## Control Structures
Next, we will use the **control structures** to be able to analyze the RNA sequence of the cytochrome `RNA_CYP2C9` to synthesize the protein, following these steps:
1. Identify the start of the protein: AUG
2. Divide by threes
3. Find the stop (there can be several, look at the dictionary)
4. Print the protein: AUG(codons - in threes)STOP

In [58]:
# reload the sequence
with open("data/sec_CYP2C9.fasta", "r") as GEN:
    sec_CYP2C9 = GEN.read()
DNA_CYP2C9 =(''.join(sec_CYP2C9.split('\n')[1:]))
RNA_CYP2C9 = DNA_CYP2C9.replace("T","U")

run = True
# search start codon AUG
i = 0
for i in range(len(RNA_CYP2C9)):
    if RNA_CYP2C9[i:i + 3] == 'AUG':  # Start of protein found
        RNA_CYP2C9 = RNA_CYP2C9[i:]  # trim sequence. new RNA
        break  # end the for loop
    if i >= (len(RNA_CYP2C9) - 3):   # Protein start NOT found
        print('Start codon not found AUG')
        RNA_CYP2C9 = RNA_CYP2C9[i:i + 3]
        run = False   # end up
        break   # end the for loop

# This code is only executed if the start of the protein was found
# Executes with the sequence trimmed

protein = list()
if run:
    i = 0
    # start translation
    while i <= len(RNA_CYP2C9) - 2:
        codon = genetic_code[RNA_CYP2C9[i:i + 3]]
        protein.append(codon)
        i += 3
        if codon == 'STOP':
            print(f'>> Protein found')
            RNA_CYP2C9 = RNA_CYP2C9[i:]  # new RNA (trimmed)
            protein = protein[:-1]
            protein_text = ''.join(protein)
            print(f'Protein: {protein_text}')
            break
        if i >= (len(RNA_CYP2C9) - 3):
            print('Codon not found STOP')
            RNA_KR711927 = RNA_CYP2C9[i:i + 3]
            break

>> Protein found
Protein: MDSLVVLVLCLSCLLLLSLWRQSSGRGKLPPGPTPLPVIGNILQIGIKDISKSLTNLSKVYGPVFTLYFGLKPIVVLHGYEAVKEALIDLGEEFSGRGIFPLAERANRGFGIVFSNGKKWKEIRRFSLMTLRNFGMGKRSIEDRVQEEARCLVEELRKTKASPCDPTFILGCAPCNVICSIIFHKRFDYKDQQFLNLMEKLNENIKILSSPWIQICNNFSPIIDYFPGTHNKLLKNVAFMKSYILEKVKEHQESMDMNNPQDFIDCFLMKMEKEKHNQPSEFTIESLENTAVDLFGAGTETTSTTLRYALLLLLKHPEVTAKVQEEIERVIGRNRSPCMQDRSHMPYTDAVVHEVQRYIDLLPTSLPHAVTCDIKFRNYLIPKGTTILISLTSVLHDNKEFPNPEMFDPHHFLDEGGNFKKSKYFMPFSAGKRICVGEALAGMELFLFLTSILQNFNLKSLVDPKNLDTTPVVNGFASVPPFYQLCFIPV


In [59]:
# the protein variable stores a list of each amino acid
print(protein)

['M', 'D', 'S', 'L', 'V', 'V', 'L', 'V', 'L', 'C', 'L', 'S', 'C', 'L', 'L', 'L', 'L', 'S', 'L', 'W', 'R', 'Q', 'S', 'S', 'G', 'R', 'G', 'K', 'L', 'P', 'P', 'G', 'P', 'T', 'P', 'L', 'P', 'V', 'I', 'G', 'N', 'I', 'L', 'Q', 'I', 'G', 'I', 'K', 'D', 'I', 'S', 'K', 'S', 'L', 'T', 'N', 'L', 'S', 'K', 'V', 'Y', 'G', 'P', 'V', 'F', 'T', 'L', 'Y', 'F', 'G', 'L', 'K', 'P', 'I', 'V', 'V', 'L', 'H', 'G', 'Y', 'E', 'A', 'V', 'K', 'E', 'A', 'L', 'I', 'D', 'L', 'G', 'E', 'E', 'F', 'S', 'G', 'R', 'G', 'I', 'F', 'P', 'L', 'A', 'E', 'R', 'A', 'N', 'R', 'G', 'F', 'G', 'I', 'V', 'F', 'S', 'N', 'G', 'K', 'K', 'W', 'K', 'E', 'I', 'R', 'R', 'F', 'S', 'L', 'M', 'T', 'L', 'R', 'N', 'F', 'G', 'M', 'G', 'K', 'R', 'S', 'I', 'E', 'D', 'R', 'V', 'Q', 'E', 'E', 'A', 'R', 'C', 'L', 'V', 'E', 'E', 'L', 'R', 'K', 'T', 'K', 'A', 'S', 'P', 'C', 'D', 'P', 'T', 'F', 'I', 'L', 'G', 'C', 'A', 'P', 'C', 'N', 'V', 'I', 'C', 'S', 'I', 'I', 'F', 'H', 'K', 'R', 'F', 'D', 'Y', 'K', 'D', 'Q', 'Q', 'F', 'L', 'N', 'L', 'M', 'E', 'K',

## Practice activity 2
Based on what you have learned, analyze the sequence of amino acids obtained from the RNA protein and answer:
1. How many amino acids does the protein have?
2. What is the most repeated amino acid?
3. Identify the nucleotide at which amino acid synthesis begins
4. At which nucleotide does amino acid synthesis end?

## Conclusions

At this point in the practice, we use various commands and methods in order to obtain an amino acid sequence from a DNA `strings`, this being a process that can be used in nucleotide sequences of different sizes and from different organisms.

Thus, to obtain the amino acids that make up the proteins, we used **arrangements** and **control structures**, where basic information on the amino acids of the cytochrome P450 protein was obtained, which we will use to classify them and obtain general information. of the enzyme from its subunits (practice 3).

# Theory: Functions

Apart from the native Python expressions and functionalities known so far throughout the book, there are others that are very relevant and necessary when writing any computer program. And we are talking about **functions**: A function is basically a piece of code that can be reused, that has a name that identifies it, and receives some input parameters. Functions generally do not always return the same result, they have a behavior based on the values ​​of their arguments.

When defining a function, it is important to keep in mind that:

1. Have a name that explains at first sight, the result and the operation it performs.
2. Most functions receive one or more parameters necessary to perform the operation, although they may not receive any.
3. They usually return a result, although this is not always the case.

```markdown
def <function_name>(<parameters>):
    code block
    code block
    ...
    return <object>
```

Let's see an example:
The following function, when executed, returns a data of the `string` type if the protein is found in the dictionary or, failing that, it returns `None`.

In [60]:
#### EXAMPLE 1: Checking for codon existence of a codon list
# Dictionary of proteins and their type:
genetic_code = {"GUU": "V", "GUC": "V", "GUA": "V", "GUG": "V", "GCU": "A", "GCC": "A", "GCA": "A", "GCG": "A",
                "GAU": "D", "GAC": "D", "GAA": "E", "GAG": "E", "GGU": "G", "GGC": "G", "GGA": "G", "GGG": "G",
                "AGA": "R", "AGG": "R", "AGU": "S", "AGC": "S", "AAU": "N", "AAC": "N", "AAA": "K", "AAG": "K",
                "ACU": "T", "ACC": "T", "ACA": "T", "ACG": "T", "AUU": "I", "AUC": "I", "AUA": "I", "AUG": "M",
                "CGU": "R", "CGC": "R", "CGA": "R", "CGG": "R", "CCU": "P", "CCC": "P", "CCA": "P", "CCG": "P",
                "CAU": "H", "CAC": "H", "CAA": "Q", "CAG": "Q", "UUU": "F", "UUC": "F", "UUA": "L", "UUG": "L",
                "UCU": "S", "UCC": "S", "UCA": "S", "UCG": "S", "UAU": "Y", "UAC": "Y", "UAA": "STOP", "UAG": "STOP",
                "UGU": "C", "UGC": "C", "UGA": "STOP", "UGG": "W", "CUU": "L", "CUC": "L", "CUA": "L", "CUG": "L"}


# function that checks if there is a codon defined in the dictionary and returns its genetic code or its default returns a None value.
def codon_gen_code(codon):
    if codon in genetic_code:
        return genetic_code[codon]

# execute the function
print(f'The amino acid of GCG is: {codon_gen_code("GCG")}')

The amino acid of GCG is: A


*Observations*:
* Every function must start with the keyword **def** followed by the name, followed by the parameters in parentheses and end the line with a colon
    * The name of the function is: **type_of_protein**
    * The parameter that the function receives is: **protein_name**
* The body of the function is indented four spaces.
* The end of the block is when the line returns to the previous level
* Once the function has been created, it can be called with its name followed by parentheses and, if necessary, send the parameters by reference.

### Parameters of a function

A function can receive more than one parameter, in the previous example the function only received one, but we can add more. For example, let's modify the function so that it also receives the list of `proteins`:

In [61]:
#### EXAMPLE 2

# function that checks if there is a codon defined in the dictionary and returns its genetic code or its default returns a None value.
# notice that this function takes two parameters
def codon_gen_code (codones_type, name):
    if name in codones_type:
        return codones_type[name]

# Way to execute the previously created function
codon_gen_code(genetic_code, "ACC")

'T'

Now that it is not necessary to define the variable proteins before the function, this is because the functions have their own execution context, now the value of the variable `proteins` is referenced when executing the function, as its first parameter.

About the parameters it is important to highlight that:

* They must have clear and legible names that complement what the function does
* They can be of any type as in the case of the previous example `proteins` is a dictionary while `protein_name` is of type string
* They can have default values, let's see an example below.

In [62]:
#### EXAMPLE 3

# Default parameter protein_name='AAU'
def codon_gen_code(codon_types, name='AAU'):
    if name in codon_types:
        return codon_types[name]

# protein_name it is not necessary to pass it as a reference since it has a default value in the function definition.
codon_gen_code(genetic_code)

'N'

Now let's review the implementation of practice #1 and make a function that is of the general type and that helps us find any protein.

In [63]:
#### EXAMPLE 4: Function that is responsible for finding a protein from RNA and an initial codon.

# reload the sequence

with open("data/sec_CYP2C9.fasta", "r") as GEN:
    sec_CYP2C9 = GEN.read()
DNA_CYP2C9 =(''.join(sec_CYP2C9.split('\n')[1:]))
RNA_CYP2C9 = DNA_CYP2C9.replace("T","U")


def rna_a_protein(rna, start_codon='AUG'):
    run = True
   # search start codon AUG
    i = 0
    for i in range(len(rna)):
        if rna[i:i + 3] == start_codon:  # Start of protein found
            rna = rna[i:]  # trim sequence. new RNA
            break  # end the for loop
        if i >= (len(rna) - 3):  # Protein start NOT found
            print('Start codon not found AUG')
            rna = rna[i:i + 3]
            run = False  # end up
            break  # end the for loop

    # This code is only executed if the start of the protein was found
    # Executes with the sequence trimmed
    if run:
        i = 0
        protein = list()
# start translation
        while i <= len(rna) - 2:
            codon = genetic_code[rna[i:i + 3]]
            protein.append(codon)
            i += 3
            if codon == 'STOP':
                print(f'>> Protein found')
                rna = rna[i:]  # new RNA (trimmed)
                protein = protein[:-1]
                protein_text = ''.join(protein)
                print(f'Protein: {protein_text}')
                break
            if i >= (len(rna) - 3):
                print('No codon found STOP')
                rna = rna[i:i + 3]
                break

# We call the function with the necessary arguments
rna_a_protein(RNA_CYP2C9, "AUG")

>> Protein found
Protein: MDSLVVLVLCLSCLLLLSLWRQSSGRGKLPPGPTPLPVIGNILQIGIKDISKSLTNLSKVYGPVFTLYFGLKPIVVLHGYEAVKEALIDLGEEFSGRGIFPLAERANRGFGIVFSNGKKWKEIRRFSLMTLRNFGMGKRSIEDRVQEEARCLVEELRKTKASPCDPTFILGCAPCNVICSIIFHKRFDYKDQQFLNLMEKLNENIKILSSPWIQICNNFSPIIDYFPGTHNKLLKNVAFMKSYILEKVKEHQESMDMNNPQDFIDCFLMKMEKEKHNQPSEFTIESLENTAVDLFGAGTETTSTTLRYALLLLLKHPEVTAKVQEEIERVIGRNRSPCMQDRSHMPYTDAVVHEVQRYIDLLPTSLPHAVTCDIKFRNYLIPKGTTILISLTSVLHDNKEFPNPEMFDPHHFLDEGGNFKKSKYFMPFSAGKRICVGEALAGMELFLFLTSILQNFNLKSLVDPKNLDTTPVVNGFASVPPFYQLCFIPV


# Practice 3: Proteins and amino acids
## Concepts to work
The functional diversity expressed by the proteins starts from the molecular variety and the specific sequence that composes them. Amino acids are low molecular weight subunits, which fulfill a specific function in the structure of the protein, due to their physicochemical properties, such as polarity, acidity or basicity, aromaticity, size, capacity to form bonds or their chemical reactivity (Fig. . 5). For this reason, they can be classified in different ways:

<img src="img/Figura5-en.jpg" alt="aminoácidos1" width="600"/>

*Figure 5. : Diagram of some physicochemical properties of amino acids. Figure modified from: [Asencio, T., Aguilar, J. (2010) Spanish Congress on Technologies and Fuzzy Logic.(https://www.researchgate.net/publication/266892662_Importancia_de_las_Propiedades_Fisico-Quimicas_de_los_Aminoacidos_en_la_Prediccion_de_Estructuras_de_Proteinas_usando_Vecinos_mas*)

1. By polarity, the ability to interact with water molecules, dividing into:

   * Apolar: hydrophobic.
   * Polar: hydrophilic.
   * Acids: negative charge at pH.
   * Basic: positive charge at physiological pH.

<img src="img/Figura6-en.jpg" alt="aminoácidos2" width="600"/>

*Figure 6. Structure and classification of amino acids by their polarity. Figure modified from: [Trudy McKee, James R. McKee. (2014)]( https://accessmedicina.mhmedical.com/book.aspx?bookid=1960)*

2. Due to the conformation of their side chain, they can be grouped into:
   * Aliphatic
   * Aromatics
   * Hydroxyamino acids
   * Thioamino acids
   * Imino acids
   * Dicarboxylic
   * Amides
   * Dibasic

Knowing the physicochemical properties of proteins have facilitated the prediction of their secondary structures, that is, understanding the folding of proteins for the three-dimensional formation from the chain of amino acids that form it, this through the possible links established between the proteins. subunits and between proteins.

## Problem Statement
To solve the general objective of the practice, we will analyze 2 physicochemical properties of the cytochrome P450 enzyme, using the information in figures 6 and 7 as a guide. In this way, we will obtain the basic information of the protein, which would facilitate a prediction of its folding through the use of omic sciences. The properties we are going to evaluate are:
* Polarity
* Acidity or basicity

Before beginning, we must create a dictionary of the physicochemical properties that we want to evaluate, with the classification of each amino acid. Where, the `key` would be the amino acids and the `value` the properties.

In [64]:
#Dictionary of amino acids for their classification
properties= {"A": "Nonpolar", "V": "Nonpolar", "L": "Nonpolar", "G": "Nonpolar", "I": "Nonpolar", "F": "Nonpolar","W": "Nonpolar", "M": "Nonpolar", "P": "Nonpolar", "S": "Polar", "T": "Polar", "Y": "Polar", "N": "Polar", "Q": "Polar", "C": "Polar", "D": "Acid", "E": "Acid", "K": "Basic", "R": "Basic",  "H": "Basic"}

print(f'Amino acids are: \n{list(properties.keys())}')
print('-----------------')
print(f'The properties are: \n{set(properties.values())}')
# an array is used from the list so that the properties are not repeated

Amino acids are: 
['A', 'V', 'L', 'G', 'I', 'F', 'W', 'M', 'P', 'S', 'T', 'Y', 'N', 'Q', 'C', 'D', 'E', 'K', 'R', 'H']
-----------------
The properties are: 
{'Acid', 'Polar', 'Nonpolar', 'Basic'}


Next, we are going to create the `total_elements` function to get the number of polar, nonpolar, acidic, and basic nucleotides present in the protein.
The `collections.Counter` module will be used, which organizes the elements of a list in a `Counter` that tells how many times each element is repeated.
The `Counter` object can then be turned into a `dictionary` where we can see the information.

`Counter` also has useful methods, for example: `.most_common(n)`, which returns the most common n-element of `Counter`.

more information https://docs.python.org/3/library/collections.html

In [65]:
def total_elements(protein_list):
    # The module to be used is imported
    from collections import Counter
    # An empty list is created where the property of each amino acid is to be stored
    list_protein_properties = list()
    counter_1 = list()

    # Will iterate for each amino acid in the protein
    for element in protein_list:

        # The amino acid property is found and stored in the list (.append())
        list_protein_properties.append(properties[element])
        # The Counter method is called to arrange the data
        counter_1 = Counter(list_protein_properties)
    print(f'Summary of protein properties:')
    print(f'Total items: {len(protein_list)}')
    print(f'Frequency of properties: {dict(counter_1)}')
    print(f'Most common property: {counter_1.most_common(1)[0]}')

    return None

In [66]:
# Protein enzyme cytochrome P450 (found in activity 2)
print(protein)
print('-----------')
total_elements(protein)

['M', 'D', 'S', 'L', 'V', 'V', 'L', 'V', 'L', 'C', 'L', 'S', 'C', 'L', 'L', 'L', 'L', 'S', 'L', 'W', 'R', 'Q', 'S', 'S', 'G', 'R', 'G', 'K', 'L', 'P', 'P', 'G', 'P', 'T', 'P', 'L', 'P', 'V', 'I', 'G', 'N', 'I', 'L', 'Q', 'I', 'G', 'I', 'K', 'D', 'I', 'S', 'K', 'S', 'L', 'T', 'N', 'L', 'S', 'K', 'V', 'Y', 'G', 'P', 'V', 'F', 'T', 'L', 'Y', 'F', 'G', 'L', 'K', 'P', 'I', 'V', 'V', 'L', 'H', 'G', 'Y', 'E', 'A', 'V', 'K', 'E', 'A', 'L', 'I', 'D', 'L', 'G', 'E', 'E', 'F', 'S', 'G', 'R', 'G', 'I', 'F', 'P', 'L', 'A', 'E', 'R', 'A', 'N', 'R', 'G', 'F', 'G', 'I', 'V', 'F', 'S', 'N', 'G', 'K', 'K', 'W', 'K', 'E', 'I', 'R', 'R', 'F', 'S', 'L', 'M', 'T', 'L', 'R', 'N', 'F', 'G', 'M', 'G', 'K', 'R', 'S', 'I', 'E', 'D', 'R', 'V', 'Q', 'E', 'E', 'A', 'R', 'C', 'L', 'V', 'E', 'E', 'L', 'R', 'K', 'T', 'K', 'A', 'S', 'P', 'C', 'D', 'P', 'T', 'F', 'I', 'L', 'G', 'C', 'A', 'P', 'C', 'N', 'V', 'I', 'C', 'S', 'I', 'I', 'F', 'H', 'K', 'R', 'F', 'D', 'Y', 'K', 'D', 'Q', 'Q', 'F', 'L', 'N', 'L', 'M', 'E', 'K',

With this information, we already know the length of the amino acid sequence and the physicochemical properties of its structure, which has both polar and apolar regions, the latter being the most common, which allows us to have an approximation of the character of the amino acids. functional groups with which it tends to react.

## Practice activities 3
Taking into account what has been reviewed in this introductory Notebook and with the help of the bibliography, answer:
1. Could amino acids be classified in a different way?
2. Perform a new classification of the protein based on the conformation of the side chain
3. What is the difference between the final amino acid sequence compared to the sequence of CYP2C19 (NM_000769.4) a sequence from the same family of cytochrome P450 proteins that presents a polymorphism (mutation) in the amino acid sequence.

## Conclusions
In this tutorial, you understood the uses of basic Python tools used in bioinformatics practices, ranging from data collection and management, to the use of commands and methods for data management and analysis. This was done through two phases, a theory and a practice, where we carried out the expression of the genetic material of a protein from a DNA sequence, until obtaining the corresponding amino acids and their properties.
In the next tutorials, we will explain more Python tools used in collecting and organizing data obtained from electronic resources, implementing different libraries and their development.

# Bibliography

1.	Cortés, G. & Aguilar-Ruiz, J. (2006). Importancia de las Propiedades Físico-Químicas de los Aminoácidos en la Predicción de Estructuras de Proteínas usando Vecinos más Cercanos.

2.	Jiménez, F. &  Merchant, H. (2003). Biología celular y molecular. (1ra ed.). Pearson Educación de México, S.A. de C.V.

3.	McKee, J. R., & McKee, T. (2014). Bioquímica. Las bases moleculares de la vida. McGraw-Hill Education LLC.

4.	Salazar, A., Sandoval, A., & Armendáriz, J. (2016). Biología Molecular. Fundamentos y aplicaciones en las ciencias de la salud. (2da ed.). HILL/INTERAMERICANA EDITORES, S.A. DE C.V.

5.	Watson, J. D. (2008). Molecular biology of the gene (6th ed.). Pearson/Benjamin Cummings.