# Computational Skills for Biocuration

## Programming Skills with Python

### Introducing example

- Uniprot Example: ["gene:tp53 AND reviewed:yes"](https://www.uniprot.org/uniprot/?query=gene%3ATP53+AND+reviewed%3Ayes&sort=score)

## Recap

### A few Python concepts we've seen in last webinars:

- Mathematical operations
- data type and type conversion
- `print( )`
- Variable assignement
    - Use gene name from UniProt example

In [11]:
456*567-678/789
print(456*(567-(678/789)))
type(456*(567-(678/789)))
int(456*(567-(678/789)))
gene_name = "tp53"
print(gene_name)

258160.15209125474
tp53


#### Using methods on variables

- using methods (`.upper( )`, `.split( )` etc.)
    - The . indicates that the method is part of the string s, and the brackets indicate that we want to execute it. 
    - Methods simply return a transformed version of the string, which you can then store in another variable.


- Use mouse gene from UniProt example.

In [13]:
mouse_gene = "Tp53 P53, Trp53"
mouse_gene.upper()
mouse_gene.split()
mouse_gene.split(',')

mouse_gene_list = mouse_gene.split(',')

#### Some notes on variables

- variables are mutable (their values can be overwritten)
- `len(data type)`
    - For any sequence data type, `len()` will tell us how many elements it has:
    - Strings have length, numbers don't

In [14]:
print(len(mouse_gene))
len(mouse_gene_list)

15


2

### List: `list( )`

- creating empty list
- creating list with items
    - example: use UniProt entry names "O09185", "P79734", "P41685", "P04637"

In [84]:
tp53_entry_list = ["O09185", "P79734", "P41685", "P04637"]
tp53_entry_list

['O09185', 'P79734', 'P41685', 'P04637']

- list method to append new item
    - example: use UniProt entry "P67938"
- getting length of the list

In [85]:
tp53_entry_list.append("P67938")
len(tp53_entry_list)

5

- accessing items using index
- accessing items by slicing

In [86]:
tp53_entry_list[0]
tp53_entry_list[4]
tp53_entry_list[-1]
tp53_entry_list[2:4]
tp53_entry_list[:4]
tp53_entry_list[2:]

['P41685', 'P04637', 'P67938']

- sorting list alphanumerically

In [87]:
sorted(tp53_entry_list)

['O09185', 'P04637', 'P41685', 'P67938', 'P79734']

#### Exercise

After sorting retry accessing items we will notice that the order of the original list has changed. Remember data types are mutable, they are overwritten.

**Question:** What would you do to keep the original list but also have a list that contains items in a sorted manner? Start by creating the original list that you used one step earlier.

In [89]:
tp53_entry_list = ['O09185', 'P79734', 'P41685', 'P04637', 'P67938']
sorted_tp53_list = sorted(tp53_entry_list)

### Looping: `for` loop

1. use loop for printing each item of the list
1. iterate by `range()`
1. use range to access list item

In [90]:
for entry in tp53_entry_list:
    print(entry)

O09185
P79734
P41685
P04637
P67938


In [92]:
for entry in range(5):
    print(entry)

0
1
2
3
4


In [93]:
for entry in range(5):
    print(tp53_entry_list[entry])

O09185
P79734
P41685
P04637
P67938


#### Summary: `for` loop

- for loops can be used to repeat a block of code for each item in a list.
- `range( )` can be used to create a list of numbers, to execute the loop a given number of times, for e.g. to access items of a list and operate on them.

### String Formatting

Printing out multiple items with more information.

In [94]:
print("this is a list", tp53_entry_list)
print("item in index", 0, "is", tp53_entry_list[0])
print(f"item in index {0} is {tp53_entry_list[0]}")
print(f"item in index {1} is {tp53_entry_list[1]}")

this is a list ['O09185', 'P79734', 'P41685', 'P04637', 'P67938']
item in index 0 is O09185
item in index 0 is O09185
item in index 1 is P79734


#### Exercise

You've learned how to use `for` loop using `range` of list's `len( )`, and now you know how to use string formatting.

Combining these concepts can you create a `for` loop that uses a command written below to print each index and the corresponding item of your list?

`f"Item in {index_number} index is {list_name[index_number]}"`

In [95]:
for index in range(len(tp53_entry_list)):
    print(f"Item in {index} index is {tp53_entry_list[index]}")

Item in 0 index is O09185
Item in 1 index is P79734
Item in 2 index is P41685
Item in 3 index is P04637
Item in 4 index is P67938


### Looking up items in multiple lists

- Comparing two lists
- Accessing items from multiple list using same index number in a for loop

Example: 

- If you used the UniProt gene list from the last exercise, make another list that comprises of organism names corresponding to each gene, i.e. "Chinese hamster", "Zebrafish", "Cat", "Human", "Zebu".
- How can you print each gene with its organism information? (Hint: "TP53 gene entry of human is P04637")

In [107]:
tp53_entry_list = ['O09185', 'P79734', 'P41685', 'P04637', 'P67938']
organism_list = ["Chinese hamster", "Zebrafish", "Cat", "Human", "Zebu"]

for index_num in range(len(tp53_entry_list)):
    print(f"TP53 gene entry for {organism_list[index_num]} is {tp53_entry_list[index_num]}")

TP53 gene entry for Chinese hamster is O09185
TP53 gene entry for Zebrafish is P79734
TP53 gene entry for Cat is P41685
TP53 gene entry for Human is P04637
TP53 gene entry for Zebu is P67938


### Dictionary: `dict( )`

#### working with dictionaries
 
- creating empty dictionary 
- creating dictionary with key-value pairs
- using key for looking up its value
- manipulating values

#### Example:

Use gene and organism information from the previous lists.

- Genes = 'O09185', 'P79734', 'P41685', 'P04637'
- Organism = "Chinese hamster", "Zebrafish", "Cat", "Human"

In [108]:
my_dict = {}
organism_gene_dict = {"Chinese hamster": 'O09185', 
                      "Zebrafish": 'P79734',
                      "Cat": 'P41685',
                      "Human": 'P04637',
                     }
organism_gene_dict["Human"]

'P04637'

#### adding or removing items

- adding new key-value pair in dict
    - example: "Zebu" -> 'P67938'
- manipulating values
- deleting certain key-value pair

In [110]:
organism_gene_dict["Zebu"] = 'P67938'
organism_gene_dict["Zebu"] = 'NA'
del(organism_gene_dict["Zebu"])
organism_gene_dict

{'Cat': 'P41685',
 'Chinese hamster': 'O09185',
 'Human': 'P04637',
 'Zebrafish': 'P79734'}

#### operating on dictionary items

- getting all the `.keys( )`
- getting all the `.values( )`
- getting all the `.items( )`

In [109]:
organism_gene_dict.keys()
organism_gene_dict.values()
organism_gene_dict.items()

dict_items([('Chinese hamster', 'O09185'), ('Zebrafish', 'P79734'), ('Cat', 'P41685'), ('Human', 'P04637')])

#### Summary: `Dict( )`

- Dictionaries are another object data type which stores key-value pairs.
- The `.keys( )`, `.values()` and `.items()` methods are used to get lists of the contents of a dictionary.

## Combining Python Concepts

### Looping on `dict( )` items

#### Questions

- How can I get all the key value pairs one by one?
- How can I operate on the values of each key?

In [111]:
for dict_item in organism_gene_dict.items():
    print(dict_item)

('Chinese hamster', 'O09185')
('Zebrafish', 'P79734')
('Cat', 'P41685')
('Human', 'P04637')


In [105]:
for dict_key in organism_gene_dict.keys():
    print(organism_gene_dict[dict_key])

O09185
P79734
P41685
P04637


## Exercise:

**Objective**: Understanding how to work with multiple dictionaries that have same set of keys but different information in their associated values.

- Create another dictionary that contains organism as key (like the last dictionary), but have UniProt Entry Names associated with them.
    - To continue using the same UniProt example, use this set of key value pairs:
        - Chinese hamster -> P53_CRIGR
        - Zebrafish -> P53_DANRE
        - Cat -> P53_FELCA
        - Human -> P53_HUMAN
- Use for loop to access keys from one dictionary and print information from both the dictionaries.
- Discuss where in your work such concepts will be useful.

In [114]:
organism_entryname_dict = {"Chinese hamster": "P53_CRIGR",
                       "Zebrafish": "P53_DANRE",
                       "Cat": "P53_FELCA",
                       "Human": "P53_HUMAN"
                      }
for up_key in organism_gene_dict:
    print(organism_gene_dict[up_key], organism_entryname_dict[up_key])

O09185 P53_CRIGR
P79734 P53_DANRE
P41685 P53_FELCA
P04637 P53_HUMAN
