# Computational Skills for Biocuration

## Programming Skills with Python

### Reading a local file

#### Example file: 

- Search term: ["gene:tp53 AND reviewed:yes"](https://www.uniprot.org/uniprot/?query=gene%3ATP53+AND+reviewed%3Ayes&sort=score)
- Download uncompressed tab separated file in your local directory under the file name: **uniprot-tp53.tab**

## Handling files in Python

- How can I open a file programmatically?

with open('uniprot-tp53.tab', 'r') as in_fh:
    print("Reading file:", in_fh.read())

- How can I read each line of a file programmatically?
- **Tip**: Use a smaller file to test your code: **uniprot-tp53-small.tab**

In [13]:
with open('uniprot-tp53-small.tab', 'r') as in_fh:
    for lines in in_fh:
        print("Reading line:", lines)

Reading line: Entry	Entry name	Status	Protein names	Gene names	Organism	Length

Reading line: O09185	P53_CRIGR	reviewed	Cellular tumor antigen p53 (Tumor suppressor p53)	TP53 P53	Cricetulus griseus (Chinese hamster) (Cricetulus barabensis griseus)	393

Reading line: P79734	P53_DANRE	reviewed	Cellular tumor antigen p53 (Tumor suppressor p53)	tp53 drp53	Danio rerio (Zebrafish) (Brachydanio rerio)	373

Reading line: P41685	P53_FELCA	reviewed	Cellular tumor antigen p53 (Tumor suppressor p53)	TP53 TRP53	Felis catus (Cat) (Felis silvestris catus)	386

Reading line: P04637	P53_HUMAN	reviewed	Cellular tumor antigen p53 (Antigen NY-CO-13) (Phosphoprotein p53) (Tumor suppressor p53)	TP53 P53	Homo sapiens (Human)	393


- How can I read a specific infomation of a file programmatically?

In [16]:
with open('uniprot-tp53-small.tab', 'r') as in_fh:
    for lines in in_fh:
        entry = lines.split('\t')[0]
        organism = lines.split('\t')[5]
        print(entry, organism)

Entry Organism
O09185 Cricetulus griseus (Chinese hamster) (Cricetulus barabensis griseus)
P79734 Danio rerio (Zebrafish) (Brachydanio rerio)
P41685 Felis catus (Cat) (Felis silvestris catus)
P04637 Homo sapiens (Human)


- How can I write information such as curated data to a file?
    - Use "tp53-entry-organism.tab" as your file name.

In [26]:
with open('uniprot-tp53-small.tab', 'r') as in_fh:
    with open('tp53-entry-organism.tab', 'w') as out_fh:
        for lines in in_fh:
            entry = lines.split('\t')[0]
            organism = lines.split('\t')[5]
            out_fh.write(f"{entry}, {organism}\n")

### Exercise:

Fill in the blanks in the code below, to create a filter which:

 * Prints the "Entry" (UniProt ID) and "Organism" if (and only if) the Organism is "Danio rerio (Zebrafish) (Brachydanio rerio)" or "Felis catus (Cat) (Felis silvestris catus)"

In [None]:
### Debugging time!

with open('uniprot-tp53-small.tab', 'r') as in_fh:
    for lines in in_fh:
        entry = lines.split('\t')[0]
        organism = lines.split('\t')[5]
        if --- == "Danio rerio (Zebrafish) (Brachydanio rerio)":
            ---
        --- organism --- "Felis catus (Cat) (Felis silvestris catus)":
            print(entry, organism)

In [None]:
### Solution

with open('uniprot-tp53-small.tab', 'r') as in_fh:
    for lines in in_fh:
        entry = lines.split('\t')[0]
        organism = lines.split('\t')[5]
        if organism == "Danio rerio (Zebrafish) (Brachydanio rerio)":
            print(entry, organism)
        elif organism == "Felis catus (Cat) (Felis silvestris catus)":
            print(entry, organism)

### Exercise:

- Try to use different concepts learned today in this exercise (e.g. list, dict, set, function, loops and conditional)


- Create a function "extract_uniprot_organism" that can read any UniProt file like the UniProt example file.
- Print unique set of organism name.


- Can you identify how many organism had multiple entries?
    - Hint: check length of full list and length of unique list.

In [32]:
def extract_uniprot_organism(in_file):
    with open(in_file, 'r') as in_fh:
        organism_list = []
        for lines in in_fh:    
            organism = lines.split('\t')[5]
            organism_list.append(organism)
        print(len(organism_list) - len(set(organism_list)))
    return set(organism_list)

unique_organism = extract_uniprot_organism("uniprot-tp53.tab")
print(unique_organism)

59 36
{'Barbus barbus (Barbel) (Cyprinus barbus)', 'Sus scrofa (Pig)', 'Macaca mulatta (Rhesus macaque)', 'Bos taurus (Bovine)', 'Cavia porcellus (Guinea pig)', 'Xiphophorus maculatus (Southern platyfish) (Platypoecilus maculatus)', 'Delphinapterus leucas (Beluga whale)', 'Rattus norvegicus (Rat)', 'Spermophilus beecheyi (California ground squirrel) (Otospermophilus beecheyi)', 'Danio rerio (Zebrafish) (Brachydanio rerio)', 'Canis lupus familiaris (Dog) (Canis familiaris)', 'Felis catus (Cat) (Felis silvestris catus)', 'Organism', 'Homo sapiens (Human)', 'Bos indicus (Zebu)', 'Mus musculus (Mouse)', 'Cricetulus griseus (Chinese hamster) (Cricetulus barabensis griseus)', 'Mesocricetus auratus (Golden hamster)', 'Ictalurus punctatus (Channel catfish) (Silurus punctatus)', 'Oryctolagus cuniculus (Rabbit)', 'Gallus gallus (Chicken)', 'Tetraodon miurus (Congo puffer)', 'Tupaia belangeri (Common tree shrew) (Tupaia glis belangeri)', 'Oncorhynchus mykiss (Rainbow trout) (Salmo gairdneri)', 'P

**Optional Exercise**: 
- Expand your function that writes "Enrty - organism" pair to a file.
    - Hint: See the code that you wrote earlier to write your first file: "tp53-entry-organism.tab"

In [37]:
def extract_uniprot_organism(in_file, out_file):
    with open(out_file, 'w') as out_fh:
        with open(in_file, 'r') as in_fh:
            organism_list = []
            for lines in in_fh:    
                entry = lines.split('\t')[0]
                organism = lines.split('\t')[5]
                out_fh.write(f'{entry}, {organism}\n')
                organism_list.append(organism)
            print(len(organism_list) - len(set(organism_list)))
    return set(organism_list)
unique_organism = extract_uniprot_organism("uniprot-tp53.tab", "tp53-entry-organism.tab")
print(unique_organism)

23
{'Barbus barbus (Barbel) (Cyprinus barbus)', 'Sus scrofa (Pig)', 'Macaca mulatta (Rhesus macaque)', 'Bos taurus (Bovine)', 'Cavia porcellus (Guinea pig)', 'Xiphophorus maculatus (Southern platyfish) (Platypoecilus maculatus)', 'Delphinapterus leucas (Beluga whale)', 'Rattus norvegicus (Rat)', 'Spermophilus beecheyi (California ground squirrel) (Otospermophilus beecheyi)', 'Danio rerio (Zebrafish) (Brachydanio rerio)', 'Canis lupus familiaris (Dog) (Canis familiaris)', 'Felis catus (Cat) (Felis silvestris catus)', 'Organism', 'Homo sapiens (Human)', 'Bos indicus (Zebu)', 'Mus musculus (Mouse)', 'Cricetulus griseus (Chinese hamster) (Cricetulus barabensis griseus)', 'Mesocricetus auratus (Golden hamster)', 'Ictalurus punctatus (Channel catfish) (Silurus punctatus)', 'Oryctolagus cuniculus (Rabbit)', 'Gallus gallus (Chicken)', 'Tetraodon miurus (Congo puffer)', 'Tupaia belangeri (Common tree shrew) (Tupaia glis belangeri)', 'Oncorhynchus mykiss (Rainbow trout) (Salmo gairdneri)', 'Pong