# Computational Skills for Biocuration

## Programming Skills with Python

### Practical Exercises

- Uniprot example file using Search term: ["gene:tp53 AND reviewed:yes"](https://www.uniprot.org/uniprot/?query=gene%3ATP53+AND+reviewed%3Ayes&sort=score)

In this sesssion we will build a pipeline that can let you do a series of task starting from getting data from internet and storing curated data.

#### Step 1: Locate data: Where is your data? 

*Hint: Look-up Rest API links*

- Here you will download the example file from UniProt

- Create download link "https://www.uniprot.org/uniprot?query=gene:TP53%20AND%20reviewed:yes&format=tab" as such that you can use your code to update the link for any gene name (which TP53 here)

In [20]:
genename = 'TP53'
uniprot_url = "https://www.uniprot.org/uniprot?query=gene:{genename}%20AND%20reviewed:yes&format=tab"

print(uniprot_url)

https://www.uniprot.org/uniprot?query=gene:{genename}%20AND%20reviewed:yes&format=tab


#### Step 2: How can you find data for a different query?

*Hint: Think about reusing your codes by creating variables for your queries*

- Edit your code so that you can allow yourself to download other file formats like '.fasta' (which is '.tab' here)

In [4]:
genename = 'TP53'
fileformat = 'tab'
uniprot_url = "https://www.uniprot.org/uniprot?query=gene:{genename}%20AND%20reviewed:yes&format={fileformat}"

print(uniprot_url)

https://www.uniprot.org/uniprot?query=gene:{genename}%20AND%20reviewed:yes&format={fileformat}


#### Step 3: Can you bring more flexibility to your query?

*Hint: are there more parameters that you can allow your users to change?*

- Which other parameters in your link can be replaced by variables that can be supplied by your users?

In [21]:
genename = 'TP53'
fileformat = 'tab'
reviewed = 'yes'
uniprot_url = "https://www.uniprot.org/uniprot?query=gene:{genename}%20AND%20reviewed:{reviewed}&format={fileformat}"

print(uniprot_url)

https://www.uniprot.org/uniprot?query=gene:{genename}%20AND%20reviewed:{reviewed}&format={fileformat}


#### Step 4: Get data

- import requests and use it to get the UniProt file corresponding the link;

In [13]:
import requests

result = requests.get(uniprot_url)

#### Step 5: Make sure that everything is OK with your request

*Hint: remember `result.status_code`?*

- Using if-else conditional, check whether the query was successful (Tip: `if result.ok:`) at all and, if not,check which HTTP status code the server returned.
- If the query was successful, save the data to a variable (Tip: `result.text`)

In [16]:
if result.ok:
    query_data = result.text
else:
    print('Something went wrong ', result.status_code)

Entry	Entry name	Status	Protein names	Gene names	Organism	Length
O09185	P53_CRIGR	reviewed	Cellular tumor antigen p53 (Tumor suppressor p53)	TP53 P53	Cricetulus griseus (Chinese hamster) (Cricetulus barabensis griseus)	393
P79734	P53_DANRE	reviewed	Cellular tumor antigen p53 (Tumor suppressor p53)	tp53 drp53	Danio rerio (Zebrafish) (Brachydanio rerio)	373
Q29480	P53_EQUAS	reviewed	Cellular tumor antigen p53 (Tumor suppressor p53) (Fragment)	TP53	Equus asinus (Donkey) (Equus africanus asinus)	207
P41685	P53_FELCA	reviewed	Cellular tumor antigen p53 (Tumor suppressor p53)	TP53 TRP53	Felis catus (Cat) (Felis silvestris catus)	386
P04637	P53_HUMAN	reviewed	Cellular tumor antigen p53 (Antigen NY-CO-13) (Phosphoprotein p53) (Tumor suppressor p53)	TP53 P53	Homo sapiens (Human)	393
Q9W678	P53_BARBU	reviewed	Cellular tumor antigen p53 (Tumor suppressor p53)	tp53 p53	Barbus barbus (Barbel) (Cyprinus barbus)	369
P67938	P53_BOSIN	reviewed	Cellular tumor antigen p53 (Tumor suppressor p53)	TP53	Bos 

#### Step 6: Save you data

- Save the downloaded file to a local file (Tip: `with open(filename, 'w') as out_fh:`)

In [17]:
with open('uniprot_tp53.tab', 'w') as out_fh:
    out_fh.write(query_data)

- Congrats! you made a pipeline (create query link, query UniProt database, save return data to a file)
- Please put the scripts to these tasks together, so that you can do the same analysis in one single code-box of this Jupyter Notebook.

### Make a working pipeline by putting pieces together

- I have provided a pseudocode for this task in the code-box below:
    - import requests
    - create a variable for gene id, uniprot basic link, any other parameters
    - combine the links to one
    - query UniProt database
    - check response, if it's ok then save the output to a file 'uniprot_tp53.tab'

In [None]:
import requests

genename = 'TP53'
fileformat = 'tab'
reviewed = 'yes'
uniprot_url = "https://www.uniprot.org/uniprot?query=gene:{genename}%20AND%20reviewed:{reviewed}&format={fileformat}"

result = requests.get(uniprot_url)

with open('uniprot_tp53.tab', 'w') as out_fh:
    out_fh.write(query_data)

### Pipelines are awesome. Functions make your pipeline even better!

- Put this piece of code in a function that takes an gene name and output file names.
- If you want, you can increase the number of arguments your function takes to allow your users to provide more information such as format and reviewed or not.

In [18]:
import requests

def download_data(genename, outfile):
    uniprot_url = "https://www.uniprot.org/uniprot?query=gene:{genename}%20AND%20reviewed:yes&format=tab"
    result = requests.get(uniprot_url)
    with open('uniprot_tp53.tab', 'w') as out_fh:
        out_fh.write(query_data)
    
genename = 'TP53'
outfile = 'uniprot_tp53.tab'
download_data(genename, outfile)

### Expand your pipeline with more functions

#### Write functions that reads the downloaded file and curate data for you

- Write a function that reads the file and extracts entry name (column 2), gene_name (column 4) and organism (column 6)
- Expand your function to write this information to a file, choose tab ('\t') as separater of your column

#### Create local query database

*Hint: import sqllite3*

#### Query local database