# Computational Skills for Biocuration

## Programming Skills with Python

### Practical Exercises

- Uniprot example file using Search term: ["gene:tp53 AND reviewed:yes"](https://www.uniprot.org/uniprot/?query=gene%3ATP53+AND+reviewed%3Ayes&sort=score)

In this sesssion we will build a pipeline that can let you do a series of task starting from getting data from internet and storing curated data.

#### Step 1: Locate data: Where is your data? 

*Hint: Look-up Rest API links*

- Here you will download the example file from UniProt

- Create download link "https://www.uniprot.org/uniprot?query=gene:TP53%20AND%20reviewed:yes&format=tab" as such that you can use your code to update the link for any gene name (which TP53 here)

In [None]:
uniprot_url = "https://www.uniprot.org/uniprot?query=gene:TP53%20AND%20reviewed:yes&format=tab"

print(uniprot_url)

#### Step 2: How can you find data for a different query?

*Hint: Think about reusing your codes by creating variables for your queries*

- Edit your code so that you can allow yourself to download other file formats like '.fasta' (which is '.tab' here)

In [None]:
genename = 'TP53'
uniprot_url = f"https://www.uniprot.org/uniprot?query=gene:{genename}%20AND%20reviewed:yes&format=tab"

print(uniprot_url)

#### Step 3: Can you bring more flexibility to your query?

*Hint: are there more parameters that you can allow your users to change?*

- Which other parameters in your link can be replaced by variables that can be supplied by your users?

In [None]:
genename = 'TP53'
fileformat = 'tab'
reviewed = 'yes'
uniprot_url = f"https://www.uniprot.org/uniprot?query=gene:{genename}%20AND%20reviewed:{reviewed}&format={fileformat}"
print(uniprot_url)

#### Step 4: Get data

- import requests and use it to get the UniProt file corresponding the link;

In [None]:
import requests

result = requests.get(uniprot_url)

#### Step 5: Make sure that everything is OK with your request

*Hint: remember `result.status_code`?*

- Using if-else conditional, check whether the query was successful (Tip: `if result.ok:`) at all and, if not,check which HTTP status code the server returned.
- If the query was successful, save the data to a variable (Tip: `result.text`)

In [None]:
if result.ok:
    print("Fetching data...")
    query_data = result.text
    print(query_data)
else:
    print('Something went wrong ', result.status_code)

#### Step 6: Save you data

- Save the downloaded file to a local file (Tip: `with open(filename, 'w') as out_fh:`)

In [None]:
with open('uniprot_tp53.tab', 'w') as out_fh:
    out_fh.write(query_data)

- Congrats! you made a pipeline (create query link, query UniProt database, save return data to a file)
- Please put the scripts to these tasks together, so that you can do the same analysis in one single code-box of this Jupyter Notebook.

### Make a working pipeline by putting pieces together

- I have provided a pseudocode for this task in the code-box below:
    - import requests
    - create a variable for gene id, uniprot basic link, any other parameters
    - combine the links to one
    - query UniProt database
    - check response, if it's ok then save the output to a file 'uniprot_tp53.tab'

In [None]:
import requests

genename = 'TP53'
fileformat = 'tab'
reviewed = 'yes'
uniprot_url = "https://www.uniprot.org/uniprot?query=gene:{genename}%20AND%20reviewed:{reviewed}&format={fileformat}"

result = requests.get(uniprot_url)

with open('uniprot_tp53.tab', 'w') as out_fh:
    out_fh.write(query_data)

### Pipelines are awesome. Functions make your pipeline even better!

- Put this piece of code in a function that takes an gene name and output file names.
- If you want, you can increase the number of arguments your function takes to allow your users to provide more information such as format and reviewed or not.

In [None]:
import requests

def get_data(data_url, outfile):
    result = requests.get(data_url)
    query_data = result.text
    if result.ok:
        print("Fetching data...")
        with open(outfile, 'w') as out_fh:
            out_fh.write(query_data)
    else:
        print('Something went wrong ', result.status_code)
    
genename = 'TP53'
outfile = 'uniprot_tp53.tab'
data_url = "https://www.uniprot.org/uniprot?query=gene:{genename}%20AND%20reviewed:yes&format=tab"
get_data(data_url, outfile)

### Expand your pipeline with more functions

#### Write functions that reads the downloaded file and curate data for you

- Write a function that reads an input file from UniProt and returns entry name (column 2), gene_name (column 4) and organism (column 6)
    - Use List or Dictionary to return this information

In [None]:
import requests

def get_data(data_url, outfile):
    result = requests.get(data_url)
    query_data = result.text
    if result.ok:
        print("Fetching data...")
        with open(outfile, 'w') as out_fh:
            out_fh.write(query_data)
    else:
        print('Something went wrong!', result.status_code)
    
def curate_data(data_file):
    
    curated_list = [] # will store information from each line
    
    with open(data_file, 'r') as in_fh:
        for lines in in_fh:
            entry_name = lines[1] #(column 2)
            gene_name = lines[3] #(column 4)
            organism = lines[5] #(column 6)
            # putting all variables int one string
            selected_data = f"{entry_name}\t{gene_name}\t{organism}" 
            curated_list.append(selected_data)
    return curated_list

- Expand your function to write this (or any) information to a file, choose tab ('\t') as separater of your column.

In [None]:
def list2file(data_list, filename):
    with open(filename, 'w') as out_fh:
        for items in data_list:
            out_fh.write(items+'\n')
    print("Created a file with curated data.")

In [None]:
## Define all arguments for my functions
genename = 'TP53'
data_url = f"https://www.uniprot.org/uniprot?query=gene:{genename}%20AND%20reviewed:yes&format=tab"

datafile = 'uniprot_tp53.tab'

In [None]:
## call functions
get_data(data_url, datafile)
curated_info = curate_data(datafile)

curated_file = 'curated_'+datafile
list2file(curated_info, curated_file)

#### Create local query database

*Hint: import sqllite3*

#### Query local database