# Computational Skills for Biocuration

## Programming Skills with Python

### Accessing Web Services with Python

The vast majority of useful biological data can be found somewhere online. Resources such as **UniProt**, **Ensembl** and the **NCBI** have become invaluable to research.

## API vs Web Service vs REST Interface

In the wild there tends to exist some confusion between REST interfaces and Web API's. Here are some definitions which should help clarify the difference between these:

* **API**: An **Application Program Interface** is a way to interact with a resource, database, or application programatically (as opposed to a grafical interface).
* **Web Service**: An **API** which recieves requests and returns results over the web, typically using HTTP.
* **REST API**: A **REpresentational State Transfer API** is a **Web API** which adheres to the **REST** guidelines: A set of guidelines which specify how the the **Web API** receives queries and returns results.

Many resources which call temselves RESTful are not 100% compliant, and many non-REST Web API's implement some RESTful guidelines. Some people refer to REST-like interfaces. In this Notebook we will not be making a large distinction between these, and will refer to broadly as **Web Services**.

## The Uniprot Web Service

Uniprot, the Universal Protein Resourece, is one of the most comprehensive Protein resources in the World. It is queried thousands of times per day for small and large scale data requests. As a programmer, the **Web Service** is the primary entry point to fetch information about different proteins.

As an example, lets say we are interested in obtaining the sequnce for the Human Protein PLK4 (UniProt ID: O00444). We could find the sequence by opening the Protein's UniProt entry:

* https://www.uniprot.org/uniprot/O00444  

and scrolling down to the "sequences" section. You could imagine doing this for 1, 2 maybe even 5 or 10 different proteins. But, anymore than that, and this would get quite tedious. However, UniProt provides a way to obtain the sequence directly, without having to scroll through the entire protein's entry. The FASTA formatted sequence is available by appending ".fasta" to the end of the URL:

* https://www.uniprot.org/uniprot/O00444.fasta

Other file types and formats are also available: XML, GFF, and Text. Some of these provide different types of information, and some of them have a different format than others.



### Exercise

Have a look at the "text" file information for a single uniprot entry, by appending ".txt" to the URL. For example:

* https://www.uniprot.org/uniprot/O00444.txt

**Questions**:

* Name 3 differerent types of information you can identify in this file


## Python Requests

The Python **requests** library contains all of the functionality needed to talk to **Web services**. The most useful method is the `.get()` method, which is responsible for fetching the content from a **Web service**. The only required argument for the `.get()` function is the URL you wish to fetch information for.

For example, to fetch the FASTA sequence of a protein from UniProt, we would use the following code:

In [None]:
# Demo time!

Most common entries in **UniProt** UniProt have 2 unique identifiers: A 6 character ID, and a longer, more human readable one. For example the protein *TP53* has accession "P04637" and "P53_HUMAN". If you use the latter to access the UniProt entry for *TP53*, you will automatically be directed to the entry for *P04637*.

**Note:** Web services, like most of the web, communicate using the HyperText Transfer Protocol (HTTP, or its "secure" sibling HTTPS). The **requests** library is a Python library which is able to "speak" and "read" HTTP. In the same way that there are libaries to read and write different file types ("open" for plain text, BioPython's **SeqIO** for various biological data formats, etc), requests is a library which deals exclusively with HTTP & HTTS formatted data.

### Exercise

Write a small script that fetches and prints the FASTA sequences for a collection of 5 proteins:

* P53_MOUSE, ATM_MOUSE, MDM2_MOUSE, CDN1A_MOUSE and CBP_MOUSE.


In [None]:
# Write your solution here

### Limiting the number of requests per minute

Many webservers have a limit to the total number of requests you can do per minute. Since code runs uncredibly fast, it would theoretically be possible to send hundreds or thousands of requests to a **Web service**, and to make sure resources don't run out (and other people can also access the server), many **Web services** ask you to limit the number of requests per minute. It is always important to read the documentation of any **Web service** to see if they have any strict limits, or otherwise you may find yourself (or the entire institute!) temporarily banned from this service.

The most straigtforward way to do this is to use the `sleep()` function from the **time** module, which causes the script to "sleep" for a certain amount of time. Try Running the example below:

In [None]:
# Demo time!

### Excercise

Which of the following examples (one or more) would limit the rate to at most 1 query per 10 seconds

**Solution 1**

```python
from time import sleep

protein_ids = ["CDK2_DROME", "CCNA_DROME", "MCM5_DROME"]

for protein_id in protein_ids:
    sleep(10)
    response = requests.get("https://www.uniprot.org/uniprot/O00444.fasta")
    print(response.text)
```

**Solution 2**

```python

from time import sleep

protein_ids = ["CDK2_DROME", "CCNA_DROME", "MCM5_DROME"]

for protein_id in protein_ids:
    response = requests.get("https://www.uniprot.org/uniprot/O00444.fasta")
    print(response.text)
    sleep(10)
```

**Solution 3**

```python

from time import sleep

protein_ids = ["CDK2_DROME", "CCNA_DROME", "MCM5_DROME"]

for protein_id in protein_ids:
    response = requests.get("https://www.uniprot.org/uniprot/O00444.fasta")
    print(response.text)
sleep(10)
```

### HTTP Errors

Sometimes things go wrong when we try to do things online. This is no different for **Web services**. Pretty much everyone will have bumped into a "Page Note Found" error while browsing the net. Some may even know this as a "404 Page Not Found" error. This **404** is actually very useful: Its the HTTP code which a webserver returns to us when it cannot find a page that we are asking.

The following are some common return codes which can come from a HTTP request:

* **200** OK, everything went well
* **404** ERROR: Page not found
* **429** ERROR: Too many requests (not always implemented)
* **401** ERROR: Unauthorized
* **403** ERROR: Forbidden
* **503** ERROR: Service (temporarily) unavailable

It is very useful to check the HTTP response in our request, to make sure we did not encounter any errors. Often checking for a **200** response is enough, or you can use the shortcut `.ok`.

In [None]:
# Demo time!

### Exercise

Let's wrap up all of the things we have just taked about. Write a script that checks if a FASTA sequences for a series of proteins exists (by checking for HTTP errors). Make sure the script:

* Waits 1 second between each request
* Tells us the sequence exists (if there are no errors)
* If there was an error, warn the user, and tell then which type of error (error code) was encountered

Test your script on the following list of protein IDs:

* CDK2_MOUSE, CDK2_RAT, CCNA_HUMAN and MCM5_DROME


In [None]:
# Try your solution here!

## JSON

**Web Services** can return data in a variety of different formats, such as XML, GFF, CSV and even plain text. One of the more popular ones is **JSON** (JavaScript Object Notation). 

Here is an example of a JSON formatted (mini) dataset:
    
```python
json_data = '[{"species": "Human", "protein": "P53_HUMAN"},{"species": "Zebrafish","protein": "P53_DANRE"}]'

```

It's (almost) exactly like a string representation of Python **lists** and **dictionaries**.

* The python **json** library deals with converting **JSON** data into Python objects

In [None]:
# Demo time!

The **EBI Proteins API** is a programatical interface to download data from UniProt any many other databases. You can find more information on the [EBI Proteins API documentation](https://www.ebi.ac.uk/proteins/api/doc/index.html).

The "proteins" endpoint is where we can retrieve information for a single UniProt entry, via its UniProt ID. For example:

* https://www.ebi.ac.uk/proteins/api/proteins/Q64702

**Note** some browsers will re-format the **JSON** object to look prettier. 

In [None]:
# Demo time!

# Exercise

Using the **EBI Proteins API** to find the "Recommended Names" (e.g., common names) for this list of UniProt IDs:

* P30291, Q9VAC8, O60566

**Optional:** Include a 5 second timeout between each request, and also implement HTTP response checking.

In [16]:
# Try our solution here!

## Additional reading

Python's **requests** library does much more than what we have shown here. It handles a variety of different types of requests, can upload & download entire files, and can hanle user authentication for **Web services** which require a login. If you would like to learn more, the following resources provide a nice overview of the **requests** library:

* https://www.pythonforbeginners.com/requests/the-awesome-requests-module
* http://zetcode.com/web/pythonrequests/
* http://docs.python-requests.org/en/master/


**Web services** and particularly **REST API**'s can do much more than only fetching data: They can often also be used to Add new extries, modify existing ones or delete data. However, in practice, many Bio-data **Web services** are designed with fetching results as the main (or only) feature. For a (technical) overview of REST:

* https://www.restapitutorial.com/