<img src="images/JHI_STRAP_Web.png" style="width: 150px; float: right;">
# 07 - Programming for `UniProt`

## Table of Contents

1. [Introduction](#introduction)
2. [Python imports](#imports)
3. [Running a remote `UniProt` query](#uniprot)
  1. [Connecting to `UniProt`](#connect)
  2. [Constructing a query](#query)
  3. [Perform the query](#search)
  4. [EXAMPLE: Putting it together](#example01)
4. [Advanced queries](#advanced)
  1. [`key:value` queries](#keyvalue)
  2. [Exercise 01](#exercise01)
  3. [Combining queries](#combine)
4. [Processing query results](#processing)

<a id="introduction"></a>
## Introduction

The `UniProt` browser interface is very powerful, but you will have noticed that even the most complex queries can be converted into a single string that describes the search being made of the `UniProt` databases. This string is generated for you, and placed into the search field at the top of the `UniProt` webpage every time you run a search.

It can be tedious and time-consuming to point-and-click your way through a large number of browser-based searches, but by using the `UniProt` webservice, the search strings you've already seen, and a Python module called `bioservices`, we can compose and run as many searches as we like using computer code, and pull the results of those searches

This notebook presents examples of methods for using `UniProt` programmatically, via a webservice, and you will be controlling the searches using Python code in this notebook.

<div class="alert-success">
<b>There are a number of advantages to this approach:</b>
</div>

* It is easy to set up repeatable searches for many sequences, or collections of sequences
* It is easy to read in the search results and conduct downstream analyses that add value to your search

Where it is not practical to submit a large number of simultaneous queries via a web form (because it is tiresome to point-and-click over and over again), this can be handled programmatically instead. You have the opportunity to change custom options to help refine your query, compared to the website interface. If you need to repeat a query, it can be trivial to get the same settings every time, if you use a programmatic approach.

<a id="imports"></a>
## Python imports

In [2]:
# Show plots as part of the notebook
%pylab inline

# Standard library packages
import os

# Import bioservices module, to run remote UniProt queries
from bioservices import UniProt

Populating the interactive namespace from numpy and matplotlib


<a id="uniprot"></a>
## Running a remote `UniProt` query

There are three key steps to running a remote `UniProt` query with `bioservices`:

1. Make a link to the `UniProt` webservice
2. Construct a query string
3. Send the query to `UniProt`, and catch the result in a variable

Once the search result is caught, it can be processed in any way you like, written to a file, or ignored.

<a id="connect"></a>
### Connecting to `UniProt`

To open a connection to `UniProt`, you make an *instance* of the `UniProt()` *class* from `bioservices`. This is persistent, so once it is created, you can interact with it over and over again. To make the instance, you need to assign `UniProt()` to a variable.

```
service = UniProt() # it is good practice to have a meaningful variable name
```

<a id="query"></a>
### Constructing a query

`UniProt` allows for the construction of complex searches by combining *fields*. A full discussion is beyond the scope of this lesson, but you will have seen in [notebook 06](06-uniprot_browser.ipynb) that the searches you constructed by pointing and clicking on the `UniProt` website were converted into text in the search field at the top. 

To describe the format briefly: there are a set of defined *keys* - keywords that indicate the specific type of data you want to search in (such as `host`, `annotation`, or sequence `length`), and these are combined with a particular *value* you want to search for (such as `mouse`, or `40674`) in a `key:value` pair, separated by a colon, such as `host:mouse` or `ec:3.2.1.23`.

* `UniProt` query fields: [http://www.uniprot.org/help/query-fields](http://www.uniprot.org/help/query-fields)

If you provide a string, instead of a `key:value` pair, `UniProt` will search in all *fields* for your search term.

Programmatically, we construct the query as a *string*, e.g.

```
query = "Q9AJE3"  # this query means we want to look in all fields for Q9AJE3
```

<a id="search"></a>
### Perform the query

To send the query to `UniProt`, you will use the `.search()` *method* of your active instance of the `UniProt()` *class*. If you have assigned this instance to the variable `service` (as above), then you can run the query with the line:

```
result = service.search(query)  # Run a query and catch the output in result
```

In the line above, the output of the search (i.e. your results) are stored in a variable called `result`. It is good practice to make variable names short and descriptive - this makes your code easier to read.

<a id="example01"></a>
### EXAMPLE: Putting it together

The code in the cell below uses the example code above to create an instance of the `UniProt()` class, and use it to submit a pre-stored query to the `UniProt` service, then catch the result in a variable called `result`. The `print()` statement then shows us what the result looks like, as returned by the service.

In [4]:
# Make a link to the UniProt webservice
service = UniProt()

# Build a query string
query = "Q9AJE3"

# Send the query to UniProt, and catch the search result in a variable
result = service.search(query)

# Inspect the result
print(result)

Entry	Entry name	Status	Protein names	Gene names	Organism	Length
Q9AJE3	CYC2_KITGR	reviewed	Terpentetriene synthase (EC 4.2.3.36)	cyc2	Kitasatospora griseola (Streptomyces griseolosporeus)	311



The `UniProt()` instance defined in the cell above is *persistent*, so you can reuse it to make another query, as in the cell below:

In [6]:
# Make a new query string, and run a remote search at UniProt
new_query = "Q01844"
new_result = service.search(new_query)

# Inspect the result
print(new_result)

Entry	Entry name	Status	Protein names	Gene names	Organism	Length
Q01844	EWS_HUMAN	reviewed	RNA-binding protein EWS (EWS oncogene) (Ewing sarcoma breakpoint region 1 protein)	EWSR1 EWS	Homo sapiens (Human)	656
Q12933	TRAF2_HUMAN	reviewed	TNF receptor-associated factor 2 (EC 2.3.2.27) (E3 ubiquitin-protein ligase TRAF2) (RING-type E3 ubiquitin transferase TRAF2) (Tumor necrosis factor type 2 receptor-associated protein 3)	TRAF2 TRAP3	Homo sapiens (Human)	501
Q13077	TRAF1_HUMAN	reviewed	TNF receptor-associated factor 1 (Epstein-Barr virus-induced protein 6)	TRAF1 EBI6	Homo sapiens (Human)	416
O15162	PLS1_HUMAN	reviewed	Phospholipid scramblase 1 (PL scramblase 1) (Ca(2+)-dependent phospholipid scramblase 1) (Erythrocyte phospholipid scramblase) (MmTRA1b)	PLSCR1	Homo sapiens (Human)	318
Q99873	ANM1_HUMAN	reviewed	Protein arginine N-methyltransferase 1 (EC 2.1.1.319) (Histone-arginine N-methyltransferase PRMT1) (Interferon receptor 1-bound protein 4)	PRMT1 HMT2 HRMT1L2 IR1B4	Homo sapiens (Hu

<a id="advanced"></a>
## Advanced queries

The examples above built queries that were simple strings. They did not exploit the `key:value` search structure, or combine search terms. In this section, you will explore some queries that use the `UniProt` query fields, and combine them into powerful, filtering searches.

<a id="keyvalue"></a>
### `key:value` queries

As noted above (and at [http://www.uniprot.org/help/query-fields](http://www.uniprot.org/help/query-fields)) particular values of specific data can be requested by using `key:value` pairs to restrict searches to named *fields* in the `UniProt` database.

As a first example, you will note that the result returned for the query `"Q01844"` has multiple entries. Only one of these is the sequence with `accession` value equal to `"Q01844"`, but the other entries make reference to this sequence somewhere in their database record. If we want to restrict our result only to the particular entry `"Q01844"`, we can specify the field we want to search as `accession`, and build the following query:

```
query = "accession:Q01844"  # specify a search on the accession field
```

Note that we can use the same variable name `query` as earlier. The code below runs the search and shows the output:

In [8]:
# Make a new query string, and run a remote search at UniProt
query = "accession:Q01844"
result = service.search(query)

# Inspect the result
print(result)

Entry	Entry name	Status	Protein names	Gene names	Organism	Length
Q01844	EWS_HUMAN	reviewed	RNA-binding protein EWS (EWS oncogene) (Ewing sarcoma breakpoint region 1 protein)	EWSR1 EWS	Homo sapiens (Human)	656



<div class="alert-success">
<b>By using this and other `key:value` constructions, we can refine our searches to give us only the entries we're interested in</b>

<img src="images/exercise.png" style="width: 100px; float: left;">
<a id="ex01"></a>
### Exercise 01 (10min)

Using `key:value` searches, can you find and download sets of entries for proteins that satisfy the following requirements (**HINT** the links to the `UniProt` query fields may be helpful, here):

<br></br>
<div class="alert-danger">
<ul>
<li> Have publications authored by someone with the surname Broadhurst
<li> Have protein length between 900aa and 1000aa
<li> Derive from the taipan snake
<li> Have been found in the eye
</ul>
</div>

In [None]:
# SOLUTION - EXERCISE 01
queries = ["author:broadhurst", "length:[1200 TO 1220]", "organism:taipan", "tissue:eye"]

for query in queries:
    print("\n%s" % query)
    result = service.search(query)
    print(result)

<a id="processing"></a>
## Processing query results