# Exercise 01 - Finding Effector Sequences <img src="images/JHI_STRAP_Web.png" style="width: 150px; float: right;">

## Introduction

## Running cells in this notebook

This is an interactive notebook, which means you are able to run the code that is written in each of the cells.

To run the code in a cell, you should:

1. Place your mouse cursor in the cell, and click (this gives the cell *focus*) to make it active
2. Hold down the `Shift` key, and press the `Return` key.

If this is successful, you should see the input marker to the left of the cell change from

```
In [ ]:
```

to (for example)

```
In [1]:
```

and you may see output appear below the cell.

### Related online documentation

**`avrPto1` UniProt entries**
* [974|PHI:974|avrPto1|AAA25728](http://www.uniprot.org/uniprot/Q08242)
* [975|PHI:975|avrPto1|AAO57459](http://www.uniprot.org/uniprot/Q87Y16)
* [976|PHI:976|avrPto1|AAY39946](http://www.uniprot.org/uniprot/Q4ZLM6)

### Requirements

<div class="alert alert-success">
To complete this exercise, you will need:
<ul>
<li>an active internet connection
<li>the <b>Biopython</b> libraries
<li>a local installation of [Clustal Omega (`clustalo`)](http://www.clustal.org/omega/)
<li>a local installation of [HMMer](http://hmmer.org/)
</ul>
</div>

## PHI-Base

### Downloading PHI-Base

We will extract a subset of effector sequences from the set of sequences in PHI-Base. To do this, we will download all the sequences contained in PHI-Base, in FASTA format. This can be done using your web browser.

* Navigate to [http://www.phi-base.org/](http://www.phi-base.org/).
* Click on the `Download` [link](http://www.phi-base.org/download.php) in the top link bar on the main page.
* Enter your personal details in the form to register, and click on the `Continue` button.
* Accept the terms and conditions, and click on the `Submit` button.
* Click on the `FASTA format` link.

This will download a file called `PHI_accessions` to a location on your machine. Copy this file to the `exercises/data` subdirectory, renaming it to make the database version number explicit, and to assign the `.fas` file extension. You can do this with your file manager, or at the terminal with the `mv` command:

```
mv ~/Downloads/PHI_accessions ./data/PHI_accessions_3_8.fas
```

From this point, we will assume that the location of this dataset is `./data/PHI_accessions_3_8.fas`.

<div class="alert alert-warning">
<b>Note: version numbers</b>: it is important that you record and report the version numbers of databases and software used in your analyses - this enables reproducibility for yourself and others.
</div>

## Extracting effector sequences from the PHI-Base download

We will use `Biopython` to help us extract all sequences from the PHI-Base file that refer to the *Pseudomonas syringae* effector `avrPto1`. This will give us a starting set of sequences representative of early identification of a putative effector in a real project.


<div class="alert alert-warning">
<b>Note:</b> there is nothing particularly special about *P. syringae*'s `avrPto1` - I've chosen it almost randomly as an example to illustrate the principles in this worksheet. Also, while PHI-Base is a very useful resource for many reasons, it has been chosen here to provide an illustrative example of non-exhaustive sequence information - to emulate the incomplete information you might have in a sequencing project for a less well-studied organism.
</div>

To load and extract the `avrPto1` sequences, we use `Biopython`'s `SeqIO` library, and filter sequences on the presence of the string `avrPto1` in the description. To filter the sequences, run the cell below:

In [None]:
from Bio import SeqIO

# We create an empty list to hold avrPto1 sequences that we find
avrpto = []

# We read every sequence in turn from the input FASTA file, and
# if the description contains "avrPto1", we put that sequence in
# the avrpto list
for seqrecord in SeqIO.parse('data/PHI_accessions_3_8.fas', 'fasta'):
    if "avrPto1" in seqrecord.description:
        avrpto.append(seqrecord)
        
# Show the sequences that we've collected in avrpto
for seqrecord in avrpto:
    print(seqrecord)

After running the cell above, you should see that we have found three sequences:

* 974|PHI:974|avrPto1|AAA25728
* 975|PHI:975|avrPto1|AAO57459
* 976|PHI:976|avrPto1|AAY39946

We will write these to the file `data/avrpto1.fas` in FASTA format, using `Biopython`, by running the cell below:

In [None]:
# Write the contents of the list avrpto to the file data/avrpto1.fas in FASTA format
SeqIO.write(avrpto, "data/avrpto1.fas", "fasta")

On running the cell, you should see the number `3` below the cell (this is the number of sequences that were written), and you can check for the presence of the file `avrpto1.fas` in the `data` subdirectory.

## Finding `avrPto1.fas` sequences in other genomes

We will now use the three `avrPto1` sequences you have downloaded as the *seed* or *training* set of sequences for finding new examples of these effectors in the *Pseudomonas* data you downloaded earlier.

It would be possible to search with each sequence individually, and attempt to match the results obtained from each individual search, but instead we will perform a *profile search*, by building a hidden Markov model from the downloaded sequences, and searching with the `HMMer` search tool (instead of `BLAST`).

<div class="alert alert-warning">
<b>Note:</b> `HMMer` is one of several tools (such as `PSI-BLAST`) that builds a *profile* of similar sequences, and searches on the basis of an aggregated, statistical representation of the sequence set. This places more weight on features shared by the input sequences, and less weight on features that the sequences do not have in common. That can result in more sensitive sequence searches, with fewer *false positives*, but care must be taken when compiling the sequence set used to build the initial sequence profile.
</div>

### Aligning the `avrPto1` sequences

To build a sequence profile, we must first align the *seed* set of sequences, so that the equivalent positions in each input sequence line up in the profile. We will use the command-line tool [Clustal Omega (`clustalo`)](http://www.clustal.org/omega/) to do this, by running the command below at the terminal:

```
clustalo -i data/avrPto1.fas -o data/avrpto1_aln.fas
```

This will create the new file `data/avrpto1_aln.fas` containing an alignment of the `avrPto1` sequences.

<div class="alert alert-danger" role="alert">
If Clustal Omega (`clustalo`) is not installed on your machine, you can perform the alignment online at [http://www.ebi.ac.uk/Tools/msa/clustalo/](http://www.ebi.ac.uk/Tools/msa/clustalo/) by copying and pasting your sequences into the entry field, and downloading the alignment in FASTA format.
</div>

### Building the `HMMer` profile

To build the sequence profile from our alignment, we use the `HMMer` package `hmmbuild`. This takes a protein sequence alignment, and converts it to a hidden Markov model (sequence profile) that accounts for the frequency of each amino acid at each position, and also its dependence on the preceding amino acid. This a statistically sophisticated representation of our alignment, and very effective at describing the composition of a large sequence alignment.

To build this statistical model, we run the `hmmbuild` command below, at the terminal:

```
hmmbuild --amino data/avrpto1.hmm data/avrpto1_aln.fas
```

This produces the file `data/avrpto1.hmm`, which describes our input sequence set statistically, and will be used to search against the *Pseudomonas* proteins you downloaded earlier.

<div class="alert alert-warning">
<b>Note:</b> You can inspect the model that `HMMer` builds directly, by looking at the contents of the `data/avrpto1.hmm` file, for example by using the command

```
less data/avrpto1.hmm
```
</div>

### Searching for new `avrPto1` sequences

We can now use `HMMer`'s `hmmsearch` tool to query the `data/avrpto1.hmm` file against the *Pseudomonas* protein sequences that you downloaded earlier. This is similar to running a `BLAST` search at the command-line, and can be run against the `BLAST` database of *Pseudomonas* proteins you have already built.

<div class="alert alert-danger" role="alert">
If you have not already built the *Pseudomonas* protein database, you should do this now by running the terminal command: 
</div>