<img src="images/JHI_STRAP_Web.png" style="width: 150px; float: right;">
# 03a - Building a Reproducible Document

## Table of Contents

1. [Python Imports/Startup](#python_imports)
2. [Biological Motivation](#motivation)
3. [Load Sequence](#load_sequence)
4. [Build `BLAST` database](#build_blast)
5. [Run `BLAST` query](#blast_query)
6. [Load `BLAST` results](#blast_results)
7. [Query `UniProt`](#uniprot)
8. [Query `KEGG`](#uniprot)

<a id="python_imports"></a>
## 1. Python Imports/Startup

<p></p><div class="alert-success">
<b>It can be very convenient to have all the `Python` library imports at the top of the notebook.</b>
</div>

This is very helpful when running the notebook with, e.g. `Cell -> Run All` or `Kernel -> Restart & Run All` from the menu bar, all the libraries are available throughout the document.

In [None]:
# The line below allows the notebooks to show graphics inline
%pylab inline

import io                            # This lets us handle streaming data
import os                            # This lets us communicate with the operating system

import pandas as pd                  # This lets us use dataframes
import seaborn as sns                # This lets us draw pretty graphics

# Biopython is a widely-used library for bioinformatics
# tasks, and integrating with software
from Bio import SeqIO                # This lets us handle sequence data
from Bio.KEGG import REST            # This lets us connect to the KEGG databases

# The bioservices library allows connections to common
# online bioinformatics resources
from bioservices import UniProt      # This lets us connect to the UniProt databases

from IPython.display import Image    # This lets us display images (.png etc) from code

<p></p><div class="alert-success">
<b>It can be useful here to create any output directories that will be used throughout the document.</b>
</div>

The `os.makedirs()` function allows us to create a new directory, and the `exist_ok` option will prevent the notebook code from stopping and throwing an error if the directory already exists.

In [None]:
# Create a new directory for notebook output
OUTDIR = os.path.join("data", "reproducible", "output")
os.makedirs(OUTDIR, exist_ok=True)

<p></p><div class="alert-success">
<b>It can be useful here to create helper functions that will be used throughout the document.</b>
</div>

The `to_df()` function will turn tabular data into a `pandas` dataframe

In [None]:
# A small function to return a Pandas dataframe, given tabular text
def to_df(result):
    return pd.read_table(io.StringIO(result), header=None)

<a id="motivation"></a>
## 2. Biological Motivation

<p></p><div class="alert-info">
<b>We are working on a project to improve bacterial throughput for biosynthesis, and have been provided with a nucleotide sequence of a gene of interest.
<br></br><br></br>
This gene is overrepresented in populations of bacteria that appear to be associated with enhanced metabolic function relevant to a biosynthetic output (lipid conversion to ethanol).
<br></br><br></br>
We want to find out more about the annotated function and literature associated with this gene, which appears to derive from *Proteus mirabilis*.
</div>

Our plan is to:

1. identify a homologue in a reference isolate of *P. mirabilis*
2. obtain the protein sequence/identifier for the homologue
3. get information about the molecular function of this protein from `UniProt`
4. get information about the metabolic function of this protein from `KEGG`
5. visualise some of the information about this gene/protein

<a id="load_sequence"></a>
## 3. Load Sequence

<p></p><div class="alert-success">
<b>We first load the sequence from a local `FASTA` file, using the `Biopython` `SeqIO` library.</b>
</div>

<a id="build_blast"></a>
## 4. Build `BLAST` Database

<p></p><div class="alert-success">
<b>We now build a local `BLAST` database from the *P. mirabilis* reference proteins.</b>
</div>

<a id="blast_query"></a>
## 5. Run `BLAST` Query

<p></p><div class="alert-success">
<b>We now query the wildtype sequence against our custom `BLAST` database from the *P. mirabilis* reference proteins.</b>
</div>

<a id="blast_results"></a>
## 6. Load `BLAST` Results

<p></p><div class="alert-success">
<b>We now load the `BLASTX` results for inspection and visualisation, using `pandas`</b>
</div>

<a id="uniprot"></a>
## 7. Query `UniProt`

<p></p><div class="alert-success">
<b>We now query the `UniProt` databases for information on our best match</b>
</div>

<a id="kegg"></a>
## 8. Query `KEGG`

<p></p><div class="alert-success">
<b>We now query the `KEGG` databases for information on our best match</b>
</div>