<img src="images/JHI_STRAP_Web.png" style="width: 150px; float: right;">
# 09 - Programming for `KEGG`

## Table of Contents

1. [Introduction](#introduction)
2. [Python imports](#imports)
3. [Running a remote `KEGG` query](#kegg)
  1. [`kegg_info()`](#kegg_info)
  2. [`kegg_list()`](#kegg_list)
  2. [`kegg_find()`](#kegg_find)
  3. [`kegg_get()`](#kegg_get)
  4. [EXAMPLE: Putting it together](#example01)

<a id="introduction"></a>
## Introduction

The `KEGG` browser interface, while able to integrate searches across comprehensive and quite disparate datasets, dooes not always present the most convenient interface to extract that information (such as downloading FASTA sequences for an entry). As will all browser-based interfaces, it can also be tedious and time-consuming to point-and-click your way through a large number of browser-based searches.

This notebook presents examples of methods for using `KEGG` programmatically, via the Biopython programming libraries, and you will be controlling the searches using Python code in this notebook.

<div class="alert-success">
<b>As with all programmatic searches, there are a number of advantages to this approach:</b>
</div>

* It is easy to set up repeatable searches for many sequences, or collections of sequences
* It is easy to read in the search results and conduct downstream analyses that add value to your search

Where it is not practical to submit a large number of simultaneous queries via a web form (because it is tiresome to point-and-click over and over again), this can be handled programmatically instead. You have the opportunity to change custom options to help refine your query, compared to the website interface. If you need to repeat a query, it can be trivial to get the same settings every time, if you use a programmatic approach.

The Biopython interface to `KEGG` has several other advantages that we will not cover in this lesson, in that it allows for a much greater range of image manipulations for the pathway maps that `KEGG` returns.

<a id="imports"></a>
## Python imports

In [None]:
# Show plots as part of the notebook
%pylab inline

# Show images inline
from IPython.display import Image

# Standard library packages
import io
import os

# Import Biopython modules to interact with KEGG
from Bio import SeqIO
from Bio.KEGG import REST
from Bio.KEGG.KGML import KGML_parser
from Bio.Graphics.KGML_vis import KGMLCanvas

# Import Pandas, so we can use dataframes
import pandas as pd

### Python functions

In the cell below, we define a couple of useful functions that convert some returned output into Pandas dataframe form, and display `.pdf` images directly in the notebook. You do not need to understand these to follow the lesson.

In [None]:
# A bit of code that will help us display the PDF output
def PDF(filename):
    return HTML('<iframe src=%s width=700 height=350></iframe>' % filename)

# Some code to return a Pandas dataframe, given tabular text
def to_df(result):
    return pd.read_table(io.StringIO(result), header=None)

<a id="kegg"></a>
## Running a remote `KEGG` query

There is typically only a single step involved in obtaining a result from `KEGG` with Biopython: run one of the functions in `Bio.KEGG.REST`, and catch the result in a variable.

The available functions are:

* `kegg_conv()` - convert identifiers from `KEGG` to those for other databases
* `kegg_find()` - find `KEGG` entries with matching query data
* `kegg_get()` - retrieve data for a specific entry from `KEGG` 
* `kegg_info()` - get information about a `KEGG` database
* `kegg_link()` - find entries in `KEGG` using a database cross-reference
* `kegg_list()` - list entries in a a database

<a id="kegg_info"></a>
### `kegg_info()`

This function returns basic information about a specified `KEGG` database - much like visiting the landing page for that database.

For instance, to get information about the `KEGG` databases as a whole, we can use `kegg_info("kegg")` to get a *handle* from `KEGG` describing the databases, and catch it in a variable:

```
result = REST.kegg_info("kegg").read()
```

We could convert this *handle* to a Pandas dataframe with the function defined above: `to_df()`:

```
to_df(result)
```

<div class="alert-danger">
<b>Not all data is suited to `pandas` dataframe representation</b>
</div>

or `.read()` the handle, and print it to output directly with the `print()` statement:

```
print(result)
```

In [None]:
# Perform the query
result = REST.kegg_info("kegg").read()

# Print the result
print(result)

# Convert result to dataframe
# NOTE: this request does not produce a suitable data format for dataframe representation
#to_df(result)

This gives us a similar overview to the available resources as the [`KEGG` landing page](http://www.genome.jp/kegg/kegg2.html). However, the `kegg_info()` function is a little more powerful, as it can find information about specific databases:

In [None]:
# Print information about the PATHWAY database
result = REST.kegg_info("pathway").read()
print(result)

and even about specific organisms (identified with their three-letter code):

In [None]:
# Print information about Kitasatospora setae
result = REST.kegg_info("ksk").read()
print(result)

<a id="kegg_list"></a>
### `kegg_list()`

The `kegg_list()` function returns a table of entry identifiers and definitions for a specified database. For example, to list all the entries in the PATHWAY database, you could use:

In [None]:
# Get all entries in the PATHWAY database as a dataframe
result = REST.kegg_list("pathway").read()
to_df(result)

and to restrict the results only to those pathways that are present in *K. setae*, you can filter the database results with a query string `ksk`, as the second argument:

In [None]:
# Get all entries in the PATHWAY database for K. setae as a dataframe
result = REST.kegg_list("pathway", "ksk").read()
to_df(result)

<div class="alert-warning">
<b>QUESTIONS</b>
<ol>
<li> How many entries are in the complete PATHWAY database
<li> How many entries in the PATHWAY database are also present in <i>K. setae</i>
<li> Are these the same answers you got in lesson 08?
</ol>
</div>

If, instead of specifying one of the top-level `KEGG` databases, you specify an organism code, `KEGG` will return a list of gene entries for that organism:

In [None]:
# Get all genes from K. setae as a dataframe
result = REST.kegg_list("ksk").read()
to_df(result)

<a id="kegg_find"></a>
### `kegg_find()`

The `kegg_find()` function will search a named `KEGG` database with a specified query term. For instance, to query the GENES database with the entry accession `KSE_17560` you could use:

In [None]:
# Find a specific entry with a precise search term
result = REST.kegg_find("genes", "KSE_17560").read()
to_df(result)

With the query above, `KEGG` returns information for the exact entry we've requested. But we can also use less precise search terms, and combine them with the `+` symbol. For example, to search for `shiga toxin` we would use the query:

```
"shiga+toxin"
```

In [None]:
# Find all shiga toxin genes
result = REST.kegg_find("genes", "shiga+toxin").read()
to_df(result)

We can restrict this search to specific organisms, such as *Escherichia coli* O111 H-11128 (EHEC), by supplying its three letter code (here, `eoi`) as the database to be searched:

In [None]:
# Find all shiga toxin genes in eoi
result = REST.kegg_find("eoi", "shiga+toxin").read()
to_df(result)

The `kegg_find()` query string can also search in specific *fields* of the entry. The format for this is:

```
<query_value>/<field>
```

So, to search for all compounds with a molecular weight between 300 and 310 mass units, you can use the code:

In [None]:
# Find all compounds with mass between 300 and 310 units
result = REST.kegg_find("compound", "300-310/mol_weight").read()
to_df(result)

<a id="kegg_get"></a>
### `kegg_get()`

Most functions you've seen so far will return two columns of data: the first column being the entry accession, and the second column being a description of that entry, or the requested value.

The `kegg_get()` function lets us retrieve specific entries from `KEGG` - such as our search results - in named formats.

For example, the first compound in our search for molecular weights in the range 300-310 above has entry accession `cpd:C00051`. We can recover this entry as follows:

In [None]:
# Get the entry information for cpd:C00051
result = REST.kegg_get("cpd:C00051").read()
print(result)

<div class="alert-warning">
<b>QUESTIONS</b>
<ol>
<li> What information is returned in the default result?
</ol>
</div>

`KEGG` provides a number of different entry types, which cannot all be recovered in exactly the same ways. For instance, the COMPOUND entries typically have an associated molecular structure image, which can be recovered with `kegg_get()` by specifying the format to be `"image"`:

In [None]:
# Display molecular structure for cpd:C00051
result = REST.kegg_get("cpd:C00051", "image").read()
Image(result)

GENE entries are sequences, so can be recovered as their database entries (default), or as FASTA format nucleotide and/or protein sequences:

In [None]:
# Get entry information for KSE_17560
result = REST.kegg_get("ksk:KSE_17560").read()
print(result)

In [None]:
# Get coding sequence for KSE_17560
result = REST.kegg_get("ksk:KSE_17560", "ntseq").read()
print(result)

In [None]:
# Get protein sequence for KSE_17560
result = REST.kegg_get("ksk:KSE_17560", "aaseq").read()
print(result)