# Basic functionalities

The easiest way to get a prediction with CHAMOIS is to run the `chamois predict` command with a query BGC given as a GenBank record. 
For now, let's use [BGC0000703](https://mibig.secondarymetabolites.org/repository/BGC0000703.4/index.html#r1c1), the MIBiG BGC
producing [kanamycin](https://pubchem.ncbi.nlm.nih.gov/compound/6032) in *Streptomyces kanamyceticus*. The record was pre-downloaded
from MIBiG in GenBank format.

<div class="alert alert-info">

Note

This notebook calls the CHAMOIS CLI with the `chamois.cli.run` function. This is equivalent to calling the `chamois` command line in your shell, it's only done here to integrate with the documentation generator. For instance, calling:
```python
chamois.cli.run(["predict"])
```
is equivalent to running
```bash
$ chamois predict
```
in the console.

</div>


In [None]:
import chamois.cli
chamois.__version__

## Running predictions

Use the `chamois predict` command to run ChemOnt class predictions with CHAMOIS:

In [None]:
# $ chamois predict -i data/BGC0000703.4.gbk -o data/BGC0000703.4.hdf5
chamois.cli.run(["predict", "-i", "data/BGC0000703.4.gbk", "-o", "data/BGC0000703.4.hdf5"])

The resulting HDF5 file can be opened with the `anndata` package for further analysis:

In [None]:
import anndata
data = anndata.read_h5ad("data/BGC0000703.4.hdf5")
data

The observations (`data.obs`) store the metadata about the query BGCs:

In [None]:
data.obs

The variables (`data.var`) store the metadata about the chemical classes predicted by the CHAMOIS predictor.

In [None]:
data.var

## Visualizing results

The resulting file is a HDF5 format file contains the class probabilities for each of the records in the input GenBank file. The CLI can be used to quickly inspect the predicted classes:

In [None]:
# $ chamois render -i data/BGC0000703.4.hdf5
chamois.cli.run(["render", "-i", "data/BGC0000703.4.hdf5"])

## Screening predictions

Once predictions have been made, they can be screened with a particular query metabolite to see which BGC is the most likely to predict that metabolite. Let's try with the kanamycin as a sanity check. Molecules can be passed to `chamois screen` as either SMILES, InChi, or InChi Key.

<div class="alert alert-info">

Info

Passing a SMILES or an InChi requires the additional Python dependency `rdkit` 
to handle conversion to InChi Key.

</div>

In [None]:
# $ chamois screen -i data/BGC0000703.4.hdf5 -q SBUJHOSQTJFQJX-NOAMYHISSA-N --render
chamois.cli.run(["screen", "-i", "data/BGC0000703.4.hdf5", "-q", 'SBUJHOSQTJFQJX-NOAMYHISSA-N', "--render" ])

## Searching a catalog

<div class="alert alert-warning">

Warning

This feature is experimental and has not been properly evaluated. Use with caution.

</div>

The predictions can be used to search a catalog of compounds encoded as a `classes.hdf5` file, similar to what CHAMOIS uses for training. For instance, we can search which compound of MIBiG 3.1 is most similar to our prediction; hopefully we should get BGC0000703 among the top hits:

In [None]:
# $ chamois search -i data/BGC0000703.4.hdf5 --catalog ../../data/datasets/mibig3.1/classes.hdf5 --render
chamois.cli.run(["search", "-i", "data/BGC0000703.4.hdf5", "--catalog", "../../data/datasets/mibig3.1/classes.hdf5", "--render"])

## Interpreting a prediction

The `chamois explain` command allows obtaining additional information about a prediction made by CHAMOIS. It must be passed the original sequences of the BGCs, will re-annotate the genes, and will inspect the model weights to break down the prediction made by CHAMOIS into individual contributions from each genes, making it easier to understand the functions of the individual genes of the BGC. We call the `chamois explain` command with the `--cds` argument to ensure that the gene coordinates and identifiers are those already defined in the GenBank record:

In [None]:
# $ chamois explain --cds -i data/BGC0000703.4.gbk -o data/BGC0000703.4.tsv
chamois.cli.run(["explain", "cluster", "--cds", "-i", "data/BGC0000703.4.gbk", "-o", "data/BGC0000703.4.tsv"])

The output is a table that shows the contribution of the genes of the BGC to each of the predicted classes. It can be easily loaded with `pandas`:

In [None]:
import pandas
table = pandas.read_table("data/BGC0000703.4.tsv")
table

For instance, to see which genes contribute significantly to the prediction of the BGC compound to CHEMONTID:0000282 (Aminoglycosides), we can extract the relevant row from the table and filter for genes with weight greater than 2.0:

In [None]:
w = table.set_index("class").loc["CHEMONTID:0000282"].drop(["name", "probability"])
w[w >= 2]

These two genes are actually [DegT/DnrJ/EryC1/StrS-family aminotransferases](https://www.ebi.ac.uk/interpro/entry/InterPro/IPR000653/), which are also found in the biosynthesic pathways of [streptidine](https://pubchem.ncbi.nlm.nih.gov/compound/439323) (one of the aminoglycoside moieties of [streptomycin](https://pubchem.ncbi.nlm.nih.gov/compound/19649)) or [rifamycin B](https://pubchem.ncbi.nlm.nih.gov/compound/5459948).