## Abacat tutorial


### Getting started
Using Abacat is quite straightforward, and you will find that it can be used inline (like in this tutorial) and also in the command line.

Abacat's main class is the `Genome` class, which will hold most of your data.

In [1]:
# Let's instanstiate a genome to start.
from abacat import Genome

g = Genome()

### Loading data
Well, our class doesn't hold anything for now, but we can load a .fasta file containing WGS data. Abacat comes with 7 genomes so you can play around without worrying with downloading data.

Our genomes are located in `abacat/data/genomes`.

For now, let's load a genome unto our genome instance. This is a cyanobacterial genome of the species **Synechococcus elongatus**, but it comes named for its NCBI accession number. More information about it is available [here](https://www.ncbi.nlm.nih.gov/assembly/GCF_000012525.1/), at the NCBI Assembly database.

In [2]:
g.load_contigs("abacat/data/genomes/GCF_000012525.1_ASM1252v1_genomic.fna")

2019-09-27 05:33:09 - Contigs file set as /Users/viniWS/Bio/abacat/abacat/data/genomes/GCF_000012525.1_ASM1252v1_genomic.fna
2019-09-27 05:33:09 - Directory set as /Users/viniWS/Bio/abacat/abacat/data/genomes
2019-09-27 05:33:09 - Name set as GCF_000012525.1_ASM1252v1_genomic


Notice that Abacat sets `directory` and `name` attributes. Our file is stored in `g.files`.

In [3]:
g.files

{'contigs': '/Users/viniWS/Bio/abacat/abacat/data/genomes/GCF_000012525.1_ASM1252v1_genomic.fna'}

`g.files` is a dictionary because we will generate more files and their paths will be stored there. We can have a look at our sequence statistics using [seqstats](https://github.com/clwgg/seqstats), a quick command line tool which Abacat provides a wrapper for:

In [4]:
g.load_seqstats()
g.seqstats  # We also store this information as an attribute.

{'Total n': 2.0,
 'Total seq': 2742269.0,
 'Avg. seq': 1371134.5,
 'Median seq': 1371134.5,
 'N 50': 2695903.0,
 'Min seq': 46366.0,
 'Max seq': 2695903.0}

We can see we have a perfect genome assembly, with one chromosome (the larger sequence, 2.69 mbp) and a plasmid (46 kbp).

### Gene calling, gene sets and prot sets
The next thing we might want to with an assembly is to predict coding sequences (CDS) so we can have a file with genes (which we will call `geneset`) and a file with proteins, which we will call `protset`. For gene calling, we have a wrapper for [Prodigal](https://github.com/hyattpd/Prodigal), a popular software for that:

In [13]:
g.run_prodigal()

2019-09-27 05:42:49 - Starting Prodigal. Your input file is /Users/viniWS/Bio/abacat/abacat/data/genomes/GCF_000012525.1_ASM1252v1_genomic.fna. Quiet setting is True.
2019-09-27 05:42:57 - Loaded gene set from Prodigal data. It has 2725 genes.
2019-09-27 05:42:57 - Loaded protein set from Prodigal data.
2019-09-27 05:42:57 - Took 0:00:07.566366


These files are stored in our `files` attribute, which we saw previously:

In [14]:
g.files

{'contigs': '/Users/viniWS/Bio/abacat/abacat/data/genomes/GCF_000012525.1_ASM1252v1_genomic.fna',
 'prodigal': {'genes': '/Users/viniWS/Bio/abacat/abacat/data/genomes/GCF_000012525.1_ASM1252v1_genomic_prodigal_genes.fna',
  'proteins': '/Users/viniWS/Bio/abacat/abacat/data/genomes/GCF_000012525.1_ASM1252v1_genomic_prodigal_proteins.faa',
  'cds': '/Users/viniWS/Bio/abacat/abacat/data/genomes/GCF_000012525.1_ASM1252v1_genomic_prodigal_cds.gbk'}}

Now we have two keys in our `g.files` dictionary: the `'contigs'` key, which holds the file we started with, and the `'prodigal'` keys, which holds a dictionary with the CDS files.

In [15]:
g.files["prodigal"]["proteins"]  # The path to our file holding a set of proteins.

'/Users/viniWS/Bio/abacat/abacat/data/genomes/GCF_000012525.1_ASM1252v1_genomic_prodigal_proteins.faa'

Our `Genome` object also loads all of these sequences in memory, through the `geneset` and `protset` attributes:

In [16]:
g.geneset["prodigal"].keys()

dict_keys(['records', 'origin'])

Our `"prodigal"` geneset has all the records from our original genome file. The `"records"` key accesses all of the sequence records in the `geneset`, and the `"origin"` key points to the file from which they were generated:

In [17]:
len(g.geneset["prodigal"]["records"]), g.geneset["prodigal"]["origin"]

(2725,
 '/Users/viniWS/Bio/abacat/abacat/data/genomes/GCF_000012525.1_ASM1252v1_genomic_prodigal_genes.fna')

Now we have all of these 2725 loaded as a dictionary, which each sequence ID being the key.

In [18]:
g.geneset["prodigal"]["records"]["NC_007604.1_1"]

SeqRecord(seq=Seq('ATGCTTTGGCAAGATTGCGATCAAAGGCTCGGGCAGCCTCCCCCCATGAAGTTG...TAG', SingleLetterAlphabet()), id='NC_007604.1_1', name='NC_007604.1_1', description='NC_007604.1_1 # 65 # 1237 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.570', dbxrefs=[])

In [19]:
# The same goes for the protset:
g.protset["prodigal"]["records"]["NC_007604.1_1"]

SeqRecord(seq=Seq('MLWQDCDQRLGQPPPMKLVCRQNELNTSLSLVSRAVPSRPNHPVLANVLLAADA...RS*', SingleLetterAlphabet()), id='NC_007604.1_1', name='NC_007604.1_1', description='NC_007604.1_1 # 65 # 1237 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.570', dbxrefs=[])

### Annotation

Now that we have predicted CDSs in our genome, we can easily annotate these sequences. Abacat comes with two small pre-packaged databases:  
* [Megares](https://megares.meglab.org/) - an antibiotic resistance genes database and
* Phenotyping - which consists of genes involved in metabolic pathways which define phenotypes, and is still experimental.

To annotate our genomes, we will use the `blast_seqs()` method, which can be adapted for either our gene set or our prot set. In this case, because we want a nucleotide to nucleotide alignment, we will use our gene set with blastn:

In [20]:
g.blast_seqs(db="megares")

2019-09-27 05:48:20 - Blasting GCF_000012525.1_ASM1252v1_genomic to /Users/viniWS/Bio/abacat/abacat/data/genomes/GCF_000012525.1_ASM1252v1_genomic_megares_blast.xml.
2019-09-27 05:48:23 - Found 1 hits.

2019-09-27 05:48:23 - Wrote 1 annotated sequences to /Users/viniWS/Bio/abacat/abacat/data/genomes/GCF_000012525.1_ASM1252v1_genomic_megares.fasta.
2019-09-27 05:48:23 - Took 0:00:03.035533


NC_007604.1_912 CARD|pvgb|CP002695|3866610-3867801|ARO:3001312|elfamycin|Elfamycins|EF-Tu_inhibition|TUFAB|RequiresSNPConfirmation


Our annotation with Megares only found 1 gene! The annotation data is also stored in our genome object, and can be accessing the database key in the `files` attribute, like so: `g.files['megares']`

It produces 3 files:
* xml - the BLASTn result in XML format.
* annotation - the matching hits with corresponding annotation
* hits - only the hit description, without the sequences.

In [26]:
g.files['megares']

{'xml': '/Users/viniWS/Bio/abacat/abacat/data/genomes/GCF_000012525.1_ASM1252v1_genomic_megares_blast.xml',
 'annotation': '/Users/viniWS/Bio/abacat/abacat/data/genomes/GCF_000012525.1_ASM1252v1_genomic_megares.fasta',
 'hits': '/Users/viniWS/Bio/abacat/abacat/data/genomes/GCF_000012525.1_ASM1252v1_genomic_megares.hits'}

The same can be done for the phenotyping database. But, because it is a **protein** database, we can change our search strategy to "blastx", which searches nucleotides against protein sequences.

In [28]:
g.blast_seqs(db="phenotyping", blast="blastx")

2019-09-27 05:53:00 - Blasting GCF_000012525.1_ASM1252v1_genomic to /Users/viniWS/Bio/abacat/abacat/data/genomes/GCF_000012525.1_ASM1252v1_genomic_phenotyping_blast.xml.


ApplicationError: Non-zero return code 2 from 'blastn -out /Users/viniWS/Bio/abacat/abacat/data/genomes/GCF_000012525.1_ASM1252v1_genomic_phenotyping_blast.xml -outfmt 5 -num_alignments 5 -query /Users/viniWS/Bio/abacat/abacat/data/genomes/GCF_000012525.1_ASM1252v1_genomic_prodigal_genes.fna -db /Users/viniWS/Bio/abacat/abacat/data/db/phenotyping/phenotyping.fasta -evalue 1e-20 -num_threads 2', message 'BLAST Database error: No alias or index file found for nucleotide database [/Users/viniWS/Bio/abacat/abacat/data/db/phenotyping/phenotyping.fasta] in search path [/Users/viniWS/Bio/abacat::]'

In [29]:
!ls /Users/viniWS/Bio/abacat/abacat/data/db/phenotyping/phenotyping.fasta

/Users/viniWS/Bio/abacat/abacat/data/db/phenotyping/phenotyping.fasta
