# Getting Started

Start by creating a snapshot from the Pink Berries Metagenome snapshot, and starting a CeruleanTools AMI Instance with that volume loaded onto it.

You'll have to make a directory for the drive and mount the drive. My metagenome data volume was at `/dev/xvdc`; you can find this by using `lsblk` and seeing which name corresponds to your unmounted volume (MOUNTPOINT is blank).

In [3]:
!lsblk
!mkdir ~/data
!sudo mount /dev/xvdf ~/data

NAME  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvdb  202:16   0  200G  0 disk /mnt
xvda1 202:1    0  100G  0 disk /
mkdir: cannot create directory ‘/home/ubuntu/data’: File exists
mount: mount point data does not exist


# Binning

Use BWA or Bowtie to just look at *Bacteroidetes*

# Inputs and pre-processing

Cerulean requires that we use *ABySS* to assemble contigs from our short reads, and then map the PacBio long reads to the contigs using *BLASR*. 

First, we'll create a folder in which to store our working files.

In [11]:
!cd
!mkdir hybrid

mkdir: cannot create directory ‘hybrid’: File exists


### Assembling short-read contigs

To start, place the Illumina short-read files in our new directory. 

In [13]:
#! need to find where the illumina files are and move them to ~/hybrid

Now, assemble the contigs: ("   If the paired-end reads are stored in fastq format in the files `reads1.fastq` and `reads2.fastq`, then contigs may be assembled by:")

In [14]:
#! $ abyss-pe k=64 n=10 in='reads1.fastq reads2.fastq' name=<dataname>


This will generate 2 files used for inputs to Cerulean:
```
* <dataname>-contigs.fa    #This contains the contig sequences
* <dataname>-contigs.dot   #This contains the graph structure
```

### Map PacBio reads to ABySS contigs using BLASR

   Note: sawriter and blasr are part of SMRT Analysis toolkit
   Note: You need to set the environmental variables and path:
   
```
$ export SEYMOUR_HOME=/opt/smrtanalysis/
$ source $SEYMOUR_HOME/etc/setup.sh
```
   
Suppose PacBio reads are stored in `<dataname>_pacbio.fasta`

```
$ sawriter <dataname>-contigs.fa
$ blasr <dataname>_pacbio.fa <dataname>-contigs.fa -minMatch 10 \
     -minPctIdentity 70 -bestn 30 -nCandidates 30 -maxScore -500 \
     -nproc <numthreads> -noSplitSubreads \
     -out <dataname>_pacbio_contigs_mapping.fasta.m4
```
   
   Make sure the fasta.m4 file generated has the following format:
   qname tname qstrand tstrand score pctsimilarity tstart tend tlength \
   qstart qend qlength ncells
   The file format may be verified by adding the option -header to blasr. 


# Execute Cerulean

 Cerulean requires that all input files are in the same directory `<basedir>`:
 i)   `<basedir>/<dataname>-contigs.fa`
 ii)  `<basedir>/<dataname>-contigs.dot`
 iii) `<basedir>/<dataname>_pacbio_contigs_mapping.fasta.m4`

 To run:
 ```
 $ python src/Cerulean.py --dataname <dataname> --basedir <basedir> \
 --nproc <numthreads>
 ```
 
 This will generate:
 i)  `<basedir>_cerulean.fasta`
 ii) `<basedir>_cerulean.dot`
 Note: The dot does not have same contigs as fasta, but intermediate graph.
 


# Post-processing

 Currently Cerulean does not include consensus sequence of PacBio reads in gaps
 The gaps may be filled using PBJelly.
 ```
 $ python $JELLYPATH/fakeQuals.py <dataname>_cerulean.fasta <dataname>_cerulean.qual
 $ python $JELLYPATH/fakeQuals.py <dataname>_pacbio.fasta <dataname>_pacbio.qual
 $ cp $JELLYPATH/lambdaExample/Protocol.xml .
 $ mkdir PBJelly
 ```
 
 Modify Protocol.xml as follows:
 Set `<reference>` to `$PATH_TO_<basedir>/<dataname>_cerulean.fasta`
 Set `<outputDir>` to `$PATH_TO_<basedir>/PBJelly`
 Set `<baseDir>` to `$PATH_TO_<basedir>`
 Set `<job>` to `<dataname>_pacbio.fasta`
 Set `<blasr>` option `-nproc <numthreads>`
 
 Note: PBJelly requires that the suffix be .fasta and not .fa
 Next run PBJelly:
 
 ```
 ($ source $JELLYPATH/exportPaths.sh)
 $ python $JELLYPATH/Jelly.py <stage> Protocol.xml
 ```
 
 where <stage> has to be in the order:
 ```
 setup
 mapping
 support
 extraction
 assembly
 output
 ```
 
 The assembled contigs may be view in 
 ```
 <basedir>/PBJelly/assembly/jellyOutput.fasta
 ```

Need to quality test against other assemblies