# I. Getting Started

Start by creating a snapshot from the Pink Berries Metagenome snapshot, and starting a CeruleanTools AMI Instance with that volume loaded onto it.

You have to make a directory for the drive and mount the drive. My metagenome data volume was at `/dev/xvdc`; I'm not sure how to tell where it will be. If you make a snapshot in the right zone (e.g. us-east-1d), you can also load and mount the volume directly from your instance:

```
mkdir data

aws ec2 attach-volume --volume-id vol-0bdfad3677d717075 --instance-id i-0a62227ff1d1977bd --device /dev/xvdh

sudo mount /dev/xvdh ~/data
```

In [2]:
!lsblk
!mkdir ~/data
!sudo mount /dev/xvdf ~/data

mkdir: cannot create directory ‘/home/ubuntu/data’: File exists


# II. Inputs and pre-processing

Cerulean requires that we use *ABySS* to assemble contigs from our short reads, and then map the PacBio long reads to the contigs using *BLASR*. 

First, create a folder in which to store working files.

In [3]:
!cd
!mkdir hybrid

## Assembling short-read contigs

We'll do two assemblies: 
* **illumina_4moleculo + pacbio** - paired-end read prepped for moleculo mapped onto pacbio long reads (using illumina instead of pacbio assembly because this library had deeper coverage.
* **illumina_4moleculo + moleculo** paired-end read prepped for moleculo, but mapped onto moleculo long reads instead of pacbio.

To start, we copy the Illumina short-read files in our new directory. Binning found both alphaproteobacteria ("a" prefix) and bacteroidetes ("b" prefix)--we only want to copy over the bacteroidetes.

In [None]:
#!cp -r ~/data/metagenomes/sequence_reads/illumina_4moleculo/hiseq.raw.fastq ~/hybrid
!cp -R ~/data2/metagenomes/sequence_reads/illumina_4moleculo/quality-trimmed-reads/reads-by-genome/b* ~/hybrid/reads-by-genome/

These are interleaved files. To deinterleave each of the 5 files:

In [6]:
!for filename in ~/hybrid/reads-by-genome/*; do echo "$filename"; done

/home/ubuntu/hybrid/reads-by-genome/b1_flavo.interleaved.fasta
/home/ubuntu/hybrid/reads-by-genome/b2_owen.interleaved.fasta
/home/ubuntu/hybrid/reads-by-genome/b3_bact.interleaved.fasta
/home/ubuntu/hybrid/reads-by-genome/b4_cyt1.interleaved.fasta
/home/ubuntu/hybrid/reads-by-genome/b5_cyt2.interleaved.fasta


Now, assemble the contigs. The flag k=64 is the maximum k-mer length. It's probably a good idea to run this in a screen. To check on what processes are happening, use `top` (and `q` to quit)

## I DON'T KNOW WHAT THE n MEANS

In [7]:
!cd ~/hybrid/hiseq.raw.fastq
!abyss-pe k=64 n=10 name=HiSeqABySS in='Pb2_HiSeqStandardIllumina_1.fastq Pb2_HiSeqStandardIllumina_2.fastq'

SyntaxError: invalid syntax (<ipython-input-7-49513709daa1>, line 1)


This will generate 2 files used for inputs to Cerulean:
```
* <dataname>-contigs.fa    #This contains the contig sequences
* <dataname>-contigs.dot   #This contains the graph structure
```

### Map PacBio reads to ABySS contigs using BLASR

Note: sawriter and blasr are part of SMRT Analysis toolkit

Note: You need to set the environmental variables and path:
   
```
$ export SEYMOUR_HOME=/opt/smrtanalysis/
$ source $SEYMOUR_HOME/etc/setup.sh
```
   
Suppose PacBio reads are stored in `<dataname>_pacbio.fasta`

```
$ sawriter <dataname>-contigs.fa
$ blasr <dataname>_pacbio.fa <dataname>-contigs.fa -minMatch 10 \
     -minPctIdentity 70 -bestn 30 -nCandidates 30 -maxScore -500 \
     -nproc <numthreads> -noSplitSubreads \
     -out <dataname>_pacbio_contigs_mapping.fasta.m4
```
   
   Make sure the fasta.m4 file generated has the following format:
   qname tname qstrand tstrand score pctsimilarity tstart tend tlength \
   qstart qend qlength ncells
   The file format may be verified by adding the option -header to blasr. 


# Execute Cerulean

 Cerulean requires that all input files are in the same directory `<basedir>`:
 i)   `<basedir>/<dataname>-contigs.fa`
 ii)  `<basedir>/<dataname>-contigs.dot`
 iii) `<basedir>/<dataname>_pacbio_contigs_mapping.fasta.m4`

 To run:
 ```
 $ python src/Cerulean.py --dataname <dataname> --basedir <basedir> \
 --nproc <numthreads>
 ```
 
 This will generate:
 i)  `<basedir>_cerulean.fasta`
 ii) `<basedir>_cerulean.dot`
 Note: The dot does not have same contigs as fasta, but intermediate graph.
 


# Post-processing

 Currently Cerulean does not include consensus sequence of PacBio reads in gaps
 The gaps may be filled using PBJelly.
 ```
 $ python $JELLYPATH/fakeQuals.py <dataname>_cerulean.fasta <dataname>_cerulean.qual
 $ python $JELLYPATH/fakeQuals.py <dataname>_pacbio.fasta <dataname>_pacbio.qual
 $ cp $JELLYPATH/lambdaExample/Protocol.xml .
 $ mkdir PBJelly
 ```
 
 Modify Protocol.xml as follows:
 Set `<reference>` to `$PATH_TO_<basedir>/<dataname>_cerulean.fasta`
 Set `<outputDir>` to `$PATH_TO_<basedir>/PBJelly`
 Set `<baseDir>` to `$PATH_TO_<basedir>`
 Set `<job>` to `<dataname>_pacbio.fasta`
 Set `<blasr>` option `-nproc <numthreads>`
 
 Note: PBJelly requires that the suffix be .fasta and not .fa
 Next run PBJelly:
 
 ```
 ($ source $JELLYPATH/exportPaths.sh)
 $ python $JELLYPATH/Jelly.py <stage> Protocol.xml
 ```
 
 where <stage> has to be in the order:
 ```
 setup
 mapping
 support
 extraction
 assembly
 output
 ```
 
 The assembled contigs may be view in 
 ```
 <basedir>/PBJelly/assembly/jellyOutput.fasta
 ```

Need to quality test against other assemblies