# I. Getting Started

Start by creating a snapshot from the Pink Berries Metagenome snapshot, and starting a CeruleanTools AMI Instance with that volume loaded onto it.

You have to make a directory for the drive and mount the drive. My metagenome data volume was at `/dev/xvdc`; I'm not sure how to tell where it will be. If you make a snapshot in the right zone (e.g. us-east-1d), you can also load and mount the volume directly from your instance:

```
mkdir data

aws ec2 attach-volume --volume-id vol-0bdfad3677d717075 --instance-id i-0a62227ff1d1977bd --device /dev/xvdh

sudo mount /dev/xvdh ~/data
```

In [2]:
!lsblk
!mkdir ~/data
!sudo mount /dev/xvdf ~/data

mkdir: cannot create directory ‘/home/ubuntu/data’: File exists


# II. Inputs and pre-processing

Cerulean requires that we use *ABySS* to assemble contigs from our short reads, and then map the PacBio long reads to the contigs using *BLASR*. 

First, create a folder in which to store working files.

In [3]:
!cd
!mkdir hybrid

## ABySS: assembling short-read contigs

We'll do two assemblies: 
* **illumina_4moleculo + pacbio** - paired-end read prepped for moleculo mapped onto pacbio long reads (using illumina instead of pacbio assembly because this library had deeper coverage.
* **illumina_4moleculo + moleculo** paired-end read prepped for moleculo, but mapped onto moleculo long reads instead of pacbio.

### Setup

To start, we copy the Illumina short-read files in our new directory. Binning found both alphaproteobacteria ("a" prefix) and bacteroidetes ("b" prefix)--we only want to copy over the bacteroidetes.

In [None]:
!cp -R ~/data/metagenomes/sequence_reads/illumina_4moleculo/quality-trimmed-reads/reads-by-genome/b* ~/hybrid/reads-by-genome/

### ABySS pre-processing

**Deinterleaving:** These are interleaved files. To deinterleave each of them, run the following:
```
cd ~/hybrid/reads-by-genome/
mkdir ~/hybrid/deinterleaved
for FILE in *; do 
mkdir ~/hybrid/deinterleaved/$FILE-deinterleaved;
grep -A1 "_1$" "$FILE" | grep -v "^--$" >  ~/hybrid/deinterleaved/$FILE-deinterleaved/reads-1.fasta; 
grep -A1 "_2$" "$FILE" | grep -v "^--$" >  ~/hybrid/deinterleaved/$FILE-deinterleaved/reads-2.fasta; 
done
```

Depending on what the files are named, it might be good to tweak the above to give more sensible names to your folders. I did this manually. However, you probably want to keep the name of each pair of files within each folder the same: reads-1 and reads-2. This makes the assembly step simpler.

**Naming reads properly**: ABySS also needs each read to be named with a slash. While in the directory containing the folders for each binned OTU (i.e. ~/hybrid/deinterleaved), the following will replace the

    hyphens (e.g. DJB775P1:392:D1R59ACXX:2:1310:17052:38927-2)
    with slashes (--> DJB775P1:392:D1R59ACXX:2:1310:17052:38927/2)


The -i flag is required to write the results of sed to a file (.bak is necessary for compatibility with certain systems). 

In [None]:
!cd ~/hybrid/deinterleaved/
!for FOLDER in * ; do for FILE in $FOLDER/*; do sed -i.bak 's/_2/\/2/' $FILE; sed -i.bak 's/_1/\/1/' $FILE; done; done

This command creates some extraneous .bak files. Delete them:

In [None]:
!cd ~/hybrid/deinterleaved/
!for FOLDER in * ; do rm $FOLDER/*.bak; done

**Reverse-complementing (binned reads only):** Due to the binning process, the binned reads are in a forward-forward read format, i.e. both paired-end reads are both from 5' to 3'. (For more conceptual information on this, see http://www.cureffi.org/2012/12/19/forward-and-reverse-reads-in-paired-end-sequencing/.) 

However, ABySS needs them to be in forward-reverse format, i.e. for each paired read, one needs to be reverse-complemented. We'll use Biopython to reverse-complement the reads.

Install Biopython using pip:

    pip install biopython

Then run the following python script in each binned folder to reverse-complement the `reads-2` files. Any lines that say "print" can be commented out if desired.

In [None]:
#Reverse-complementing the reads in a fasta file, reads-2.fasta
#To be run within the folder in which each reads-2 file is located 

from Bio.Seq import Seq
from Bio import SeqIO

rc_file = open("rc-reads-2.fasta", "w+") #w opens file for writing; 
                                         #+ creates if it doesn't exist

for seq_record in SeqIO.parse("reads-2.fasta", "fasta"):
    print("Reverse-complementing: "+seq_record.id)
    #print("ORIGINAL: " + seq_record.seq)
    seqRC = seq_record.reverse_complement(id=True) #preserves seq ID
    #print("REV-COMP: " + seqRC.seq)
    print("Reverse-complementing complete! ") #+ seqRC.id + "\n")
    
    #write new record to file
    rc_file.write(">"+str(seqRC.id))
    rc_file.write("\n")
    rc_file.write(str(seqRC.seq))
    rc_file.write("\n")

rc_file.close()


### ABySS Assembly

Now, assemble the contigs. The flag k=64 is the maximum k-mer length. It's probably a good idea to run this in a screen. To check on what processes are happening, use `top` (and `q` to quit)

In the binned case, I preferred doing this individually for each bacteroidetes bin so I could specify a different name for each file:

```
cd ~/hybrid/deinterleaved/b1_flavo_deinterleaved
abyss-pe name=b1-flavo k=64 in='reads-1.fasta reads-2.fasta'
```

SyntaxError: invalid syntax (<ipython-input-7-49513709daa1>, line 1)


This will generate 2 files used for inputs to Cerulean:
```
* <dataname>-contigs.fa    #This contains the contig sequences
* <dataname>-contigs.dot   #This contains the graph structure
```

### Map PacBio reads to ABySS contigs using BLASR

Note: sawriter and blasr are part of SMRT Analysis toolkit

Note: You need to set the environmental variables and path:
   
```
$ export SEYMOUR_HOME=/opt/smrtanalysis/
$ source $SEYMOUR_HOME/etc/setup.sh
```
   
Suppose PacBio reads are stored in `<dataname>_pacbio.fasta`

```
$ sawriter <dataname>-contigs.fa
$ blasr <dataname>_pacbio.fa <dataname>-contigs.fa -minMatch 10 \
     -minPctIdentity 70 -bestn 30 -nCandidates 30 -maxScore -500 \
     -nproc <numthreads> -noSplitSubreads \
     -out <dataname>_pacbio_contigs_mapping.fasta.m4
```
   
   Make sure the fasta.m4 file generated has the following format:
   qname tname qstrand tstrand score pctsimilarity tstart tend tlength \
   qstart qend qlength ncells
   The file format may be verified by adding the option -header to blasr. 


# Execute Cerulean

 Cerulean requires that all input files are in the same directory `<basedir>`:
 i)   `<basedir>/<dataname>-contigs.fa`
 ii)  `<basedir>/<dataname>-contigs.dot`
 iii) `<basedir>/<dataname>_pacbio_contigs_mapping.fasta.m4`

 To run:
 ```
 $ python src/Cerulean.py --dataname <dataname> --basedir <basedir> \
 --nproc <numthreads>
 ```
 
 This will generate:
 i)  `<basedir>_cerulean.fasta`
 ii) `<basedir>_cerulean.dot`
 Note: The dot does not have same contigs as fasta, but intermediate graph.
 


# Post-processing

 Currently Cerulean does not include consensus sequence of PacBio reads in gaps
 The gaps may be filled using PBJelly.
 ```
 $ python $JELLYPATH/fakeQuals.py <dataname>_cerulean.fasta <dataname>_cerulean.qual
 $ python $JELLYPATH/fakeQuals.py <dataname>_pacbio.fasta <dataname>_pacbio.qual
 $ cp $JELLYPATH/lambdaExample/Protocol.xml .
 $ mkdir PBJelly
 ```
 
 Modify Protocol.xml as follows:
 Set `<reference>` to `$PATH_TO_<basedir>/<dataname>_cerulean.fasta`
 Set `<outputDir>` to `$PATH_TO_<basedir>/PBJelly`
 Set `<baseDir>` to `$PATH_TO_<basedir>`
 Set `<job>` to `<dataname>_pacbio.fasta`
 Set `<blasr>` option `-nproc <numthreads>`
 
 Note: PBJelly requires that the suffix be .fasta and not .fa
 Next run PBJelly:
 
 ```
 ($ source $JELLYPATH/exportPaths.sh)
 $ python $JELLYPATH/Jelly.py <stage> Protocol.xml
 ```
 
 where <stage> has to be in the order:
 ```
 setup
 mapping
 support
 extraction
 assembly
 output
 ```
 
 The assembled contigs may be view in 
 ```
 <basedir>/PBJelly/assembly/jellyOutput.fasta
 ```

Need to quality test against other assemblies