Overview

FLEA is a bioinformatics pipeline for analyzing longitudinal sequencing data from the Pacific Biosciences RS-II or Sequel. It currently supports full-length HIV env sequences.

The pipeline takes a set of FASTQ files, one per time point, containing circular consensus sequence (CCS) reads, which can be obtained using the ”Reads of Insert“ protocol on PacBio’s SMRTportal or SMRTanalysis tools. It produces a JSON file containing the following results:

a multiple sequence alignment of high-quality consensus sequences for each time point
a maximum-likelihood phylogenetic tree, inferred using FastTree
the most recent common ancestor (MRCA) and other inferred ancestor sequences
a two-dimensional embedding that respects TN93 sequence distances
per-site selection pressure, inferred using FUBAR, and other per-site evolutionary metrics
per-segment evolutionary and phenotypic metrics, inferred using HyPhy

The pipeline logic is implemented in Nextflow. A full description of the pipeline has been submitted for publication. A link to the journal article will be added here when it is available.

Setup

Dependencies

Nextflow
Python
USEARCH
MAFFT
HyPhy
FastTree
TN93
GNU parallel
Python dependencies (see below)

Install Python scripts

FLEA comes with a flea Python package containing scripts used throughout the pipeline. To install requirements and the scripts themselves (virtualenv recommended):

pip install -r requirements.txt
pip install -r requirements2.txt
python setup.py install

To test:

python setup.py nosetests

Configuration

The default config file is nextflow.config. It is recommended that you make a seperate config file that overrides any options that need to be customized. For more information on Nextflow-specific configuration, see the Nextflow documentation.

At the very least, params.reference_dir and the parameters that depend on it need to point to the various reference files used by the pipeline:

params.reference_db: FASTA file of reference sequences
params.contaminants_db: FASTA file of contaminant sequences
params.reference_dna: reference DNA sequence
params.reference_protein: reference amino acid sequence
params.reference_coordinates:

Usage

Write a control file containing a list of FASTQ files, visit codes, and dates, seperated by spaces.

<file> <visit code> <date>
<file> <visit code> <date>
....

Dates must be in 'YYYYMMDD' format.

Run the pipeline with Nextflow:

nextflow path/to/flea.nf -c path/to/custom/config/file \
  --infile path/to/metadata \
  --results_dir path/to/results

The results directory will contain output from lots of pipeline steps. The two files that contain the final results are:

session.json: a JSON file to be visualized with flea-web-app.
session.zip: a zip file with FASTA files for the consensus sequences, ancestors, and MRCA, and a Newick file containing the rooted phylogenetic tree.

These files can be served with flea-server and visualized with flea-web-app.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Overview

Setup

Dependencies

Install Python scripts

Configuration

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

Overview

Setup

Dependencies

Install Python scripts

Configuration

Usage