Bacass - a pipeline for bacterial genome assembly from long-read sequences

About

Bacass is a workflow for filtering reads, de novo genome assembly, and genome annotation for bacterial isolates.

Installation

This workflow utilises docker for downloading the required databases and running the pipeline, and requires a working docker installation

To install, clone this repository into a local environment

git clone https://github.com/samuelmontgomery/bacass

Database installation

To install the databases, first pull down the docker image for installation

docker pull samueltmontgomery/bacassdb

Then run the install.db wrapper script specifying the database location e.g.

install_db.sh -d /scratch/database

Running the pipeline

To run the pipeline, first pull down or build the dockerfile

docker pull samueltmontgomery/bacass

then run the wrapper script

bac_assembly.sh -i INPUT -o OUTPUT [-p PLATFORM] -d DATABASE [-f FORMAT] -g LENGTH

Options:
  -i, --input       Specify the input directory path (required)
  -o, --output      Specify the output directory path (required)
  -f, --format      Specify the input format (default: fastq.gz, options: bam)
  -l, --length      Specify the expected genome length (required)
  -d, --database    Specify the directory of bakta database (required)

The input for this pipeline should be run on the output from MinKNOW basecalling or dorado basecalling, as either folders of fastq.gz files split by barcode or as unmapped bam files respectively.

The pipeline requires an input file called "barcodes.txt" in the input folder, which is a tab separated file with your barcodes matching the actual bacteria IDs - there is an example in the test directory. Running the pipeline will either rename the folders containing fastq.gz files to match the isolate name in the barcodes.txt file, or create folders with the isolate name and move the bam files into the folders in the input directory

The results will be written into matching folders in the output directory

This directory structure is required for the parallelisation of the pipeline to reduce runtime It also assumes you have 16 CPUs, and 64GB of system RAM - the script will need editing if that is not the case.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
lib		lib
test		test
Dockerfile		Dockerfile
Dockerfile_db		Dockerfile_db
README.md		README.md
bac_assembly.sh		bac_assembly.sh
dna_cs.fasta		dna_cs.fasta
install_db.sh		install_db.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bacass - a pipeline for bacterial genome assembly from long-read sequences

About

Installation

Database installation

Running the pipeline

About

Releases 3

Packages

Languages

samuelmontgomery/bacass

Folders and files

Latest commit

History

Repository files navigation

Bacass - a pipeline for bacterial genome assembly from long-read sequences

About

Installation

Database installation

Running the pipeline

About

Resources

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages