Skip to content

straightlab/charseq-pipelines

Repository files navigation

ChAR-seq pipelines

This repository contains pipeline templates to process ChAR-seq raw sequencing reads. It also contains some notebooks describing how to prepare the ncessary files for pipeline (In progress).

Currently Available pipelines in the pipeline folder

  • Paired End reads ("standard" pipeline)

Available notebooks (in progress)

Dependencies

We recommand installing all python related packages, including tagtools and chartools in a conda environment. For example, create a charseq environment with

conda create -n charseq python=3.6

Then activate this environment and intall the other packages:

conda activate charseq
conda install pandas numpy pysam pyarrow snakemake
# then the pip instructions to install tagtools and chartools

Note that a lot of the bioinformatics tools above can also be installed through anaconda. In particular, pairix, bbmap, pear can be installed from the bioconda channel

conda install -c bioconda pairix bbmap

Bioconda channels may need to set up following instructions at https://bioconda.github.io/user/install.html (section 2, set up channels)

How to run a pipeline

The pipelines are written in Snakemake and each pipeline consists in 3 files as described below. Simply copy these files to the desired run directory, edit the yaml configuration file and sample definition file appropiately, then simply execute snakemake

snakemake -pr -s pipeline.smk --configfile pipeline_config.yaml

Typically, the relevant python packages listed above (including snakemake) are installed in a conda environment, so activate the enviroment before running snakemake.

The pipeline will generate output files into ./data

The 3 files used by the pipeline are:

  • a snakemake file, pipeline.smk, which implements the pipeline steps. This file is portable across systems and do not need to be modified. There are no hardcoded sytems specific, sample specific, or tools configuration related parameters in this file.
  • a yaml configuration file pipeline_config.yaml, which contains parameters for the various steps of the pipeline, such as configuration of the trimmer, aligners, read length cutoff, etc... Most parameters can be left unchanged but the paths to a few necessary resource files (such bowtie2 index, annotation files, etc..) are hardcoded and need to set up once.
  • a sample definition file samples_def.yaml, which contains the list of samples to process with for each : the path to fastq files, the sequence of the bridge (which can be different for each sample), and the path to the adapters definition fasta file used for adapter trimming.

Important notes:

  • One should always execute the pipeline in dry-run mode first, using the -n snakemake option!
  • If the pipeline is ran on a cluster, it is ok to do dry runs on a login node. Of course however, once ready to fire up the pipeline, do not do so on a login node. Instead, either create a sbatch script or directly launch snakemake on an allocated compute node.

Preparation of annotation files

In order to run, the pipeline requires first creating annotation files for the genome of interest. These are used by tagtools to annote the reads with transcripts and gene names. The starting point to create these annotation files is a gff3 file for the genome of interest. Refer to xenopus_laevis example in the notebooks folder

Pipeline structures

Basic pipeline

The schematic below shows the steps of the pipeline for paired-end sequencing reads. dag