Skip to content
fayrouzhammal edited this page Sep 13, 2021 · 20 revisions

What is ReMap ?

ReMap is a project which goal is to provide the largest catalogue of high-quality regulatory regions resulting from a large-scale integrative analysis of hundreds of transcription factors and general components of the transcriptional machinery from DNA-binding experiments. Check ReMap us here http://remap.univ-amu.fr/

What does this git contain ?

This git contain all files necessary to run analysis ReMap style from annotation files to the final bed catalogue. ReMap pipeline is made of several workflows following each other.

Table of Contents

  1. Installation
  2. Main Workflow
  3. Post processing
  4. Non redundant processing

Install the necessary environments in Conda or Docker.

Git pull code

git pull git@github.com:remap-cisreg/remap-pipeline.git

Conda install

You don't need to install specific Conda environments. Snakemake will create them from recipes in 2.scripts/conda_envirronment during its first run. Conda environments will be in the .snakemake/conda folder in the working directory.

Docker images install

Docker images are found in the Repository list on DockerHub

Pull all necessary images

docker pull benoitballester/remap_bowtie2:latest
docker pull benoitballester/remap_aria2:latest
docker pull benoitballester/remap_phantompeakqualtools:latest
docker pull benoitballester/remap_trimgalore:latest
docker pull benoitballester/remap_bedtools:latest
docker pull benoitballester/remap_samtools:latest
docker pull benoitballester/remap_ucsc_apps:latest

Singularity images install

If you use singularity, you will need to convert them into singularity images. Please refer to Sylabs.io

Main workflow (ReMap v4) with steps from downloading fastq to peakcalling.

Quick Usage

A more in depth explanation is here.

  1. Extract relevant info from annotation files with 2.scripts/utils/python3/extract_info_download_encode_v4.py for ENCODE data or 2.scripts/utils/python3/extract_info_download_v3.py for GEO data.
  2. Create config files for snakemake and cluster
  3. Run snakemake

For slurm

snakemake (--use-singularity|--use-conda) --singularity-args "-B /path/to/working_directory:/path/to/working_directory" --snakefile 2.scripts/snakefiles_workflow/Snakefile_remap_v4.py --cluster-config 2.scripts/cluster_configuration/EXAMPLE_cluster_slurm.json --cluster "sbatch --job-name {cluster.job-name} -p {cluster.partition} --ntasks {cluster.ntasks} --cpus-per-task={cluster.thread} -o {cluster.stdout} -e {cluster.stderr} --time {cluster.time} --mem-per-cpu={cluster.memory} " --configfile 2.scripts/snakemake_configuration/EXAMPLE_Snakefile_config_remap.json --resources res=150 --cores X

For Torque

snakemake (--use-singularity|--use-conda) --snakefile 2.scripts/snakefiles_workflow/Snakefile_remap_v4.py --printshellcmds --cores X --cluster-config 2.scripts/cluster_configuration/EXAMPLE_cluster_torque.json --cluster "qsub -V -q {cluster.queue} -l nodes={cluster.node}:ppn={cluster.thread} -o {cluster.stdout} -e {cluster.stderr}" --configfile 2.scripts/snakemake_configuration/EXAMPLE_Snakefile_config_remap.json --use-conda

Rulegraph

rulegraph

Usage

Requirements

Software

If you have Conda, all environment recipes we used for ReMap 2020 are in 2.scripts/conda_envirronment in YAML format. Docker/Singularity images are not in this because of space issues.

If you don't want to want to use Conda or want to create your own images here are the required tools and versions:

  • Aria2 >= 1.34.0
  • Trim Galore >= 0.4.3
  • Bowtie2 >= 2.3.4.1
  • Samtools >= 1.9
  • Macs2 >= 2.1.1.20160309

Config files

Config files are in JSON format only. They can be anywhere, but we recommend that you put them in 2.scripts/cluster_configuration or 2.scripts/snakemake_configuration for organisation. They are files specific to Snakemake workflow.

Cluster config

JSON files containing information needed to run the workflow on cluster architecture (for more information https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html?highlight=cluster#cluster-configuration). If you want to quickly run the workflow, just modify one example already in 2.scripts/cluster_configuration.

Minimum information are in the default are:

  • "queue": name of the queue.
  • "node": number of node used by each rules.
  • "thread": number of cpu per tasks used by each rules.
  • "stdout": Where standard output for each files are written.
  • "stderr": Where standard error for each files are written

/!\ for stdout and stdout the name of the files contain the wildcards. As wildcards are not the same for all the rules keep:

  • {wildcards.replicat_name} for default
  • {wildcards.fastq_name} for aria2
  • {wildcards.replicat_name_paired} for delete_trim_paired

Depending on your Job manager other parameters can be given (contact your sysadmin). We used:

  • "ntasks": number of task per job
  • "time": time wall for each rules
  • "memory": memory for each rules (Be careful depending on cluster configuration asking more memory can request more cpu)
  • "mail-type": send a mail at specific steps of a workflow
  • "mail-user": where to send the mail

All information can be overridden for each rules.

/!\ thread for bowtie2 much match the one in snakemake_config !! /!\

Metadata (tab) files

Metadata files are necessary to run the workflow. There is an metadata file per experiment Metadata files are created from a big annoation file by python scripts 2.scripts/utils/python3/extract_info_download_encode_v4.py for ENCODE data or 2.scripts/utils/python3/extract_info_download_v3.py for GEO data. They both need an annotation file containing all data you want to run.

Annotation file must contain one experiment per line. It must have at least 6 columns with the following headers: ID: ID from database of origin (ex GSE, ENC*, etc) *TARGET: Official name of target protein *BIOTYPE + modification: name of biotype and modification separated by "_" if there is (BIOTYPE_MODIFICATION) *Replicat ID: ID of experiement replicat from database of origin separate by ";" *Control ID:ID of control replicat from database of origin separate by ";" Corresponding ENA ID: SRP

Metadata files format is one replicat/control per line with 5 columns:

  • "filename": final name of file. Are ReMap like ..
  • "control": 1 if it is a control, 0 if not
  • "library": library type PAIRED or SINGLE
  • "url": Full url where files are
  • "md5": MD5 for specific file

Snakemake config

JSON file containing all information necessary to run the workflow. Examples are in 2.scripts/snakemake_configuration. JSON structure should not be changed unless you are an expert Snakemake user.

Launch file avilable here

Workflow doing quality control steps and creation of ReMap like file containing all peaks from peakcalling results. As quality control for ChIP-seq/DAP-seq is different than quality control for ChIP-exo two workflows exist.

Quick Usage

This workflow use the outputs created by Snakefile ReMap v4. It also needs summary tab files created for Snakefile ReMap v4 (see here how to create them).

To run workflow:

For slurm

snakemake (--use-singularity|--use-conda) --singularity-args "-B /path/to/working_directory:/path/to/working_directory" --snakefile 2.scripts/snakefiles_workflow/Snakefile_remap_post_processing.py --cluster-config 2.scripts/cluster_configuration/EXAMPLE_cluster_slurm.json --cluster "sbatch --job-name {cluster.job-name} -p {cluster.partition} --ntasks {cluster.ntasks} --cpus-per-task={cluster.thread} -o {cluster.stdout} -e {cluster.stderr} --time {cluster.time} --mem-per-cpu={cluster.memory} " --configfile 2.scripts/snakemake_configuration/EXAMPLE_Snakefile_config_remap.json --resources res=150 --cores X

For Torque

snakemake (--use-singularity|--use-conda) --snakefile 2.scripts/snakefiles_workflow/Snakefile_remap_post_processing.py --printshellcmds --cores X --cluster-config 2.scripts/cluster_configuration/EXAMPLE_cluster_torque.json --cluster "qsub -V -q {cluster.queue} -l nodes={cluster.node}:ppn={cluster.thread} -o {cluster.stdout} -e {cluster.stderr}" --configfile 2.scripts/snakemake_configuration/EXAMPLE_Snakefile_config_remap.json --use-conda

This workflow creates non redudant files per TFs and Biotypes. From a ReMap like all files it creates a directory per TF and per Biotype. In those directories are a file containing all peaks file and a non redundant file for this feature. It also creates for each TF a Fasta file containing sequences where them bind.

Quick Usage

This workflow use the outputs created by Snakefile ReMap Post-processing. It needs:

  • A file listing all experiments (one experiment per line).
  • The complete genome in Fasta format.

To run workflow:

For slurm

snakemake (--use-singularity|--use-conda) --singularity-args "-B /path/to/working_directory:/path/to/working_directory" --snakefile 2.scripts/snakefiles_workflow/Snakefile_remap_non_redundant.py --cluster-config 2.scripts/cluster_configuration/EXAMPLE_cluster_slurm.json --cluster "sbatch --job-name {cluster.job-name} -p {cluster.partition} --ntasks {cluster.ntasks} --cpus-per-task={cluster.thread} -o {cluster.stdout} -e {cluster.stderr} --time {cluster.time} --mem-per-cpu={cluster.memory} " --configfile 2.scripts/snakemake_configuration/EXAMPLE_Snakefile_config_remap.json --resources res=150 --cores X

For Torque

snakemake (--use-singularity|--use-conda) --snakefile 2.scripts/snakefiles_workflow/Snakefile_remap_non_redundant.py --printshellcmds --cores X --cluster-config 2.scripts/cluster_configuration/EXAMPLE_cluster_torque.json --cluster "qsub -V -q {cluster.queue} -l nodes={cluster.node}:ppn={cluster.thread} -o {cluster.stdout} -e {cluster.stderr}" --configfile 2.scripts/snakemake_configuration/EXAMPLE_Snakefile_config_remap.json --use-conda

Other workflows avilable

ReMap v4 with bigwig compute

This workflow is a spin-off Snakefile remap v4 doing bigwig files from fastq.

Citation

ReMap 2020: a database of regulatory regions from an integrative analysis of Human and Arabidopsis DNA-binding sequencing experiments Jeanne Chèneby, Zacharie Ménétrier, Martin Mestdagh, Thomas Rosnet, Allyssa Douida, Wassim Rhalloussi, Aurélie Bergon, Fabrice Lopez, Benoit Ballester. Nucleic Acids Research, 29 October 2019, https://doi.org/10.1093/nar/gkz945