-
Notifications
You must be signed in to change notification settings - Fork 5
Home
ReMap is a project which goal is to provide the largest catalogue of high-quality regulatory regions resulting from a large-scale integrative analysis of hundreds of transcription factors and general components of the transcriptional machinery from DNA-binding experiments. Check ReMap us here http://remap.univ-amu.fr/
This git contain all files necessary to run analysis ReMap style from annotation files to the final bed catalogue. ReMap pipeline is made of several workflows following each other.
Install the necessary environments in Conda or Docker.
git pull git@github.com:remap-cisreg/remap-pipeline.git
You don't need to install specific Conda environments.
Snakemake will create them from recipes in 2.scripts/conda_envirronment
during its first run.
Conda environments will be in the .snakemake/conda
folder in the working directory.
Docker images are found in the Repository list on DockerHub
Pull all necessary images
docker pull benoitballester/remap_bowtie2:latest
docker pull benoitballester/remap_aria2:latest
docker pull benoitballester/remap_phantompeakqualtools:latest
docker pull benoitballester/remap_trimgalore:latest
docker pull benoitballester/remap_bedtools:latest
docker pull benoitballester/remap_samtools:latest
docker pull benoitballester/remap_ucsc_apps:latest
If you use singularity, you will need to convert them into singularity images. Please refer to Sylabs.io
Main workflow (ReMap v4) with steps from downloading fastq to peakcalling.
A more in depth explanation is here.
- Extract relevant info from annotation files with 2.scripts/utils/python3/extract_info_download_encode_v4.py for ENCODE data or 2.scripts/utils/python3/extract_info_download_v3.py for GEO data.
- Create config files for snakemake and cluster
- Run snakemake
For slurm
snakemake (--use-singularity|--use-conda) --singularity-args "-B /path/to/working_directory:/path/to/working_directory" --snakefile 2.scripts/snakefiles_workflow/Snakefile_remap_v4.py --cluster-config 2.scripts/cluster_configuration/EXAMPLE_cluster_slurm.json --cluster "sbatch --job-name {cluster.job-name} -p {cluster.partition} --ntasks {cluster.ntasks} --cpus-per-task={cluster.thread} -o {cluster.stdout} -e {cluster.stderr} --time {cluster.time} --mem-per-cpu={cluster.memory} " --configfile 2.scripts/snakemake_configuration/EXAMPLE_Snakefile_config_remap.json --resources res=150 --cores X
For Torque
snakemake (--use-singularity|--use-conda) --snakefile 2.scripts/snakefiles_workflow/Snakefile_remap_v4.py --printshellcmds --cores X --cluster-config 2.scripts/cluster_configuration/EXAMPLE_cluster_torque.json --cluster "qsub -V -q {cluster.queue} -l nodes={cluster.node}:ppn={cluster.thread} -o {cluster.stdout} -e {cluster.stderr}" --configfile 2.scripts/snakemake_configuration/EXAMPLE_Snakefile_config_remap.json --use-conda
If you have Conda, all environment recipes we used for ReMap 2020 are in 2.scripts/conda_envirronment in YAML format. Docker/Singularity images are not in this because of space issues.
If you don't want to want to use Conda or want to create your own images here are the required tools and versions:
- Aria2 >= 1.34.0
- Trim Galore >= 0.4.3
- Bowtie2 >= 2.3.4.1
- Samtools >= 1.9
- Macs2 >= 2.1.1.20160309
Config files are in JSON format only. They can be anywhere, but we recommend that you put them in 2.scripts/cluster_configuration or 2.scripts/snakemake_configuration for organisation. They are files specific to Snakemake workflow.
JSON files containing information needed to run the workflow on cluster architecture (for more information https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html?highlight=cluster#cluster-configuration). If you want to quickly run the workflow, just modify one example already in 2.scripts/cluster_configuration.
Minimum information are in the default are:
- "queue": name of the queue.
- "node": number of node used by each rules.
- "thread": number of cpu per tasks used by each rules.
- "stdout": Where standard output for each files are written.
- "stderr": Where standard error for each files are written
/!\ for stdout and stdout the name of the files contain the wildcards. As wildcards are not the same for all the rules keep:
- {wildcards.replicat_name} for default
- {wildcards.fastq_name} for aria2
- {wildcards.replicat_name_paired} for delete_trim_paired
Depending on your Job manager other parameters can be given (contact your sysadmin). We used:
- "ntasks": number of task per job
- "time": time wall for each rules
- "memory": memory for each rules (Be careful depending on cluster configuration asking more memory can request more cpu)
- "mail-type": send a mail at specific steps of a workflow
- "mail-user": where to send the mail
All information can be overridden for each rules.
/!\ thread for bowtie2 much match the one in snakemake_config !! /!\
Metadata files are necessary to run the workflow. There is an metadata file per experiment Metadata files are created from a big annoation file by python scripts 2.scripts/utils/python3/extract_info_download_encode_v4.py for ENCODE data or 2.scripts/utils/python3/extract_info_download_v3.py for GEO data. They both need an annotation file containing all data you want to run.
Annotation file must contain one experiment per line. It must have at least 6 columns with the following headers: ID: ID from database of origin (ex GSE, ENC*, etc) *TARGET: Official name of target protein *BIOTYPE + modification: name of biotype and modification separated by "_" if there is (BIOTYPE_MODIFICATION) *Replicat ID: ID of experiement replicat from database of origin separate by ";" *Control ID:ID of control replicat from database of origin separate by ";" Corresponding ENA ID: SRP
Metadata files format is one replicat/control per line with 5 columns:
- "filename": final name of file. Are ReMap like ..
- "control": 1 if it is a control, 0 if not
- "library": library type PAIRED or SINGLE
- "url": Full url where files are
- "md5": MD5 for specific file
JSON file containing all information necessary to run the workflow. Examples are in 2.scripts/snakemake_configuration. JSON structure should not be changed unless you are an expert Snakemake user.
Launch file avilable here
Workflow doing quality control steps and creation of ReMap like file containing all peaks from peakcalling results. As quality control for ChIP-seq/DAP-seq is different than quality control for ChIP-exo two workflows exist.
This workflow use the outputs created by Snakefile ReMap v4. It also needs summary tab files created for Snakefile ReMap v4 (see here how to create them).
To run workflow:
For slurm
snakemake (--use-singularity|--use-conda) --singularity-args "-B /path/to/working_directory:/path/to/working_directory" --snakefile 2.scripts/snakefiles_workflow/Snakefile_remap_post_processing.py --cluster-config 2.scripts/cluster_configuration/EXAMPLE_cluster_slurm.json --cluster "sbatch --job-name {cluster.job-name} -p {cluster.partition} --ntasks {cluster.ntasks} --cpus-per-task={cluster.thread} -o {cluster.stdout} -e {cluster.stderr} --time {cluster.time} --mem-per-cpu={cluster.memory} " --configfile 2.scripts/snakemake_configuration/EXAMPLE_Snakefile_config_remap.json --resources res=150 --cores X
For Torque
snakemake (--use-singularity|--use-conda) --snakefile 2.scripts/snakefiles_workflow/Snakefile_remap_post_processing.py --printshellcmds --cores X --cluster-config 2.scripts/cluster_configuration/EXAMPLE_cluster_torque.json --cluster "qsub -V -q {cluster.queue} -l nodes={cluster.node}:ppn={cluster.thread} -o {cluster.stdout} -e {cluster.stderr}" --configfile 2.scripts/snakemake_configuration/EXAMPLE_Snakefile_config_remap.json --use-conda
This workflow creates non redudant files per TFs and Biotypes. From a ReMap like all files it creates a directory per TF and per Biotype. In those directories are a file containing all peaks file and a non redundant file for this feature. It also creates for each TF a Fasta file containing sequences where them bind.
This workflow use the outputs created by Snakefile ReMap Post-processing. It needs:
- A file listing all experiments (one experiment per line).
- The complete genome in Fasta format.
To run workflow:
For slurm
snakemake (--use-singularity|--use-conda) --singularity-args "-B /path/to/working_directory:/path/to/working_directory" --snakefile 2.scripts/snakefiles_workflow/Snakefile_remap_non_redundant.py --cluster-config 2.scripts/cluster_configuration/EXAMPLE_cluster_slurm.json --cluster "sbatch --job-name {cluster.job-name} -p {cluster.partition} --ntasks {cluster.ntasks} --cpus-per-task={cluster.thread} -o {cluster.stdout} -e {cluster.stderr} --time {cluster.time} --mem-per-cpu={cluster.memory} " --configfile 2.scripts/snakemake_configuration/EXAMPLE_Snakefile_config_remap.json --resources res=150 --cores X
For Torque
snakemake (--use-singularity|--use-conda) --snakefile 2.scripts/snakefiles_workflow/Snakefile_remap_non_redundant.py --printshellcmds --cores X --cluster-config 2.scripts/cluster_configuration/EXAMPLE_cluster_torque.json --cluster "qsub -V -q {cluster.queue} -l nodes={cluster.node}:ppn={cluster.thread} -o {cluster.stdout} -e {cluster.stderr}" --configfile 2.scripts/snakemake_configuration/EXAMPLE_Snakefile_config_remap.json --use-conda
This workflow is a spin-off Snakefile remap v4 doing bigwig files from fastq.
ReMap 2020: a database of regulatory regions from an integrative analysis of Human and Arabidopsis DNA-binding sequencing experiments Jeanne Chèneby, Zacharie Ménétrier, Martin Mestdagh, Thomas Rosnet, Allyssa Douida, Wassim Rhalloussi, Aurélie Bergon, Fabrice Lopez, Benoit Ballester. Nucleic Acids Research, 29 October 2019, https://doi.org/10.1093/nar/gkz945