D_oGE: Differential Gene Expression Analysis Pipeline

BIO792: Next Generation Sequencing Data Analysis

calibrate [http://cran.r-project.org/package=calibrate]
DESeq2 [https://bioconductor.org/packages/3.10/bioc/html/DESeq2.html]
dplyr [http://cran.r-project.org/package=dplyr]
GGally [http://cran.r-project.org/package=GGally]
ggplot2 [http://cran.r-project.org/package=ggplot2]
ggrepel [http://cran.r-project.org/package=ggrepel]
gplots [http://cran.r-project.org/package=gplots]
gridExtra [http://cran.r-project.org/package=gridExtra]
kableExtra [http://cran.r-project.org/package=kableExtra]
knitr [http://cran.r-project.org/package=knitr]
latex2exp [http://cran.r-project.org/package=latex2exp]
pander [http://cran.r-project.org/package=pander]
RColorBrewer [http://cran.r-project.org/package=RColorBrewer]
rmarkdown [http://cran.r-project.org/package=rmarkdown]

Setup

git clone https://github.com/villegar/doge
cd doge
conda env create -f environment.yml -n DoGE
conda activate DoGE or source activate DoGE
python download.genome.py genomes/X-genome.json

Genome file

A good place to get some reference genomes and gene annotations is http://uswest.ensembl.org/info/data/ftp/index.html. The reference must be stored in JSON format (see below template),X-genome.json

{
	"X.fa.gz":
            "ftp://ftp.ensembl.org/pub/some/path/to/X.fa.gz",
        "X.gtf.gz":
            "ftp://ftp.ensembl.org/pub/some/path/to/X.gtf.gz"
}

Execution

Single node

snakemake -j CPUS \ # maximum number of CPUs available to Snakemake
	  --configfile config.json # configuration file

Multi-node

snakemake -j JOBS  \ # maximum number of simultaneous jobs to spawn
	  --configfile config.json # configuration file
          --latency-wait 1000 \ # files latency in seconds
          --cluster-config cluster.json \ # cluster configuration file
          --cluster "sbatch --job-name={cluster.name} 
                            --nodes={cluster.nodes} 
                            --ntasks-per-node={cluster.ntasks} 
                            --output={cluster.log} 
                            --partition={cluster.partition} 
                            --time={cluster.time}"

Alternatively

bash run_cluster config.json &> log &

Cluster configuration (cluster.json)

{
    "__default__" :
    {
        "time" : "1-00:00:00",
        "nodes" : 1,
        "partition" : "compute",
	"ntasks": "{threads}",
	"name": "DoGE-{rule}",
	"log": "DoGE-{rule}-%J.log"
    }
}

Pipeline configuration (config.json)

The genome section MUST point to the path for the X-genome.json file.
The reads section points the pipeline to the location (path), format (extension), type (end_type), and prefix (prefix) of the raw reads. Optionally, if end_type = pe (paired-end), both the forward (forward_read_id) and reverse (reverse_read_id) reads identifier (e.g. 1, R1, 2, R2, etc.) should be specified.
The trimmomatic section should contain a sub-key called options with the parameters for trimming, excluding the input and ouput names, which will be set up by the pipeline.

{
    "genome": "/path/to/X-genome.json",
    "reads": {
        "extension": "fastq",
        "end_type": "se",
        "forward_read_id": "1",
        "reverse_read_id": "2",
        "path": "/path/to/raw/reads",
        "prefix": "SRR"
    },
    "trimmomatic":{
      "options": "ILLUMINACLIP:{input.adapter}/TruSeq3-SE-2.fa:2:30:10:2:keepBothReads"
    }
}

Study Case

Data set

For this study case the following article title LncRNA DEANR1 facilitates human endoderm differentiation by activating FOXA2 expression was consulted. https://doi.org/10.1016/j.celrep.2015.03.008

Accession numbers

SRR1958165
SRR1958166
SRR1958167
SRR1958168
SRR1958169
SRR1958170

Configuration file

{
    "genome": "genomes/human-genome.json",
    "reads": {
        "extension": "fastq",
        "end_type": "se",
        "path": "/path/to/reads",
        "prefix": "SRR"
    },
    "trimmomatic":{
      "options": "ILLUMINACLIP:{input.adapter}/TruSeq3-SE-2.fa:2:30:10:2:keepBothReads TRAILING:3 MINLEN:24"
    }
}

Execution

It is a good practice to perform a dry-run of the workflow before submitting for execution. This can be done by appending the -n option to the snakemake command:

snakemake --configfile config.json -n

The output will display a summary of each job that will be processed and a final summary that should look like:

Job counts:
        count   jobs
        6       alignment
        6       alignment_quality
        1       all
        1       annotation_table
        6       fastqc_raw
        6       fastqc_trimmed
        1       feature_counts
        1       hisat2_index
        1       quantification_table
        1       rmd_report
        6       sam2bam
        6       trim_reads
        42

For a graphical summary of above jobs, check the directed acyciclic graph: https://raw.githubusercontent.com/villegar/DoGE/master/images/dag.png

Single node execution

snakemake -j CPUS \ # maximum number of CPUs available to Snakemake
	  --configfile config.json # configuration file

Name		Name	Last commit message	Last commit date
Latest commit History 183 Commits
PE		PE
R		R
SE		SE
genomes		genomes
html		html
images		images
rmd		rmd
rules		rules
toy-data-results		toy-data-results
EMTAB8015.config		EMTAB8015.config
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
cluster.json		cluster.json
config.json		config.json
dag		dag
doge.snakefile		doge.snakefile
download.genome.py		download.genome.py
environment.yml		environment.yml
exp_design.csv		exp_design.csv
run_cluster		run_cluster
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

D_oGE: Differential Gene Expression Analysis Pipeline

BIO792: Next Generation Sequencing Data Analysis

Table of Contents

Requirements

R libraries:

Setup

Genome file

Execution

Single node

Multi-node

Alternatively

Cluster configuration (cluster.json)

Pipeline configuration (config.json)

Study Case

Data set

Accession numbers

Configuration file

Execution

Single node execution

About

Releases

Packages

Contributors 2

Languages

License

villegar/DoGE

Folders and files

Latest commit

History

Repository files navigation

DoGE: Differential Gene Expression Analysis Pipeline

BIO792: Next Generation Sequencing Data Analysis

Table of Contents

Requirements

R libraries:

Setup

Genome file

Execution

Single node

Multi-node

Alternatively

Cluster configuration (cluster.json)

Pipeline configuration (config.json)

Study Case

Data set

Accession numbers

Configuration file

Execution

Single node execution

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

D_oGE: Differential Gene Expression Analysis Pipeline

Packages