- Make sure that the latest versions of
samtools
,bcftools
, andtabix
are installed. - Download the latest Nextflow from
https://www.nextflow.io
. - Create python 3 virtual environment:
virtualenv -p python3.6 venv
. - Activate virtual environment:
source venv/bin/activate
. - Install python packagies:
pip install -r requirements.txt
.
Run Step 1: prepare intervals covering CDS regions (i.e. protein coding exons and some padding around them)
-
cd into
intervals
directry -
download latest GENCODE GTF file:
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.annotation.gtf.gz
-
create intervals:
python make_intervals.py -g gencode.v31.annotation.gtf.gz -o cds_intervals
Intervals will be saved into
cds_intervals
directory. By default, chromosome names havechr
prefix. If your reference genome doesn't have this prefix, then specify--no-chr-prefix
option when runningmake_intervals.py
.
-
cd into
coverage
directory -
edit
nextflow.config
file:bams_list_path
-- points to the file were each line is a whitespace delimited tuple: sample name, absolute path to the corresponding BAM/CRAM file.cram
-- true if working with CRAMsintervals
-- points to*.list
files with intervals generated in Step 1.reference_path
-- points to FASTA file (*.fa
) with genome reference. File must be indexed (ie. the corresponding*.fai
should in the same directory)samtools
-- path tosamtools
executablemax_depth
-- maximal depth per sample (see documentation forsamtools mpileup
for details)mpileup_subset
-- points tompileup_subset.py
script (prefferable absolute path)bcftools
-- path tobcftools
executabletabix
-- path totabix
executableEdit other options related to SLURM or local execution as needed.
-
Run
nextlow run Coverage.nf
. Preferrably run fromtmux
session. When crashed (e.g. SLURM node failure) usenextflow run Coverage.nf -resume
. -
The final BCF files with DP information for each CDS base-pair and each sample are located in
results/merged
folder.
- Make sure that the latest versions of
bcftools
andtabix
are installed. - Download the latest Nextflow from
https://www.nextflow.io
.
-
cd into
subset_normalize
directory -
edit
nextflow.config
file:bams_list_path
-- Same file as in coverage computation: points to the file were each line is a whitespace delimited tuple: sample name, absolute path to the corresponding BAM/CRAM file.vcfs
-- path to VCF or BCF files with genotype information. Files must include both PASS and QC failed variants.reference_path
-- points to FASTA file (*.fa
) with genome reference. File must be indexed (ie. the corresponding*.fai
should in the same directory)bcftools
-- path tobcftools
executabletabix
-- path totabix
executableEdit other options related to SLURM or local execution as needed.
-
Run
nextlow run Subset.nf
. Preferrably run fromtmux
session. When crashed (e.g. SLURM node failure) usenextflow run Subset.nf -resume
. -
The final BCF files with normalized variants are located in
results/
folder.
- Make sure that the latest version of
bcftools
andtabix
are installed - Make sure that the
VEP
(Variant Effect Predictor) v96 or higher is installed.
-
cd into
annotate
directory -
edit
nextflow.config
file:vcfs
-- path to the VCF/BCF files generate in the previous stepvep
-- path toVEP
executablevep_flags
-- set any additional VEP flags if neededbcftools
-- path toVEP
executabletabix
-- path totabix
executableEdit other options related to SLURM or local execution as needed.
-
Run
nextlow run Annotate.nf
. Preferrably run fromtmux
session. When crashed (e.g. SLURM node failure) usenextflow run Annotate.nf -resume
. -
The final VCF files with annotated variants are located in
results/vep
folder.
-
cd into
compare
directory -
edit
nextlow.config
file:pairs_list_path
-- Mapping file with sample IDs (no header). Each line has two whitespace (or tab) delimited columns. First column stores sample ID in study 1, second column stores corresponding sample ID in study 2. Seeexample_pairs.list
.study1_files_list_path
-- File with absolute paths to coverage, genotype, and annotation files generated in previous steps for study 1. The file has three whitespace (or tab) delimited columns (no header). The first column stores absolute path to coverage files from step A, the second column stores absolute path to the genotype files from step B, the third column stores absolute path to the annotation files from step C. Important: each row must store corresponding files for the same chromosome. Seestudy1_files.list
.study2_files_list_path
-- File with the same structure asstudy1_files_list_path
, but with absolute paths to coverage, genotype, and annotation files for study 2.compare
-- absolute path to thecompare.py
scriptEdit other options related to SLURM or local execution as needed.
-
Run
nextlow run Compare.nf
. Preferrably run fromtmux
session. When crashed (e.g. SLURM node failure) usenextflow run Compare.nf -resume
. -
The final gzip compressed summary files are located in
results/
folder.
-
cd into
histograms
directory -
edit
nextflow.config
gencode_gtf_path
-- path to GENCODE GTF file. Change only if other thanv31
version needed.coverage_files_path
-- path to VCF/BCF files generated in step A (i.e. insequencing_comparison/coverage/results/merged
directory).coverage_files_index_suffix
-- change totbi
if VCF/BCF files in step A were indexed using TBI index (i.e. default tabix).histograms
-- absolute path to thehistograms.py
script.Edit other options related to SLURM or local execution as needed.
-
Run
nextlow run Histograms.nf
. Preferrably run fromtmux
session. When crashed (e.g. SLURM node failure) usenextflow run Histograms.nf -resume
. -
The final gzip compressed histogram files are located in
results/
folder.