This package is used to do CPD sequence data analysis.
- Prerequisites
- Installation
- Step by step tutorial
- (Optional) Demultiplex fastq file
- General QC
- Generate index files for reference genome (FASTA) files
- Align reads to genome using bowtie2
- (Optional) Correct GC content bias
- Count UV radiation induced DNA damage
- (Optional) Generate text files to inform estimated lesions along chromosomes
- (Optional) Subtract out short tandem repeat regions or narrow down to the interested gnomic regions
- (Optional) Estimate sample-wise normalization factors
- Quality control
- Generate a genome-wide UV damage distribution map
- Draw dinucleotide pileup figure in a specific genomic region type
- Compare UV radiation damage of sample(s) against the reference genome background
- Compare UV damage of sample(s) against reference genome background within a specific region type
- Compare UV damage between two regions for one or multiple samples
- Compare genome-wide UV damage between two groups of samples
- Compare UV damage between two groups of samples within a specific region type
- Running cpdseqer using singularity
Install Bowtie2 in Ubuntu with the following command:
sudo apt-get install -y bowtie2
Install tabix in Ubuntu with the following command:
sudo apt-get install –y tabix
Install Samtools in Ubuntu with the following command:
wget https://github.com/samtools/samtools/releases/download/1.10/samtools-1.10.tar.bz2
tar -jxvf samtools-1.10.tar.bz2
cd samtools-1.10
./configure
make
sudo make install
Install bedtools in Ubuntu with the following command:
sudo apt-get install -y bedtools
Install Python 3.7 in Ubuntu. Most factory versions of Ubuntu18.04 and later come with python pre-installed. To check if Python is installed and the Python version, use the following command:
python -version
If Python is not installed or version is lower than 3.7, use the following commands in sequential order to install Python 3.7:
sudo apt update
sudo apt install software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.7
Install R packages in R
install.packages(c("knitr","rmarkdown", "data.table", "R.utils", "ggplot2", "reshape2"))
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("edgeR")
Install CPD-seq protocol package in Ubuntu with the following command:
sudo pip3 install git+git://github.com/shengqh/cpdseqer.git
If you don't have pip installed, you need to install pip first.
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
sudo python get-pip.py
We provide example data and code for test. It requires 34G for downloading and additional 20G for testing. Remember to come back to folder cpdseqer after each step.
wget -r --no-parent --reject "index.html*" --reject-regex "bed_files" https://cqsweb.app.vumc.org/Data/cpdseqer/
cd cqsweb.app.vumc.org/Data/cpdseqer/
If raw sequencing data in FASTQ format is multiplexed, it needs to be first de-multiplexed based on barcode sequence. Otherwise, skip to the next step. To de-multiplex a multiplexed FASTQ file, use the following command:
usage: cpdseqer demultiplex [-h] -i [INPUT] -o [OUTPUT] -b [BARCODE_FILE]
optional arguments:
-h, --help show this help message and exit
-i [INPUT], --input [INPUT]
Input fastq file (gzipped supported)
-o [OUTPUT], --output [OUTPUT]
Output folder
-b [BARCODE_FILE], --barcode_file [BARCODE_FILE]
Tab-delimited file, first column is barcode, second column is sample name
For example:
cd T01_demultiplex/
cpdseqer demultiplex -i ../data/example.fastq.gz -o . -b barcode.txt
The barcode.txt contains two columns indicate barcode and sample name (separated by tab).
ATCGCGAT | Control |
GAACTGAT | UV |
An alternative method to perform demultiplex is using Je.
cd T01_demultiplex/je
je demultiplex F1=../../data/example.fastq.gz BF=barcode_je.txt O=.
The JE_BARCODEFILE contains three columns indicate sample name, barcode and sample filename (separated by tab).
Control | ATCGCGAT | Control.fastq.gz |
UV | GAACTGAT | UV.fastq.gz |
Without going into detail, we recommend previously established methods such as FASTQC and QC3 for this general QC step.
cd T02_fastqc/
fastqc -o . -t 1 --extract ../data/Control.fastq.gz
cd T03_build_bowtie2_index
bowtie2-build --threads 8 ../data/bowtie2_index_2.3.5.1/GRCh38.primary_assembly.genome.fa GRCh38.primary_assembly.genome
(A) If sequencing data is single-end, use the following command:
bowtie2 -p [THREADS] -x [INDEX] -U [FASTQ] -S [SAM] | samtools sort –o [OUTPUT] –T [TEMP_PREFIX] [-@ [THREADS]] [–m [MAX_MEMORY]]
(B) If sequencing data is pair-end, use the following command:
bowtie2 -p [THREADS] -x [INDEX] -1 [FASTQ1] -2 [FASTQ2] –S [SAM] | samtools sort –o [OUTPUT] –T [TEMP_PREFIX] [-@ [THREADS]] [–m [MAX_MEMORY]]
For example,
cd T04_bowtie2
bowtie2 -p 8 -x ../data/bowtie2_index_2.3.5.1/GRCh38.primary_assembly.genome -U ../data/Control.fastq.gz | samtools sort -@ 8 -m 4G -o Control.bam -
samtools index Control.bam
deepTools can be used to correct GC bias on bam file.
(i) computeGCBias -b [IN_BAM] –g [GENOME] --effectiveGenomeSize [GENOME_SIZE] –GCbiasFrequenciesFile [TEXT_OUT] [-p [THREADS]]
(ii) correctGCBias -b [IN_BAM] –o [OUT_BAM] –g [GENOME] --effectiveGenomeSize [GENOME_SIZE]–GCbiasFrequenciesFile [TEXT_OUT] [-p [THREADS]]
For example,
cd T05_Correct_GC/T05_01_calc
computeGCBias -p 8 -b ../../data/Control.bam --effectiveGenomeSize 2913022398 -g ../../data/hg38.2bit -l 200 --GCbiasFrequenciesFile control_freq.txt --biasPlot control.png
cd ../T05_02_correct/
correctGCBias -p 8 -b ../../data/Control.bam --effectiveGenomeSize 2913022398 -g ../../data/hg38.2bit --GCbiasFrequenciesFile ../T05_01_calc/control_freq.txt -o control.corrected.bam
This step completes raw data processing and generates important output files that will be required in multiple steps in the following workflow.
usage: cpdseqer bam2dinucleotide [-h] -i [INPUT] -g [FASTA] [-q [MAPPING_QUALITY]] [-m [MIN_COVERAGE]] [-u] [-t] -o [OUTPUT]
optional arguments:
-h, --help show this help message and exit
-i [INPUT], --input [INPUT]
Input BAM file
-g [FASTA], --fasta [FASTA]
Input genome fasta file
-q [MAPPING_QUALITY], --mapping_quality [MAPPING_QUALITY]
Minimum mapping quality of read (default 20)
-m MIN_COVERAGE, --min_coverage MIN_COVERAGE
The minimum coverage of dinucleotide for counting (default 1)
-u, --unique_only Use uniquely mapped read only
-t, --test Test the first 1000000 reads only
-o [OUTPUT], --output [OUTPUT]
Output file prefix
For example:
cd T06_bam2dinucleotide
cpdseqer bam2dinucleotide -i ../data/Control.bam -g ../data/bowtie2_index_2.3.5.1/GRCh38.primary_assembly.genome.fa -o Control
(i) Generate a binwise dinucleotide site summary based on the reference genome
cpdseqer fasta2bincount -i [INPUT_FA] [-b [BLOCK]] -o [OUTPUT]
(ii) Generate a binwise dinucleotide read-count summary based on CPD read count information
cpdseqer dinucleotide2bincount -i [INPUT_DI] [-g [GENOME]] [-b [BLOCK]] -o [OUTPUT]
For example,
cd T07_bincount/T07_01_fasta2bincount/
cpdseqer fasta2bincount -i ../../data/bowtie2_index_2.3.5.1/GRCh38.primary_assembly.genome.fa -b 100000 -o GRCh38.b100000.txt
Once you come back to cpdseqer folder,
cd T07_bincount/T07_02_dinucleotide2bincount/
cpdseqer dinucleotide2bincount -i ../../data/Control.bed.bgz -g hg38 -b 100000 -o Control.b100000.txt
8. (Optional) Subtract out short tandem repeat regions or narrow down to the interested gnomic regions
cpdseqer filter –i [INPUT] –c [COORDINATE] –o [OUTPUT_PREFIX] [-m {subtract,intersect}]
For example,
cd T08_filter
cpdseqer filter -i ../data/Control.bed.bgz -c ../data/hg38_UTR3.bed.gz -m intersect -o Control.UTR3
cpdseqer filter -i ../data/Control.bed.bgz -c ../data/hg38_UTR3.bed.gz -m subtract -o Control.noUTR3
cpdseqer size_factor -i [INPUT] -o [OUTPUT_PREFIX] [--calc_type {site_union,chrom_dinucleotide} }]
For example,
cd T09_sizefactor
cpdseqer size_factor -i dinucleotide.list --calc_type chrom_dinucleotide -o sf
QC based on dinucleotide count results can be performed using the following command:
cpdseqer qc [-h] -i [INPUT] -o [OUTPUT] [-n [NAME]] [--count_type {rCnt,sCnt}}] [-g [GENOME]] [-s [SIZE_FACTOR_FILE]]
usage: cpdseqer qc [-h] -i [INPUT] [-n [NAME]] [--count_type [COUNT_TYPE]] -o [OUTPUT]
optional arguments:
-h, --help show this help message and exit
-i [INPUT], --input [INPUT]
Input list file, first column is file name, second column is count file path, third column is dinucleotide file path
-n [NAME], --name [NAME]
Input project name
--count_type [COUNT_TYPE]
Input count type, rCnt/sCnt (read count/site count, default rCnt)
-o [OUTPUT], --output [OUTPUT]
Output file prefix
For example:
cd T10_qc
cpdseqer qc -i single_file.list -n single -o output_single
cpdseqer qc -i multi_file.list -n multi -o output_multi
usage: cpdseqer fig_genome [-h] -i [INPUT] [-b [BLOCK]]
[-n [{None,Total,LocalGC}]] -o [OUTPUT]
[-g [GENOME]]
optional arguments:
-h, --help show this help message and exit
-i [INPUT], --input [INPUT]
Input dinucleotide list file, first column is file
location, second column is file name
-b [BLOCK], --block [BLOCK]
Block size for summerize dinucleotide count (default
100000)
-n [{None,Total,LocalGC}], --norm_type [{None,Total,LocalGC}]
Normalization type
-o [OUTPUT], --output [OUTPUT]
Output file prefix
-g [GENOME], --genome [GENOME]
Input reference genome, hg38/hg19 (default hg38) or
chromosome length file
For example:
cd T11_fig_genome
cpdseqer fig_genome -i dinucleotide.list -g hg38 -o cpd_genome -n LocalGC -b 1000000
The dinucleotide.list contains two columns indicate dinucleotide file and sample name (separated by tab).
../data/Control.bed.bgz | Control |
../data/UV.dinucleotide.bed.bgz | UV |
usage: cpdseqer fig_position [-h] -i [INPUT] -c [COORDINATE_FILE]
[-b [BACKGROUND_FILE]] [--space] [--add_chr] [-t]
-o [OUTPUT]
optional arguments:
-h, --help show this help message and exit
-i [INPUT], --input [INPUT]
Input dinucleotide list file, first column is file
location, second column is file name
-c [COORDINATE_FILE], --coordinate_file [COORDINATE_FILE]
Input coordinate bed file (can use short name
hg38/hg19 as default nucleosome file)
-b [BACKGROUND_FILE], --background_file [BACKGROUND_FILE]
Background dinucleotide file
--space Use space rather than tab in coordinate files
--add_chr Add chr in chromosome name in coordinate file
-t, --test Test the first 10000 coordinates only
-o [OUTPUT], --output [OUTPUT]
Output file prefix
For example, we will calculate the dinucleotide position in 3'UTR:
cd T12_fig_position
cpdseqer fig_position -c ../data/hg38_UTR3.bed.gz -b ../data/hg38_Naked.bed.bgz -i dinucleotide.list -o position
Here, you can input absolute coordinate file, or hg38/hg19. hg38 and hg19 indicates the nucleosome coordinate files which can also be downloaded by:
wget https://github.com/shengqh/cpdseqer/raw/master/cpdseqer/data/nucleosome_hg19_interval.zip
wget https://github.com/shengqh/cpdseqer/raw/master/cpdseqer/data/nucleosome_hg38_interval.zip
usage: cpdseqer uv_comp_genome [-h] -i [INPUT] [--count_type [{rCnt,sCnt}]] -o
[OUTPUT] [-g [GENOME]] [-s [SIZE_FACTOR_FILE]]
optional arguments:
-h, --help show this help message and exit
-i [INPUT], --input [INPUT]
Input count list file, first column is file location,
second column is file name
--count_type [{rCnt,sCnt}]
Input count type, rCnt/sCnt (read count/site count,
default rCnt)
-o [OUTPUT], --output [OUTPUT]
Output file prefix
-g [GENOME], --genome [GENOME]
Input reference genome, hg38/hg19/saccer3 (default
hg38)
-s [SIZE_FACTOR_FILE], --size_factor_file [SIZE_FACTOR_FILE]
Input size factor file for normalization
For example,
cd T13_uv_comp_genome
cpdseqer uv_comp_genome -i count.list -o uv_comp_genome -g hg38 --count_type sCnt -s ../data/chrom_dinucleotide.sizefactor.txt
The count.list contains two columns indicate dinucleotide count file and sample name (separated by tab).
../data/Control.count | Control |
../data/UV.count | UV |
14. Compare UV damage of sample(s) against reference genome background within a specific region type
usage: cpdseqer uv_comp_genome_region [-h] -i [INPUT] [-c [COORDINATE_FILE]]
[--space] [--add_chr] [-f [FASTA]]
[--count_type [{rCnt,sCnt}]] -o [OUTPUT]
[-s [SIZE_FACTOR_FILE]]
optional arguments:
-h, --help show this help message and exit
-i [INPUT], --input [INPUT]
Input dinucleotide list file, first column is file
location, second column is file name
-c [COORDINATE_FILE], --coordinate_file [COORDINATE_FILE]
Input coordinate bed file (can use short name
hg38/hg19 as default nucleosome file)
--space Use space rather than tab in coordinate file
--add_chr Add chr to chromosome name in coordinate file
-f [FASTA], --fasta [FASTA]
Input reference genome fasta file
--count_type [{rCnt,sCnt}]
Input count type, rCnt/sCnt (read count/site count,
default rCnt)
-o [OUTPUT], --output [OUTPUT]
Output file prefix
-s [SIZE_FACTOR_FILE], --size_factor_file [SIZE_FACTOR_FILE]
Input size factor file for normalization
For example,
cd T14_uv_comp_genome_region
cpdseqer uv_comp_genome_region \
-i dinucleotide.list \
-f ../data/bowtie2_index_2.3.5.1/GRCh38.primary_assembly.genome.fa \
-c ../data/nucleosome_hg38_interval.txt \
-o uv_comp_genome_region \
--space \
-s ../data/chrom_dinucleotide.sizefactor.txt
usage: cpdseqer uv_comp_regions [-h] -i [INPUT] [-c1 [COORDINATE_FILE1]]
[-c2 [COORDINATE_FILE2]] [--space] [--add_chr]
[-f [FASTA]] [--count_type [{rCnt,sCnt}]] -o
[OUTPUT]
optional arguments:
-h, --help show this help message and exit
-i [INPUT], --input [INPUT]
Input dinucleotide list file, first column is file
location, second column is file name
-c1 [COORDINATE_FILE1], --coordinate_file1 [COORDINATE_FILE1]
Input coordinate bed file 1
-c2 [COORDINATE_FILE2], --coordinate_file2 [COORDINATE_FILE2]
Input coordinate bed file 2
--space Use space rather than tab in coordinate file
--add_chr Add chr to chromosome name in coordinate file
-f [FASTA], --fasta [FASTA]
Input reference genome fasta file
--count_type [{rCnt,sCnt}]
Input count type, rCnt/sCnt (read count/site count,
default rCnt)
-o [OUTPUT], --output [OUTPUT]
Output file prefix
For example,
cd T15_uv_comp_regions
cpdseqer uv_comp_regions \
-i dinucleotide.list \
-f ../data/bowtie2_index_2.3.5.1/GRCh38.primary_assembly.genome.fa \
-c1 hg38_promoter.bed \
-c2 hg38_tf.bed \
-o uv_comp_regions
usage: cpdseqer uv_comp_groups [-h] -i1 [INPUT1] -i2 [INPUT2]
[--count_type [{rCnt,sCnt}]] -o [OUTPUT]
[-s [SIZE_FACTOR_FILE]]
optional arguments:
-h, --help show this help message and exit
-i1 [INPUT1], --input1 [INPUT1]
Input CPD count list file 1, first column is file
location, second column is file name
-i2 [INPUT2], --input2 [INPUT2]
Input CPD count list file 2, first column is file
location, second column is file name
--count_type [{rCnt,sCnt}]
Input count type, rCnt/sCnt (read count/site count,
default rCnt)
-o [OUTPUT], --output [OUTPUT]
Output file prefix
-s [SIZE_FACTOR_FILE], --size_factor_file [SIZE_FACTOR_FILE]
Input size factor file for normalization
For example,
cd T16_uv_comp_groups
cpdseqer uv_comp_groups \
-i1 control.list \
-i2 case.list \
-o uv_comp_groups \
--count_type sCnt \
-s ../data/chrom_dinucleotide.sizefactor.txt
cpdseqer uv_comp_groups_region [-h] -i1 [INPUT1] -i2 [INPUT2] -o [OUTPUT] -c [COORDINATE_FILE] [--add_chr] [--space] [--count_type [COUNT_TYPE]] [-s [SIZE_FACTOR_FILE]]
For example,
cd T17_uv_comp_groups_region
cpdseqer uv_comp_groups_region \
-i1 control.list \
-i2 case.list \
-c ../data/nucleosome_hg38_interval.txt \
-o uv_comp_groups_region \
--count_type sCnt \
--space \
-s ../data/chrom_dinucleotide.sizefactor.txt
We also build docker container for cpdseqer which can be used by singularity.
singularity exec -e docker://shengqh/cpdseqer bowtie2 -h
singularity exec -e docker://shengqh/cpdseqer cpdseqer -h
singularity build cpdseqer.simg docker://shengqh/cpdseqer
singularity exec -e cpdseqer.simg bowtie2 -h
singularity exec -e cpdseqer.simg cpdseqer -h