ATACseq, ChIPseq, CUT&RUN/TAG RNAseq pipeline wrappers

This repository has the following combined shell/awk/python/R scripts which can be used for High-throughput sequecning data analysis.

ATACseq.sh: bulk ATACseq and CUT&TAG pipeline, from fastq to open chromatin regions/peaks.
ChIPseq.sh: ChIPseq pipeline, from fastq to peak calling step.
RNAseq.sh: bulk RNAseq pipeline, from fastq to differential expressed genes.
adapt_trim.sh: adapter trimming function, seperated from the above pipelines.
cisVar.sh: pipeline wrapper of cisVar.
trans_assemble.sh: de novo transcript assembly, from fastq to GTF.
PLAR.sh: de novo PLAR lncRNA discovery pipeline wrapper.
rRNA_dep.sh: ribosomal RNA depletion from fastq files.
CRISPRlib.sh: mapping CRISPR sgRNA library, from fastq to tables.

Requirements: Python3, cutadapt, macs2(>=2.1.1), R, DESeq2, featureCounts, bowtie1/2, bwa,STAR, fastqc, samtools, bedtools, deeptools

ATACseq.sh

This script will QC fastq files and align reads to the reference genome build with Bowtie2 or chromap, depending on the species selection passed by -g or the index and other required files passed by -i, -b and -c, convert to filtered BAM/BED and bigwig format, then call peaks with MACS2 in BEDPE mode after Tn5 shifting.

This script works for both ATACseq and CUT&TAG.

Input

Paired-end fastq files with _R1/2 suffix, i.e. test_R1.fastq.gz, test_R2.fastq.gz

Single-end sequencing data is also supported with -s, although it is not recommended.

Options

help message can be shown by ATACseq.sh -h

Usage: ATAC.sh <options> -g <hg38|hg19|mm10> <reads1>|..<reads2> 
  Options:
    -g [str] Genome build selection <hg38|hg19|mm10>
    -x [str] Custom bowtie2 index PATH
    -b [str] Custom blacklist PATH
    -m [str] Genome size abbr supported by MACS2
    -p [str] Prefix of output
    -t [int] Threads (1 default)
    -s Single-end mod (DO NOT recommend, Paired-end default)
    -c Using chromap to process FASTQ instead of canonical bowtie2
    -i [str] Custom chromap genome index (only valid with -c option)
    -r [str] Custom chromap genome reference (only valid with -c option)
    -z [str] Custom chromosome size table
    -h Print this help message

Example

wget https://raw.githubusercontent.com/zhaoshuoxp/Pipelines-Wrappers/master/ATACseq.sh
chmod 755 ATACseq.sh
./ATACseq.sh -g hg19 -p test -t 24 /path/to/test_R1.fastq.gz /path/to/test_R2.fastq.gz

Alternatively, you may use chromap aligner to speed up the processing:

./ATACseq.sh -c -g hg19 -p test -t 24 /path/to/test_R1.fastq.gz /path/to/test_R2.fastq.gz

chromap is recomended for most cases, as it is ultra fast and outputs a little bit less (<5%) aligned reads.

Output

All results will be store in current (./) directory.

{prefix}_trimmed_R1/2.fastq.gz: adapter trimmed fastq files.
{prefix}_mkdup.bam: all alignments, with duplicates marked.
{prefix}_filtered.bam: useful filtered alignments; duplicates, unpaired, unmapped, low-quality, secondary, chrM reads removed.
{prefix}_se.bed: useful filtered alignments in BED format, Tn5 shifted.
{prefix}_pe.bed: useful filtered alignments in BEDPE format, Tn5 shifted, the 2nd and 3rd columns indicate the fragment start and end coordinates on genome. It will be used for macs2 peak calling.
{prefix}.bw: bigwig file converted from {prefix}_filtered.bam, can be upload to genome browser for visualization.
macs2: output of macs2, see here. Only broad peaks will be called by default. In addition, {prefix}_broad_filtered.bed is the peaks file with blacklist filtered.
fastqc: the report(s) of fastqc
logs: running logs

No bam files and trimmed fastq files will be generated with chromap runs.

ChIPseq.sh

This script will QC fastq files and align reads to reference genome with BWA or chromap (bowtie2 for CUT&RUN), depending on the species passed by -g or the index passed by -i, convert alignments to filtered BAM/BED and bigwig but DOES NOT call peaks.

This script works for both ChIPseq and CUT&RUN.

Input

Paired-end fastq files with _R1/2 suffix, i.e. test_R1.fastq.gz, test_R2.fastq.gz Or single-end fastq file with -s.

Options

help message can be shown by ChIPseq.sh -h

Usage: ChIPseq.sh <options> -g <hg38|hg19|mm10> <reads1>|..<reads2> 
  Options:
    -g [str] Genome build selection <hg38|hg19|mm10>
    -x [str] Custom BWA index PATH (valid only without -g option)
    -z [str] Custom chromosome size table (valid only without -g option)
    -p [str] Prefix of output
    -t [int] Threads (1 default)
    -s Single-end mod (Paired-end default)
    -n Nextera adapters (Truseq default)
    -a Use BWA aln algorithm (BWA mem default)
    -u CUR&RUN mode, will be paired-end mode and use bowtie2 aligner with --dovetail
    -b [str] Custom Bowtie2 index PATH  (valid only with -b option)
    -c Using chromap to process FASTQ instead of canonical bowtie2/bwa
    -i [str] Custom chromap genome index (valid only with -c option)
    -r [str] Custom chromap genome reference (valid only with -c option)

Example

wget https://raw.githubusercontent.com/zhaoshuoxp/Pipelines-Wrappers/master/ChIPseq.sh
chmod 755 ChIPseq.sh
./ChIPseq.sh -g hg19 -p test -t 24 /path/to/test_R1.fastq.gz /path/to/test_R2.fastq.gz

Alternatively, you may use chromap aligner to speed up the processing :

./ChIPseq.sh -c -g hg19 -p test -t 24 /path/to/test_R1.fastq.gz /path/to/test_R2.fastq.gz

Output

All results will be store in current (./) directory.

{prefix}_trimmed_R1/2.fastq.gz: adapter trimmed fastq files.
{prefix}_mkdup.bam: all alignments, with duplicates marked.
{prefix}_filtered.bam: useful filtered alignments; duplicates, unpaired, unmapped, low-quality, secondary, chrM reads removed.
{prefix}_se.bed: useful filtered alignments in BED format.
{prefix}_pe.bed: useful filtered alignments in BEDPE format, the 2nd and 3rd columns indicate the fragment start and end coordinates on genome.
{prefix}.bw: bigwig file converted from {prefix}_filtered.bam, can be upload to genome browser for visualization.
fastqc: the report(s) of fastqc
logs: running logs

No bam files and trimmed fastq files will be generated with chromap runs.

Peak calling

NOTE: this pipeline does NOT call peaks, you may run it manually. Input is highly recommended for peak calling, process input fastq files with this pipeline with same parameter(s).

test_pe.bed (and input_pe.bed) can be used for macs2 peak calling in BEDPE mode:

macs2 callpeak -t test_pe.bed -c input_pe.bed -f BEDPE -g hs -n test

--broad is recommended for histone modifications.

test_se.bed and test_filtered.bam can also be used in BED or BAM mode of macs2.

See more about MACS2 (for TFs peak calling) and SICER (for Histone Mods peak calling).

RNAseq.sh

This script will QC fastq files and align reads to the reference genome and transcriptome with STAR, depending on the species selection passed by -s or the index and GTF passed by -i and -g, featureCounts and DESeq2 will be used for reads counting and differential expressed genes discovery,

Input

Paired-end fastq files with _R1/2.fastq.gz suffix all together in a directory, i.e.

Single-end not supported

ls -1 ./
cond1_rep1_R1.fastq.gz 
cond1_rep1_R2.fastq.gz 
cond1_rep2_R1.fastq.gz 
cond1_rep2_R2.fastq.gz
cond2_rep1_R1.fastq.gz 
cond2_rep1_R2.fastq.gz 
cond2_rep2_R1.fastq.gz 
cond2_rep2_R2.fastq.gz
....

And a text file with meta information. i.e.

sample  condition
cond1_rep1  group1
cond1_rep2  group1
cond2_rep1  group2
cond2_rep2  group2
....

You can use the script to scan fastq files and generate the text file:

wget https://raw.githubusercontent.com/zhaoshuoxp/Pipelines-Wrappers/master/RNAseq.sh
chmod 755 RNAseq.sh 
./RNAseq.sh -p /path/to/directory/contains/fastq/

Then the meta.txt will be created and opened with VIM. Sample column (1st) should have been filled, edit the text by adding the group information on the 2nd column.

NOTE: Provide the PATH of the DIRECTORY which contains fastq to the scripts, DO NOT give the path of fastq files directly!

Options

help message can be shown by RNAseq.sh -h

Usage: RNAseq.sh <options> -c conditions.txt /PATH/to/directoy/contains/fastq/ 
    Options:
      -m [str] /PATH/to/meta.txt
      -s [str] species <hg|mm> hg=hg38, mm=mm10
      -i [str] Custom STAR index PATH
      -g [str] Custom Reference GTF transcripts PATH
      -t [int] Threads (1 default)
      -p prepare meta.txt
      -n Nextera adapters (Truseq default)
      -h Print this help message

Example

./RNAseq.sh -s hg -m meta.txt -t 24 /path/to/directory/contains/fastq/

Alternatively, you may use your custom genome and transcriptome reference:

./RNAseq.sh -i /path/to/STAR/index -g /path/to/GTF -m meta.txt -t 24 /path/to/directory/contains/fastq/

Output

All results will be store in current (./) directory.

TRIMMED/{prefix}_R1/2_trimmed.gz: adapter trimmed fastq files.
BAM/{prefix}.bam: STAR output, accepted alignments.
BAM/SJ.out.tab: STAR output, splice junctions.
featureCounts.txt: featureCounts output, raw fragments count.
all_genes_exp.txt: size-factor normalized gene expression levels with P values.
logs: running logs and fastqc reports.

NOTE: Sample names in meta.txt have to match the featureCounts output exactly, check your text or use this script to create it. This script cannot automatically run the DEG discovery pair-wisely if you have >2 groups. Either edit deseq.r or analyze it manually in R. A online tool might be useful: iDEP.

adapt_trim.sh

This script is separated from ChIPseq.sh, it trims adapter sequences from fastq files with cutadapt@python3.

Input

Paired-end fastq files or single-end with -s, i.e. test_R1.fastq.gz test_R2.fastq.gz

Options

help message can be shown by adapt_trim.sh -h

Usage: adapt_trim.sh <options> <reads1>|..<reads2
    Options:
    	-p Prefix of output
      -t Threads (1 default)
      -s Single-end mod (Paired-end default)
      -n Nextera adapters (Truseq default)
      -h Print this help message

Example

wget https://raw.githubusercontent.com/zhaoshuoxp/Pipelines-Wrappers/master/adapt_trim.sh
chmod 755 adapt_trim.sh 
./adapt_trim.sh -p test -t 24 test_R1.fastq.gz test_R2.fastq.gz

NOTE: multi-threads support only works with python3>=3.4, multiprocessing>=0.70, cutadapt>=1.15 and pigz

Output

All results will be store in current (./) directory.

{prefix}_R1/2_trimmed.gz: adapter trimmed fastq files.

trans_assemble.sh

This script QC fastq files and aligns reads to hg19/GRCh37(depends on the index and GTF provided) using HISAT2. De novo transcripts assembly will be performed by stingtie.

Input

Paired-end fastq files.

Usage

./trans_assemble.sh <reads1> <reads2> <prefix of output> <starnd: fr|rf|un>

NOTE: Edit the script and mofiy $threads, $index, $gtf.

Output

All results will be store in current (./) directory.

{prefix}.bam: sorted accepted alignments.
{prefix}.gtf: de novo transcripts assembled with reference GTF guiding.
logs: running logs.

cisVar.sh

This script is a wrapper of cisVar.

Input

Output of hornet, sorted and indexed bam file.

Usage

./cisVar.sh hornet.bam <read_depth> <indivdual files>

NOTE: Edit the script and mofiy $vcf PATH to your SNP vcf files.

See more about cisVar.

Output

All results will be store in current (./) directory.

{prefix}.{read_depth}.final.txt: mian regression output.
{prefix}.${DEP}.total.txt.prepost.png: desity plot of the regression output.

PLAR.sh

This script is a re-written wrapper of PLAR.

Requirements stringtie, cuffdiff, plar, CPC2 and HMMER

Input

BAM files of aligned RNAseq data, and coresponding GTFs.

Usage

./PLAR.sh hornet.bam <output dir> <prefix_sample1,prefix_sample2...> <strand:rf|fr|un> <sample1_rep1.gtf> <sample1_rep2.gtf> <sample2_rep1.gtf> ... <sample1_rep1.bam,sample1_rep2.bam,sample2_rep1..>

NOTE: Edit the script and modify $plar_path, $cpc2_path. Additional annotation files required in $plar_path, see PLAR for more details.

Output

All results will be store in current (./) directory.

final_lncRNA.bed: filtered and sorted lncRNAs in BED12 format.

rRNA_dep.sh

This script removes ribosomal RNA reads from fastq files by mapping them to rRNA genes and retrieving unmmaped reads.

Input

Paired-end fastq files.

Usage

./rRNA_dep.sh <reads1> <reads2> <prefix of output>

NOTE: Edit the script and mofiy $threads, $genome, $gtf.

Output

All results will be store in current (./) directory.

{prefix}_dep_R1/2_fq.gz: rRNA removed fastq files.
{prefix}_rRNA.log: mapping log.

CRISPRlib.sh

This script uses cutadapt trimming the input fastq files to get the sgRNA sequences (20nt) according to the adaptor sequence on the 5' end next to the sgRNA, and then aligns these sequences to bowtie index build with the reference sgRNA library. The Addgene library 110066, 160129 and 162256 have been prebuilt and can be directly assigned by -l.

Input

The fastq file containing sgRNA sequences.

Usage

help message can be shown by CRISPRlib.sh -h

    Usage: CRISPRlib.sh <options> <reads_clean.fq.gz>

    ### INPUT: fastq files ###
    This script will trim the input fastq to 20nt after the given sequence with cutadapt, and align the trimmed reads to the reference library build with Bowtie1, depending on the library selection passed by -l or the index and adapter sequence passed by -i and -a,
    then statisticize each sequence's frequency, and all results will be store in current (./) directory.
    ### python3/cutadapt/bowtie1/samtools required ###

    Options:
    -l [str] library selection <110066|160129|162256>
    -i [str] Custom bowtie index PATH
    -a [str] Custom adapter sequence
    -p [str] Prefix of output
    -n [int] Threads (1 default)
    -h Print this help message

Example

CRISPRlib.sh -n 12 -l 110066 -p test lib_R2.fastq.gz

Alternatively, you may build your custom sgRNA library index and give the path to the script along with your adaptor sequence:

First make sure your sgRNA library is in FASTA format, such as:

>sgZC3H12A_5
GGGCAGCGACCTGAGACCAG
>sgZC3H12A_6
GGAGTGGAAGCGCTTCATCG
...

Here is an example to format a csv file (let's say sgRNA name is in 1st column and sequence in 2nd column) to FASTA:

awk -F',' '{print ">"$1"\n"$2}' sample.csv > sample.fa

Then build bowite1 index using this FASTA:

bowtie-build --threads 12 sample.fa sgLib

This will generate a few index files with the prefix "sgLib"

Now you can run CRISPRlib.sh with your own library and adaptor sequence.

CRISPRlib.sh -n 12 -i /path/to/index/with/prefix -a YOURADAPTORSEQUENCE -p test lib_R2.fastq.gz

The adaptor sequence is 5'-end next to the sgRNA, depends on the vector used. 8nt or more would be recommended.

Note: edit line53 to 58 or insert new options in the script if you want to automatically assign your own index path and adaptor sequences with -l.

Output

All results will be store in current (./) directory.

{prifix}.tr.fq.gz: trimmed fastq containing the sgRNA only (20bp)
{prefix}.sam: bowtie1 alignment
{prefix}.srt.bam: bowtie1 alignment in sorted and indexed BAM format
{prefix}.srt.bam.bai: bowtie1 alignment index
{pre}.log: trimming and alignment summary
{pre}.counts.tsv: a table delimited text with sgRNA name (1st column) and its count number (2nd column)
{pre}.table.tsv: insert the actual sequence between the name and count compared to counts.tsv

Author @zhaoshuoxp April 24 2024

Name		Name	Last commit message	Last commit date
Latest commit History 228 Commits
.gitignore		.gitignore
ATACseq.sh		ATACseq.sh
CRISPRlib.sh		CRISPRlib.sh
ChIPseq.sh		ChIPseq.sh
README.md		README.md
RNAseq.sh		RNAseq.sh
adapt_trim.sh		adapt_trim.sh
cisVar.sh		cisVar.sh
plar.sh		plar.sh
rRNA_dep.sh		rRNA_dep.sh
trans_assemble.sh		trans_assemble.sh

zhaoshuoxp/Pipelines-Wrappers

Folders and files

Latest commit

History

Repository files navigation

ATACseq, ChIPseq, CUT&RUN/TAG RNAseq pipeline wrappers

ATACseq.sh

Input

Options

Example

Output

ChIPseq.sh

Input

Options

Example

Output

Peak calling

RNAseq.sh

Input

Options

Example

Output

adapt_trim.sh

Input

Options

Example

Output

trans_assemble.sh

Input

Usage

Output

cisVar.sh

Input

Usage

Output

PLAR.sh

Input

Usage

Output

rRNA_dep.sh

Input

Usage

Output

CRISPRlib.sh

Input

Usage

Example

Output

About

Resources

Stars

Watchers

Forks

Languages