Custom analysis software for DISCOVER-Seq+

System requirements

Anaconda Python 3.7 (Anaconda's python distribution comes with the required numpy and scipy libraries)
pysam
bowtie2
samtools
SRA Toolkit
- For linux systems, download compiled binaries from here. In this example, will use Ubuntu Linux 64 bit architecture, downloading the file sratoolkit.3.0.0-ubuntu64.tar.gz.
- Run tar -xvzf sratoolkit.3.0.0-ubuntu64.tar.gz to extract folder.
- Copy extracted folder sratoolkit.3.0.0-ubuntu64 to permanent location, e.g. /home/roger/bioinformatics.
- Add relevant functions to PATH variable by adding the following text to a new last line of ~/.bashrc, e.g. export PATH=$PATH:/home/roger/bioinformatics/sratoolkit.3.0.0-ubuntu64/bin.

Data requirements

https://www.ncbi.nlm.nih.gov/bioproject/PRJNA801688

Installation guide (est. 30 min)

Download the prebuilt bowtie2 indices for various genome assemblies
- Human hg38
- Human hg19
- Mouse mm10
- Extract from archive, move to the corresponding folders named hg38_bowtie2/,hg19_bowtie2/, or mm10_bowtie2/
Download genome assemblies in FASTA format
- hg38.fa
- hg19.fa
- mm10.fa
- Extract from archive, move to the corresponding folders named hg38_bowtie2/,hg19_bowtie2/, or mm10_bowtie2/
Generate FASTA file indices
- samtools faidx hg38_bowtie2/hg38.fa
- samtools faidx hg19_bowtie2/hg19.fa
- samtools faidx mm10_bowtie2/mm10.fa
Download BLENDER (Wienert et al., 2020)

Demo (please ensure 100 GB of free disk space)

This demo will perform DISCOVER-Seq+ on C57BL/6J mouse liver, 24h after induction with adenovirus expressing Cas9 and gRNA targeting PCSK9. Test run performed on personal Windows 10 Desktop with Ubuntu, Intel i7-8700k 3.7 GHz 6 cores, 32GB RAM.

On Linux/Mac command line, navigate to the home directory for this repository.
Open download_reads_demo.sh, go to "User Entry Section", enter the desired directory to download the sequencing data. For the consistency of this demo, I will use /mnt/c/Users/rzou4/Downloads/.
Run download_reads_demo.sh on command line, which will download the relevant FASTQ files from the public database, rename the files appropriately, and move files to newly created demo folder within /mnt/c/Users/rzou4/Downloads/ (est. time <10 min/file).
Open subset_reads_demo.sh, go to "User Entry Section", ensure the paths are correct.
Run subset_reads_demo.sh, which will subset the relevant FASTQ files to ensure equal # of sequencing reads between DISCOVER-Seq+ and DISCOVER-Seq, for fair comparison (est. time 1 min/file).
Open process_reads_demo.sh, go to "User Entry Section", enter the directory which has the mm10 genome previously installed via the Installation Guide.
Run process_reads_demo.sh, which will use bowtie2 to align the files to the mouse mm10 genome, followed by samtools for further processing before conversion to indexed BAM files (est. time 12 hours/file). Performs processing on all 3 samples in parallel.
Open blender_c3_demo.sh, go to "User Entry Section", enter the paths to (1) BLENDER software, (2) mm10 genome, and (3-4) path to new output folders.
Run blender_c3_demo.sh, which runs BLENDER to output list of off-target sites from DISCOVER-Seq+ versus DISCOVER-Seq (est. time 3 days/file). Performs processing on both samples in parallel.
Expected results are included in this repository:
- DISCOVER-Seq+
- DISCOVER-Seq

Usage instructions

Similar to instructions listed in the Demo, but more general and applicable to all data sets. The list of sequencing datasets is listed here.

`download_reads.sh` downloads the desired FASTQ sequencing reads from Sequencing Read Archive (SRA) (est. time 10 min/file), parallelizable

Open download_reads.sh, which has the list of FASTQ datasets to download, of format SRR********, such as SRR18188706, along with its corresponding name: mm_mP9_KU24h_r3. Enter the desired directory to hold the downloaded files at download_path. Uncomment only the files you would like to download. Expect the processing for each sample to take up ~33 GB of disk space.
Run download_reads.sh, which downloads the desired data from SRA using the SRA Toolkit, then renames and copies the downloaded files into a subfolder inside download_path called SRA_download

`subset_reads.sh` subsets the group of sequencing reads to ensure equal # of reads (est. time 1 min/file), parallelizable

Open subset_reads.sh, make sure sra_path points to the path for SRA_download folder. Under main() {}, uncomment only the files sets you would like to subset - likely they are the same ones previously downloaded in download_reads.sh. NOTE: negative control samples generally do not need to be subsetted b/c they are only used as a control for off-target detection using BLENDER.
Run subset_reads.sh to subset the relevant FASTQ files to ensure equal # of sequencing reads between DISCOVER-Seq+ and DISCOVER-Seq for fair comparison.

`process_reads.sh` aligns FASTQ to the appropriate genome, then converts alignment to BAM files (est. time 12 hours/file), parallelizable

Open process_reads.sh, uncomment only the files you would like to align and process - likely they are the same ones previously downloaded in download_reads.sh, some of which were subsetted with subset_reads.sh. Also, enter valid paths to the downloaded genomes
Run process_reads.sh to use bowtie2 to align the files to the mouse mm10 genome, followed by samtools for further processing before conversion to indexed BAM files.

`blender_*_xxxx_cc.sh` determines genome-wide off-target sites using BLENDER (est. time 1-3 days/file), parallelizable

Shell file of type blender_*_xxxx_cc.sh. * indicates sequencing file dataset, xxxx indicates genome (hg19 or mm10), and cc indicates either c2 or c3 BLENDER parameter.
Modify blender_*_xxxx_cc.sh appropriately for compatibility with your desired datasets, and run.

`script_1.py` perform downstream analysis, generating the majority of the data and results presented in the associated manuscript

Modify script_1.py appropriately for compatibility with your desired datasets, and run.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
lib		lib
peaks		peaks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
amplicon_1.sh		amplicon_1.sh
blender_210216_hg19_c2.sh		blender_210216_hg19_c2.sh
blender_210216_hg19_c3.sh		blender_210216_hg19_c3.sh
blender_210301_hg19_c2.sh		blender_210301_hg19_c2.sh
blender_210301_hg19_c3.sh		blender_210301_hg19_c3.sh
blender_211213_mm10_c3.sh		blender_211213_mm10_c3.sh
blender_220701_cross_c3.sh		blender_220701_cross_c3.sh
blender_220701_hg19_c3.sh		blender_220701_hg19_c3.sh
blender_220714_hg19_c3.sh		blender_220714_hg19_c3.sh
blender_c3_demo.sh		blender_c3_demo.sh
blender_finaldraw.sh		blender_finaldraw.sh
download_reads_demo.sh		download_reads_demo.sh
download_reads_fasterq.sh		download_reads_fasterq.sh
download_reads_ftp.sh		download_reads_ftp.sh
merge_bam.sh		merge_bam.sh
process_macs2.sh		process_macs2.sh
process_reads.sh		process_reads.sh
process_reads_demo.sh		process_reads_demo.sh
script_1_profiles.py		script_1_profiles.py
script_2_sets.py		script_2_sets.py
script_3_indels.py		script_3_indels.py
subset_reads.sh		subset_reads.sh
subset_reads_demo.sh		subset_reads_demo.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Custom analysis software for DISCOVER-Seq+

System requirements

Data requirements

Installation guide (est. 30 min)

Demo (please ensure 100 GB of free disk space)

Usage instructions

`download_reads.sh` downloads the desired FASTQ sequencing reads from Sequencing Read Archive (SRA) (est. time 10 min/file), parallelizable

`subset_reads.sh` subsets the group of sequencing reads to ensure equal # of reads (est. time 1 min/file), parallelizable

`process_reads.sh` aligns FASTQ to the appropriate genome, then converts alignment to BAM files (est. time 12 hours/file), parallelizable

`blender_*_xxxx_cc.sh` determines genome-wide off-target sites using BLENDER (est. time 1-3 days/file), parallelizable

`script_1.py` perform downstream analysis, generating the majority of the data and results presented in the associated manuscript

About

Releases

Packages

Languages

License

rogerzou/DSeqPlus

Folders and files

Latest commit

History

Repository files navigation

Custom analysis software for DISCOVER-Seq+

System requirements

Data requirements

Installation guide (est. 30 min)

Demo (please ensure 100 GB of free disk space)

Usage instructions

download_reads.sh downloads the desired FASTQ sequencing reads from Sequencing Read Archive (SRA) (est. time 10 min/file), parallelizable

subset_reads.sh subsets the group of sequencing reads to ensure equal # of reads (est. time 1 min/file), parallelizable

process_reads.sh aligns FASTQ to the appropriate genome, then converts alignment to BAM files (est. time 12 hours/file), parallelizable

blender_*_xxxx_cc.sh determines genome-wide off-target sites using BLENDER (est. time 1-3 days/file), parallelizable

script_1.py perform downstream analysis, generating the majority of the data and results presented in the associated manuscript

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`download_reads.sh` downloads the desired FASTQ sequencing reads from Sequencing Read Archive (SRA) (est. time 10 min/file), parallelizable

`subset_reads.sh` subsets the group of sequencing reads to ensure equal # of reads (est. time 1 min/file), parallelizable

`process_reads.sh` aligns FASTQ to the appropriate genome, then converts alignment to BAM files (est. time 12 hours/file), parallelizable

`blender_*_xxxx_cc.sh` determines genome-wide off-target sites using BLENDER (est. time 1-3 days/file), parallelizable

`script_1.py` perform downstream analysis, generating the majority of the data and results presented in the associated manuscript

Packages