- Anaconda Python 3.7 (Anaconda's python distribution comes with the required numpy and scipy libraries)
- pysam
- bowtie2
- samtools
- SRA Toolkit
- For linux systems, download compiled binaries from here.
In this example, will use Ubuntu Linux 64 bit architecture, downloading the file
sratoolkit.3.0.0-ubuntu64.tar.gz
. - Run
tar -xvzf sratoolkit.3.0.0-ubuntu64.tar.gz
to extract folder. - Copy extracted folder
sratoolkit.3.0.0-ubuntu64
to permanent location, e.g./home/roger/bioinformatics
. - Add relevant functions to PATH variable by adding the following text to a new last line of
~/.bashrc
, e.g.export PATH=$PATH:/home/roger/bioinformatics/sratoolkit.3.0.0-ubuntu64/bin
.
- For linux systems, download compiled binaries from here.
In this example, will use Ubuntu Linux 64 bit architecture, downloading the file
- Download the prebuilt bowtie2 indices for various genome assemblies
- Human hg38
- Human hg19
- Mouse mm10
- Extract from archive, move to the corresponding folders named
hg38_bowtie2/
,hg19_bowtie2/
, ormm10_bowtie2/
- Download genome assemblies in FASTA format
- Generate FASTA file indices
samtools faidx hg38_bowtie2/hg38.fa
samtools faidx hg19_bowtie2/hg19.fa
samtools faidx mm10_bowtie2/mm10.fa
- Download BLENDER (Wienert et al., 2020)
This demo will perform DISCOVER-Seq+ on C57BL/6J mouse liver, 24h after induction with adenovirus expressing Cas9 and gRNA targeting PCSK9. Test run performed on personal Windows 10 Desktop with Ubuntu, Intel i7-8700k 3.7 GHz 6 cores, 32GB RAM.
- On Linux/Mac command line, navigate to the home directory for this repository.
- Open
download_reads_demo.sh
, go to "User Entry Section", enter the desired directory to download the sequencing data. For the consistency of this demo, I will use/mnt/c/Users/rzou4/Downloads/
. - Run
download_reads_demo.sh
on command line, which will download the relevant FASTQ files from the public database, rename the files appropriately, and move files to newly createddemo
folder within/mnt/c/Users/rzou4/Downloads/
(est. time <10 min/file). - Open
subset_reads_demo.sh
, go to "User Entry Section", ensure the paths are correct. - Run
subset_reads_demo.sh
, which will subset the relevant FASTQ files to ensure equal # of sequencing reads between DISCOVER-Seq+ and DISCOVER-Seq, for fair comparison (est. time 1 min/file). - Open
process_reads_demo.sh
, go to "User Entry Section", enter the directory which has the mm10 genome previously installed via the Installation Guide. - Run
process_reads_demo.sh
, which will usebowtie2
to align the files to the mouse mm10 genome, followed bysamtools
for further processing before conversion to indexed BAM files (est. time 12 hours/file). Performs processing on all 3 samples in parallel. - Open
blender_c3_demo.sh
, go to "User Entry Section", enter the paths to (1) BLENDER software, (2) mm10 genome, and (3-4) path to new output folders. - Run
blender_c3_demo.sh
, which runs BLENDER to output list of off-target sites from DISCOVER-Seq+ versus DISCOVER-Seq (est. time 3 days/file). Performs processing on both samples in parallel. - Expected results are included in this repository:
Similar to instructions listed in the Demo, but more general and applicable to all data sets. The list of sequencing datasets is listed here.
download_reads.sh
downloads the desired FASTQ sequencing reads from Sequencing Read Archive (SRA) (est. time 10 min/file), parallelizable
- Open
download_reads.sh
, which has the list of FASTQ datasets to download, of formatSRR********
, such asSRR18188706
, along with its corresponding name:mm_mP9_KU24h_r3
. Enter the desired directory to hold the downloaded files atdownload_path
. Uncomment only the files you would like to download. Expect the processing for each sample to take up ~33 GB of disk space. - Run
download_reads.sh
, which downloads the desired data from SRA using the SRA Toolkit, then renames and copies the downloaded files into a subfolder insidedownload_path
calledSRA_download
subset_reads.sh
subsets the group of sequencing reads to ensure equal # of reads (est. time 1 min/file), parallelizable
- Open
subset_reads.sh
, make suresra_path
points to the path forSRA_download
folder. Undermain() {}
, uncomment only the files sets you would like to subset - likely they are the same ones previously downloaded indownload_reads.sh
. NOTE: negative control samples generally do not need to be subsetted b/c they are only used as a control for off-target detection using BLENDER. - Run
subset_reads.sh
to subset the relevant FASTQ files to ensure equal # of sequencing reads between DISCOVER-Seq+ and DISCOVER-Seq for fair comparison.
process_reads.sh
aligns FASTQ to the appropriate genome, then converts alignment to BAM files (est. time 12 hours/file), parallelizable
- Open
process_reads.sh
, uncomment only the files you would like to align and process - likely they are the same ones previously downloaded indownload_reads.sh
, some of which were subsetted withsubset_reads.sh
. Also, enter valid paths to the downloaded genomes - Run
process_reads.sh
to usebowtie2
to align the files to the mouse mm10 genome, followed bysamtools
for further processing before conversion to indexed BAM files.
blender_*_xxxx_cc.sh
determines genome-wide off-target sites using BLENDER (est. time 1-3 days/file), parallelizable
- Shell file of type
blender_*_xxxx_cc.sh
.*
indicates sequencing file dataset,xxxx
indicates genome (hg19
ormm10
), andcc
indicates eitherc2
orc3
BLENDER parameter. - Modify
blender_*_xxxx_cc.sh
appropriately for compatibility with your desired datasets, and run.
script_1.py
perform downstream analysis, generating the majority of the data and results presented in the associated manuscript
- Modify
script_1.py
appropriately for compatibility with your desired datasets, and run.