JASPAR UCSC tracks
This repository contains the data and code used to generate the JASPAR UCSC Genome Browser track data hub.
For more information visit the JASPAR website.
genomesfolder contains scripts to download and process different genome assemblies
profilesfolder contains the output from the script
get_profiles.py, which downloads the JASPAR CORE profiles for different taxons
- The file
environment.ymlcontains the conda environment used to generate the genomic tracks for JASPAR 2020 (see installation)
- The script
install-pwmscan.shdownloads and installs PWMscan and places its binaries in the in the
- The script
scan_sequence.pytakes as its input the
profilesfolder and a nucleotide sequence in FASTA format
(e.g. a genome), and outputs TFBS predictions
- The script
scans2bigBedcreates a bigBed track file from TFBS predictions
- Python 3.7 with the following libraries: Biopython (<1.74), NumPy, pyfaidx and tqdm
- UCSC binaries for standalone command-line use
Note that for running
scan_sequence.py, only the Python dependencies and PWMScan are required.
To install PWMScan, execute the script
The remaining dependencies can be installed through the conda package manager:
conda env create -f ./environment.yml
Genomic tracks and TFBS predictions for human and six other model organisms are available online:
To illustrate how the genomic tracks are generated, we provide an example for the baker's yeast genome:
- Download the genome sequence and chromosome sizes (automated in this script)
- Scan the genome sequence using all fungi profiles from the JASPAR CORE
./scan_sequence.py --fasta-file ./genomes/sacCer3/sacCer3.fa --profiles-dir ./profiles/ \ --output-dir ./tracks/sacCer3/ --threads 4 --latest --taxon fungi
For this example, the scanning step should take no longer than a minute. For human and other similar genomes, this step is usually finished within a few hours (the final amount of time will depend on the number of
- Create the genomic track
./scans2bigBed -c ./genomes/sacCer3/sacCer3.chrom.sizes -i ./tracks/sacCer3/ -o ./tracks/sacCer3.bb -t 4
TFBS predictions from the previous step are merged into a bigBed track file. In column five, we use as scores the p-values from PWMScan (scaled between 0-1000, where 0 corresponds to p-value = 1 and 1000 to p-value ≤ 10-10). This allows for comparison of prediction confidence across TFBSs. Again, for this example, this step should be completed within a few minutes, while for larger genomes it can take a few hours.
Important note: both disk space and memory requirements for large genomes (i.e. danRer11, hg19, hg38 and mm10) are substantial. In these cases, we highly recommend allocating at least 1Tb of disk space and 512Gb of ram.