In this example, we are estimating the copy number of a single TR in a simulated WGS sample. To run the example, make sure you have scattr or cargo installed in your PATH:
./run.sh
In the input/ directory, there are example input files to run ScatTR. In general, running ScatTR requires the following input files:
- A tab-delimeted catalog containing TR loci of interest:
catalog.tsv- The file is expected to have columns:
id,contig,start,end, andmotif
- The file is expected to have columns:
- Aligned reads of the sample and their index:
input/sample.bamandinput/sample.bam.bai - A reference genome and its index:
input/reference.faandinput/reference.fa.fai
In the output/ directory, there are output files generated by ScatTR based on the example input files. ScatTR produces the following output files:
- The extracted bag of reads:
sample.bag.bam - The extracted depth and insert distributions:
sample.stats.json- By default, ScatTR produces plots of these distributions (
sample.depth_distr.pngandsample.insert_distr.png). The--no-plotoption disables this behavior
- By default, ScatTR produces plots of these distributions (
- The optimization problem definitions:
sample.defs.json- This file information about all the reads associated with each TR and their relative positions with the decoy references (refer to manuscript methods section for more details)
- The estimated copy numbers:
sample.genotypes.json- This file contains estimates for each TR locus in addition to a 95% confidence interval
Generating the input files requires having python, art_illumina, bwa, samtools in your PATH. The input files are generated by running:
./scripts/make_inputs.sh
The script creates a random reference genome (~200 kbp in length) that is used to generate reads for a heteroyzygous TR expansion with normal allele copy number being 3 and with expanded copy number being 100. The motif of the TR is CAGATA.