Skip to content

Personalized reference editor for somatic mutation discovery

License

Notifications You must be signed in to change notification settings

theLongLab/PRESM

Repository files navigation

PRESM

PRESM stands for Personalized Reference Editor for Somatic Mutation discovery. In contrast to other reference genome editor software that generate a diploid reference genome which may distribute the reads to two site, impairing the soundness of the downstream statistical framework, PRESM provides two haploid reference genomes. The pipeline of PRESM involves three steps: First, germline mutations are discovered by another tool, e.g., GATK, and are used to make personalized references to call somatic mutations. Second, a reference genome composed of all personal variants (including both heterozygous and homozygous sites) is used as “decoy” to capture the heterozygous variants in reads. Third, PRESM changes the reads by replacing all heterozygous alleles with the corresponding reference alleles and maps the modified reads back to another personalized reference genome that contains only homozygous changes. The output of this step is a BAM file ready for any somatic mutation callers to use. We intend to offer long-term maintenance for PRESM and continue adding our new functions into it.

Installation

PRESM is a batteries-included JAR executable; therefore no installation is needed aside from Java 8. Please download the executable PRESM.jar from the latest release and run it using the standard command for Java packages: java [–Xmx] –jar PRESM.jar

Building From Source

Building the project from source will require Apache Maven 3.6.1. First, clone the repository to a local folder.

git clone git@github.com:theLongLab/PRESM.git

The dependency JARs must then be downloaded separately and installed to the local repository.

mvn install:install-file \
    -Dfile=path-to-jar \
    -DgroupId=group-id \ # refer to pom.xml for groupId, artifactId, and version;
    -DartifactId=artifact-id \ # alternatively, use own naming and change pom.xml to reflect as such
    -Dversion=version \
    -Dpackaging=jar \

Afterwards, compile the project and the JAR will be in the target/ folder.

mvn package

Functions

  • Processing variants files generated by GATK, Pindel or other variant call software, i.e., combining two variant files that are for SNPs and indels respectively; selecting homozygous variants or heterozygous variants; removing variants with duplicated coordinates.
  • Generating the personalized reference genome according to the germline mutations provided by the users.
  • Generating the modified background database files according to personalized reference genomes, for example, the personalized dbSNP, db.Indel, and cosmic.vcf can be generated. (Several downstream somatic mutation callers require these files).
  • Mapping the coordinates of somatic variants called by using personalized reference genome to the coordinates of universal reference genome.
  • Replacing the alternative alleles with reference bases according to the heterozygous variants provided by the users.

Commands and options

All the functions are used as: java [-Xmx] –jar /path/to/presm.jar <options>

CombineVariants: Combine two variant call files according to the reference genome.

> -F CombineVariants –R ref.fasta –variant1 input1.vcf –variant2 input2.vcf –O output.vcf

Parameters:

  • –R: input the reference genome file.
  • -variant1: input variant file 1 (in vcf foramt)
  • -variant2: input variant file 2 (in vcf foramt)
  • -O: output the combined variant call file in vcf format

SelectGenotype: Select homozygous or heterozygous variants in the variant call file provided by the users.

> -F SelectGenotype –genotype homo[heter] –variants input.vcf –O output.vcf

Parameters:

  • -genotype: Specify the genotype of the variants (homozygous/ heterozygous variants)
  • -variants: input the variants in vcf format
  • -O: output the specified genotype variants in vcf format

RemoveOverlaps : Remove overlapping variants in a variant call file.

> -F RemoveOverlaps –R ref.fasta –variants input.vcf –O output.vcf

Parameters:

  • –R: input the reference genome file
  • -variants: input the variant in vcf format
  • -O: output the duplicated variant in vcf format

SortVariants: Sort variants according to the reference genome coordinates.

> -F SortVariants –R ref.fasta –variants input.vcf –O output.vcf

Parameters:

  • –R: input the reference genome file
  • -variants: input the variant in vcf format
  • -O: output the sorted variant in vcf format

MakePersonalizedReference: Generate personalized reference genome according to the germline mutations provided by the users.

> -F MakePersonalizedReference –I ref.fasta –germlinemutations input.vcf –O output.fa [–intervals input.intervals] [-genotype home/ heter]

Parameters:

  • –I: input the reference genome file
  • -germlinemutations: input the germline mutations in vcf format
  • -O: output the personalized reference genome in fasta format

Options:

  • -intervals: specify the region of variants
  • -genotype: specify the genotype of variants

MakePersonalizedVariantsDB: Generate personalized variants database files according to the germline mutations provided by the users.

> -F MakePersonalizedVariants –I input.vcf –O output.vcf –variants variant.vcf [–intervals input.intervals] [-genotype home/ heter] [-removeduplicates]

Parameters:

  • -I: input the variants database in vcf format
  • -O: output the personalized variants database in vcf format
  • -variants: input the mutations in vcf format

Options:

  • -intervals: specify the region of variants
  • -genotype: specify the genotype of variants
  • -removeduplicates: remove duplicated variants

MapVariants: Map the personalized reference genome-based coordinates of the variants to their corresponding coordinates in the universal reference genome.

> -F MapVariants –I input.vcf –O output.vcf –germlinemutations variant.vcf [–intervals input.intervals] [-genotype home/heter] [-removeduplicates]

Parameters:

  • -I: input the somatic mutations in vcf format
  • -O: output the somatic mutations being mapped to the universal reference genome in vcf format
  • -germlinemutations: input the germline mutations in vcf format

Options:

  • -intervals: specify the region of variants
  • -genotype: specify the genotype of variants
  • -removeduplicates: remove duplicated variants

ReplaceGenotype: Replacing the alternative alleles in the sequencing reads with reference bases according to the heterozygous variants provided by the users.

> -F ReplaceGenotype –I input.sam –germlinemutations germlinemutations.vcf –O output.sam –readlength len [–intervals input.intervals] [-genotype home/ heter]

Parameters:

  • -I: input the sequence alignment map file in sam format
  • -variant: input the germline mutations in vcf format
  • -O: output the replaced sequence alignment map file in sam format
  • –readlength: the sequencing read length

Options:

  • -intervals: specify the region of variants
  • -genotype: specify the genotype of variants

ViewFasta: View specified region of sequence in reference genome.

> Usage: -F ViewFasta –R ref.fasta [–L input.list] [-region specified region]

Parameters:

  • –R: input the reference genome file
  • -L: input the specified region list file, this function was used for viewing multiple regions in the chromosome
  • -region: input one specified region, this function was used for viewing single region in the chromosome

Example of region specifications format:

chr1: Output whole sequence of chromosome 1 in the reference genome.

chr2: 5000 Output the chromosome 2 sequence which begins at base position 5000 and ends at the end of chromosome 2.

chr3: 500-600 Output the chromosome 3 sequence which begins at base position 500 and ends at base position 600 of chromosome 3.

SomaticMutationsOnGermlineInsertion: Output the relative coordinate of somatic mutations located on germline insertions.

> -F SomaticMutationsOnGermlineInsertion –germlinemutations germlinemutation.vcf –I input.vcf –O output.txt [–intervals input.intervals] [-genotype home/ heter]

Parameters:

  • -germlinemutations: input the germline mutations in vcf format
  • -I: input the somatic mutations (using personalized coordinate system) in vcf formait
  • -O: output the locations of somatic mutations on germline insertions

Options:

  • -intervals: specify the region of variants
  • -genotype: specify the genotype of variants

Citation

If you find PRESM useful towards your project, please cite the publication as located here.

Contacts