Bystro genetic analysis (annotation, filtering, statistics)
Clone or download
Pull request Compare This branch is 1 commit ahead, 201 commits behind akotlar:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bench
bin
config
lib
t
.gitignore
BUILD.md
Changes.md
FIELDS.md
INSTALL.md
LICENSE
README.md
install-perl-libs.sh
install-rpm.sh

README.md

Bystro DOI

Using Bystro

For most users, we recommend https://bystro.io .

The web app gives full access to all of Bystro's capabilities, provides a convenient search/filtering interface, supports large data sets (tested up to 890GB uncompressed/129GB compressed), and has excellent performance.

Installing Bystro

Follow the instructions in INSTALL.md

Bystro relies on pluggable (via Bystro's YAML config) pre-processors to normalize variant inputs (dealing with VCF issues such as padding), calculate whether a site is a transition or transversion, calculate sample maf, identify hets/homozygotes/missing samples, calculate heterozygosity, homozygosity, missingness, and more.

  1. VCF format: Bystro-Vcf
  2. SNP format: Bystro-SNP
  3. Create your own to support other formats!

Annotation (Output) Field Descriptions

Please read FIELDS.md

The Bystro configuration file

  • The config file describes the state of both the database and the annotation. It's required for annotating or building
  • It has several keys:
    • tracks: The highest level organization for database values. Tracks have a name property, which must be unique, and a type, which must be one of:

      • sparse: Any bed file, or any file that can be mapped to chrom, chromStart, and chromEnd columns.
        • This is used for dbSNP, and Clinvar records, but many files can be fit this format.
        • Mapping fields can be managed by the fieldMap key
      • score: Accepts any wigFix file.
        • Used for phastCons, phyloP
      • cadd:
        • Accepts any CADD file, or Bystro's custom "bed-like" CADD file, which has 2 header lines, and chrom, chromStart, chromEnd columns, followed by standard CADD fields
      • gene: A UCSC gene track field (ex: knownGene, refGene, sgdGene).
        • The local_files for this are created using an sql_statement
        • Ex: SELECT * FROM hg38.refGene LEFT JOIN hg38.kgXref ON hg38.kgXref.refseq = hg38.refGene.name
    • chromosomes: The allowable chromosomes.

      • Each row of every track must be identified by these chromosomes (during building)
      • Each row of any input file submitted for annotation must also be "" "" (during annotation)
      • However, Bystro is flexible about the chr prefix

      Ex: For the following config

      chromosomes:
      - chr1
      - chr2
      - chr3

      Only chr1, chr2, and chr3 will be accepted. However, Bystro tries to make your life easy

      1. We currently follow UCSC coneventions for chromosomes, meaning they should be prepended by chr
      2. Bystro will automatically append chr to chromosomes read from an input file during annotation.
      3. Bystro allows the transformation of any field during building, configurable in the YAML config file for that assembly, making it easy to prepend chr to the source file chromosome field

      Ex: Clinvar doesn't have a chr prefix, so during building we specify:

      tracks:
        - name: clinvar
          build_field_transformations:
            chrom: chr .
          fieldMap:
            Chromosome: chrom

      Here fieldMap allows us to rename header fields, and build_field_transformations allows us to define a prepend operation (chr . can be interpreted as the perl command "chr" . $chrom)

      So: input files do not need to have their chromosomes prepended by chr. Bystro will normalize the name.

      In this example chromosomes 1 and chr1 will be built/annotated, but 1_rand will not.

Directories and Files

These describe where the Bystro database and any source files are located.

  1. files_dir : The parent folder within which each track's local_files are located
  • Bystro automatically checks for local_files at parent/trackName/file

    Ex: For the config file containing

    files_dir: /path/to/files/
    track:
      - name: refSeq
        local_files:
          - hg19.refGene.chr1.gz
          # and more files

    Bystro will expect files in /path/to/files/refSeq/hg19.refGene.chr1.gz

  1. database_dir : Each database is held within database_dir, in a folder of the name assembly

    Ex: For the config file containing

    assembly: hg19
    database_dir: /path/to/databases/

    Bystro will look for the database /path/to/databases/hg19