Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
README.md
phaser_gene_ae.py

README.md

phASER Gene AE

Uses output from phASER to produce gene level haplotype counts for allelic expression studies. It does this by summing reads from both single variants and phASER haplotype blocks using their phase for each gene.

Developed by Stephane E. Castel in the Lappalainen Lab at the New York Genome Center and Columbia University Department of Systems Biology.

Runs on Python 2.7.x and has the following dependencies: pandas, IntervalTree

Usage

Requires phASER to have been run with a phased VCF as input with unphased_vars enabled. Takes an input BED format file containing the coordinates for genes (feautres) where haplotypic counts are to be measured. NOTE this version of phaser_gene_ae is only compatible with results from phASER v1.0.0+.

Important Note - If multiple input BAMs were used when running phASER, gene level haplotypic counts will be generated for each input BAM independently. In the output file, the column "bam" indicates which input BAM the haplotypic counts correspond to.

Useful files

The specific features to produce haplotypic counts for must be provided in BED format. This is most often genes. A file containing coordinates for gencode genes is included here for convenience. Note that these annotations have been filtered to be consistent with what was used for calling GTEx eQTLs. Please ensure that they contain annotations appropriate for your specific analysis.

hg19:

hg38:

Arguments

Required

  • --haplotypic_counts - Output file from phASER containing read counts for haplotype blocks.
  • --features - File in BED format (0 BASED COORDINATES - chr,start,stop,name) containing the features to produce counts for.
  • --o - Output file.

Optional

  • --id_separator (_) - Separator used for generating unique variant IDs when phASER was run.
  • --gw_cutoff (0.9) - Minimum genome wide phase confidence for phASER haplotype blocks.
  • --min_cov (0) - Minimum total coverage for a feature to be outputted.
  • --min_haplo_maf (0) - The minimum MAF used to phase a haplotype for it to be considered genome wide phased when generating gene level counts. Setting this number higher will result in more confident phasing if genotypes were population prephased. Value must be between 0 and 0.5.

Output File

Contains the haplotype counts (A = genome wide haplotype 0, B = genome wide haplotype 1) for each feature, for each input BAM. The column "gw_phased" indicates if a feature is genome wide phased, meaning that for two features that are both genome wide phased, reads from haplotype A for each would have come from the same DNA molecule.

  • 1 - contig - Feature contig.
  • 2 - start - Feature start (0 based).
  • 3 - stop - Feature stop (0 based).
  • 4 - name - Feature name.
  • 5 - aCount - Total allelic count for haplotype A.
  • 6 - bCount - Total allelic count for haplotype B.
  • 7 - totalCount - Total allelic coverage of this feature (aCount + bCount).
  • 8 - log2_aFC - Effect size for the allelic imbalance reported as allelic fold change (log2(aCount/bCount)) defined in our paper.
  • 9 - n_variants - Number of variants with allelic data in this feature.
  • 10 - variants - List of variants with allelic data in this feature (contig_position_ref_alt).
  • 11 - gw_phased - The feature is genome wide phased, meaning that for two features that are both genome wide phased, reads from haplotype A for each would have come from the same DNA molecule (0,1).
  • 12 - bam - Input BAM used when running phASER that these counts correspond to.
You can’t perform that action at this time.