## Canine GWAS Reference Alignment

This module will align variants in the reference dataset to those in the target dataset.  These operations have no analog in the UKBB analysis because, presumably, 1KG and UKBB genotyping datasets were already aligned to a common reference genome and used the same minor/major allele assignments.

Steps:

- Join the variants in both datasets by locus
- Check to see what alleles appear to be in conflicting order
- Adjust the order of the alleles and invert calls where necessary in the reference data
- Re-check the allele co-occurrence between the datasets
- Export a reference dataset that contains only variants in the target dataset, all with the same orientation

In [1]:
import hail as hl
import pandas as pd
import numpy as np
import plotnine as pn
import plotly.express as px
import os.path as osp
%run ../../nb.py
%run paths.py
%run common.py
gab.register_timeop_magic(get_ipython(), 'hail')
hl.init()

Running on Apache Spark version 2.4.4
SparkUI available at http://a783b4e25167:4041
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.30-2ae07d872f43
LOGGING: writing to /home/eczech/repos/gwas-analysis/notebooks/organism/canine/hail-20200213-0517-0.2.30-2ae07d872f43.log


### Load Target and Reference Data

In [2]:
%%capture
hl.ReferenceGenome(**load_reference_genome(REF_GENOME_FILE))

In [3]:
mt_ref = hl.read_matrix_table(osp.join(WORK_DIR, REF_QC_02_FILE + '.mt'))
mt_ref.count()

(36395, 1350)

In [4]:
mt_tgt = hl.import_plink(
    *plink_files(ORGANISM_CANINE_TGT_DIR, PLINK_FILE_TGT),
    skip_invalid_loci=False,
    reference_genome='canine'
)
mt_tgt.count()

2020-02-13 05:17:34 Hail: INFO: Found 4342 samples in fam file.
2020-02-13 05:17:34 Hail: INFO: Found 160727 variants in bim file.
2020-02-13 05:17:36 Hail: INFO: Coerced sorted dataset


(160727, 4342)

In [5]:
def get_alt_allele_freq(mt):
    """Get frequency of number of alt alleles present across all calls"""
    cts = mt.aggregate_entries(hl.agg.hist(mt.GT.n_alt_alleles(), 0, 2, 3))
    cts = pd.Series(cts.bin_freq).rename('count').rename_axis('n_alt_alleles').reset_index()
    # Make sure that the most common count is 0 (homozygous reference)
    assert cts.sort_values('count')['n_alt_alleles'].tail(1).values[0] == 0
    return cts

In [6]:
get_alt_allele_freq(mt_tgt)

2020-02-13 05:17:37 Hail: INFO: Coerced sorted dataset


Unnamed: 0,n_alt_alleles,count
0,0,422763319
1,1,181266267
2,2,93838364


In [7]:
get_alt_allele_freq(mt_ref)

Unnamed: 0,n_alt_alleles,count
0,0,26895996
1,1,14281635
2,2,7907143


### Check Alignment

In [23]:
def get_variant_orientation(mt_src, mt_tgt):
    """Join variants by locus in two datasets and categorize call status based on allele orientation
    
    See https://privefl.github.io/bigsnpr/reference/snp_match.html for a reference implementation
    including similar functionality
    """
    
    # Select only locus (as key) and alleles from a table
    def prep(ht, typ):
        c = 'alleles_' + typ
        ht = ht.key_by('locus').select('alleles').rename({'alleles': c})
        ht = ht.annotate(**{c: hl.delimit(ht[c], '')})
        return ht
    
    # Join dataset rows (i.e. variants)
    ht = prep(mt_src.rows(), 'src')\
        .join(prep(mt_tgt.rows(), 'tgt'), how='outer')
    
    # Determine what the orientation of the variants is with respect to the "src" dataset
    # (i.e. this value implies what needs to be done to the alleles/calls to align with "tgt")
    ht = ht.annotate(
        orientation=hl.case()
        .when(ht.alleles_src == ht.alleles_tgt, 'same') # AC = AC
        .when(ht.alleles_src == ht.alleles_tgt.reverse(), 'order_flip') # AC = CA
        .when(ht.alleles_src == hl.reverse_complement(ht.alleles_tgt).reverse(), 'strand_flip') # AC = TG
        .when(ht.alleles_src == hl.reverse_complement(ht.alleles_tgt), 'order_flip+strand_flip') # AC = GT
        .or_missing()
    )
    ht = ht.annotate(
        status=hl.case()
        .when(hl.is_defined(ht.orientation), 'in_both')
        .when(hl.is_defined(ht.alleles_tgt), 'only_tgt')
        .when(hl.is_defined(ht.alleles_src), 'only_src')
        .or_missing()
    )
    return ht

ht = get_variant_orientation(mt_ref, mt_tgt)
ht.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'locus': locus<canine> 
    'alleles_src': str 
    'alleles_tgt': str 
    'orientation': str 
    'status': str 
----------------------------------------
Key: ['locus']
----------------------------------------


Show the unaligned allele co-occurence frequencies:

In [24]:
ht.to_pandas().groupby(['alleles_src', 'alleles_tgt']).size().unstack().fillna(0).astype(int)

2020-02-13 05:24:47 Hail: INFO: Coerced sorted dataset


alleles_tgt,AC,CA,AG,GA
alleles_src,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AC,1449,61,0,0
AG,0,0,6180,307
CA,79,1600,0,0
CT,0,0,364,7995
GA,0,0,385,8130
GT,74,1631,0,0
TC,0,0,6219,343
TG,1437,78,0,0


Show how often variants are in one dataset or both:

In [25]:
ht.aggregate(hl.agg.counter(ht.status))

2020-02-13 05:24:52 Hail: INFO: Coerced sorted dataset


{'in_both': 36332, 'only_src': 63, 'only_tgt': 124395}

Show orientation counts:

In [26]:
ht.aggregate(hl.agg.sum(hl.is_defined(ht.orientation)))

2020-02-13 05:24:54 Hail: INFO: Coerced sorted dataset


36332

In [27]:
ht.aggregate(hl.agg.counter(ht.orientation))

2020-02-13 05:24:56 Hail: INFO: Coerced sorted dataset


{None: 124458,
 'order_flip+strand_flip': 859,
 'strand_flip': 17282,
 'order_flip': 832,
 'same': 17359}

### Apply Alignment

In [28]:
def align_variant_orientation(mt_src, ht_orientation):
    # Add orientation as a row field
    mt = mt_src.annotate_rows(orientation=ht_orientation[mt_src.locus].orientation)
    flipped_orientations = hl.set(['order_flip', 'order_flip+strand_flip'])
    
    # Flip calls where appropriate
    mt = mt.annotate_entries(
        GT=hl.case()
        .when(
            flipped_orientations.contains(mt.orientation), 
            hl.unphased_diploid_gt_index_call(2 - mt.GT.n_alt_alleles())
        ).default(mt.GT)
    )
    
    # Flip allele arrays where appropriate
    keys = list(mt.row_key.keys())
    mt = mt.key_rows_by('locus')
    mt = mt.annotate_rows(
        alleles=hl.case()
        .when(mt.orientation == 'same', 
              mt.alleles) # AC -> AC
        .when(mt.orientation == 'order_flip', 
              hl.reversed(mt.alleles)) # AC -> CA
        .when(mt.orientation == 'strand_flip', 
              mt.alleles.map(lambda v: hl.reverse_complement(v))) # AC -> TG
        .when(mt.orientation == 'order_flip+strand_flip', 
              hl.reversed(mt.alleles.map(lambda v: hl.reverse_complement(v)))) # AC -> GT
        .or_missing()
    )
    mt = mt.key_rows_by(*keys)
    return mt

In [29]:
mt_ref_nrm = align_variant_orientation(mt_ref, ht)
assert mt_ref_nrm.count() == mt_ref.count()
mt_ref_nrm.count()

(36395, 1350)

Limit to only variants where the orientation was present, indicating that a variant at the same locus exists in the target data:

In [30]:
mt_ref_exp = mt_ref_nrm.filter_rows(hl.is_defined(mt_ref_nrm.orientation))
mt_ref_exp.count()

2020-02-13 05:25:45 Hail: INFO: Coerced sorted dataset


(36332, 1350)

Show the aligned allele co-occurrence frequencies:

In [36]:
df = get_variant_orientation(mt_ref_exp, mt_tgt).to_pandas()
df = df.groupby(['alleles_src', 'alleles_tgt']).size().unstack().fillna(0).astype(int)
df

2020-02-13 05:28:22 Hail: INFO: Coerced sorted dataset
2020-02-13 05:28:22 Hail: INFO: Coerced sorted dataset


alleles_tgt,AC,AG,CA,GA
alleles_src,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AC,3039,0,0,0
AG,0,13148,0,0
CA,0,0,3370,0
GA,0,0,0,16775


In [43]:
# Ensure that the axis labels are equal
assert df.index.equals(df.columns)
# Ensure that the off-diagonal counts are all zero
assert np.all((df.values - np.diag(np.diag(df.values))) == 0)

Export the result:

In [44]:
path = osp.join(WORK_DIR, REF_QC_03_FILE + '.mt')
mt_ref_exp.write(path, overwrite=True)
print('Final result written to', path)

2020-02-13 05:30:56 Hail: INFO: Coerced sorted dataset
2020-02-13 05:30:59 Hail: INFO: Coerced sorted dataset


Final result written to /home/eczech/data/gwas/tmp/canine/mt_ref_qc_03.mt


2020-02-13 05:31:11 Hail: INFO: wrote matrix table with 36332 rows and 1350 columns in 2 partitions to /home/eczech/data/gwas/tmp/canine/mt_ref_qc_03.mt
