# Locating Variants in NBS Gene Regions

Variants were annotated with the NBS gene table provided by Ben Solomon at ITMI. Variants located in NBS gene regions were annotated with Kaviar frequency, ClinVar clinical significance ratings, DANN predicted pathogenicity scores for SNV's, and coding consequence predictions using SnpEff.

Variants were filtered for appropriate mode of inheritance and then labeled as predictive when they contain the presence of mutation(s) with appropriate inheritance (eg, 2 bi-allelic pathogenic mutations for a recessive disorder).  

Mutations defined strictly as either:  
- Annotated in ClinVar with a clinical significance of 4 or 5, but never 2 or 3 or labeled pathogenic in HGMD (to be added later)  

OR  
- Novel (in Kaviar with frequency lower than specified below, or not listed in Kaviar) but predicted to be disease-causing with a SnpEff impact score of 'high'

## User Specified Parameters

In [12]:
import time
# enter kaviar frequency threshold
kav_freq = '.03'

# enter dictionary of samples to examine in format 'sample_id':['family_id','member_id']
sample_list = 'all'

# enter as NB, M, and/or F, or 'all' to include extended family if available
trio_member = 'NB','M','F'

# enter minimum dann score for predicted NBS
dann_score = '0.96'

# enter user database to output tables
out_db = 'users_selasady'
# enter database.name for variant table
variant_table = 'p7_platform.brady_variant'
# enter database.name of kaviar table
kaviar_table = 'p7_ref_grch37.kaviar'
# enter database.name of clinvar table
clinvar_table = 'p7_ref_grch37.clinvar'
# enter database.name of dann table
dbnfsp_table = 'p7_ref_grch37.dbnsfp_variant'

# enter path to snpEff.jar on your computer
snpeff_path = '/Users/summerrae/tools/snpEff/snpEff.jar'
#snpeff_path = 'D:/Documents/tools/snpEff/snpEff.jar'
# enter desired basename for output files
out_name = 'nbs_' + time.strftime("%y%m%d")

## Transform of NBS Gene Table

The NBS Gene Table provided by ITMI was transformed as follows and uploaded to impala:   

- A 'level' column was added to represent the color coding found in the file, with the following options: red, yellow, null  
- All commas were replaced with colons to prevent errors while importing csv file  
- All semicolons were replaced with colons for consistency of data  
- Spaces in column names were replaced with underscores  
- Special characters were removed from column names

## Adding positional information to the NBS gene list

Genomic position, gene and transcript information and strand were added to the NBS Gene list by joining it with the ensembl_genes table by gene name. 

        CREATE TABLE users_selasady.nbs_ensembl AS (
             SELECT DISTINCT nbs.*, ens.chrom, ens.start, ens.stop, ens.gene_id, 
             ens.transcript_id, ens.strand
             FROM users_selasady.nbs_genes nbs, p7_ref_grch37.ensembl_genes ens
             WHERE nbs.gene = ens.gene_name
           )

The results were saved as users _ selasady.nbs _ ensembl containing 14839 rows due to multiple transcripts per gene name. 

## Finding rare variants with Kaviar

For all specified trio members, variants were annotated with Kaviar, labeling variants under the Kaviar frequency threshold, or not listed in Kaviar, as rare. Variants were then joined with the nbs_ensembl table to retain only variants falling in NBS gene regions

In [53]:
# import needed modules
from impala.util import as_pandas
import pandas as pd
from parse_impala import parse_impala as pimp
from impala.dbapi import connect
# disable extraneous pandas warning
pd.options.mode.chained_assignment = None

In [9]:
# parse trio arg
member_arg = pimp.label_member('bv', trio_member)

# parse sample id argument
subject_list = pimp.select_samples('bv', sample_list)

# list of user args to join
arg_list = [subject_list, member_arg]
sample_args = pimp.join_args(arg_list)

print 'The following argument will be added to the query: \n'
print sample_args

In [50]:
nbs_query = ''' 
CREATE TABLE {}.{}_annot AS 
    WITH bv as (
       SELECT bv.sample_id, bv.chrom, bv.pos, bv.ref, bv.alt, bv.id as rs_id, bv.qual, bv.filter,
                regexp_replace(bv.gt, '1/2', '0/1') as geno,nbs.gene, nbs.transcript_id,
                nbs.gene_id, nbs.strand, nbs.condition, nbs.omim_phenotype, nbs.va_name, nbs.inheritance,
                nbs.allelic_conditions, nbs.comments, nbs.level, nbs.in_pbg
                FROM {}.nbs_ensembl nbs, {} bv
                WHERE nbs.chrom = bv.chrom
                AND (bv.pos BETWEEN nbs.start and nbs.stop)
                {}
                )
        SELECT DISTINCT bv.*, k.allele_freq,  
                  CASE WHEN (k.allele_freq < {} OR k.allele_freq IS NULL )then 'Y'
                  ELSE 'N'
                  END as kaviar_rare
             FROM bv
             LEFT JOIN /* +SHUFFLE */ {} k
                  ON bv.chrom = k.chrom
                  AND bv.pos = k.pos
                  AND bv.ref = k.ref
                  AND bv.alt = k.alt
            WHERE (bv.chrom <> 'X' AND bv.chrom <> 'Y')
'''.format(out_db,out_name, out_db, variant_table, sample_args, kav_freq, kaviar_table)

# run kaviar annotation query
pimp.run_query(nbs_query, 'annot', out_db, out_name)

Removing table if it already exists...
Running the following query on impala: 
 
CREATE TABLE users_selasady.nbs_150924_annot AS 
    WITH bv as (
       SELECT bv.sample_id, bv.chrom, bv.pos, bv.ref, bv.alt, bv.id as rs_id, bv.qual, bv.filter,
                regexp_replace(bv.gt, '1/2', '0/1') as geno,nbs.gene, nbs.transcript_id,
                nbs.gene_id, nbs.strand, nbs.condition, nbs.omim_phenotype, nbs.va_name, nbs.inheritance,
                nbs.allelic_conditions, nbs.comments, nbs.level, nbs.in_pbg
                FROM users_selasady.nbs_ensembl nbs, p7_platform.brady_variant bv
                WHERE nbs.chrom = bv.chrom
                AND (bv.pos BETWEEN nbs.start and nbs.stop)
                 AND (bv.sample_id LIKE '%03' OR bv.sample_id LIKE '%01' OR bv.sample_id LIKE '%02')
                )
        SELECT DISTINCT bv.*, k.allele_freq,  
                  CASE WHEN (k.allele_freq < .03 OR k.allele_freq IS NULL )then 'Y'
                  ELSE 'N'
                  END 

Please note that the 1/2 genotype was converted to 0/1 for downstream compatibility with snpeff, and only used for determining compound heterozygosity.

## Annotating with ClinVar pathogenicity and pathogenicity ratings from dbNFSP

Variants in NBS gene regions were then annotated with the ClinVar table by genomic position. Variants with a ClinVar significance rating of 4 or 5, but never 2 or 3, were marked with a 'Y' in the clin_patho column to denote non-conflicted pathogenically significant ratings.

DANN scores are provided for SNV's and range between 0 and 1. The closer a score is to 1, the more pathogenic the variant. The DANN table was matched to variants on genomic position and alternate allele. 

In [51]:
dbnsfp_query = '''
        CREATE TABLE IF NOT EXISTS {}.{}_annotated AS 
        WITH nbs as (
             SELECT DISTINCT nbs.*, clin.clin_hgvs, clin.clin_sig, clin.clin_dbn,
                 CASE WHEN (clin.clin_sig NOT REGEXP '3|2[^5]|2$'  
                     AND  clin.clin_sig REGEXP '4|[^25]5|^5') THEN 'Y'
                     ELSE 'N'
                 END AS clin_patho
             FROM {}.{}_annot nbs
             LEFT JOIN {} clin
                 ON nbs.chrom = clin.chrom
                 AND nbs.pos = clin.pos
                 AND nbs.alt = clin.alt
          )

        SELECT DISTINCT nbs.*, db.aa_alt, db.sift_score, db.sift_pred, db.polyphen2_hdiv_score,
            db.polyphen2_hdiv_pred, db.polyphen2_hvar_score, db.polyphen2_hvar_pred, 
            db.fathmm_score, db.fathmm_pred, db.cadd_raw, db.dann_score, db.mutation_taster_pred,
            db.mutation_assessor_score, mutation_assessor_pred, db.provean_score, db.provean_pred,
            db.clinvar_clnsig, db.interpro_domain, db.exac_af,
            CASE
                WHEN SUBSTRING(nbs.sample_id, -2) = '01'THEN 'M'
                WHEN SUBSTRING(nbs.sample_id, -2) = '02' THEN 'F'
                WHEN SUBSTRING(nbs.sample_id, -2) = '03'THEN 'NB'
                END as member,
             CASE
                WHEN (db.sift_pred LIKE '%D%' or db.polyphen2_hdiv_pred LIKE '%D%' or db.mutation_taster_pred LIKE '%D%'
                     or db.mutation_assessor_pred LIKE '%H%' or db.fathmm_pred LIKE '%D%' or db.provean_pred LIKE '%D%'
                     or db.dann_score >= {}) THEN 'Y'
                ELSE 'N'
                END as dbnfsp_predicted,
            SUBSTRING(nbs.sample_id, 1, (length(nbs.sample_id)-3)) as family_id,
            CONCAT(nbs.chrom, ':', CAST(nbs.pos AS STRING), ':', nbs.ref, ':', nbs.alt) as var_id 
        FROM nbs, {} db
            WHERE nbs.chrom = db.chrom
            AND nbs.pos = db.pos
            AND nbs.alt = db.alt
            '''.format(out_db, out_name, out_db, out_name, clinvar_table, dann_score, dbnfsp_table)

# run dbNFSP query on impala
pimp.run_query(dbnsfp_query, 'annotated', out_db, out_name)

Removing table if it already exists...
Running the following query on impala: 

        CREATE TABLE IF NOT EXISTS users_selasady.nbs_150924_annotated AS 
        WITH nbs as (
             SELECT DISTINCT nbs.*, clin.clin_hgvs, clin.clin_sig, clin.clin_dbn,
                 CASE WHEN (clin.clin_sig NOT REGEXP '3|2[^5]|2$'  
                     AND  clin.clin_sig REGEXP '4|[^25]5|^5') THEN 'Y'
                     ELSE 'N'
                 END AS clin_patho
             FROM users_selasady.nbs_150924_annot nbs
             LEFT JOIN p7_ref_grch37.clinvar clin
                 ON nbs.chrom = clin.chrom
                 AND nbs.pos = clin.pos
                 AND nbs.alt = clin.alt
          )

        SELECT DISTINCT nbs.*, db.aa_alt, db.sift_score, db.sift_pred, db.polyphen2_hdiv_score,
            db.polyphen2_hdiv_pred, db.polyphen2_hvar_score, db.polyphen2_hvar_pred, 
            db.fathmm_score, db.fathmm_pred, db.cadd_raw, db.dann_score, db.mutation_taster_pred,
            db.mu

### Reading the impala results into python

In [56]:
# query to left join nbs_annotated variants with DANN scores
nbs_query = """
    SELECT * FROM 
    {}.{}_annotated
    """.format(out_db, out_name)

# run query on impala
conn=connect(host='glados19', port=21050, timeout=120)
cur = conn.cursor()
cur.execute(nbs_query)

# store results as pandas data frame
nbs_df = as_pandas(cur)
cur.close()
conn.close()

In [57]:
if len(nbs_df) > 0:
    print str(len(nbs_df)) + ' variants located in NBS regions.'
else: 
    print "No variants located in NBS regions."

13719 variants located in NBS regions.


## Predicting Coding Consequences

Coding consequences were predicted using SnpEff. 

In [47]:
reload(pimp)

<module 'parse_impala.parse_impala' from 'parse_impala/parse_impala.pyc'>

In [109]:
# rename 'geno' to 'gt' for vcf output
nbs_df = nbs_df.rename(columns = {'geno':'gt'})

vcf_header = """##fileformat=VCFv4.0
##fileDate={}
##reference=grch37v47
""".format(time.strftime("%y%m%d"))

vcf = nbs_df[['chrom', 'pos', 'var_id', 'ref', 'alt', 'qual', 'filter']]
vcf['INFO'] = '.'
vcf['FORMAT'] = 'GT'
vcf['GT'] = nbs_df['gt']
vcf.columns = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'GT']

f = open('test.vcf', 'wb') 
f.write(vcf_header)
f.close()

vcf.head().to_csv('test.vcf', sep='\t', mode='a', index=False)

In [58]:
# rename 'geno' to 'gt' for vcf output
nbs_df = nbs_df.rename(columns = {'geno':'gt'})

# run function on query results
try:
    pimp.df_to_snpeff(nbs_df, snpeff_path, out_name)
except Exception, e: 
    print str(e)

global name 'args' is not defined


Coding consequence predictions were read back into Python, parsed and matched with the each respective variant data frame. 

In [60]:
# match coding consequences with nbs variants 
nbs_annot = pimp.parse_snpeff(with_tx, '{}_snpeff.vcf'.format(out_name))


print str(len(nbs_annot)) + " coding consequences found for this set of variants."

259407 variants annotated for coding consequences.


## Filtering for potentially pathogenic mutations

dbNFSP variant table was used to add the following additional measures of possible pathogenicity.

If a variant was labeled as pathogenic in ClinVar, or was considered rare with a Kaviar frequency less than .03 and contained any of the following parameters, it was marked as pathogenic.

- SnpEff effect = High  
- sift_pred = D  
- polyphen2_ hdiv _pred = D  
- polyphen2_ hvar _pred= D  
- mutation_ taster _pred= D  
- mutation_ assessor _pred = H  
- fathmm_ pred = D  
- provean_ pred = D  
- cadd_ raw = provided for reference only  
- dann_ score = above user-specified cutoff  
- exac_ af = provided for reference only  

In [71]:
reload(pimp)

<module 'parse_impala.parse_impala' from 'parse_impala\parse_impala.py'>

In [62]:
# label variants considered pathogenic 
pimp.label_predictive(nbs_annot)
 
print str(len(nbs_annot[(nbs_annot['predictive'] == True)])) + " variants labeled as predictive."

36429 variants labeled as predictive.


## Finding newborns predicted to have NBS conditions

The dataframe was separated into mother, father and newborn predictive het variants to search for compound hertozygosity.

In [68]:
mom_hets, dad_hets, nb_het1, nb_het2, nb_het3 = pimp.find_AR_cands(nbs_annot)

### Finding Dominant and Homozygous Recessive Disorders

The newborn and parent variants were subset to report:  
- All variants in regions of dominant disorders  
- All homozygous recessive variants in regions of autosomal recessive disorders

In [72]:
nb_dominant = pimp.find_dominant_nb(nbs_annot)

nb_hom_recessive = pimp.find_hom_rec(nbs_annot)

### Looking for compunt heterozygosity

Newborn heterozygous variants were subset to look for compound heterozygosity. This analysis was only performed on newborns where both parents were also sequenced.

In [114]:
# function to find matching parent variants
def find_parent_vars(nb_df, parent_df):
    # merge dataframes on by variant position
    merged_df = pd.merge(nb_df, parent_df, on=['chrom', 'pos', 'ref', 'alt'], how='inner')
    # rename parent sample_id column to avoid dropping when removing '_y' cols
    merged_df.rename(columns = {'member_y':'from_parent'}, inplace=True)
    # drop extra y columns from merge with fathers
    drop_y(merged_df)
    #remove _x from colnames
    merged_df.rename(columns=lambda x: x.replace('_x', ''), inplace=True)
    return merged_df
    
# run function for each group of newborns
def match_parents(nb_df):
    if (len(mom_hets) > 0) and (len(dad_hets) > 0):
        nb_and_mom = find_parent_vars(nb_df, mom_hets)
        nb_and_dad = find_parent_vars(nb_df, dad_hets)
        # merge variants found in either mother or father
        het_cands = pd.concat([nb_and_mom,nb_and_dad]).drop_duplicates().reset_index(drop=True)
        # group variants by gene name
        by_gene = het_cands.groupby(['gene_name', 'family_id'])
        return by_gene
    else:
        print "No compound het variants"
        
# run function for each group of newborns
het1_grouped = match_parents(nb_het1)
het2_grouped = match_parents(nb_het2)
het3_grouped = match_parents(nb_het3)

After grouping the variants by gene and family, the variants will be filtered to keep only variants with at least one different variant coming from the mother and one from the father.

In [115]:
# function to find compound hets
def find_comphets(gene_group, comphet_list_name):
    for name, group in gene_group:
        # if there is a variant in more than one position
        if group.pos.nunique() > 1:
            # and there are more than one variants from both parents
            if len(group[group['from_parent'] == 'M'] > 1) and len(group[group['from_parent'] == 'F'] > 1):
                comphet_list_name.append(group)
            # or if there is only one variant from each parent
            elif len(group[group['from_parent'] == 'M'] == 1) and len(group[group['from_parent'] == 'F'] == 1):
                # and those variants are different
                if len(group[group['from_parent'] == 'M'].pos - group[group['from_parent'] == 'F']) > 0:
                        comphet_list_name.append(group)

# create empty list to store comp_hets
comp_hets = []

het1_df = find_comphets(het1_grouped, comp_hets)     
het2_df = find_comphets(het2_grouped, comp_hets)
het3_df = find_comphets(het3_grouped, comp_hets)

# check if comphets found
def comphet_check(df):
    if df:
        comp_hets.append(df)

# check for each nb df
comphet_check(het1_df)
comphet_check(het2_df)
comphet_check(het3_df)

# if any comp  hets were found, append to list and convert to df
if len(comp_hets) > 0:
    comphet_df = pd.concat(comp_hets)
    comphet_df.name = 'comp_het'
else:
    print 'No comp hets found.'
    comphet_df = pd.DataFrame()
    comphet_df.name = 'comp_het'

## QA

In [154]:
review_these = []

# make sure all dominant variants are inherited as AD
if len(nb_dominant[(~nb_dominant['inheritance'].isin(['AD']))]) > 0:
     review_these.append(group)

# make sure all homozygous recessive are homozygous alt and AR
if len(nb_hom_recessive[(~nb_hom_recessive['inheritance'].isin(['AR'])) & (nb_hom_recessive['gt'] != '1/1')]) > 0:
    review_these.append(group)

# make sure all compound hets are het and AR 
if len(comphet_df[(~comphet_df['inheritance'].isin(['AR'])) | (comphet_df['gt'] != '0/1')]) > 0:
     review_these.append(group)

# check that there is more than one variant per gene
het_groupy = comphet_df.groupby(['gene'])

for name, group in het_groupy:
        # check that there is a variant in more than one position
        if group.pos.nunique() < 1:
            review_these.append(group)
        # check that there is at least one variant from the mother
        if len(group[group['from_parent'] == 'M']) == 0:
            review_these.append(group)
        # check that there is at least one variant from the father
        if len(group[group['from_parent'] == 'F']) == 0:
            review_these.append(group)
        # check that there is at least one different variant from each parent
        if len(group[(group['from_parent'] == 'M')].pos - group[(group['from_parent'] == 'F')].pos)  == 0:
            review_these.append(group)

## Results

In [None]:
reload(pimp)

In [117]:
# report variant counts
def report_result_counts(results_df):
    if len(results_df) > 0:
        print str(len(results_df)) + ' {} variants found.'.format(results_df.name)
        condition = results_df['condition'].value_counts()
        genes = results_df['gene'].unique()
        print 'Condition: \n', condition, '\n'
        print 'Affected gene(s): \n', genes, '\n'
    else:
         print "No {} variants found. \n".format(results_df.name)
        
print "Variants found:"
report_result_counts(nb_dominant)
report_result_counts(nb_hom_recessive)
report_result_counts(comphet_df)
print "\n"

Variants found:
28 dominant variants found.
Condition: 
Hypothyroidism: congenital: nongoitrous 2    16
Pallister-Hall syndrome 2                    12
dtype: int64 

Affected gene(s): 
['PAX8' 'GLI2'] 

580 hom_recessive variants found.
Condition: 
Hyperprolinemia: type I                                         400
Severe combined immunodeficiency: autosomal recessive: T-cell negative: B-cell positive: NK cell positive     60
Adrenal hyperplasia: congenital: due to 21-hydroxylase deficiency: Hyperandrogenism: nonclassic type: due to 21-hydroxylase deficiency     60
Tyrosinemia: type III                                            32
Maple syrup urine disease: type II                               18
Severe combined immunodeficiency                                 10
dtype: int64 

Affected gene(s): 
['IL7R' 'HPD' 'PRODH' 'CYP21A2' 'MTHFD1' 'DBT'] 

251 comp_het variants found.
Condition: 
Severe combined immunodeficiency: autosomal recessive: T-cell negative: B-cell positive: NK cell p

## Saving Output

In [155]:
# add from_parent column to dom and hom_rec so we can keep info for comp_het
nb_dominant['from_parent'] = 'NA'
nb_hom_recessive['from_parent'] = 'NA'

# merge and mark predicted variants
def merge_predictors(df, out_list):
    if len(df) > 0:
        df['var_type'] = df.name
        out_list.append(df)
        print "Saving {} {} variants to current working directory".format(len(df), df.name)
    else:
        print "No {} variants to save.".format(df.name)

# list to store patho output
predict_list = []            

merge_predictors(nb_dominant, predict_list)
merge_predictors(nb_hom_recessive, predict_list)
merge_predictors(comphet_df, predict_list)

# merge results
merged_patho = pd.concat(predict_list, axis=0)

# put the columns back in order because pandas
cols = ['sample_id', 'chrom', 'pos', 'id', 'ref', 'alt', 'gt', 'gene',  'gene_id', 'allele_freq', 
        'condition','inheritance', 'clin_sig', 'clin_dbn', 'transcript_id', 'strand',   
        'impact', 'effect', 'dbnfsp_predicted', 'dann_score', 'cadd_raw', 'qual', 'filter', 
        'clin_hgvs', 'omim_phenotype', 'va_name', 'allelic_conditions', 'comments', 'level',
        'aa_alt', 'sift_score', 'sift_pred', 'polyphen2_hdiv_score', 'polyphen2_hdiv_pred', 
        'polyphen2_hvar_score', 'polyphen2_hvar_pred', 'fathmm_score', 'fathmm_pred',  
        'mutation_taster_pred', 'mutation_assessor_score', 'mutation_assessor_pred', 'provean_score', 'provean_pred', 
        'interpro_domain', 'exac_af', 'rank', 'hgvs_c', 'hgvs_p', 'from_parent', 'var_type',   
        'member','family_id', 'feature_type', 'gene_name','id']

merged_out = merged_patho[cols]

# remove unnecessary columns
merged_out.drop(merged_out.columns[[50, 51, 52, 52,53,54]], axis=1, inplace=True)

# save to file
merged_out.to_csv('predicted_NBS_{}.tsv'.format(time.strftime("%y%m%d")), sep= '\t', header=True, encoding='utf-8', \
                                                                                                    index=False)

if len(review_these) > 0:
    print 'Some variants were flagged for review and saved to nbs_{)_review.tsv'.format(time.strftime("%y%m%d"))
    for_review = pd.DataFrame(review_these)
    for_review.to_csv('nbs_{)_review.tsv'.format(time.strftime("%y%m%d")), sep='\t', header=True, encoding='utf-8', 
                                                 index=False)

Saving 28 dominant variants to current working directory
Saving 580 hom_recessive variants to current working directory
Saving 251 comp_het variants to current working directory


## To Do

- make package
- change location of tab2vcf.py function according to location in package
- report only nb vars
- mark with variant type
- why is sample id _22 showing up? 