# GWAS on Resting Heart Rate

Ravi Mandla


In [1]:
import pandas as pd
import os, subprocess
import numpy as np
import math

## Data Download and Parsing

We downloaded our data from the UK Biobank, an online collection of biological information on ~500,000 indviduals. All non-genotyping data is stored in a giant csv file, which we parsed to only isolate characteristics of interest. We were specifically interested in heart rate, age, BMI, sex, and ethnicity. These data are stored under data-field ID 102, 21001, 21003, 31, and 21000 respectively.

In the giant csv file, there are multiple columns per ID, corresponding to repeat assessments of said characteristics (see https://biobank.ctsu.ox.ac.uk/~bbdatan/Repeat_assessment_doc_v1.0.pdf). To navigate multiple data points per characteristic, we averaged all repeat assessments per column. 

To reduce bias introduced through population differences due to ethnicity, we restricted our analysis to only individuals who self-reported as "White" according to data-field 21000. This includes "White", "British", "Irish", or "Other White Background".

In [2]:
pheno = pd.read_csv('../ukbiobank/ukb40068.csv', usecols=['eid','102-0.0','102-0.1','102-1.0','102-1.1','102-2.0','102-2.1','102-3.0','102-3.1','21001-0.0','21001-1.0','21001-2.0',
                                                          '21001-3.0','21003-0.0','21003-1.0','21003-2.0','21003-3.0','31-0.0','21000-0.0'])

In [3]:
# average heart rate
justhrs = pheno[['102-0.0','102-0.1','102-1.0','102-1.1','102-2.0','102-2.1','102-3.0','102-3.1']]
avhr = justhrs.mean(axis=1,skipna=True)

# organize data
heartrate = pd.DataFrame()
heartrate['FID'] = pheno['eid']
heartrate['IID'] = pheno['eid']
heartrate['hr'] = avhr
heartrate['ethnicity'] = pheno['21000-0.0']

# isolate only "White" individuals
heartrate = heartrate[heartrate['ethnicity'].isin([1,1001,1002,1003])]

# drop individuals without heart rate data
heartrate = heartrate.dropna(subset=['hr'])

In [4]:
covars = pd.DataFrame()

# average BMI data
bmi = pheno[['21001-0.0','21001-1.0','21001-2.0','21001-3.0']]
bmi = bmi.mean(axis=1,skipna=True)

# average age data
age = pheno[['21003-0.0','21003-1.0','21003-2.0','21003-3.0']]
age = age.mean(axis=1,skipna=True)

# organize data and drop individuals without covariate data
covars['FID'] = pheno['eid']
covars['IID'] = pheno['eid']
covars['age'] = age
covars['bmi'] = bmi
covars['sex'] = pheno['31-0.0']
covars = covars.dropna()

In [5]:
# identify individuals with heart rate and covariate data
heartrate_cov = heartrate[heartrate['FID'].isin(covars['FID'])]
covars_hr = covars[covars['FID'].isin(heartrate['FID'])]

Genotyping data was downloaded into PLINK BED/BIM/FAM files using [ukbgene](http://biobank.ndph.ox.ac.uk/showcase/download.cgi?id=665&ty=ut). There is one trio of PLINK files per chromosome, containing SNP data on all individuals with genoytping data (488,377 individuals).

In [6]:
# identify individuals with phenotype data and genotype data
genotype_indv = pd.read_table('/mnt/labshare/ravi/ukbiobank/notused/ukb_cal_chr10_v2.fam',header=None,sep=' ')[[0]]

In [7]:
heartrate_filt = heartrate_cov[heartrate_cov['FID'].isin(genotype_indv[0])]
covars_filt = covars_hr[covars_hr['FID'].isin(genotype_indv[0])]

In [8]:
heartrate_filt[['FID','IID']].to_csv('var_filt_ids.tsv',sep='\t',index=None)
heartrate_filt.to_csv('heartrate_id.tsv',sep='\t',index=None)

## PCA

To control for possible population stratification, PCA was used to generate 10 principal components to include as coviarates in our analysis. To do so, individuals with genotyping, heart rate, BMI, sex, and age data who identified as "White" were isolated into separate BED/BIM/FAM files for a total of 420,553 individuals.

Rather than run PCA on all SNPs, we chose to run it on a random sample of 100,000 SNPs instead to reduce to computational burden and processing time. All BIM files were merged, from which 100,000 of the 805,426 SNPs stored in the UK Biobank were randomly selected.

In [11]:
def filter_beds(directory, id_list):
    # take in a directory of FAM files, and filter out individuals in id_list. Then output new BED/BIM/FAM files containing only those individuals. Outputted files have the same name, except for _covfilt attached to the end
    for i in os.listdir(directory):
        if '.fam' in i:
            name = i.split('.fam')[0]
            print('filtering.....')
            subprocess.run('~/bin/plink2 --bfile ' + directory+name + ' --keep ' + id_list + ' --make-bed --out ' + name+'_covfilt',shell=True,check=True)
    
            print('Finished filtering ' + name)

In [13]:
filter_beds('/mnt/labshare/ravi/ukbiobank/notused/','var_filt_ids.tsv')

filtering.....
Finished filtering ukb_cal_chr10_v2
filtering.....
Finished filtering ukb_cal_chr11_v2
filtering.....
Finished filtering ukb_cal_chr12_v2
filtering.....
Finished filtering ukb_cal_chr13_v2
filtering.....
Finished filtering ukb_cal_chr14_v2
filtering.....
Finished filtering ukb_cal_chr15_v2
filtering.....
Finished filtering ukb_cal_chr16_v2
filtering.....
Finished filtering ukb_cal_chr17_v2
filtering.....
Finished filtering ukb_cal_chr18_v2
filtering.....
Finished filtering ukb_cal_chr19_v2
filtering.....
Finished filtering ukb_cal_chr1_v2
filtering.....
Finished filtering ukb_cal_chr20_v2
filtering.....
Finished filtering ukb_cal_chr21_v2
filtering.....
Finished filtering ukb_cal_chr22_v2
filtering.....
Finished filtering ukb_cal_chr2_v2
filtering.....
Finished filtering ukb_cal_chr3_v2
filtering.....
Finished filtering ukb_cal_chr4_v2
filtering.....
Finished filtering ukb_cal_chr5_v2
filtering.....
Finished filtering ukb_cal_chr6_v2
filtering.....
Finished filtering ukb

In [14]:
# combine BED files

## output headers to txt file
with open('filenames.txt','w') as output:
    for i in os.listdir():
        if '.bim' in i:
            output.write(i.split('.bim')[0])
            output.write('\n')

## merge files
subprocess.run('~/bin/plink --bfile ukb_cal_chr1_v2_covfilt --merge-list filenames.txt --make-bed --out ukball_merged',shell=True,check=True)

CompletedProcess(args='~/bin/plink --bfile ukb_cal_chr1_v2_covfilt --merge-list filenames.txt --make-bed --out ukball_merged', returncode=0)

Randomly sampled SNPs were compiled into one BED/BIM/FAM trio, and PCA was run using the command:

`plink2 --bfile ukball_merged --pca approx 10 --out pcavals`

The command outputed a TSV file, containing two column for FID and IID, and one column per PC. These PC columns were appended onto the rest of the covariate data.

In [None]:
# appending PCA results to covariate table

## read in table
pca = pd.read_table('pcavals.eigenvec')

## fix covariate indexes
covars_filt = covars_filt.reset_index(drop=True) 

## add PCA columns
for i in range(1,11):
    covars_filt['PC'+str(i)] = pca['PC'+str(i)]

## GWAS

After all covariates were compiled, GWAS was conducted using plink2 against all UK Biobank SNPs on individuals with covariate and heart rate data using the following command:

`plink2 --bfile ukball_merged --pheno heartrate_id.tsv --pheno-name hr --covar covariates.tsv --glm no-x-sex --covar-variance-standardize --out ukb_rhr_results`

## Filtering SNPs of interest

plink2 by default outputs raw P values. Before correcting for multiple-hypothesis testing, we restricted our analysis into two separate tests. One with SNPs occuring in or within +/- 500 bp of mouse sinus node pacemaker cell (PC) and/or right atrial cardiomyocyte (RACM) ATAC peaks, and one with SNPs occuring in or within +/- 500 bp of ATAC peaks differentially open in PC compared to RACM. UCSC liftOver was used to convert mm9 genomic coordinates to hg19 to compare ATAC to SNP data.

After filtering, bonferroni correction was used to correct p-values per number of SNPs in each individual analysis.