# GWAS on Resting Heart Rate

Ravi Mandla


## Data Download and Parsing

We downloaded our data from the UK Biobank, an online collection of biological information on ~500,000 indviduals. All non-genotyping data is stored in a giant csv file, which we parsed to only isolate characteristics of interest. We were specifically interested in heart rate, age, BMI, sex, and ethnicity. These data are stored under data-field ID 102, 21001, 21003, 31, and 21000 respectively.

In the giant csv file, there are multiple columns per ID, corresponding to repeat assessments of said characteristics (see https://biobank.ctsu.ox.ac.uk/~bbdatan/Repeat_assessment_doc_v1.0.pdf). To navigate multiple data points per characteristic, we averaged all repeat assessments per column. 

To reduce bias introduced through population differences due to ethnicity, we restricted our analysis to only individuals who self-reported as "White" according to data-field 21000. This includes "White", "British", "Irish", or "Other White Background".

Genotyping data was downloaded into PLINK BED/BIM/FAM files using [ukbgene](http://biobank.ndph.ox.ac.uk/showcase/download.cgi?id=665&ty=ut). There is one trio of PLINK files per chromosome, containing SNP data on all individuals with genoytping data (488,377 individuals).

## PCA

To control for possible population stratification, PCA was used to generate 10 principal components to include as coviarates in our analysis. To do so, individuals with genotyping, heart rate, BMI, sex, and age data who identified as "White" were isolated into separate BED/BIM/FAM files for a total of 420,553 individuals.

Rather than run PCA on all SNPs, we chose to run it on a random sample of 100,000 SNPs instead to reduce to computational burden and processing time. All BIM files were merged, from which 100,000 of the 805,426 SNPs stored in the UK Biobank were randomly selected.

Randomly sampled SNPs were compiled into one BED/BIM/FAM trio, and PCA was run using the command:

`plink2 --bfile readyforpca --pca approx 10 --out pcavals`

The command outputed a TSV file, containing two column for FID and IID, and one column per PC. These PC columns were appended onto the rest of the covariate data.

## GWAS

After all covariates were compiled, GWAS was conducted using plink2 against all UK Biobank SNPs on individuals with covariate and heart rate data using the following command:

`plink2 --bfile ukb_cal_chr1_v2_covfilg --pheno heartrate-id.tsv --pheno-name hr --covar filtered-covariates.tsv --glm --threads 14 --covar-variance-standardize --out chr1`

This command was run per chromosome, and automated using the following function

## Filtering SNPs of interest

plink2 by default outputs raw P values. Before correcting for multiple-hypothesis testing, we restricted our analysis into two separate tests. One with SNPs occuring in or within +/- 500 bp of mouse sinus node pacemaker cell (PC) and/or right atrial cardiomyocyte (RACM) ATAC peaks, and one with SNPs occuring in or within +/- 500 bp of ATAC peaks differentially open in PC compared to RACM. UCSC liftOver was used to convert mm9 genomic coordinates to hg19 to compare ATAC to SNP data.

After filtering, bonferroni correction was used to correct p-values per number of SNPs in each individual analysis.