# 02_Hail_part2_GWAS

**Authors:** Jennifer Zhang, Francis Ratsimbazafy, Jun Qian

**Contributors:** Christopher Lord, Nicole Deflaux, Kelsey Mayo, Lee Lichtenstein, *All of Us* Genomic Users

With modifications by Stephan Cordogan

## Set up Notebook

Import Necessary Packages

In [None]:
from datetime import datetime
import os
import pandas as pd
import hail as hl

In [None]:
start = datetime.now()

Save workspace bucket path as a variable to read and save files.

In [None]:
bucket = os.getenv('WORKSPACE_BUCKET')
bucket

In [None]:
hl.init(default_reference = "GRCh38")

## Load Hail MatrixTable containing the variants

The *All of Us* Data and Research Center provides eight Hail MatrixTables with variants from different genome regions. For most GWAS, the Hail MatrixTable ClinVar variants, MatrixTable with multi-allelic variants, or MatrixTable with multi-allelic variants split will be sufficient.

In [None]:
# mt_path = os.getenv("WGS_CLINVAR_SPLIT_HAIL_PATH")
mt_path = os.getenv("WGS_ACAF_THRESHOLD_SPLIT_HAIL_PATH")
# mt_path = os.getenv("WGS_ACAF_THRESHOLD_MULTI_HAIL_PATH")
mt_path

Import the Hail MatrixTable.

In [None]:
mt = hl.read_matrix_table(mt_path)

The following code can be used to limit the GWAS to a specific chromosome, set of chromosomes, or chromosomal region/s. Comment in the proceeding code block to apply filter.

In [None]:
# test_intervals = ['chr6:55000000-57000000','chr12:122000000-124000000']
# test_intervals = ['chr6:55000000-57000000','chr13:49000000-51000000']
# test_intervals = ['chr6:55000000-57000000','chr4:80000000-81000000']
# test_intervals = ['chr6:55000000-57000000','chr6:7000000-8000000']
test_intervals = ['chr6:55000000-57000000']

# test_intervals = ['chr6:135000000-145000000','chr9:14000000-20000000', 'chr10:100000000-105000000']


# test_intervals = ['chr6', 'chr20']

# test_intervals = ['chrX:68112092-68112093','chrX:10198158-10198159','chrX:118782557-118782558'
#  ,'chrX:79170974-79170975' ,'chrX:21843316-21843317', 'chrX:40063106-40063107', 
#                  'chrX:53434412-53434413', 'chrX:110688628-110688629', 'chrX:101259997-101259998'] 

# test_intervals = ['chrY:13470103-13470104']

In [None]:
mt = hl.filter_intervals(
    mt,
    [hl.parse_locus_interval(x,)
     for x in test_intervals])

## Load phenotypic data

Let’s read the pre-generated phenotypic data into a Hail table for later use. We will use the the function `import_table` in this step.

In the demographics dataframe that we created and saved to the bucket earlier, each row represents data for one person ID. The same person IDs also represent the columns in the matrix table.

- Read the phenotype file from your workspace bucket

In [None]:
phenotype_filename = f'{bucket}/data/genomics_phenotypes.tsv'
phenotype_filename

In [None]:
phenotypes = (hl.import_table(phenotype_filename,
                              types={'person_id':hl.tstr},
                              impute=True,
                              key='person_id')
             )

Before performing a series of variant QC, we need to filter the Hail MatrixTable to only keep samples with phenotype values. We will use the function `semi_join_cols` to keep only samples in the pre-generated phenotype file.

In [None]:
mt = mt.semi_join_cols(phenotypes)
#mt.count()

## Link phenotypic data with genomic data

Before running a GWAS, we need to annotate the genomic data with the phenotype data. We will use the function `annotate_cols` to perform this step.

In [None]:
mt = mt.annotate_cols(pheno = phenotypes[mt.s])

# Pre-process the genomic data

## Sex discrepancy

Sex concordance is part of the *All of Us* upstream genomic data quality control process, and all samples have passed the sex concordance check, so we do not need to perform this step here. For more details about the sex concordance check, please refer to the [All of Us Genomic Quality Report]( https://aousupporthelp.zendesk.com/hc/en-us/articles/4617899955092-All-of-Us-Beta-Release-Genomic-Quality-Report-).

## Relatedness

The *All of Us* Data and Research Center provides a list of samples to remove related samples from the full cohort. We will use the function `anti_join_cols` to perform this step. For more details about this list of related samples, please refer to the support article [How the All of Us Genomic data are organized](https://aousupporthelp.zendesk.com/hc/en-us/articles/4614687617556-How-the-All-of-Us-Genomic-data-are-organized).

In [None]:
related_samples_path = "gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux/relatedness/relatedness_flagged_samples.tsv"

In [None]:
related_remove = hl.import_table(related_samples_path,
                                 types={"sample_id":"tstr"},
                                key="sample_id")

#related_remove.count()

In [None]:
mt = mt.anti_join_cols(related_remove)
#mt.count()

## Population stratification

The *All of Us* Data and Research Center provides genetic predicted ancestry including the principal components (PCs) for the WGS data. We will incorporate thes PCs into the Hail MatrixTable and set them as covariate to take population stratification into account during model building. We will read the ancestry table and annoate the Hail MatrixTable with the ancestry table in this step.

For more information about the ancestry prediction table, please refer to this support article [How the All of Us Genomic data are organized](https://aousupporthelp.zendesk.com/hc/en-us/articles/4614687617556-How-the-All-of-Us-Genomic-data-are-organized).

In [None]:
ancestry_pred_path = "gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux/ancestry/ancestry_preds.tsv"

In [None]:
ancestry_pred = hl.import_table(ancestry_pred_path,
                               key="research_id", 
                               impute=True, 
                               types={"research_id":"tstr","pca_features":hl.tarray(hl.tfloat)})

In [None]:
mt = mt.annotate_cols(ancestry_pred = ancestry_pred[mt.s])

Below includes code to filter the population by majority predicted ancestry.  When running a multiethnic GWAS, results can be slightly improved by analyzing each ethinity seperately and meta-analyzing the results.

In [None]:
mt_eur = mt.filter_cols(mt.ancestry_pred.ancestry_pred == "eur")
# mt_eur = mt.filter_cols(hl.literal(["eur", "eas"]).contains(mt.ancestry_pred.ancestry_pred))
# mt_pheno_eur.col.describe()
# mt_orig = mt
# mt = mt_eur

## Heterozygosity rate

Allele Balance (AB) in the “Hard Threshold Filters” is part of the *All of Us* upstream data quality control processes. It marks all the variants that do not meet the threshold AB>=0.2 for heterozygotes as `NO_HQ_GENOTYPES` in the `filters` field. There are other criteria to flag specific variants from a callset which results in different values in the `filters` field. 

The MatrixTables for small regions have removed variants that don't pass the filters. For more details about the `filters` field, please please refer to this support article [How the All of Us Genomic data are organized](https://aousupporthelp.zendesk.com/hc/en-us/articles/4614687617556-How-the-All-of-Us-Genomic-data-are-organized).

## Minor allele frequency (MAF)

Since we have removed some samples, we need to recompute common variant statistics for the variant quality control metrics.

In [None]:
mt = hl.variant_qc(mt)

Smaller samples will have decreased power to detect rare variants.  Minor Allele Frequency thresholds can speed up computation for smaller cohorts with minimal impact on results

In [None]:
mt = mt.filter_rows(hl.min(mt.variant_qc.AF) > 0.05, keep = True)
#mt.count()

## Deviations from Hardy–Weinberg equilibrium (HWE)

HWE assumptions are violated by our multiethnic cohort and have been removed.  HWE can be applied post-hoc if necessary

In [None]:
#mt = mt.filter_rows(mt.variant_qc.p_value_hwe > 1e-20, keep = True)
# mt.count()

# Genome-Wide Association Study (GWAS)

The following code will run a logistic regression with our disease as the dependant variable, and sex, age, and all availible ancestry principal components as variables.

In [None]:
covariates = [1.0, mt.pheno.is_male, mt.pheno.age_yrs,
             mt.ancestry_pred.pca_features[0], 
              mt.ancestry_pred.pca_features[1], 
              mt.ancestry_pred.pca_features[2],
              mt.ancestry_pred.pca_features[3],
              mt.ancestry_pred.pca_features[4],
              mt.ancestry_pred.pca_features[5],
              mt.ancestry_pred.pca_features[6],
              mt.ancestry_pred.pca_features[7],
              mt.ancestry_pred.pca_features[8],
              mt.ancestry_pred.pca_features[9],
              mt.ancestry_pred.pca_features[10], 
              mt.ancestry_pred.pca_features[11], 
              mt.ancestry_pred.pca_features[12], 
              mt.ancestry_pred.pca_features[13],
              mt.ancestry_pred.pca_features[14],
              mt.ancestry_pred.pca_features[15]]

In [None]:
log_reg = hl.logistic_regression_rows(
    test='wald',
    y=mt.pheno.has_pheno,
    x=mt.GT.n_alt_alleles(),
    covariates=covariates
)

Export GWAS results to workspace bucket for downstream analysis

In [None]:
log_reg = log_reg.flatten()

log_reg_save_path = f'{bucket}/data/log_reg.tsv.bgz'

log_reg.export(log_reg_save_path)

The following code can be run in another notebook within this workspace to work with GWAS results

In [None]:
gwas_result = hl.import_table(log_reg_save_path, types={"locus":hl.tlocus(reference_genome='GRCh38'),"alleles": hl.tarray(hl.tstr), "beta": hl.tfloat64, "p_value": hl.tfloat64, "fit.n_iterations": hl.tint32})