## INTRO

**Author:** Stephan Cordogan

This notebook generates the Genotype matrices required to compute 10 Principal Components for each of 5 broad, sufficiently represented, ancestry groups.  It uses a subset of high quality sites pruned for linkage disequilibrium, further filtered for a minimum frequency of 1%.  The resulting PCs for every individual of these ancestry groups, as well as the loadings used to generate them, will be saved to the workspace bucket for downstream use.  Computation of ancestry principal components is an expensive step in performing a GWAS, and should not be done more than once if possible.

A Hail Genomic Analysis environment is required for this notebook, as well as Notebooks 1.12, 1.13, 2, 2.11, 2.12.  The specific resources required depend on the size of your cohort and number of genetic variants, but for the entire cohort with roughly 30 million SNPs per ancestry group, I recommend 12 workers, 6 of which are preemtpible, with 32 CPUs, 208 GB ram per worker, with the same configuration for the head node, with disk spaces of 500GB.  You can moniter resource usage by clicking "View detailed spend report" in the About tab of your workspace and searching for Metrics Exlorer on the following page.

## Import Necessary Packages

In [None]:
from datetime import datetime
import os
import pandas as pd
import hail as hl


In [None]:
start = datetime.now()
bucket = os.getenv('WORKSPACE_BUCKET')
bucket
hl.init(default_reference = "GRCh38")

## Load Hail MT and filtered, pruned sites

In [None]:
# mt_path = os.getenv("WGS_CLINVAR_SPLIT_HAIL_PATH")
# mt_path = os.getenv("WGS_ACAF_THRESHOLD_SPLIT_HAIL_PATH")
mt_path = os.getenv("WGS_ACAF_THRESHOLD_MULTI_HAIL_PATH")
mt_path

In [None]:
mt = hl.read_matrix_table(mt_path)
# mt.describe()

In [None]:
vcf_path = "gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux/ancestry/merged_sites_only_intersection.vcf.bgz"
mt_qual = hl.import_vcf(vcf_path)

In [None]:
# mt_qual.describe()

Optionally, filter matrix table by region

In [None]:
# # test_intervals = ['chr6:45000000-46000000']
# test_intervals = ['chr2','chr3','chr4','chr5','chr6', 'chr7', 'chr8', 'chr9']

# mt = hl.filter_intervals(
#     mt,
#     [hl.parse_locus_interval(x,)
#      for x in test_intervals])

# mt_qual = hl.filter_intervals(
#     mt_qual,
#     [hl.parse_locus_interval(x,)
#      for x in test_intervals])

Subset matrix table for filtered, pruned sites

In [None]:
qual_variants = mt_qual.rows()
# Step 2: Filter mt based on variants present in mt_qual
mt = mt.filter_rows(hl.is_defined(qual_variants[mt.locus, mt.alleles]))
# mt_filtered.count()

## Remove related samples

In [None]:
related_samples_path = "gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux/relatedness/relatedness_flagged_samples.tsv"
related_remove = hl.import_table(related_samples_path,
                                 types={"sample_id":"tstr"},
                                key="sample_id")

#related_remove.count()
mt = mt.anti_join_cols(related_remove)
#mt.count()

## Population stratification

Load in predicted ancestry to subset population by broad ancestral groups

In [None]:
ancestry_pred_path = "gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux/ancestry/ancestry_preds.tsv"
ancestry_pred = hl.import_table(ancestry_pred_path,
                               key="research_id", 
                               impute=True, 
                               types={"research_id":"tstr","pca_features":hl.tarray(hl.tfloat)})
mt = mt.annotate_cols(ancestry_pred = ancestry_pred[mt.s])

In [None]:
# mt.describe()

In [None]:
used_ancestries = hl.literal({"eur", "afr", "amr", "eas", "sas"})

# Filter for used ancestries and annotate rows with call_stats
mt = mt.filter_cols(used_ancestries.contains(mt.ancestry_pred.ancestry_pred))
# mt = mt.annotate_rows(info = hl.agg.call_stats(mt.GT, mt.alleles))

### EUR Population ###
mt_eur = mt.filter_cols(mt.ancestry_pred.ancestry_pred == "eur")
mt_eur = mt_eur.annotate_rows(info = hl.agg.call_stats(mt_eur.GT, mt_eur.alleles))
mt_eur = mt_eur.filter_rows(hl.min(mt_eur.info.AF) > 0.01, keep=True)

mt_eur_filtered = mt_eur.key_rows_by().key_cols_by().select_entries('GT') 
mt_eur_filtered = mt_eur_filtered.select_cols('s', 'ancestry_pred')  
mt_eur_filtered = mt_eur_filtered.select_rows('locus', 'alleles', 'info') 

# Save the filtered MatrixTable for EUR
mt_eur_save_path = f'{bucket}/data/mt_eur_filtered.mt'
mt_eur_filtered.write(mt_eur_save_path, overwrite=True)

### AFR Population ###
mt_afr = mt.filter_cols(mt.ancestry_pred.ancestry_pred == "afr")
mt_afr = mt_afr.annotate_rows(info = hl.agg.call_stats(mt_afr.GT, mt_afr.alleles))
mt_afr = mt_afr.filter_rows(hl.min(mt_afr.info.AF) > 0.01, keep=True)

mt_afr_filtered = mt_afr.key_rows_by().key_cols_by().select_entries('GT')  
mt_afr_filtered = mt_afr_filtered.select_cols('s', 'ancestry_pred')  
mt_afr_filtered = mt_afr_filtered.select_rows('locus', 'alleles', 'info')  

# Save the filtered MatrixTable for AFR
mt_afr_save_path = f'{bucket}/data/mt_afr_filtered.mt'
mt_afr_filtered.write(mt_afr_save_path, overwrite=True)

### AMR Population ###
mt_amr = mt.filter_cols(mt.ancestry_pred.ancestry_pred == "amr")
mt_amr = mt_amr.annotate_rows(info = hl.agg.call_stats(mt_amr.GT, mt_amr.alleles))
mt_amr = mt_amr.filter_rows(hl.min(mt_amr.info.AF) > 0.01, keep=True)

mt_amr_filtered = mt_amr.key_rows_by().key_cols_by().select_entries('GT')  
mt_amr_filtered = mt_amr_filtered.select_cols('s', 'ancestry_pred')  
mt_amr_filtered = mt_amr_filtered.select_rows('locus', 'alleles', 'info')  

# Save the filtered MatrixTable for AMR
mt_amr_save_path = f'{bucket}/data/mt_amr_filtered.mt'
mt_amr_filtered.write(mt_amr_save_path, overwrite=True)

### EAS Population ###
mt_eas = mt.filter_cols(mt.ancestry_pred.ancestry_pred == "eas")
mt_eas = mt_eas.annotate_rows(info = hl.agg.call_stats(mt_eas.GT, mt_eas.alleles))
mt_eas = mt_eas.filter_rows(hl.min(mt_eas.info.AF) > 0.01, keep=True)

mt_eas_filtered = mt_eas.key_rows_by().key_cols_by().select_entries('GT')  
mt_eas_filtered = mt_eas_filtered.select_cols('s', 'ancestry_pred')  
mt_eas_filtered = mt_eas_filtered.select_rows('locus', 'alleles', 'info')  

# Save the filtered MatrixTable for EAS
mt_eas_save_path = f'{bucket}/data/mt_eas_filtered.mt'
mt_eas_filtered.write(mt_eas_save_path, overwrite=True)

### SAS Population ###
mt_sas = mt.filter_cols(mt.ancestry_pred.ancestry_pred == "sas")
mt_sas = mt_sas.annotate_rows(info = hl.agg.call_stats(mt_sas.GT, mt_sas.alleles))
mt_sas = mt_sas.filter_rows(hl.min(mt_sas.info.AF) > 0.01, keep=True)

mt_sas_filtered = mt_sas.key_rows_by().key_cols_by().select_entries('GT')  
mt_sas_filtered = mt_sas_filtered.select_cols('s', 'ancestry_pred')  
mt_sas_filtered = mt_sas_filtered.select_rows('locus', 'alleles', 'info')  

# Save the filtered MatrixTable for SAS
mt_sas_save_path = f'{bucket}/data/mt_sas_filtered.mt'
mt_sas_filtered.write(mt_sas_save_path, overwrite=True)
