# Import pVCF Genomic Data with Hail

This notebook shows how to import genomic data from pVCFs into a Hail MatrixTable and save it to an Apollo database (dnax://) on the DNAnexus platform. See documentation for guidance on launch specs for the JupyterLab with Spark Cluster app for different data sizes: https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data

Pre-conditions for running this notebook successfully:

   * pVCF(s) are uploaded to the project



 ## 1) Initiate Spark and Hail

In [None]:
# Running this cell will output a red-colored message- this is expected.
# The 'Welcome to Hail' message in the output will indicate that Hail is ready to use in the notebook.

from pyspark.sql import SparkSession
import hail as hl

builder = (
    SparkSession
    .builder
    .enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)

## 2) Locate and import data into a Hail MatrixTable

All data uploaded to the project before running the JupyterLab app is mounted (https://documentation.dnanexus.com/user/jupyter-notebooks?#accessing-data) and can be accessed in `/mnt/project/<path_to_data>`. The file URL follows the format: `file:///mnt/project/<path_to_data>`


Hail's `import_vcf` is used to import vcf formatted data

The first thing we do is import (import_vcf) and convert the VCF file into a Hail native file format. This is done by using the write method below. The resulting file is **much faster** to process because it is scalable and easily parallelizable

In [None]:
# Define variables used in import

file_url = "file:///mnt/project//Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c22_b0_v1.vcf.gz" # regex can be used if genomic data is in multiple pVCFs

In [None]:
# Import genomic data into a MT supported formats are VCF, (B)GEN, PLINK, TSV,

mt = hl.import_vcf(file_url, 
                   force_bgz=True, 
                   reference_genome="GRCh38", 
                   array_elements_required=False)

In [None]:
# View basic properties of MT

print(f"Num partitions: {mt.n_partitions()}")
mt.describe()

## 3) Basic operations

We highly recommend exploring your `matrixTable` if this is your first time using Hail using functions like `show`, `summarize`, or `count`. One of our personal interactive favourites is:

In [None]:
mt.describe(widget=True)

In [None]:
### Look at the first 5 variants
mt.rows().select().show(5)

In [None]:
### Look at the first 5 samples
mt.s.show(5)

In [None]:
### Look at the locus
mt.locus.show()

In [None]:
## Look at the genotyoes
mt.GT.show()

In [None]:
### Look at the first genotype calls 
mt.entry.take(5)

#mt.entry.show(5)

`summarize` Prints (potentially) useful information about any field or object:

In [None]:
mt.DP.summarize()

In [None]:
mt.AD.summarize()

`MatrixTable.count` returns a tuple with the number of rows (variants) and number of columns (samples).

In [None]:
mt.count()

## Annotate MatrixTable with sample and phenotypes annotation

Column fields you would annote phenotypes, ancestry, sex, and covariates

Row fields can be used to store information like gene membership and functional impact for use in QC or analysis



In Hail, annotate methods refer to adding new fields.

   * MatrixTable's `annotate_cols` adds new column (sample) fields.
   * MatrixTable's `annotate_rows` adds new row (variant) fields.
   * MatrixTable's `annotate_entries` adds new entry (genotype) fields.
   * Table's `annotate` adds new row fields.


In [None]:
## To import a table with phenotypes, sex, etc
table = (hl.import_table('data/1kg_annotations.txt', impute=True)
         .key_by('Sample'))

In [None]:
## To see the structure of the table
table.describe()

In [None]:
## To see the contents of the table
table.show(width=100)

In [None]:
## Join phenotype table with matrix table
mt = mt.annotate_cols(pheno = table[mt.s])

In [None]:
mt.col.describe()

## 4) Gathering some statistics 

In [None]:
## counter is used to count the occurrence of one element 
pprint(table.aggregate(hl.agg.counter(table.SuperPopulation)))

In [None]:
## stats is for useful statistics or numeric collections 
pprint(table.aggregate(hl.agg.stats(table.CaffeineConsumption)))

In [None]:
## To get the count only in our cohort of interest
mt.aggregate_cols(hl.agg.counter(mt.pheno.SuperPopulation))

In [None]:
## stats only in our dataset
pprint(mt.aggregate_cols(hl.agg.stats(mt.pheno.CaffeineConsumption)))

## 5) Get histograms for DP

In [None]:
p = hl.plot.histogram(mt.DP, range=(0,30), bins=30, title='DP Histogram', legend='DP')
show(p)

## 6) Quality control

### Count before splitting multi-allelics.

In [None]:
## count operations are computationally very expensive 
n = mt.count()

pprint('n samples:')
print(n[1])
pprint('n variants:')
print(n[0])

In [None]:
hl.summarize_variants(mt)

In [None]:
## To split multi-allelics
mt = hl.split_multi_hts(mt)

In [None]:
## Get the new numbers after splitting
hl.summarize_variants(mt)

In [None]:
mt.col.describe()

### Sample QC

Use `sample_qc` function of Hail. Hail has the function `hl.sample_qc` to compute a list of useful statistics about samples from sequencing data. This function adds a new column field, sample_qc, with the computed statistic

In [None]:
## Hail has the sample_qc function which produces some useful metrics and stores them in a column 
mt = hl.sample_qc(mt)

In [None]:
mt.col.describe()

In [None]:
# Plot the QC metrics as a good place to start. The call rate and outliers
p = hl.plot.histogram(mt.sample_qc.call_rate, range=(.88,1), legend='Call Rate')
show(p)

In [None]:
## Plot sample genotype quality and outliers
p = hl.plot.histogram(mt.sample_qc.gq_stats.mean, range=(10,70), legend='Mean Sample GQ')
show(p) 

In [None]:
## Correlation between DP and GQ
p = hl.plot.scatter(x=mt.sample_qc.dp_stats.mean,
                    y=mt.sample_qc.call_rate,
                    xlabel='Mean DP',
                    ylabel='Call Rate',
                    hover_fields={'ID': mt.s},
                    size=8)
show(p)

In [None]:
mt.count()

In [None]:
## Applying a call rate filter of 97%
mt = mt.filter_cols(mt.sample_qc.call_rate >= 0.97)

In [None]:
p = hl.plot.scatter(x=mt.sample_qc.dp_stats.mean,
                    y=mt.sample_qc.call_rate,
                    xlabel='Mean DP',
                    ylabel='Call Rate',
                    hover_fields={'ID': mt.s},
                    size=8)
show(p)

In [None]:
mt.describe(widget=True)

In [None]:
# Number of variants removed
mt.count()

#### Sex imputation

We suggest inferring for sex using the Hail function `impute_sex`. This function should be performed on common biallelic SNPs (AF > 0.05) with a high callrate (callrate > 0.97). 
Suggested thresholds for this function include the following. We would also recommend plotting the data to observe data is within reasonable limits of thresholds set: aaf_threshold: 0.05 female_threshold: 0.5 male_threshold: 0.75

In [None]:
## Filter for high quality calls for sex QC
mt = mt.filter_rows((hl.len(mt.alleles) == 2) & hl.is_snp(mt.alleles[0], mt.alleles[1]) &
                            (hl.agg.mean(mt.GT.n_alt_alleles()) / 2 > 0.001) &
                            (hl.agg.fraction(hl.is_defined(mt.GT)) > 0.97))

In [None]:
mt.count()

In [None]:
# Imputing sex with thresholds defined above and write it into a Hail table
imputed_sex = hl.impute_sex(mt.GT,aaf_threshold=0.05, female_threshold=0.5, male_threshold=0.75)

In [None]:
imputed_sex.show()

In [None]:
## Annotate matrix table with imputed sex
mt = mt.annotate_cols(impute_sex = imputed_sex[mt.s])

#### Additional filters

Recommended filters removing samples that are
* Mean coverage < 20.0 
* Ambiguous sex 
* Aneuploids 
* Call rate < 97

In [None]:
mt = mt.annotate_cols(aneuploid= ((mt.impute_sex.f_stat >= 0.5) ) | (hl.is_missing(mt.impute_sex.f_stat)) | 
                      ((mt.impute_sex.f_stat >= 0.4) & (mt.impute_sex.f_stat <= 0.6) ) ,
        sex_aneuploidy=(mt.impute_sex.f_stat < 0.4) )

In [None]:
mt.count()

In [None]:
mt = mt.filter_cols( (mt.sample_qc.call_rate >= 0.97) &
                    (mt.sample_qc.dp_stats.mean > 20) & (hl.is_defined(mt.aneuploid))  )

In [None]:
mt.count()

In [None]:
## Filtering based on DP and QC
mt = mt.filter_cols((mt.sample_qc.dp_stats.mean >= 4) & (mt.sample_qc.call_rate >= 0.97))
print('After filter, %d/284 samples remain.' % mt.count_cols())

#### Relatedness filter

Samples can be filtered to remove one of each pair of related samples using Hail's maximal_independent_set (uses model free relatedness estimation via PC-Relate). We suggest filtering for samples with second-degree relatedness or higher, where one of each pair of samples with a kinship coefficient of > 0.088 can be removed.

Run PC-relate and compute pairs of closely related individuals: Note that the filtered kinship coefficient is already listed as the recommended 0.088


In [None]:
pca_eigenvalues, pca_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT, k=10, compute_loadings=False)

In [None]:
relatedness_ht = hl.pc_relate(mt.GT, min_individual_maf=0.01, scores_expr=pca_scores[mt.col_key].scores,
                                      block_size=4096, min_kinship=0.1, statistics='all')

In [None]:
pairs = relatedness_ht.filter(relatedness_ht['kin'] > 0.088)

In [None]:
related_samples_to_remove = hl.maximal_independent_set(pairs.i, pairs.j, False)

In [None]:
mt.count()

In [None]:
mt = mt.filter_cols(hl.is_defined(related_samples_to_remove[mt.col_key]), keep=False)

In [None]:
mt.count()

#### Population ancestry inference

Principal component analysis (PCA) is a very general statistical method for reducing high dimensional data to a small number of dimensions which capture most of the variation in the data. Hail has the function pca for performing generic PCA.

PCA typically works best on normalized data (e.g. mean centered). Hail provides the specialized function `hwe_normalized_pca` which first normalizes the genotypes according to the Hardy-Weinberg Equilibium model.

In [None]:
pca_eigenvalues, pca_scores, pca_loadings = hl.hwe_normalized_pca(mt.GT, compute_loadings=True)

In [None]:
mt = mt.annotate_cols(pca = pca_scores[mt.s])

In [None]:
ht = pca_scores.select(PC1=pca_scores.scores[0],
                       PC2=pca_scores.scores[1],
                       PC3=pca_scores.scores[2],
                       PC4=pca_scores.scores[3])
ht = ht.annotate(pheno = sa[ht.s])


Visualize!

Let's plot several combinations of the first four principal components (PCs) against each other. This will help us visualize the population structure of the dataset, and allow us to try identify our samples with different population ancestry clusters. Note that since the plots generated by the hl.plot module use the bokeh plotting library internally, we can use bokeh functions like gridplot to arrange plots.


In [None]:
p1 = hl.plot.scatter(ht.PC1, ht.PC2, xlabel='PC1', ylabel='PC2', label=ht.pheno.super_population, size=6)
p2 = hl.plot.scatter(ht.PC1, ht.PC3, xlabel='PC1', ylabel='PC3', label=ht.pheno.super_population, size=6)
p3 = hl.plot.scatter(ht.PC2, ht.PC4, xlabel='PC2', ylabel='PC4', label=ht.pheno.super_population, size=6)


show(bokeh.layouts.gridplot([[p1], [p2], [p3]]))

In [None]:
Based on your visualization, you can then choose to cluster your samples based on ancestry inference using the following code structure suggestion

In [None]:
check(ht.annotate(
    unmasked = hl.case()
        .when((ht.PC2 > 0.2) & (ht.PC1 < 0), 'EAS')
#         .when(..., 'AFR')
#         .when(..., 'AMR')
#         .when(..., 'EUR')
#         .when(..., 'SAS')
        .default(ht.pheno.super_population)
))


#### Outlier detection

Utilizing the Hail sample_qc method, we suggest removing outliers that deviate from the median and median absolute deviation (MAD) (non-parametric equivalent for mean and standard deviation) for the following metrics. It is also important to note that these outlier detection metrics below would need to be stratified by population ancestry (and sequencing platform) determined from subsection 2.0.5:

`n_snp:` Number of SNP alternate alleles

`r_ti_tv:` Transition/transversion ratio

`r_insertion_deletion:` Insertion/Deletion allele ratio

`n_insertion:` Number of insertion alternate alleles

`n_deletion:` Number of deletion alternate alleles

`r_het_hom_var:` Heterozygous/homozygous call ratio

Using medians and median absolute deviation (MAD), we can estimate removal of outliers.

The following code blocks:

    1. is an outline of what can be done for separately for each population ancestry and sequencing platform.

    2. look at the n_snp metric and needs to be interrogated (and replaced in script below) for r_ti_tv, r_insertion_deletion, n_insertion, n_deletion, and r_het_hom_var.



In [None]:
metric_values = hl.agg.collect(mt.sample_qc.n_snp)
metric_median = hl.median(metric_values)
metric_mad = 1.4826 * hl.median(hl.abs(metric_values - metric_median))
outlier_metric=hl.struct( median=metric_median,
            mad=metric_mad,
            upper=metric_median + 4 * metric_mad,
            lower=metric_median - 4 * metric_mad)


mt = mt.annotate_globals(metrics_stats=mt.aggregate_cols(outlier_metric))

In [None]:
mt.globals.metrics_stats.show()

Apply filter for the selected metric. Remember that this step needs to be done for each

    1. population

    2. sequencing platform

    3. each metric (n_snp, r_ti_tv, r_insertion_deletion, n_insertion, n_deletion, and r_het_hom_var)



In [None]:
mt=mt.filter_cols( (mt.sample_qc.n_snp <= mt.metrics_stats.upper) |
            (mt.sample_qc.n_snp >=  mt.metrics_stats.lower) )

In [None]:
mt.count()

## Genotype QC

High quality genotypes can be filtered when applying the following thresholds. We would also recommend performing call rate filtering separately for cases and controls: differential missingness is a typical source of false positives:

GQ >= 20

DP >= 10

AB >= 0.25 (for each allele in heterozygous calls)

In [None]:
#create an allele balance annotation
mt= mt.annotate_entries(AB = (mt.AD[1] / hl.sum(mt.AD) ))

In [None]:
#set filter condition for AB
filter_condition_ab = ((mt.GT.is_hom_ref() & (mt.AB <= 0.1)) |
                        (mt.GT.is_het() & (mt.AB >= 0.25) & (mt.AB <= 0.75)) |
                        (mt.GT.is_hom_var() & (mt.AB >= 0.9)))
fraction_filtered = mt.aggregate_entries(hl.agg.fraction(~filter_condition_ab))
print(f'Filtering {fraction_filtered * 100:.2f}% entries out of downstream analysis.')

In [None]:
mt = mt.filter_entries( (mt.GQ>=20) &
                 (mt.DP >= 10) &
                 ((mt.GT.is_hom_ref() & (mt.AB <= 0.1)) |
                        (mt.GT.is_het() & (mt.AB >= 0.25) & (mt.AB <= 0.75)) |
                        (mt.GT.is_hom_var() & (mt.AB >= 0.9)))) 

## Variant QC

Upon completion of the Sample QC described in section 2.0, exomes should then be processed for Variant QC that is further elaborated in this section 3.0. We recommend applying a PASS filter using the Variant Quality Score Recalibration (VQSR) metric.

Hail has the function `hl.variant_qc` to compute a list of useful statistics about variants from sequencing data.

In [None]:
## Use the varian_qc option of Hail to provide statistics
mt = hl.variant_qc(mt)

In [None]:
show(hl.plot.cdf(mt.variant_qc.call_rate))

In [None]:
mt.describe(widget=True)

In [None]:
mt.row.describe()

In [None]:
mt = mt.annotate_rows(fail_VQSR = hl.len(mt.filters) == 0)

In [None]:
mt.filter_rows(mt.fail_VQSR).count_rows()

In [None]:
mt.filters.show()

In [None]:
#  Annotate variants with flag indicating if they failed VQSR. In this toy example, there is no information on VQSR, so everything is removed. Be weary of your data!

mt = mt.annotate_rows(fail_VQSR = hl.len(mt.filters) != 0)

In [None]:
fail_VQSR = mt.filter_rows(mt.fail_VQSR).count_rows()
print('n variants failing VQSR:')
pprint(fail_VQSR)

In [None]:
mt = mt.filter_rows(mt.fail_VQSR, keep=False)

In [None]:
## filter invariant rows
mt = mt.filter_rows((mt.qc.AF[0] > 0.0) & (mt.qc.AF[0] < 1.0))

# 7) Store Hail MT in DNAnexus

In [None]:
# Define database and MT names

# Note: It is recommended to only use lowercase letters for the database name.
# If uppercase lettering is used, the database name will be lowercased when creating the database.
db_name = "database_name"
mt_name = "geno.mt"


In [None]:
# Create database in DNAX

stmt = f"CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://'"
print(stmt)
spark.sql(stmt).show()

In [None]:
# Store MT in DNAX

import dxpy

# Find database ID of newly created database using dxpy method
db_uri = dxpy.find_one_data_object(name=f"{db_name}", classname="database")['id']
url = f"dnax://{db_uri}/{mt_name}" # Note: the dnax url must follow this format to properly save MT to DNAX

# Before this step, the Hail MatrixTable is just an object in memory. To persist it and be able to access 
# it later, the notebook needs to write it into a persistent filesystem (in this case DNAX).
# See https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.write for additional documentation.
mt.write(url) # Note: output should describe size of MT (i.e. number of rows, columns, partitions) 