In [0]:
# import pyspark
from delta import *

In [0]:
# builder = pyspark.sql.SparkSession.builder.appName("MetadataDeltalakeETL") \
#     .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
#     .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

In [0]:
spark = configure_spark_with_delta_pip(builder).getOrCreate()

In [0]:
spark

In [0]:
# Page Break

# Cloud Native

* At this point, we could simply sync up the whole directory `./lakehouse` into Cloud Object store.
* For example for AWS, we could perform these steps:
    * Upload our warehouse to S3: `aws s3 sync ./lakehouse s3://org-datalake-prod/v1/lakehouse`
    * Setup [AWS Glue to crawl Deltatables](https://aws.amazon.com/blogs/big-data/crawl-delta-lake-tables-using-aws-glue-crawlers/)
    * Then query with AWS Athena
* Similar cloud services should exist in Azure or GCP.
* For other Cloud and private infrastructure, you could leverage PrestoDB, Trino, Hive to front our Genomics data warehouse.
* More advanced cases, we could utilise more High throughput dedicated managed data warehouse services such as Redshift, Synapse, etc.

> Key point: Our Genomic BigData warehouse is in Cloud-native and can be scale-out by leveraging _state-of-the-art_ Cloud auto-scaling services.

In [0]:
# Page Break

# Load Lakehouse Tables

* For now in our local dev, we mimic this Cloud-native effect by loading each Deltatables into separate spark dataframes.
* Then, we create in-memory "Table View" and, use SparkSQL to mimic SQL like query experience.

In [0]:
# Page Break

In [0]:
germline_src = "./lakehouse/bcbio/germline_table"
somatic_src = "./lakehouse/bcbio/somatic_table"
metadata_src = "./lakehouse/bcbio/metadata_table"

In [0]:
germline_df = spark.read.format("delta").load(germline_src)
somatic_df = spark.read.format("delta").load(somatic_src)
metadata_df = spark.read.format("delta").load(metadata_src)

In [0]:
germline_df.createOrReplaceTempView("germline_table")
somatic_df.createOrReplaceTempView("somatic_table")
metadata_df.createOrReplaceTempView("metadata_table")

In [0]:
# Page Break

## Describe Table Schema

In [0]:
spark.sql("describe germline_table").show(1000, truncate=False)

In [0]:
spark.sql("describe somatic_table").show(1000, truncate=False)

In [0]:
spark.sql("describe metadata_table").show(1000, truncate=False)

In [0]:
# Page Break

# Querying Tables

In [0]:
spark.sql("select \
    m.SequenceRunName, m.SubjectID, m.Gender, m.Phenotype, m.StudyID, m.DiseaseCode, m.SNOMED, m.SampleID, \
    s.contigName as CHROM, s.referenceAllele as REF, s.alternateAlleles as ALT, array_size(s.alternateAlleles) as ALT_cnt \
from metadata_table as m \
join somatic_table as s on s.genotypes_sampleId = m.SampleID").show()

In [0]:
spark.sql("select \
    m.SubjectID, m.Phenotype, m.StudyID, m.DiseaseCode, m.SNOMED, m.SampleID, \
    s.genotypes_sampleId as GT_SampleID, s.referenceAllele as REF, s.alternateAlleles as ALT, count(*) as num_of_snps \
from metadata_table as m \
join somatic_table as s on s.genotypes_sampleId = m.SampleID \
where \
    char_length(referenceAllele) = 1 and \
    array_size(alternateAlleles) = 1 and \
    char_length(alternateAlleles[0]) = 1 \
    group by m.SubjectID, m.Phenotype, m.StudyID, m.DiseaseCode, m.SNOMED, m.SampleID, s.genotypes_sampleId, s.referenceAllele, s.alternateAlleles \
    order by num_of_snps desc").show()

In [0]:
# Page Break

# Summary

* In real world scenario, this join table call may or may not make sense. However. It demonstrates the use case possibility.
* Often, these warehouse tables may span across multiple sources; AWS Athena, Trino as such distributed query engine enable "Federated Query" interface.
    * For example, as such Genomic data warehouse could be backend of [GA4GH Beacon](https://github.com/ga4gh-beacon/beacon-v2/) and/or [GA4GH Data Connect](https://github.com/ga4gh-discovery/data-connect) REST endpoint interfaces.
* It opens up data into more general tooling for information retrieval e.g. SQL.
* As such warehouse table enable OLAP -- Online Analytical Processing and, further bridge into data science such ML/AI.
* Depends on data arrangement, we may treat such Genomic Table as central Fact table -- which we could use surrounding Metadata table(s) as dimensional look up.
* Or, we may be just focusing on Genomic Tables for some number crunching or aggregation for reporting.
* Bring data one step closer to Cloud for computation -- i.e. data & compute closer together. Hence. "Cloud-native".

To note that;

* Certainly, experienced/trained BioInformatician can leverage more efficient tools in a specified ad-hoc setup.
* Genomic Table and, as such sourcing data warehouse directly from VCF could easily get out of hand.
* Typically, "best practice" BioInformatics Pipeline should have finer, distilled, end-of-chain, "gold" label VCF product.
* This VCF should contain a handful of records per patient that have already analysed & annotated well for given use case.
* If sourcing Genomic Table from VCF is not appropriate then one could leverage more summarised tables output from MultiQC or some Cancer Reporter as see fit.

Hence, this kind of Genomic BigData warehouse still rely on high quality output of BioInformatics Pipeline. And continuation of overall Data Pipeline as depict below.

![GenomicBigData.png](./assets/GenomicBigData.png)

In [0]:
# Page Break

# Future Work

* This has yet to evaluate further on real world data workload (case by case) and, setting up some decent size cohort-wide data warehousing.
* Explore data partitioning within Deltatable or chosen LakeHouse table framework.
* Performance and feasibility study benchmarking on various technologies that underpin the setup.
* More specialised Cloud-native BioInfo formats: `BioParquet` _ala_ [GeoParquet](https://github.com/opengeospatial/geoparquet)?
* Comparison with other LakeHouse table framework: Iceberg, Hudi
    * Similarly, entails more specific to Genomic such that `BioTable` or `GenomicTable`?

In [0]:
# Page Break

# Stop Spark Session

In [0]:
spark.stop()

In [0]:
# Continue to next notebook