In [1]:
import pyspark
from delta import *

In [2]:
builder = pyspark.sql.SparkSession.builder.appName("MetadataDeltalakeETL") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

In [None]:
spark = configure_spark_with_delta_pip(builder).getOrCreate()

In [None]:
spark

In [None]:
# Page Break

# Load Metadata and Save as Deltatable

* NOTE:
    * This `meatadata.tsv` is total make up mock for demonstration.
    * To simplify thing, it is prepared in denormalized flat table form.

In [4]:
meta_src = "./data/metadata.csv"

In [5]:
meta_df = spark.read.csv(meta_src, header=True)

In [6]:
meta_df.printSchema()

root
 |-- SequenceRunName: string (nullable = true)
 |-- SequenceRunID: string (nullable = true)
 |-- Timestamp: string (nullable = true)
 |-- SubjectID: string (nullable = true)
 |-- LibraryID: string (nullable = true)
 |-- SampleID: string (nullable = true)
 |-- SampleDescription: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Race: string (nullable = true)
 |-- Ethnicity: string (nullable = true)
 |-- Proband: string (nullable = true)
 |-- ProjectOwner: string (nullable = true)
 |-- ProjectName: string (nullable = true)
 |-- StudyID: string (nullable = true)
 |-- StudyType: string (nullable = true)
 |-- Pipeline: string (nullable = true)
 |-- Phenotype: string (nullable = true)
 |-- Collection: string (nullable = true)
 |-- DiseaseCode: string (nullable = true)
 |-- SNOMED: string (nullable = true)



In [None]:
# Page Break

# Data Warehouse Models

> Consider: some ETL or complex transformation have done here!! :)

* At this point, we can take one step back and think about leveraging Data Warehouse data models.
 * Some examples but not limited to:
    * Multi-Dimensional data model such as Star schema, Snowflake schema
        * For example, wrap around our `somatic_table` as Fact table and, build metadata as surrounding dimension table(s)
    * Data Vault data modelling such as arranging in Hub, Satellite, Links concepts (depends on data stages i.e. Bronze, Silver, Gold, Platinum)
    * Or, simple flat table, Data Mart or, even some simplified Relation model if that suit for the use case

In [None]:
# Page Break

## QA TRANSFORMED DATAFRAME

In [7]:
meta_df.select("SampleID").distinct().show()

+--------+
|SampleID|
+--------+
| NA24385|
| NA12878|
+--------+



In [8]:
meta_df \
    .select("SequenceRunName", "Timestamp", "SubjectID", "LibraryID", "SampleID", "StudyID", "StudyType", "Pipeline", "DiseaseCode", "SNOMED") \
    .where("SampleID = 'NA12878'") \
    .show()

+--------------------+----------+---------+---------+--------+-------+---------+--------+-------------+---------+
|     SequenceRunName| Timestamp|SubjectID|LibraryID|SampleID|StudyID|StudyType|Pipeline|  DiseaseCode|   SNOMED|
+--------------------+----------+---------+---------+--------+-------+---------+--------+-------------+---------+
|221007_A00130_000...|2022-10-07| SBJ00001| L0000001| NA12878|NA12878|      WGS| Somatic|MONDO:0007254|429740004|
+--------------------+----------+---------+---------+--------+-------+---------+--------+-------------+---------+



In [9]:
meta_df \
    .dropna() \
    .cube("StudyID") \
    .count() \
    .show()

+-------+-----+
|StudyID|count|
+-------+-----+
|NA12878|    2|
|   null|    2|
+-------+-----+



In [None]:
# Page Break

# Write to Deltatable

In [10]:
metadata_table = "./lakehouse/bcbio/metadata_table"

In [11]:
meta_df.write.format("delta").mode("overwrite").save(metadata_table)

                                                                                

In [12]:
!ls ./lakehouse/bcbio/metadata_table

[1m[36m_delta_log[m[m
part-00000-53abd47d-e378-4144-bd28-ba6c00b531a7-c000.snappy.parquet


In [None]:
# Page Break

# Summary

* In real world scenario, this could involve much more complex structure and/or integration with upstream systems.
    * For example, interfacing with systems such as RedCAP, FHIR, or Pathling, ontology and some kind of clinical Phenotype look up server
    * And there may be possibility with dynamically looking them up from these systems
* Consider: we need to ingest some minimal metadata info and, need to ETL out these minimal Phenotype info from these systems
* We may or may not need to ingest these meta information into data warehouse -- which depends on case by case basis.

In [None]:
# Page Break

# Stop Spark Session

In [13]:
spark.stop()

In [None]:
# Continue to next notebook