## Lab 6: Building a Data Lakehouse with the PySpark Structured Streaming Medallion Architecture
This lab will help you learn to use many of the software libraries and programming techniques required to fulfill the requirements of the final end-of-session capstone project for course **DS-2002: Data Systems**. The spirit of the project is to provide a capstone challenge that requires students to demonstrate a practical and functional understanding of each of the data systems and architectural principles covered throughout the session.

**These include:**
- Relational Database Management Systems (e.g., MySQL, Microsoft SQL Server, Oracle, IBM DB2)
  - Online Transaction Processing Systems (OLTP): *Optimized for High-Volume Write Operations; Normalized to 3rd Normal Form.*
  - Online Analytical Processing Systems (OLAP): *Optimized for Read/Aggregation Operations; Dimensional Model (i.e, Star Schema)*
- NoSQL *(Not Only SQL)* Systems (e.g., MongoDB, CosmosDB, Cassandra, HBase, Redis)
- File System *(Data Lake)* Source Systems (e.g., AWS S3, Microsoft Azure Data Lake Storage)
  - Various Datafile Formats (e.g., JSON, CSV, Parquet, Text, Binary)
- Massively Parallel Processing *(MPP)* Data Integration Systems (e.g., Apache Spark/PySpark, Databricks)
- Data Integration Patterns (e.g., Extract-Transform-Load, Extract-Load-Transform, Extract-Load-Transform-Load, Lambda & Kappa Architectures)

## Section I: Prerequisites

### 1.0. Import Required Libraries

In [7]:
import findspark
findspark.init()
print(findspark.find())

import os
import sys
import json
import time
import pymongo
import certifi
import shutil
import pandas as pd

from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window as W

/opt/homebrew/Cellar/apache-spark/3.5.5/libexec


### 2.0. Instantiate Global Variables

In [62]:
# --------------------------------------------------------------------------------
# Specify MySQL Server Connection Information
# --------------------------------------------------------------------------------
mysql_args = {
    "host_name" : "localhost",
    "port" : "3306",
    "db_name" : "northwind_dw",
    "conn_props" : {
        "user" : "root",
        "password" : "Ashwaniis#1!",
        "driver" : "com.mysql.cj.jdbc.Driver"
    }
}

# --------------------------------------------------------------------------------
# Specify MongoDB Cluster Connection Information
# --------------------------------------------------------------------------------
mongodb_args = {
    "cluster_location" : "local", # "atlas"
    "user_name" : "vaneeshagupta10",
    "password" : "Fdztq26kWFlyBXiE",
    "cluster_name" : "cluster0",
    "cluster_subnet" : "koqso",
    "db_name" : "northwind",
    "collection" : "",
    "null_column_threshold" : 0.5
}

# --------------------------------------------------------------------------------
# Specify Directory Structure for Source Data
# --------------------------------------------------------------------------------
base_dir = os.path.join(os.getcwd(), 'lab_data')
data_dir = os.path.join(base_dir, 'northwind')
batch_dir = os.path.join(data_dir, 'batch')
stream_dir = os.path.join(data_dir, 'streaming')

orders_stream_dir = os.path.join(stream_dir, 'orders')
purchase_orders_stream_dir = os.path.join(stream_dir, 'purchase_orders')
inventory_trans_stream_dir = os.path.join(stream_dir, 'inventory_transactions')

# --------------------------------------------------------------------------------
# Create Directory Structure for Data Lakehouse Files
# --------------------------------------------------------------------------------
dest_database = "northwind_dlh"
sql_warehouse_dir = os.path.abspath('spark-warehouse')
dest_database_dir = f"{dest_database}.db"
database_dir = os.path.join(sql_warehouse_dir, dest_database_dir)

orders_output_bronze = os.path.join(database_dir, 'fact_orders', 'bronze')
orders_output_silver = os.path.join(database_dir, 'fact_orders', 'silver')
orders_output_gold = os.path.join(database_dir, 'fact_orders', 'gold')

purchase_orders_output_bronze = os.path.join(database_dir, 'fact_purchase_orders', 'bronze')
purchase_orders_output_silver = os.path.join(database_dir, 'fact_purchase_orders', 'silver')
purchase_orders_output_gold = os.path.join(database_dir, 'fact_purchase_orders', 'gold')

inventory_trans_output_bronze = os.path.join(database_dir, 'fact_inventory_transactions', 'bronze')
inventory_trans_output_silver = os.path.join(database_dir, 'fact_inventory_transactions', 'silver')
inventory_trans_output_gold = os.path.join(database_dir, 'fact_inventory_transactions', 'gold')

### 3.0. Define Global Functions

In [27]:
def get_file_info(path: str):
    file_sizes = []
    modification_times = []

    '''Fetch each item in the directory, and filter out any directories.'''
    items = os.listdir(path)
    files = sorted([item for item in items if os.path.isfile(os.path.join(path, item))])

    '''Populate lists with the Size and Last Modification DateTime for each file in the directory.'''
    for file in files:
        file_sizes.append(os.path.getsize(os.path.join(path, file)))
        modification_times.append(pd.to_datetime(os.path.getmtime(os.path.join(path, file)), unit='s'))

    data = list(zip(files, file_sizes, modification_times))
    column_names = ['name','size','modification_time']
    
    return pd.DataFrame(data=data, columns=column_names)


def wait_until_stream_is_ready(query, min_batches=1):
    while len(query.recentProgress) < min_batches:
        time.sleep(5)
        
    print(f"The stream has processed {len(query.recentProgress)} batchs")


def remove_directory_tree(path: str):
    '''If it exists, remove the entire contents of a directory structure at a given 'path' parameter's location.'''
    try:
        if os.path.exists(path):
            shutil.rmtree(path)
            return f"Directory '{path}' has been removed successfully."
        else:
            return f"Directory '{path}' does not exist."
            
    except Exception as e:
        return f"An error occurred: {e}"
        

def drop_null_columns(df, threshold):
    '''Drop Columns having a percentage of NULL values that exceeds the given 'threshold' parameter value.'''
    columns_with_nulls = [col for col in df.columns if df.filter(df[col].isNull()).count() / df.count() > threshold] 
    df_dropped = df.drop(*columns_with_nulls) 
    
    return df_dropped
    
    
def get_mysql_dataframe(spark_session, sql_query : str, **args):
    '''Create a JDBC URL to the MySQL Database'''
    jdbc_url = f"jdbc:mysql://{args['host_name']}:{args['port']}/{args['db_name']}"
    
    '''Invoke the spark.read.format("jdbc") function to query the database, and fill a DataFrame.'''
    dframe = spark_session.read.format("jdbc") \
    .option("url", jdbc_url) \
    .option("driver", args['conn_props']['driver']) \
    .option("user", args['conn_props']['user']) \
    .option("password", args['conn_props']['password']) \
    .option("query", sql_query) \
    .load()
    
    return dframe
    

def get_mongo_uri(**args):
    '''Validate proper input'''
    if args["cluster_location"] not in ['atlas', 'local']:
        raise Exception("You must specify either 'atlas' or 'local' for the 'cluster_location' parameter.")
        
    if args['cluster_location'] == "atlas":
        uri = f"mongodb+srv://{args['user_name']}:{args['password']}@"
        uri += f"{args['cluster_name']}.{args['cluster_subnet']}.mongodb.net/"
    else:
        uri = "mongodb://localhost:27017/"

    return uri


def get_spark_conf_args(spark_jars : list, **args):
    jars = ""
    for jar in spark_jars:
        jars += f"{jar}, "
    
    sparkConf_args = {
        "app_name" : "PySpark Northwind Data Lakehouse (Medallion Architecture)",
        "worker_threads" : f"local[{int(os.cpu_count()/2)}]",
        "shuffle_partitions" : int(os.cpu_count()),
        "mongo_uri" : get_mongo_uri(**args),
        "spark_jars" : jars[0:-2],
        "database_dir" : sql_warehouse_dir
    }
    
    return sparkConf_args
    

def get_spark_conf(**args):
    sparkConf = SparkConf().setAppName(args['app_name'])\
    .setMaster(args['worker_threads']) \
    .set('spark.driver.memory', '4g') \
    .set('spark.executor.memory', '2g') \
    .set('spark.jars', args['spark_jars']) \
    .set('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.12:3.0.1') \
    .set('spark.mongodb.input.uri', args['mongo_uri']) \
    .set('spark.mongodb.output.uri', args['mongo_uri']) \
    .set('spark.sql.adaptive.enabled', 'false') \
    .set('spark.sql.debug.maxToStringFields', 35) \
    .set('spark.sql.shuffle.partitions', args['shuffle_partitions']) \
    .set('spark.sql.streaming.forceDeleteTempCheckpointLocation', 'true') \
    .set('spark.sql.streaming.schemaInference', 'true') \
    .set('spark.sql.warehouse.dir', args['database_dir']) \
    .set('spark.streaming.stopGracefullyOnShutdown', 'true')
    
    return sparkConf


def get_mongo_client(**args):
    '''Get MongoDB Client Connection'''
    mongo_uri = get_mongo_uri(**args)
    if args['cluster_location'] == "atlas":
        client = pymongo.MongoClient(mongo_uri, tlsCAFile=certifi.where())

    elif args['cluster_location'] == "local":
        client = pymongo.MongoClient(mongo_uri)
        
    else:
        raise Exception("A MongoDB Client could not be created.")

    return client
    
    
# TODO: Rewrite this to leverage PySpark?
""" def set_mongo_collections(mongo_client, db_name : str, data_directory : str, json_files : list):
    db = mongo_client[db_name]
    
    for file in json_files:
        db.drop_collection(file)
        json_file = os.path.join(data_directory, json_files[file])
        with open(json_file, 'r') as openfile:
            json_object = json.load(openfile)
            file = db[file]
            result = file.insert_many(json_object)
        
    mongo_client.close() """
    
def set_mongo_collections_with_pyspark(spark_session, data_directory: str, json_files: dict, **mongo_args):
    db_name = mongo_args["db_name"]
    mongo_uri = get_mongo_uri(**mongo_args)

    for collection_name, filename in json_files.items():
        json_file_path = os.path.join(data_directory, filename)

        df = spark_session.read \
            .option("multiline", "true") \
            .json(json_file_path)

        df.write \
            .format("com.mongodb.spark.sql.DefaultSource") \
            .mode("overwrite") \
            .option("uri", mongo_uri) \
            .option("database", db_name) \
            .option("collection", collection_name) \
            .save()

        print(f"✔ Loaded {filename} into MongoDB collection '{collection_name}'")




def get_mongodb_dataframe(spark_session, **args):
    '''Query MongoDB, and create a DataFrame'''
    dframe = spark_session.read.format("com.mongodb.spark.sql.DefaultSource") \
        .option("database", args['db_name']) \
        .option("collection", args['collection']).load()

    '''Drop the '_id' index column to clean up the response.'''
    dframe = dframe.drop('_id')
    
    '''Call the drop_null_columns() function passing in the dataframe.'''
    dframe = drop_null_columns(dframe, args['null_column_threshold'])
    
    return dframe

### 4.0. Initialize Data Lakehouse Directory Structure
Remove the Data Lakehouse Database Directory Structure to Ensure Idempotency

In [11]:
remove_directory_tree(database_dir)

"Directory '/Users/vaneeshagupta/DS-2002-2/04-PySpark/spark-warehouse/northwind_dlh.db' has been removed successfully."

### 5.0. Create a New Spark Session

In [12]:
worker_threads = f"local[{int(os.cpu_count()/2)}]"

jars = []
mysql_spark_jar = os.path.join(os.getcwd(), "mysql-connector-j-9.1.0", "mysql-connector-j-9.1.0.jar")
mssql_spark_jar = os.path.join(os.getcwd(), "sqljdbc_12.8", "enu", "jars", "mssql-jdbc-12.8.1.jre11.jar")

jars.append(mysql_spark_jar)
#jars.append(mssql_spark_jar)

sparkConf_args = get_spark_conf_args(jars, **mongodb_args)

sparkConf = get_spark_conf(**sparkConf_args)
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
spark.sparkContext.setLogLevel("OFF")
spark

### 6.0. Create a New Metadata Database.

In [13]:
spark.sql(f"DROP DATABASE IF EXISTS {dest_database} CASCADE;")

sql_create_db = f"""
    CREATE DATABASE IF NOT EXISTS {dest_database}
    COMMENT 'DS-2002 Lab 06 Database'
    WITH DBPROPERTIES (contains_pii = true, purpose = 'DS-2002 Lab 6.0');
"""
spark.sql(sql_create_db)

DataFrame[]

## Section II: Populate Dimensions by Ingesting "Cold-path" Reference Data 
### 1.0. Fetch Data from the File System
#### 1.1. Verify the location of the source data files on the file system

In [14]:
get_file_info(batch_dir)

Unnamed: 0,name,size,modification_time
0,northwind_customers.json,10186,2025-04-21 01:31:03.085003138
1,northwind_employees.csv,2687,2025-04-21 01:31:03.085060358
2,northwind_invoices.json,5843,2025-04-21 01:31:03.085113764
3,northwind_shippers.csv,253,2025-04-21 01:31:03.085165501
4,northwind_suppliers.json,1380,2025-04-21 01:31:03.085221767


#### 1.2. Populate the <span style="color:darkred">Employees Dimension</span>
##### 1.2.1. Use PySpark to Read data from a CSV file

In [15]:
employee_csv = os.path.join(batch_dir, 'northwind_employees.csv')
print(employee_csv)

df_dim_employees = spark.read.format('csv').options(header='true', inferSchema='true').load(employee_csv)
df_dim_employees.toPandas().head(2)

/Users/vaneeshagupta/DS-2002-2/04-PySpark/lab_data/northwind/batch/northwind_employees.csv


Unnamed: 0,id,company,last_name,first_name,email_address,job_title,business_phone,home_phone,mobile_phone,fax_number,address,city,state_province,zip_postal_code,country_region,web_page,notes,attachments
0,1,Northwind Traders,Freehafer,Nancy,nancy@northwindtraders.com,Sales Representative,(123)555-0100,(123)555-0102,,(123)555-0103,123 1st Avenue,Seattle,WA,99999,USA,#http://northwindtraders.com#,,
1,2,Northwind Traders,Cencini,Andrew,andrew@northwindtraders.com,"Vice President, Sales",(123)555-0100,(123)555-0102,,(123)555-0103,123 2nd Avenue,Bellevue,WA,99999,USA,http://northwindtraders.com#http://northwindtr...,"Joined the company as a sales representative, ...",


##### 1.2.2. Make Necessary Transformations to the New DataFrame

In [16]:
# ----------------------------------------------------------------------------------
# Rename the 'id' column to 'employee_id' ------------------------------------------
# ----------------------------------------------------------------------------------
df_dim_employees = df_dim_employees.withColumnRenamed("id", "employee_id")

# ----------------------------------------------------------------------------------
# Add Primary Key column using SQL Windowing function: ROW_NUMBER() 
# ----------------------------------------------------------------------------------
df_dim_employees.createOrReplaceTempView("employees")
sql_employees = f"""
    SELECT *, ROW_NUMBER() OVER (ORDER BY employee_id) AS employee_key
    FROM employees;
"""
df_dim_employees = spark.sql(sql_employees)

# ----------------------------------------------------------------------------------
# Reorder Columns and display the first two rows in a Pandas dataframe
# ----------------------------------------------------------------------------------
ordered_columns = ['employee_key', 'employee_id', 'first_name', 'last_name'
                   , 'company', 'job_title', 'business_phone', 'home_phone', 'fax_number'
                   , 'address', 'city', 'state_province', 'zip_postal_code', 'country_region']

df_dim_employees = df_dim_employees[ordered_columns]
df_dim_employees.toPandas().head(2)

Unnamed: 0,employee_key,employee_id,first_name,last_name,company,job_title,business_phone,home_phone,fax_number,address,city,state_province,zip_postal_code,country_region
0,1,1,Nancy,Freehafer,Northwind Traders,Sales Representative,(123)555-0100,(123)555-0102,(123)555-0103,123 1st Avenue,Seattle,WA,99999,USA
1,2,2,Andrew,Cencini,Northwind Traders,"Vice President, Sales",(123)555-0100,(123)555-0102,(123)555-0103,123 2nd Avenue,Bellevue,WA,99999,USA


##### 1.2.3. Save as the <span style="color:darkred">dim_employees</span> table in the Data Lakehouse

In [17]:
df_dim_employees.write.saveAsTable(f"{dest_database}.dim_employees", mode="overwrite")

##### 1.2.4. Unit Test: Describe and Preview Table

In [18]:
spark.sql(f"DESCRIBE EXTENDED {dest_database}.dim_employees;").show()
spark.sql(f"SELECT * FROM {dest_database}.dim_employees LIMIT 2").toPandas()

+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|        employee_key|                 int|   NULL|
|         employee_id|                 int|   NULL|
|          first_name|              string|   NULL|
|           last_name|              string|   NULL|
|             company|              string|   NULL|
|           job_title|              string|   NULL|
|      business_phone|              string|   NULL|
|          home_phone|              string|   NULL|
|          fax_number|              string|   NULL|
|             address|              string|   NULL|
|                city|              string|   NULL|
|      state_province|              string|   NULL|
|     zip_postal_code|                 int|   NULL|
|      country_region|              string|   NULL|
|                    |                    |       |
|# Detailed Table ...|                    |       |
|           

Unnamed: 0,employee_key,employee_id,first_name,last_name,company,job_title,business_phone,home_phone,fax_number,address,city,state_province,zip_postal_code,country_region
0,1,1,Nancy,Freehafer,Northwind Traders,Sales Representative,(123)555-0100,(123)555-0102,(123)555-0103,123 1st Avenue,Seattle,WA,99999,USA
1,2,2,Andrew,Cencini,Northwind Traders,"Vice President, Sales",(123)555-0100,(123)555-0102,(123)555-0103,123 2nd Avenue,Bellevue,WA,99999,USA


#### 1.3. Populate the <span style="color:darkred">Shippers Dimension</span>
##### 1.3.1. Use PySpark to Read Data from a CSV File

In [19]:
# 1). Get a reference to the 'northwind_shippers.csv' file.
shippers_csv = os.path.join(batch_dir, 'northwind_shippers.csv')
print(shippers_csv)

# 2). Use Spark to read the CSV file data into the 'df_dim_shippers' variable.
#     Remember to specify that the first row contains column names (header), and to infer the schema.
df_dim_shippers = spark.read.format('csv') \
    .option('header', 'true') \
    .option('inferSchema', 'true') \
    .load(shippers_csv)

# 3). Unit Test: Convert the spark dataframe to a Pandas dataframe, and display the first two rows.
df_dim_shippers.toPandas().head(2)

/Users/vaneeshagupta/DS-2002-2/04-PySpark/lab_data/northwind/batch/northwind_shippers.csv


Unnamed: 0,id,company,address,city,state_province,zip_postal_code,country_region
0,1,Shipping Company A,123 Any Street,Memphis,TN,99999,USA
1,2,Shipping Company B,123 Any Street,Memphis,TN,99999,USA


##### 1.3.2 Make Necessary Transformations to the New DataFrame

In [23]:
# ----------------------------------------------------------------------------------
# Rename the 'id' column to 'shipper_id' ------------------------------------------
# ----------------------------------------------------------------------------------
df_dim_shippers = df_dim_shippers.withColumnRenamed("id", "shipper_id")

# ----------------------------------------------------------------------------------
# Add Primary Key column using SQL Windowing function: ROW_NUMBER() 
# ----------------------------------------------------------------------------------
df_dim_shippers.createOrReplaceTempView("shippers")
sql_shippers = f"""
    SELECT 
        ROW_NUMBER() OVER (ORDER BY shipper_id) AS shipper_key,
        shipper_id, company, address, city, 
        state_province, zip_postal_code, country_region
    FROM shippers
"""
df_dim_shippers = spark.sql(sql_shippers)

# ----------------------------------------------------------------------------------
# Reorder Columns and display the first two rows in a Pandas dataframe
# ----------------------------------------------------------------------------------
ordered_columns = ['shipper_key', 'shipper_id', 'company', 'address', 'city', 
                   'state_province', 'zip_postal_code', 'country_region']

df_dim_shippers = df_dim_shippers[ordered_columns]
df_dim_shippers.toPandas().head(2)



Unnamed: 0,shipper_key,shipper_id,company,address,city,state_province,zip_postal_code,country_region
0,1,1,Shipping Company A,123 Any Street,Memphis,TN,99999,USA
1,2,2,Shipping Company B,123 Any Street,Memphis,TN,99999,USA


##### 1.3.3. Save as the <span style="color:darkred">dim_shippers</span> table in the Data Lakehouse

In [24]:
df_dim_shippers.write.saveAsTable(f"{dest_database}.dim_shippers", mode="overwrite")


##### 1.3.4. Unit Test: Describe and Preview Table

In [25]:
spark.sql(f"DESCRIBE EXTENDED {dest_database}.dim_shippers").show()
spark.sql(f"SELECT * FROM {dest_database}.dim_shippers LIMIT 2").toPandas()

+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|         shipper_key|                 int|   NULL|
|          shipper_id|                 int|   NULL|
|             company|              string|   NULL|
|             address|              string|   NULL|
|                city|              string|   NULL|
|      state_province|              string|   NULL|
|     zip_postal_code|                 int|   NULL|
|      country_region|              string|   NULL|
|                    |                    |       |
|# Detailed Table ...|                    |       |
|             Catalog|       spark_catalog|       |
|            Database|       northwind_dlh|       |
|               Table|        dim_shippers|       |
|        Created Time|Mon Apr 21 03:51:...|       |
|         Last Access|             UNKNOWN|       |
|          Created By|         Spark 3.5.5|       |
|           

Unnamed: 0,shipper_key,shipper_id,company,address,city,state_province,zip_postal_code,country_region
0,1,1,Shipping Company A,123 Any Street,Memphis,TN,99999,USA
1,2,2,Shipping Company B,123 Any Street,Memphis,TN,99999,USA


### 2.0. Fetch Reference Data from a MongoDB Atlas Database
#### 2.1. Create a New MongoDB Database, and Load Each JSON File into a New MongoDB Collection
**NOTE:** The following cell **can** be run more than once because the **set_mongo_collection()** function **is** idempotent.

In [28]:
#client = get_mongo_client(**mongodb_args)

json_files = {"customers" : "northwind_customers.json",
              "invoices" : 'northwind_invoices.json',
              "suppliers" : 'northwind_suppliers.json'
             }

#set_mongo_collections(client, mongodb_args["db_name"], batch_dir, json_files) 
set_mongo_collections_with_pyspark(spark, batch_dir, json_files, **mongodb_args)

✔ Loaded northwind_customers.json into MongoDB collection 'customers'
✔ Loaded northwind_invoices.json into MongoDB collection 'invoices'
✔ Loaded northwind_suppliers.json into MongoDB collection 'suppliers'


#### 2.2. Populate the <span style="color:darkred">Customers Dimension</span>
##### 2.2.1. Fetch Data from the New MongoDB <span style="color:darkred">Customers</span> Collection

In [29]:
mongodb_args["collection"] = "customers"

df_dim_customers = get_mongodb_dataframe(spark, **mongodb_args)
df_dim_customers.toPandas().head(2)

Unnamed: 0,address,business_phone,city,company,country_region,fax_number,first_name,id,job_title,last_name,state_province,zip_postal_code
0,123 1st Street,(123)555-0100,Seattle,Company A,USA,(123)555-0101,Anna,1,Owner,Bedecs,WA,99999
1,123 2nd Street,(123)555-0100,Boston,Company B,USA,(123)555-0101,Antonio,2,Owner,Gratacos Solsona,MA,99999


##### 2.2.2. Make Necessary Transformations to the New Dataframe

In [35]:
# ----------------------------------------------------------------------------------
# Rename the 'id' column to 'customer_id' ------------------------------------------
# ----------------------------------------------------------------------------------
df_dim_customers = df_dim_customers.withColumnRenamed("id", "customer_id")

# ----------------------------------------------------------------------------------
# Add Primary Key column using the SQL Windowing function: ROW_NUMBER() 
# ----------------------------------------------------------------------------------

df_dim_customers.createOrReplaceTempView("customers")
sql_customers = """
    SELECT 
        ROW_NUMBER() OVER (ORDER BY customer_id) AS customer_key,
        customer_id,
        company,
        last_name,
        first_name,
        job_title,
        business_phone,
        fax_number,
        address,
        city,
        state_province,
        zip_postal_code,
        country_region
    FROM customers
"""
df_dim_customers = spark.sql(sql_customers)


# ----------------------------------------------------------------------------------
# Reorder Columns and display the first two rows in a Pandas dataframe
# ----------------------------------------------------------------------------------
# Reorder Columns and display the first two rows in a Pandas dataframe
ordered_columns = ['customer_key', 'customer_id', 'company', 'last_name', 'first_name',
                   'job_title', 'business_phone', 'fax_number',
                   'address', 'city', 'state_province', 'zip_postal_code', 'country_region']

df_dim_customers = df_dim_customers.select(*ordered_columns)
df_dim_customers.toPandas().head(2)


Unnamed: 0,customer_key,customer_id,company,last_name,first_name,job_title,business_phone,fax_number,address,city,state_province,zip_postal_code,country_region
0,1,1,Company A,Bedecs,Anna,Owner,(123)555-0100,(123)555-0101,123 1st Street,Seattle,WA,99999,USA
1,2,2,Company B,Gratacos Solsona,Antonio,Owner,(123)555-0100,(123)555-0101,123 2nd Street,Boston,MA,99999,USA


##### 2.2.3. Save as the <span style="color:darkred">dim_customers</span> table in the Data lakehouse

In [36]:
df_dim_customers.write.saveAsTable(f"{dest_database}.dim_customers", mode="overwrite")


##### 2.2.4. Unit Test: Describe and Preview Table

In [37]:
spark.sql(f"DESCRIBE EXTENDED {dest_database}.dim_customers").show()
spark.sql(f"SELECT * FROM {dest_database}.dim_customers LIMIT 2").toPandas()


+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|        customer_key|                 int|   NULL|
|         customer_id|              bigint|   NULL|
|             company|              string|   NULL|
|           last_name|              string|   NULL|
|          first_name|              string|   NULL|
|           job_title|              string|   NULL|
|      business_phone|              string|   NULL|
|          fax_number|              string|   NULL|
|             address|              string|   NULL|
|                city|              string|   NULL|
|      state_province|              string|   NULL|
|     zip_postal_code|              string|   NULL|
|      country_region|              string|   NULL|
|                    |                    |       |
|# Detailed Table ...|                    |       |
|             Catalog|       spark_catalog|       |
|           

Unnamed: 0,customer_key,customer_id,company,last_name,first_name,job_title,business_phone,fax_number,address,city,state_province,zip_postal_code,country_region
0,1,1,Company A,Bedecs,Anna,Owner,(123)555-0100,(123)555-0101,123 1st Street,Seattle,WA,99999,USA
1,2,2,Company B,Gratacos Solsona,Antonio,Owner,(123)555-0100,(123)555-0101,123 2nd Street,Boston,MA,99999,USA


#### 2.4. Populate the <span style="color:darkred">Suppliers Dimension</span>
##### 2.3.1. Fetch Data from the New MongoDB <span style="color:darkred">Suppliers</span> Collection

In [38]:
mongodb_args["collection"] = "suppliers"
df_dim_suppliers = get_mongodb_dataframe(spark, **mongodb_args)
df_dim_suppliers.printSchema()
df_dim_suppliers.toPandas().head(2)


root
 |-- company: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- id: long (nullable = true)
 |-- job_title: string (nullable = true)
 |-- last_name: string (nullable = true)



Unnamed: 0,company,first_name,id,job_title,last_name
0,Supplier A,Elizabeth A.,1,Sales Manager,Andersen
1,Supplier B,Cornelia,2,Sales Manager,Weiler


##### 2.3.2. Make Necessary Transformations to the New Dataframe

In [39]:
# ----------------------------------------------------------------------------------
# Rename the 'id' column to 'supplier_id' ------------------------------------------
# ----------------------------------------------------------------------------------
df_dim_suppliers = df_dim_suppliers.withColumnRenamed("id", "supplier_id")

# ----------------------------------------------------------------------------------
# Add Primary Key column using SQL Windowing function: ROW_NUMBER() 
# ----------------------------------------------------------------------------------
df_dim_suppliers.createOrReplaceTempView("suppliers")
sql_suppliers = """
    SELECT 
        ROW_NUMBER() OVER (ORDER BY supplier_id) AS supplier_key,
        supplier_id,
        company,
        first_name,
        last_name,
        job_title
    FROM suppliers
"""
df_dim_suppliers = spark.sql(sql_suppliers)

# ----------------------------------------------------------------------------------
# Reorder Columns and display the first two rows in a Pandas dataframe
# ----------------------------------------------------------------------------------
ordered_columns = ['supplier_key', 'supplier_id', 'company', 'first_name', 'last_name', 'job_title']
df_dim_suppliers = df_dim_suppliers.select(*ordered_columns)
df_dim_suppliers.toPandas().head(2)


Unnamed: 0,supplier_key,supplier_id,company,first_name,last_name,job_title
0,1,1,Supplier A,Elizabeth A.,Andersen,Sales Manager
1,2,2,Supplier B,Cornelia,Weiler,Sales Manager


##### 2.3.3. Save as the <span style="color:darkred">dim_suppliers</span> table in the Data lakehouse

In [40]:
df_dim_suppliers.write.saveAsTable(f"{dest_database}.dim_suppliers", mode="overwrite")

##### 2.3.4. Unit Test: Describe and Preview Table

In [41]:
spark.sql(f"DESCRIBE EXTENDED {dest_database}.dim_suppliers").show()
spark.sql(f"SELECT * FROM {dest_database}.dim_suppliers LIMIT 2").toPandas()

+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|        supplier_key|                 int|   NULL|
|         supplier_id|              bigint|   NULL|
|             company|              string|   NULL|
|          first_name|              string|   NULL|
|           last_name|              string|   NULL|
|           job_title|              string|   NULL|
|                    |                    |       |
|# Detailed Table ...|                    |       |
|             Catalog|       spark_catalog|       |
|            Database|       northwind_dlh|       |
|               Table|       dim_suppliers|       |
|        Created Time|Mon Apr 21 04:03:...|       |
|         Last Access|             UNKNOWN|       |
|          Created By|         Spark 3.5.5|       |
|                Type|             MANAGED|       |
|            Provider|             parquet|       |
|           

Unnamed: 0,supplier_key,supplier_id,company,first_name,last_name,job_title
0,1,1,Supplier A,Elizabeth A.,Andersen,Sales Manager
1,2,2,Supplier B,Cornelia,Weiler,Sales Manager


#### 2.4. Populate the <span style="color:darkred">Invoices Dimension</span>
##### 2.4.1. Fetch Data from the New MongoDB <span style="color:darkred">Invoices</span> Collection

In [42]:
mongodb_args["collection"] = "invoices"
df_dim_invoices = get_mongodb_dataframe(spark, **mongodb_args)
df_dim_invoices.printSchema()
df_dim_invoices.toPandas().head(2)


root
 |-- amount_due: double (nullable = true)
 |-- id: long (nullable = true)
 |-- invoice_date: string (nullable = true)
 |-- order_id: long (nullable = true)
 |-- shipping: double (nullable = true)
 |-- tax: double (nullable = true)



Unnamed: 0,amount_due,id,invoice_date,order_id,shipping,tax
0,0.0,5,2006-03-22 16:08:59,31,0.0,0.0
1,0.0,6,2006-03-22 16:10:27,32,0.0,0.0


##### 2.4.2. Make Necessary Transformations to the New Dataframe

In [43]:
# ----------------------------------------------------------------------------------
# Rename the 'id' column to 'invoice_id' ------------------------------------------
# ----------------------------------------------------------------------------------
df_dim_invoices = df_dim_invoices.withColumnRenamed("id", "invoice_id")

# ----------------------------------------------------------------------------------
# Add Primary Key column using SQL Windowing function: ROW_NUMBER() 
# ----------------------------------------------------------------------------------
df_dim_invoices.createOrReplaceTempView("invoices")
sql_invoices = """
    SELECT 
        ROW_NUMBER() OVER (ORDER BY invoice_id) AS invoice_key,
        invoice_id,
        order_id,
        invoice_date,
        tax,
        shipping,
        amount_due
    FROM invoices
"""
df_dim_invoices = spark.sql(sql_invoices)


# ----------------------------------------------------------------------------------
# Reorder Columns and display the first two rows in a Pandas dataframe
# ----------------------------------------------------------------------------------
ordered_columns = ['invoice_key', 'invoice_id', 'order_id', 'invoice_date',
                   'tax', 'shipping', 'amount_due']

df_dim_invoices = df_dim_invoices.select(*ordered_columns)
df_dim_invoices.toPandas().head(2)

Unnamed: 0,invoice_key,invoice_id,order_id,invoice_date,tax,shipping,amount_due
0,1,5,31,2006-03-22 16:08:59,0.0,0.0,0.0
1,2,6,32,2006-03-22 16:10:27,0.0,0.0,0.0


##### 2.4.3. Save as the <span style="color:darkred">dim_invoices</span> table in the Data lakehouse

In [44]:
df_dim_invoices.write.saveAsTable(f"{dest_database}.dim_invoices", mode="overwrite")


##### 2.4.4. Unit Test: Describe and Preview Table

In [45]:
spark.sql(f"DESCRIBE EXTENDED {dest_database}.dim_invoices").show()
spark.sql(f"SELECT * FROM {dest_database}.dim_invoices LIMIT 2").toPandas()


+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|         invoice_key|                 int|   NULL|
|          invoice_id|              bigint|   NULL|
|            order_id|              bigint|   NULL|
|        invoice_date|              string|   NULL|
|                 tax|              double|   NULL|
|            shipping|              double|   NULL|
|          amount_due|              double|   NULL|
|                    |                    |       |
|# Detailed Table ...|                    |       |
|             Catalog|       spark_catalog|       |
|            Database|       northwind_dlh|       |
|               Table|        dim_invoices|       |
|        Created Time|Mon Apr 21 04:05:...|       |
|         Last Access|             UNKNOWN|       |
|          Created By|         Spark 3.5.5|       |
|                Type|             MANAGED|       |
|           

Unnamed: 0,invoice_key,invoice_id,order_id,invoice_date,tax,shipping,amount_due
0,1,5,31,2006-03-22 16:08:59,0.0,0.0,0.0
1,2,6,32,2006-03-22 16:10:27,0.0,0.0,0.0


### 3.0. Fetch Reference Data from a MySQL Database
#### 3.1. Populate the <span style="color:darkred">Date Dimension</span>
##### 3.1.1 Fetch data from the <span style="color:darkred">dim_date</span> table in MySQL

In [65]:
sql_dim_date = f"SELECT * FROM {mysql_args['db_name']}.dim_date"
df_dim_date = get_mysql_dataframe(spark, sql_dim_date, **mysql_args)

##### 3.1.2. Save as the <span style="color:darkred">dim_date</span> table in the Data Lakehouse

In [66]:
df_dim_date.write.saveAsTable(f"{dest_database}.dim_date", mode="overwrite")

##### 3.1.3. Unit Test: Describe and Preview Table

In [67]:
spark.sql(f"DESCRIBE EXTENDED {dest_database}.dim_date;").show()
spark.sql(f"SELECT * FROM {dest_database}.dim_date LIMIT 2").toPandas()

+--------------------+---------+-------+
|            col_name|data_type|comment|
+--------------------+---------+-------+
|            date_key|      int|   NULL|
|           full_date|     date|   NULL|
|           date_name| char(11)|   NULL|
|        date_name_us| char(11)|   NULL|
|        date_name_eu| char(11)|   NULL|
|         day_of_week|  tinyint|   NULL|
|    day_name_of_week| char(10)|   NULL|
|        day_of_month|  tinyint|   NULL|
|         day_of_year|      int|   NULL|
|     weekday_weekend| char(10)|   NULL|
|        week_of_year|  tinyint|   NULL|
|          month_name| char(10)|   NULL|
|       month_of_year|  tinyint|   NULL|
|is_last_day_of_month|  char(1)|   NULL|
|    calendar_quarter|  tinyint|   NULL|
|       calendar_year|      int|   NULL|
| calendar_year_month| char(10)|   NULL|
|   calendar_year_qtr| char(10)|   NULL|
|fiscal_month_of_year|  tinyint|   NULL|
|      fiscal_quarter|  tinyint|   NULL|
+--------------------+---------+-------+
only showing top

Unnamed: 0,date_key,full_date,date_name,date_name_us,date_name_eu,day_of_week,day_name_of_week,day_of_month,day_of_year,weekday_weekend,...,is_last_day_of_month,calendar_quarter,calendar_year,calendar_year_month,calendar_year_qtr,fiscal_month_of_year,fiscal_quarter,fiscal_year,fiscal_year_month,fiscal_year_qtr
0,20000101,2000-01-01,2000/01/01,01/01/2000,01/01/2000,7,Saturday,1,1,Weekend,...,N,1,2000,2000-01,2000Q1,7,3,2000,2000-07,2000Q3
1,20000102,2000-01-02,2000/01/02,01/02/2000,02/01/2000,1,Sunday,2,2,Weekend,...,N,1,2000,2000-01,2000Q1,7,3,2000,2000-07,2000Q3


#### 3.2. Populate the <span style="color:darkred">Product Dimension</span>
##### 3.2.1. Fetch data from the <span style="color:darkred">Products</span> table in MySQL

In [73]:
# ----------------------------------------------------------------------------------
# Add Primary Key column using the SQL Windowing function: ROW_NUMBER() 
# ----------------------------------------------------------------------------------
mysql_args["db_name"] = "northwind"

sql_dim_products = f"SELECT * FROM {mysql_args['db_name']}.products"
df_dim_products = get_mysql_dataframe(spark, sql_dim_products, **mysql_args)

df_dim_products.printSchema()
df_dim_products.toPandas().head(2)

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window_spec = Window.orderBy("id")
df_dim_products = df_dim_products.withColumn("product_key", row_number().over(window_spec))



root
 |-- supplier_ids: string (nullable = true)
 |-- id: integer (nullable = true)
 |-- product_code: string (nullable = true)
 |-- product_name: string (nullable = true)
 |-- description: string (nullable = true)
 |-- standard_cost: decimal(19,4) (nullable = true)
 |-- list_price: decimal(19,4) (nullable = true)
 |-- reorder_level: integer (nullable = true)
 |-- target_level: integer (nullable = true)
 |-- quantity_per_unit: string (nullable = true)
 |-- discontinued: boolean (nullable = true)
 |-- minimum_reorder_quantity: integer (nullable = true)
 |-- category: string (nullable = true)
 |-- attachments: binary (nullable = true)



##### 3.2.2. Perform any Necessary Transformations

In [74]:
# ----------------------------------------------------------------------------------
# Rename the 'id' column to 'product_id' 
# ----------------------------------------------------------------------------------
# Using the monotonically_increasing_id() function has some limitations: starts with zero (0), and is not sequential.
    # df_dim_products = df_dim_products.withColumn("product_key", monotonically_increasing_id())
df_dim_products = df_dim_products.withColumnRenamed("id", "product_id")


# ----------------------------------------------------------------------------------
# Drop unwanted columns (description and attachments)
# ----------------------------------------------------------------------------------
df_dim_products = df_dim_products.drop("description", "attachments")

# ----------------------------------------------------------------------------------
# Reorder Columns and display the first two rows in a Pandas dataframe
# ----------------------------------------------------------------------------------
ordered_columns = [
    "product_key", "product_id", "product_code", "product_name",
    "standard_cost", "list_price", "reorder_level", "target_level",
    "quantity_per_unit", "discontinued", "minimum_reorder_quantity", "category"
]

df_dim_products = df_dim_products.select(*ordered_columns)
df_dim_products.toPandas().head(2)

Unnamed: 0,product_key,product_id,product_code,product_name,standard_cost,list_price,reorder_level,target_level,quantity_per_unit,discontinued,minimum_reorder_quantity,category
0,1,1,NWTB-1,Northwind Traders Chai,13.5,18.0,10,40,10 boxes x 20 bags,False,10.0,Beverages
1,2,3,NWTCO-3,Northwind Traders Syrup,7.5,10.0,25,100,12 - 550 ml bottles,False,25.0,Condiments


##### 3.2.3. Save as the <span style="color:darkred">dim_products</span> table in the Data Lakehouse

In [75]:
df_dim_products.write.saveAsTable(f"{dest_database}.dim_products", mode="overwrite")


##### 3.2.4. Unit Test: Describe and Preview Table

In [76]:
spark.sql(f"DESCRIBE EXTENDED {dest_database}.dim_products").show(truncate=False)
spark.sql(f"SELECT * FROM {dest_database}.dim_products LIMIT 2").toPandas()

+----------------------------+----------------------------+-------+
|col_name                    |data_type                   |comment|
+----------------------------+----------------------------+-------+
|product_key                 |int                         |NULL   |
|product_id                  |int                         |NULL   |
|product_code                |varchar(25)                 |NULL   |
|product_name                |varchar(50)                 |NULL   |
|standard_cost               |decimal(19,4)               |NULL   |
|list_price                  |decimal(19,4)               |NULL   |
|reorder_level               |int                         |NULL   |
|target_level                |int                         |NULL   |
|quantity_per_unit           |varchar(50)                 |NULL   |
|discontinued                |boolean                     |NULL   |
|minimum_reorder_quantity    |int                         |NULL   |
|category                    |varchar(50)       

Unnamed: 0,product_key,product_id,product_code,product_name,standard_cost,list_price,reorder_level,target_level,quantity_per_unit,discontinued,minimum_reorder_quantity,category
0,1,1,NWTB-1,Northwind Traders Chai,13.5,18.0,10,40,10 boxes x 20 bags,False,10,Beverages
1,2,3,NWTCO-3,Northwind Traders Syrup,7.5,10.0,25,100,12 - 550 ml bottles,False,25,Condiments


### 4.0. Verify Dimension Tables

In [77]:
spark.sql(f"USE {dest_database};")
spark.sql("SHOW TABLES").toPandas()

Unnamed: 0,namespace,tableName,isTemporary
0,northwind_dlh,dim_customers,False
1,northwind_dlh,dim_date,False
2,northwind_dlh,dim_employees,False
3,northwind_dlh,dim_invoices,False
4,northwind_dlh,dim_products,False
5,northwind_dlh,dim_shippers,False
6,northwind_dlh,dim_suppliers,False
7,,customers,True
8,,employees,True
9,,invoices,True


## Section III: Integrate Reference Data with Real-Time Data
### 6.0. Use PySpark Structured Streaming to Process (Hot Path) <span style="color:darkred">Orders</span> Fact Data  
#### 6.1. Verify the location of the source data files on the file system

In [78]:
get_file_info(orders_stream_dir)

Unnamed: 0,name,size,modification_time
0,northwind_orders_01.json,9609,2025-04-21 01:31:03.085644722
1,northwind_orders_02.json,9103,2025-04-21 01:31:03.085706472
2,northwind_orders_03.json,9008,2025-04-21 01:31:03.085754871


#### 6.2. Create the Bronze Layer: Stage <span style="color:darkred">Orders Fact table</span> Data
##### 6.2.1. Read "Raw" JSON file data into a Stream

In [79]:
df_orders_bronze = (
    spark.readStream \
    .option("schemaLocation", orders_output_bronze) \
    .option("maxFilesPerTrigger", 1) \
    .option("multiLine", "true") \
    .json(orders_stream_dir)
)

df_orders_bronze.isStreaming

True

##### 6.2.2. Write the Streaming Data to a Parquet file

In [80]:
orders_checkpoint_bronze = os.path.join(orders_output_bronze, '_checkpoint')

orders_bronze_query = (
    df_orders_bronze
    # Add Current Timestamp and Input Filename columns for Traceability
    .withColumn("receipt_time", current_timestamp())
    .withColumn("source_file", input_file_name())
    
    .writeStream \
    .format("parquet") \
    .outputMode("append") \
    .queryName("orders_bronze")
    .trigger(availableNow = True) \
    .option("checkpointLocation", orders_checkpoint_bronze) \
    .option("compression", "snappy") \
    .start(orders_output_bronze)
)

##### 6.2.3. Unit Test: Implement Query Monitoring

In [81]:
print(f"Query ID: {orders_bronze_query.id}")
print(f"Query Name: {orders_bronze_query.name}")
print(f"Query Status: {orders_bronze_query.status}")

Query ID: 1275be83-3562-4862-8f69-ad7dcb9b1bd0
Query Name: orders_bronze
Query Status: {'message': 'Stopped', 'isDataAvailable': False, 'isTriggerActive': False}


In [82]:
orders_bronze_query.awaitTermination()

#### 6.3. Create the Silver Layer: Integrate "Cold-path" Data & Make Transformations
##### 6.3.1. Prepare Role-Playing Dimension Primary and Business Keys

In [83]:
df_dim_order_date = df_dim_date.select(col("date_key").alias("order_date_key"), col("full_date").alias("order_full_date"))
df_dim_paid_date = df_dim_date.select(col("date_key").alias("paid_date_key"), col("full_date").alias("paid_full_date"))
df_dim_shipped_date = df_dim_date.select(col("date_key").alias("shipped_date_key"), col("full_date").alias("shipped_full_date"))
df_dim_shippers = df_dim_shippers.withColumnRenamed("shipper_id", "shipper_no")

##### 6.3.2. Define Silver Query to Join Streaming with Batch Data

In [84]:
df_orders_silver = spark.readStream.format("parquet").load(orders_output_bronze) \
    .join(df_dim_customers, "customer_id") \
    .join(df_dim_employees, "employee_id") \
    .join(df_dim_products, "product_id") \
    .join(df_dim_shippers, df_dim_shippers.shipper_no == col("shipper_id").cast(IntegerType()), "left_outer") \
    .join(df_dim_order_date, df_dim_order_date.order_full_date.cast(DateType()) == col("order_date").cast(DateType()), "inner") \
    .join(df_dim_shipped_date, df_dim_shipped_date.shipped_full_date.cast(DateType()) == col("shipped_date").cast(DateType()), "left_outer") \
    .join(df_dim_paid_date, df_dim_paid_date.paid_full_date.cast(DateType()) == col("paid_date").cast(DateType()), "left_outer") \
    .select(col("order_id").cast(LongType()), \
            col("order_detail_id").cast(LongType()), \
            df_dim_customers.customer_key.cast(LongType()), \
            df_dim_employees.employee_key.cast(LongType()), \
            df_dim_products.product_key.cast(LongType()), \
            df_dim_shippers.shipper_key.cast(IntegerType()), \
            df_dim_order_date.order_date_key.cast(LongType()), \
            df_dim_paid_date.paid_date_key.cast(LongType()), \
            df_dim_shipped_date.shipped_date_key.cast(LongType()), \
            col("quantity"), \
            col("unit_price"), \
            col("discount"), \
            col("shipping_fee"), \
            col("taxes"), \
            col("tax_rate"), \
            col("payment_type"), \
            col("order_status"), \
            col("order_details_status") \
           )

In [85]:
df_orders_silver.isStreaming

True

In [86]:
df_orders_silver.printSchema()

root
 |-- order_id: long (nullable = true)
 |-- order_detail_id: long (nullable = true)
 |-- customer_key: long (nullable = false)
 |-- employee_key: long (nullable = false)
 |-- product_key: long (nullable = false)
 |-- shipper_key: integer (nullable = true)
 |-- order_date_key: long (nullable = true)
 |-- paid_date_key: long (nullable = true)
 |-- shipped_date_key: long (nullable = true)
 |-- quantity: double (nullable = true)
 |-- unit_price: double (nullable = true)
 |-- discount: long (nullable = true)
 |-- shipping_fee: double (nullable = true)
 |-- taxes: double (nullable = true)
 |-- tax_rate: long (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- order_status: string (nullable = true)
 |-- order_details_status: string (nullable = true)



##### 6.3.3. Write the Transformed Streaming data to the Data Lakehouse

In [87]:
orders_checkpoint_silver = os.path.join(orders_output_silver, '_checkpoint')

orders_silver_query = (
    df_orders_silver.writeStream \
    .format("parquet") \
    .outputMode("append") \
    .queryName("orders_silver")
    .trigger(availableNow = True) \
    .option("checkpointLocation", orders_checkpoint_silver) \
    .option("compression", "snappy") \
    .start(orders_output_silver)
)

##### 6.3.4. Unit Test: Implement Query Monitoring

In [88]:
print(f"Query ID: {orders_silver_query.id}")
print(f"Query Name: {orders_silver_query.name}")
print(f"Query Status: {orders_silver_query.status}")

Query ID: c2168002-5a49-4ce3-b389-c6a38bd8ff77
Query Name: orders_silver
Query Status: {'message': 'Stopped', 'isDataAvailable': False, 'isTriggerActive': False}


In [89]:
orders_silver_query.awaitTermination()

#### 6.4. Create Gold Layer: Perform Aggregations
##### 6.4.1. Define a Query to Create a Business Report
Create a new Gold table using the PySpark API. The table should include the number of Products sold per Category each Month. The results should include The Month, Product Category and Number of Products sold, sorted by the month number when the orders were placed: e.g., January, February, March.

In [90]:
df_orders_by_product_category_gold = spark.readStream.format("parquet").load(orders_output_silver) \
.join(df_dim_products, "product_key") \
.join(df_dim_date, df_dim_date.date_key.cast(IntegerType()) == col("order_date_key").cast(IntegerType())) \
.groupBy("month_of_year", "category", "month_name") \
.agg(count("product_key").alias("product_count")) \
.orderBy(asc("month_of_year"), desc("product_count"))

In [91]:
df_orders_by_product_category_gold.printSchema()

root
 |-- month_of_year: byte (nullable = true)
 |-- category: string (nullable = true)
 |-- month_name: string (nullable = true)
 |-- product_count: long (nullable = false)



##### 6.4.2. Write the Streaming data to a Parquet File in "Complete" mode

In [92]:
orders_gold_query = (
    df_orders_by_product_category_gold.writeStream \
    .format("memory") \
    .outputMode("complete") \
    .queryName("fact_orders_by_product_category")
    .start()
)

In [93]:
wait_until_stream_is_ready(orders_gold_query, 1)

The stream has processed 1 batchs


##### 6.4.3. Query the Gold Data from Memory

In [94]:
df_fact_orders_by_product_category = spark.sql("SELECT * FROM fact_orders_by_product_category")
df_fact_orders_by_product_category.printSchema()

root
 |-- month_of_year: byte (nullable = true)
 |-- category: string (nullable = true)
 |-- month_name: string (nullable = true)
 |-- product_count: long (nullable = false)



##### 6.4.4 Create the Final Selection

In [95]:
df_fact_orders_by_product_category_gold_final = df_fact_orders_by_product_category \
.select(col("month_name").alias("Month"), \
        col("category").alias("Product Category"), \
        col("product_count").alias("Product Count")) \
.orderBy(asc("month_of_year"), desc("Product Count"))

##### 6.4.5. Load the Final Results into a New Table and Display the Results

In [96]:
df_fact_orders_by_product_category_gold_final.write.saveAsTable(f"{dest_database}.fact_orders_by_product_category", mode="overwrite")
spark.sql(f"SELECT * FROM {dest_database}.fact_orders_by_product_category").toPandas()

Unnamed: 0,Month,Product Category,Product Count
0,March,Beverages,7
1,March,Sauces,1
2,March,Dried Fruit & Nuts,1
3,March,"Jams, Preserves",1
4,March,Candy,1
5,March,Condiments,1
6,March,Baked Goods & Mixes,1
7,January,Dried Fruit & Nuts,4
8,January,Beverages,3
9,January,Baked Goods & Mixes,1


### 7.0. Use PySpark Structured Streaming to Process (Hot Path) <span style="color:darkred">Inventory Transactions</span> Fact Data
#### 7.1. Verify the location of the source data files on the file system

In [97]:
get_file_info(inventory_trans_stream_dir)

Unnamed: 0,name,size,modification_time
0,northwind_inventory_transactions_01.json,7656,2025-04-21 01:31:03.085403919
1,northwind_inventory_transactions_02.json,7590,2025-04-21 01:31:03.085463047
2,northwind_inventory_transactions_03.json,7587,2025-04-21 01:31:03.085515976


#### 7.2. Create the Bronze Layer: Stage <span style="color:darkred">Inventory Transactions Fact table</span> Data
##### 7.2.1. Read "Raw" JSON file data into a Stream

In [99]:
df_inventory_trans_bronze = (
    spark.readStream \
    #TODO: load data from 'inventory_trans_stream_dir'
    .option("maxFilesPerTrigger", 1)
    .option("multiLine", "true")
    .json(inventory_trans_stream_dir)
    
)

df_inventory_trans_bronze.isStreaming

True

##### 7.2.2. Write the Streaming Data to a Parquet file

In [100]:
inventory_trans_checkpoint_bronze = os.path.join(inventory_trans_output_bronze, '_checkpoint')

inventory_trans_bronze_query = (
    df_inventory_trans_bronze
    # TODO: Add Current Timestamp and Input Filename columns for Traceability
    # TODO: writeStream to 'inventory_trans_output_bronze' in 'append' mode
    .withColumn("receipt_time", current_timestamp())
    .withColumn("source_file", input_file_name())
    .writeStream
    .format("parquet")
    .outputMode("append")
    .queryName("inventory_trans_bronze")
    .trigger(availableNow=True)
    .option("checkpointLocation", inventory_trans_checkpoint_bronze)
    .option("compression", "snappy")
    .start(inventory_trans_output_bronze)
)

##### 7.2.3. Unit Test: Implement Query Monitoring

In [101]:
print(f"Query ID: {inventory_trans_bronze_query.id}")
print(f"Query Name: {inventory_trans_bronze_query.name}")
print(f"Query Status: {inventory_trans_bronze_query.status}")

Query ID: 66d7e64d-f4c5-409a-ac28-1af36c504f2e
Query Name: inventory_trans_bronze
Query Status: {'message': 'Stopped', 'isDataAvailable': False, 'isTriggerActive': False}


In [102]:
inventory_trans_bronze_query.awaitTermination()

#### 7.3. Create the Silver Layer: Integrate "Cold-path" Data & Make Transformations
##### 7.3.1. Prepare Role-Playing Dimension Primary and Business Keys

In [103]:
df_dim_created_date = df_dim_date.selectExpr("date_key as created_date_key", "full_date as created_full_date") #TODO: Copy df_dim_date and rename 'date_key' and 'full_date' columns.
df_dim_modified_date = df_dim_date.selectExpr("date_key as modified_date_key", "full_date as modified_full_date") #TODO: Copy df_dim_date and rename 'date_key' and 'full_date' columns.

##### 7.3.2. Define Silver Query to Join Streaming with Batch Data

In [105]:
df_inventory_trans_silver = (spark.readStream.format("parquet").load(inventory_trans_output_bronze) \
    # .join to the dim_products dimension
    # .join to the dim_created_date dimension
    # .join to the dim_created_date
    # .join to the dim_modified_date dimension
    # .select() the appropriate columns
    .join(df_dim_products.select("product_id", "product_key"), on="product_id", how="inner")
    .join(df_dim_created_date, expr("created_full_date = CAST(transaction_created_date AS DATE)"), "inner")
    .join(df_dim_modified_date, expr("modified_full_date = CAST(transaction_modified_date AS DATE)"), "left_outer")
    .selectExpr(
        "inventory_transaction_id",
        "product_key",
        "transaction_type",
        "quantity",
        "created_date_key",
        "modified_date_key",
        "receipt_time",
        "source_file"
    )
)

In [106]:
df_inventory_trans_silver.isStreaming

True

In [107]:
df_inventory_trans_silver.printSchema()

root
 |-- inventory_transaction_id: long (nullable = true)
 |-- product_key: integer (nullable = false)
 |-- transaction_type: string (nullable = true)
 |-- quantity: long (nullable = true)
 |-- created_date_key: integer (nullable = true)
 |-- modified_date_key: integer (nullable = true)
 |-- receipt_time: timestamp (nullable = true)
 |-- source_file: string (nullable = true)



##### 7.3.3. Write the Transformed Streaming data to the Data Lakehouse

In [108]:
inventory_trans_checkpoint_silver = os.path.join(inventory_trans_output_silver, '_checkpoint')

inventory_trans_silver_query = (
    df_inventory_trans_silver.writeStream \
    # TODO: writeStream, in 'parquet' format, to 'inventory_trans_output_silver' in 'append' mode
    .format("parquet")
    .outputMode("append")
    .queryName("inventory_trans_silver")
    .trigger(availableNow=True)
    .option("checkpointLocation", inventory_trans_checkpoint_silver)
    .option("compression", "snappy")
    .start(inventory_trans_output_silver)
)

##### 7.3.4. Unit Test: Implement Query Monitoring

In [109]:
print(f"Query ID: {inventory_trans_silver_query.id}")
print(f"Query Name: {inventory_trans_silver_query.name}")
print(f"Query Status: {inventory_trans_silver_query.status}")

Query ID: fb0c5b6f-8aff-45d6-80cb-46b3ad4beff1
Query Name: inventory_trans_silver
Query Status: {'message': 'Stopped', 'isDataAvailable': False, 'isTriggerActive': False}


In [110]:
inventory_trans_silver_query.awaitTermination()

#### 7.4. Create Gold Layer: Perform Aggregations
##### 7.4.1. Define a Query to Create a Business Report
Create a new Gold table using the PySpark API. The table should include the total quantity (total quantity) of the inventory transactions placed per Product. Include the Inventory Transaction Type, and the Product Name.

In [111]:
df_fact_inventory_trans_by_product_gold = (spark.readStream.format("parquet").load(inventory_trans_output_silver) \
    #.join to the df_dim_products dimension
    #.join to the df_dim_date dimension on the 'created_date_key'
    # group by the 'calendar_quarter', 'transaction_type', and 'product_name columns
    # sum the 'quantity' column to create the 'Total Quantity' column
    # order by the 'Total Quantity' column
    .join(df_dim_products.select("product_key", "product_name"), on="product_key")
    .join(df_dim_date.selectExpr("date_key as created_date_key", "calendar_quarter"), on="created_date_key")
    .groupBy("calendar_quarter", "transaction_type", "product_name")
    .agg(sum("quantity").alias("Total_Quantity"))
    .orderBy(col("Total_Quantity").desc())
)





##### 7.4.2. Write the Streaming data to Memory in "Complete" mode

In [112]:
inventory_trans_gold_query = (
    df_fact_inventory_trans_by_product_gold.writeStream \
    # create the new "fact_inventory_trans_by_product" query
    .format("memory")
    .outputMode("complete")
    .queryName("fact_inventory_trans_by_product")
    .start()
)

In [113]:
wait_until_stream_is_ready(inventory_trans_gold_query, 1)

The stream has processed 1 batchs


##### 7.4.3. Query the Gold Data from Memory

In [114]:
df_fact_inventory_trans_by_product = spark.sql("SELECT * FROM fact_inventory_trans_by_product")
df_fact_inventory_trans_by_product.printSchema()

root
 |-- calendar_quarter: byte (nullable = true)
 |-- transaction_type: string (nullable = true)
 |-- product_name: string (nullable = true)
 |-- Total_Quantity: long (nullable = true)



##### 7.4.4 Create the Final Selection

In [115]:
df_fact_inventory_trans_by_product_gold_final = (df_fact_inventory_trans_by_product \
    # .select() the 'calendar_quarter' column as 'Quarter Created',
    # 'transaction_type' as 'Transaction', 'product_name' as 'Product', and 'Total Quantity'
    # ordered by 'Total Quantity'.
    .selectExpr(
        "calendar_quarter as Quarter_Created",
        "transaction_type as Transaction",
        "product_name as Product",
        "Total_Quantity"
    )
    .orderBy(col("Total_Quantity").desc())
)


##### 7.4.5. Load the Final Results into a New Table and Display the Results

In [116]:
df_fact_inventory_trans_by_product_gold_final.write.saveAsTable(f"{dest_database}.fact_inventory_trans_by_product", mode="overwrite")
spark.sql(f"SELECT * FROM {dest_database}.fact_inventory_trans_by_product").toPandas()

Unnamed: 0,Quarter_Created,Transaction,Product,Total_Quantity
0,2,Sold,Northwind Traders Clam Chowder,60
1,1,Purchased,Northwind Traders Mustard,60
2,2,Purchased,Northwind Traders Clam Chowder,50
3,1,Sold,Northwind Traders Chocolate Biscuits Mix,50
4,2,Purchased,Northwind Traders Mozzarella,50
...,...,...,...,...
72,1,Sold,Northwind Traders Green Tea,200
73,1,Purchased,Northwind Traders Chocolate,200
74,2,Sold,Northwind Traders Cajun Seasoning,30
75,2,Sold,Northwind Traders Dried Apples,30


### 8.0. Use PySpark Structured Streaming to Process (Hot Path) <span style="color:darkred">Purchase Orders</span> Fact Data
#### 8.1. Verify the location of the source data files on the file system

In [117]:
get_file_info(purchase_orders_stream_dir)

Unnamed: 0,name,size,modification_time
0,northwind_purchase_orders_01.json,11245,2025-04-21 01:31:03.085878849
1,northwind_purchase_orders_02.json,10651,2025-04-21 01:31:03.085936785
2,northwind_purchase_orders_03.json,10471,2025-04-21 01:31:03.085988998


#### 8.2. Create the Bronze Layer: Stage <span style="color:darkred">Purchase Orders Fact table</span> Data
##### 8.2.1. Read "Raw" JSON file data into a Stream

In [140]:
df_purchase_orders_bronze = (
    spark.readStream \
    # TODO: load data from 'purchase_orders_stream_dir'
    .option("multiLine", "true")
    .schema("po_number STRING, po_detail_id STRING, product_id LONG, supplier_id LONG, submitted_by INT, created_by INT, approved_by INT, submitted_date STRING, creation_date STRING, approved_date STRING, date_received STRING, po_detail_quantity INT, po_detail_unit_cost DOUBLE, list_price DOUBLE")
    .json(purchase_orders_stream_dir)
)

df_purchase_orders_bronze.isStreaming

True

##### 8.2.2. Write the Streaming Data to a Parquet file

In [146]:
purchase_orders_checkpoint_bronze = os.path.join(purchase_orders_output_bronze, '_checkpoint')

purchase_orders_bronze_query = (
    df_purchase_orders_bronze
    # TODO: Add Current Timestamp and Input Filename columns for Traceability
    # TODO: writeStream to 'purchase_orders_output_bronze' in 'append' mode
    .withColumnRenamed("list_price", "bronze_list_price")
    .withColumn("receipt_time", current_timestamp())
    .withColumn("source_file", input_file_name())
    .writeStream
    .format("parquet")
    .outputMode("append")
    .queryName("purchase_orders_bronze")
    .trigger(availableNow=True)
    .option("checkpointLocation", purchase_orders_checkpoint_bronze)
    .option("compression", "snappy")
    .start(purchase_orders_output_bronze)

)

##### 8.2.3. Unit Test: Implement Query Monitoring

In [147]:
print(f"Query ID: {purchase_orders_bronze_query.id}")
print(f"Query Name: {purchase_orders_bronze_query.name}")
print(f"Query Status: {purchase_orders_bronze_query.status}")

Query ID: 88844c1f-1eef-44da-9770-c4a54376846a
Query Name: purchase_orders_bronze
Query Status: {'message': 'Stopped', 'isDataAvailable': False, 'isTriggerActive': False}


In [148]:
purchase_orders_bronze_query.awaitTermination()

#### 8.3. Create the Silver Layer: Integrate "Cold-path" Data & Make Transformations
##### 8.3.1. Prepare Role-Playing Dimension Primary and Business Keys

In [149]:
df_dim_created_by = df_dim_employees.selectExpr("employee_key as created_by_key", "employee_id as created_by_id")
df_dim_approved_by = df_dim_employees.selectExpr("employee_key as approved_by_key", "employee_id as approved_by_id")
df_dim_submitted_by = df_dim_employees.selectExpr("employee_key as submitted_by_key", "employee_id as submitted_by_id")

df_dim_submitted_date = df_dim_date.selectExpr("date_key as submitted_date_key", "full_date as submitted_full_date")
df_dim_creation_date = df_dim_date.selectExpr("date_key as creation_date_key", "full_date as creation_full_date")
df_dim_approved_date = df_dim_date.selectExpr("date_key as approved_date_key", "full_date as approved_full_date")
df_dim_date_received = df_dim_date.selectExpr("date_key as date_received_key", "full_date as date_received_full_date")


##### 8.3.2. Define Silver Query to Join Streaming with Batch Data

In [152]:
#df_purchase_orders_silver = (spark.readStream.format("parquet").load(purchase_orders_output_bronze) \
    # .join 'inner' to the df_dim_products dimension
    # .join 'inner' to the df_dim_suppliers
    # .join 'left_outer' to the df_dim_created_by dimension
    # .join 'left_outer' to the df_dim_approved_by dimension
    # .join 'left_outer' to the df_dim_submitted_by dimension
    # .join 'inner' to the df_dim_submitted_date dimension
    # .join 'inner' to the df_dim_creation_date
    # .join 'left_outer' to the df_dim_approved_date dimension
    # .join 'left_outer' to the df_dim_date_received dimension
    # .select() the appropriate columns from the 'purchase orders bronze' stream
  
from pyspark.sql.functions import col, expr

df_purchase_orders_bronze = (
    spark.readStream
        .format("parquet")
        .load(purchase_orders_output_bronze)
        .withColumnRenamed("list_price", "bronze_list_price")
)

df_purchase_orders_silver = (
    df_purchase_orders_bronze
        .join(df_dim_products, "product_id", "inner")
        .join(df_dim_suppliers, "supplier_id", "inner")
        .join(df_dim_created_by, df_dim_created_by.created_by_id == col("created_by"), "left_outer")
        .join(df_dim_approved_by, df_dim_approved_by.approved_by_id == col("approved_by"), "left_outer")
        .join(df_dim_submitted_by, df_dim_submitted_by.submitted_by_id == col("submitted_by"), "left_outer")
        .join(df_dim_submitted_date, expr("CAST(submitted_date AS DATE) = submitted_full_date"), "inner")
        .join(df_dim_creation_date, expr("CAST(creation_date AS DATE) = creation_full_date"), "inner")
        .join(df_dim_approved_date, expr("CAST(approved_date AS DATE) = approved_full_date"), "left_outer")
        .join(df_dim_date_received, expr("CAST(date_received AS DATE) = date_received_full_date"), "left_outer")
        .selectExpr(
            "po_number",
            "po_detail_id",
            "product_key",
            "supplier_key",
            "created_by_key",
            "approved_by_key",
            "submitted_by_key",
            "creation_date_key",
            "approved_date_key",
            "submitted_date_key",
            "date_received_key",
            "po_detail_quantity",
            "po_detail_unit_cost",
            "list_price as product_list_price",
            "bronze_list_price",
            "receipt_time",
            "source_file"
        )
)



In [153]:
df_purchase_orders_silver.isStreaming

True

In [154]:
df_purchase_orders_silver.printSchema()

root
 |-- po_number: string (nullable = true)
 |-- po_detail_id: string (nullable = true)
 |-- product_key: integer (nullable = false)
 |-- supplier_key: integer (nullable = false)
 |-- created_by_key: integer (nullable = true)
 |-- approved_by_key: integer (nullable = true)
 |-- submitted_by_key: integer (nullable = true)
 |-- creation_date_key: integer (nullable = true)
 |-- approved_date_key: integer (nullable = true)
 |-- submitted_date_key: integer (nullable = true)
 |-- date_received_key: integer (nullable = true)
 |-- po_detail_quantity: integer (nullable = true)
 |-- po_detail_unit_cost: double (nullable = true)
 |-- product_list_price: decimal(19,4) (nullable = true)
 |-- bronze_list_price: double (nullable = true)
 |-- receipt_time: timestamp (nullable = true)
 |-- source_file: string (nullable = true)



##### 8.3.3. Write the Transformed Streaming data to the Data Lakehouse

In [155]:
purchase_orders_checkpoint_silver = os.path.join(purchase_orders_output_silver, '_checkpoint')

purchase_orders_silver_query = (
    df_purchase_orders_silver.writeStream \
    # TODO: writeStream, in 'parquet' format, to 'purchase_orders_output_silver' in 'append' mode
    .format("parquet")
    .outputMode("append")
    .queryName("purchase_orders_silver")
    .trigger(availableNow=True)
    .option("checkpointLocation", purchase_orders_checkpoint_silver)
    .option("compression", "snappy")
    .start(purchase_orders_output_silver)
)

##### 8.3.4. Unit Test: Implement Query Monitoring

In [156]:
print(f"Query ID: {purchase_orders_silver_query.id}")
print(f"Query Name: {purchase_orders_silver_query.name}")
print(f"Query Status: {purchase_orders_silver_query.status}")

Query ID: 68166eb8-82dd-4c0b-8be7-8926472e24c0
Query Name: purchase_orders_silver
Query Status: {'message': 'Stopped', 'isDataAvailable': False, 'isTriggerActive': False}


In [157]:
purchase_orders_silver_query.awaitTermination()

#### 8.4. Create Gold Layer: Perform Aggregations
##### 8.4.1. Define a Query to Create a Business Report
Create a new Gold table using the PySpark API. The table should include the Suppliers' Company Name, the Product Name, the Total Quantity, Total Unit Cost, and Total List Price for all the purchase orders placed per Supplier for each Product.

In [171]:
from pyspark.sql.functions import sum as _sum
df_fact_pos_products_per_supplier_gold = (spark.readStream.format("parquet").load(purchase_orders_output_silver) \
# .join to the 'df_dim_products' dimension
# .join to the 'df_dim_suppliers' dimension
# .groupBy 'company' and 'product_name'
# sum 'po_detail_quantity' as 'Total Quantity'
# sum 'po_detail_unit_cost' as 'Total Unit Cost'
# sum 'list_price' as 'Total List Price'
# orderBy 'Total Quantity' in descending order
        .join(df_dim_products, "product_key")
        .join(df_dim_suppliers, "supplier_key")
        .withWatermark("receipt_time", "10 minutes")
        .groupBy("company", "product_name")
        .agg(
            _sum("po_detail_quantity").alias("Total Quantity"),
            _sum("po_detail_unit_cost").alias("Total Unit Cost"),
            _sum("bronze_list_price").alias("Total List Price")
        )
)

##### 8.4.2. Write the Streaming data to Memory in "Complete" mode

In [172]:
purchase_orders_checkpoint_gold = os.path.join(purchase_orders_output_gold, "_checkpoint")

purchase_orders_gold_query = (
    df_fact_pos_products_per_supplier_gold.writeStream \
    # create the new "fact_pos_products_per_supplier" query
        .format("memory")
        .outputMode("complete")  # Required for sorting aggregated data later
        .option("checkpointLocation", purchase_orders_checkpoint_gold)
        .queryName("fact_pos_products_per_supplier")
        .trigger(availableNow=True)
        .start()
)

In [173]:
wait_until_stream_is_ready(purchase_orders_gold_query, 1)

The stream has processed 1 batchs


##### 8.4.3. Query the Gold Data from Memory

In [174]:
df_fact_pos_products_per_supplier = spark.sql("SELECT * FROM fact_pos_products_per_supplier")
df_fact_pos_products_per_supplier.printSchema()

root
 |-- company: string (nullable = true)
 |-- product_name: string (nullable = true)
 |-- Total Quantity: long (nullable = true)
 |-- Total Unit Cost: double (nullable = true)
 |-- Total List Price: double (nullable = true)



##### 8.4.4. Create the Final Selection

In [176]:
df_fact_pos_products_per_supplier_gold_final = df_fact_pos_products_per_supplier.selectExpr(
    "company as Supplier",
    "product_name as Product",
    "`Total Quantity`",
    "`Total Unit Cost`",
    "`Total List Price`"
)
# .select() the 'company' column as 'Supplier', the 'product_name' column as 'Product',
# along with the 'Total Quantity', 'Total Unit Cost', and 'Total List Price' columns


##### 8.4.5. Load the Final Results into a New Table and Display the Results

In [177]:
df_fact_pos_products_per_supplier_gold_final.write.saveAsTable(f"{dest_database}.fact_pos_products_per_supplier", mode="overwrite")
spark.sql(f"SELECT * FROM {dest_database}.fact_pos_products_per_supplier").toPandas()

Unnamed: 0,Supplier,Product,Total Quantity,Total Unit Cost,Total List Price
0,Supplier H,Northwind Traders Chai,,,
1,Supplier C,Northwind Traders Tomato Sauce,,,
2,Supplier B,Northwind Traders Scones,,,
3,Supplier C,Northwind Traders Cajun Seasoning,,,
4,Supplier B,Northwind Traders Fruit Cocktail,,,
5,Supplier B,Northwind Traders Boysenberry Spread,,,
6,Supplier E,Northwind Traders Gnocchi,,,
7,Supplier F,Northwind Traders Brownie Mix,,,
8,Supplier B,Northwind Traders Dried Apples,,,
9,Supplier B,Northwind Traders Almonds,,,


### 9.0. Stop the Spark Session

In [178]:
spark.stop()