# Framework Ingestion Notebook Example

This notebook demonstrates the framework contract for ingestion tasks.
It reads from a source Delta table, transforms the data, and writes to a target Delta table.

**Framework Contract:**
- Accepts `task_key`, `control_table`, and `parameters` as inputs via widgets
- Parameters contain catalog, schema, source_table, target_table, and write_mode
- In production, the framework will pass these via widgets
- For this example, widgets have default values so it can run end-to-end without manual input


In [None]:
import logging
import json
from pyspark.sql import functions as F

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Framework widgets - In production, these are set by the framework
# For this example, widgets have default values so it can run end-to-end
dbutils.widgets.text("task_key", "delta_table_ingestion", "Task Key")
dbutils.widgets.text("control_table", "main.examples.etl_control", "Control Table")
dbutils.widgets.text("parameters", '{"catalog": "fe_ppark_demo", "schema": "lakeflow_job_metadata", "source_table": "source_customers", "target_table": "customers", "write_mode": "append"}', "Parameters")

# Get widget values
task_key = dbutils.widgets.get("task_key")
control_table = dbutils.widgets.get("control_table")
parameters_str = dbutils.widgets.get("parameters")

if not task_key:
    raise ValueError("task_key widget is required")
if not control_table:
    raise ValueError("control_table widget is required")
if not parameters_str:
    raise ValueError("parameters widget is required")

logger.info(f"Processing task_key: {task_key}")

# Parse parameters JSON
try:
    parameters = json.loads(parameters_str) if isinstance(parameters_str, str) else parameters_str
    
    # Extract catalog, schema, and table names
    catalog = parameters.get('catalog')
    schema = parameters.get('schema')
    source_table = parameters.get('source_table')
    target_table = parameters.get('target_table')
    write_mode = parameters.get('write_mode', 'overwrite')
    
    # Validate required fields
    if not catalog:
        raise ValueError(f"Missing 'catalog' in parameters for task_key '{task_key}'")
    if not schema:
        raise ValueError(f"Missing 'schema' in parameters for task_key '{task_key}'")
    if not source_table:
        raise ValueError(f"Missing 'source_table' in parameters for task_key '{task_key}'")
    if not target_table:
        raise ValueError(f"Missing 'target_table' in parameters for task_key '{task_key}'")
    
    logger.info(f"Successfully parsed parameters")
    logger.info(f"Catalog: {catalog}, Schema: {schema}")
    logger.info(f"Source table: {source_table}")
    logger.info(f"Target table: {target_table}")
    logger.info(f"Write mode: {write_mode}")
    
except Exception as e:
    logger.error(f"Failed to parse parameters: {str(e)}")
    raise


## Parse and Validate Configurations


In [None]:
# Configuration already validated when reading from control table
logger.info("Configuration validated successfully")
logger.info(f"Source: {catalog}.{schema}.{source_table}")
logger.info(f"Target: {catalog}.{schema}.{target_table}")


## Prepare Source Data

If the source table doesn't exist, create sample data for demonstration.
In production, this would read directly from the configured source table.


In [None]:
from pyspark.sql.types import StructType, StructField, StringType

source_table_full = f"{catalog}.{schema}.{source_table}"

logger.info(f"Reading from source table: {source_table_full}")

# Try to read from source table, if it doesn't exist, create sample data
df = None
record_count = 0

try:
    df = spark.table(source_table_full)
    record_count = df.count()
    logger.info(f"Successfully read {record_count} records from source table")
    df.show(5, truncate=False)
except Exception as e:
    logger.warning(f"Source table not found: {source_table_full}. Creating sample data for demonstration.")
    
    # Create sample source data
    sample_data = [
        ("CUST001", "John", "Doe", "john.doe@example.com", "2024-01-15", "active"),
        ("CUST002", "Jane", "Smith", "jane.smith@example.com", "2024-01-16", "active"),
        ("CUST003", "Bob", "Johnson", "bob.johnson@example.com", "2024-01-17", "inactive"),
        ("CUST004", "Alice", "Williams", "alice.williams@example.com", "2024-01-18", "active"),
        ("CUST005", "Charlie", "Brown", "charlie.brown@example.com", "2024-01-19", "active")
    ]
    
    schema = StructType([
        StructField("customer_id", StringType(), True),
        StructField("first_name", StringType(), True),
        StructField("last_name", StringType(), True),
        StructField("email", StringType(), True),
        StructField("registration_date", StringType(), True),
        StructField("status", StringType(), True)
    ])
    
    df = spark.createDataFrame(sample_data, schema)
    
    # Create the source table for demonstration purposes
    df.write.format("delta").mode("overwrite").saveAsTable(source_table_full)
    record_count = df.count()
    logger.info(f"Created sample source table with {record_count} records")
    df.show(5, truncate=False)


## Transform Data


In [None]:
# Transform data: add metadata columns and apply business logic
df_transformed = df.withColumn("ingestion_timestamp", F.current_timestamp()) \
                   .withColumn("task_key", F.lit(task_key)) \
                   .withColumn("full_name", F.concat(F.col("first_name"), F.lit(" "), F.col("last_name"))) \
                   .filter(F.col("status") == "active")  # Example: filter only active customers

record_count_transformed = df_transformed.count()
logger.info(f"Transformed data: {record_count_transformed} records (filtered from {record_count} source records)")
logger.info("Sample transformed data:")
df_transformed.select("customer_id", "full_name", "email", "status", "task_key", "ingestion_timestamp").show(5, truncate=False)


## Write to Target

Write transformed data to the target Delta table.


In [None]:
target_table_full = f"{catalog}.{schema}.{target_table}"

logger.info(f"Writing to target table: {target_table_full}")
logger.info(f"Write mode: {write_mode}")
logger.info(f"Records to write: {record_count_transformed}")

# Write to target Delta table
try:
    df_transformed.write \
        .format("delta") \
        .mode(write_mode) \
        .option("mergeSchema", "true") \
        .saveAsTable(target_table_full)
    
    logger.info(f"✅ Successfully wrote {record_count_transformed} records to {target_table_full}")
    
    # Verify the write
    written_df = spark.table(target_table_full)
    written_count = written_df.count()
    logger.info(f"✅ Verified: {written_count} records in target table")
    written_df.select("customer_id", "full_name", "email", "status", "task_key", "ingestion_timestamp").show(5, truncate=False)
    
except Exception as e:
    logger.error(f"Failed to write to target table: {str(e)}")
    raise


## Summary

✅ Parameters parsed and validated  
✅ Source data read from Delta table (or created for demo)  
✅ Data transformed and enriched  
✅ Data written to target Delta table  

**Framework Contract:**  
This notebook demonstrates the expected contract:
- Accepts `task_key`, `control_table`, and `parameters` as inputs via widgets
- Parameters contain `catalog`, `schema`, `source_table`, `target_table`, and `write_mode`
- Validates configuration
- Reads from source Delta table
- Transforms data (adds metadata columns, applies business logic)
- Writes to target Delta table

**Widgets Used:**
- `task_key`: Unique identifier for this task
- `control_table`: Name of the control table containing job metadata
- `parameters`: JSON string with task parameters (catalog, schema, source_table, target_table, write_mode)

**Parameters Structure:**
The notebook expects parameters JSON with:
- `catalog`: Catalog name (e.g., "fe_ppark_demo")
- `schema`: Schema name (e.g., "lakeflow_job_metadata")
- `source_table`: Source table name (e.g., "source_customers")
- `target_table`: Target table name (e.g., "customers")
- `write_mode`: Write mode (defaults to "overwrite" if not specified)

The framework will set `task_key`, `control_table`, and `parameters` widgets when calling this notebook as a task.
