#### Incremental  Workflow Control



##### List of Scala Objects for Incremental  Workflow Control


| Object         | Description  |  Notebook | Framework Library |
|----------------|-----------------------|--------------------------|--------------------------|
| SchemaResolver  | Resolve  env, version, etc. for schema    | SchemaResolver | Yes|
|  |           |   | |
|  QueryProcessor|  Construct the SQl query for Incrmental logic to hide env, version,  details, etc.         | QueryProcessor  | Yes|
|  |           |   | |
| Watermarks      | Snapshot the data version for ACS tables for checkpoint  | Watermarks-Control | Yes|
| Checkpoints   |  Control pipeline run status and recovery |  Watermarks-Control | Yes |
|  |           |   | |
|TableResolver  |  Resolve  env, version,  source/target. dataset, etc. for tables   | table_resolver  | Yes|
|ACS  |  Auto-generted case object for all ACS tables    | table_resolver  | Yes|
|DAP  |   Auto-generted case object for all DAP tables    | table_resolver  | Yes|
|UDM  |   Auto-generted case object for all UDM Appendix tables    | table_resolver  | Yes|
|  |           |   | |
|CdcType2Reader  |a unified CDC Type-2 read function for baseline and incremental, returning deduplicated upserts and deletes |  cdc_reader | Yes |
|CdcType1Reader  | a unified CDC Type-1 read function for baseline and incremental, returning deduplicated upserts and deletes  |  cdc_reader | Yes |
|  |           |   | |
| Registry | Pipeline metadata/lineage operation        |Database-Ops | maybe   |
| DapOps           | OPS  schema operation    |  Database-Ops | No |
| DapSchemaManager | Schema maintenance        | Database-Ops | No |
|  |           |   |   |
| DeltaReader | Create incrmental logic and  watermark, read delta data, just for reference  | Pipeline-Example | No |
| PipelineRunner | Run the pipeline core logic to handle chckpoint in same approach, just for reference    |  Pipeline-Example | No |



#### Usage 

#### 1. SchemaResolver




This object is responsible for retrieving various parameters from the bundle configuration file, including source and target information, environment, data version, pipeline name, and related settings.

**Key Components**:

SchemaResolver.SCHEMA_MAP
- Used by the QueryProcessor to correctly bind and resolve schemas.

SchemaResolver.ACS_SCHEMAS
- Defines the schemas used for reading ACS tables.

SchemaResolver.DAP_SCHEMAS
- Defines the schemas used for reading from and writing to DAP tables.


SchemaResolver.PIPELINE
- Defines the pipeline name. This value is used to resolve the version map and to set table ownership during write operations.

SchemaResolver.OPS_SCHEMA
- Specifies the DAP schema used for workflow-related operational tables (for example, "dap_ops").

SchemaResolver.OPS_TABLE_PREFIX
- Defines an optional prefix for operational (OPS) tables, primarily used for testing. When set, the prefix is applied to OPS table names to isolate test data.





#### 2. QueryProcessor



**Features**

- Dynamic table_changes for tables with version_param.
- Optional per-table filter applied automatically.
- Multi-column joins handled correctly.
- Union across multiple tables.
- Dynamic catalog replacements like ${entity} supported.
- Runtime version map allows flexible filtering of selected tables.

---

**1. Function: renderSqlTemplate,  renderSqlTemplateExtended**

This function can be used to check the the generated SQL query before actual test, we can copy the query to SQL editor to run to validate first 

```scala


// Basic function: read master table by start/end version and read appendix tables as end version
def renderSqlTemplate(
  sqlTemplateConfig: String, 
  tableVersionMap: Map[String, (Long, Long)] = Map.empty
): Unit = { ... }



// Extended function(experimental): extend the logic to return the row of the latest version if multiple rows exist
def renderSqlTemplateExtended(
  sqlTemplateConfig: String, 
  tableVersionMap: Map[String, (Long, Long)] = Map.empty
): Unit = { ... }


```

**Paramters**:
- sqlTemplateConfig: The SQL-query configuration that defines the SQL query. It contains placeholders(catalog) that need to be replaced.
- tableVersionMap: A map for rom upstream tables: tableName → (startVersion, endVersion), where `startVersion` and `endVersion` define the inclusive version range to process for each upstream table.  the map data can be created using Watermarks 


---

**2. Function: runSqlAndSave (Optional)**

This function is used to actually create the delta Paffected K table speficied in SQL config and save data to the table

```scala


def runSqlAndSave(
  sqlTemplateConfig: String,
  tableVersionMap: Map[String, (Long, Long)] = Map.empty,
  targetTableName: String = "",
  dryRun: Boolean = false
): Unit = { ... }

```

**Paramters**:
- sqlTemplateConfig: The SQL-query configuration that defines the SQL query. It contains placeholders(catalog) that need to be replaced.
- tableVersionMap: A map for upstream tables: tableName → (startVersion, endVersion), where `startVersion` and `endVersion` define the inclusive version range to process for each upstream table.  the map data can be created using Watermarks 
- targetTableName: The taget delta tbale to save the output Dataframe
- dryRun: Save data to the delta table when dryRun is False


---


**3. Schema Map Definition: SchemaResolver.SCHEMA_MAP**

This object warp the information on catalog, schema, environment and  version varibable management, which is used to render the SQl query with the dynamic varibale for catalog.


```scala
val catalogMap = SchemaResolver.SCHEMA_MAP
```


---




#### 3. Watermarks



Watermarks provides centralized management of batch-level processing state for pipelines. It tracks the latest batch ID, maintains version mappings for source tables, and manages watermark lifecycle events (initialization and completion). This class enables consistent, incremental processing by capturing stable snapshots of source data and coordinating checkpoints across pipelines.

**1. Functions**

**Watermarks.latestBatchId**
- Returns the latest batch ID.

**Watermarks.getWatermarkForTable**
- Retrieves version map information for tables.

The version map is represented as:
```scala
table_name -> (startVersion, endVersion)
```

Usage options:
- Pass a list of table names to retrieve version mappings for those tables, or
- If no table list is provided, the version mappings are retrieved by pipeline name by default from the registry table.

**Watermarks.initializeWatermark**
- Used by the automation pipeline (PPL) to take a snapshot of the latest versions for all ACS tables as processing watermarks.
- This method also creates checkpoints for each pipeline, assigning an incremented batch ID that corresponds to the current, up-to-date watermark state.

**Watermarks.completeWatermark**
- Marks the watermark as complete by recording a successful completion state.


**2. Example**


2.1 PPL Pipeline

```scala

 // Run this function as the first task in the PPL workflow
Watermarks.initializeWatermark()


 // Run thisfunction  as the last task in the PPL workflow
Watermarks.completeWatermark()
```


2.2 Data Pipeline
```scala

// Option-1: When upstream linegae data is available int pipeline registry table
Watermarks.getWatermarkForTable()


//Option-2: Pass a list of upstream tables as parameter to retrieve the resu;t 
Watermarks.getWatermarkForTable( Seq["d_spmster", "d_orgmaster", "d_publication_spmaster_link" ] )


```





#### 4. Checkpoints


The Checkpoints object manages pipeline runtime state and execution control. It is used by each pipeline to track the active batch ID, monitor execution status, and update checkpoint states throughout the pipeline lifecycle (Ready, Running, Failed, Success). It also supports explicitly overriding checkpoint status when needed.

**1. Functions**

**Checkpoints.activeBatchId**
- Returns the currently active batch ID for the pipeline.

**Checkpoints.activeStatus**
- Returns the current execution status of the pipeline (Ready, Running, Failed, Success).

**Checkpoints.markCheckpointStarted**
- Marks the checkpoint as started and updates the pipeline status to Running.

**Checkpoints.markCheckpointCompleted**
- Marks the checkpoint as successfully completed and updates the status to Success.

**Checkpoints.markCheckpointFailed**
- - - Marks the checkpoint as failed and updates the status to Failed.

**Checkpoints.updateCheckpointStatus**
- Updates the checkpoint status explicitly, allowing forced status changes when required.


**2. Example**


```scala

    // 1. Start the checkpoint
    Checkpoints.markRunStarted(pipelineName)

    try {
      // 2. Load
      val rawDF = DeltaReader.readDeltaDataWithWatermark(queryConfigPath)

      // optional - skipped
      if (rawDF.isEmpty) {
        Checkpoints.markCheckpointSkipped(pipelineName)
        return
      }

      // 3. Core pipeline logic 
      transformAndSave(rawDF)

      // 4. Mark success
      val params = CheckpointParams(
        rowsRead = 200,
        rowsWritten = 500
      )
      Checkpoints.markRunCompleted(Some(params))

    } catch {
      case e: Throwable =>
        // 5. Mark failure
        Checkpoints.markRunFailed(pipelineName, Some(e.getMessage))
        throw e
    }


```


#### 5. TableResolver


##### Description:

TableResolver provides a consistent way to use predefined case objects as table names for all ACS, DAP, and UDM tables. The case object names are derived from the actual catalog table names. All tables are defined centrally and reused across 20+ pipelines, ensuring standardization and easier maintenance.”

##### Benefit:
- Automatically resolve catalog, environment, schema, and  version
- Centralized table naming reduces duplication and errors.
- Enables consistent reference across multiple pipelines.
- Simplifies updates when underlying tables change, without modifying individual pipelines.
- Detect underlying table name changes automatically through validation
- Prevent typos and simplify table name maintenance


##### Attributes:

- **tableName**: Actual Delta table name.
- **schema**: Optional override for schema name (bypasses resolver).
- **catalog**: Optional override for catalog name (bypasses resolver).
- **role**: Indicates source (ACS, UDM) or target (DAP) table; used by resolver to fetch correct parameters.
- **needsSuffix**: Flag to append suffix (e.g., "woscore") for WOS Core and ESCI datasets.
- **primaryKeyCols**: List of primary key columns for CDC Type-2 deduplication; auto-generated but requires validation.
- **fullName**: Complete Delta table name in the format {catalog}.{schema}.{table}.


#### Example

1. Example - ACS Tables
```scala

val datasetList: Seq[Dataset] = Seq(Dataset.WosCore, Dataset.Pprn, Dataset.WosEsci)

datasetList.foreach { t =>

    TableResolver.forDataset(t)
    println("-------------------------")
    println(s"Dataset: $t")
    println(ACS.DArticleTotalCites)
    println(ACS.DAlmaOpenaccess)
    println(ACS.AuthorPublicationLink)

}
```

2. Example - DAP & UDM Tables
```scala
println("---------DAP----------------")
println(DAP.Alma)
println(DAP.ApArticle)
println(DAP.IncitesRiOrgGrants)

println("---------UDM----------------")
println(UDM.GrantsTopic)
println(UDM.ProfileGrantRelation)
println(UDM.ItemTopic)

````



3. Example  - get data by attribute

```scala
  def read( tableName: TableMetadata ) : Unit = {
        
        println(s"PK: {tableName.primaryKeyCols})
        spark.read
            .format("delta")
            .option("versionAsOf", version)
            .table(tableName.fullName)
            .select(tableName.primaryKeyCols.map(c => col(c)):_*)  

  }


```


#### 6. CdcType2Reader


A unified CDC Type-2 read function that handles both baseline and incremental data seamlessly. It reads Delta tables consistently and automatically returns a deduplicated dataset, including upserts and deletes, ensuring data integrity and simplifying downstream processing.



**1. Functions & Paramters**

**1.1 read**

- Read baseline or incremental data consistently using start and end versions
- Unified interface to both baseline an incremental, Baseline read when start version = end version, otherwise, Incremental read

```scala
  def read(
      tableName: String,
      startVersion: Long,
      endVersion: Long,
      selectedCols: Seq[String] = Seq.empty,
      primaryKeyCols: Seq[String] =  Seq.empty
  ): (DataFrame, DataFrame)

```

**1.2  readBaseline**

- Reads baseline data, optionally specifying a version
- Baseline snapshot AS-OF a given Delta version If endVersion is None => use latest version

```scala
  def readBaseline(
      tableName: String,
      endVersion: Option[Long] = None,
      selectedCols: Seq[String] = Seq.empty,
      primaryKeyCols: Seq[String] = Seq.empty
  ): DataFrame

```

**1.3  readIncremental**

- Reads incremental data between a start version and an end version
- Incremental read using Delta Change Data Feed. If selectedCols is empty => read all columns

```scala
  def readIncremental(
      tableName: String,
      startVersion: Long,
      endVersion: Long,
      selectedCols: Seq[String] = Seq.empty,
      primaryKeyCols: Seq[String] = Seq.empty
  ): (DataFrame, DataFrame) 

```


**2. Exampe**

Example with watermarks
```scala

  val versionMap = Watermarks.getWatermarkForTable()

  val (startVersion, endVersion ) = versionMap[ACS.FPublication.ACS.FPublication ]

  // 1. example -  read 
  val (upserts, deletes) = CdcType2Reader.read(
        ACS.FPublication,
        startVersion = startVersion,
        endVersion   = endVersion
    )

  // 2. example - readIncremental with primaryKeyCols paramter
  val (upserts, deletes) = CdcType2Reader.readIncremental(
      ACS.FPublication.fullName,
      startVersion = startVersion,
      endVersion   = endVersion,
      Seq("uid")
  )

  // 3. example - readBaseline without version paramter and with selectedCols paramter
  val df =
  CdcType2Reader.readBaseline(
    ACS.FPublication.fullName,
    selectedCols = Seq("uid", "pub_year", "__END_AT"),
  )

```



#### 7. CdcType1Reader


A unified CDC Type-1 read function that handles both baseline and incremental data seamlessly. It reads Delta tables consistently and automatically returns a deduplicated dataset, including upserts and deletes, ensuring data integrity and simplifying downstream processing.



**1. Functions & Paramters**

**1.1 read**

-  Read baseline or incremental data consistently using start and end versions
- Unified interface to both baseline an incremental, Baseline read when start version = end version, otherwise, Incremental read

```scala
  def read(
      tableName: String,
      startVersion: Long,
      endVersion: Long,
      selectedCols: Seq[String] = Seq.empty
  ): (DataFrame, DataFrame) 

```

**1.2  readBaseline**

- Reads baseline data, optionally specifying a version
- Read baseline snapshot for Type-1, If selectedCols empty => read all columns

```scala
  def readBaseline(
      tableName: String,
      endVersion: Option[Long] = None,
      selectedCols: Seq[String] = Seq.empty
  ): DataFrame

```

**1.3  readIncremental**

- Reads incremental data between a start version and an end version
- Read incremental changes for Type-1 via CDF (if available), Returns (upserts, deletes)

```scala
  def readIncremental(
      tableName: String,
      startVersion: Long,
      endVersion: Long,
      selectedCols: Seq[String] = Seq.empty
  ): (DataFrame, DataFrame)

```


