# DBT Data Transformation Tutorial with DigitalHub

This notebook demonstrates how to build a data transformation pipeline using DBT (Data Build Tool) with the DigitalHub SDK. We'll work with employee data, apply SQL transformations, and orchestrate the process through a workflow.

## Overview
- **Extract**: Load employee data from a CSV source
- **Transform**: Use DBT to filter and process the data with SQL
- **Orchestrate**: Create a workflow pipeline to automate the transformation process

## Setup and Function Definitions

First, we'll create the necessary directory structure and define our SQL transformation that will be used by DBT.

In [None]:
from pathlib import Path

Path("src").mkdir(exist_ok=True)

### SQL Transformation Definition

This cell creates our DBT SQL transformation. The SQL query will:

- Reference the input employees table using DBT's `{{ ref('employees') }}` syntax
- Filter employees by department ID '50' 
- Return all columns for employees in that specific department

The transformation is designed to work with DBT's templating system and will be executed as part of our data pipeline.

## Project Initialization

Now we'll initialize our DigitalHub project using consistent naming with other tutorials.

In [None]:
import digitalhub as dh

p_name = "tutorial-project"
project = dh.get_or_create_project(p_name)

## Data Source Setup

We'll create a data item that points to employee data. This dataset contains employee information including department assignments.

In [None]:
url = "https://gist.githubusercontent.com/kevin336/acbb2271e66c10a5b73aacf82ca82784/raw/e38afe62e088394d61ed30884dd50a6826eee0a8/employees.csv"
di = project.new_dataitem(name="employees-data", kind="table", path=url)

## Data Transformation with DBT

Now we'll create and execute our DBT transformation function. This will filter the employee data to show only employees in department '50'.

In [None]:
sql = """
WITH tab AS (
    SELECT  *
    FROM    {{ ref('employees') }}
)
SELECT  *
FROM    tab
WHERE   tab."DEPARTMENT_ID" = '50'
"""

In [None]:
function = project.new_function(name="transform-employees", kind="dbt", code=sql)

In [None]:
run = function.run(
    "transform",
    inputs={"employees": di.key},
    outputs={"output_table": "department-50"},
    wait=True,
)

Let's examine the transformed data - employees from department 50:

In [None]:
run.output("department-50").as_df().head()

## Pipeline Orchestration

Now let's create a workflow that orchestrates the DBT transformation. This pipeline uses Hera (Argo Workflows) to define the execution flow for our data transformation process.

In [None]:
%%writefile "src/pipeline.py"
from hera.workflows import Workflow, DAG, Parameter
from digitalhub_runtime_hera.dsl import step


def pipeline():
    with Workflow(entrypoint="dag", arguments=Parameter(name="employees")) as w:
        with DAG(name="dag"):
            A = step(template={"action":"transform",
                               "inputs": {"employees": "{{workflow.parameters.employees}}"},
                               "outputs": {"output_table": "department-50"}},
                     function="transform-employees")
    return w

### Execute the Complete Pipeline

Finally, let's create and execute our DBT transformation pipeline workflow. This will run the transformation in an automated, orchestrated manner.

In [None]:
workflow = project.new_workflow(
    name="dbt-pipeline", kind="hera", code_src="src/pipeline.py", handler="pipeline"
)

In [None]:
workflow.run("build", wait=True)

In [None]:
workflow_run = workflow.run("pipeline", parameters={"employees": di.key}, wait=True)