# OAuth2 Client Credentials Demo

This notebook demonstrates running the ETL pipeline against an API endpoint using OAuth2 Client Credentials Grant authentication via Keycloak.

Unlike Password Grant, Client Credentials authenticates the application itself, not on behalf of a user. This is the recommended flow for service-to-service communication.

## Prerequisites

Start Keycloak and the mock API service:
```bash
make up-keycloak
```

Services:
- Keycloak Admin Console: `http://localhost:8180` (admin/admin)
- Mock API: `http://mock-api:8000` (from Docker network)

## Client Credentials

- Client ID: `etl-client`
- Client Secret: `etl-client-secret`

In [1]:
import sys
from pathlib import Path

In [2]:
# project_root = Path("/opt/spark/work")
project_root = Path("/opt/spark/app")
src_path = project_root / "src"

if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import sha2, expr
from orchestration.orchestrator import run_pipeline

In [5]:
spark = (
    SparkSession.builder
    .appName("oauth2_client_credentials_demo_pipeline")
    .getOrCreate()
)

## Create Source DataFrame

Generate a DataFrame with unique tracking IDs that will be used to make API requests.

In [6]:
df = (
    spark.range(50)
         .repartition(4)
         .select(
             sha2(expr("uuid()"), 256).alias("tracking_id")
         )
)
df.show(10)

+--------------------+
|         tracking_id|
+--------------------+
|e117f769a601bd247...|
|410640310c211ab2b...|
|91039f56b47cfe059...|
|b79227527a1d8292a...|
|bcdf25ec7c2151bc6...|
|a8d99edbcfaff67ff...|
|b027173f1a92de59d...|
|ef0526890adcd8073...|
|7779357563bc897a0...|
|eaec91e7950433279...|
+--------------------+
only showing top 10 rows



## Run Pipeline

Execute the ETL pipeline using the OAuth2 Client Credentials configuration.

The pipeline will:
1. Authenticate with Keycloak using client credentials (no user involved)
2. Distribute the access token to all Spark executors
3. Make API requests with `Authorization: Bearer <token>` header
4. Automatically refresh tokens when they expire

In [7]:
config_path = project_root / "configs" / "examples" / "oauth2_client_credentials_demo.yml"

In [8]:
run_pipeline(
    spark=spark,
    config_path=config_path,
    source_df=df,
    source_id="tracking_id"
)

2026-02-19 14:15:46,571 [INFO] [PipelineOrchestrator]: Starting driver-side authentication runtime service
2026-02-19 14:15:46,572 [INFO] [RpcBootstrapper]: Starting RPC token service...
2026-02-19 14:15:46,598 [INFO] AsyncBackgroundService[DriverTokenManager]: Background service started
2026-02-19 14:15:46,614 [INFO] AsyncBackgroundService[RpcService]: Background service started
2026-02-19 14:15:46,615 [INFO] [TokenRpcService]: Started at http://5e7dfb25a9a4:57717
2026-02-19 14:15:46,615 [INFO] [RpcBootstrapper]: TokenManager background refresh started.
2026-02-19 14:15:46,616 [INFO] [RpcBootstrapper]: RPC Token Service running at http://5e7dfb25a9a4:57717
2026-02-19 14:15:46,616 [INFO] [PipelineOrchestrator]: Adding authentication middleware
2026-02-19 14:15:46,616 [INFO] [PipelineOrchestrator]: Request from URL: /http://mock-api:8000/api/oauth2/data
2026-02-19 14:15:46,617 [INFO] [TableManager]: Creating database demo
2026-02-19 14:15:46,943 [INFO] [DriverTokenManager]: Background t

## Verify Results

Read the sink table to verify the API responses were captured.

In [9]:
response_df = spark.table("demo.oauth2_client_credentials_demo_response")
response_df.show(10)

+--------------------+--------------------+--------------------+------+--------------------+--------------+--------------------+-----------+--------------------+--------------------+-------+-------------+--------+--------------------+--------------------+
|          request_id|            row_hash|                 url|method|     request_headers|request_params|    request_metadata|status_code|    response_headers|           body_text|success|error_message|attempts|   response_metadata|       _request_time|
+--------------------+--------------------+--------------------+------+--------------------+--------------+--------------------+-----------+--------------------+--------------------+-------+-------------+--------+--------------------+--------------------+
|b79227527a1d8292a...|1ed7fc2d21679fbc1...|/http://mock-api:...|   GET|{"Accept": "appli...|            {}|{"vendor": "mock-...|        200|{"Date": "Thu, 19...|{"request_id":"eb...|   true|         NULL|       1|{"logs": ["-> GET..

In [10]:
# Summary statistics
print(f"Total records: {response_df.count()}")
response_df.groupBy("status_code").count().show()

Total records: 50
+-----------+-----+
|status_code|count|
+-----------+-----+
|        200|   50|
+-----------+-----+



In [11]:
# Sample response body - note the auth_method field shows "oauth2:bearer"
response_df.select("body_text").limit(3).show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|body_text                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+-----------------------------------------------------------------------------------------------------------

In [14]:
response_df.select("response_metadata").limit(1).collect()

[Row(response_metadata='{"logs": ["-> GET /http://mock-api:8000/api/oauth2/data", "[RetryMiddleware] Attempt 1/10 -> GET /http://mock-api:8000/api/oauth2/data", "<- 200 /http://mock-api:8000/api/oauth2/data"], "token_provider": {"provider": "RpcTokenProvider", "path": "rpc"}, "connection_warmup": {"warmed_up": false, "warmup_error": null, "warmup_timeout": 10}, "json": {"valid": true, "error": null}, "timing": {"total_seconds": 0.01}}')]