# OAuth2 Client Credentials Demo

This notebook demonstrates running the ETL pipeline against an API endpoint using OAuth2 Client Credentials Grant authentication via Keycloak.

Unlike Password Grant, Client Credentials authenticates the application itself, not on behalf of a user. This is the recommended flow for service-to-service communication.

## Prerequisites

Start Keycloak and the mock API service:
```bash
make up-keycloak
```

Services:
- Keycloak Admin Console: `http://localhost:8180` (admin/admin)
- Mock API: `http://mock-api:8000` (from Docker network)

## Client Credentials

- Client ID: `etl-client`
- Client Secret: `etl-client-secret`

In [1]:
import sys
from pathlib import Path

In [2]:
project_root = Path("/opt/spark/work")
src_path = project_root / "src"

if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import sha2, expr
from pipeline.orchestrator import run_pipeline

In [4]:
spark = (
    SparkSession.builder
    .appName("oauth2_client_credentials_demo_pipeline")
    .getOrCreate()
)

## Create Source DataFrame

Generate a DataFrame with unique tracking IDs that will be used to make API requests.

In [5]:
df = (
    spark.range(50)
         .repartition(4)
         .select(
             sha2(expr("uuid()"), 256).alias("tracking_id")
         )
)
df.show(10)

+--------------------+
|         tracking_id|
+--------------------+
|bc5859ab6df7abcd8...|
|d4d4dedef53f77539...|
|e35e52621abd9389c...|
|2800fb0fe3e62b6c6...|
|763e42509d65b3445...|
|e3b8a35a4cef69268...|
|8111a6a550703069d...|
|4072b41a0d05b8d86...|
|35781dcaeb0eb79b4...|
|f903e9ffddf6d4bd4...|
+--------------------+
only showing top 10 rows



## Run Pipeline

Execute the ETL pipeline using the OAuth2 Client Credentials configuration.

The pipeline will:
1. Authenticate with Keycloak using client credentials (no user involved)
2. Distribute the access token to all Spark executors
3. Make API requests with `Authorization: Bearer <token>` header
4. Automatically refresh tokens when they expire

In [6]:
config_path = project_root / "configs" / "examples" / "oauth2_client_credentials_demo.yml"

In [7]:
run_pipeline(
    spark=spark,
    config_path=config_path,
    source_df=df,
    source_id="tracking_id"
)

2026-02-08 14:29:44,108 [INFO] [PipelineOrchestrator]: Starting driver-side authentication runtime service
2026-02-08 14:29:44,109 [INFO] [RpcBootstrapper]: Starting RPC token service...
2026-02-08 14:29:44,143 [INFO] AsyncBackgroundService[DriverTokenManager]: Background service started
2026-02-08 14:29:44,151 [INFO] AsyncBackgroundService[RpcService]: Background service started
2026-02-08 14:29:44,152 [INFO] [TokenRpcService]: Started at http://89d602eb842e:53007
2026-02-08 14:29:44,153 [INFO] [RpcBootstrapper]: TokenManager background refresh started.
2026-02-08 14:29:44,153 [INFO] [RpcBootstrapper]: RPC Token Service running at http://89d602eb842e:53007
2026-02-08 14:29:44,153 [INFO] [PipelineOrchestrator]: Adding authentication middleware
2026-02-08 14:29:44,153 [INFO] [PipelineOrchestrator]: Request from URL: /http://mock-api:8000/api/oauth2/data
2026-02-08 14:29:44,154 [INFO] [TableManager]: Creating database demo
2026-02-08 14:29:44,680 [INFO] [DriverTokenManager]: Background t

## Verify Results

Read the sink table to verify the API responses were captured.

In [5]:
response_df = spark.table("demo.oauth2_client_credentials_demo_response")
response_df.show(10)

+--------------------+--------------------+--------------------+------+--------------------+--------------+--------------------+-----------+--------------------+--------------------+-------+-------------+--------+--------------------+--------------------+
|          request_id|            row_hash|                 url|method|     request_headers|request_params|    request_metadata|status_code|    response_headers|           body_text|success|error_message|attempts|   response_metadata|       _request_time|
+--------------------+--------------------+--------------------+------+--------------------+--------------+--------------------+-----------+--------------------+--------------------+-------+-------------+--------+--------------------+--------------------+
|4aa701e27bbf1d133...|316198866e9e1cd3f...|/http://mock-api:...|   GET|{"Accept": "appli...|            {}|{"vendor": "mock-...|        200|{"Date": "Mon, 09...|{"request_id":"66...|   true|         NULL|       1|{"connection_warm..

In [9]:
# Summary statistics
print(f"Total records: {response_df.count()}")
response_df.groupBy("status_code").count().show()

Total records: 50
+-----------+-----+
|status_code|count|
+-----------+-----+
|        200|   50|
+-----------+-----+



In [10]:
# Sample response body - note the auth_method field shows "oauth2:bearer"
response_df.select("body_text").limit(3).show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|body_text                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+---------------------------------------------------------------------------------------------------------------------

----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 49074)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/socketserver.py", line 317, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/opt/conda/lib/python3.11/socketserver.py", line 348, in process_request
    self.finish_request(request, client_address)
  File "/opt/conda/lib/python3.11/socketserver.py", line 361, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/opt/conda/lib/python3.11/socketserver.py", line 755, in __init__
    self.handle()
  File "/usr/local/spark/python/pyspark/accumulators.py", line 295, in handle
    poll(accum_updates)
  File "/usr/local/spark/python/pyspark/accumulators.py", line 267, in poll
    if self.rfile in r and func():
                           ^^^^^^
  File "/usr/local/spark/python/pyspark/accumulators.py", line 271, in accum_updates
    num_updates =