# Static Bearer Token Demo

This notebook demonstrates running the ETL pipeline against an API endpoint that requires a static Bearer token.

Unlike OAuth2, this uses a pre-configured token that doesn't expire or refresh. Useful for APIs that issue long-lived API keys.

## Prerequisites

Start the mock API service:
```bash
make up-keycloak
```

The mock API will be available at `http://mock-api:8000` from within the Docker network.

In [1]:
import sys
from pathlib import Path

In [2]:
project_root = Path("/opt/spark/app")
src_path = project_root / "src"

if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import sha2, expr
from pipeline.orchestrator import run_pipeline

In [4]:
spark = (
    SparkSession.builder
    .appName("bearer_token_demo_pipeline")
    .getOrCreate()
)

## Create Source DataFrame

Generate a DataFrame with unique tracking IDs that will be used to make API requests.

In [5]:
df = (
    spark.range(50)
         .repartition(4)
         .select(
             sha2(expr("uuid()"), 256).alias("tracking_id")
         )
)
df.show(10)

+--------------------+
|         tracking_id|
+--------------------+
|9af5e2303f556b43b...|
|00e79ecd2ebe66a26...|
|f5f4c8a4d99492d15...|
|4df5776d76b38feca...|
|0abea37718048fbdf...|
|14c3bdd21f162bd6c...|
|79721e8d6965fd739...|
|f9172d2c320644ded...|
|1343d3ba08d8fab0d...|
|daab7678ebf875123...|
+--------------------+
only showing top 10 rows



## Run Pipeline

Execute the ETL pipeline using the static bearer token configuration.

The pipeline will automatically add the `Authorization: Bearer <token>` header to all requests.

In [6]:
config_path = project_root / "configs" / "examples" / "bearer_token_demo.yml"

In [7]:
run_pipeline(
    spark=spark,
    config_path=config_path,
    source_df=df,
    source_id="tracking_id"
)

2026-02-08 14:14:41,866 [INFO] [PipelineOrchestrator]: Authentication does not have a runtime service... skipping
2026-02-08 14:14:41,867 [INFO] [PipelineOrchestrator]: Adding authentication middleware
2026-02-08 14:14:41,867 [INFO] [PipelineOrchestrator]: Request from URL: /http://mock-api:8000/api/bearer/data
2026-02-08 14:14:41,868 [INFO] [TableManager]: Creating database demo
2026-02-08 14:14:46,907 [INFO] [TableManager]: Created Delta table: demo.bearer_token_demo_response
2026-02-08 14:14:49,189 [INFO] [BatchProcessor]: ➤ Attempt 1: Processing 1 batches
2026-02-08 14:14:49,190 [INFO] [BatchProcessor]:     → Processing batch 1/1
2026-02-08 14:14:50,250 [INFO] numexpr.utils: Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-02-08 14:14:50,251 [INFO] numexpr.utils: NumExpr defaulting to 8 threads.
✓ All API requests processed
2026-02-08 14:14:53,509 [INFO] [PipelineOrchestrator]: Pipeline run finished


## Verify Results

Read the sink table to verify the API responses were captured.

In [5]:
response_df = spark.table("demo.bearer_token_demo_response")
response_df.show(10)

+--------------------+--------------------+--------------------+------+--------------------+--------------+--------------------+-----------+--------------------+--------------------+-------+-------------+--------+--------------------+--------------------+
|          request_id|            row_hash|                 url|method|     request_headers|request_params|    request_metadata|status_code|    response_headers|           body_text|success|error_message|attempts|   response_metadata|       _request_time|
+--------------------+--------------------+--------------------+------+--------------------+--------------+--------------------+-----------+--------------------+--------------------+-------+-------------+--------+--------------------+--------------------+
|202c6673f2631e297...|32816f0de897d5c78...|/http://mock-api:...|   GET|{"Accept": "appli...|            {}|{"vendor": "mock-...|        200|{"Date": "Mon, 09...|{"request_id":"3c...|   true|         NULL|       1|{"connection_warm..

In [6]:
# Summary statistics
print(f"Total records: {response_df.count()}")
response_df.groupBy("status_code").count().show()

Total records: 100
+-----------+-----+
|status_code|count|
+-----------+-----+
|        200|  100|
+-----------+-----+



In [10]:
# Sample response body - note the auth_method field shows "bearer:static"
response_df.select("body_text").limit(3).show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|body_text                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+-------------------------------------------------------------------------------------------------------

----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 46182)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/socketserver.py", line 317, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/opt/conda/lib/python3.11/socketserver.py", line 348, in process_request
    self.finish_request(request, client_address)
  File "/opt/conda/lib/python3.11/socketserver.py", line 361, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/opt/conda/lib/python3.11/socketserver.py", line 755, in __init__
    self.handle()
  File "/usr/local/spark/python/pyspark/accumulators.py", line 295, in handle
    poll(accum_updates)
  File "/usr/local/spark/python/pyspark/accumulators.py", line 267, in poll
    if self.rfile in r and func():
                           ^^^^^^
  File "/usr/local/spark/python/pyspark/accumulators.py", line 271, in accum_updates
    num_updates =