# No Authentication Demo

This notebook demonstrates running the ETL pipeline against an API endpoint that requires no authentication.

## Prerequisites

Start the mock API service:
```bash
make up-keycloak
```

The mock API will be available at `http://mock-api:8000` from within the Docker network.

In [1]:
import sys
from pathlib import Path

In [2]:
project_root = Path("/opt/spark/work")
src_path = project_root / "src"

if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import sha2, expr
from pipeline.orchestrator import run_pipeline

In [4]:
spark = (
    SparkSession.builder
    .appName("noauth_demo_pipeline")
    .getOrCreate()
)

## Create Source DataFrame

Generate a DataFrame with unique tracking IDs that will be used to make API requests.

In [5]:
df = (
    spark.range(50)
         .repartition(4)
         .select(
             sha2(expr("uuid()"), 256).alias("tracking_id")
         )
)
df.show(10)

+--------------------+
|         tracking_id|
+--------------------+
|dc78aee4cb79e2e57...|
|ef56312324d7ff41d...|
|d47b2ca42cea5a982...|
|63fa18bebf8bf609f...|
|b30b10f17a5071965...|
|10615b5277b4e0f52...|
|b9b75d978e97f21d5...|
|5367c80ad617e4fba...|
|1ed8837d97e899882...|
|f01fc087545fa5c27...|
+--------------------+
only showing top 10 rows



## Run Pipeline

Execute the ETL pipeline using the no-auth configuration.

In [6]:
config_path = project_root / "configs" / "examples" / "noauth_demo.yml"

In [7]:
run_pipeline(
    spark=spark,
    config_path=config_path,
    source_df=df,
    source_id="tracking_id"
)

2026-02-08 14:15:57,669 [INFO] [PipelineOrchestrator]: Authentication does not have a runtime service... skipping
2026-02-08 14:15:57,670 [INFO] [PipelineOrchestrator]: Request from URL: /http://mock-api:8000/api/noauth/data
2026-02-08 14:15:57,671 [INFO] [TableManager]: Creating database demo
2026-02-08 14:16:03,248 [INFO] [TableManager]: Created Delta table: demo.noauth_demo_response
2026-02-08 14:16:05,836 [INFO] [BatchProcessor]: ➤ Attempt 1: Processing 1 batches
2026-02-08 14:16:05,837 [INFO] [BatchProcessor]:     → Processing batch 1/1
2026-02-08 14:16:06,851 [INFO] numexpr.utils: Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-02-08 14:16:06,851 [INFO] numexpr.utils: NumExpr defaulting to 8 threads.
✓ All API requests processed
2026-02-08 14:16:10,106 [INFO] [PipelineOrchestrator]: Pipeline run finished


## Verify Results

Read the sink table to verify the API responses were captured.

In [6]:
response_df = spark.table("demo.noauth_demo_response")
response_df.show(10)

+--------------------+--------------------+--------------------+------+--------------------+--------------+--------------------+-----------+--------------------+--------------------+-------+-------------+--------+--------------------+--------------------+
|          request_id|            row_hash|                 url|method|     request_headers|request_params|    request_metadata|status_code|    response_headers|           body_text|success|error_message|attempts|   response_metadata|       _request_time|
+--------------------+--------------------+--------------------+------+--------------------+--------------+--------------------+-----------+--------------------+--------------------+-------+-------------+--------+--------------------+--------------------+
|aca33682944ffc0ae...|0322fd9df26a4a55e...|/http://mock-api:...|   GET|{"Accept": "appli...|            {}|{"vendor": "mock-...|        200|{"Date": "Mon, 09...|{"request_id":"87...|   true|         NULL|       1|{"connection_warm..

In [7]:
# Summary statistics
print(f"Total records: {response_df.count()}")
response_df.groupBy("status_code").count().show()

Total records: 100
+-----------+-----+
|status_code|count|
+-----------+-----+
|        200|  100|
+-----------+-----+



In [10]:
# Sample response body
response_df.select("body_text").limit(3).show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|body_text                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+-------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 39368)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/socketserver.py", line 317, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/opt/conda/lib/python3.11/socketserver.py", line 348, in process_request
    self.finish_request(request, client_address)
  File "/opt/conda/lib/python3.11/socketserver.py", line 361, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/opt/conda/lib/python3.11/socketserver.py", line 755, in __init__
    self.handle()
  File "/usr/local/spark/python/pyspark/accumulators.py", line 295, in handle
    poll(accum_updates)
  File "/usr/local/spark/python/pyspark/accumulators.py", line 267, in poll
    if self.rfile in r and func():
                           ^^^^^^
  File "/usr/local/spark/python/pyspark/accumulators.py", line 271, in accum_updates
    num_updates =