# Lakekeeper Bootstrap Process - Technical Deep Dive

This document provides a comprehensive explanation of the Lakekeeper bootstrap process, analyzing each step from the `02-01-Bootstrap.ipynb` notebook with technical details from the Keycloak realm configuration.

## 🎯 **Overview**

The bootstrap process initializes Lakekeeper with the first administrative user and sets up the foundational access control system. This is a **one-time operation** that can only be performed when the system is in an uninitialized state.

## 📋 **Prerequisites**

### Required Services
- **Keycloak** (Identity Provider) - Running on port 30080
- **Lakekeeper** (Data Catalog) - Running on port 8181
- **OpenFGA** (Authorization Backend) - Running internally
- **PostgreSQL** (Metadata Database) - Running internally

### Keycloak Configuration
From `realm.json`, the system is configured with:

#### **Realm Settings**
- **Realm Name**: `iceberg`
- **Access Token Lifespan**: 300 seconds (5 minutes)
- **SSO Session Timeout**: 1800 seconds (30 minutes)
- **Default Signature Algorithm**: RS256

#### **Pre-configured Users**
1. **Human Users**:
   - `peter` (peter@example.com) - Admin user
   - `anna` (anna@example.com) - Regular user with no initial permissions

2. **Service Accounts** (Technical Users):
   - `spark` - Used for bootstrap process
   - `trino` - Query engine access
   - `duckdb` - Query engine access  
   - `starrocks` - Query engine access
   - `openfga` - Authorization backend

## 🔧 **Step-by-Step Process Analysis**

### **Step 1: Install Dependencies**

In [1]:
!pip install -q pyjwt

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


**What Happens:**
- Installs the `pyjwt` library for JWT token handling
- This library allows us to decode JWT tokens without signature verification
- Essential for inspecting token contents during the bootstrap process

**Technical Details:**
- **Library Purpose**: JWT (JSON Web Token) decoding and validation
- **Security Note**: Running as root in Docker container (expected behavior)
- **Usage**: We'll use this to inspect tokens received from Keycloak

### **Step 2: Import Libraries and Define URLs**

In [2]:
import requests, jwt
from IPython.display import JSON

CATALOG_URL = "http://lakekeeper:8181/catalog"
MANAGEMENT_URL = "http://lakekeeper:8181/management"
KEYCLOAK_TOKEN_URL = "http://keycloak:8080/realms/iceberg/protocol/openid-connect/token"

**What Happens:**
- Imports required libraries for HTTP requests and JWT handling
- Defines service endpoints for internal communication
- Sets up the OAuth2/OIDC token endpoint for Keycloak

**Technical Details:**
- **Internal Communication**: All services communicate via Docker network
- **OAuth2 Flow**: Using client credentials grant type
- **Service Discovery**: URLs point to internal Docker service names

### **Step 3: Authenticate with Keycloak**

#### Bootstraping Lakekeeper
Initially, Lakekeeper needs to be bootstrapped.
During bootstrapping the initial `admin` is set. Bootstrapping can only be performed once. The first user calling the bootstrap endpoint, will become the `admin`.

This Notebook performs bootstrapping via python requests. It only works if the server hasn't previously bootstrapped using the UI!

## 1. Sign in
First, we need to obtain a token from our Identity Provider. In this example a `Keycloak` is running as a pod beside Lakekeeper. A few users have been pre-created in Keycloak for this example. We are now logging into Keycloak as the technical user (client) `spark`. If a human user bootstraps the catalog, we recommend to use the UI.

Keycloak can be accessed at http://localhost:30080 in this example. Use `admin` as username and password. Then select the `iceberg` realm on the top left corner.

In [3]:
# Login to Keycloak
CLIENT_ID = "spark"
CLIENT_SECRET = "2OR3eRvYfSZzzZ16MlPd95jhLnOaLM52"

response = requests.post(
    url=KEYCLOAK_TOKEN_URL,
    data={
        "grant_type": "client_credentials",
        "client_id": CLIENT_ID,
        "client_secret": CLIENT_SECRET,
        "scope": "lakekeeper"
    },
    headers={"Content-type": "application/x-www-form-urlencoded"},
)
response.raise_for_status()
access_token = response.json()['access_token']



**What Happens:**
- Authenticates as the `spark` service account using client credentials
- Requests an access token with `lakekeeper` scope
- Receives a JWT token for subsequent API calls

**Technical Details:**
- **Grant Type**: `client_credentials` (machine-to-machine authentication)
- **Client ID**: `spark` (pre-configured service account)
- **Scope**: `lakekeeper` (specific permission scope)
- **Token Type**: JWT (JSON Web Token)
- **Token Lifespan**: 300 seconds (5 minutes)

**From realm.json - Spark Client Configuration:**
```json
{
  "clientId": "spark",
  "secret": "2OR3eRvYfSZzzZ16MlPd95jhLnOaLM52",
  "serviceAccountsEnabled": true,
  "standardFlowEnabled": false,
  "directAccessGrantsEnabled": false
}
```
Now that we have the access token, we can query the server info Endpoint. 
On first launch it will show bootstrapped `'bootstrapped': false`.
The full API documentation is available as part of the Repository and hosted by Lakekeeper: [http://localhost:8181/swagger-ui/#/](http://localhost:8181/swagger-ui/#/)

### **Step 4: Inspect the JWT Token**

In [4]:
# Lets inspect the token we got to see that our application name is available:
JSON(jwt.decode(access_token, options={"verify_signature": False}))

<IPython.core.display.JSON object>

**What Happens:**
- Decodes the JWT token to inspect its contents
- Shows user information, permissions, and token metadata
- Verifies the authentication was successful

**Expected Token Structure:**
```json
{
  "acr": "1",
  "aud": ["lakekeeper", "account"],
  "azp": "spark",
  "client_id": "spark",
  "exp": 1754303226,
  "iat": 1754299626,
  "iss": "http://keycloak:8080/realms/iceberg",
  "preferred_username": "service-account-spark",
  "realm_access": {
    "roles": ["offline_access", "uma_authorization", "default-roles-iceberg"]
  },
  "scope": "email profile",
  "sub": "9410d0bf-4487-4177-a34f-af364cac0a59",
  "typ": "Bearer"
}
```

**Technical Details:**
- **Subject (sub)**: Unique user identifier
- **Issuer (iss)**: Keycloak realm URL
- **Audience (aud)**: Intended recipients (lakekeeper, account)
- **Expiration (exp)**: Token expiration timestamp
- **Roles**: Default realm roles assigned to service account

### **Step 5: Check Server Status**

In [5]:
response = requests.get(
    url=f"{MANAGEMENT_URL}/v1/info",
    headers={"Authorization": f"Bearer {access_token}"},
)
response.raise_for_status()
JSON(response.json())
# On first launch it shows "bootstrapped": False

<IPython.core.display.JSON object>

**What Happens:**
- Queries Lakekeeper's server information endpoint
- Checks if the system has been bootstrapped
- Verifies the authentication token is valid

**Expected Response (Before Bootstrap):**
```json
{
  "authz-backend": "openfga",
  "aws-system-identities-enabled": false,
  "azure-system-identities-enabled": false,
  "bootstrapped": false,
  "default-project-id": "00000000-0000-0000-0000-000000000000",
  "gcp-system-identities-enabled": false,
  "queues": ["tabular_purge", "tabular_expiration"],
  "server-id": "00000000-0000-0000-0000-000000000000",
  "version": "0.9.3"
}
```

**Technical Details:**
- **Bootstrapped**: `false` indicates system needs initialization
- **Authz Backend**: OpenFGA is configured for authorization
- **Version**: Lakekeeper version information
- **Queues**: Background processing queues

### **Step 6: Perform Bootstrap**

## 2. Bootstrap

In [8]:
response = requests.post(
    url=f"{MANAGEMENT_URL}/v1/bootstrap",
    headers={
        "Authorization": f"Bearer {access_token}"
    },
    json={
        "accept-terms-of-use": True,
        # Optionally, we can override the name / type of the user:
        # "user-email": "user@example.com",
        # "user-name": "Roald Amundsen",
        # "user-type": "human"
    },
)
response.raise_for_status()

**What Happens:**
- Initiates the bootstrap process
- Creates the first administrative user
- Sets up the foundational system configuration
- **Critical**: This can only be done once!

**Technical Details:**
- **One-time Operation**: Bootstrap can only be performed once
- **Admin Creation**: The calling user becomes the system administrator
- **Terms Acceptance**: Required for legal compliance
- **System Initialization**: Sets up default projects and permissions

### **Step 7: Grant Permissions to Human Users**

## 3. Grant access to UI User
In keycloak the user "Peter" exists which we are now also assigning the "admin" role.

Before executing the next cell, login to the UI at http://localhost:8181 using:
* Username: `peter`
* Password: `iceberg`

You should see "You don't have any projects assignments".

Lets assign permissions to peter:

In [9]:

# Users will show up in the /v1/user endpoint after the first login via the UI
# or the first call to the /catalog/v1/config endpoint.
response = requests.get(
    url=f"{MANAGEMENT_URL}/v1/user",
    headers={"Authorization": f"Bearer {access_token}"},
)
response.raise_for_status()
JSON(response.json())

<IPython.core.display.JSON object>

In [10]:
response = requests.post(
    url=f"{MANAGEMENT_URL}/v1/permissions/server/assignments",
    headers={"Authorization": f"Bearer {access_token}"},
    json={
      "writes": [
        {
          "type": "admin",
          "user": "oidc~cfb55bf6-fcbb-4a1e-bfec-30c6649b52f8"
        }
      ]
    }
)
response.raise_for_status()

**What Happens:**
- Grants server-level admin permissions to Peter
- Allows Peter to manage the entire Lakekeeper system
- Uses OpenID Connect user identifier format

**Technical Details:**
- **User ID Format**: `oidc~{keycloak-user-id}`
- **Permission Type**: `admin` (highest level access)
- **Scope**: Server-wide permissions
- **User ID**: `cfb55bf6-fcbb-4a1e-bfec-30c6649b52f8` (Peter's Keycloak ID)

#### **7.2: Assign Project Admin Role**
You can now refresh the UI page and should see the default Lakehouse.

In [11]:
response = requests.post(
    url=f"{MANAGEMENT_URL}/v1/permissions/project/assignments",
    headers={"Authorization": f"Bearer {access_token}"},
    json={
      "writes": [
        {
          "type": "project_admin",
          "user": "oidc~cfb55bf6-fcbb-4a1e-bfec-30c6649b52f8"
        }
      ]
    }
)
response.raise_for_status()

**What Happens:**
- Grants project-level admin permissions to Peter
- Allows Peter to manage data projects and resources
- Enables full access to data lake operations

**Technical Details:**
- **Permission Type**: `project_admin` (project-level administration)
- **Scope**: Project and data resource management
- **Hierarchy**: Project admin is below server admin in permission levels

### **Step 8: Register Query Engine Service Accounts**

### 3.1 Grant Access to trino & duckdb & starrocks User

In [12]:
# First we login as the trino user, so that the user is known to
# Lakekeeper.

for client_id, client_secret in [("trino", "AK48QgaKsqdEpP9PomRJw7l2T7qWGHdZ"), ("duckdb", "r2dHUlb7XrkSRcvrRqG5XZwQfnUS5NlL"), ("starrocks", "X5IWbfDJBTcU1F3PGZWgxDJwLyuFQmSf")]:
    response = requests.post(
        url=KEYCLOAK_TOKEN_URL,
        data={
            "grant_type": "client_credentials",
            "client_id": client_id,
            "client_secret": client_secret,
            "scope": "lakekeeper"
        },
        headers={"Content-type": "application/x-www-form-urlencoded"},
    )
    response.raise_for_status()
    access_token_client = response.json()['access_token']
    
    response = requests.post(
        url=f"{MANAGEMENT_URL}/v1/user",
        headers={"Authorization": f"Bearer {access_token_client}"},
        json={"update-if-exists": True}
    )
    response.raise_for_status()

**What Happens:**
- Authenticates each query engine service account
- Registers them as users in Lakekeeper
- Enables query engines to access the data catalog

**Technical Details:**
- **Service Accounts**: Machine-to-machine authentication
- **Client Secrets**: Pre-configured in Keycloak realm
- **User Registration**: Creates user records in Lakekeeper
- **Scope**: `lakekeeper` for catalog access

In [13]:
# Users will show up in the /v1/user endpoint after the first login via the UI
# or the first call to the /catalog/v1/config endpoint.
response = requests.get(
    url=f"{MANAGEMENT_URL}/v1/user",
    headers={"Authorization": f"Bearer {access_token}"},
)
response.raise_for_status()
JSON(response.json())

<IPython.core.display.JSON object>

In [14]:
response = requests.post(
    url=f"{MANAGEMENT_URL}/v1/permissions/project/assignments",
    headers={"Authorization": f"Bearer {access_token}"},
    json={
      "writes": [
        {
            "type": "project_admin",
            "user": "oidc~94eb1d88-7854-43a0-b517-a75f92c533a5"
        },
        {
            "type": "project_admin",
            "user": "oidc~7515be4b-ce5b-4371-ab31-f40b97f74ec6"
        },
        {
            "type": "project_admin",
            "user": "oidc~7a5da0c5-24e2-4148-a8d9-71c748275928"
        }
      ]
    }
)
response.raise_for_status()

**What Happens:**
- Grants project admin permissions to all query engines
- Enables query engines to read/write data and metadata
- Allows engines to manage their own data resources

### **Step 9: Validate Bootstrap Completion**

## 4. Validate Bootstrap

In [15]:
# The server is now bootstrapped:
response = requests.get(
    url=f"{MANAGEMENT_URL}/v1/info",
    headers={"Authorization": f"Bearer {access_token}"},
)
response.raise_for_status()
JSON(response.json())

<IPython.core.display.JSON object>

**Expected Response (After Bootstrap):**
```json
{
  "authz-backend": "openfga",
  "bootstrapped": true,
  "default-project-id": "00000000-0000-0000-0000-000000000000",
  "server-id": "00000000-0000-0000-0000-000000000000",
  "version": "0.9.3"
}
```

**Technical Details:**
- **Bootstrapped**: Now `true` indicating successful initialization
- **System Ready**: Lakekeeper is now operational
- **Admin User**: Service account `spark` is the system administrator

#### **9.2: Verify User List**

In [16]:
# An initial user was created
response = requests.get(
    url=f"{MANAGEMENT_URL}/v1/user",
    headers={"Authorization": f"Bearer {access_token}"},
)
response.raise_for_status()
JSON(response.json())

<IPython.core.display.JSON object>


**Expected Users:**
1. **service-account-spark** (System Admin)
2. **Peter Cold** (Human Admin)
3. **service-account-trino** (Query Engine)
4. **service-account-duckdb** (Query Engine)
5. **service-account-starrocks** (Query Engine)

#### **9.3: Verify Admin Permissions**

In [17]:
# The "admin" role has been assigned:
response = requests.get(
    url=f"{MANAGEMENT_URL}/v1/permissions/server/assignments",
    headers={"Authorization": f"Bearer {access_token}"},
)
response.raise_for_status()
user_id = response.json()['assignments'][0]['user']
JSON(response.json())

<IPython.core.display.JSON object>

In [18]:
# This user is the global admin, which has all access rights to the server:
response = requests.get(
    url=f"{MANAGEMENT_URL}/v1/permissions/server/access",
    headers={"Authorization": f"Bearer {access_token}"},
)
response.raise_for_status()
JSON(response.json())

<IPython.core.display.JSON object>

In [21]:
# Lets see who this user is:
response = requests.get(
    url=f"{MANAGEMENT_URL}/v1/user/{user_id}",
    headers={"Authorization": f"Bearer {access_token}"},
)
response.raise_for_status()
JSON(response.json())

<IPython.core.display.JSON object>


**Expected Permissions:**
```json
{
  "allowed-actions": [
    "create_project",
    "update_users", 
    "delete_users",
    "list_users",
    "grant_admin",
    "provision_users",
    "read_assignments"
  ]
}
```

## 🔐 **Security Architecture**

### **Authentication Flow**
1. **Service Account Authentication**: Using client credentials
2. **JWT Token Generation**: Keycloak issues signed tokens
3. **Token Validation**: Lakekeeper validates tokens with Keycloak
4. **Permission Checking**: OpenFGA handles authorization decisions

### **Authorization Model**
- **Server Admin**: Full system access
- **Project Admin**: Project and data resource management
- **Query Engines**: Data access and metadata management
- **Human Users**: Role-based access control

### **Keycloak Realm Configuration**
From `realm.json`:
- **Realm**: `iceberg`
- **Users**: 7 pre-configured users (2 human, 5 service accounts)
- **Clients**: 10 configured clients including query engines
- **Roles**: Default roles with UMA authorization support

## ⚠️ **Important Notes**

### **Bootstrap Limitations**
- **One-time Operation**: Bootstrap can only be performed once
- **Admin Assignment**: First user to bootstrap becomes system admin
- **UI Restriction**: Should not use UI for bootstrap (use notebook instead)

### **Security Considerations**
- **Service Account Secrets**: Pre-configured in realm.json
- **Token Lifespan**: 5 minutes for access tokens
- **Internal Communication**: All services communicate via Docker network
- **Permission Inheritance**: Server admin > Project admin > User permissions

### **Troubleshooting**
- **Token Expiration**: Re-authenticate if tokens expire
- **Service Dependencies**: Ensure all services are healthy before bootstrap
- **Permission Errors**: Verify user IDs and permission assignments
- **Network Issues**: Check Docker network connectivity

## 🎯 **Next Steps**

After successful bootstrap:
1. **Login to UI**: Use Peter's credentials (peter/iceberg)
2. **Create Warehouses**: Set up data storage locations
3. **Test Query Engines**: Verify Spark, Trino, StarRocks access
4. **Configure Permissions**: Set up additional user access as needed

---

**The bootstrap process establishes the foundational security and access control system for the entire data lake platform.** 