Skip to content

ai-tier v0.2 k0s#87

Merged
kupratyu-splunk merged 55 commits intomainfrom
ai-tier-v2-k0s
Apr 29, 2026
Merged

ai-tier v0.2 k0s#87
kupratyu-splunk merged 55 commits intomainfrom
ai-tier-v2-k0s

Conversation

@spl-arif
Copy link
Copy Markdown
Collaborator

Description

Related Issues

  • Related to #

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test improvement
  • CI/CD improvement
  • Chore (dependency updates, etc.)

Changes Made

Testing Performed

  • Unit tests pass (make test)
  • Linting passes (make lint)
  • Integration tests pass (if applicable)
  • E2E tests pass (if applicable)
  • Manual testing performed

Test Environment

  • Kubernetes Version:
  • Cloud Provider:
  • Deployment Method:

Test Steps

Documentation

  • Updated inline code comments
  • Updated README.md (if adding features)
  • Updated API documentation
  • Updated deployment guides
  • Updated CHANGELOG.md
  • No documentation needed

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published
  • I have updated the Helm chart version (if applicable)
  • I have updated CRD schemas (if applicable)

Breaking Changes

Impact:

Migration Path:

Screenshots/Recordings

Additional Notes

Reviewer Notes

Please pay special attention to:


Commit Message Convention: This PR follows Conventional Commits

kupratyu-splunk and others added 30 commits March 20, 2026 23:58
Version 3.0.0 does not exist in the splunk helm repo; 3.1.0 is the
latest available. Also regenerates Chart.lock with correct digest.
The splunkai_models_apps package no longer exists in ai-platform-models.
The ray applications are now resolved relative to their working_dir zip,
so import paths should be bare module names (main:SERVE_APP / main:create_serve_app).
…ersion into ApplicationParams

Without working_dir, Ray has no zip to load main from and fails with
'No module named main'. Added WorkingDirBase and ModelVersion fields to
ApplicationParams, computed from object storage path and MODEL_VERSION
env var, and templated working_dir into all 13 app entries in applications.yaml.
…b_storage

Two bugs causing NoSuchBucket when Ray downloads working_dir zips:

1. rayS3DownloadEnv() was missing AWS_S3_ADDRESSING_STYLE=path. Boto3
   defaults to virtual-hosted style (bucket.endpoint) for custom endpoints,
   which fails DNS resolution with MinIO. Path-style (endpoint/bucket/key)
   is required for all S3-compatible stores.

2. applications.yaml used 'object_storage' as the model_loader sub-field but
   ModelLoader in model_definition.py defines it as 'blob_storage' (renamed
   in commit e62d93da). Pydantic silently ignored the unknown key, leaving
   blob_storage=None and causing a model validation error at startup.
…handler

Ray's s3:// protocol handler (protocol.py _handle_s3_protocol) creates a
plain boto3.Session().client('s3') with no endpoint_url, so it always hits
AWS S3 regardless of AWS_ENDPOINT_URL set on the pod. This causes NoSuchBucket
when the bucket only exists in MinIO.

Replace rayRuntimeWorkingDirScheme() with rayWorkingDirBase() which, for
S3-compatible stores with a custom endpoint, builds the working_dir as a
direct HTTP URL to MinIO (endpoint/bucket/path). Ray's https handler uses
urllib which simply fetches the URL without any S3-specific boto3 logic.

Also remove the ineffective AWS_S3_ADDRESSING_STYLE env var added in the
previous commit.
…nIO zips

Ray's s3:// protocol handler creates a bare boto3.Session().client('s3')
with no endpoint_url, so it always hits AWS S3 regardless of any custom
endpoint config. Rather than fighting Ray internals, switch to file://
working_dir pointing to app source baked into the Ray image.

- applications.yaml: replace all 'minio-zip' working_dir templates with
  file:///home/ray/ray/applications/entrypoint (Entrypoint) and
  file:///home/ray/ray/applications/generic_application (all other apps)
- builder.go: remove WorkingDirBase, ModelVersion fields and rayWorkingDirBase()
  function — no longer needed since working_dir is a static file:// path
- builder_test.go: remove TestRayWorkingDirBase test for deleted function
…ote URL for others

PromptInjectionTfidf, PromptInjectionCrossEncoder, PromptInjectionClassifier are
baked into the Ray worker image at /home/ray/ray/applications/generic_application,
so they use file:// working_dir with no network dependency.

All other apps (UaeLarge, AllMinilmL6V2, BiEncoder, MbartTranslator, etc.) continue
to use {{.WorkingDirBase}}/AppName-{{.ModelVersion}}.zip resolved at runtime from
the configured object storage (s3, gs, azure, or s3compat/MinIO endpoint).
…cile

- saia/impl.go: bump default memory request 1Gi->2Gi, limits CPU 1->2 / memory 2Gi->4Gi
  to prevent kubelet OOMKill during SAIA startup
- reconciler.go: preserve existing AIService Resources on reconcile so user-set limits
  are not wiped back to defaults on every AIPlatform reconcile
Ray requires file:// working_dir URIs to point to a .zip or .whl file.
Update the 3 prompt injection apps to reference generic_application.zip
which is built during the Docker image build in ai-platform-models.
…g_dir base

All 13 Ray Serve apps now use the generic_application.zip baked into the
Ray head image via file://, eliminating the need to upload versioned zips
to MinIO. Also fixes rayWorkingDirBase to return s3:// for all S3-compatible
backends (minio, seaweedfs, s3compat) so AWS_ENDPOINT_URL on the pods
redirects boto3 to the MinIO endpoint.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MbartTranslatorDeployment hardcodes its blob_prefix and does not accept
model_definition as an init arg, so passing it via .bind() caused a
Ray pickle error.
Replace Llama31Instruct (8b) with GptOss20b and Llama3170bInstructAwq (70b)
with GptOss120b. L40S-only, tool_parser: openai, VLLM_ATTENTION_BACKEND:
TRITON_ATTN, 1 GPU for 20b and 4 GPUs for 120b.
…env override

Actor-level runtime_env.env_vars in gpu_type_options_override replaces the
app-level runtime_env, causing APPLICATION_NAME and other vars to be lost.
Move VLLM_ATTENTION_BACKEND: TRITON_ATTN to the top-level env_vars instead.
…r terminating nodes during model load

Large model loading (e.g. gpt-oss-120b) takes several minutes. Without an
idle timeout, the Ray autoscaler terminates worker nodes after 60s (default),
killing the replica mid-load with SIGTERM.
…pec field

WorkerGroupSpec.idleTimeoutSeconds is rejected as unknown by the installed
KubeRay CRD version. AutoscalerOptions.IdleTimeoutSeconds is set at the
cluster level and read directly by the Ray autoscaler process, achieving
the same effect without requiring a CRD upgrade.

600s idle timeout prevents the autoscaler from terminating worker nodes
while large models (e.g. gpt-oss-120b) are loading.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 120b model download/load exceeds the previous 50Gi ephemeral storage
limit, causing pod eviction. 200Gi matches the storage needed for large
model artifacts. Memory increased from 16Gi to 64Gi to support vLLM
process memory requirements for a 120b model.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The model is mxfp4 quantized at 65GB on disk, requiring more than a single
L40S (46Gi) GPU. Switch to num_gpus=2 / tensor_parallel_size=2 to use
2x L40S = 92Gi, comfortably fitting the model at runtime.

Also increase l40s-1-gpu ephemeral-storage to 200Gi and memory to 64Gi
to prevent pod eviction during large model downloads.
…e to 200Gi for gpt-oss-120b"

This reverts commit 03ef451.
…atorType

- Add H100 worker tiers to instance.yaml (h100-0-gpu, h100-1-gpu)
- Add H100 instanceScale block to features/saia.yaml
- Add AcceleratorType field to ApplicationParams in builder.go, populated
  from effectiveAcceleratorType(), so applications.yaml can template gpu_types
- Template gpu_types in applications.yaml for GptOss20b and GptOss120b
  using {{.AcceleratorType}} instead of hardcoded ["L40S"]
- Add H100 gpu_type_options_override and gpu_type_model_config_override
  entries for GptOss20b (0.5 GPU, tp=1) and GptOss120b (1 GPU, tp=1)
- Fix UaeLarge H100 num_gpus and gpu_memory_utilization: 0.025 -> 0.0375
- Require yq in preflight checks for eks and k0s scripts (fail instead
  of silently falling back to fragile grep/awk parsing)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
eks_cluster_with_stack.sh:
- Read GPU_CAPACITY_RESERVATION_ID/AZ and GPU_AVAILABILITY_ZONES from
  cluster-config.yaml in load_config()
- generate_node_groups(): skip standard GPU node group for H100 with
  capacity reservation; add availabilityZones support for other types
- New create_gpu_nodegroup_with_capacity_block(): CloudFormation-based
  H100 node group using CapacityType: CAPACITY_BLOCK, only invoked when
  defaultAcceleratorType=H100 and capacityReservation.id is set
- create_cluster_flow/reconcile_flow: gate capacity block creation on
  DEFAULT_ACCELERATOR=H100, idempotent GPU node count check
- main_install: export AWS_DEFAULT_REGION/AWS_REGION after load_config
- Add missing --region flag to 3 eksctl create iamserviceaccount calls

k0s_cluster_with_stack.sh:
- load_config: read defaultAcceleratorType from config, default L40S

cluster-config.yaml:
- GPU TYPE QUICK REFERENCE comment block: L40S/H100/H100_NVL instance
  types, when to use capacityReservation and availabilityZones
- H100-only capacityReservation and availabilityZones commented-out blocks
- defaultAcceleratorType comment cross-referencing instance types

k0s-cluster-config.yaml (new file):
- Config template for k0s script with GPU TYPE QUICK REFERENCE
- Documents L40S/H100/H100_NVL gpuWorker instance types alongside
  defaultAcceleratorType

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…enable, and Redis no-op

Three independent root causes were breaking the airgap SAIA v2 flow end-to-end.
Each is fixed at its actual origin so the operator's rendered manifests are
usable out-of-the-box on k0s without post-install patches.

* nginx CORS preflight short-circuit (pkg/ai/features/saia/impl.go):
  SAIA v2's TenantConversationKeyMiddleware rejects unauthenticated OPTIONS
  with 400 before FastAPI's CORSMiddleware can respond, so browsers block the
  real request with "No Access-Control-Allow-Origin header present". nginx now
  answers OPTIONS with 204 + CORS headers on both v1 and v2 locations.
  Access-Control-Allow-Headers is reflected from the request via a map{} so
  new client headers don't require nginx edits, and ACAO is emitted only on
  preflight to avoid duplicating FastAPI's real-response ACAO.

* ENABLE_AUTHZ=true (pkg/ai/features/saia/impl.go defaults):
  The CMP interactive-token branch in SAIAAuthorizer is the only path that
  sets request.state.cmp_splunk_url, which AdminCapabilityAuthorizer needs
  to bridge a Splunk.interactive bearer into an EC-equivalent token. With
  "false", /admin/* returned 403 "Admin endpoints require an authenticated
  EC user token." Even in airgap CMP mode the value must be "true" — there
  is no authz-skip value that preserves the CMP bridge.

* DISABLE_RESPONSES_API_REDIS=True on GptOss120b / GptOss20b
  (config/configs/applications.yaml):
  Pairs with ray-head/ray-worker-gpu:build-v2-002 which ships the
  NoOpOpenAIServingResponses implementation (ai-platform-models c1f9aef3).
  Without this flag the vLLM RedisOpenAIServingResponses constructor raises
  "Responses Redis URL not set" on every /v1/responses call, the SSE stream
  is empty, and the /query path fails with SearchStreamError. Airgap has no
  Redis; cloud stays on "False" with its in-namespace Redis StatefulSet.

Supporting changes:
* Bump operator to v0.1.25 and all SAIA/Ray images to build-v2-002 in
  artifacts.yaml and k0s-cluster-config.yaml.
* Promote gopkg.in/yaml.v3 from indirect to direct in go.mod for the new
  raybuilder test.
* Add regression tests:
  - Test_reconcileSAIAConfigMap_EnablesAuthzForCMPBridging
  - Test_reconcileSAIAConfigMap_PreservesUserOverride (user override honored)
  - Test_reconcileNginxConfigMap_CORSPreflight (both locations, exactly two
    ACAO instances, dynamic header reflection)
  - pkg/ai/raybuilder/configmap_apps_test.go for the Redis no-op flag.
* Update AIPlatformUrl default test expectation to include the http:// scheme
  (matches the scheme-qualified URL that httpx/openai clients need).
* Add k0s-cluster-config-h100.yaml for the H100 lab topology.

Made-with: Cursor
@spl-arif spl-arif marked this pull request as draft April 27, 2026 06:00
@spl-arif spl-arif marked this pull request as ready for review April 27, 2026 15:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Splunk AI Operator stack to better support airgapped / k0s deployments and “bring-your-own” S3-compatible object storage (MinIO/SeaweedFS), while also extending SAIA v2 and Ray/Weaviate runtime wiring.

Changes:

  • Add first-class S3-compatible storage support (s3compat://, minio://, seaweedfs://) across CRDs, storage client, Ray builder templating, and docs/scripts.
  • Extend SAIA/AIService plumbing (serviceTemplate propagation, SAIA v2 fields/images) and update Weaviate to expose gRPC (50051) for v2 clients.
  • Add/refresh cluster setup assets (k0s configs, EKS README guidance, ECR secret refresh helper) and bump various build/dependency settings.

Reviewed changes

Copilot reviewed 60 out of 64 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
tools/cluster_setup/splunk-operator-cluster.yaml Update Splunk Enterprise image reference; adds an inline TODO comment.
tools/cluster_setup/refresh_ecr_credentials.sh New helper script to refresh ECR pull secret and restart stuck workloads.
tools/cluster_setup/k0s-cluster-config.yaml New k0s cluster config template for L40S; includes storage/operator/image settings.
tools/cluster_setup/k0s-cluster-config-h100.yaml New k0s cluster config template for H100.
tools/cluster_setup/cluster-config.yaml Expand EKS config template with object store selection and GPU guidance.
tools/cluster_setup/artifacts.yaml Update schema/patterns for new storage schemes and add SAIA v2 env/image wiring.
tools/cluster_setup/EKS_README.md Add object store selection docs and richer verification/troubleshooting steps.
tools/artifacts_download_upload_scripts/upload_to_seaweedfs.sh New script to install/run SeaweedFS and upload artifacts with retries/parallelism.
tools/artifacts_download_upload_scripts/upload_to_minio.sh Generalize MinIO uploader to S3-compatible and improve error handling/upload semantics.
tools/artifacts_download_upload_scripts/upload_splunk_app_to_seaweedfs.sh New helper to upload Splunk app tarball to SeaweedFS bucket apps/.
tools/artifacts_download_upload_scripts/test_minio_connection.sh Update MinIO connectivity test defaults (currently includes hardcoded endpoint/secret).
tools/artifacts_download_upload_scripts/seaweedfs.service New systemd unit for SeaweedFS all-in-one server.
tools/artifacts_download_upload_scripts/install_seaweedfs_systemd.sh New installer for the SeaweedFS systemd service (SELinux-safe).
tools/artifacts_download_upload_scripts/install_minio_ec2.sh New script to install MinIO on EC2 or launch+install in the EKS VPC.
tools/artifacts_download_upload_scripts/create_seaweedfs_folders.sh New helper to create expected “folder” prefixes via .keep objects.
tools/artifacts_download_upload_scripts/SEAWEEDFS_SYSTEMD.md New documentation for SeaweedFS systemd setup and troubleshooting.
tools/artifacts_download_upload_scripts/README.md Extend artifact scripts README for SeaweedFS + S3-compatible usage.
pkg/storage/storageclient_test.go Add tests for new schemes and stricter URI validation cases.
pkg/storage/storageclient.go Route minio://, seaweedfs://, s3compat:// to S3-compatible client; add bucket validation.
pkg/storage/s3compat.go New S3-compatible client implementation backed by AWS SDK with custom endpoint.
pkg/storage/minio.go Deprecate MinIO-specific client in favor of generic S3-compatible client.
pkg/storage/azure.go Add explicit validation for missing Azure container name.
pkg/ai/weaviate_test.go Update tests to assert http+grpc ports and GRPC_PORT env var.
pkg/ai/weaviate.go Expose Weaviate gRPC port (50051) in container + service; set GRPC_PORT env.
pkg/ai/reconciler_test.go Add test to ensure AIPlatform serviceTemplate propagates (and deep-copies) into AIService.
pkg/ai/reconciler.go Preserve patched AIService serviceTemplate/resources; propagate serviceTemplate; add SAIA v2 env-driven config.
pkg/ai/raybuilder/configmap_apps_test.go New regression tests ensuring apps YAML stays well-formed and sets DISABLE_RESPONSES_API_REDIS correctly.
pkg/ai/raybuilder/builder_test.go Minor whitespace/no-op change.
pkg/ai/raybuilder/builder_additional_test.go Update expectation to PullIfNotPresent for worker image pull policy.
pkg/ai/raybuilder/builder.go Expand app template parameters (provider/endpoint/creds/modelVersion/accelerator type), add S3-compatible env injection, autoscaling tuning.
pkg/ai/features/seca/seca.go Include scheme (http/https) when building AIPlatformUrl.
internal/webhook/v1/aiservice_webhook.go Update (commented) validation prefix list to include new schemes.
internal/webhook/v1/aiplatform_webhook.go Allow objectStorage.path updates; validate new path scheme prefixes.
helm-chart/splunk-ai-operator/values.yaml Add SAIA v2 and nginx image values; extend defaults.
helm-chart/splunk-ai-operator/templates/deployment.yaml Support image digest; add RELATED_IMAGE_SAIA_API_V2 and RELATED_IMAGE_NGINX env vars.
helm-chart/splunk-ai-operator/crds/ai.splunk.com_aiservices.yaml CRD schema updates for new storage schemes and SAIA v2 fields.
helm-chart/splunk-ai-operator/crds/ai.splunk.com_aiplatforms.yaml CRD schema updates for new storage schemes.
helm-chart/splunk-ai-operator/Chart.yaml Bump splunk-operator dependency to 3.1.0.
helm-chart/splunk-ai-operator/Chart.lock Lockfile update for splunk-operator 3.1.0.
go.mod Bump Go version and dependency set; add yaml.v3.
docs/troubleshooting.md Add detailed object-store/model artifact troubleshooting and verification steps.
docs/configuration/storage-artifacts.md Update storage provider descriptions and link to new object storage selection doc.
docs/configuration/object-storage.md New doc explaining scheme-based backend selection and examples.
config/manager/kustomization.yaml Update kustomize deployment image reference (currently points to docker.com/...).
config/crd/bases/ai.splunk.com_aiservices.yaml Base CRD schema updates for storage schemes and SAIA v2 fields.
config/crd/bases/ai.splunk.com_aiplatforms.yaml Base CRD schema updates for storage schemes.
config/configs/instance.yaml Adjust L40S resources and add H100 tiers.
config/configs/features/saia.yaml Update SAIA feature scaling for new model apps and add H100 instance scale.
config/configs/applications.yaml Major application template updates (working_dir, providers, new GPT OSS apps, disable responses Redis).
api/v1/zz_generated.deepcopy.go Deepcopy updates for new AIService v2 config fields.
api/v1/aiservice_types.go Add AIPlatformScheme; add AIService SAIA v2/v2Worker config structs.
api/v1/aiplatform_types.go Update ObjectStorageSpec validation for new schemes and add optional provider field.
Makefile Pass GO_VERSION build arg; add amd64 docker build target.
Dockerfile.k0s-runner New runner image with kubectl/helm/yq for k0s workflows.
Dockerfile.debug Use GO_VERSION build arg.
Dockerfile Use GO_VERSION build arg and change runtime user/group for OpenShift SCC compatibility.
.gitignore Ignore tmp/ and tools/cluster_setup/*.original byproducts.
.env Bump GO_VERSION.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/ai/raybuilder/builder.go Outdated
Comment thread tools/artifacts_download_upload_scripts/test_minio_connection.sh Outdated
Comment thread config/manager/kustomization.yaml Outdated
Comment thread tools/artifacts_download_upload_scripts/SEAWEEDFS_SYSTEMD.md
Comment thread pkg/storage/storageclient.go
Comment thread tools/artifacts_download_upload_scripts/README.md Outdated
Comment thread tools/cluster_setup/cluster-config.yaml Outdated
Comment thread tools/cluster_setup/cluster-config.yaml
Comment thread Dockerfile Outdated
Comment thread tools/cluster_setup/k0s-cluster-config-h100.yaml Outdated
spl-arif and others added 6 commits April 28, 2026 19:32
…ove in-cluster MinIO install requiring customer-managed object storage
…e cluster

If the existing-cluster detection (useExisting) flakes due to an SSH
timeout or transient k0s status error, install_k0s_cluster could
fall through and unconditionally rm -rf /var/lib/k0s, destroying all
cluster state (etcd/kine, CRs, PVCs). Add a pre-wipe check that
queries k0s kubectl for Ready nodes and aborts with a clear error
if any are found.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Comment thread tools/cluster_setup/k0s-cluster-config.yaml Outdated
Comment thread tools/cluster_setup/k0s-cluster-config.yaml Outdated
Comment thread tools/cluster_setup/EKS_README.md
Copy link
Copy Markdown
Collaborator

@kupratyu-splunk kupratyu-splunk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kupratyu-splunk kupratyu-splunk merged commit 7d736a0 into main Apr 29, 2026
11 checks passed
@kupratyu-splunk kupratyu-splunk deleted the ai-tier-v2-k0s branch April 29, 2026 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants