Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 9 additions & 6 deletions .agents/building-and-testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,12 @@ The React UI (`core/http/react-ui/`) has **no component/unit tests** — its onl
- **Browser:** the flake dev shell ships `chromium` and exports `PLAYWRIGHT_CHROMIUM_PATH`; `playwright.config.js` uses it via `launchOptions.executablePath`, and the Makefile skips `playwright install` when it's set. This avoids Playwright's downloaded browser, which can't resolve system libs (`libglib-2.0`, …) on NixOS. In CI (no `PLAYWRIGHT_CHROMIUM_PATH`) the Makefile falls back to `playwright install --with-deps chromium`.
- The app is a React SPA, so coverage accumulates across in-app navigation within a test; a full `page.goto`/reload resets it.
- `.nycrc.json` uses `all: true`, so **every `src/**` file is in the report**, including 0%-coverage ones — that's how you spot features with no test at all (sort the HTML report or `coverage-summary.json` by line% ascending).
- **UI coverage gate:** `make test-ui-coverage-check` runs the suite then `scripts/ui-coverage-check.sh`, failing if total line coverage drops more than `UI_COVERAGE_TOLERANCE` (default **1.0pp**) below `core/http/react-ui/coverage-baseline.txt`. `make test-ui-coverage-baseline` regenerates the baseline. **Why a tolerance (unlike the strict Go gate):** UI e2e line coverage is *non-deterministic* — async/debounced paths (e.g. the VRAM estimate's 500ms debounce) make identical specs vary ~0.5pp run-to-run, so a zero-tolerance gate would flake. Keep the tolerance just above the observed jitter. Run in CI (`tests-ui-e2e.yml`) and pre-commit on `core/http/react-ui/` changes.

Rules:
- The gate is **strict — there is no tolerance**. Any decrease fails, regardless of how many lines a PR adds or deletes. `covermode=atomic` makes line coverage deterministic, so there's no run-to-run jitter to excuse.
- When a change legitimately **raises** coverage, run `make test-coverage-baseline` and **commit** the updated `coverage-baseline.txt` so the ratchet moves up. Never lower the baseline by hand.
- If you can't get coverage back to baseline, the fix is to **add tests**, not to edit the baseline.
- **UI coverage gate:** `make test-ui-coverage-check` runs the suite then `scripts/ui-coverage-check.sh`, failing if total line coverage drops more than `UI_COVERAGE_TOLERANCE` below `core/http/react-ui/coverage-baseline.txt`. `make test-ui-coverage-baseline` regenerates the baseline. Runs in CI (`tests-ui-e2e.yml`) and pre-commit on `core/http/react-ui/` changes.
- **Why it has a tolerance (unlike the strict Go gate):** UI e2e coverage is *non-deterministic*. Specs that assert on state and end while async/lazy render work is still in flight collect those lines only when the render beats the coverage teardown — so the total drifts with machine speed/load (a fast local box reads higher than a slow CI runner), diffusely across many specs. The tolerance absorbs that drift, so set the baseline *below* the slow-CI floor, never to a fast-local `make test-ui-coverage-baseline` number, or CI flaps.
- **Raising coverage is cheap:** a *render-smoke* spec (navigate to a route, assert its header renders) mounts a lazy page and runs its full render + initial effects, capturing most of its lines in a few lines of test — see `e2e/page-render-smoke.spec.js`. Auth is disabled in the test server (`isAdmin=true`), so `RequireAdmin`/`RequireFeature` routes render without a mock. The most *deterministic* win is removing a race: make a spec `await` a rendered element before ending (see `e2e/agents.spec.js` → AgentCreate) so its lines count every run.

Rules (both gates):
- **Install the hooks:** `make install-hooks` once per clone so lint + coverage run pre-commit. Don't lean on CI for what the hook catches.
- **Don't work around the gate:** never `git commit --no-verify`, and never hand-lower a baseline or widen a tolerance to turn a red gate green. The ratchet only moves up.
- If a change drops coverage, **add tests** (sort `coverage-summary.json` by line% ascending to find untested code) rather than editing the baseline. When coverage legitimately rises, commit the regenerated baseline (`make test-coverage-baseline` / `test-ui-coverage-baseline`).
- The Go gate is **strict — no tolerance**; `covermode=atomic` keeps it deterministic. The UI gate keeps a small tolerance only because its e2e coverage isn't.
1 change: 1 addition & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]

## Quick Reference

- **Git hooks & coverage gates**: Run `make install-hooks` once per clone so the pre-commit lint + coverage gates run. **Never bypass them with `git commit --no-verify`, and never lower a coverage baseline or widen a gate's tolerance to turn a red gate green** — the coverage ratchet only moves up. If a change drops coverage, add tests to raise it (e.g. render-smoke specs). See [.agents/building-and-testing.md](.agents/building-and-testing.md).
- **Logging**: Use `github.com/mudler/xlog` (same API as slog)
- **Go style**: Prefer `any` over `interface{}`
- **Comments**: Explain *why*, not *what*
Expand Down
6 changes: 6 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,12 @@ The e2e tests run LocalAI in a Docker container and exercise the API:
make test-e2e
```

### React UI tests and coverage

The React UI (`core/http/react-ui/`) is covered by Playwright e2e specs, gated by a **monotonic line-coverage ratchet** (`make test-ui-coverage-check`, run in CI and pre-commit). The metric is non-deterministic — a fast local box reads higher than a slow CI runner for the same code — so a small tolerance is unavoidable.

**If your change lowers UI coverage, raise it back by adding specs — do not widen the tolerance or hand-lower the baseline.** A *render-smoke* spec (navigate to a page, assert its header is visible) cheaply covers an entire lazy page. See `core/http/react-ui/e2e/page-render-smoke.spec.js` and the full policy in [.agents/building-and-testing.md](.agents/building-and-testing.md#react-ui-coverage).

### Running E2E container tests

These tests build a standard LocalAI Docker image and run it with pre-configured model configs to verify that most endpoints work correctly:
Expand Down
2 changes: 1 addition & 1 deletion backend/cpp/llama-cpp/Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

LLAMA_VERSION?=d6588daa800058dfa54f1d7ea695b1a810c8ae18
LLAMA_VERSION?=399739d5c5978351f39e3454bfbfbab4f369088f
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp

CMAKE_ARGS?=
Expand Down
2 changes: 1 addition & 1 deletion core/http/react-ui/coverage-baseline.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
39.86
40.0
40 changes: 40 additions & 0 deletions core/http/react-ui/e2e/page-render-smoke.spec.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import { test, expect } from './coverage-fixtures.js'

// Render-smoke coverage. Each page is lazy-loaded and runs its full render +
// initial effects on mount, so a bare visit captures the bulk of a page's
// lines — cheap, real coverage for pages that have no dedicated spec yet.
//
// This is the project's preferred way to keep the UI coverage gate green:
// raise the floor by covering more, rather than loosening the gate's
// tolerance (see CONTRIBUTING.md → "React UI coverage"). Auth is disabled in
// the test server, so RequireAdmin/RequireFeature resolve to isAdmin=true and
// every gated route renders without an auth/capability mock.
//
// Asserts the page mounted (its .page-title header is visible) and that it did
// not bounce to a gate redirect (/login or back to /app home).
const PAGES = [
['/app/talk', 'Talk'],
['/app/usage', 'Usage'],
['/app/account', 'Account'],
['/app/studio', 'Studio'],
['/app/manage', 'Manage'],
['/app/backends', 'Backends'],
['/app/settings', 'Settings'],
['/app/nodes', 'Nodes'],
['/app/face', 'Face recognition'],
['/app/voice', 'Voice recognition'],
['/app/fine-tune', 'Fine-tuning'],
['/app/quantize', 'Quantize'],
]

test.describe('Page render smoke', () => {
for (const [path, label] of PAGES) {
test(`renders ${label} (${path})`, async ({ page }) => {
await page.goto(path)
// .page-title for the normal header; .empty-state-title for pages that
// render a gated/empty state (e.g. Account when auth is disabled).
await expect(page.locator('.page-title, .empty-state-title').first()).toBeVisible({ timeout: 15_000 })
await expect(page).toHaveURL(new RegExp(path.replace(/\//g, '\\/') + '$'))
})
}
})
74 changes: 11 additions & 63 deletions core/services/nodes/replicapicker.go
Original file line number Diff line number Diff line change
@@ -1,69 +1,17 @@
package nodes

import "time"
import "github.com/mudler/LocalAI/pkg/clusterrouting"

// ReplicaCandidate is the minimum view of a loaded model replica needed to
// apply the routing policy. It is intentionally decoupled from the gorm models
// (BackendNode, NodeModel) so the same picker can run against fresh DB rows
// (SmartRouter.Route → FindAndLockNodeWithModel) and against an in-memory
// snapshot (the per-frontend rotating cache flagged in pkg/model — see TODO
// below).
type ReplicaCandidate struct {
NodeID string
Address string
ReplicaIndex int
InFlight int
LastUsed time.Time
AvailableVRAM uint64
}
// ReplicaCandidate aliases the canonical type in pkg/clusterrouting. The policy
// implementation moved there so the p2p federation server can share it without
// importing this package (which pulls in gorm). Because this is a type alias,
// existing references such as the LoadedReplicaStats interface method and the
// ReplicaCandidate(rw) row conversion in registry.go remain valid unchanged.
type ReplicaCandidate = clusterrouting.ReplicaCandidate

// PickBestReplica is the single source of truth for which loaded replica of a
// model serves the next request.
//
// Policy (ordered tiers, first non-tie wins):
// 1. Least in-flight wins — primary load-balancing signal.
// 2. Oldest last_used wins — round-robin between equally-loaded replicas.
// Every successful pick refreshes last_used (in FindAndLockNodeWithModel's
// transaction and in TouchNodeModel on cache hits), so the "oldest" tier
// naturally rotates through the candidate set without a separate cursor.
// 3. Largest available_vram wins — cold-start tiebreaker for replicas that
// have never been picked (identical last_used).
//
// Two callers must agree on this policy:
//
// - SmartRouter.Route, via the SQL ORDER BY in FindAndLockNodeWithModel
// (registry.go). That query MUST mirror this function — TestPickerSQLMirror
// asserts both sides agree on a representative dataset.
//
// - The per-frontend rotating-replica cache (NOT YET IMPLEMENTED — see
// pkg/model/loader.go and pkg/model/initializers.go for the integration
// point). When that cache lands, it will call PickBestReplica against an
// in-memory snapshot using locally-tracked in-flight counters and skip the
// per-request DB round-trip.
//
// Returns nil when the candidate list is empty. Does not allocate.
// PickBestReplica delegates to the canonical implementation in pkg/clusterrouting.
// The SQL ORDER BY in FindAndLockNodeWithModel (registry.go) must mirror it; the
// "policy mirror" spec in registry_test.go asserts they agree.
func PickBestReplica(candidates []ReplicaCandidate) *ReplicaCandidate {
if len(candidates) == 0 {
return nil
}
best := &candidates[0]
for i := 1; i < len(candidates); i++ {
c := &candidates[i]
if betterReplica(c, best) {
best = c
}
}
return best
}

// betterReplica reports whether candidate a is preferred over candidate b
// under the policy documented on PickBestReplica.
func betterReplica(a, b *ReplicaCandidate) bool {
if a.InFlight != b.InFlight {
return a.InFlight < b.InFlight
}
if !a.LastUsed.Equal(b.LastUsed) {
return a.LastUsed.Before(b.LastUsed)
}
return a.AvailableVRAM > b.AvailableVRAM
return clusterrouting.PickBestReplica(candidates)
}
2 changes: 1 addition & 1 deletion docs/content/advanced/advanced-usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -273,7 +273,7 @@ A list of the environment variable that tweaks parallelism is the following:
```
### Python backends GRPC max workers
### Default number of workers for GRPC Python backends.
### This actually controls wether a backend can process multiple requests or not.
### This actually controls whether a backend can process multiple requests or not.

### Define the number of parallel LLAMA.cpp workers (Defaults to 1)

Expand Down
6 changes: 3 additions & 3 deletions docs/content/features/image-generation.md
Original file line number Diff line number Diff line change
Expand Up @@ -199,15 +199,15 @@ Pipelines types available:

##### Advanced: Additional parameters

Additional arbitrarly parameters can be specified in the option field in key/value separated by `:`:
Additional arbitrary parameters can be specified in the option field in key/value separated by `:`:

```yaml
name: animagine-xl
options:
- "cfg_scale:6"
```

**Note**: There is no complete parameter list. Any parameter can be passed arbitrarly and is passed to the model directly as argument to the pipeline. Different pipelines/implementations support different parameters.
**Note**: There is no complete parameter list. Any parameter can be passed arbitrarily and is passed to the model directly as argument to the pipeline. Different pipelines/implementations support different parameters.

The example above, will result in the following python code when generating images:

Expand Down Expand Up @@ -342,4 +342,4 @@ diffusers:
```bash
(echo -n '{"prompt": "spiderman surfing","size": "512x512","model":"txt2vid"}') |
curl -H "Content-Type: application/json" -X POST -d @- http://localhost:8080/v1/images/generations
```
```
4 changes: 2 additions & 2 deletions docs/content/features/text-generation.md
Original file line number Diff line number Diff line change
Expand Up @@ -897,7 +897,7 @@ The backend will automatically download the required files in order to run the m
- `OVModelForCausalLM` requires OpenVINO IR [Text Generation](https://huggingface.co/models?library=openvino&pipeline_tag=text-generation) models from Hugging face
- `OVModelForFeatureExtraction` works with any Safetensors Transformer [Feature Extraction](https://huggingface.co/models?pipeline_tag=feature-extraction&library=transformers,safetensors) model from Huggingface (Embedding Model)

Please note that streaming is currently not implemente in `AutoModelForCausalLM` for Intel GPU.
Please note that streaming is currently not implemented in `AutoModelForCausalLM` for Intel GPU.
AMD GPU support is not implemented.
Although AMD CPU is not officially supported by OpenVINO there are reports that it works: YMMV.

Expand Down Expand Up @@ -1008,4 +1008,4 @@ template:

completion: |
{{.Input}}
```
```
2 changes: 1 addition & 1 deletion docs/content/whats-new.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ It is now possible for single-devices with one GPU to specify `--single-active-b

#### Resources management

Thanks to the continous community efforts (another cool contribution from {{< github "dave-gray101" >}} ) now it's possible to shutdown a backend programmatically via the API.
Thanks to the continuous community efforts (another cool contribution from {{< github "dave-gray101" >}} ) now it's possible to shutdown a backend programmatically via the API.
There is an ongoing effort in the community to better handling of resources. See also the [🔥Roadmap](https://localai.io/#-hot-topics--roadmap).

#### New how-to section
Expand Down
13 changes: 13 additions & 0 deletions pkg/clusterrouting/clusterrouting_suite_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
package clusterrouting

import (
"testing"

. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)

func TestClusterRouting(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "ClusterRouting Suite")
}
66 changes: 66 additions & 0 deletions pkg/clusterrouting/replica.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
// Package clusterrouting holds the transport-agnostic replica selection policy
// shared by the NATS distributed mode (core/services/nodes) and the p2p
// federation server (core/p2p). It deliberately depends on nothing heavier than
// the standard library so either transport can import it without pulling in a
// database driver or message bus.
package clusterrouting

import "time"

// ReplicaCandidate is the minimum view of a loaded model replica needed to
// apply the routing policy. It is intentionally decoupled from any storage
// model (gorm rows on the NATS side, gossiped NodeData on the p2p side) so the
// same picker runs against fresh DB rows, an in-memory snapshot, or p2p gossip.
type ReplicaCandidate struct {
NodeID string
Address string
ReplicaIndex int
InFlight int
LastUsed time.Time
AvailableVRAM uint64
}

// PickBestReplica is the single source of truth for which loaded replica of a
// model serves the next request.
//
// Policy (ordered tiers, first non-tie wins):
// 1. Least in-flight wins: primary load-balancing signal.
// 2. Oldest last_used wins: round-robin between equally-loaded replicas.
// Every successful pick refreshes last_used (in the NATS
// FindAndLockNodeWithModel transaction and in TouchNodeModel on cache
// hits), so the "oldest" tier naturally rotates through the candidate set
// without a separate cursor.
// 3. Largest available_vram wins: cold-start tiebreaker for replicas that
// have never been picked (identical last_used).
//
// The NATS SQL ORDER BY in FindAndLockNodeWithModel (registry.go) MUST mirror
// this function; registry_test.go's "agrees with PickBestReplica on a seeded
// dataset (policy mirror)" spec asserts both sides agree on a representative
// dataset and fails fast if they drift.
//
// Returns nil when the candidate list is empty. Does not allocate.
func PickBestReplica(candidates []ReplicaCandidate) *ReplicaCandidate {
if len(candidates) == 0 {
return nil
}
best := &candidates[0]
for i := 1; i < len(candidates); i++ {
c := &candidates[i]
if betterReplica(c, best) {
best = c
}
}
return best
}

// betterReplica reports whether candidate a is preferred over candidate b
// under the policy documented on PickBestReplica.
func betterReplica(a, b *ReplicaCandidate) bool {
if a.InFlight != b.InFlight {
return a.InFlight < b.InFlight
}
if !a.LastUsed.Equal(b.LastUsed) {
return a.LastUsed.Before(b.LastUsed)
}
return a.AvailableVRAM > b.AvailableVRAM
}
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
package nodes
package clusterrouting

import (
"time"
Expand Down Expand Up @@ -36,7 +36,7 @@ var _ = Describe("PickBestReplica", func() {

It("uses oldest last_used as the tiebreaker when in_flight ties", func() {
// All three tied on in_flight=0. Without last_used, available_vram
// would pin every pick to the fattest node the exact bug
// would pin every pick to the fattest node: the exact bug
// fix(distributed): round-robin replicas of the same model addressed.
cs := []ReplicaCandidate{
{NodeID: "fat-recent", InFlight: 0, LastUsed: ref.Add(2 * time.Second), AvailableVRAM: 24_000_000_000},
Expand All @@ -47,7 +47,7 @@ var _ = Describe("PickBestReplica", func() {
})

It("uses largest available_vram as the final tiebreaker", func() {
// in_flight tied AND last_used tied pick the largest GPU.
// in_flight tied AND last_used tied: pick the largest GPU.
cs := []ReplicaCandidate{
{NodeID: "small", InFlight: 0, LastUsed: ref, AvailableVRAM: 8_000_000_000},
{NodeID: "fat", InFlight: 0, LastUsed: ref, AvailableVRAM: 24_000_000_000},
Expand Down
Loading
Loading