Vyomi v2.0.7
Bug-fix release dominated by a release-blocking fresh-install regression, plus a stack of fixes surfaced by actually running the appliance from a clean machine. The headline: every multipass-based install (brew/deb/rpm/scoop) has failed to launch since v2.0.3 because the source bundle omitted packaging/ (whose init scripts the compose file bind-mounts) — Docker stubbed them as directories, the backing containers crash-looped, and the simulator never started. Docker-Compose installs and pre-2.0.3 upgrades were unaffected, which is why dogfood appliances never caught it. This release fixes that (and hardens the bundle check against the whole class), plus: the disk-cleanup 422, idempotent default-space self-heal, live launcher cold-start progress, Azure VM create UX, Azure VM SSH connect, LXD stop→start recovery, and a swallowed-error fix. Most of these predate v2.0.6 (verified by git diff); none was caused by the v2.0.6 upgrade itself.
Fixed
-
Disk cleanup on the
/cloudsworkspaces page returned HTTP 422 — the "Free up selected" action posted{ ids: [...] }toPOST /api/runtime/disk-cleanup/run, but the backend_DiskCleanupRequestrequired a field namedcategories, so every cleanup request failed Pydantic validation with a 422. Worse, the SPA derived each checkbox's value fromit.id || it.path || label— but suggestion items are keyed bykey(terminated_workspaces,tmp_and_apt,lxd_orphans,journald,lxd_image_cache), so even a renamed field would have sent human-readable labels that match no category. Fixed on both sides:static/clouds.htmlnow readsit.keyfor the checkbox value and posts{ categories: ids };_DiskCleanupRequestnow acceptscategories(canonical) and a legacyidsalias, both defaulting to[], so a missing/renamed field is an empty no-op instead of a 422. The handler mergesreq.categories or req.ids. -
Fresh installs broken since v2.0.3 — the appliance never finished launching (release-blocking). Every multipass-based install path (brew, deb, rpm, scoop) failed on first
vyomi up: the launcher's source bundle (scripts/cloud-learnrequiredlist) omittedpackaging/, butdocker-compose.appliance.ymlbind-mountspackaging/{firestore,vault,elasticmq}/*init files at runtime. Docker therefore created the missing mount sources as empty directories → firestore/vault/elasticmq crash-looped (exit 126: Is a directory) → the simulatordepends_onthem and never started → the health check timed out → "APPLIANCE LAUNCH FAILED." Fixed by addingpackagingto the launcher bundle, to the deb/rpm build scripts (build-deb.sh,cloud-learn.spec— which also still omittedroutes/setup_cython.py/cloudsim-backbone), and by extendingscripts/verify-bundle.shto assert every compose bind-mount source is bundled (not just DockerfileCOPYs), so this class fails CI instead of a user's first boot. Same recurring bug class as theroutes/setup_cython.pyomissions in v1.0.0–v1.1.9. -
Launcher showed multipass's blank "Waiting for initialization" spinner during cold start.
multipass launchran in the foreground, so its detail-free spinner sat for the 3–5 min LXD-snap-install cloud-init while the launcher's own progress poller only ran afterward. Newappliance_launch_and_trackruns the launch in the background and streams live cloud-init progress ([48s] Installing LXD container runtime…) with a 30 s liveness heartbeat, and reaps a failed launch with its output instead of marching to the 10-min bail-out. -
Cold-start progress poller false-failed every fresh launch with a bogus "APPLIANCE LAUNCH FAILED" at ~26 s. The background-launch progress feature (above) tripped over the launcher's own safety net: the script runs under
set -eEwith an inheritedERRtrap (on_err), and during the early-boot window the VM exists but isn't yet exec-able — so the firstmultipass execcloud-init poll insideappliance_wait_for_cloud_initreturned non-zero → the ERR trap fired →_die "command exited 2 inside Phase 4/8", even though the VM was coming up perfectly (cloud-init finished ~150 s later).appliance_wait_for_cloud_initnow suspendserrexit+ theERRtrap around its best-effort poll loop (restoring both on exit), andappliance_launch_and_trackjudges success by the VM's real state (appliance_instance_state) rather than the launch process's exit code (sincemultipass launchcan report non-zero while the instance is actually fine). Verified end-to-end on a fresh launch: progress still streams ([118s] /usr/bin/dockerd …→✓ cloud-init finished (177s)) and it proceeds to the stack deploy instead of false-failing. -
Azure VM "Review + create" went silent and the new instance needed a hard refresh. In
static/azure-console.htmlthe Create button was silentlydisabledon any validation error (no feedback) andsubmitWizardhad notry/catch. Now Create is always clickable and jumps to the first errored step with a toast; submit shows "Creating…", closes the blade immediately, re-fetches the list (no hard refresh needed), and toasts on any error instead of leaving a dead pane. -
Azure VM SSH connect was half-wired — "no SSH command", couldn't connect.
core/vm_connect.connect_info()does the lazy SSH provisioning (key injection + lxc proxy) given a flat instance dict, but Azure VMs are ARM records that keepcontainer_name/stateunderproperties.runtime, so the connect path 409'd before provisioning anything (AWS/GCP pass a flat dict and worked)._connect_info_responsenow flattens the Azure record so it reaches the same path, and the Connect modal renders a real, copyablessh -i ~/.ssh/vyomi_ed25519 -p <port> ubuntu@<host>command — parity with AWS/GCP. (The LXD instance was always attached; only the SSH wiring + modal were incomplete.) -
LXD instance
stop→startfailed with "Missing source path … for disk workspace". LXD validates disk-device sources at start time, and an instance's workspace deployment dir could be removed while it was stopped (terminated-workspace reaper / disk-cleanup / a simulator restart), solxc start503'd._start_lxd_instancenow recreates the workspace host dir (_ensure_lxd_workspace_host_dir) before starting, making stop→start always recoverable for EC2/GCE/Azure VMs. -
Azure VM provisioning errors were silently swallowed.
core/azure_dataplane.pywrapped the LXD provisioning inexcept Exception: pass. It now recordsstatus: provision_error+ the message on the VM record, so a real provisioning failure surfaces instead of looking like a silent metadata-only VM. -
RDS databases weren't isolated per space on the shared engine — cross-space data clobber. AWS RDS provisioning (
_rds_real_provision/_rds_real_deprovision) named the physical database from the rawdb_instance_identifierand the login role from the verbatimmaster_username, with no space namespacing — unlike Cloud SQL + Azure DB, which already namespace viagcp_sql_engine.physical_name(space_id, …). Because all three clouds back onto the samecloudlearn-sql-postgrescontainer, two spaces (or two users following the same tutorial) creating an RDS namedmydbwith master useradmincollided:CREATE DATABASEwas skipped (already exists) so the second space silently connected to the first space's database, andALTER ROLE "admin" WITH PASSWORDran unconditionally, resetting the shared role's password. RDS now namespaces the physical db + role per space exactly like Cloud SQL — verified live (samedb_instance_identifier/master_usernamein two spaces → two isolated physical DBscl_sharedname_cdad9b3dvscl_sharedname_f7c78009). The boto3-visibleMasterUsername/DBInstanceIdentifierstay verbatim (SDK conformance preserved); the connectable physical creds are surfaced in the RDS view's newconnectionblock — parity with Cloud SQL'sconnectionInfo. The data-plane conformance tests (tests/conformance/ui/aws/test_rds_{postgres,mysql}.py) now connect with thatconnectionblock instead of the verbatim master username. Found while validating real app+DB deploys across AWS/GCP/Azure (the portal-as-real-app smoke test). -
Paste-a-key license activation never worked in appliance mode — every real portal JWT was rejected.
POST /api/license/activatehad two handlers for the same path: the correct RS256 one inserver.py(req.license_key→_apply_license_jwt()), and a legacy HMAC one inroutes/licensing.py(payload["token"]→_verify_license()). The licensing-route module registers beforeserver.py's@app.post, so the legacy handler won — and it (a) read the wrong field name (token, while the pricing.html "Activate" modal postslicense_key), (b) verified with the local HMAC scheme instead of RS256, and (c) gated the call behindrequire_admin_key, which the end-user UI never sends. A pasted portal JWT therefore failed with401 Invalid license token: not enough values to unpack (expected 2, got 1)(emptytoken→ one-part split). Rewrote the liveroutes/licensing.pyhandler to: acceptlicense_key(andtokenfor back-compat); route on segment count (a portal JWT has two dots →_apply_license_jwt()RS256 verify + apply, returningactive_tier/issued_to/jtias the SPA expects; a one-dot legacy HMAC token → the old path, stillrequire_admin_key-gated since those are locally forgeable); and drop the admin-key requirement for the JWT path — the RS256 signature +install_idbinding is the auth boundary, identical to the ungated device-flow path (/api/auth/poll-activation). Malformed input now returns a clean400 license_key_required/401 license_invalid(withdetail.reason) instead of a 500, so the modal shows a real error.
Added
- Idempotent default-space self-heal —
core/app_context.pypreviously seeded theaws-default/gcp-default/azure-defaultspaces only inside theif not spaces_state["spaces"]:fresh-install guard, so any install created before the v1.2.5 multi-default seeding (which only ever had the legacy AWS space) never gained the GCP/Azure defaults — not even across upgrades. The GCP/Azure ensure now runs on every init, guarded by provider presence (skips if a space for that provider already exists), so pre-v1.2.5 single-space installs get all three consoles after upgrading without duplicating any space the user (or an API re-seed) already created. - Standard host-reachability for every VM. New
_ensure_lxd_host_proxy, called from_start_lxd_instance(so EC2/GCE/Azure all inherit it), creates an LXD proxy device forwarding the instance's allocated host port → the VM's app port — making every new VM reachable from the host (and the user's machine) athttp://<host>:<host_port>, not just SSH. App port defaults to80, overridable per instance viainstance['app_port']or envCLOUDLEARN_LXD_VM_APP_PORT; the reachable port is surfaced asinstance['host_app_port']. Recreated on every start (survives container recreation). Also added_ensure_lxd_workspace_host_dir(called beforelxc start) so a stopped instance whose deployment dir was reaped can always restart.
Notes
- Neither issue was caused by upgrading to v2.0.6. The disk-cleanup code is byte-identical between v2.0.4.1 and v2.0.6 (the inner-validation appliance on 2.0.4.1 reproduces the same 422 live), and the default-space seeding logic is unchanged since v1.2.5. The differences observed between an inner (fresh-built) and outer (long-lived) appliance were a pre-existing bug exposed by exercising it, and an old-state artifact — not a release regression.
- Existing appliances missing the GCP/Azure defaults are self-healed on the next restart once running ≥2.0.7; before that, they can be re-seeded via the
+ Create Spacebutton (orPOST /api/spaces). - Validated end-to-end with a real app + DB on all three clouds. Using the portal itself as the test application (it needs a VM + Postgres), we provisioned managed Postgres (RDS / Cloud SQL / Azure SQL) + a compute VM (EC2 / GCE / Azure VM) on each cloud, deployed the portal onto the VM pointed at the managed DB, and confirmed it created its full schema over the wire — proving the Azure VM now attaches a real LXD container (the gap this release fixes) and that the host-reachability proxy works on every provider. This smoke test is what surfaced the RDS cross-space isolation bug above.
Design & exploration (docs + spike only — no runtime impact)
This release also lands the design work from a long architecture session. None of
the below changes appliance behaviour — they are documents under docs/browser-lite/
and a throwaway proof under spikes/. Captured here so the decisions survive.
- LXD → Docker compute, de-risked by a passing spike —
spikes/docker-instance/
(backend.py= aComputeBackendseam +DockerComputeBackend;run_spike.sh;
Dockerfile.instance;README.md). Ran end-to-end on the OUTER appliance VM
(aarch64, Docker 29.1.3): create → boot (inner dockerd 16s) → shell → docker-in-
instance (DinD) → persistence across stop/start → status/IP → terminate, all green.
Finding: VM semantics are tractable on Docker (no systemd needed —tini+sshd+dockerd);
the one real cost is--privilegedfor DinD. Direction: replace the ~45_lxd_*/
_multipass_*functions +core/runtime_bridge.pywith a DockerComputeBackend. - Vyomi Lite — full browser architecture —
docs/browser-lite/ARCHITECTURE.md
(master) plusVYOMI-LITE-DESIGN.md,VYOMI-LITE-MASTER-SPEC.md,IN-MEMORY-PLAN.md.
A browser-native edition whereserver.pyruns in Pyodide, every backing
container is replaced by an in-memory/WASM engine (PGlite · moto · cedar-wasm ·
WebCrypto · BlobEngine+OPFS · in-proc pub/sub/queue/event), the SPA is served via a
service-worker fetch⇄ASGI shim, and state persists to OPFS/IndexedDB. - Networking as a fractal — tab = host, browser = Availability Zone (same-origin
tabs share a SharedWorker hub enforcing VPC security groups over tcpip.js real
TCP), federated browsers = VPC/region (WebRTC/overlay, e.g. Tailscale-wasm),
devices = multi-region. Two tabs of WASM emulators form a real, SG-enforced VPC. - Compute backends —
ProviderFacade → InstanceManager (budget/quota) → RuntimeBackend: Simulated · Pyodide(Py) · WebContainers/BrowserPod(Node) ·
CheerpJ(Java) · Container2Wasm(any image) · RemoteDocker(real host). Per-language
WASM matrix documented (§4a). Decision: integrate runtimes, do not build a
Docker-in-WASM kernel — moat stays in conformance + airgapped, not the runtime. - CloudSim in the browser — keep the real engine via CheerpJ (
CloudSimBackend
withCheerpJBackend; gated on CheerpJ Java-17 support, since CloudSim Plus 8.5.7
needs Java 17), or default to the existing Pythonlocal-fallback. - Product & licensing model — Vyomi Lite (B2C, browser, single-tenant) vs
Vyomi Enterprise (B2B, real compute, airgapped) as two backends of one codebase
(VYOMI_BACKENDflag). One emulator = one appliance = one browser profile = one
tenant = one license seat, bound by a WebCrypto non-extractable keypair
(extendscore/license_remote.py); seat enforcement is server-side. Free tier =
anonymous simulation-only, with a sign-in 30-day real-compute trial → simulation-
only (account-anchored, server-enforced). Multi-tenant = multiple peered appliances
(browser-as-tenant). - MVP scope — AWS-only, single-tenant: S3 (OPFS) · DynamoDB (moto) · IAM
(cedar-wasm) core; SQS · RDS (PGlite) · minimal EC2 compute as stretch; the other 8
AWS services + GCP/Azure + SDN + multi-AZ deferred. ~16 weeks / 2–3 engineers, with a
week-2 GO/NO-GO gate on the Pyodide + fetch⇄ASGI shell spike. Conformance measured by
running the existing 35/35 harness against the Lite build.
Artifacts
- SHA256SUMS
- cloud-learn-0.1.0.tar.gz
- cloud-learn-2.0.7-1.noarch.rpm
- cloud-learn-2.0.7.tar.gz
- cloud-learn_2.0.7_all.deb
Docker image: docker pull vyomi/appliance:2.0.7
Install: curl -fsSL https://raw.githubusercontent.com/vyomi-cloud/appliance/main/install.sh | bash
SHA256 checksums: see SHA256SUMS in attached artifacts.