solisoft · solisoft · May 9, 2026 · May 8, 2026
diff --git a/tasks/done/security-audit-1.md b/tasks/done/security-audit-1.md
@@ -0,0 +1,4 @@
+# security audit
+
+as a rust expert do a security audit.
+create tickets in tasks/todo for any security issue found
diff --git a/tasks/todo/SEC-001-resp-idle-timeout.md b/tasks/todo/SEC-001-resp-idle-timeout.md
@@ -0,0 +1,39 @@
+# SEC-001 — RESP server has no read/idle timeout, enabling connection exhaustion DoS
+
+**Severity:** High
+
+## Location
+
+- `crates/solikv-server/src/resp_server.rs:67-110` (`handle_connection`)
+- `crates/solikv-server/src/resp_server.rs:19` (`MAX_CONNECTIONS = 10_000`)
+
+## Issue
+
+`handle_connection` calls `socket.read_buf(&mut read_buf).await` (line 105) with no
+read timeout. A connection that sends nothing (or a partial RESP frame) sits forever
+in `read_buf` and holds one of the 10,000 semaphore permits.
+
+A single attacker opening 10,000 idle TCP connections (no bytes ever sent) drains
+the connection pool and locks every legitimate client out. The pre-pubsub-mode read
+loop, the pubsub-mode `socket.read_buf` (line 611), and the `auth_middleware`
+fast-fail path all share this property: no per-stage deadline.
+
+There is also no `tcp_keepalive` configuration, so half-open TCP connections are
+never reaped.
+
+## Fix
+
+1. Wrap the per-iteration `socket.read_buf(...)` calls in
+   `tokio::time::timeout(...)` with a configurable idle deadline (suggested
+   default: 60 s, matching Redis `timeout 0` override).
+2. On timeout, write a RESP error and close the connection.
+3. Apply the same timeout in `handle_pubsub_mode` (`resp_server.rs:611`).
+4. Optionally enable TCP keepalive on the accepted socket via
+   `socket2::SockRef::from(&stream).set_tcp_keepalive(...)`.
+
+## Verification
+
+- Unit: open a TCP connection, send 0 bytes, assert it is closed after the
+  configured timeout.
+- Manual: `nc -q -1 127.0.0.1 6379 &` ×10 000, then verify a fresh `redis-cli`
+  client can still connect within `timeout * 2`.
diff --git a/tasks/todo/SEC-002-resp-array-stack-overflow.md b/tasks/todo/SEC-002-resp-array-stack-overflow.md
@@ -0,0 +1,39 @@
+# SEC-002 — RESP parser recurses without depth limit, allowing single-packet stack overflow / crash
+
+**Severity:** High
+
+## Location
+
+- `crates/solikv-resp/src/codec.rs:227-269` (`decode_array`, recursive call to `decode_frame`)
+- `crates/solikv-resp/src/codec.rs:131` (`MAX_ARRAY_LEN = 1_048_576`)
+
+## Issue
+
+`decode_array` calls `decode_frame(&src[offset..])` for every element, and
+`decode_frame` dispatches back to `decode_array` for nested arrays
+(`*` prefix). There is no recursion-depth limit. An attacker connected to the
+RESP port can send a single buffer containing `*1\r\n*1\r\n*1\r\n…` repeated a
+few thousand times. Each level adds a stack frame, and the recursion eventually
+overflows the (default 2 MB) thread stack — the Tokio worker dies with SIGSEGV
+and the server's connection accept loop on that worker is gone.
+
+This is *pre-auth* (the parser runs before AUTH is checked) and unauthenticated
+remote callers can crash the server with a single TCP packet. `MAX_ARRAY_LEN`
+limits each level's element count but does not bound nesting depth.
+
+## Fix
+
+1. Add a `decode_frame_with_depth(src, depth)` helper. Bump `depth` on each
+   recursive call from `decode_array`. Reject (return `Err`) when `depth`
+   exceeds a hard cap (suggested: 32, matching Redis client/server typical
+   nesting use).
+2. Change the public `decode_frame(src)` to call
+   `decode_frame_with_depth(src, 0)`.
+3. Add a unit test that constructs an N-deep nested array (e.g. 10 000) and
+   asserts the parser returns an error rather than panicking / overflowing.
+
+## Verification
+
+- Unit: deep-nested input returns `Err`, not a stack overflow.
+- Existing parser tests still pass.
+- `cargo clippy --quiet -- -D warnings` clean.
diff --git a/tasks/todo/SEC-003-lua-sandbox-state-leak.md b/tasks/todo/SEC-003-lua-sandbox-state-leak.md
@@ -0,0 +1,67 @@
+# SEC-003 — Lua VM is pooled and reused without resetting global state, allowing sandbox poisoning across scripts
+
+**Severity:** High
+
+## Location
+
+- `crates/solikv-engine/src/lua.rs:103-138` (`LUA_POOL`, `take_lua`, `return_lua`)
+- `crates/solikv-engine/src/lua.rs:147-186` (`execute_script`)
+- `crates/solikv-engine/src/lua.rs:192-209` (`sandbox_lua` runs once on cold path only)
+
+## Issue
+
+`take_lua` reuses a previously-sandboxed `Lua` instance from a thread-local
+`LUA_POOL`. The sandbox (`sandbox_lua`) is applied once on creation, and only
+`KEYS` / `ARGV` are reset between invocations. Globals other than `KEYS` /
+`ARGV` — including the standard-library tables `string`, `table`, `math` —
+are *not* restored between scripts.
+
+A malicious script can therefore plant traps that affect future scripts running
+on the same worker thread:
+
+```lua
+-- attacker:  EVAL "string.byte = function() return 999 end ..."
+-- victim:    EVAL "return string.byte('x')"   →  999, not 120
+```
+
+Or worse, a script can monkey-patch `redis.call` itself:
+
+```lua
+local orig = redis.call
+redis.call = function(...) ... return "" end   -- silent data corruption
+```
+
+Because the VM is pooled per-thread and reused, this contaminates other tenants
+sharing the server. `redis` itself is set in `setup_redis_module` only on cold
+path (line 124), so a poisoned `redis.call` survives every subsequent
+`execute_script` on that thread until the VM is dropped (pool is full).
+
+## Fix
+
+Either (preferred):
+
+1. Re-snapshot the global table after `sandbox_lua` + `setup_redis_module`
+   completes (e.g., serialize a registry-stashed clean copy), and restore it on
+   each `take_lua`. mlua exposes `Lua::globals()` and `Table::clear()` /
+   pairs traversal to rebuild it.
+
+Or, simpler:
+
+2. Drop the pool entirely — create a fresh `Lua` per script. The cost is one
+   sandbox setup per EVAL, which dominates only for trivially small scripts.
+   Benchmark first.
+
+Or:
+
+3. Wrap user scripts in a fresh Lua function environment (`setfenv` /
+   `_ENV` swap in 5.2+) so each script gets its own `_ENV` with read-only
+   parent. Verify mlua exposes the necessary primitives.
+
+Option 1 or 3 preserves the perf benefit. Option 2 is simplest and safest.
+
+## Verification
+
+- New test: run two scripts back-to-back on the same engine; the first mutates
+  `string.byte`, the second calls `string.byte('x')` and must observe the
+  original value.
+- Existing Lua tests still pass.
diff --git a/tasks/todo/SEC-004-script-cache-unbounded.md b/tasks/todo/SEC-004-script-cache-unbounded.md
@@ -0,0 +1,43 @@
+# SEC-004 — Script cache (`SCRIPT LOAD` / `EVAL`) has no size cap, enabling memory-exhaustion DoS
+
+**Severity:** High
+
+## Location
+
+- `crates/solikv-engine/src/lua.rs:20-58` (`ScriptCache`)
+- `crates/solikv-engine/src/dispatch.rs:3236` (`EVAL` calls `script_cache.load(script)` on every invocation)
+- `crates/solikv-engine/src/dispatch.rs:3286-3303` (`SCRIPT LOAD`)
+
+## Issue
+
+`ScriptCache` is a `DashMap<String, String>` with no eviction, no maximum
+entry count, and no maximum source size. Any authenticated client can call:
+
+```
+SCRIPT LOAD "<random 1 MiB script body>"
+```
+
+repeatedly, and each insert is permanent. With ~1 KiB scripts the server holds
+~10 MB after 10 000 calls; with 1 MiB scripts, 10 GB after 10 000 calls. There
+is no command to flush only stale entries, and even `SCRIPT FLUSH` requires the
+attacker's cooperation.
+
+`EVAL` *also* re-inserts the script source on every execution (line 3236), so
+the cache grows on the regular EVAL hot path too — not just on `SCRIPT LOAD`.
+
+## Fix
+
+1. Cap script source size at a sane limit (suggested: 64 KiB, configurable).
+   Reject larger scripts with `ERR script too long`.
+2. Cap total script count (suggested: 8 192 entries) and switch the storage
+   from `DashMap` to a small LRU (e.g. `lru::LruCache` behind a `Mutex`,
+   contention is fine — script load is rare).
+3. In `EVAL`, only insert into the cache if the SHA is not already present
+   (cheap `contains_key` check), to avoid the rewrite-on-every-execution
+   churn.
+
+## Verification
+
+- Unit: load N scripts where N > cache cap, assert oldest is evicted.
+- Unit: `EVAL` of an oversize script returns `ERR script too long`.
+- Existing scripting tests still pass.
diff --git a/tasks/todo/SEC-005-secrets-on-cli.md b/tasks/todo/SEC-005-secrets-on-cli.md
@@ -0,0 +1,45 @@
+# SEC-005 — Passwords accepted only via CLI flags, exposing them in `ps` / `/proc/<pid>/cmdline` / shell history
+
+**Severity:** High
+
+## Location
+
+- `crates/solikv-server/src/main.rs:65-67` (`--requirepass`)
+- `crates/solikv-server/src/main.rs:97-99` (`--cluster-password`)
+
+## Issue
+
+The auth password and cluster password are accepted only as `clap` arguments
+with `value_name = "PASSWORD"`. On Linux:
+
+- `ps auxe`, `/proc/<pid>/cmdline`, and any monitoring tool that reads cmdlines
+  expose the password to other local users.
+- The launching shell's history (`.bash_history`, `.zsh_history`) records it.
+- The daemon code (`main.rs:140-149`) re-`spawn`s itself with
+  `std::env::args().skip(1)`, so the child process *also* has the password in
+  its cmdline.
+
+Production deployments commonly leak Redis passwords this way; SoliKV inherits
+the same risk and has no alternative input mechanism.
+
+## Fix
+
+Add support for password sources, in priority order:
+
+1. `--requirepass-file <path>` / `--cluster-password-file <path>`: read a file
+   (newline-trimmed). Recommended for production.
+2. `SOLIKV_PASSWORD` / `SOLIKV_CLUSTER_PASSWORD` environment variables (clap's
+   `env = "..."` attribute).
+3. Keep `--requirepass <PASS>` for dev convenience but **emit a warning to
+   stderr** when used so operators see the leak risk.
+
+Refuse to start if both flag and file/env are set, to avoid silent precedence
+confusion.
+
+When the file path is used, verify mode 0600 (warn loudly if world-readable).
+
+## Verification
+
+- Unit: parse args from a file successfully; warn when flag is used.
+- Manual: `ps -ef | grep solikv` after start with `--requirepass-file` shows
+  no password in cmdline.
diff --git a/tasks/todo/SEC-006-pidfile-tmp-symlink.md b/tasks/todo/SEC-006-pidfile-tmp-symlink.md
@@ -0,0 +1,53 @@
+# SEC-006 — Daemon mode uses world-writable `/tmp/solikv.pid` + `/tmp/solikv.log`, with weak process verification
+
+**Severity:** High
+
+## Location
+
+- `crates/solikv-server/src/main.rs:106` (`pidfile = PathBuf::from("/tmp/solikv.pid")`)
+- `crates/solikv-server/src/main.rs:108-130` (kill-old logic)
+- `crates/solikv-server/src/main.rs:132-138` (`/tmp/solikv.log` opened with `create + append`)
+
+## Issue
+
+Three problems compound:
+
+1. **Symlink attack on the log file.** `OpenOptions::new().create(true).append(true)
+   .open("/tmp/solikv.log")` follows existing symlinks. On a multi-user host an
+   attacker who controls `/tmp/solikv.log` (it is in a sticky-bit directory but
+   they can win the race or pre-place a symlink before solikv first runs) can
+   point it at any path the operator can write to (e.g. `/etc/something`,
+   `~/.bashrc`). solikv then appends its stdout/stderr there.
+
+2. **PID file race + weak verification.** Lines 113-122 read `/proc/<pid>/cmdline`
+   and only check `.contains("solikv")`. Any cmdline containing the substring
+   "solikv" passes — including a user's `vim /tmp/solikv-notes`, `grep solikv
+   …`, etc. Combined with the predictable `/tmp/solikv.pid` path (which a local
+   attacker can pre-write), this lets the attacker get solikv to send `kill` to
+   an arbitrary PID belonging to the operator (so long as that process's
+   cmdline mentions "solikv").
+
+3. **Use of `kill(1)` subprocess** for what is a built-in syscall — extra fork,
+   extra attack surface, and silently swallows errors with `let _ = ... .output()`.
+
+## Fix
+
+1. Move pidfile and log to `args.dir` (the operator-supplied data directory),
+   defaulting to `./data`. Refuse to operate on `/tmp` paths.
+2. Open the log with `OpenOptions::new().custom_flags(libc::O_NOFOLLOW)` (Unix)
+   to refuse symlinks; or open with `O_CREAT | O_EXCL` first and fall back to
+   append-after-stat-check.
+3. Strengthen the PID check: parse `/proc/<pid>/exe` (a symlink to the binary)
+   and compare to `std::env::current_exe()`. `cmdline.contains("solikv")` is
+   not a security check.
+4. Replace the `Command::new("kill")` subprocess with `nix::sys::signal::kill`
+   or `libc::kill` directly.
+5. After unlinking the stale pidfile, write the new one with `O_EXCL` so two
+   concurrent daemon starts cannot both succeed.
+
+## Verification
+
+- Unit (where feasible): given a PID whose `/proc/<pid>/exe` does not equal our
+  current exe, the kill path is a no-op.
+- Manual: `ln -s /tmp/victim /tmp/solikv.log`, run `solikv -d`, confirm solikv
+  refuses (or creates a new file separate from the symlink target).
diff --git a/tasks/todo/SEC-007-default-bind-no-auth.md b/tasks/todo/SEC-007-default-bind-no-auth.md
@@ -0,0 +1,51 @@
+# SEC-007 — Default bind is `0.0.0.0` with no password required, exposing an unauthenticated DB to the network
+
+**Severity:** High
+
+## Location
+
+- `crates/solikv-server/src/main.rs:38-39` (`#[arg(long, default_value = "0.0.0.0")] bind`)
+- `crates/solikv-server/src/main.rs:65-67` (`requirepass: Option<String>` — no default)
+- `crates/solikv-server/src/resp_server.rs:80` (`requires_auth = password.is_some()`)
+
+## Issue
+
+A user who runs `solikv` with no flags gets:
+
+- RESP server listening on `0.0.0.0:6379`
+- REST API on `0.0.0.0:5020`
+- No authentication required (`password = None` ⇒ `requires_auth = false`)
+
+This exposes the entire database to anyone on the same network. The same
+default tripped Redis for years and led to a documented mass-exploitation
+campaign ("Redis WCRY"). The README's quickstart and Docker example both run
+this way.
+
+The cluster-bus port at least binds to `127.0.0.1` only (`gossip.rs:336`), but
+the public RESP/REST surface does not.
+
+## Fix
+
+Pick one of these (in order of preference):
+
+1. **Refuse to start unauthenticated when bind is non-loopback.** If
+   `args.bind != "127.0.0.1"` and `args.requirepass.is_none()`, exit with
+   a clear error pointing at `--requirepass` and `--bind 127.0.0.1`. Provide
+   `--protected-mode no` opt-out for users who really mean it.
+2. Change the default `bind` to `127.0.0.1`. Document `--bind 0.0.0.0` for
+   network exposure.
+
+Approach 1 is closer to upstream Redis "protected mode" and is what most
+Redis-compatible projects do. Approach 2 is simpler but slightly less
+forgiving.
+
+Also update the README quickstart and Docker examples to set `--requirepass`
+and/or `--bind 127.0.0.1`, and note the new behavior in
+`www/app/views/docs/getting_started.html.slv`.
+
+## Verification
+
+- Unit: `Args::parse_from(["solikv", "--bind", "0.0.0.0"])` plus the
+  protected-mode check returns an error.
+- Manual: `cargo run` with no flags refuses to bind public; `cargo run --
+  --bind 0.0.0.0 --requirepass foo` works.