Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions tasks/done/security-audit-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# security audit

as a rust expert do a security audit.
create tickets in tasks/todo for any security issue found
39 changes: 39 additions & 0 deletions tasks/todo/SEC-001-resp-idle-timeout.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# SEC-001 — RESP server has no read/idle timeout, enabling connection exhaustion DoS

**Severity:** High

## Location

- `crates/solikv-server/src/resp_server.rs:67-110` (`handle_connection`)
- `crates/solikv-server/src/resp_server.rs:19` (`MAX_CONNECTIONS = 10_000`)

## Issue

`handle_connection` calls `socket.read_buf(&mut read_buf).await` (line 105) with no
read timeout. A connection that sends nothing (or a partial RESP frame) sits forever
in `read_buf` and holds one of the 10,000 semaphore permits.

A single attacker opening 10,000 idle TCP connections (no bytes ever sent) drains
the connection pool and locks every legitimate client out. The pre-pubsub-mode read
loop, the pubsub-mode `socket.read_buf` (line 611), and the `auth_middleware`
fast-fail path all share this property: no per-stage deadline.

There is also no `tcp_keepalive` configuration, so half-open TCP connections are
never reaped.

## Fix

1. Wrap the per-iteration `socket.read_buf(...)` calls in
`tokio::time::timeout(...)` with a configurable idle deadline (suggested
default: 60 s, matching Redis `timeout 0` override).
2. On timeout, write a RESP error and close the connection.
3. Apply the same timeout in `handle_pubsub_mode` (`resp_server.rs:611`).
4. Optionally enable TCP keepalive on the accepted socket via
`socket2::SockRef::from(&stream).set_tcp_keepalive(...)`.

## Verification

- Unit: open a TCP connection, send 0 bytes, assert it is closed after the
configured timeout.
- Manual: `nc -q -1 127.0.0.1 6379 &` ×10 000, then verify a fresh `redis-cli`
client can still connect within `timeout * 2`.
39 changes: 39 additions & 0 deletions tasks/todo/SEC-002-resp-array-stack-overflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# SEC-002 — RESP parser recurses without depth limit, allowing single-packet stack overflow / crash

**Severity:** High

## Location

- `crates/solikv-resp/src/codec.rs:227-269` (`decode_array`, recursive call to `decode_frame`)
- `crates/solikv-resp/src/codec.rs:131` (`MAX_ARRAY_LEN = 1_048_576`)

## Issue

`decode_array` calls `decode_frame(&src[offset..])` for every element, and
`decode_frame` dispatches back to `decode_array` for nested arrays
(`*` prefix). There is no recursion-depth limit. An attacker connected to the
RESP port can send a single buffer containing `*1\r\n*1\r\n*1\r\n…` repeated a
few thousand times. Each level adds a stack frame, and the recursion eventually
overflows the (default 2 MB) thread stack — the Tokio worker dies with SIGSEGV
and the server's connection accept loop on that worker is gone.

This is *pre-auth* (the parser runs before AUTH is checked) and unauthenticated
remote callers can crash the server with a single TCP packet. `MAX_ARRAY_LEN`
limits each level's element count but does not bound nesting depth.

## Fix

1. Add a `decode_frame_with_depth(src, depth)` helper. Bump `depth` on each
recursive call from `decode_array`. Reject (return `Err`) when `depth`
exceeds a hard cap (suggested: 32, matching Redis client/server typical
nesting use).
2. Change the public `decode_frame(src)` to call
`decode_frame_with_depth(src, 0)`.
3. Add a unit test that constructs an N-deep nested array (e.g. 10 000) and
asserts the parser returns an error rather than panicking / overflowing.

## Verification

- Unit: deep-nested input returns `Err`, not a stack overflow.
- Existing parser tests still pass.
- `cargo clippy --quiet -- -D warnings` clean.
67 changes: 67 additions & 0 deletions tasks/todo/SEC-003-lua-sandbox-state-leak.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# SEC-003 — Lua VM is pooled and reused without resetting global state, allowing sandbox poisoning across scripts

**Severity:** High

## Location

- `crates/solikv-engine/src/lua.rs:103-138` (`LUA_POOL`, `take_lua`, `return_lua`)
- `crates/solikv-engine/src/lua.rs:147-186` (`execute_script`)
- `crates/solikv-engine/src/lua.rs:192-209` (`sandbox_lua` runs once on cold path only)

## Issue

`take_lua` reuses a previously-sandboxed `Lua` instance from a thread-local
`LUA_POOL`. The sandbox (`sandbox_lua`) is applied once on creation, and only
`KEYS` / `ARGV` are reset between invocations. Globals other than `KEYS` /
`ARGV` — including the standard-library tables `string`, `table`, `math` —
are *not* restored between scripts.

A malicious script can therefore plant traps that affect future scripts running
on the same worker thread:

```lua
-- attacker: EVAL "string.byte = function() return 999 end ..."
-- victim: EVAL "return string.byte('x')" → 999, not 120
```

Or worse, a script can monkey-patch `redis.call` itself:

```lua
local orig = redis.call
redis.call = function(...) ... return "" end -- silent data corruption
```

Because the VM is pooled per-thread and reused, this contaminates other tenants
sharing the server. `redis` itself is set in `setup_redis_module` only on cold
path (line 124), so a poisoned `redis.call` survives every subsequent
`execute_script` on that thread until the VM is dropped (pool is full).

## Fix

Either (preferred):

1. Re-snapshot the global table after `sandbox_lua` + `setup_redis_module`
completes (e.g., serialize a registry-stashed clean copy), and restore it on
each `take_lua`. mlua exposes `Lua::globals()` and `Table::clear()` /
pairs traversal to rebuild it.

Or, simpler:

2. Drop the pool entirely — create a fresh `Lua` per script. The cost is one
sandbox setup per EVAL, which dominates only for trivially small scripts.
Benchmark first.

Or:

3. Wrap user scripts in a fresh Lua function environment (`setfenv` /
`_ENV` swap in 5.2+) so each script gets its own `_ENV` with read-only
parent. Verify mlua exposes the necessary primitives.

Option 1 or 3 preserves the perf benefit. Option 2 is simplest and safest.

## Verification

- New test: run two scripts back-to-back on the same engine; the first mutates
`string.byte`, the second calls `string.byte('x')` and must observe the
original value.
- Existing Lua tests still pass.
43 changes: 43 additions & 0 deletions tasks/todo/SEC-004-script-cache-unbounded.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# SEC-004 — Script cache (`SCRIPT LOAD` / `EVAL`) has no size cap, enabling memory-exhaustion DoS

**Severity:** High

## Location

- `crates/solikv-engine/src/lua.rs:20-58` (`ScriptCache`)
- `crates/solikv-engine/src/dispatch.rs:3236` (`EVAL` calls `script_cache.load(script)` on every invocation)
- `crates/solikv-engine/src/dispatch.rs:3286-3303` (`SCRIPT LOAD`)

## Issue

`ScriptCache` is a `DashMap<String, String>` with no eviction, no maximum
entry count, and no maximum source size. Any authenticated client can call:

```
SCRIPT LOAD "<random 1 MiB script body>"
```

repeatedly, and each insert is permanent. With ~1 KiB scripts the server holds
~10 MB after 10 000 calls; with 1 MiB scripts, 10 GB after 10 000 calls. There
is no command to flush only stale entries, and even `SCRIPT FLUSH` requires the
attacker's cooperation.

`EVAL` *also* re-inserts the script source on every execution (line 3236), so
the cache grows on the regular EVAL hot path too — not just on `SCRIPT LOAD`.

## Fix

1. Cap script source size at a sane limit (suggested: 64 KiB, configurable).
Reject larger scripts with `ERR script too long`.
2. Cap total script count (suggested: 8 192 entries) and switch the storage
from `DashMap` to a small LRU (e.g. `lru::LruCache` behind a `Mutex`,
contention is fine — script load is rare).
3. In `EVAL`, only insert into the cache if the SHA is not already present
(cheap `contains_key` check), to avoid the rewrite-on-every-execution
churn.

## Verification

- Unit: load N scripts where N > cache cap, assert oldest is evicted.
- Unit: `EVAL` of an oversize script returns `ERR script too long`.
- Existing scripting tests still pass.
45 changes: 45 additions & 0 deletions tasks/todo/SEC-005-secrets-on-cli.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# SEC-005 — Passwords accepted only via CLI flags, exposing them in `ps` / `/proc/<pid>/cmdline` / shell history

**Severity:** High

## Location

- `crates/solikv-server/src/main.rs:65-67` (`--requirepass`)
- `crates/solikv-server/src/main.rs:97-99` (`--cluster-password`)

## Issue

The auth password and cluster password are accepted only as `clap` arguments
with `value_name = "PASSWORD"`. On Linux:

- `ps auxe`, `/proc/<pid>/cmdline`, and any monitoring tool that reads cmdlines
expose the password to other local users.
- The launching shell's history (`.bash_history`, `.zsh_history`) records it.
- The daemon code (`main.rs:140-149`) re-`spawn`s itself with
`std::env::args().skip(1)`, so the child process *also* has the password in
its cmdline.

Production deployments commonly leak Redis passwords this way; SoliKV inherits
the same risk and has no alternative input mechanism.

## Fix

Add support for password sources, in priority order:

1. `--requirepass-file <path>` / `--cluster-password-file <path>`: read a file
(newline-trimmed). Recommended for production.
2. `SOLIKV_PASSWORD` / `SOLIKV_CLUSTER_PASSWORD` environment variables (clap's
`env = "..."` attribute).
3. Keep `--requirepass <PASS>` for dev convenience but **emit a warning to
stderr** when used so operators see the leak risk.

Refuse to start if both flag and file/env are set, to avoid silent precedence
confusion.

When the file path is used, verify mode 0600 (warn loudly if world-readable).

## Verification

- Unit: parse args from a file successfully; warn when flag is used.
- Manual: `ps -ef | grep solikv` after start with `--requirepass-file` shows
no password in cmdline.
53 changes: 53 additions & 0 deletions tasks/todo/SEC-006-pidfile-tmp-symlink.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# SEC-006 — Daemon mode uses world-writable `/tmp/solikv.pid` + `/tmp/solikv.log`, with weak process verification

**Severity:** High

## Location

- `crates/solikv-server/src/main.rs:106` (`pidfile = PathBuf::from("/tmp/solikv.pid")`)
- `crates/solikv-server/src/main.rs:108-130` (kill-old logic)
- `crates/solikv-server/src/main.rs:132-138` (`/tmp/solikv.log` opened with `create + append`)

## Issue

Three problems compound:

1. **Symlink attack on the log file.** `OpenOptions::new().create(true).append(true)
.open("/tmp/solikv.log")` follows existing symlinks. On a multi-user host an
attacker who controls `/tmp/solikv.log` (it is in a sticky-bit directory but
they can win the race or pre-place a symlink before solikv first runs) can
point it at any path the operator can write to (e.g. `/etc/something`,
`~/.bashrc`). solikv then appends its stdout/stderr there.

2. **PID file race + weak verification.** Lines 113-122 read `/proc/<pid>/cmdline`
and only check `.contains("solikv")`. Any cmdline containing the substring
"solikv" passes — including a user's `vim /tmp/solikv-notes`, `grep solikv
…`, etc. Combined with the predictable `/tmp/solikv.pid` path (which a local
attacker can pre-write), this lets the attacker get solikv to send `kill` to
an arbitrary PID belonging to the operator (so long as that process's
cmdline mentions "solikv").

3. **Use of `kill(1)` subprocess** for what is a built-in syscall — extra fork,
extra attack surface, and silently swallows errors with `let _ = ... .output()`.

## Fix

1. Move pidfile and log to `args.dir` (the operator-supplied data directory),
defaulting to `./data`. Refuse to operate on `/tmp` paths.
2. Open the log with `OpenOptions::new().custom_flags(libc::O_NOFOLLOW)` (Unix)
to refuse symlinks; or open with `O_CREAT | O_EXCL` first and fall back to
append-after-stat-check.
3. Strengthen the PID check: parse `/proc/<pid>/exe` (a symlink to the binary)
and compare to `std::env::current_exe()`. `cmdline.contains("solikv")` is
not a security check.
4. Replace the `Command::new("kill")` subprocess with `nix::sys::signal::kill`
or `libc::kill` directly.
5. After unlinking the stale pidfile, write the new one with `O_EXCL` so two
concurrent daemon starts cannot both succeed.

## Verification

- Unit (where feasible): given a PID whose `/proc/<pid>/exe` does not equal our
current exe, the kill path is a no-op.
- Manual: `ln -s /tmp/victim /tmp/solikv.log`, run `solikv -d`, confirm solikv
refuses (or creates a new file separate from the symlink target).
51 changes: 51 additions & 0 deletions tasks/todo/SEC-007-default-bind-no-auth.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# SEC-007 — Default bind is `0.0.0.0` with no password required, exposing an unauthenticated DB to the network

**Severity:** High

## Location

- `crates/solikv-server/src/main.rs:38-39` (`#[arg(long, default_value = "0.0.0.0")] bind`)
- `crates/solikv-server/src/main.rs:65-67` (`requirepass: Option<String>` — no default)
- `crates/solikv-server/src/resp_server.rs:80` (`requires_auth = password.is_some()`)

## Issue

A user who runs `solikv` with no flags gets:

- RESP server listening on `0.0.0.0:6379`
- REST API on `0.0.0.0:5020`
- No authentication required (`password = None` ⇒ `requires_auth = false`)

This exposes the entire database to anyone on the same network. The same
default tripped Redis for years and led to a documented mass-exploitation
campaign ("Redis WCRY"). The README's quickstart and Docker example both run
this way.

The cluster-bus port at least binds to `127.0.0.1` only (`gossip.rs:336`), but
the public RESP/REST surface does not.

## Fix

Pick one of these (in order of preference):

1. **Refuse to start unauthenticated when bind is non-loopback.** If
`args.bind != "127.0.0.1"` and `args.requirepass.is_none()`, exit with
a clear error pointing at `--requirepass` and `--bind 127.0.0.1`. Provide
`--protected-mode no` opt-out for users who really mean it.
2. Change the default `bind` to `127.0.0.1`. Document `--bind 0.0.0.0` for
network exposure.

Approach 1 is closer to upstream Redis "protected mode" and is what most
Redis-compatible projects do. Approach 2 is simpler but slightly less
forgiving.

Also update the README quickstart and Docker examples to set `--requirepass`
and/or `--bind 127.0.0.1`, and note the new behavior in
`www/app/views/docs/getting_started.html.slv`.

## Verification

- Unit: `Args::parse_from(["solikv", "--bind", "0.0.0.0"])` plus the
protected-mode check returns an error.
- Manual: `cargo run` with no flags refuses to bind public; `cargo run --
--bind 0.0.0.0 --requirepass foo` works.
Loading
Loading