Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ help:
@echo " make db-dev-down - 停止 Docker 开发数据库/Redis/Adminer"
@echo " make api-dev-docker-db - 本地 API 连接 Docker 开发数据库 (15432)"
@echo " make db-dev-status - 显示 Docker 开发数据库/Redis/Adminer 与 API 端口状态"
@echo " make metrics-check - 基础指标一致性校验 (/health vs /metrics)"
@echo " make seed-bcrypt-user - 插入一个 bcrypt 测试用户 (触发登录重哈希)"

# 安装依赖
install:
Expand Down Expand Up @@ -214,6 +216,18 @@ db-dev-status:
@echo "🌿 /health:"
@curl -fsS http://localhost:$${API_PORT:-8012}/health 2>/dev/null || echo "(API 未响应)"

# ---- Metrics & Dev Utilities ----
metrics-check:
@echo "运行指标一致性脚本..."
@cd jive-api && ./scripts/check_metrics_consistency.sh || true
@echo "抓取 /metrics 关键行:" && curl -fsS http://localhost:$${API_PORT:-8012}/metrics | grep -E 'password_hash_|jive_build_info|export_requests_' || true

seed-bcrypt-user:
@echo "插入 bcrypt 测试用户 (若不存在)..."
@cd jive-api && cargo run --bin hash_password --quiet -- 'TempBcrypt123!' >/dev/null 2>&1 || true
@psql $${DATABASE_URL:-postgresql://postgres:postgres@localhost:5433/jive_money} -c "DO $$ BEGIN IF NOT EXISTS (SELECT 1 FROM users WHERE email='bcrypt_test@example.com') THEN INSERT INTO users (email,password_hash,name,is_active,created_at,updated_at) VALUES ('bcrypt_test@example.com', crypt('TempBcrypt123!','bf'), 'Bcrypt Test', true, NOW(), NOW()); END IF; END $$;" 2>/dev/null || echo "⚠️ 需要本地 Postgres 运行 (5433)"
@echo "测试登录: curl -X POST -H 'Content-Type: application/json' -d '{\"email\":\"bcrypt_test@example.com\",\"password\":\"TempBcrypt123!\"}' http://localhost:$${API_PORT:-8012}/api/v1/auth/login"

# 代码格式化
format:
@echo "格式化 Rust 代码..."
Expand Down
84 changes: 84 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,76 @@ export JWT_SECRET=$(openssl rand -hex 32)

未设置时(或留空)API 会在开发 / 测试自动使用一个不安全的占位并打印警告,不可在生产依赖该默认值。

### 监控与指标 (Metrics)

| Endpoint | 用途 | 认证 | 备注 |
|-------------|-------------------|------|------|
| `/health` | 探活 + 快照 | 否 | 轻量 JSON:hash 分布、rehash 状态、汇率指标等 |
| `/metrics` | Prometheus 拉取 | 否 | 文本格式指标(适合长期监控) |

规范指标(推荐使用):
```
password_hash_bcrypt_total # bcrypt (2a+2b+2y)
password_hash_argon2id_total # argon2id 数量
password_hash_unknown_total # 未识别前缀
password_hash_total_count # 总数
password_hash_bcrypt_variant{variant="2b"} X # 每个变体

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The 'X' in this line appears to be a placeholder for the metric's value. It should be removed or replaced with a more descriptive placeholder like <value> to avoid confusion for users reading the documentation.

jive_password_rehash_total # 成功重哈希次数(bcrypt→argon2id)
jive_password_rehash_fail_total # 重哈希失败次数(不会阻断登录)
jive_password_rehash_fail_breakdown_total{cause="hash"|"update"} # 重哈希失败按原因
export_requests_buffered_total # 缓冲导出请求次数(POST CSV/JSON)
export_requests_stream_total # 流式导出请求次数(GET CSV streaming, feature=export_stream)
export_rows_buffered_total # 缓冲导出累计行数
export_rows_stream_total # 流式导出累计行数
jive_build_info{...} # 构建信息 (value=1)
auth_login_fail_total # 登录失败(未知用户 / 密码不匹配)
auth_login_inactive_total # 非激活账号登录尝试
auth_login_rate_limited_total # 登录被速率限制次数 (429)
jive_build_info{commit,time,rustc,version} 1 # 构建信息 gauge
export_duration_buffered_seconds_* # 缓冲导出耗时直方图 (bucket/sum/count)
export_duration_stream_seconds_* # 流式导出耗时直方图 (bucket/sum/count)
process_uptime_seconds # 进程运行时长(秒)
jive_build_info{commit,time,rustc,version} 1 # 构建信息 gauge

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line is a duplicate of line 214. To improve clarity and reduce redundancy in the documentation, please remove this repeated entry for the jive_build_info metric.

```

兼容旧指标(DEPRECATED,将在 2 个发布周期后移除,详见 docs/METRICS_DEPRECATION_PLAN.md):
```
jive_password_hash_users{algo="bcrypt_2b"}
```

Prometheus 抓取示例:
```yaml
scrape_configs:
- job_name: jive-api
metrics_path: /metrics
scrape_interval: 15s
static_configs:
- targets: ["api-host:8012"]
```

一致性快速校验(bcrypt 聚合与 /metrics 是否匹配):
```bash
H=$(curl -s http://localhost:8012/health)
M=$(curl -s http://localhost:8012/metrics)
echo "Health bcrypt sum:" \
$(echo "$H" | jq '.metrics.hash_distribution.bcrypt | (."2a"+."2b"+."2y")')
echo "Metrics bcrypt total:" \
$(grep '^password_hash_bcrypt_total' <<<"$M" | awk '{print $2}')
```

运维建议:
- 大规模用户场景可为 hash 查询加 30s 内存缓存(计划中)。
- 迁移所有看板后移除旧的 jive_password_hash_users* 系列(目标 v1.2.0)。
- 监控 `jive_password_rehash_fail_total`,持续增长提示 DB 更新/并发异常。
- 导出耗时直方图示例:
```promql
# P95 缓冲导出耗时
histogram_quantile(0.95, sum(rate(export_duration_buffered_seconds_bucket[5m])) by (le))

# 最近 1 分钟流式导出平均耗时
sum(rate(export_duration_stream_seconds_sum[1m])) / sum(rate(export_duration_stream_seconds_count[1m]))
```

### 密码重哈希(bcrypt → Argon2id)

登录成功后,如检测到旧 bcrypt 哈希,系统会在 `REHASH_ON_LOGIN` 未显式关闭时(默认开启)尝试透明升级为 Argon2id:
Expand Down Expand Up @@ -523,3 +593,17 @@ MIT License
## 📞 联系

如有问题,请提交 Issue 或联系维护者。
环境变量 (Metrics & 安全):
```
AUTH_RATE_LIMIT=30/60 # 60 秒窗口内最多 30 次登录尝试 (默认 30/60)
AUTH_RATE_LIMIT_HASH_EMAIL=1 # 限流键中对 email 做哈希截断 (默认1)
ALLOW_PUBLIC_METRICS=1 # 设为 0 时启用白名单
METRICS_ALLOW_CIDRS=127.0.0.1/32 # 逗号分隔 CIDR 列表 (ALLOW_PUBLIC_METRICS=0 生效)
METRICS_DENY_CIDRS= # 可选拒绝 CIDR (deny 优先)
METRICS_CACHE_TTL=30 # /metrics 缓存秒数 (0 禁用)
```

Grafana 仪表板: `docs/GRAFANA_DASHBOARD_TEMPLATE.json`
Alert 规则示例: `docs/ALERT_RULES_EXAMPLE.yaml`
安全清单: `docs/SECURITY_CHECKLIST.md`
快速验证脚本: `scripts/verify_observability.sh`
39 changes: 39 additions & 0 deletions docs/ALERT_RULES_EXAMPLE.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
groups:
- name: jive-api-alerts
rules:
- alert: RehashFailBurst
expr: increase(jive_password_rehash_fail_total[10m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: Password rehash failures detected
- alert: LoginFailSurge
expr: rate(auth_login_fail_total[5m]) > 3 * rate(auth_login_fail_total[30m])
for: 10m
labels:
severity: warning
annotations:
summary: Sudden login failure surge
- alert: ExportLatencyHigh
expr: histogram_quantile(0.95,sum by (le)(rate(export_duration_buffered_seconds_bucket[5m]))) > 2
for: 10m
labels:
severity: critical
annotations:
summary: Buffered export P95 latency >2s
- alert: RateLimitedSpike
expr: increase(auth_login_rate_limited_total[5m]) > 50
for: 5m
labels:
severity: info
annotations:
summary: Many logins being rate-limited (possible attack)
- alert: ProcessRestarted
expr: increase(process_uptime_seconds[5m]) < 60

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using increase() on a gauge like process_uptime_seconds is not idiomatic in PromQL and can be confusing. A simpler and more direct way to detect a recent restart is to check if the uptime is less than your scrape interval window.

For example, to detect a restart within the last 5 minutes (300 seconds), you could use:

process_uptime_seconds < 300

This is more readable and directly expresses the condition you want to alert on.

        expr: process_uptime_seconds < 300

for: 0m
labels:
severity: info
annotations:
summary: API process restarted recently

18 changes: 18 additions & 0 deletions docs/GRAFANA_DASHBOARD_TEMPLATE.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"title": "Jive API Overview",
"panels": [
{"type":"stat","title":"Uptime (h)","targets":[{"expr":"process_uptime_seconds/3600"}]},
{"type":"stat","title":"Rehash Success","targets":[{"expr":"jive_password_rehash_total"}]},
{"type":"stat","title":"Rehash Fail","targets":[{"expr":"jive_password_rehash_fail_total"}]},
{"type":"graph","title":"Password Hash Distribution","targets":[{"expr":"password_hash_bcrypt_total"},{"expr":"password_hash_argon2id_total"}]},
{"type":"graph","title":"Rehash Fail Breakdown","targets":[{"expr":"sum by (cause)(increase(jive_password_rehash_fail_breakdown_total[5m]))"}]},
{"type":"graph","title":"Login Outcomes","targets":[{"expr":"rate(auth_login_fail_total[5m])"},{"expr":"rate(auth_login_inactive_total[5m])"},{"expr":"rate(auth_login_rate_limited_total[5m])"}]},
{"type":"graph","title":"Export Requests","targets":[{"expr":"rate(export_requests_buffered_total[5m])"},{"expr":"rate(export_requests_stream_total[5m])"}]},
{"type":"graph","title":"Export Rows","targets":[{"expr":"rate(export_rows_buffered_total[5m])"},{"expr":"rate(export_rows_stream_total[5m])"}]},
{"type":"graph","title":"Buffered Export P95","targets":[{"expr":"histogram_quantile(0.95,sum by (le)(rate(export_duration_buffered_seconds_bucket[5m])))"}]},
{"type":"graph","title":"Stream Export P95","targets":[{"expr":"histogram_quantile(0.95,sum by (le)(rate(export_duration_stream_seconds_bucket[5m])))"}]}
],
"schemaVersion": 38,
"version": 1
}

54 changes: 54 additions & 0 deletions docs/METRICS_DEPRECATION_PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Metrics Deprecation Plan

This document tracks deprecation and removal timelines for legacy metrics exposed by the API.

## Principles
- Provide at least two released versions of overlap before removal.
- Never silently change a metric's semantic meaning; prefer adding a new metric.
- Document target removal version and migration path here + README.

## Deprecated Metrics
| Metric | Status | Replacement | First Deprecated | Target Removal | Notes |
|--------|--------|-------------|------------------|----------------|-------|
| `jive_password_hash_users` (labels: bcrypt_2a,bcrypt_2b,bcrypt_2y,argon2id) | Deprecated | `password_hash_bcrypt_variant`, `password_hash_bcrypt_total`, `password_hash_argon2id_total` | v1.0.0 | v1.2.0 | Keep until majority dashboards migrated |
| `jive_password_rehash_fail_total` | Deprecated (aggregate) | `jive_password_rehash_fail_breakdown_total{cause}` | v1.0.X | v1.3.0 | Remove once dashboards use breakdown |

## Active Canonical Metrics (Password Hash & Auth)
- `password_hash_bcrypt_total`
- `password_hash_argon2id_total`
- `password_hash_unknown_total`
- `password_hash_total_count`
- `password_hash_bcrypt_variant{variant="2a"|"2b"|"2y"}`
- `jive_password_rehash_total`
- `jive_password_rehash_fail_total`
- `auth_login_fail_total`
- `auth_login_inactive_total`

## Export Metrics
- `export_requests_buffered_total`
- `export_requests_stream_total`
- `export_rows_buffered_total`
- `export_rows_stream_total`
- `export_duration_buffered_seconds_*` (histogram buckets/sum/count)
- `export_duration_stream_seconds_*` (histogram buckets/sum/count)

## Build / Operational
- `jive_build_info{commit,time,rustc,version}` (value always 1)
- `process_uptime_seconds`

## Future Candidates
| Proposed | Description | Status |
|----------|-------------|--------|
| `auth_login_fail_total` | Count failed login attempts (unauthorized) | Planned |
| `export_duration_seconds` (histogram) | Latency of export operations | Planned |
| `process_uptime_seconds` | Seconds since process start | Implemented |
Comment on lines +42 to +44

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The 'Future Candidates' table seems to be out of sync with the changes in this PR. Several metrics listed as 'Planned' or 'Implemented' are now fully available.

  • auth_login_fail_total is implemented, not 'Planned'.
  • export_duration_seconds is implemented as export_duration_buffered_seconds and export_duration_stream_seconds.

To ensure the documentation accurately reflects the current state of the project, please update this table to mark these metrics as 'Implemented' and adjust their descriptions accordingly.


## Removal Procedure
1. Mark metric here and in README as DEPRECATED with target version.
2. Announce in release notes for two consecutive releases.
3. After reaching target version, remove metric exposition code; update this file.
4. Provide simple one-shot conversion guidance for dashboards.

## Changelog
- v1.0.0: Introduced canonical password hash metrics + export metrics; deprecated legacy `jive_password_hash_users`.
- v1.0.X: Added login fail/inactive counters; export duration histograms; uptime gauge.
58 changes: 58 additions & 0 deletions docs/PR_SECURITY_METRICS_SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
## PR Security & Metrics Summary (Template)

### Overview
This PR strengthens API security and observability. Copy & adapt sections below for the final PR description.

### Key Changes
- Login rate limiting (IP + email key) with structured 429 JSON and `Retry-After` header.
- Metrics endpoint CIDR allow + deny lists (`ALLOW_PUBLIC_METRICS=0`, `METRICS_ALLOW_CIDRS`, `METRICS_DENY_CIDRS`).
- Password rehash failure breakdown: `jive_password_rehash_fail_breakdown_total{cause="hash"|"update"}`.
- Export performance histograms (buffered & streaming) and uptime metric.
- New security / monitoring docs: Grafana dashboard, alert rules, security checklist.
- Email-based rate limit key hashing (first 8 hex of SHA256) for privacy.

### New / Modified Environment Variables
| Variable | Purpose | Default |
|----------|---------|---------|
| `AUTH_RATE_LIMIT` | Login attempts per window (N/SECONDS) | `30/60` |
| `AUTH_RATE_LIMIT_HASH_EMAIL` | Hash email in key (privacy) | `1` |
| `ALLOW_PUBLIC_METRICS` | If `0`, restrict metrics by CIDR | `1` |
| `METRICS_ALLOW_CIDRS` | Comma CIDR whitelist | `127.0.0.1/32` |
| `METRICS_DENY_CIDRS` | Comma CIDR deny (priority) | (empty) |
| `METRICS_CACHE_TTL` | Metrics base cache seconds | `30` |

### Prometheus Metrics Added
| Metric | Type | Notes |
|--------|------|-------|
| `auth_login_rate_limited_total` | counter | Rate-limited login attempts |
| `jive_password_rehash_fail_breakdown_total{cause}` | counter | Split hash/update failures |
| `export_duration_buffered_seconds_*` | histogram | Export latency (buffered) |
| `export_duration_stream_seconds_*` | histogram | Export latency (stream) |
| `process_uptime_seconds` | gauge | Runtime age |

Deprecated (pending removal): `jive_password_rehash_fail_total` (aggregate).

### Quick Local Verification
Run stack (example):
```bash
ALLOW_PUBLIC_METRICS=1 AUTH_RATE_LIMIT=3/60 cargo run --bin jive-api &
sleep 2
./scripts/verify_observability.sh
```

Expect PASS output and non-zero counters for `auth_login_fail_total` after simulated attempts.

### Reviewer Checklist
- [ ] 429 login response includes `Retry-After` and JSON structure
- [ ] `/metrics` reachable only when expected (toggle ALLOW_PUBLIC_METRICS)
- [ ] Rehash breakdown metrics appear
- [ ] Export histogram buckets present
- [ ] Uptime metric increasing across scrapes
- [ ] Security checklist file present (`docs/SECURITY_CHECKLIST.md`)

### Follow-up (Optional / Tracked)
- Audit logging for repeated rate-limit triggers
- Global unified error response model
- Redis/distributed rate limiting for multi-instance scaling
- Remove deprecated rehash aggregate metric (target v1.3.0)

27 changes: 27 additions & 0 deletions docs/SECURITY_CHECKLIST.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
## Production Security Checklist

1. Secrets
- Set strong `JWT_SECRET` (>=32 random bytes). Never use dev default.
2. Metrics Exposure
- `ALLOW_PUBLIC_METRICS=0`
- Restrict `METRICS_ALLOW_CIDRS` to monitoring network.
3. Rate Limiting
- Tune `AUTH_RATE_LIMIT` (e.g. 20/60 or 50/300 based on traffic).
- Keep `AUTH_RATE_LIMIT_HASH_EMAIL=1` to avoid leaking raw emails in memory keys.
4. TLS / Reverse Proxy
- Terminate TLS at trusted proxy; strip untrusted `X-Forwarded-For`.
5. Logging
- Ensure logs exclude plaintext passwords/tokens.
- Monitor `auth_login_rate_limited_total` + `auth_login_fail_total` anomalies.
6. Password Migration
- Track reduction of bcrypt via `password_hash_bcrypt_total` trend.
- Investigate any spike in `jive_password_rehash_fail_breakdown_total{cause}`.
7. Export Controls
- Consider pagination/stream for large exports; watch P95 latency panels.
8. Dependency Hygiene
- Run `cargo deny` (already in CI) before release.
9. Database
- Use least-privilege DB role for API.
10. Incident Response
- Create alerts using `docs/ALERT_RULES_EXAMPLE.yaml` as baseline.

1 change: 1 addition & 0 deletions jive-api/Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions jive-api/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ version = "1.0.0"
edition = "2021"
authors = ["Jive Money Team"]
description = "Jive Money API Server for category template management"
build = "build.rs"

[lib]
name = "jive_money_api"
Expand Down Expand Up @@ -44,6 +45,7 @@ base64 = "0.22"
# Make core optional; gate usage behind feature `core_export`
jive-core = { path = "../jive-core", package = "jive-core", features = ["server", "db"], default-features = false, optional = true }
bytes = "1"
sha2 = "0.10"

# WebSocket支持
tokio-tungstenite = "0.24"
Expand Down
28 changes: 20 additions & 8 deletions jive-api/PR47_METRICS_VERIFICATION_REPORT.md
Original file line number Diff line number Diff line change
Expand Up @@ -252,20 +252,32 @@ curl -s http://localhost:8014/health | jq '.metrics'
6. **✅ Consistency Validation**: Perfect consistency between `/health` and `/metrics` endpoints
7. **✅ Code Quality**: Meets project standards with only minor cosmetic warnings

### Verified Metrics Available
- `jive_password_rehash_total` - Counter of successful bcrypt→argon2id rehashes
- `jive_password_hash_users{algo="bcrypt_2a|2b|2y|argon2id"}` - User count by hash type

### Next Actions (Optional)
1. Add monitoring documentation to README
2. Create consistency verification scripts
### Verified Metrics Available (Updated Post PR #48 Plan)
Canonical (new) metrics:
- `password_hash_bcrypt_total` – Users with any bcrypt variant (2a+2b+2y)
- `password_hash_argon2id_total` – Users with argon2id hashes
- `password_hash_unknown_total` – Users whose hash prefix not in (2a,2b,2y,argon2id)
- `password_hash_total_count` – Total users counted
- `password_hash_bcrypt_variant{variant="2a|2b|2y"}` – Per-variant bcrypt counts
- `jive_password_rehash_total` – Successful bcrypt→argon2id rehash counter

Legacy (DEPRECATED – retained temporarily for dashboards):
- `jive_password_hash_users{algo="bcrypt_2a|bcrypt_2b|bcrypt_2y|argon2id"}`

Deprecation Notice: legacy `jive_password_hash_users` will be removed after dashboards migrate to canonical metrics (target: two release cycles). Monitor usage before removal.

### Next Actions (Optional / In Progress)
1. Add monitoring documentation to README (IN PROGRESS)
2. Create consistency verification scripts (IN PROGRESS)
3. Configure Prometheus scraping for production
4. Test rehash counter during actual password changes
5. Migrate dashboards from legacy to canonical metrics
6. Decide removal date for legacy metrics (propose: +2 releases)

### Final Status: 🎯 COMPLETE SUCCESS
**All verification requirements fulfilled. PR #47 is production-ready.**

---
*Final report completed: 2025-09-26T01:16:00Z*
*Runtime testing completed: 2025-09-26T01:15:30Z*
*Merge verified: 2025-09-26T01:10:47Z*
*Merge verified: 2025-09-26T01:10:47Z*
Loading
Loading