-
Notifications
You must be signed in to change notification settings - Fork 0
feat(api): security+metrics enhancements #50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -186,6 +186,76 @@ export JWT_SECRET=$(openssl rand -hex 32) | |
|
|
||
| 未设置时(或留空)API 会在开发 / 测试自动使用一个不安全的占位并打印警告,不可在生产依赖该默认值。 | ||
|
|
||
| ### 监控与指标 (Metrics) | ||
|
|
||
| | Endpoint | 用途 | 认证 | 备注 | | ||
| |-------------|-------------------|------|------| | ||
| | `/health` | 探活 + 快照 | 否 | 轻量 JSON:hash 分布、rehash 状态、汇率指标等 | | ||
| | `/metrics` | Prometheus 拉取 | 否 | 文本格式指标(适合长期监控) | | ||
|
|
||
| 规范指标(推荐使用): | ||
| ``` | ||
| password_hash_bcrypt_total # bcrypt (2a+2b+2y) | ||
| password_hash_argon2id_total # argon2id 数量 | ||
| password_hash_unknown_total # 未识别前缀 | ||
| password_hash_total_count # 总数 | ||
| password_hash_bcrypt_variant{variant="2b"} X # 每个变体 | ||
| jive_password_rehash_total # 成功重哈希次数(bcrypt→argon2id) | ||
| jive_password_rehash_fail_total # 重哈希失败次数(不会阻断登录) | ||
| jive_password_rehash_fail_breakdown_total{cause="hash"|"update"} # 重哈希失败按原因 | ||
| export_requests_buffered_total # 缓冲导出请求次数(POST CSV/JSON) | ||
| export_requests_stream_total # 流式导出请求次数(GET CSV streaming, feature=export_stream) | ||
| export_rows_buffered_total # 缓冲导出累计行数 | ||
| export_rows_stream_total # 流式导出累计行数 | ||
| jive_build_info{...} # 构建信息 (value=1) | ||
| auth_login_fail_total # 登录失败(未知用户 / 密码不匹配) | ||
| auth_login_inactive_total # 非激活账号登录尝试 | ||
| auth_login_rate_limited_total # 登录被速率限制次数 (429) | ||
| jive_build_info{commit,time,rustc,version} 1 # 构建信息 gauge | ||
| export_duration_buffered_seconds_* # 缓冲导出耗时直方图 (bucket/sum/count) | ||
| export_duration_stream_seconds_* # 流式导出耗时直方图 (bucket/sum/count) | ||
| process_uptime_seconds # 进程运行时长(秒) | ||
| jive_build_info{commit,time,rustc,version} 1 # 构建信息 gauge | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| ``` | ||
|
|
||
| 兼容旧指标(DEPRECATED,将在 2 个发布周期后移除,详见 docs/METRICS_DEPRECATION_PLAN.md): | ||
| ``` | ||
| jive_password_hash_users{algo="bcrypt_2b"} | ||
| ``` | ||
|
|
||
| Prometheus 抓取示例: | ||
| ```yaml | ||
| scrape_configs: | ||
| - job_name: jive-api | ||
| metrics_path: /metrics | ||
| scrape_interval: 15s | ||
| static_configs: | ||
| - targets: ["api-host:8012"] | ||
| ``` | ||
|
|
||
| 一致性快速校验(bcrypt 聚合与 /metrics 是否匹配): | ||
| ```bash | ||
| H=$(curl -s http://localhost:8012/health) | ||
| M=$(curl -s http://localhost:8012/metrics) | ||
| echo "Health bcrypt sum:" \ | ||
| $(echo "$H" | jq '.metrics.hash_distribution.bcrypt | (."2a"+."2b"+."2y")') | ||
| echo "Metrics bcrypt total:" \ | ||
| $(grep '^password_hash_bcrypt_total' <<<"$M" | awk '{print $2}') | ||
| ``` | ||
|
|
||
| 运维建议: | ||
| - 大规模用户场景可为 hash 查询加 30s 内存缓存(计划中)。 | ||
| - 迁移所有看板后移除旧的 jive_password_hash_users* 系列(目标 v1.2.0)。 | ||
| - 监控 `jive_password_rehash_fail_total`,持续增长提示 DB 更新/并发异常。 | ||
| - 导出耗时直方图示例: | ||
| ```promql | ||
| # P95 缓冲导出耗时 | ||
| histogram_quantile(0.95, sum(rate(export_duration_buffered_seconds_bucket[5m])) by (le)) | ||
|
|
||
| # 最近 1 分钟流式导出平均耗时 | ||
| sum(rate(export_duration_stream_seconds_sum[1m])) / sum(rate(export_duration_stream_seconds_count[1m])) | ||
| ``` | ||
|
|
||
| ### 密码重哈希(bcrypt → Argon2id) | ||
|
|
||
| 登录成功后,如检测到旧 bcrypt 哈希,系统会在 `REHASH_ON_LOGIN` 未显式关闭时(默认开启)尝试透明升级为 Argon2id: | ||
|
|
@@ -523,3 +593,17 @@ MIT License | |
| ## 📞 联系 | ||
|
|
||
| 如有问题,请提交 Issue 或联系维护者。 | ||
| 环境变量 (Metrics & 安全): | ||
| ``` | ||
| AUTH_RATE_LIMIT=30/60 # 60 秒窗口内最多 30 次登录尝试 (默认 30/60) | ||
| AUTH_RATE_LIMIT_HASH_EMAIL=1 # 限流键中对 email 做哈希截断 (默认1) | ||
| ALLOW_PUBLIC_METRICS=1 # 设为 0 时启用白名单 | ||
| METRICS_ALLOW_CIDRS=127.0.0.1/32 # 逗号分隔 CIDR 列表 (ALLOW_PUBLIC_METRICS=0 生效) | ||
| METRICS_DENY_CIDRS= # 可选拒绝 CIDR (deny 优先) | ||
| METRICS_CACHE_TTL=30 # /metrics 缓存秒数 (0 禁用) | ||
| ``` | ||
|
|
||
| Grafana 仪表板: `docs/GRAFANA_DASHBOARD_TEMPLATE.json` | ||
| Alert 规则示例: `docs/ALERT_RULES_EXAMPLE.yaml` | ||
| 安全清单: `docs/SECURITY_CHECKLIST.md` | ||
| 快速验证脚本: `scripts/verify_observability.sh` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| groups: | ||
| - name: jive-api-alerts | ||
| rules: | ||
| - alert: RehashFailBurst | ||
| expr: increase(jive_password_rehash_fail_total[10m]) > 0 | ||
| for: 5m | ||
| labels: | ||
| severity: warning | ||
| annotations: | ||
| summary: Password rehash failures detected | ||
| - alert: LoginFailSurge | ||
| expr: rate(auth_login_fail_total[5m]) > 3 * rate(auth_login_fail_total[30m]) | ||
| for: 10m | ||
| labels: | ||
| severity: warning | ||
| annotations: | ||
| summary: Sudden login failure surge | ||
| - alert: ExportLatencyHigh | ||
| expr: histogram_quantile(0.95,sum by (le)(rate(export_duration_buffered_seconds_bucket[5m]))) > 2 | ||
| for: 10m | ||
| labels: | ||
| severity: critical | ||
| annotations: | ||
| summary: Buffered export P95 latency >2s | ||
| - alert: RateLimitedSpike | ||
| expr: increase(auth_login_rate_limited_total[5m]) > 50 | ||
| for: 5m | ||
| labels: | ||
| severity: info | ||
| annotations: | ||
| summary: Many logins being rate-limited (possible attack) | ||
| - alert: ProcessRestarted | ||
| expr: increase(process_uptime_seconds[5m]) < 60 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using For example, to detect a restart within the last 5 minutes (300 seconds), you could use: process_uptime_seconds < 300This is more readable and directly expresses the condition you want to alert on. expr: process_uptime_seconds < 300 |
||
| for: 0m | ||
| labels: | ||
| severity: info | ||
| annotations: | ||
| summary: API process restarted recently | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| { | ||
| "title": "Jive API Overview", | ||
| "panels": [ | ||
| {"type":"stat","title":"Uptime (h)","targets":[{"expr":"process_uptime_seconds/3600"}]}, | ||
| {"type":"stat","title":"Rehash Success","targets":[{"expr":"jive_password_rehash_total"}]}, | ||
| {"type":"stat","title":"Rehash Fail","targets":[{"expr":"jive_password_rehash_fail_total"}]}, | ||
| {"type":"graph","title":"Password Hash Distribution","targets":[{"expr":"password_hash_bcrypt_total"},{"expr":"password_hash_argon2id_total"}]}, | ||
| {"type":"graph","title":"Rehash Fail Breakdown","targets":[{"expr":"sum by (cause)(increase(jive_password_rehash_fail_breakdown_total[5m]))"}]}, | ||
| {"type":"graph","title":"Login Outcomes","targets":[{"expr":"rate(auth_login_fail_total[5m])"},{"expr":"rate(auth_login_inactive_total[5m])"},{"expr":"rate(auth_login_rate_limited_total[5m])"}]}, | ||
| {"type":"graph","title":"Export Requests","targets":[{"expr":"rate(export_requests_buffered_total[5m])"},{"expr":"rate(export_requests_stream_total[5m])"}]}, | ||
| {"type":"graph","title":"Export Rows","targets":[{"expr":"rate(export_rows_buffered_total[5m])"},{"expr":"rate(export_rows_stream_total[5m])"}]}, | ||
| {"type":"graph","title":"Buffered Export P95","targets":[{"expr":"histogram_quantile(0.95,sum by (le)(rate(export_duration_buffered_seconds_bucket[5m])))"}]}, | ||
| {"type":"graph","title":"Stream Export P95","targets":[{"expr":"histogram_quantile(0.95,sum by (le)(rate(export_duration_stream_seconds_bucket[5m])))"}]} | ||
| ], | ||
| "schemaVersion": 38, | ||
| "version": 1 | ||
| } | ||
|
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| # Metrics Deprecation Plan | ||
|
|
||
| This document tracks deprecation and removal timelines for legacy metrics exposed by the API. | ||
|
|
||
| ## Principles | ||
| - Provide at least two released versions of overlap before removal. | ||
| - Never silently change a metric's semantic meaning; prefer adding a new metric. | ||
| - Document target removal version and migration path here + README. | ||
|
|
||
| ## Deprecated Metrics | ||
| | Metric | Status | Replacement | First Deprecated | Target Removal | Notes | | ||
| |--------|--------|-------------|------------------|----------------|-------| | ||
| | `jive_password_hash_users` (labels: bcrypt_2a,bcrypt_2b,bcrypt_2y,argon2id) | Deprecated | `password_hash_bcrypt_variant`, `password_hash_bcrypt_total`, `password_hash_argon2id_total` | v1.0.0 | v1.2.0 | Keep until majority dashboards migrated | | ||
| | `jive_password_rehash_fail_total` | Deprecated (aggregate) | `jive_password_rehash_fail_breakdown_total{cause}` | v1.0.X | v1.3.0 | Remove once dashboards use breakdown | | ||
|
|
||
| ## Active Canonical Metrics (Password Hash & Auth) | ||
| - `password_hash_bcrypt_total` | ||
| - `password_hash_argon2id_total` | ||
| - `password_hash_unknown_total` | ||
| - `password_hash_total_count` | ||
| - `password_hash_bcrypt_variant{variant="2a"|"2b"|"2y"}` | ||
| - `jive_password_rehash_total` | ||
| - `jive_password_rehash_fail_total` | ||
| - `auth_login_fail_total` | ||
| - `auth_login_inactive_total` | ||
|
|
||
| ## Export Metrics | ||
| - `export_requests_buffered_total` | ||
| - `export_requests_stream_total` | ||
| - `export_rows_buffered_total` | ||
| - `export_rows_stream_total` | ||
| - `export_duration_buffered_seconds_*` (histogram buckets/sum/count) | ||
| - `export_duration_stream_seconds_*` (histogram buckets/sum/count) | ||
|
|
||
| ## Build / Operational | ||
| - `jive_build_info{commit,time,rustc,version}` (value always 1) | ||
| - `process_uptime_seconds` | ||
|
|
||
| ## Future Candidates | ||
| | Proposed | Description | Status | | ||
| |----------|-------------|--------| | ||
| | `auth_login_fail_total` | Count failed login attempts (unauthorized) | Planned | | ||
| | `export_duration_seconds` (histogram) | Latency of export operations | Planned | | ||
| | `process_uptime_seconds` | Seconds since process start | Implemented | | ||
|
Comment on lines
+42
to
+44
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The 'Future Candidates' table seems to be out of sync with the changes in this PR. Several metrics listed as 'Planned' or 'Implemented' are now fully available.
To ensure the documentation accurately reflects the current state of the project, please update this table to mark these metrics as 'Implemented' and adjust their descriptions accordingly. |
||
|
|
||
| ## Removal Procedure | ||
| 1. Mark metric here and in README as DEPRECATED with target version. | ||
| 2. Announce in release notes for two consecutive releases. | ||
| 3. After reaching target version, remove metric exposition code; update this file. | ||
| 4. Provide simple one-shot conversion guidance for dashboards. | ||
|
|
||
| ## Changelog | ||
| - v1.0.0: Introduced canonical password hash metrics + export metrics; deprecated legacy `jive_password_hash_users`. | ||
| - v1.0.X: Added login fail/inactive counters; export duration histograms; uptime gauge. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,58 @@ | ||
| ## PR Security & Metrics Summary (Template) | ||
|
|
||
| ### Overview | ||
| This PR strengthens API security and observability. Copy & adapt sections below for the final PR description. | ||
|
|
||
| ### Key Changes | ||
| - Login rate limiting (IP + email key) with structured 429 JSON and `Retry-After` header. | ||
| - Metrics endpoint CIDR allow + deny lists (`ALLOW_PUBLIC_METRICS=0`, `METRICS_ALLOW_CIDRS`, `METRICS_DENY_CIDRS`). | ||
| - Password rehash failure breakdown: `jive_password_rehash_fail_breakdown_total{cause="hash"|"update"}`. | ||
| - Export performance histograms (buffered & streaming) and uptime metric. | ||
| - New security / monitoring docs: Grafana dashboard, alert rules, security checklist. | ||
| - Email-based rate limit key hashing (first 8 hex of SHA256) for privacy. | ||
|
|
||
| ### New / Modified Environment Variables | ||
| | Variable | Purpose | Default | | ||
| |----------|---------|---------| | ||
| | `AUTH_RATE_LIMIT` | Login attempts per window (N/SECONDS) | `30/60` | | ||
| | `AUTH_RATE_LIMIT_HASH_EMAIL` | Hash email in key (privacy) | `1` | | ||
| | `ALLOW_PUBLIC_METRICS` | If `0`, restrict metrics by CIDR | `1` | | ||
| | `METRICS_ALLOW_CIDRS` | Comma CIDR whitelist | `127.0.0.1/32` | | ||
| | `METRICS_DENY_CIDRS` | Comma CIDR deny (priority) | (empty) | | ||
| | `METRICS_CACHE_TTL` | Metrics base cache seconds | `30` | | ||
|
|
||
| ### Prometheus Metrics Added | ||
| | Metric | Type | Notes | | ||
| |--------|------|-------| | ||
| | `auth_login_rate_limited_total` | counter | Rate-limited login attempts | | ||
| | `jive_password_rehash_fail_breakdown_total{cause}` | counter | Split hash/update failures | | ||
| | `export_duration_buffered_seconds_*` | histogram | Export latency (buffered) | | ||
| | `export_duration_stream_seconds_*` | histogram | Export latency (stream) | | ||
| | `process_uptime_seconds` | gauge | Runtime age | | ||
|
|
||
| Deprecated (pending removal): `jive_password_rehash_fail_total` (aggregate). | ||
|
|
||
| ### Quick Local Verification | ||
| Run stack (example): | ||
| ```bash | ||
| ALLOW_PUBLIC_METRICS=1 AUTH_RATE_LIMIT=3/60 cargo run --bin jive-api & | ||
| sleep 2 | ||
| ./scripts/verify_observability.sh | ||
| ``` | ||
|
|
||
| Expect PASS output and non-zero counters for `auth_login_fail_total` after simulated attempts. | ||
|
|
||
| ### Reviewer Checklist | ||
| - [ ] 429 login response includes `Retry-After` and JSON structure | ||
| - [ ] `/metrics` reachable only when expected (toggle ALLOW_PUBLIC_METRICS) | ||
| - [ ] Rehash breakdown metrics appear | ||
| - [ ] Export histogram buckets present | ||
| - [ ] Uptime metric increasing across scrapes | ||
| - [ ] Security checklist file present (`docs/SECURITY_CHECKLIST.md`) | ||
|
|
||
| ### Follow-up (Optional / Tracked) | ||
| - Audit logging for repeated rate-limit triggers | ||
| - Global unified error response model | ||
| - Redis/distributed rate limiting for multi-instance scaling | ||
| - Remove deprecated rehash aggregate metric (target v1.3.0) | ||
|
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| ## Production Security Checklist | ||
|
|
||
| 1. Secrets | ||
| - Set strong `JWT_SECRET` (>=32 random bytes). Never use dev default. | ||
| 2. Metrics Exposure | ||
| - `ALLOW_PUBLIC_METRICS=0` | ||
| - Restrict `METRICS_ALLOW_CIDRS` to monitoring network. | ||
| 3. Rate Limiting | ||
| - Tune `AUTH_RATE_LIMIT` (e.g. 20/60 or 50/300 based on traffic). | ||
| - Keep `AUTH_RATE_LIMIT_HASH_EMAIL=1` to avoid leaking raw emails in memory keys. | ||
| 4. TLS / Reverse Proxy | ||
| - Terminate TLS at trusted proxy; strip untrusted `X-Forwarded-For`. | ||
| 5. Logging | ||
| - Ensure logs exclude plaintext passwords/tokens. | ||
| - Monitor `auth_login_rate_limited_total` + `auth_login_fail_total` anomalies. | ||
| 6. Password Migration | ||
| - Track reduction of bcrypt via `password_hash_bcrypt_total` trend. | ||
| - Investigate any spike in `jive_password_rehash_fail_breakdown_total{cause}`. | ||
| 7. Export Controls | ||
| - Consider pagination/stream for large exports; watch P95 latency panels. | ||
| 8. Dependency Hygiene | ||
| - Run `cargo deny` (already in CI) before release. | ||
| 9. Database | ||
| - Use least-privilege DB role for API. | ||
| 10. Incident Response | ||
| - Create alerts using `docs/ALERT_RULES_EXAMPLE.yaml` as baseline. | ||
|
|
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'X' in this line appears to be a placeholder for the metric's value. It should be removed or replaced with a more descriptive placeholder like
<value>to avoid confusion for users reading the documentation.