Skip to content

Conversation

@zensgit
Copy link
Owner

@zensgit zensgit commented Sep 26, 2025

PR Security & Metrics Summary (Template)

Overview

This PR strengthens API security and observability. Copy & adapt sections below for the final PR description.

Key Changes

  • Login rate limiting (IP + email key) with structured 429 JSON and Retry-After header.
  • Metrics endpoint CIDR allow + deny lists (ALLOW_PUBLIC_METRICS=0, METRICS_ALLOW_CIDRS, METRICS_DENY_CIDRS).
  • Password rehash failure breakdown: jive_password_rehash_fail_breakdown_total{cause="hash"|"update"}.
  • Export performance histograms (buffered & streaming) and uptime metric.
  • New security / monitoring docs: Grafana dashboard, alert rules, security checklist.
  • Email-based rate limit key hashing (first 8 hex of SHA256) for privacy.

New / Modified Environment Variables

Variable Purpose Default
AUTH_RATE_LIMIT Login attempts per window (N/SECONDS) 30/60
AUTH_RATE_LIMIT_HASH_EMAIL Hash email in key (privacy) 1
ALLOW_PUBLIC_METRICS If 0, restrict metrics by CIDR 1
METRICS_ALLOW_CIDRS Comma CIDR whitelist 127.0.0.1/32
METRICS_DENY_CIDRS Comma CIDR deny (priority) (empty)
METRICS_CACHE_TTL Metrics base cache seconds 30

Prometheus Metrics Added

Metric Type Notes
auth_login_rate_limited_total counter Rate-limited login attempts
jive_password_rehash_fail_breakdown_total{cause} counter Split hash/update failures
export_duration_buffered_seconds_* histogram Export latency (buffered)
export_duration_stream_seconds_* histogram Export latency (stream)
process_uptime_seconds gauge Runtime age

Deprecated (pending removal): jive_password_rehash_fail_total (aggregate).

Quick Local Verification

Run stack (example):

ALLOW_PUBLIC_METRICS=1 AUTH_RATE_LIMIT=3/60 cargo run --bin jive-api &
sleep 2
./scripts/verify_observability.sh

Expect PASS output and non-zero counters for auth_login_fail_total after simulated attempts.

Reviewer Checklist

  • 429 login response includes Retry-After and JSON structure
  • /metrics reachable only when expected (toggle ALLOW_PUBLIC_METRICS)
  • Rehash breakdown metrics appear
  • Export histogram buckets present
  • Uptime metric increasing across scrapes
  • Security checklist file present (docs/SECURITY_CHECKLIST.md)

Follow-up (Optional / Tracked)

  • Audit logging for repeated rate-limit triggers
  • Global unified error response model
  • Redis/distributed rate limiting for multi-instance scaling
  • Remove deprecated rehash aggregate metric (target v1.3.0)

…il breakdown & observability assets

- Login rate limiting (IP + email hash key) with structured 429 JSON + Retry-After
- /metrics CIDR allow & deny (ALLOW_PUBLIC_METRICS, METRICS_ALLOW_CIDRS, METRICS_DENY_CIDRS)
- Password rehash failure breakdown metric (hash vs update) + deprecate aggregate
- Export latency histograms (buffered & stream) + process_uptime_seconds
- New counters: auth_login_rate_limited_total, rehash_fail_breakdown
- Security checklist, Grafana dashboard, Prometheus alert rules, PR summary template
- Observability verification script (scripts/verify_observability.sh)
- Opportunistic limiter cleanup; IPv6 CIDR precise matching; deny precedence

DEPRECATION: jive_password_rehash_fail_total slated removal v1.3.0 (use breakdown).
@Copilot Copilot AI review requested due to automatic review settings September 26, 2025 05:31
@gemini-code-assist
Copy link

Summary of Changes

Hello @zensgit, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the API's security and observability. It introduces robust login rate limiting to prevent brute-force attacks, implements granular access control for the metrics endpoint, and expands the suite of Prometheus metrics to provide deeper insights into authentication outcomes, password rehash processes, and data export performance. The changes are complemented by new documentation and utility scripts to streamline monitoring setup and security best practices.

Highlights

  • Login Rate Limiting: Implemented login rate limiting based on IP and hashed email, returning a structured 429 JSON response with a 'Retry-After' header for enhanced security.
  • Metrics Endpoint Access Control: Introduced CIDR-based allow and deny lists for the '/metrics' endpoint, configurable via environment variables, to restrict access to monitoring infrastructure.
  • Enhanced Password Rehash Metrics: Added a new metric, 'jive_password_rehash_fail_breakdown_total', to provide a detailed breakdown of password rehash failures by cause (e.g., hash generation or database update issues).
  • Export Performance Metrics: Integrated Prometheus histograms ('export_duration_buffered_seconds_' and 'export_duration_stream_seconds_') to track the latency of both buffered and streaming data export operations, along with counters for requests and rows exported.
  • New Observability Documentation: Added comprehensive documentation including a Grafana dashboard template, example Prometheus alert rules, a security checklist, and a metrics deprecation plan to guide monitoring and operational practices.
  • Build Information Metric: Introduced a 'jive_build_info' gauge metric that exposes build-time details such as Git commit, build timestamp, Rust compiler version, and package version, aiding in debugging and version tracking.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@zensgit zensgit merged commit 2e6a0dd into main Sep 26, 2025
6 of 8 checks passed
@zensgit zensgit deleted the feat/security-metrics-observability branch September 26, 2025 05:32
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the Jive Money API with comprehensive security and observability features. It introduces login rate limiting with email-based keying, restricts metrics endpoint access via CIDR lists, expands metrics coverage with breakdown counters and performance histograms, and adds extensive documentation for monitoring and security best practices.

Key Changes

  • Login rate limiting with IP+email key combination and structured 429 responses
  • Metrics endpoint access control using CIDR allow/deny lists
  • Enhanced metrics including rehash failure breakdown, export performance histograms, and authentication counters

Reviewed Changes

Copilot reviewed 27 out of 28 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
scripts/verify_observability.sh Test script for validating core metrics presence
scripts/check_metrics_consistency.sh Verification script for health vs metrics consistency
jive-api/tests/integration/*.rs Integration tests for rate limiting, metrics, and export functionality
jive-api/src/middleware/rate_limit.rs Complete rewrite implementing email-based rate limiting
jive-api/src/middleware/metrics_guard.rs New CIDR-based access control for metrics endpoint
jive-api/src/metrics.rs Expanded metrics with caching, histograms, and build info
jive-api/src/main.rs Integration of rate limiting and metrics guard middleware
jive-api/src/lib.rs Extended AppMetrics with new counters and histogram fields
jive-api/src/handlers/transactions.rs Added export performance metrics tracking
jive-api/src/handlers/auth.rs Added login failure and rate limiting metrics
jive-api/build.rs Build script for capturing git commit and build metadata
docs/*.md Comprehensive security, monitoring, and deprecation documentation
README.md Updated with metrics documentation and environment variables
Makefile Added metrics verification and test user seeding targets

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

records: Arc::new(RwLock::new(HashMap::new())),
}
pub fn new(max: u32, window_secs: u64) -> Self {
let hash_email = std::env::var("AUTH_RATE_LIMIT_HASH_EMAIL").map(|v| v=="1" || v.eq_ignore_ascii_case("true")).unwrap_or(true);
Copy link

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The environment variable parsing logic is duplicated and could be extracted into a helper function for consistency across the codebase. Consider creating a utility function for parsing boolean environment variables.

Copilot uses AI. Check for mistakes.
Comment on lines +24 to 28
// Opportunistic cleanup if map large
if map.len() > 10_000 {
let window = self.window;
map.retain(|_, (_c, start)| now.duration_since(*start) <= window);
}
Copy link

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 10,000 should be extracted as a named constant to improve maintainability and make it configurable if needed.

Copilot uses AI. Check for mistakes.
let norm = raw.trim().to_lowercase();
if norm.is_empty() { return None; }
if !hash { return Some(norm); }
let mut h = Sha256::new(); h.update(&norm); let hex = format!("{:x}", h.finalize());
Copy link

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] This complex line should be split into multiple statements for better readability and debugging.

Suggested change
let mut h = Sha256::new(); h.update(&norm); let hex = format!("{:x}", h.finalize());
let mut h = Sha256::new();
h.update(&norm);
let hex = format!("{:x}", h.finalize());

Copilot uses AI. Check for mistakes.
family_id,
// Handle potentially nullable base_currency field
base_currency: settings.base_currency.unwrap_or_else(|| "CNY".to_string()),
base_currency: if settings.base_currency.is_empty() { "CNY".to_string() } else { settings.base_currency },
Copy link

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic change from unwrap_or_else to is_empty() check alters the behavior. This will panic if base_currency is None, whereas the original code provided a default. The logic should handle the Option type properly.

Suggested change
base_currency: if settings.base_currency.is_empty() { "CNY".to_string() } else { settings.base_currency },
base_currency: settings
.base_currency
.filter(|s| !s.is_empty())
.unwrap_or_else(|| "CNY".to_string()),

Copilot uses AI. Check for mistakes.
Comment on lines +12 to +17
#[tokio::test]
async fn login_rate_limit_blocks_after_threshold() {
let pool = create_test_pool().await;
// Seed a user so we can attempt logins (with wrong password to avoid side effects)
let email = format!("rl_{}@example.com", Uuid::new_v4());
sqlx::query("INSERT INTO users (email,password_hash,name,is_active,created_at,updated_at) VALUES ($1,'$argon2id$v=19$m=4096,t=3,p=1$dGVzdHNhbHQAAAAAAAAAAA$Jr7Z5fakehashHashHashHashHashHash','RL User',true,NOW(),NOW())")
Copy link

Copilot AI Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The hardcoded hash string is very long and makes the code hard to read. Consider extracting it to a constant or using a test fixture helper function.

Suggested change
#[tokio::test]
async fn login_rate_limit_blocks_after_threshold() {
let pool = create_test_pool().await;
// Seed a user so we can attempt logins (with wrong password to avoid side effects)
let email = format!("rl_{}@example.com", Uuid::new_v4());
sqlx::query("INSERT INTO users (email,password_hash,name,is_active,created_at,updated_at) VALUES ($1,'$argon2id$v=19$m=4096,t=3,p=1$dGVzdHNhbHQAAAAAAAAAAA$Jr7Z5fakehashHashHashHashHashHash','RL User',true,NOW(),NOW())")
const TEST_PASSWORD_HASH: &str = "$argon2id$v=19$m=4096,t=3,p=1$dGVzdHNhbHQAAAAAAAAAAA$Jr7Z5fakehashHashHashHashHashHash";
#[tokio::test]
async fn login_rate_limit_blocks_after_threshold() {
let pool = create_test_pool().await;
// Seed a user so we can attempt logins (with wrong password to avoid side effects)
let email = format!("rl_{}@example.com", Uuid::new_v4());
sqlx::query(&format!(
"INSERT INTO users (email,password_hash,name,is_active,created_at,updated_at) VALUES ($1,'{}','RL User',true,NOW(),NOW())",
TEST_PASSWORD_HASH
))

Copilot uses AI. Check for mistakes.
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the API's security and observability by introducing login rate limiting, a metrics endpoint with CIDR-based access control, and a rich set of new Prometheus metrics. The changes are well-structured, including comprehensive documentation, new tests, and utility scripts. Overall, this is a high-quality contribution. I've identified a few issues, primarily related to metric correctness and documentation consistency, and have provided suggestions for improvement.

println!("DEBUG[login]: failed to parse Argon2 hash: {:?}", e);
ApiError::InternalServerError
})?;
let parsed_hash = PasswordHash::new(hash).map_err(|e| { #[cfg(debug_assertions)] println!("DEBUG[login]: failed to parse Argon2 hash: {:?}", e); state.metrics.increment_login_fail(); ApiError::InternalServerError })?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Incrementing auth_login_fail_total when a password hash fails to parse is incorrect. This metric should track user-attributable login failures (like wrong password), not internal server errors. A hash parsing failure indicates a data corruption issue or a bug, and mixing it with failed login attempts will pollute the metric and make it harder to distinguish between security events and system errors. Please remove the metric increment from this error handling path.

Suggested change
let parsed_hash = PasswordHash::new(hash).map_err(|e| { #[cfg(debug_assertions)] println!("DEBUG[login]: failed to parse Argon2 hash: {:?}", e); state.metrics.increment_login_fail(); ApiError::InternalServerError })?;
let parsed_hash = PasswordHash::new(hash).map_err(|e| { #[cfg(debug_assertions)] println!("DEBUG[login]: failed to parse Argon2 hash: {:?}", e); ApiError::InternalServerError })?;

let argon2 = Argon2::default();
argon2
.verify_password(req.password.as_bytes(), &parsed_hash)
.map_err(|_| ApiError::Unauthorized)?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The auth_login_fail_total metric is not being incremented here when Argon2 password verification fails. This is inconsistent with other password verification failure paths in this function. To ensure all failed login attempts are correctly tracked, please add the metric increment.

Suggested change
.map_err(|_| ApiError::Unauthorized)?;
.map_err(|_| { state.metrics.increment_login_fail(); ApiError::Unauthorized })?;

password_hash_argon2id_total # argon2id 数量
password_hash_unknown_total # 未识别前缀
password_hash_total_count # 总数
password_hash_bcrypt_variant{variant="2b"} X # 每个变体

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The 'X' in this line appears to be a placeholder for the metric's value. It should be removed or replaced with a more descriptive placeholder like <value> to avoid confusion for users reading the documentation.

export_duration_buffered_seconds_* # 缓冲导出耗时直方图 (bucket/sum/count)
export_duration_stream_seconds_* # 流式导出耗时直方图 (bucket/sum/count)
process_uptime_seconds # 进程运行时长(秒)
jive_build_info{commit,time,rustc,version} 1 # 构建信息 gauge

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line is a duplicate of line 214. To improve clarity and reduce redundancy in the documentation, please remove this repeated entry for the jive_build_info metric.

annotations:
summary: Many logins being rate-limited (possible attack)
- alert: ProcessRestarted
expr: increase(process_uptime_seconds[5m]) < 60

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using increase() on a gauge like process_uptime_seconds is not idiomatic in PromQL and can be confusing. A simpler and more direct way to detect a recent restart is to check if the uptime is less than your scrape interval window.

For example, to detect a restart within the last 5 minutes (300 seconds), you could use:

process_uptime_seconds < 300

This is more readable and directly expresses the condition you want to alert on.

        expr: process_uptime_seconds < 300

Comment on lines +42 to +44
| `auth_login_fail_total` | Count failed login attempts (unauthorized) | Planned |
| `export_duration_seconds` (histogram) | Latency of export operations | Planned |
| `process_uptime_seconds` | Seconds since process start | Implemented |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The 'Future Candidates' table seems to be out of sync with the changes in this PR. Several metrics listed as 'Planned' or 'Implemented' are now fully available.

  • auth_login_fail_total is implemented, not 'Planned'.
  • export_duration_seconds is implemented as export_duration_buffered_seconds and export_duration_stream_seconds.

To ensure the documentation accurately reflects the current state of the project, please update this table to mark these metrics as 'Implemented' and adjust their descriptions accordingly.

Err(e) => {
tracing::warn!(user_id=%user.id, error=?e, "failed to generate Argon2id hash")
}
Err(e) => { tracing::warn!(user_id=%user.id, error=?e, "failed to generate Argon2id hash"); state.metrics.increment_rehash_fail(); state.metrics.inc_rehash_fail_hash(); }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While one-liners can be compact, this line is quite long and contains multiple statements, which harms readability and makes debugging more difficult. For better maintainability, please format this closure over multiple lines.

                Err(e) => {
                    tracing::warn!(user_id=%user.id, error=?e, "failed to generate Argon2id hash");
                    state.metrics.increment_rehash_fail();
                    state.metrics.inc_rehash_fail_hash();
                }

"jive_build_info{{commit=\"{}\",time=\"{}\",rustc=\"{}\",version=\"{}\"}} 1\n",
build_commit,
build_time,
rustc_version.replace('"', "'"),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Replacing " with ' is not a standard way to escape label values in Prometheus. The specification requires backslashes, double quotes, and newlines to be escaped with a backslash (e.g., \, \", \n). While your current approach might work if the rustc version string doesn't contain single quotes, it's safer to adhere to the standard to prevent generating invalid metrics output.

Suggested change
rustc_version.replace('"', "'"),
rustc_version.replace('\\', "\\\\").replace('"', "\\\"").replace('\n', "\\n"),

Comment on lines +11 to +18
#[tokio::test]
async fn metrics_v6_allowed_when_public() {
std::env::remove_var("ALLOW_PUBLIC_METRICS");
let dummy_pool = PgPool::connect_lazy("postgresql://ignored").unwrap_err();
// Skip full state since test only checks routing; create minimal state is complex, so we just assert handler builds.
// This test is a placeholder; full integration would need real AppState. Here we simply ensure no panic.
assert!(true);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test is currently a placeholder that only asserts true and doesn't perform any actual verification. To make it useful, it should at least test that the application can be built with the metrics guard middleware and that the /metrics route is accessible when the guard is disabled. A more complete test would involve mocking ConnectInfo to test the CIDR filtering logic.

@zensgit zensgit mentioned this pull request Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant