- 
                Notifications
    You must be signed in to change notification settings 
- Fork 0
feat(api): security+metrics enhancements #50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…il breakdown & observability assets - Login rate limiting (IP + email hash key) with structured 429 JSON + Retry-After - /metrics CIDR allow & deny (ALLOW_PUBLIC_METRICS, METRICS_ALLOW_CIDRS, METRICS_DENY_CIDRS) - Password rehash failure breakdown metric (hash vs update) + deprecate aggregate - Export latency histograms (buffered & stream) + process_uptime_seconds - New counters: auth_login_rate_limited_total, rehash_fail_breakdown - Security checklist, Grafana dashboard, Prometheus alert rules, PR summary template - Observability verification script (scripts/verify_observability.sh) - Opportunistic limiter cleanup; IPv6 CIDR precise matching; deny precedence DEPRECATION: jive_password_rehash_fail_total slated removal v1.3.0 (use breakdown).
| Summary of ChangesHello @zensgit, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the API's security and observability. It introduces robust login rate limiting to prevent brute-force attacks, implements granular access control for the metrics endpoint, and expands the suite of Prometheus metrics to provide deeper insights into authentication outcomes, password rehash processes, and data export performance. The changes are complemented by new documentation and utility scripts to streamline monitoring setup and security best practices. Highlights
 Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either  
 Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a  Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the Jive Money API with comprehensive security and observability features. It introduces login rate limiting with email-based keying, restricts metrics endpoint access via CIDR lists, expands metrics coverage with breakdown counters and performance histograms, and adds extensive documentation for monitoring and security best practices.
Key Changes
- Login rate limiting with IP+email key combination and structured 429 responses
- Metrics endpoint access control using CIDR allow/deny lists
- Enhanced metrics including rehash failure breakdown, export performance histograms, and authentication counters
Reviewed Changes
Copilot reviewed 27 out of 28 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description | 
|---|---|
| scripts/verify_observability.sh | Test script for validating core metrics presence | 
| scripts/check_metrics_consistency.sh | Verification script for health vs metrics consistency | 
| jive-api/tests/integration/*.rs | Integration tests for rate limiting, metrics, and export functionality | 
| jive-api/src/middleware/rate_limit.rs | Complete rewrite implementing email-based rate limiting | 
| jive-api/src/middleware/metrics_guard.rs | New CIDR-based access control for metrics endpoint | 
| jive-api/src/metrics.rs | Expanded metrics with caching, histograms, and build info | 
| jive-api/src/main.rs | Integration of rate limiting and metrics guard middleware | 
| jive-api/src/lib.rs | Extended AppMetrics with new counters and histogram fields | 
| jive-api/src/handlers/transactions.rs | Added export performance metrics tracking | 
| jive-api/src/handlers/auth.rs | Added login failure and rate limiting metrics | 
| jive-api/build.rs | Build script for capturing git commit and build metadata | 
| docs/*.md | Comprehensive security, monitoring, and deprecation documentation | 
| README.md | Updated with metrics documentation and environment variables | 
| Makefile | Added metrics verification and test user seeding targets | 
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| records: Arc::new(RwLock::new(HashMap::new())), | ||
| } | ||
| pub fn new(max: u32, window_secs: u64) -> Self { | ||
| let hash_email = std::env::var("AUTH_RATE_LIMIT_HASH_EMAIL").map(|v| v=="1" || v.eq_ignore_ascii_case("true")).unwrap_or(true); | 
    
      
    
      Copilot
AI
    
    
    
      Sep 26, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The environment variable parsing logic is duplicated and could be extracted into a helper function for consistency across the codebase. Consider creating a utility function for parsing boolean environment variables.
| // Opportunistic cleanup if map large | ||
| if map.len() > 10_000 { | ||
| let window = self.window; | ||
| map.retain(|_, (_c, start)| now.duration_since(*start) <= window); | ||
| } | 
    
      
    
      Copilot
AI
    
    
    
      Sep 26, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number 10,000 should be extracted as a named constant to improve maintainability and make it configurable if needed.
| let norm = raw.trim().to_lowercase(); | ||
| if norm.is_empty() { return None; } | ||
| if !hash { return Some(norm); } | ||
| let mut h = Sha256::new(); h.update(&norm); let hex = format!("{:x}", h.finalize()); | 
    
      
    
      Copilot
AI
    
    
    
      Sep 26, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] This complex line should be split into multiple statements for better readability and debugging.
| let mut h = Sha256::new(); h.update(&norm); let hex = format!("{:x}", h.finalize()); | |
| let mut h = Sha256::new(); | |
| h.update(&norm); | |
| let hex = format!("{:x}", h.finalize()); | 
| family_id, | ||
| // Handle potentially nullable base_currency field | ||
| base_currency: settings.base_currency.unwrap_or_else(|| "CNY".to_string()), | ||
| base_currency: if settings.base_currency.is_empty() { "CNY".to_string() } else { settings.base_currency }, | 
    
      
    
      Copilot
AI
    
    
    
      Sep 26, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic change from unwrap_or_else to is_empty() check alters the behavior. This will panic if base_currency is None, whereas the original code provided a default. The logic should handle the Option type properly.
| base_currency: if settings.base_currency.is_empty() { "CNY".to_string() } else { settings.base_currency }, | |
| base_currency: settings | |
| .base_currency | |
| .filter(|s| !s.is_empty()) | |
| .unwrap_or_else(|| "CNY".to_string()), | 
| #[tokio::test] | ||
| async fn login_rate_limit_blocks_after_threshold() { | ||
| let pool = create_test_pool().await; | ||
| // Seed a user so we can attempt logins (with wrong password to avoid side effects) | ||
| let email = format!("rl_{}@example.com", Uuid::new_v4()); | ||
| sqlx::query("INSERT INTO users (email,password_hash,name,is_active,created_at,updated_at) VALUES ($1,'$argon2id$v=19$m=4096,t=3,p=1$dGVzdHNhbHQAAAAAAAAAAA$Jr7Z5fakehashHashHashHashHashHash','RL User',true,NOW(),NOW())") | 
    
      
    
      Copilot
AI
    
    
    
      Sep 26, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The hardcoded hash string is very long and makes the code hard to read. Consider extracting it to a constant or using a test fixture helper function.
| #[tokio::test] | |
| async fn login_rate_limit_blocks_after_threshold() { | |
| let pool = create_test_pool().await; | |
| // Seed a user so we can attempt logins (with wrong password to avoid side effects) | |
| let email = format!("rl_{}@example.com", Uuid::new_v4()); | |
| sqlx::query("INSERT INTO users (email,password_hash,name,is_active,created_at,updated_at) VALUES ($1,'$argon2id$v=19$m=4096,t=3,p=1$dGVzdHNhbHQAAAAAAAAAAA$Jr7Z5fakehashHashHashHashHashHash','RL User',true,NOW(),NOW())") | |
| const TEST_PASSWORD_HASH: &str = "$argon2id$v=19$m=4096,t=3,p=1$dGVzdHNhbHQAAAAAAAAAAA$Jr7Z5fakehashHashHashHashHashHash"; | |
| #[tokio::test] | |
| async fn login_rate_limit_blocks_after_threshold() { | |
| let pool = create_test_pool().await; | |
| // Seed a user so we can attempt logins (with wrong password to avoid side effects) | |
| let email = format!("rl_{}@example.com", Uuid::new_v4()); | |
| sqlx::query(&format!( | |
| "INSERT INTO users (email,password_hash,name,is_active,created_at,updated_at) VALUES ($1,'{}','RL User',true,NOW(),NOW())", | |
| TEST_PASSWORD_HASH | |
| )) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly enhances the API's security and observability by introducing login rate limiting, a metrics endpoint with CIDR-based access control, and a rich set of new Prometheus metrics. The changes are well-structured, including comprehensive documentation, new tests, and utility scripts. Overall, this is a high-quality contribution. I've identified a few issues, primarily related to metric correctness and documentation consistency, and have provided suggestions for improvement.
| println!("DEBUG[login]: failed to parse Argon2 hash: {:?}", e); | ||
| ApiError::InternalServerError | ||
| })?; | ||
| let parsed_hash = PasswordHash::new(hash).map_err(|e| { #[cfg(debug_assertions)] println!("DEBUG[login]: failed to parse Argon2 hash: {:?}", e); state.metrics.increment_login_fail(); ApiError::InternalServerError })?; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incrementing auth_login_fail_total when a password hash fails to parse is incorrect. This metric should track user-attributable login failures (like wrong password), not internal server errors. A hash parsing failure indicates a data corruption issue or a bug, and mixing it with failed login attempts will pollute the metric and make it harder to distinguish between security events and system errors. Please remove the metric increment from this error handling path.
| let parsed_hash = PasswordHash::new(hash).map_err(|e| { #[cfg(debug_assertions)] println!("DEBUG[login]: failed to parse Argon2 hash: {:?}", e); state.metrics.increment_login_fail(); ApiError::InternalServerError })?; | |
| let parsed_hash = PasswordHash::new(hash).map_err(|e| { #[cfg(debug_assertions)] println!("DEBUG[login]: failed to parse Argon2 hash: {:?}", e); ApiError::InternalServerError })?; | 
| let argon2 = Argon2::default(); | ||
| argon2 | ||
| .verify_password(req.password.as_bytes(), &parsed_hash) | ||
| .map_err(|_| ApiError::Unauthorized)?; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The auth_login_fail_total metric is not being incremented here when Argon2 password verification fails. This is inconsistent with other password verification failure paths in this function. To ensure all failed login attempts are correctly tracked, please add the metric increment.
| .map_err(|_| ApiError::Unauthorized)?; | |
| .map_err(|_| { state.metrics.increment_login_fail(); ApiError::Unauthorized })?; | 
| password_hash_argon2id_total # argon2id 数量 | ||
| password_hash_unknown_total # 未识别前缀 | ||
| password_hash_total_count # 总数 | ||
| password_hash_bcrypt_variant{variant="2b"} X # 每个变体 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| export_duration_buffered_seconds_* # 缓冲导出耗时直方图 (bucket/sum/count) | ||
| export_duration_stream_seconds_* # 流式导出耗时直方图 (bucket/sum/count) | ||
| process_uptime_seconds # 进程运行时长(秒) | ||
| jive_build_info{commit,time,rustc,version} 1 # 构建信息 gauge | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| annotations: | ||
| summary: Many logins being rate-limited (possible attack) | ||
| - alert: ProcessRestarted | ||
| expr: increase(process_uptime_seconds[5m]) < 60 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using increase() on a gauge like process_uptime_seconds is not idiomatic in PromQL and can be confusing. A simpler and more direct way to detect a recent restart is to check if the uptime is less than your scrape interval window.
For example, to detect a restart within the last 5 minutes (300 seconds), you could use:
process_uptime_seconds < 300This is more readable and directly expresses the condition you want to alert on.
        expr: process_uptime_seconds < 300| | `auth_login_fail_total` | Count failed login attempts (unauthorized) | Planned | | ||
| | `export_duration_seconds` (histogram) | Latency of export operations | Planned | | ||
| | `process_uptime_seconds` | Seconds since process start | Implemented | | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'Future Candidates' table seems to be out of sync with the changes in this PR. Several metrics listed as 'Planned' or 'Implemented' are now fully available.
- auth_login_fail_totalis implemented, not 'Planned'.
- export_duration_secondsis implemented as- export_duration_buffered_secondsand- export_duration_stream_seconds.
To ensure the documentation accurately reflects the current state of the project, please update this table to mark these metrics as 'Implemented' and adjust their descriptions accordingly.
| Err(e) => { | ||
| tracing::warn!(user_id=%user.id, error=?e, "failed to generate Argon2id hash") | ||
| } | ||
| Err(e) => { tracing::warn!(user_id=%user.id, error=?e, "failed to generate Argon2id hash"); state.metrics.increment_rehash_fail(); state.metrics.inc_rehash_fail_hash(); } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While one-liners can be compact, this line is quite long and contains multiple statements, which harms readability and makes debugging more difficult. For better maintainability, please format this closure over multiple lines.
                Err(e) => {
                    tracing::warn!(user_id=%user.id, error=?e, "failed to generate Argon2id hash");
                    state.metrics.increment_rehash_fail();
                    state.metrics.inc_rehash_fail_hash();
                }| "jive_build_info{{commit=\"{}\",time=\"{}\",rustc=\"{}\",version=\"{}\"}} 1\n", | ||
| build_commit, | ||
| build_time, | ||
| rustc_version.replace('"', "'"), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing " with ' is not a standard way to escape label values in Prometheus. The specification requires backslashes, double quotes, and newlines to be escaped with a backslash (e.g., \, \", \n). While your current approach might work if the rustc version string doesn't contain single quotes, it's safer to adhere to the standard to prevent generating invalid metrics output.
| rustc_version.replace('"', "'"), | |
| rustc_version.replace('\\', "\\\\").replace('"', "\\\"").replace('\n', "\\n"), | 
| #[tokio::test] | ||
| async fn metrics_v6_allowed_when_public() { | ||
| std::env::remove_var("ALLOW_PUBLIC_METRICS"); | ||
| let dummy_pool = PgPool::connect_lazy("postgresql://ignored").unwrap_err(); | ||
| // Skip full state since test only checks routing; create minimal state is complex, so we just assert handler builds. | ||
| // This test is a placeholder; full integration would need real AppState. Here we simply ensure no panic. | ||
| assert!(true); | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is currently a placeholder that only asserts true and doesn't perform any actual verification. To make it useful, it should at least test that the application can be built with the metrics guard middleware and that the /metrics route is accessible when the guard is disabled. A more complete test would involve mocking ConnectInfo to test the CIDR filtering logic.
PR Security & Metrics Summary (Template)
Overview
This PR strengthens API security and observability. Copy & adapt sections below for the final PR description.
Key Changes
Retry-Afterheader.ALLOW_PUBLIC_METRICS=0,METRICS_ALLOW_CIDRS,METRICS_DENY_CIDRS).jive_password_rehash_fail_breakdown_total{cause="hash"|"update"}.New / Modified Environment Variables
AUTH_RATE_LIMIT30/60AUTH_RATE_LIMIT_HASH_EMAIL1ALLOW_PUBLIC_METRICS0, restrict metrics by CIDR1METRICS_ALLOW_CIDRS127.0.0.1/32METRICS_DENY_CIDRSMETRICS_CACHE_TTL30Prometheus Metrics Added
auth_login_rate_limited_totaljive_password_rehash_fail_breakdown_total{cause}export_duration_buffered_seconds_*export_duration_stream_seconds_*process_uptime_secondsDeprecated (pending removal):
jive_password_rehash_fail_total(aggregate).Quick Local Verification
Run stack (example):
ALLOW_PUBLIC_METRICS=1 AUTH_RATE_LIMIT=3/60 cargo run --bin jive-api & sleep 2 ./scripts/verify_observability.shExpect PASS output and non-zero counters for
auth_login_fail_totalafter simulated attempts.Reviewer Checklist
Retry-Afterand JSON structure/metricsreachable only when expected (toggle ALLOW_PUBLIC_METRICS)docs/SECURITY_CHECKLIST.md)Follow-up (Optional / Tracked)