Skip to content

Enable LRU eviction policy in ObjectCache#367

Merged
k82cn merged 1 commit into
xflops:mainfrom
k82cn:flm_366
Mar 10, 2026
Merged

Enable LRU eviction policy in ObjectCache#367
k82cn merged 1 commit into
xflops:mainfrom
k82cn:flm_366

Conversation

@k82cn
Copy link
Copy Markdown
Contributor

@k82cn k82cn commented Mar 10, 2026

fix #366

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical memory management issue in the ObjectCache by introducing a configurable LRU eviction policy. Previously, the cache would grow indefinitely, leading to potential system instability. The new policy intelligently removes least recently used objects from memory when configured limits are reached, while ensuring they remain persistently stored on disk and can be reloaded on demand. This change significantly improves the cache's stability and resource efficiency without altering its external API.

Highlights

  • LRU Eviction Policy Implementation: Implemented a Least Recently Used (LRU) eviction policy for the ObjectCache to manage in-memory object storage, preventing unbounded memory growth and potential Out-of-Memory (OOM) conditions. This includes a new EvictionPolicy trait and a LRUPolicy implementation.
  • Configurable Cache Limits: Introduced new configuration options for the ObjectCache under the cache.eviction section in flame-cluster.yaml, allowing users to specify policy (lru or none), max_memory (e.g., "1G", "512M"), and max_objects.
  • Memory Size Parsing with Binary Units: Added utility functions and integrated the bytesize crate to parse human-readable memory size strings (e.g., "1G", "512M") into bytes, ensuring that common SI unit suffixes (G, M, K) are interpreted as binary units (GiB, MiB, KiB) for accurate memory management.
  • Transparent Disk Reloading: Enhanced the ObjectCache to transparently reload evicted objects from disk when they are accessed, ensuring data availability even after being removed from memory.
  • Comprehensive Testing and Documentation: Included a detailed design document (docs/designs/RFE004-lru-policy/FS.md) outlining the motivation, specification, and implementation of the LRU policy, along with new end-to-end tests (e2e/tests/test_cache_lru.py) to validate the eviction behavior and data integrity.
Changelog
  • Cargo.lock
    • Updated dependencies to include the bytesize crate.
  • common/Cargo.toml
    • Added bytesize as a dependency.
  • common/src/ctx.rs
    • Imported bytesize::ByteSize for memory unit parsing.
    • Defined DEFAULT_EVICTION_POLICY and DEFAULT_MAX_MEMORY constants.
    • Introduced FlameEvictionYaml and FlameEviction structs to model eviction configuration.
    • Implemented Default for FlameEviction to provide sensible defaults.
    • Added convert_to_binary_units function to standardize memory unit parsing to binary (IEC) units.
    • Implemented parse_memory_size function to convert memory strings (e.g., "1G") into u64 bytes.
    • Modified FlameCache struct to include the new eviction configuration.
    • Implemented TryFrom<FlameEvictionYaml> for FlameEviction to facilitate configuration parsing.
    • Added extensive unit tests for FlameClusterContext with various cache eviction configurations and memory size parsing scenarios.
  • docs/designs/RFE004-lru-policy/FS.md
    • Added a new design document detailing the LRU Eviction Policy for ObjectCache, covering motivation, function specification, implementation details, use cases, and references.
  • e2e/tests/test_cache_lru.py
    • Added new end-to-end tests to verify the functionality of the LRU eviction policy, including basic eviction, LRU order maintenance, reloading evicted objects, multiple access updates, handling of single large and many small objects, object updates, concurrent session isolation, and data integrity during reload.
  • installer/flame-cluster.yaml
    • Updated the example configuration file to include the new cache.eviction section with policy, max_memory, and max_objects settings.
  • object_cache/Cargo.toml
    • Added bytesize as a dependency.
  • object_cache/src/cache.rs
    • Imported bytesize::ByteSize and modules from eviction.
    • Defined EVICTION_BATCH_SIZE constant for batch eviction operations.
    • Refactored ObjectCache to separate in-memory objects (actual data) from metadata (all known objects, whether in memory or on disk).
    • Integrated eviction_policy into the ObjectCache struct.
    • Updated ObjectCache::new to accept and initialize the eviction policy.
    • Modified load_session_objects to populate both objects and metadata maps and notify the eviction policy.
    • Introduced run_eviction method to trigger eviction based on policy limits.
    • Added load_object_from_disk_internal helper for internal disk loading.
    • Updated put method to track object size, notify the eviction policy, and run eviction after adding an object.
    • Refactored load_object_from_disk to use the new internal loading helper.
    • Modified try_load_and_index to load objects into memory, update both maps, and notify the eviction policy.
    • Revised get method to first check in-memory objects, then reload from disk if an object is found in metadata but not in memory, and notify the eviction policy on access and add.
    • Updated update method to handle in-memory and metadata updates, and notify the eviction policy.
    • Modified delete method to notify the eviction policy about removed objects and clear them from both in-memory and metadata maps.
    • Changed list_all to retrieve object metadata from the dedicated metadata map.
    • Updated the run function to construct and pass the eviction configuration to the ObjectCache initialization.
  • object_cache/src/eviction.rs
    • Created a new module for eviction policies.
    • Defined EvictionConfig struct for policy configuration.
    • Introduced the EvictionPolicy trait, specifying methods for access tracking, victim selection, and object lifecycle events.
    • Implemented LRUNode and LRUPolicy for Least Recently Used eviction, utilizing a doubly-linked list and HashMap for efficient operations.
    • Implemented NoEvictionPolicy as a no-op policy for disabling eviction.
    • Provided new_policy function to instantiate policies based on configuration and environment variables.
    • Included unit tests for both LRUPolicy and NoEvictionPolicy to validate their behavior.
  • object_cache/src/lib.rs
    • Exported the new eviction module and its public types (EvictionConfig, EvictionPolicy, EvictionPolicyPtr, LRUPolicy, NoEvictionPolicy).
Activity
  • The author k82cn created this pull request to enable LRU eviction policy in ObjectCache, addressing issue Enable LRU policy in ObjectCache #366. No further activity such as comments or reviews have been recorded yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-designed feature: an LRU eviction policy for the ObjectCache, including new configuration options, a flexible eviction policy trait, an LRU implementation, and updates to the cache logic. However, the refactoring of ObjectCache methods has exposed or propagated several path traversal vulnerabilities. User-supplied identifiers like session_id, object_id, and key are used to construct file system paths without proper validation or sanitization, which could allow an attacker to read, write, or delete arbitrary files. Beyond this critical security concern, further improvements are needed to ensure consistent configuration parsing between YAML files and environment variables, clarify the impact of default behavior changes in the documentation, and improve code clarity and test robustness.

Comment thread object_cache/src/cache.rs
) -> Result<ObjectMetadata, FlameError> {
let object_id = object_id.unwrap_or_else(|| uuid::Uuid::new_v4().to_string());
let key = format!("{}/{}", session_id, object_id);
let size = object.data.len() as u64;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The put_with_id function is vulnerable to path traversal. The session_id and object_id parameters, which are derived from user-supplied Arrow Flight descriptors, are used to construct file paths without validation. An attacker can use path traversal sequences (e.g., ..) to write files outside the intended storage directory.

Comment thread object_cache/src/cache.rs
Comment on lines 358 to +359
let object_path = storage_path.join(format!("{}.arrow", key));

let file = fs::File::open(&object_path)
.map_err(|e| FlameError::NotFound(format!("Object file not found: {}", e)))?;
let reader = FileReader::try_new(file, None)
.map_err(|e| FlameError::Internal(format!("Failed to create reader: {}", e)))?;

let batch = reader
.into_iter()
.next()
.ok_or_else(|| FlameError::Internal("No batches in file".to_string()))?
.map_err(|e| FlameError::Internal(format!("Failed to read batch: {}", e)))?;

let object = batch_to_object(&batch)
.map_err(|e| FlameError::Internal(format!("Failed to parse batch: {}", e)))?;

Ok(object)
self.load_object_from_disk_internal(&object_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The load_object_from_disk function constructs a file path by joining storage_path with a user-supplied key. This is a classic path traversal vulnerability. You should validate that the resulting path is still within the storage_path.

Suggested change
let object_path = storage_path.join(format!("{}.arrow", key));
let file = fs::File::open(&object_path)
.map_err(|e| FlameError::NotFound(format!("Object file not found: {}", e)))?;
let reader = FileReader::try_new(file, None)
.map_err(|e| FlameError::Internal(format!("Failed to create reader: {}", e)))?;
let batch = reader
.into_iter()
.next()
.ok_or_else(|| FlameError::Internal("No batches in file".to_string()))?
.map_err(|e| FlameError::Internal(format!("Failed to read batch: {}", e)))?;
let object = batch_to_object(&batch)
.map_err(|e| FlameError::Internal(format!("Failed to parse batch: {}", e)))?;
Ok(object)
self.load_object_from_disk_internal(&object_path)
let object_path = storage_path.join(format!("{}.arrow", key));
if !object_path.starts_with(storage_path) {
return Err(FlameError::InvalidConfig(format!("Invalid key: {}", key)));
}
self.load_object_from_disk_internal(&object_path)

Comment thread object_cache/src/cache.rs
Ok(Some(object))
}

async fn get(&self, key: String) -> Result<Object, FlameError> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The get function is vulnerable to path traversal. The key parameter, which is derived from user-supplied Arrow Flight tickets, is used to construct file paths in load_object_from_disk without validation. An attacker can provide a key like ../../etc/passwd to read arbitrary files on the system (with a .arrow extension).

Comment thread object_cache/src/cache.rs
Err(FlameError::NotFound(format!("object <{}> not found", key)))
}

async fn update(&self, key: String, new_object: Object) -> Result<ObjectMetadata, FlameError> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The update function is vulnerable to path traversal. The key parameter is used to construct a file path for writing updated object data. Lack of validation allows an attacker to overwrite arbitrary files on the system.

Comment thread object_cache/src/cache.rs
Ok(meta)
}

async fn delete(&self, session_id: SessionID) -> Result<(), FlameError> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The delete function is vulnerable to path traversal. The session_id parameter is used to construct a directory path for deletion using fs::remove_dir_all. An attacker can provide a session_id like .. to delete directories outside the intended storage path.

Comment on lines +48 to +54
pub fn max_memory_bytes(&self) -> u64 {
let max_memory_str = self.max_memory.as_deref().unwrap_or(DEFAULT_MAX_MEMORY);
max_memory_str
.parse::<ByteSize>()
.map(|bs| bs.as_u64())
.unwrap_or(1024 * 1024 * 1024) // 1G fallback
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is an inconsistency in how memory size strings are parsed. The parse_memory_size function in common/src/ctx.rs is designed to interpret units like 'G' as binary (GiB), which is desirable for memory sizes. However, this max_memory_bytes function uses str::parse::<ByteSize>() directly, which treats 'G' as decimal (GB). This leads to inconsistent behavior between configuration from a file (which uses the custom parser) and configuration from environment variables (which would be parsed here).

To ensure consistent behavior, this function should use the common::ctx::parse_memory_size function. You will need to add use common::ctx::parse_memory_size; to the top of the file.

    pub fn max_memory_bytes(&self) -> u64 {
        let max_memory_str = self.max_memory.as_deref().unwrap_or(DEFAULT_MAX_MEMORY);
        common::ctx::parse_memory_size(max_memory_str).unwrap_or(1024 * 1024 * 1024) // 1G fallback
    }

Comment thread common/src/ctx.rs Outdated
Comment on lines +403 to +404
let ctx = FlameClusterContext::from_file(Some(tmp_file.to_string_lossy().to_string()))
.map_err(|e| FlameError::Internal(e.to_string()))?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Wrapping the result of FlameClusterContext::from_file with map_err obscures the original error type. If from_file returns a specific FlameError variant (like InvalidConfig), it gets converted into a generic Internal error. This makes debugging test failures harder. It would be better to use ? directly to propagate the original error.

Suggested change
let ctx = FlameClusterContext::from_file(Some(tmp_file.to_string_lossy().to_string()))
.map_err(|e| FlameError::Internal(e.to_string()))?;
let ctx = FlameClusterContext::from_file(Some(tmp_file.to_string_lossy().to_string()))?;

Comment thread docs/designs/RFE004-lru-policy/FS.md Outdated
Comment on lines +102 to +107
- Backward compatible: If eviction config is not set, defaults to LRU with 1G limit
- No changes to Arrow Flight protocol or ObjectRef structure
- Existing cached data on disk remains accessible

*Breaking Changes:*
- None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The document states there are no breaking changes, but also that the default behavior changes from unbounded memory to an LRU policy with a 1G limit if no eviction configuration is provided. This change in default behavior could be considered a breaking change for users who rely on the previous unbounded behavior. It would be clearer to explicitly mention this as a behavioral change in the 'Breaking Changes' section or rephrase the 'Backward compatible' point to clarify this.

Comment on lines +230 to +250
let mut nodes = match lock_ptr!(self.nodes) {
Ok(n) => n,
Err(e) => {
tracing::error!("Failed to lock nodes: {}", e);
return;
}
};
let mut head = match lock_ptr!(self.head) {
Ok(h) => h,
Err(e) => {
tracing::error!("Failed to lock head: {}", e);
return;
}
};
let mut tail = match lock_ptr!(self.tail) {
Ok(t) => t,
Err(e) => {
tracing::error!("Failed to lock tail: {}", e);
return;
}
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The pattern of locking nodes, head, and tail individually with error handling is repeated in on_access, on_evict, and on_add. This introduces boilerplate and could be simplified by creating a helper function to acquire all three locks and return them as a tuple, for example Result<(MutexGuard<...>, ...), ...>. This would make the code more concise and less error-prone.

k82cn added a commit to k82cn/flame that referenced this pull request Mar 10, 2026
- Add path traversal validation to prevent directory traversal attacks
  in put_with_id, get, update, delete, and load_object_from_disk
- Use common::ctx::parse_memory_size in EvictionConfig::max_memory_bytes
  for consistent binary unit interpretation (G=GiB, M=MiB)
- Simplify test error handling by using ? operator directly
- Clarify behavioral change in FS.md: default LRU with 1GiB limit
- Fix clippy warnings (map_or -> is_some_and, wildcard pattern)
@k82cn k82cn force-pushed the flm_366 branch 2 times, most recently from ee7106c to e6aad4c Compare March 10, 2026 07:45
Implement memory management for ObjectCache with configurable eviction:

- Add EvictionPolicy trait with LRU and NoEviction implementations
- LRU uses doubly-linked list + HashMap for O(1) operations
- Support max_memory (e.g. "1G", "512M") and max_objects limits
- Evicted objects remain on disk, reloaded on demand
- Add path traversal validation for security
- Enable LRU eviction in CI configuration

Configuration example:
  cache:
    eviction:
      policy: "lru"
      max_memory: "1G"
      max_objects: 10000

Closes xflops#366
@k82cn k82cn merged commit f36e419 into xflops:main Mar 10, 2026
5 checks passed
@k82cn k82cn deleted the flm_366 branch March 10, 2026 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable LRU policy in ObjectCache

1 participant