-
Notifications
You must be signed in to change notification settings - Fork 447
Description
Problem Statement
The current GeminiModel
implementation does not support Gemini's explicit context caching feature, which provides up to 90% cost reduction on cached tokens. While Gemini 2.5 models have implicit caching, it doesn't work reliably with Strands' request structure (system prompt and tools in config
instead of contents
).
Current behavior:
- Every request sends full system prompt + tools (e.g., 13,494 tokens)
- No visibility into cached tokens
- No control over cache lifecycle
cached_content_token_count
always returnsNone
Expected behavior:
- Ability to explicitly cache system prompt + tools
- 75-90% discount on cached tokens
- Cache visibility via
usage_metadata.cached_content_token_count
- Cache lifecycle management (create, delete, TTL)
Proposed Solution
Add explicit context caching support to GeminiModel
similar to how BedrockModel
implements cache_prompt
parameter.
API design:
from strands.models.gemini import GeminiModel
model = GeminiModel(
model_id="gemini-2.5-flash",
client_args={"api_key": "..."},
enable_caching=True, # Enable auto-caching
cache_ttl="3600s" # Cache TTL (default 1 hour)
)
# Or manual cache management
model.create_cache(system_prompt, tool_specs, ttl="7200s")
model.delete_cache()
Key features:
- Auto-cache creation: Automatically creates cache on first request when
enable_caching=True
- Cache validation: Reuses cache when system prompt + tools match
- Visibility: Exposes
cachedTokens
inmetadata.usage
- Cache lifecycle: Methods for create/delete/manage cache
Implementation Details
Changes needed in strands/models/gemini.py
:
- Add
enable_caching
andcache_ttl
toGeminiConfig
- Add
create_cache()
anddelete_cache()
methods - Modify
_format_request_config()
to acceptcached_content
parameter - Add cache validation logic in
_format_request()
- Expose
cached_content_token_count
in metadata
References
- [Gemini Context Caching Docs](https://ai.google.dev/gemini-api/docs/caching)
- [Python SDK Reference](https://googleapis.github.io/python-genai/)
- [Bedrock Cache Implementation](https://strandsagents.com/latest/documentation/docs/user-guide/concepts/model-providers/amazon-bedrock/) (for API consistency)
Alternative Solutions
- Do nothing: Users pay 5-10x more in token costs
- Rely on implicit caching: Unreliable, no visibility, no control
Additional Context
Tested implementation shows:
- 68% token reduction on real workload
cached_content_token_count: 9,255
out of 13,564 total tokens- Works with 30+ tools and complex system prompts
- Compatible with existing Strands agent loop
I'm happy to submit a PR with the implementation if this feature request is accepted.
Use Case
Agents with large system prompts or many tools (e.g., 30 tools = ~9K tokens) incur high costs on every request. For production workloads with 1,000+ messages/day, this becomes expensive quickly.
Example cost impact:
- Without caching: 13,564 tokens/msg × 30K msgs/month × $0.00000035 = $142/month
- With caching: ~4,300 effective tokens/msg × 30K msgs/month = $69/month
- Savings: $73/month ($876/year)
Alternatives Solutions
No response
Additional Context
No response