Skip to content

[FEATURE] Add explicit context caching support for Gemini models #1060

@JosephGabito

Description

@JosephGabito

Problem Statement

The current GeminiModel implementation does not support Gemini's explicit context caching feature, which provides up to 90% cost reduction on cached tokens. While Gemini 2.5 models have implicit caching, it doesn't work reliably with Strands' request structure (system prompt and tools in config instead of contents).

Current behavior:

  • Every request sends full system prompt + tools (e.g., 13,494 tokens)
  • No visibility into cached tokens
  • No control over cache lifecycle
  • cached_content_token_count always returns None

Expected behavior:

  • Ability to explicitly cache system prompt + tools
  • 75-90% discount on cached tokens
  • Cache visibility via usage_metadata.cached_content_token_count
  • Cache lifecycle management (create, delete, TTL)

Proposed Solution

Add explicit context caching support to GeminiModel similar to how BedrockModel implements cache_prompt parameter.

API design:

from strands.models.gemini import GeminiModel

model = GeminiModel(
    model_id="gemini-2.5-flash",
    client_args={"api_key": "..."},
    enable_caching=True,    # Enable auto-caching
    cache_ttl="3600s"       # Cache TTL (default 1 hour)
)

# Or manual cache management
model.create_cache(system_prompt, tool_specs, ttl="7200s")
model.delete_cache()

Key features:

  1. Auto-cache creation: Automatically creates cache on first request when enable_caching=True
  2. Cache validation: Reuses cache when system prompt + tools match
  3. Visibility: Exposes cachedTokens in metadata.usage
  4. Cache lifecycle: Methods for create/delete/manage cache

Implementation Details

Changes needed in strands/models/gemini.py:

  1. Add enable_caching and cache_ttl to GeminiConfig
  2. Add create_cache() and delete_cache() methods
  3. Modify _format_request_config() to accept cached_content parameter
  4. Add cache validation logic in _format_request()
  5. Expose cached_content_token_count in metadata

References

Alternative Solutions

  1. Do nothing: Users pay 5-10x more in token costs
  2. Rely on implicit caching: Unreliable, no visibility, no control

Additional Context

Tested implementation shows:

  • 68% token reduction on real workload
  • cached_content_token_count: 9,255 out of 13,564 total tokens
  • Works with 30+ tools and complex system prompts
  • Compatible with existing Strands agent loop

I'm happy to submit a PR with the implementation if this feature request is accepted.

Use Case

Agents with large system prompts or many tools (e.g., 30 tools = ~9K tokens) incur high costs on every request. For production workloads with 1,000+ messages/day, this becomes expensive quickly.

Example cost impact:

  • Without caching: 13,564 tokens/msg × 30K msgs/month × $0.00000035 = $142/month
  • With caching: ~4,300 effective tokens/msg × 30K msgs/month = $69/month
  • Savings: $73/month ($876/year)

Alternatives Solutions

No response

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions