Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable semantic cache with @CacheResult #659

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

andreadimaio
Copy link
Contributor

In this PR, the idea is to implement the semantic cache as discussed in #637.
The idea is to allow the developer to use the @CacheResult annotation to decide which methods need to use the semantic cache.

The idea is to create different classes following what was done for ChatMemory and use DefaultMemoryIdProvider to create an isolated cache for each user if possible (I don't know if this is enough to avoid possible data leakage).

An open point is what to do if the @Tools annotation is present.
Shouldn't @CacheResult be usable in this case?

TODO LIST:

  • Improve tests
  • Understand how to implement the cache concept for stream responses.
  • Write documentation

There is still work to be done, but I think there is enough code to understand if the direction taken is the right one or if changes need to be made.

@geoand
Copy link
Collaborator

geoand commented Jun 6, 2024

An open point is what to do if the @tools annotation is present.
Shouldn't @CacheResult be usable in this case?

I think it should

Understand how to implement the cache concept for stream responses.

This is a tricky one... Maybe we should skip it initially

@jmartisk
Copy link
Collaborator

jmartisk commented Jun 6, 2024

If a prompt leads to tool executions, I think we shouldn't cache the response. Because if we then served the same response again without actually invoking the tools again, that would be really weird. (And re-invoking the tools is probably not a good idea either, they may be highly non-deterministic and non-idempotent and may change the outcome)

@geoand
Copy link
Collaborator

geoand commented Jun 6, 2024

It's certainly weird either way

if (cache != null) {
log.debug("Attempting to obtain AI response from cache");

Embedding query = context.embeddingModel.embed(userMessage.text()).content();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC here we should also check that there is no previous conversation (no user message in the memory), because we want to only support caching for the first message, right? Or are you checking this elsewhere already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache is configurable with the property quarkus.langchain4j.cache.max-size (default 1). Should this be hidden from developers?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we're not understanding each other - cache max size is the maximum number of prompts stored in it, right?

But here I am talking about the fact that we should only attempt to use the cache when it's for the first prompt in the conversation, so that we avoid the situation that you described in #637 (comment)

Copy link
Contributor Author

@andreadimaio andreadimaio Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, max-size is the maximum number of prompts stored, where prompts should be the combination of SystemMessage + UserMessage. Each cache as id, if possible the id of the cache is unique (for example if we are in a RequestScope context), if not the id will always be the value default.

Id with context:
<ContextId>#<Class>.<Method>

Id without context:
default

Having said that, I don't think we understand each other (but that's my problem 😵).
To avoid returning a wrong cached response like in the example in #637 (comment), is it not enough to have a max-size of 1?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having said that, I don't think we understand each other (but that's my problem 😵). To avoid returning a wrong cached response like in the example in #637 (comment), is it not enough to have a max-size of 1?

A single Cache instance contains prompts (system+user message pairs) from MULTIPLE conversations; and from each such conversation, it stores only the initial prompt, do I get that right? Therefore, the max-size refers to the number of conversations from which we store the initial prompt? But then that has nothing to do with how many messages have passed in the current conversation no?

Copy link
Collaborator

@jmartisk jmartisk Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I thought we were going to make the cache shared between users, not scoped to each user individually.

Yes, that has potential security implications, but I don't find it very useful if we only consider cached results for each user separately - how often will a single user repeat the same questions again? Not very often, I think.

Scoping the cache to an HTTP request is too limiting too - one HTTP request usually involves just one prompt. We need to keep the cached responses for longer so that they can be reused in later conversations. If you scope it to a HTTP request, you dispose of it immediately after one request.

My understanding was we would have a cache for each AI service method, but its life would be application-scoped and store the first prompt of every conversation that starts by calling that method. How to avoid memory problems, that's an important question - set a reasonable TTL and periodically clean it, or I'm not sure we could use weak references to have the GC clean up stuff over time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was we would have a cache for each AI service method, but its life would be application-scoped and store the first prompt of every conversation that starts by calling that method.

Ok, so in this case the id of the cache will always be: #<ClassName>.<MethodName>
In the example above: #org.acme.exampleLLMService.fiveLinesPoem and #org.acme.exampleLLMService.tenLinesPoem for all the users.

How to avoid memory problems, that's an important question - set a reasonable TTL and periodically clean it, or I'm not sure we could use weak references to have the GC clean up stuff over time.

Yes, in this case it might be better to set a default TTL and of course document the fact that the developer can find an out-of-memory problem if the TTL value is too high.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok and the default max-size will surely be more than one, because we will store prompts from potentially different users. So actually, the max-size will probably be the main safeguard against eating too much memory

Copy link
Contributor Author

@andreadimaio andreadimaio Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let me go back to the example with the last changes we have in mind:

package org.acme.example;

@RegisterAiService
@CacheResult
@SystemMessage("You are a poet")
public interface LLMService {

	@UserMessage("Write a poem of 5 lines about {poem}")
	public String fiveLinesPoem(String poem);
	
	@UserMessage("Write a poem of 10 lines about {poem}")
	public String tenLinesPoem(String poem);
}

Configuration:

max-size: 2
ttl: 5m

ADM and JM, wants to create poems.

Two caches are created:

Cache#1:  #org.acme.exampleLLMService.fiveLinesPoem
Cache#2:  #org.acme.exampleLLMService.tenLinesPoem

Steps:

  1. ADM calls
    -> fiveLinesPoem("dog");
    -> No value in the cache;
    -> Calls the LLM;
    -> Stores the response in the Cache#1

  2. JM calls
    -> fiveLinesPoem("cat");
    -> No value in the cache;
    -> Calls the LLM;
    -> Stores the response in the Cache#1 [MAX-SIZE REACHED]

  3. ADM calls
    -> fiveLinesPoem("turtle");
    -> No value in the cache;
    -> Calls the LLM;
    -> The value is NOT cached [max size reached]

  4. ADM calls
    -> fiveLinesPoem("cat");
    -> Return the cached value. [Cache#1]

The two cached responses will be deleted after 5 minutes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this makes sense to me. But let's ask @cescoffier to confirm

@andreadimaio
Copy link
Contributor Author

I'm also thinking about a cache property that allows the developer to decide which type of embedding to use, something like quarkus.langchain4j.cache.embedding-model=dev.langchain4j.model.embedding.AllMiniLmL6V2EmbeddingModel or quarkus.langchain4j.cache.embedding-model=bam

@jmartisk
Copy link
Collaborator

jmartisk commented Jun 6, 2024

I'm also thinking about a cache property that allows the developer to decide which type of embedding to use, something like quarkus.langchain4j.cache.embedding-model=dev.langchain4j.model.embedding.AllMiniLmL6V2EmbeddingModel or quarkus.langchain4j.cache.embedding-model=bam

That should point not at the provider name, but at the name of the configured model, or whatever is the proper term here (I mean like the c1, c2, c3,.... in https://github.com/quarkiverse/quarkus-langchain4j/blob/0.15.1/integration-tests/multiple-providers/src/main/resources/application.properties). Hopefully it won't be too tricky to implement, our configuration model is quite complex already.

@geoand geoand changed the title Enable sematic cache with @CacheResult Enable semantic cache with @CacheResult Jun 6, 2024
@iocanel
Copy link
Collaborator

iocanel commented Jun 6, 2024

An open point is what to do if the @Tools annotation is present. Shouldn't @CacheResult be usable in this case?

I think that the whole cache idea is that in general similar question should produce similar result, so caching in some cases makes sense. When tools are involved this is less likely to happen, as we have absolutely no idea what response the tool will return even for the exact same input.

So, I'd say that it should not cache result, unless the user somehow defines that the tool is idempotent (always produces same output for a given input) or cachable itself.

@andreadimaio
Copy link
Contributor Author

Do you think it would make sense to enable caching for the Streaming API if the method or class has the @Blocking annotation?

@geoand
Copy link
Collaborator

geoand commented Jun 14, 2024

I think we can leave streaming out for a first version and think about it some more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants