Enable semantic cache with @CacheResult #659

andreadimaio · 2024-06-05T23:48:01Z

In this PR, the idea is to implement the semantic cache as discussed in #637.
The idea is to allow the developer to use the @CacheResult annotation to decide which methods need to use the semantic cache.

The idea is to create different classes following what was done for ChatMemory and use DefaultMemoryIdProvider to create an isolated cache for each user if possible (I don't know if this is enough to avoid possible data leakage).

An open point is what to do if the @Tools annotation is present.
Shouldn't @CacheResult be usable in this case?

TODO LIST:

Improve tests
Understand how to implement the cache concept for stream responses.
Write documentation

There is still work to be done, but I think there is enough code to understand if the direction taken is the right one or if changes need to be made.

geoand · 2024-06-06T06:05:51Z

An open point is what to do if the @tools annotation is present.
Shouldn't @CacheResult be usable in this case?

I think it should

Understand how to implement the cache concept for stream responses.

This is a tricky one... Maybe we should skip it initially

jmartisk · 2024-06-06T06:38:06Z

If a prompt leads to tool executions, I think we shouldn't cache the response. Because if we then served the same response again without actually invoking the tools again, that would be really weird. (And re-invoking the tools is probably not a good idea either, they may be highly non-deterministic and non-idempotent and may change the outcome)

geoand · 2024-06-06T06:44:44Z

It's certainly weird either way

jmartisk · 2024-06-06T06:54:30Z

.../java/io/quarkiverse/langchain4j/runtime/aiservice/AiServiceMethodImplementationSupport.java

+        if (cache != null) {
+            log.debug("Attempting to obtain AI response from cache");
+
+            Embedding query = context.embeddingModel.embed(userMessage.text()).content();


IIUC here we should also check that there is no previous conversation (no user message in the memory), because we want to only support caching for the first message, right? Or are you checking this elsewhere already?

The cache is configurable with the property quarkus.langchain4j.cache.max-size (default 1). Should this be hidden from developers?

Maybe we're not understanding each other - cache max size is the maximum number of prompts stored in it, right?

But here I am talking about the fact that we should only attempt to use the cache when it's for the first prompt in the conversation, so that we avoid the situation that you described in #637 (comment)

Yes, max-size is the maximum number of prompts stored, where prompts should be the combination of SystemMessage + UserMessage. Each cache as id, if possible the id of the cache is unique (for example if we are in a RequestScope context), if not the id will always be the value default.

Id with context:
<ContextId>#<Class>.<Method>

Id without context:
default

Having said that, I don't think we understand each other (but that's my problem 😵).
To avoid returning a wrong cached response like in the example in #637 (comment), is it not enough to have a max-size of 1?

Having said that, I don't think we understand each other (but that's my problem 😵). To avoid returning a wrong cached response like in the example in #637 (comment), is it not enough to have a max-size of 1?

A single Cache instance contains prompts (system+user message pairs) from MULTIPLE conversations; and from each such conversation, it stores only the initial prompt, do I get that right? Therefore, the max-size refers to the number of conversations from which we store the initial prompt? But then that has nothing to do with how many messages have passed in the current conversation no?

Hmm I thought we were going to make the cache shared between users, not scoped to each user individually.

Yes, that has potential security implications, but I don't find it very useful if we only consider cached results for each user separately - how often will a single user repeat the same questions again? Not very often, I think.

Scoping the cache to an HTTP request is too limiting too - one HTTP request usually involves just one prompt. We need to keep the cached responses for longer so that they can be reused in later conversations. If you scope it to a HTTP request, you dispose of it immediately after one request.

My understanding was we would have a cache for each AI service method, but its life would be application-scoped and store the first prompt of every conversation that starts by calling that method. How to avoid memory problems, that's an important question - set a reasonable TTL and periodically clean it, or I'm not sure we could use weak references to have the GC clean up stuff over time.

My understanding was we would have a cache for each AI service method, but its life would be application-scoped and store the first prompt of every conversation that starts by calling that method.

Ok, so in this case the id of the cache will always be: #<ClassName>.<MethodName>
In the example above: #org.acme.exampleLLMService.fiveLinesPoem and #org.acme.exampleLLMService.tenLinesPoem for all the users.

How to avoid memory problems, that's an important question - set a reasonable TTL and periodically clean it, or I'm not sure we could use weak references to have the GC clean up stuff over time.

Yes, in this case it might be better to set a default TTL and of course document the fact that the developer can find an out-of-memory problem if the TTL value is too high.

Ok and the default max-size will surely be more than one, because we will store prompts from potentially different users. So actually, the max-size will probably be the main safeguard against eating too much memory

Ok, let me go back to the example with the last changes we have in mind:

package org.acme.example; @RegisterAiService @CacheResult @SystemMessage("You are a poet") public interface LLMService { @UserMessage("Write a poem of 5 lines about {poem}") public String fiveLinesPoem(String poem); @UserMessage("Write a poem of 10 lines about {poem}") public String tenLinesPoem(String poem); }

Configuration:

max-size: 2 ttl: 5m

ADM and JM, wants to create poems.

Two caches are created:

Cache#1: #org.acme.exampleLLMService.fiveLinesPoem Cache#2: #org.acme.exampleLLMService.tenLinesPoem

Steps:

ADM calls
-> fiveLinesPoem("dog");
-> No value in the cache;
-> Calls the LLM;
-> Stores the response in the Cache#1

JM calls
-> fiveLinesPoem("cat");
-> No value in the cache;
-> Calls the LLM;
-> Stores the response in the Cache#1 [MAX-SIZE REACHED]

ADM calls
-> fiveLinesPoem("turtle");
-> No value in the cache;
-> Calls the LLM;
-> The value is NOT cached [max size reached]

ADM calls
-> fiveLinesPoem("cat");
-> Return the cached value. [Cache#1]

The two cached responses will be deleted after 5 minutes.

Yes this makes sense to me. But let's ask @cescoffier to confirm

core/runtime/src/main/java/io/quarkiverse/langchain4j/runtime/cache/AiCacheStore.java

andreadimaio · 2024-06-06T07:16:33Z

I'm also thinking about a cache property that allows the developer to decide which type of embedding to use, something like quarkus.langchain4j.cache.embedding-model=dev.langchain4j.model.embedding.AllMiniLmL6V2EmbeddingModel or quarkus.langchain4j.cache.embedding-model=bam

jmartisk · 2024-06-06T07:28:38Z

I'm also thinking about a cache property that allows the developer to decide which type of embedding to use, something like quarkus.langchain4j.cache.embedding-model=dev.langchain4j.model.embedding.AllMiniLmL6V2EmbeddingModel or quarkus.langchain4j.cache.embedding-model=bam

That should point not at the provider name, but at the name of the configured model, or whatever is the proper term here (I mean like the c1, c2, c3,.... in https://github.com/quarkiverse/quarkus-langchain4j/blob/0.15.1/integration-tests/multiple-providers/src/main/resources/application.properties). Hopefully it won't be too tricky to implement, our configuration model is quite complex already.

iocanel · 2024-06-06T08:29:29Z

An open point is what to do if the @Tools annotation is present. Shouldn't @CacheResult be usable in this case?

I think that the whole cache idea is that in general similar question should produce similar result, so caching in some cases makes sense. When tools are involved this is less likely to happen, as we have absolutely no idea what response the tool will return even for the exact same input.

So, I'd say that it should not cache result, unless the user somehow defines that the tool is idempotent (always produces same output for a given input) or cachable itself.

andreadimaio · 2024-06-13T20:13:29Z

Do you think it would make sense to enable caching for the Streaming API if the method or class has the @Blocking annotation?

geoand · 2024-06-14T07:12:27Z

I think we can leave streaming out for a first version and think about it some more

andreadimaio force-pushed the aicache branch from 31ed68c to 2e3d668 Compare June 6, 2024 00:02

jmartisk reviewed Jun 6, 2024

View reviewed changes

core/runtime/src/main/java/io/quarkiverse/langchain4j/runtime/cache/AiCacheStore.java Show resolved Hide resolved

geoand changed the title ~~Enable sematic cache with @CacheResult~~ Enable semantic cache with @CacheResult Jun 6, 2024

andreadimaio force-pushed the aicache branch from a438abf to 774216e Compare June 7, 2024 20:12

andreadimaio added 6 commits July 2, 2024 22:37

Enable semantic cache with @CacheResult

ed1d11c

Allow selection of preferred embedding model

1fd0fb9

Avoid using cache and tools in the same @RegisterAiService

a78a4fc

Compose the cache message with SystemMessage + UserMessage

9530cc0

Generate the id for the Cache with a unique context if possible

02ea764

Add cache documentation

d961991

andreadimaio force-pushed the aicache branch from ab4f5d1 to d961991 Compare July 2, 2024 20:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable semantic cache with @CacheResult #659

Enable semantic cache with @CacheResult #659

andreadimaio commented Jun 5, 2024

geoand commented Jun 6, 2024

jmartisk commented Jun 6, 2024 •

edited

Loading

geoand commented Jun 6, 2024

jmartisk Jun 6, 2024

andreadimaio Jun 6, 2024

jmartisk Jun 6, 2024

andreadimaio Jun 6, 2024 •

edited

Loading

jmartisk Jun 6, 2024

jmartisk Jun 10, 2024 •

edited

Loading

andreadimaio Jun 10, 2024

jmartisk Jun 10, 2024

andreadimaio Jun 10, 2024 •

edited

Loading

jmartisk Jun 10, 2024

andreadimaio commented Jun 6, 2024

jmartisk commented Jun 6, 2024 •

edited

Loading

iocanel commented Jun 6, 2024

andreadimaio commented Jun 13, 2024

geoand commented Jun 14, 2024

Enable semantic cache with @CacheResult #659

Are you sure you want to change the base?

Enable semantic cache with @CacheResult #659

Conversation

andreadimaio commented Jun 5, 2024

geoand commented Jun 6, 2024

jmartisk commented Jun 6, 2024 • edited Loading

geoand commented Jun 6, 2024

jmartisk Jun 6, 2024

Choose a reason for hiding this comment

andreadimaio Jun 6, 2024

Choose a reason for hiding this comment

jmartisk Jun 6, 2024

Choose a reason for hiding this comment

andreadimaio Jun 6, 2024 • edited Loading

Choose a reason for hiding this comment

jmartisk Jun 6, 2024

Choose a reason for hiding this comment

jmartisk Jun 10, 2024 • edited Loading

Choose a reason for hiding this comment

andreadimaio Jun 10, 2024

Choose a reason for hiding this comment

jmartisk Jun 10, 2024

Choose a reason for hiding this comment

andreadimaio Jun 10, 2024 • edited Loading

Choose a reason for hiding this comment

jmartisk Jun 10, 2024

Choose a reason for hiding this comment

andreadimaio commented Jun 6, 2024

jmartisk commented Jun 6, 2024 • edited Loading

iocanel commented Jun 6, 2024

andreadimaio commented Jun 13, 2024

geoand commented Jun 14, 2024

jmartisk commented Jun 6, 2024 •

edited

Loading

andreadimaio Jun 6, 2024 •

edited

Loading

jmartisk Jun 10, 2024 •

edited

Loading

andreadimaio Jun 10, 2024 •

edited

Loading

jmartisk commented Jun 6, 2024 •

edited

Loading