Simple full-stack demo that shows a semantic cache layer using OpenAI embeddings.
- Embeddings:
text-embedding-3-small(cheap and good for clustering/caching) - LLM:
gpt-4o-miniby default - Cache: JSON file at
data/cache.json - Endpoints:
POST /api/queryfirst checks cache, then calls the LLM on miss
- 1. Install
npm install
- 2. Configure env
cp .env.example .env
# edit .env and set OPENAI_API_KEY
- 3. Run
npm start
- Cache lookup: Incoming prompt is embedded via
text-embedding-3-small. We compute cosine similarity vs stored embeddings and pick the best of topK. If similarity >= threshold, it's a cache hit. - Cache miss: We call the chat model, return the response, and store
{ prompt, embedding, response }for future hits.
- POST
/api/query
{
"prompt": "How's the weather in NYC right now?",
"threshold": 0.9,
"topK": 3
}
Response:
{
"source": "cache" | "llm",
"cacheHit": true | false,
"similarity": 0.9432, // present on hits
"matchedPrompt": "What is the weather like in New York today?", // on hits
"response": "...",
"prompt": "..."
}
- GET
/api/cachereturns the cache content. - DELETE
/api/cacheclears the cache.
- 1. Clear cache with the UI button.
- 2. Send:
What is the weather like in New York today?→ first time will be a miss (LLM called). - 3. Send:
How's the weather in NYC right now?→ should be a cache hit (similar meaning). - 4. Try unrelated prompts → miss and then build cache.
- You can tweak
thresholdandtopKin the UI. Start with threshold0.90. Raising it reduces false hits; lowering increases reuse. - This demo stores responses verbatim in a local JSON file. For production, use a vector DB and handle privacy/security appropriately.