Persistent knowledge graph memory for AI agents — drop in with two lines, runs in the background, no external database required.
pocket-mem gives any AI agent a long-term memory that works like a knowledge graph. When your agent has a conversation, pocket-mem silently extracts the people, tools, decisions, and relationships mentioned and stores them in a structured local database. The next time your agent needs context — even sessions later — it can recall exactly who David is, what tools he recommended, and what you decided about the database last Tuesday.
It works with any agent, any LLM, and any Python project. There is no server to run, no cloud account to create, and no API key required beyond your LLM of choice.
- How it works
- System requirements
- Installation
- Setting up Ollama
- Choosing a model
- Quick start
- Wiring memory into your agent
- Recall modes
- Visualizing memory
- Sharing memory
- Storage options
- Identity
Every time your agent receives a message and produces a response, you call agent.observe(). This runs in a background thread and never blocks your agent. Under the hood it:
- Classifies the content into topics (like "People I Know", "Dev Tools", "Decisions")
- Extracts named entities — people, tools, projects — and the typed relationships between them
- Stores everything in a local SQLite knowledge graph with vector embeddings for semantic search
When your agent needs memory, you call agent.recall(). This runs a hybrid search — keyword matching plus semantic vector similarity — and returns results in whichever format you need.
The result is an agent that remembers across sessions without you managing any of it.
- Python 3.10+
- 8 GB RAM
- ~500 MB disk for the embedding model
- Any modern CPU (4+ cores recommended)
- Python 3.10+
- 16 GB RAM
- NVIDIA GPU with at least 6 GB VRAM
- NVIDIA drivers 525+ and CUDA 12.1+
- 32 GB RAM
- NVIDIA GPU with 12 GB VRAM (RTX 3060, RTX 3070 Ti, RTX 4060 Ti 16GB, or better)
Important — all-or-nothing GPU rule: Ollama either loads the entire model onto your GPU or falls back to CPU. Partial offloading (splitting layers between GPU and CPU) is actually slower than pure CPU because of PCIe transfer overhead. If the model doesn't fit in your VRAM, see Choosing a model to pick a smaller model that does.
pip install pocket-memThat installs the package and the all-MiniLM-L6-v2 embedding model (~22 MB, runs locally on CPU). The only additional setup is an LLM — see the next section.
Ollama is the recommended way to run a local LLM. It's free, runs entirely on your machine, and pocket-mem connects to it automatically with no configuration.
Linux or WSL2:
curl -fsSL https://ollama.com/install.sh | shmacOS:
brew install ollamaOr download the app from ollama.com/download.
Windows: Download and run the installer from ollama.com/download.
ollama serveLeave this running in a terminal. Ollama listens on http://localhost:11434. On Linux you can run it as a background service instead:
sudo systemctl enable ollama
sudo systemctl start ollamaollama pull qwen2.5:7bThis downloads the model (~4.7 GB compressed). You only need to do this once.
ollama list
# Should show qwen2.5:7b in the list
ollama run qwen2.5:7b "Respond with valid JSON: {\"status\": \"ok\"}"
# Should return a JSON responseIf you get a JSON response, Ollama is set up correctly and pocket-mem will work.
The default model is qwen2.5:7b. It excels at structured JSON output — the most critical capability for accurately extracting entities and relationships from text. If your hardware can run it, use it.
ollama pull qwen2.5:7bThis runs fully on GPU with 6 GB+ VRAM, or falls back to CPU at 4–6 t/s. On Apple Silicon, your full system RAM is available so 16 GB+ M-series Macs are ideal.
If qwen2.5:7b doesn't fit in your VRAM or runs too slowly, you can step down:
ollama pull qwen2.5:3bWarning: The 3B model will extract less detail from conversations and is more prone to hallucination during ingestion. Entity extraction and relationship mapping will be noticeably less accurate. Use it only if you cannot run the 7B model.
from pocket_mem import MemoryAgent, LLMConfig
agent = MemoryAgent(
project="my-app",
llm=LLMConfig(model="qwen2.5:3b")
)You can use any cloud LLM for both ingestion and recall. pocket-mem uses the OpenAI-compatible chat completions API, so it works with any provider that exposes it.
Cloud models give you the highest extraction quality with no local GPU requirement.
import os
from pocket_mem import MemoryAgent, LLMConfig
# Anthropic Claude Haiku — excellent JSON extraction, low cost
agent = MemoryAgent(
project="my-app",
llm=LLMConfig(
base_url="https://api.anthropic.com/v1",
model="claude-haiku-4-5-20251001",
api_key=os.environ["ANTHROPIC_API_KEY"]
)
)
# OpenAI GPT-4o Mini
agent = MemoryAgent(
project="my-app",
llm=LLMConfig(
base_url="https://api.openai.com/v1",
model="gpt-4o-mini",
api_key=os.environ["OPENAI_API_KEY"]
)
)
# Any OpenAI-compatible provider (Groq, Together AI, Mistral, etc.)
agent = MemoryAgent(
project="my-app",
llm=LLMConfig(
base_url="https://api.groq.com/openai/v1",
model="llama-3.1-8b-instant",
api_key=os.environ["GROQ_API_KEY"]
)
)from pocket_mem import MemoryAgent
# Creates a ./memory/ folder in your project directory automatically
agent = MemoryAgent(project="my-app")
# Store a conversation turn — non-blocking, returns immediately
agent.observe(
user_input="My boss David recommended I try Cursor IDE for coding",
agent_response="Got it, I'll keep that in mind."
)
# Recall memory as context — best method for injecting into system prompts
context = agent.recall("What tools has David recommended?", mode="context")
print(context)
# → "Entity: David (boss) — recommended Cursor IDE for coding."
# See what topics are stored in memory
print(agent.topics())
# → ["People I Know", "AI Tools"]
# Inspect the raw graph (useful for debugging)
import json
print(json.dumps(agent.recall("David", mode="raw"), indent=2))There are two patterns for connecting pocket-mem to your existing LLM or agent. Both use the same API.
Call recall() before every LLM call and inject the results into your system prompt. Your agent always has relevant memory in context without needing to explicitly ask for it.
from pocket_mem import MemoryAgent
memory = MemoryAgent(project="my-app")
def chat(user_message: str) -> str:
# 1. Retrieve relevant memory for this message
context = memory.recall(user_message, mode="context")
# 2. Inject into your system prompt
system_prompt = f"""You are a helpful coding assistant.
## What you remember from past conversations
{context}
Use this context to give personalized, informed responses.
"""
# 3. Call your LLM as normal
response = your_llm.chat(system=system_prompt, user=user_message)
# 4. Store this turn (non-blocking)
memory.observe(user_input=user_message, agent_response=response)
return responseThis works with any LLM — OpenAI, Anthropic, local models, LangChain, anything that accepts a system prompt.
Expose recall to your LLM as a callable tool. The model decides when memory is relevant and calls it on demand.
from pocket_mem import MemoryAgent
memory = MemoryAgent(project="my-app")
# Get an OpenAI-compatible tool definition — works with any compatible API
tools = [memory.as_tool()]
# Handle the tool call in your agent loop
def handle_tool_call(tool_name: str, args: dict) -> str:
if tool_name == "recall_memory":
return memory.recall(
query=args["query"],
mode=args.get("mode", "context")
)| Agent type | Use |
|---|---|
| Conversational assistant | Pattern A |
| Coding assistant | Pattern A |
| Autonomous agent | Pattern B |
| Research agent | Pattern B |
| Not sure | Pattern A — simpler, works for most cases |
The mode parameter controls what recall() returns.
Returns a formatted string of relevant memories ready to inject directly into a system prompt. No LLM call — purely graph retrieval and formatting. This is the best way to use pocket-mem. It's fast, deterministic, and works with any downstream LLM you're already using.
context = memory.recall("What database did we decide on?", mode="context")
# → "Decision (Jan 14): Chose PostgreSQL over SQLite — needs concurrent writes."
# Inject directly into your system prompt:
system = f"You are a helpful assistant.\n\n## Memory\n{context}"Makes an LLM call to synthesize a natural language answer directly from memory. Best used when the user is asking a memory-specific question and you want pocket-mem to answer it directly rather than injecting context into another model.
answer = memory.recall("Who recommended httpx?", mode="answer")
# → "David, your boss, mentioned httpx is better than requests for async HTTP work."For best results, use Claude Haiku. In benchmarks against the Veloris dataset (50 scored questions across direct lookup, single-hop, and multi-hop categories, plus 10 unanswerable), using
qwen2.5:7bfor ingestion and Claude Haiku for synthesis achieves a 2-run average of 98% accuracy on answerable questions (peak 99%), with zero false positives on unanswerable ones. Seetests/simulation/first_sim_test_50_q/BENCHMARK.mdfor the full results.Local models like
qwen2.5:7bcan answer memory questions but are more prone to synthesizing plausible-sounding answers that aren't supported by the stored facts.
If you want to keep qwen2.5:7b for ingestion (fast, free, local) but use Claude Haiku only when mode="answer" is called, set the answer_* fields separately:
import os
from pocket_mem import MemoryAgent, LLMConfig
agent = MemoryAgent(
project="my-app",
llm=LLMConfig(
# Ingestion — local Ollama, used for observe() and recall(mode="context")
base_url="http://localhost:11434/v1",
model="qwen2.5:7b",
# Answer model — only used when recall(mode="answer") is called
answer_base_url="https://api.anthropic.com/v1",
answer_model="claude-haiku-4-5-20251001",
answer_api_key=os.environ["ANTHROPIC_API_KEY"],
)
)
answer = agent.recall("What did we decide about the auth system?", mode="answer")If you omit the answer_* fields, mode="answer" uses the same model as ingestion. If you set base_url and model directly to a cloud model, that model is used for everything including extraction.
Returns the raw graph data as a Python list of nodes and edges. No LLM call, no formatting. Use this to debug what's actually stored.
import json
data = memory.recall("David", mode="raw")
print(json.dumps(data, indent=2))pocket-mem includes a built-in graph explorer that opens in your browser. It shows every node and edge in your memory graph, with filtering by topic, node type, date, and keyword search.
# Open the visualizer for the default memory path
pocket-mem show
# Specify a project
pocket-mem show --project my-app
# Specify a path
pocket-mem show --path 'path'
# Filter to a specific topic
pocket-mem show --project my-app --topic "People I Know"
# Filter by node type
pocket-mem show --project my-app --type entity
# Show only nodes updated in the last 7 days
pocket-mem show --project my-app --since 7d
# Pre-fill the search bar
pocket-mem show --project my-app --search DavidThe visualizer is read-only and requires no additional dependencies beyond the base install.
Memory is stored as a portable file. You can share your agent's full context with someone else.
Package your memory into a single .mempack file and send it to a colleague. Their agent picks up exactly where yours left off — all the people, decisions, tools, and relationships your agent has learned.
# Export your memory
agent = MemoryAgent(project="my-project")
agent.export("project_memory.mempack")
# Your colleague imports it on their machine
their_agent = MemoryAgent(project="my-project")
their_agent.import_pack("project_memory.mempack")
# Their agent now has all your memory
print(their_agent.recall("What database did we decide on?", mode="answer"))A .mempack file is a zip archive containing the SQLite database. It's self-contained and portable — you can email it, commit it to version control as a checkpoint, or back it up like any other file.
By default pocket-mem creates a memory/ folder in your current directory and stores the database there. No configuration needed.
your-project/
├── main.py
├── memory/ ← created automatically
│ └── my-app.db
└── ...
Change the storage location:
# Different local directory
agent = MemoryAgent(project="my-app", path="./data/memory/")
# Absolute path
agent = MemoryAgent(project="my-app", path="/home/user/shared-memory/")Cloud storage (shared multi-user memory graphs) is planned for v2.
An identity tells pocket-mem who the agent is and what it should care about. When set, it shapes how memories are prioritised and surfaced at retrieval time — high-signal entities get boosted, and the memory graph is pre-seeded with topic buckets that match the agent's domain.
Identity is optional and backward-compatible. Agents without an identity work exactly as before.
from pocket_mem import MemoryAgent, MemoryConfig, IdentityConfig
config = MemoryConfig(
identity=IdentityConfig(
description=(
"Paralegal at a litigation law firm. I read all case notes to track "
"clients, opposing parties, deadlines, damages, and filings so I can "
"brief attorneys on case status."
)
)
)
agent = MemoryAgent(project="my-case-files", config=config)One sentence describing the agent's role and what it pays attention to is enough. The library derives seed topics, entity priorities, and importance signals automatically.
Identity shaping happens at retrieval time, not at extraction. Every observation is stored the same way regardless of identity. When recall() is called, the importance scorer gives more weight to entities and relationships that match the configured role — a paralegal's agent surfaces client names and deadlines more prominently than passing remarks.
This design means you can add, change, or remove an identity at any time without re-ingesting your data.
For common roles, pocket-mem matches your description against a set of prebuilt configurations automatically — no API call needed:
| Role | What it tracks |
|---|---|
| Paralegal | Clients, opposing parties, filings, deadlines, damages, settlements |
| Executive Assistant | People, projects, decisions, costs, risks, technical initiatives |
| Personal AI Assistant | Schedule, reminders, contacts, shopping lists, financial tasks |
For roles not covered by a prebuilt, pass a derivation_api_key to derive a configuration via LLM:
import os
from pocket_mem import MemoryAgent, MemoryConfig, IdentityConfig
config = MemoryConfig(
identity=IdentityConfig(
description="Customer support agent for a B2B SaaS product...",
derivation_api_key=os.environ["GEMINI_API_KEY"],
)
)Derived configurations are cached in the memory store so the LLM call only happens once per unique description.
MIT