# Deep Memory trained on Syntethic Queries improves recall@10 by +20%

You need to have labelled data (query and relevance pairs) for training deep memory. However it is sometimes hard to obtain labelled data when you start fresh.

In this tutorial we will take an existing dataset and generate queries using GPT to train Deep Memory.

## 0. Setup packages and credentials
Install Necessary Packages

In [None]:
%pip install -q llama-index deeplake openai cohere llama-index-readers-wikipedia wikipedia llama-index-vector-stores-deeplake python-dotenv langchain-openai deeplake==3.9.27


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Setup Activeloop and OpenAI

In [5]:
import os
from dotenv import load_dotenv


load_dotenv("../.env")
assert os.getenv("OPENAI_API_KEY")
assert os.getenv("ACTIVELOOP_TOKEN")

## 1. Load the dataset and create a Deep Lake vector store

We are going to use GPT3.5 to generate questions based on the context provided by a chunk test.

In [None]:
!mkdir -p "data/paul_graham/"
!curl "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt" -o "data/paul_graham/paul_graham_essay.txt"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 75042  100 75042    0     0   251k      0 --:--:-- --:--:-- --:--:--  251k


In [7]:
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core import SimpleDirectoryReader


documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)

# By default, the node/chunks ids are set to random uuids. To ensure same id's per run, we manually set them.
for idx, node in enumerate(nodes):
    node.id_ = f"node_{idx}"

print(f"Number of Documents: {len(documents)}")
print(f"Number of nodes: {len(nodes)} with the current chunk size of {node_parser.chunk_size}")

Number of Documents: 1
Number of nodes: 64 with the current chunk size of 512


In [8]:
from llama_index.core import VectorStoreIndex, ServiceContext, StorageContext
from llama_index.vector_stores.deeplake import DeepLakeVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI


# Create a DeepLakeVectorStore locally to store the vectors
dataset_path = "./data/paul_graham/deep_lake_db"
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True)

# LLM that will answer questions with the retrieved context
llm = OpenAI(model="gpt-3.5-turbo-1106")
embed_model = OpenAIEmbedding()

storage_context = StorageContext.from_defaults(vector_store=vector_store)

vector_index = VectorStoreIndex(
    nodes,
    storage_context=storage_context,
    embed_model=embed_model,
    llm=llm,
    show_progress=True
)

  import pkg_resources  # type: ignore
  from .autonotebook import tqdm as notebook_tqdm
Generating embeddings:   0%|          | 0/64 [00:00<?, ?it/s]2025-09-04 12:02:40,995 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Generating embeddings: 100%|██████████| 64/64 [00:01<00:00, 43.35it/s]

Uploading data to deeplake dataset.



100%|██████████| 64/64 [00:00<00:00, 709.24it/s]

Dataset(path='./data/paul_graham/deep_lake_db', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (64, 1)      str     None   
 metadata     json      (64, 1)      str     None   
 embedding  embedding  (64, 1536)  float32   None   
    id        text      (64, 1)      str     None   





Now let's upload the local Vectore Store to Active Loop's platform and then convert it into a managed database.

In [None]:
import deeplake


local = "./data/paul_graham/deep_lake_db"
hub_path = "hub://yaroslava/optimization_paul_graham_0"
hub_managed_path = "hub://yaroslava/optimization_paul_graham_managed_0"


# First upload our local vector store
deeplake.deepcopy(local, hub_path, overwrite=True)
# Create a managed vector store under a different name
deeplake.deepcopy(hub_path, hub_managed_path, overwrite=True, runtime={"tensor_db": True})

Copying dataset: 96%|█████████▋| 27/28 [00:08<00:00


This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/yaroslava/optimization_paul_graham_0
Your Deep Lake dataset has been successfully created!


Copying dataset: 96%|█████████▋| 27/28 [00:15<00:00


This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/yaroslava/optimization_paul_graham_managed_0
Your Deep Lake dataset has been successfully created!


Dataset(path='hub://yaroslava/optimization_paul_graham_managed_0', tensors=['embedding', 'id', 'metadata', 'text'])

## 2. Generate a dataset of Queries and Documents

In [11]:
# fetch dataset docs and ids if they exist (optional you can also ingest)
db = DeepLakeVectorStore(dataset_path=hub_managed_path, overwrite=False, read_only=True)
docs = db.vectorstore.dataset.text.data(fetch_chunks=True, aslist=True)["value"]
ids = db.vectorstore.dataset.id.data(fetch_chunks=True, aslist=True)["value"]
print(len(docs))

Deep Lake Dataset in hub://yaroslava/optimization_paul_graham_managed_0 already exists, loading from the storage
64


In [13]:
docs[0]
ids[0]

'node_0'

In [14]:
from openai import OpenAI


client = OpenAI()


def generate_question(text: str) -> str:
    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo-1106",
            messages=[
                {
                    "role": "system",
                    "content": "You are a world class expert for generating questions based on provided context. \
                        You make sure the question can be answered by the text."},
                {
                    "role": "user",
                    "content": text,
                },
            ],
        )
        return response.choices[0].message.content
    except:
        question_string = "No question generated"
        return question_string


In [15]:
import random
from tqdm import tqdm


def generate_queries(docs: list[str], ids: list[str], n: int) -> tuple[list[str], list[list[tuple[str, int]]]]:
    questions = []
    relevances = []
    pbar = tqdm(total=n)
    while len(questions) < n:
        # 1. randomly draw a piece of text and relevance id
        r = random.randint(0, len(docs)-1)
        text, label = docs[r], ids[r]

        # 2. generate queries and assign and relevance id
        generated_qs = [generate_question(text)]
        if generated_qs == ["No question generated"]:
            print("No question generated")
            continue

        questions.extend(generated_qs)
        relevances.extend([[(label, 1)] for _ in generated_qs])
        pbar.update(len(generated_qs))

    return questions[:n], relevances[:n]

# Here we choose to generate 40 questions
questions, relevances = generate_queries(docs, ids, n=40)
print(len(questions))
print(questions[0])

  0%|          | 0/40 [00:00<?, ?it/s]2025-09-04 12:23:45,959 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
  2%|▎         | 1/40 [00:01<01:14,  1.91s/it]2025-09-04 12:23:47,747 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
  5%|▌         | 2/40 [00:03<01:09,  1.83s/it]2025-09-04 12:23:49,898 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
  8%|▊         | 3/40 [00:05<01:13,  1.98s/it]2025-09-04 12:23:51,912 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 10%|█         | 4/40 [00:07<01:11,  1.99s/it]2025-09-04 12:23:53,418 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 12%|█▎        | 5/40 [00:09<01:03,  1.82s/it]2025-09-04 12:23:55,324 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 15%|█▌        | 6/40 [00:11<01:02,  1.85s/it]2025-09-

40
What led the founders of Y Combinator to create the Summer Founders Program?





## 3. Train Deep Memory

In [16]:
from langchain.embeddings.openai import OpenAIEmbeddings


openai_embeddings = OpenAIEmbeddings()

job_id = db.vectorstore.deep_memory.train(
    queries=questions,
    relevance=relevances,
    embedding_function=openai_embeddings.embed_documents,
)

Starting DeepMemory training job
Your Deep Lake dataset has been successfully created!


 

Preparing training data for deepmemory:


Creating 40 embeddings in 1 batches of size 40::   0%|          | 0/1 [00:00<?, ?it/s]2025-09-04 12:25:32,914 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Creating 40 embeddings in 1 batches of size 40:: 100%|██████████| 1/1 [00:09<00:00,  9.79s/it]


DeepMemory training job started. Job ID: 68b9696b2fed3f602c85a7e6


In [27]:
db.vectorstore.deep_memory.status("68b9696b2fed3f602c85a7e6")

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/yaroslava/optimization_paul_graham_managed_0
--------------------------------------------------------------
|                  68b9696b2fed3f602c85a7e6                  |
--------------------------------------------------------------
| status                     | pending                       |
--------------------------------------------------------------
| progress                   | None                          |
--------------------------------------------------------------
| results                    | not available yet             |
--------------------------------------------------------------




Wait until training status becomes completed

## 4. Evaluate Deep Memory

### 4.1 Manual

In [None]:
from llama_index.llms import OpenAI


query = "What are the main things Paul worked on before college?"

llm = OpenAI(model="gpt-3.5-turbo-1106")
embed_model = OpenAIEmbedding()

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

db = DeepLakeVectorStore(dataset_path=hub_managed_path, overwrite=False, read_only=True)
vector_index = VectorStoreIndex.from_vector_store(db, service_context=service_context, storage_context=storage_context, show_progress=True)

In [None]:
query_engine = vector_index.as_query_engine(similarity_top_k=3, vector_store_kwargs={"deep_memory": False})
response_vector = query_engine.query(query)
print(response_vector.response)


In [None]:
query_engine = vector_index.as_query_engine(similarity_top_k=3, vector_store_kwargs={"deep_memory": True})
response_vector = query_engine.query(query)
print(response_vector.response)

### 4.2 Quantitative Evaluation on Synthetically generated queries

In [None]:
validation_questions, validation_relevances = generate_queries(docs, ids, n=40)

In [None]:
recalls = db.vectorstore.deep_memory.evaluate(
    queries=validation_questions,
    relevance=validation_relevances,
    embedding_function=openai_embeddings.embed_documents,
)