# Minimal dependency RAG with DeepSeek and Qdrant

In the rapidly advancing field of AI, Large Language Models have made significant strides in understanding and generating human-like text. To improve their factual accuracy, these models significantly benefit from an integration with external knowledge sources.

Retrieval Augmented Generation (RAG) is a framework that combines LLMs with real-time retrieval of relevant information, ensuring more accurate and contextually relevant outputs.

In this example, we'll showcase an implementation using the latest [DeepSeek-V3](https://www.deepseek.com) model. It leads the way among open-source models and competes with the best closed-source models worldwide.

## Prerequisites

Let's start setting up all the pieces to implement the pipeline. We'll try to do this with minimal dependencies. 

### Preparing the environment

In [None]:
%pip install "qdrant-client[fastembed]"

[Qdrant](https://qdrant.tech) will act as a knowledge base providing the context information for the prompts we'll be sending to the LLM.

You can get a free-forever Qdrant cloud instance at http://cloud.qdrant.io. Learn about setting up your instance from the [Quickstart](https://qdrant.tech/documentation/quickstart-cloud/).

In [None]:
QDRANT_URL = "https://xyz-example.eu-central.aws.cloud.qdrant.io:6333"
QDRANT_API_KEY = "<your-api-key>"

### Instantiating Qdrant Client

In [None]:
from qdrant_client import QdrantClient, models

client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)

### Building the knowledge base

Qdrant will use vector embeddings of our facts to enrich the original prompt with some context. Thus, we need to store the vector embeddings and the facts used to generate them.

We'll be using the [bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) model via [FastEmbed](https://github.com/qdrant/fastembed/) - A lightweight, fast, Python library for embeddings generation.

The Qdrant client provides a handy integration with FastEmbed that makes building a knowledge base very straighforward.

First, we need to create a collection, so Qdrant would know what vectors it will be dealing with, and then, we just pass our raw documents
wrapped into `models.Document` to compute and upload the embeddings.

In [5]:
collection_name = "knowledge_base"
model_name = "BAAI/bge-small-en-v1.5"
client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(size=384, distance=models.Distance.COSINE)
)

In [6]:
documents = [
    "Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!",
    "Docker helps developers build, share, and run applications anywhere — without tedious environment configuration or management.",
    "PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing.",
    "MySQL is an open-source relational database management system (RDBMS). A relational database organizes data into one or more data tables in which data may be related to each other; these relations help structure the data. SQL is a language that programmers use to create, modify and extract data from the relational database, as well as control user access to the database.",
    "NGINX is a free, open-source, high-performance HTTP server and reverse proxy, as well as an IMAP/POP3 proxy server. NGINX is known for its high performance, stability, rich feature set, simple configuration, and low resource consumption.",
    "FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.",
    "SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.",
    "The cron command-line utility is a job scheduler on Unix-like operating systems. Users who set up and maintain software environments use cron to schedule jobs (commands or shell scripts), also known as cron jobs, to run periodically at fixed times, dates, or intervals.",
]

In [9]:
client.upsert(
    collection_name=collection_name,
    points=[
        models.PointStruct(
            id=idx,
            vector=models.Document(text=document, model=model_name),
            payload={"document": document},
        )
        for idx, document in enumerate(documents)
    ],
)

Fetching 5 files: 100%|██████████████████████████████████████████████████| 5/5 [00:16<00:00,  3.39s/it]


UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

## Retrieval Augmented Generation

RAG changes the way we interact with Large Language Models. We're converting a knowledge-oriented task, in which the model may create a counterfactual answer, into a language-oriented task. The latter expects the model to extract meaningful information and generate an answer. LLMs, when implemented correctly, are supposed to be carrying out language-oriented tasks.

The task starts with the original prompt sent by the user. The same prompt is then vectorized and used as a search query for the most relevant facts. Those facts are combined with the original prompt to build a longer prompt containing more information.

But let's start simply by asking our question directly.

In [10]:
prompt = """
What tools should I need to use to build a web service using vector embeddings for search?
"""

Using the Deepseek API requires providing the API key. You can obtain it from the [DeepSeek platform](https://platform.deepseek.com/api_keys).

Now we can finally call the completion API.

In [11]:
import requests
import json

# Fill the environmental variable with your own Deepseek API key
# See: https://platform.deepseek.com/api_keys
API_KEY = "<YOUR_DEEPSEEK_KEY>"

HEADERS = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json",
}


def query_deepseek(prompt):
    data = {
        "model": "deepseek-chat",
        "messages": [{"role": "user", "content": prompt}],
        "stream": False,
    }

    response = requests.post(
        "https://api.deepseek.com/chat/completions", headers=HEADERS, data=json.dumps(data)
    )

    if response.ok:
        result = response.json()
        return result["choices"][0]["message"]["content"]
    else:
        raise Exception(f"Error {response.status_code}: {response.text}")


In [12]:
query_deepseek(prompt)

'Building a web service that leverages **vector embeddings** for search involves several key components. Here\'s a breakdown of the tools and technologies you\'ll need:\n\n---\n\n### **1. Vector Database (Storage & Search)**\nThese databases are optimized for storing and querying high-dimensional vectors (embeddings).\n\n- **Pinecone** – Managed vector database, easy to use with API.\n- **Weaviate** – Open-source vector search with GraphQL API.\n- **Milvus** / **Zilliz** – High-performance open-source vector DB.\n- **Qdrant** – Rust-based, fast and scalable.\n- **Chroma** – Lightweight, in-memory (good for prototyping).\n- **FAISS** (by Meta) – Library for efficient similarity search (not a full DB).\n- **Redis with RedisSearch** – Supports vector similarity search.\n\n**Choose based on:** Scalability, latency, cost, and ease of integration.\n\n---\n\n### **2. Embedding Model (Generating Vectors)**\nConverts text/images into embeddings.\n\n- **OpenAI Embeddings** (`text-embedding-3-sma

### Extending the prompt

Even though the original answer sounds credible, it didn't answer our question correctly. Instead, it gave us a generic description of an application stack. To improve the results, enriching the original prompt with the descriptions of the tools available seems like one of the possibilities. Let's use a semantic knowledge base to augment the prompt with the descriptions of different technologies!

In [14]:
results = client.query_points(
    collection_name=collection_name,
    query=models.Document(text=prompt, model=model_name),
    limit=3,
)
results

QueryResponse(points=[ScoredPoint(id=0, version=0, score=0.67437416, payload={'document': 'Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!'}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=6, version=0, score=0.63144326, payload={'document': 'SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.'}, vector=None, shard_key=None, order_value=None), ScoredPoint(id=5, version=0, score=0.6064749, 

We used the original prompt to perform a semantic search over the set of tool descriptions. Now we can use these descriptions to augment the prompt and create more context.

In [15]:
context = "\n".join(r.payload['document'] for r in results.points)
context

'Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!\nSentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.\nFastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.'

Finally, let's build a metaprompt, the combination of the assumed role of the LLM, the original question, and the results from our semantic search that will force our LLM to use the provided context. 

By doing this, we effectively convert the knowledge-oriented task into a language task and hopefully reduce the chances of hallucinations. It also should make the response sound more relevant.

In [16]:
metaprompt = f"""
You are a software architect. 
Answer the following question using the provided context. 
If you can't find the answer, do not pretend you know it, but answer "I don't know".

Question: {prompt.strip()}

Context: 
{context.strip()}

Answer:
"""

# Look at the full metaprompt
print(metaprompt)


You are a software architect. 
Answer the following question using the provided context. 
If you can't find the answer, do not pretend you know it, but answer "I don't know".

Question: What tools should I need to use to build a web service using vector embeddings for search?

Context: 
Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!
SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.
FastAPI is a modern

Our current prompt is much longer, and we also used a couple of strategies to make the responses even better:

1. The LLM has the role of software architect.
2. We provide more context to answer the question.
3. If the context contains no meaningful information, the model shouldn't make up an answer.

Let's find out if that works as expected.

In [17]:
query_deepseek(metaprompt)

'To build a web service using vector embeddings for search, you will need the following tools based on the provided context:\n\n1. **Qdrant**: A vector database and similarity search engine to store and retrieve high-dimensional vectors efficiently. It provides the API service for nearest-neighbor search.\n\n2. **SentenceTransformers**: A Python framework to generate state-of-the-art sentence or text embeddings. You can use this to convert your text data into vector representations (embeddings) for search.\n\n3. **FastAPI**: A modern Python web framework to build the API layer for your web service. It will handle HTTP requests, interact with Qdrant, and serve search results to clients.\n\n### Optional but helpful tools:\n- A machine learning model (e.g., from Hugging Face) if you need custom embeddings.\n- A frontend framework (e.g., React, Vue.js) if you need a user interface.\n- Docker for containerization and deployment.\n\nWould you like more details on any of these tools?'

### Testing out the RAG pipeline

By leveraging the semantic context we provided our model is doing a better job answering the question. Let's enclose the RAG as a function, so we can call it more easily for different prompts.

In [18]:
def rag(question: str, n_points: int = 3) -> str:
    results = client.query_points(
        collection_name=collection_name,
        query=models.Document(text=question, model=model_name),
        limit=n_points,
    )

    context = "\n".join(r.payload["document"] for r in results.points)

    metaprompt = f"""
    You are a software architect. 
    Answer the following question using the provided context. 
    If you can't find the answer, do not pretend you know it, but only answer "I don't know".

    Question: {question.strip()}

    Context: 
    {context.strip()}

    Answer:
    """

    return query_deepseek(metaprompt)

Now it's easier to ask a broad range of questions.

In [19]:
rag("What can the stack for a web api look like?")

"Based on the provided context, a possible stack for a web API could include:\n\n1. **Framework**: FastAPI (for building the API with Python)\n2. **Web Server/Reverse Proxy**: NGINX (to handle HTTP requests and serve as a reverse proxy)\n3. **Database**: MySQL (as the relational database management system)\n\nThis stack would look like:  \n**FastAPI + NGINX + MySQL**  \n\nAdditional components (not mentioned in the context but often used in such stacks) might include:\n- A cloud provider or server (e.g., AWS, GCP, or a Linux server)\n- Docker for containerization\n- Redis for caching (though not mentioned in the context)\n\nThe context does not provide details about other layers (e.g., authentication, monitoring, or frontend), so I’ve limited the answer to the explicitly mentioned technologies.  \n\nIf you'd like a more detailed or alternative stack, let me know, but the context only covers these three components."

In [20]:
rag("Where is the nearest grocery store?")

"I don't know."

Our model can now:

1. Take advantage of the knowledge in our vector datastore.
2. Answer, based on the provided context, that it can not provide an answer.

We have just shown a useful mechanism to mitigate the risks of hallucinations in Large Language Models.