# Code search with Qdrant

This is a notebook demonstrating how to implement a code search mechanism using two different neural encoders - one general purpuse, and another trained specifically for code. Let's start with installing all the required dependencies.

In [None]:
!pip install qdrant-client inflection sentence-transformers optimum onnx

We have already generated a structured `jsonl` file of our example [codeqai codebase](https://github.com/fynnfluegge/codeqai). The file [python_parser.py](python_parser.py) contains implementation to take the codebase as an input and generate structured representation of the source code in jsonl file. The generated file can be found [here](./resources/codeqai_codebase_python_parsed.jsonl).

In [None]:
import json

structures = []
with open("./resources/codeqai_codebase_python_parsed.jsonl", "r") as fp:
    for i, row in enumerate(fp):
        entry = json.loads(row)
        structures.append(entry)

structures[0]

In [None]:
import json

structures = []
with open("./resources/qdrant_codebase_rust_parsed.jsonl", "r", encoding="utf-8") as fp:
    for i, row in enumerate(fp):
        entry = json.loads(row)
        structures.append(entry)

structures[0]

We will use two different neural encoders - `all-MiniLM-L6-v2` and `jina-embeddings-v2-base-code`. Since the first one is trained for general purposes, and more natural language, there is a need to convert code into more human-friendly text representation. This normalization gets rid of language specifics, so the output looks more like a description of the particular code structure.

In [None]:
import inflection
import re

from typing import Dict, Any

def textify(chunk: Dict[str, Any]) -> str:
    """
    Convert the code structure into natural language like representation.

    Args:
        chunk (dict): Dictionary-like representation of the code structure
            Example: {
                "name": "LlmHost",
                "signature": "class LlmHost(Enum):",
                "code_type": "Class",
                "docstring": null,
                "line": 31,
                "line_from": 31,
                "line_to": 36,
                "context": {
                    "module": "constants",
                    "file_path": "codeqai/constants.py",
                    "file_name": "constants.py",
                    "class_name": "LlmHost",
                    "function_name": null,
                    "snippet": "class LlmHost(Enum):    LLAMACPP = \"Llamacpp\"    OLLAMA = \"Ollama\"    OPENAI = \"OpenAI\"    AZURE_OPENAI = \"Azure-OpenAI\"    ANTHROPIC = \"Anthropic\""
                }
            }

    Returns:
        str: A simplified natural language like description of the structure with some context info
            Example: "Function Await ready for timeout that does Return true if ready false if timed out defined as Fn await ready for timeout self timeout duration bool defined in struct Isready module common file is_ready rs"
    """
    # Get rid of all the camel case / snake case
    # - inflection.underscore changes the camel case to snake case
    # - inflection.humanize converts the snake case to human readable form
    name = inflection.humanize(inflection.underscore(chunk["name"]))
    signature = inflection.humanize(inflection.underscore(chunk["signature"]))

    # Check if docstring is provided
    docstring = ""
    if chunk["docstring"]:
        docstring = f"that does {chunk['docstring']} "

    # Extract the location of that snippet of code
    context = (
        f"module {chunk['context']['module']} "
        f"file {chunk['context']['file_name']}"
    )

    #if chunk["context"]["class_name"]:
    #    struct_name = inflection.humanize(
    #        inflection.underscore(chunk["context"]["class_name"])
    #    )
    #    context = f"defined in struct {struct_name} {context}"

    # Combine all the bits and pieces together
    text_representation = (
        f"{chunk['code_type']} {name} "
        f"{docstring}"
        f"defined as {signature} "
        f"{context}"
    )

    # Remove any special characters and concatenate the tokens
    tokens = re.split(r"\W", text_representation)
    tokens = filter(lambda x: x, tokens)
    return " ".join(tokens)

Here is how the same structure looks like, after performing the normalization step:

In [None]:
textify(structures[0])

Let's do it for all the structures at once:

In [None]:
text_representations = list(map(textify, structures))

In [None]:
text_representations[104]

Created text representations might be directly used as an input to the `all-MiniLM-L6-v2` model.

In [13]:
from sentence_transformers import SentenceTransformer

nlp_model = SentenceTransformer("all-MiniLM-L6-v2")
nlp_embeddings = nlp_model.encode(
    text_representations, show_progress_bar=True,
)
nlp_embeddings.shape

Batches: 100%|██████████| 4/4 [00:00<00:00,  8.96it/s]


(115, 384)

As a next step, we are going to extract all the code snippets to a separate list. This will be an input to the different model we want to use.

In [14]:
code_snippets = [
    structure["context"]["snippet"]
    for structure in structures
]
code_snippets[104]

'class TreesitterTypescript(Treesitter):    def __init__(self):        super().__init__(            Language.TYPESCRIPT, "function_declaration", "identifier", "comment"        )'

The `jina-embeddings-v2-base-code` model is available for free, but requires accepting the rules on [the model page](https://huggingface.co/jinaai/jina-embeddings-v2-base-code). Please do it first, and put the key below.

In [15]:
from dotenv import load_dotenv
env_loaded = load_dotenv("../.env")
print(f"Env loaded: {env_loaded}")

Env loaded: True


In [None]:
# You have to accept the conditions in order to be able to access Jina embedding
# model. Please visit https://huggingface.co/jinaai/jina-embeddings-v2-base-code
# to accept the rules and generate the access token in your account settings:
# https://huggingface.co/settings/tokens

HF_TOKEN = "THIS_IS_YOUR_TOKEN" # Will load from the .env file. This is not required at the moment, verified the same from hugging face repository.

Once the token is ready, we can pass the code snippets through the second model. Please mind we set the `trust_remote_code` flag to `True` so the library can download and run some code from the remote server. This is required to run the model, so in general be aware of the potential security risks and make sure you trust the source.

In [16]:
code_model = SentenceTransformer(
    "jinaai/jina-embeddings-v2-base-code",
    #token=HF_TOKEN,
    trust_remote_code=True
)
code_model.max_seq_length = 8192  # increase the context length window
code_embeddings = code_model.encode(
    code_snippets, batch_size=4, show_progress_bar=True,
)
code_embeddings.shape


Batches: 100%|██████████| 29/29 [01:02<00:00,  2.17s/it]


(115, 768)

Created embeddings have to be indexed in a Qdrant collection. For that, we need a running instance. The easiest way is to deploy it using the [Qdrant Cloud](https://cloud.qdrant.io/). There is a free tier 1GB cluster available, but you can alternatively use [a local Docker container](https://qdrant.tech/documentation/quick-start/), but running it in Google Colab might require installing Docker first.

In [None]:
# Will load from the .env file.
#QDRANT_URL = "https://my-cluster.cloud.qdrant.io:6333" # http://localhost:6333 for local instance
#QDRANT_API_KEY = "THIS_IS_YOUR_API_KEY" # None for local instance

In [17]:
from qdrant_client import QdrantClient, models
import os

client = QdrantClient(os.getenv("QDRANT_URL"), api_key=os.getenv("QDRANT_API_KEY"))
client.create_collection(
    "qdrant-sources",
    vectors_config={
        "text": models.VectorParams(
            size=nlp_embeddings.shape[1],
            distance=models.Distance.COSINE,
        ),
        "code": models.VectorParams(
            size=code_embeddings.shape[1],
            distance=models.Distance.COSINE,
        ),
    }
)

True

Our collection should be created already. As you may see, we configured so called **[named vectors](https://qdrant.tech/documentation/concepts/points/)**, to have two different embeddings stored in the same collection.

Let's finally index all the data.

In [19]:
import uuid

points = [
    models.PointStruct(
        id=uuid.uuid4().hex,
        vector={
            "text": text_embedding,
            "code": code_embedding,
        },
        payload=structure
    )
    for text_embedding, code_embedding, structure in zip(nlp_embeddings, code_embeddings, structures)
]
len(points)
points[0]

PointStruct(id='ab541d981b2048709bfbabe6e5564f9b', vector={'text': [-0.011631825938820839, -0.00014600614667870104, -0.09755951166152954, -0.006029199808835983, 0.007939223200082779, -0.09623466432094574, -0.0014058706583455205, 0.06285744160413742, 0.02438332512974739, -0.03742710500955582, 0.04625314474105835, -0.02422446571290493, 0.03547271341085434, 0.025804031640291214, 0.11759044229984283, 0.114201121032238, -0.024521520361304283, -0.026537170633673668, 0.0026248081121593714, 0.01264064759016037, 0.04780663549900055, 0.05898354575037956, 0.04669845849275589, -0.0014817004557698965, -0.07434895634651184, -0.13324540853500366, -0.0012306892313063145, 0.0507434718310833, -0.037363093346357346, -0.00982726737856865, 0.09743224084377289, -0.004054978024214506, -0.0375242717564106, 0.018598034977912903, 0.07011193782091141, 0.12876152992248535, -0.009478967636823654, -0.07346620410680771, -0.038087863475084305, 0.008240461349487305, 0.07880952954292297, -0.02419446036219597, -0.044855

In [20]:
client.upload_points(
    "qdrant-sources",
    points=points,
    batch_size=64,
)

If you want to check if all the points were sent, counting them might be the easiest idea.

In [21]:
client.count("qdrant-sources")

CountResult(count=115)

If you, however, want to know how the count endpoint works internally in the Qdrant server, that might be a question to ask.

In [38]:
query = "openai embeddings"

First of all, let's use one model at a time. Let's start with the general purpose one.

In [39]:
hits = client.search(
    "qdrant-sources",
    query_vector=(
        "text", nlp_model.encode(query).tolist()
    ),
    limit=5,
)
for hit in hits:
    print(
        "| ",
        hit.payload["context"]["module"], " | ",
        hit.payload["context"]["file_name"], " | ",
        hit.score, " | `",
        hit.payload["signature"], "` |"
    )

|  embeddings  |  embeddings.py  |  0.55871344  | ` def __init__(self, model=EmbeddingsModel.OPENAI_TEXT_EMBEDDING_ADA_002, deployment=None): ` |
|  embeddings  |  embeddings.py  |  0.37828666  | ` class Embeddings(): ` |
|  constants  |  constants.py  |  0.37065214  | ` class EmbeddingsModel(Enum): ` |
|  embeddings  |  embeddings.py  |  0.3259802  | ` def _install_instructor_embedding(self): ` |
|  app  |  app.py  |  0.30768472  | ` def env_loader(env_path, required_keys=None): ` |


The results obtained with the code specific model should be different.

In [40]:
hits = client.search(
    "qdrant-sources",
    query_vector=(
        "code", code_model.encode(query).tolist()
    ),
    limit=5,
)
for hit in hits:
    print(
        "| ",
        hit.payload["context"]["module"], " | ",
        hit.payload["context"]["file_name"], " | ",
        hit.score, " | `",
        hit.payload["signature"], "` |"
    )

|  embeddings  |  embeddings.py  |  0.6875714  | ` def __init__(self, model=EmbeddingsModel.OPENAI_TEXT_EMBEDDING_ADA_002, deployment=None): ` |
|  embeddings  |  embeddings.py  |  0.6197308  | ` class Embeddings(): ` |
|  vector_store  |  vector_store.py  |  0.59301513  | ` def __init__(self, name, embeddings): ` |
|  constants  |  constants.py  |  0.5884485  | ` class EmbeddingsModel(Enum): ` |
|  bootstrap  |  bootstrap.py  |  0.49452206  | ` def bootstrap(config, repo_name, embeddings_model=None): ` |


In reality, we implemented the system with two different models, as we want to combine the results coming from both of them. We can do it with a batch request, so there is just a single call to Qdrant.

In [41]:
results = client.search_batch(
    "qdrant-sources",
    requests=[
        models.SearchRequest(
            vector=models.NamedVector(
                name="text",
                vector=nlp_model.encode(query).tolist()
            ),
            with_payload=True,
            limit=5,
        ),
        models.SearchRequest(
            vector=models.NamedVector(
                name="code",
                vector=code_model.encode(query).tolist()
            ),
            with_payload=True,
            limit=5,
        ),
    ]
)
for hits in results:
    for hit in hits:
        print(
            "| ",
            hit.payload["context"]["module"], " | ",
            hit.payload["context"]["file_name"], " | ",
            hit.score, " | `",
            hit.payload["signature"], "` |"
        )

|  embeddings  |  embeddings.py  |  0.55871344  | ` def __init__(self, model=EmbeddingsModel.OPENAI_TEXT_EMBEDDING_ADA_002, deployment=None): ` |
|  embeddings  |  embeddings.py  |  0.37828666  | ` class Embeddings(): ` |
|  constants  |  constants.py  |  0.37065214  | ` class EmbeddingsModel(Enum): ` |
|  embeddings  |  embeddings.py  |  0.3259802  | ` def _install_instructor_embedding(self): ` |
|  app  |  app.py  |  0.30768472  | ` def env_loader(env_path, required_keys=None): ` |
|  embeddings  |  embeddings.py  |  0.6875714  | ` def __init__(self, model=EmbeddingsModel.OPENAI_TEXT_EMBEDDING_ADA_002, deployment=None): ` |
|  embeddings  |  embeddings.py  |  0.6197308  | ` class Embeddings(): ` |
|  vector_store  |  vector_store.py  |  0.59301513  | ` def __init__(self, name, embeddings): ` |
|  constants  |  constants.py  |  0.5884485  | ` class EmbeddingsModel(Enum): ` |
|  bootstrap  |  bootstrap.py  |  0.49452206  | ` def bootstrap(config, repo_name, embeddings_model=None): ` |


Last but not least, if we want to improve the diversity of the results, grouping them by the module might be a good idea.

In [42]:
results = client.search_groups(
    "qdrant-sources",
    query_vector=(
        "code", code_model.encode(query).tolist()
    ),
    group_by="context.module",
    limit=5,
    group_size=1,
)
for group in results.groups:
    for hit in group.hits:
        print(
            "| ",
            hit.payload["context"]["module"], " | ",
            hit.payload["context"]["file_name"], " | ",
            hit.score, " | `",
            hit.payload["signature"], "` |"
        )

|  embeddings  |  embeddings.py  |  0.6875714  | ` def __init__(self, model=EmbeddingsModel.OPENAI_TEXT_EMBEDDING_ADA_002, deployment=None): ` |
|  vector_store  |  vector_store.py  |  0.59301513  | ` def __init__(self, name, embeddings): ` |
|  constants  |  constants.py  |  0.5884485  | ` class EmbeddingsModel(Enum): ` |
|  bootstrap  |  bootstrap.py  |  0.49452206  | ` def bootstrap(config, repo_name, embeddings_model=None): ` |
|  app  |  app.py  |  0.4623083  | ` def run(): ` |


For a more detailed guide, please check our [code search tutorial](https://qdrant.tech/documentation/tutorials/code-search/) and [code search demo](https://github.com/qdrant/demo-code-search).