# Code search with Qdrant

This is a notebook demonstrating how to implement a code search mechanism using two different neural encoders - one general purpuse, and another trained specifically for code. Let's start with installing all the required dependencies.

In [1]:
!pip install qdrant-client inflection sentence-transformers optimum onnx

Collecting qdrant-client
  Downloading qdrant_client-1.7.3-py3-none-any.whl (206 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/206.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m206.3/206.3 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting inflection
  Downloading inflection-0.5.1-py2.py3-none-any.whl (9.5 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-2.5.1-py3-none-any.whl (156 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.5/156.5 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting optimum
  Downloading optimum-1.17.1-py3-none-any.whl (407 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m407.1/407.1 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting onnx
  Downloading onnx-1.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

We are going to work with [Qdrant source code](https://github.com/qdrant/qdrant) that has been already converted into chunks. If you want to do it for a different project, please consider using one of the [LSP implementations](https://microsoft.github.io/language-server-protocol/) for your programming language. It should be fairly easy to build similar structures with the help of these tools.

In [2]:
!wget https://storage.googleapis.com/tutorial-attachments/code-search/structures.jsonl

--2024-03-05 11:08:28--  https://storage.googleapis.com/tutorial-attachments/code-search/structures.jsonl
Resolving storage.googleapis.com (storage.googleapis.com)... 108.177.119.207, 108.177.127.207, 172.217.218.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|108.177.119.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4921256 (4.7M) [application/json]
Saving to: ‘structures.jsonl’


2024-03-05 11:08:29 (20.4 MB/s) - ‘structures.jsonl’ saved [4921256/4921256]


In [3]:
import json

structures = []
with open("structures.jsonl", "r") as fp:
    for i, row in enumerate(fp):
        entry = json.loads(row)
        structures.append(entry)

structures[0]

{'name': 'InvertedIndexRam',
 'signature': '# [doc = " Inverted flatten index from dimension id to posting list"] # [derive (Debug , Clone , PartialEq)] pub struct InvertedIndexRam { # [doc = " Posting lists for each dimension flattened (dimension id -> posting list)"] # [doc = " Gaps are filled with empty posting lists"] pub postings : Vec < PostingList > , # [doc = " Number of unique indexed vectors"] # [doc = " pre-computed on build and upsert to avoid having to traverse the posting lists."] pub vector_count : usize , }',
 'code_type': 'Struct',
 'docstring': '= " Inverted flatten index from dimension id to posting list"',
 'line': 15,
 'line_from': 13,
 'line_to': 22,
 'context': {'module': 'inverted_index',
  'file_path': 'lib/sparse/src/index/inverted_index/inverted_index_ram.rs',
  'file_name': 'inverted_index_ram.rs',
  'struct_name': None,
  'snippet': '/// Inverted flatten index from dimension id to posting list\n#[derive(Debug, Clone, PartialEq)]\npub struct InvertedIndexRam

We will use two different neural encoders - `all-MiniLM-L6-v2` and `jina-embeddings-v2-base-code`. Since the first one is trained for general purposes, and more natural language, there is a need to convert code into more human-friendly text representation. This normalization gets rid of language specifics, so the output looks more like a description of the particular code structure.

In [4]:
import inflection
import re

from typing import Dict, Any


def textify(chunk: Dict[str, Any]) -> str:
    """
    Convert the code structure into natural language like representation.

    Args:
        chunk (dict): Dictionary-like representation of the code structure
            Example: {
                "name":"await_ready_for_timeout",
                "signature":"fn await_ready_for_timeout (& self , timeout : Duration) -> bool",
                "code_type":"Function",
                "docstring":"= \" Return `true` if ready, `false` if timed out.\"",
                "line":44,
                "line_from":43,
                "line_to":51,
                "context":{
                    "module":"common",
                    "file_path":"lib/collection/src/common/is_ready.rs",
                    "file_name":"is_ready.rs",
                    "struct_name":"IsReady",
                    "snippet":"    /// Return `true` if ready, `false` if timed out.\n    pub fn await_ready_for_timeout(&self, timeout: Duration) -> bool {\n        let mut is_ready = self.value.lock();\n        if !*is_ready {\n            !self.condvar.wait_for(&mut is_ready, timeout).timed_out()\n        } else {\n            true\n        }\n    }\n"
                }
            }

    Returns:
        str: A simplified natural language like description of the structure with some context info
            Example: "Function Await ready for timeout that does Return true if ready false if timed out defined as Fn await ready for timeout self timeout duration bool defined in struct Isready module common file is_ready rs"
    """
    # Get rid of all the camel case / snake case
    # - inflection.underscore changes the camel case to snake case
    # - inflection.humanize converts the snake case to human readable form
    name = inflection.humanize(inflection.underscore(chunk["name"]))
    signature = inflection.humanize(inflection.underscore(chunk["signature"]))

    # Check if docstring is provided
    docstring = ""
    if chunk["docstring"]:
        docstring = f"that does {chunk['docstring']} "

    # Extract the location of that snippet of code
    context = (
        f"module {chunk['context']['module']} " f"file {chunk['context']['file_name']}"
    )
    if chunk["context"]["struct_name"]:
        struct_name = inflection.humanize(
            inflection.underscore(chunk["context"]["struct_name"])
        )
        context = f"defined in struct {struct_name} {context}"

    # Combine all the bits and pieces together
    text_representation = (
        f"{chunk['code_type']} {name} "
        f"{docstring}"
        f"defined as {signature} "
        f"{context}"
    )

    # Remove any special characters and concatenate the tokens
    tokens = re.split(r"\W", text_representation)
    tokens = filter(lambda x: x, tokens)
    return " ".join(tokens)

Here is how the same structure looks like, after performing the normalization step:

In [5]:
textify(structures[0])

'Struct Inverted index ram that does Inverted flatten index from dimension id to posting list defined as doc inverted flatten index from dimension id to posting list derive debug clone partial eq pub struct inverted index ram doc posting lists for each dimension flattened dimension id posting list doc gaps are filled with empty posting lists pub postings vec posting list doc number of unique indexed vectors doc pre computed on build and upsert to avoid having to traverse the posting lists pub vector count usize module inverted_index file inverted_index_ram rs'

Let's do it for all the structures at once:

In [6]:
text_representations = list(map(textify, structures))

Created text representations might be directly used as an input to the `all-MiniLM-L6-v2` model.

In [7]:
from sentence_transformers import SentenceTransformer

nlp_model = SentenceTransformer("all-MiniLM-L6-v2")
nlp_embeddings = nlp_model.encode(
    text_representations,
    show_progress_bar=True,
)
nlp_embeddings.shape

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/148 [00:00<?, ?it/s]

(4723, 384)

As a next step, we are going to extract all the code snippets to a separate list. This will be an input to the different model we want to use.

In [8]:
code_snippets = [structure["context"]["snippet"] for structure in structures]
code_snippets[0]

'/// Inverted flatten index from dimension id to posting list\n#[derive(Debug, Clone, PartialEq)]\npub struct InvertedIndexRam {\n    /// Posting lists for each dimension flattened (dimension id -> posting list)\n    /// Gaps are filled with empty posting lists\n    pub postings: Vec<PostingList>,\n    /// Number of unique indexed vectors\n    /// pre-computed on build and upsert to avoid having to traverse the posting lists.\n    pub vector_count: usize,\n}\n'

The `jina-embeddings-v2-base-code` model is available for free, but requires accepting the rules on [the model page](https://huggingface.co/jinaai/jina-embeddings-v2-base-code). Please do it first, and put the key below.

In [9]:
# You have to accept the conditions in order to be able to access Jina embedding
# model. Please visit https://huggingface.co/jinaai/jina-embeddings-v2-base-code
# to accept the rules and generate the access token in your account settings:
# https://huggingface.co/settings/tokens

HF_TOKEN = "THIS_IS_YOUR_TOKEN"

Once the token is ready, we can pass the code snippets through the second model. Please mind we set the `trust_remote_code` flag to `True` so the library can download and run some code from the remote server. This is required to run the model, so in general be aware of the potential security risks and make sure you trust the source.

In [10]:
code_model = SentenceTransformer(
    "jinaai/jina-embeddings-v2-base-code", token=HF_TOKEN, trust_remote_code=True
)
code_model.max_seq_length = 8192  # increase the context length window
code_embeddings = code_model.encode(
    code_snippets,
    batch_size=4,
    show_progress_bar=True,
)
code_embeddings.shape

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

configuration_bert.py:   0%|          | 0.00/8.23k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-v2-qk-post-norm:
- configuration_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_bert.py:   0%|          | 0.00/96.0k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-v2-qk-post-norm:
- modeling_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


pytorch_model.bin:   0%|          | 0.00/322M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/493 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/971k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.56M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Batches:   0%|          | 0/1181 [00:00<?, ?it/s]

(4723, 768)

Created embeddings have to be indexed in a Qdrant collection. For that, we need a running instance. The easiest way is to deploy it using the [Qdrant Cloud](https://cloud.qdrant.io/). There is a free tier 1GB cluster available, but you can alternatively use [a local Docker container](https://qdrant.tech/documentation/quick-start/), but running it in Google Colab might require installing Docker first.

In [11]:
QDRANT_URL = "https://my-cluster.cloud.qdrant.io:6333"  # http://localhost:6333 for local instance
QDRANT_API_KEY = "THIS_IS_YOUR_API_KEY"  # None for local instance

In [12]:
from qdrant_client import QdrantClient, models

client = QdrantClient(QDRANT_URL, api_key=QDRANT_API_KEY)
client.create_collection(
    "qdrant-sources",
    vectors_config={
        "text": models.VectorParams(
            size=nlp_embeddings.shape[1],
            distance=models.Distance.COSINE,
        ),
        "code": models.VectorParams(
            size=code_embeddings.shape[1],
            distance=models.Distance.COSINE,
        ),
    },
)

True

Our collection should be created already. As you may see, we configured so called **[named vectors](https://qdrant.tech/documentation/concepts/points/)**, to have two different embeddings stored in the same collection.

Let's finally index all the data.

In [13]:
import uuid

points = [
    models.PointStruct(
        id=uuid.uuid4().hex,
        vector={
            "text": text_embedding,
            "code": code_embedding,
        },
        payload=structure,
    )
    for text_embedding, code_embedding, structure in zip(
        nlp_embeddings, code_embeddings, structures
    )
]
len(points)

4723

In [14]:
client.upload_points(
    "qdrant-sources",
    points=points,
    batch_size=64,
)

If you want to check if all the points were sent, counting them might be the easiest idea.

In [15]:
client.count("qdrant-sources")

CountResult(count=4723)

If you, however, want to know how the count endpoint works internally in the Qdrant server, that might be a question to ask.

In [16]:
query = "How do I count points in a collection?"

First of all, let's use one model at a time. Let's start with the general purpose one.

In [18]:
hits = client.search(
    "qdrant-sources",
    query_vector=("text", nlp_model.encode(query).tolist()),
    limit=5,
)
for hit in hits:
    print(
        "| ",
        hit.payload["context"]["module"],
        " | ",
        hit.payload["context"]["file_name"],
        " | ",
        hit.score,
        " | `",
        hit.payload["signature"],
        "` |",
    )

|  toc  |  point_ops.rs  |  0.59448624  | ` async fn count (& self , collection_name : & str , request : CountRequestInternal , read_consistency : Option < ReadConsistency > , shard_selection : ShardSelectorInternal ,) -> Result < CountResult , StorageError > ` |
|  operations  |  types.rs  |  0.5493385  | ` # [doc = " Count Request"] # [doc = " Counts the number of points which satisfy the given filter."] # [doc = " If filter is not provided, the count of all points in the collection will be returned."] # [derive (Debug , Deserialize , Serialize , JsonSchema , Validate)] # [serde (rename_all = "snake_case")] pub struct CountRequestInternal { # [doc = " Look only for points which satisfies this conditions"] # [validate] pub filter : Option < Filter > , # [doc = " If true, count exact number of points. If false, count approximate number of points faster."] # [doc = " Approximate count might be unreliable during the indexing process. Default: true"] # [serde (default = "default_exact_cou

The results obtained with the code specific model should be different.

In [19]:
hits = client.search(
    "qdrant-sources",
    query_vector=("code", code_model.encode(query).tolist()),
    limit=5,
)
for hit in hits:
    print(
        "| ",
        hit.payload["context"]["module"],
        " | ",
        hit.payload["context"]["file_name"],
        " | ",
        hit.score,
        " | `",
        hit.payload["signature"],
        "` |",
    )

|  field_index  |  geo_index.rs  |  0.73278356  | ` fn count_indexed_points (& self) -> usize ` |
|  numeric_index  |  mod.rs  |  0.7254975  | ` fn count_indexed_points (& self) -> usize ` |
|  map_index  |  mod.rs  |  0.7124739  | ` fn count_indexed_points (& self) -> usize ` |
|  map_index  |  mod.rs  |  0.7124739  | ` fn count_indexed_points (& self) -> usize ` |
|  fixtures  |  payload_context_fixture.rs  |  0.7062038  | ` fn total_point_count (& self) -> usize ` |


In reality, we implemented the system with two different models, as we want to combine the results coming from both of them. We can do it with a batch request, so there is just a single call to Qdrant.

In [20]:
results = client.search_batch(
    "qdrant-sources",
    requests=[
        models.SearchRequest(
            vector=models.NamedVector(
                name="text", vector=nlp_model.encode(query).tolist()
            ),
            with_payload=True,
            limit=5,
        ),
        models.SearchRequest(
            vector=models.NamedVector(
                name="code", vector=code_model.encode(query).tolist()
            ),
            with_payload=True,
            limit=5,
        ),
    ],
)
for hits in results:
    for hit in hits:
        print(
            "| ",
            hit.payload["context"]["module"],
            " | ",
            hit.payload["context"]["file_name"],
            " | ",
            hit.score,
            " | `",
            hit.payload["signature"],
            "` |",
        )

|  toc  |  point_ops.rs  |  0.59448624  | ` async fn count (& self , collection_name : & str , request : CountRequestInternal , read_consistency : Option < ReadConsistency > , shard_selection : ShardSelectorInternal ,) -> Result < CountResult , StorageError > ` |
|  operations  |  types.rs  |  0.5493385  | ` # [doc = " Count Request"] # [doc = " Counts the number of points which satisfy the given filter."] # [doc = " If filter is not provided, the count of all points in the collection will be returned."] # [derive (Debug , Deserialize , Serialize , JsonSchema , Validate)] # [serde (rename_all = "snake_case")] pub struct CountRequestInternal { # [doc = " Look only for points which satisfies this conditions"] # [validate] pub filter : Option < Filter > , # [doc = " If true, count exact number of points. If false, count approximate number of points faster."] # [doc = " Approximate count might be unreliable during the indexing process. Default: true"] # [serde (default = "default_exact_cou

Last but not least, if we want to improve the diversity of the results, grouping them by the module might be a good idea.

In [21]:
results = client.search_groups(
    "qdrant-sources",
    query_vector=("code", code_model.encode(query).tolist()),
    group_by="context.module",
    limit=5,
    group_size=1,
)
for group in results.groups:
    for hit in group.hits:
        print(
            "| ",
            hit.payload["context"]["module"],
            " | ",
            hit.payload["context"]["file_name"],
            " | ",
            hit.score,
            " | `",
            hit.payload["signature"],
            "` |",
        )

|  field_index  |  geo_index.rs  |  0.73278356  | ` fn count_indexed_points (& self) -> usize ` |
|  numeric_index  |  mod.rs  |  0.7254975  | ` fn count_indexed_points (& self) -> usize ` |
|  map_index  |  mod.rs  |  0.7124739  | ` fn count_indexed_points (& self) -> usize ` |
|  fixtures  |  payload_context_fixture.rs  |  0.7062038  | ` fn total_point_count (& self) -> usize ` |
|  hnsw_index  |  graph_links.rs  |  0.6998417  | ` fn num_points (& self) -> usize ` |


For a more detailed guide, please check our [code search tutorial](https://qdrant.tech/documentation/tutorials/code-search/) and [code search demo](https://github.com/qdrant/demo-code-search).