# Code search with Qdrant

This is a notebook demonstrating how to implement a code search mechanism using two different neural encoders - one general purpuse, and another trained specifically for code. Let's start with installing all the required dependencies.

In [None]:
%pip install "qdrant-client[fastembed]" inflection

We are going to work with [Qdrant source code](https://github.com/qdrant/qdrant) that has been already converted into chunks. If you want to do it for a different project, please consider using one of the [LSP implementations](https://microsoft.github.io/language-server-protocol/) for your programming language. It should be fairly easy to build similar structures with the help of these tools.

In [3]:
!curl -O https://storage.googleapis.com/tutorial-attachments/code-search/structures.jsonl

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4805k  100 4805k    0     0   759k      0  0:00:06  0:00:06 --:--:-- 1531k     0      0      0 --:--:--  0:00:01 --:--:--     0


In [4]:
import json

structures = []
with open("structures.jsonl", "r") as fp:
    for i, row in enumerate(fp):
        entry = json.loads(row)
        structures.append(entry)

structures[0]

{'name': 'InvertedIndexRam',
 'signature': '# [doc = " Inverted flatten index from dimension id to posting list"] # [derive (Debug , Clone , PartialEq)] pub struct InvertedIndexRam { # [doc = " Posting lists for each dimension flattened (dimension id -> posting list)"] # [doc = " Gaps are filled with empty posting lists"] pub postings : Vec < PostingList > , # [doc = " Number of unique indexed vectors"] # [doc = " pre-computed on build and upsert to avoid having to traverse the posting lists."] pub vector_count : usize , }',
 'code_type': 'Struct',
 'docstring': '= " Inverted flatten index from dimension id to posting list"',
 'line': 15,
 'line_from': 13,
 'line_to': 22,
 'context': {'module': 'inverted_index',
  'file_path': 'lib/sparse/src/index/inverted_index/inverted_index_ram.rs',
  'file_name': 'inverted_index_ram.rs',
  'struct_name': None,
  'snippet': '/// Inverted flatten index from dimension id to posting list\n#[derive(Debug, Clone, PartialEq)]\npub struct InvertedIndexRam

We will use two different neural encoders - `all-MiniLM-L6-v2` and `jina-embeddings-v2-base-code`. Since the first one is trained for general purposes, and more natural language, there is a need to convert code into more human-friendly text representation. This normalization gets rid of language specifics, so the output looks more like a description of the particular code structure.

In [6]:
import inflection
import re

from typing import Dict, Any

def textify(chunk: Dict[str, Any]) -> str:
    """
    Convert the code structure into natural language like representation.

    Args:
        chunk (dict): Dictionary-like representation of the code structure
            Example: {
                "name":"await_ready_for_timeout",
                "signature":"fn await_ready_for_timeout (& self , timeout : Duration) -> bool",
                "code_type":"Function",
                "docstring":"= \" Return `true` if ready, `false` if timed out.\"",
                "line":44,
                "line_from":43,
                "line_to":51,
                "context":{
                    "module":"common",
                    "file_path":"lib/collection/src/common/is_ready.rs",
                    "file_name":"is_ready.rs",
                    "struct_name":"IsReady",
                    "snippet":"    /// Return `true` if ready, `false` if timed out.\n    pub fn await_ready_for_timeout(&self, timeout: Duration) -> bool {\n        let mut is_ready = self.value.lock();\n        if !*is_ready {\n            !self.condvar.wait_for(&mut is_ready, timeout).timed_out()\n        } else {\n            true\n        }\n    }\n"
                }
            }

    Returns:
        str: A simplified natural language like description of the structure with some context info
            Example: "Function Await ready for timeout that does Return true if ready false if timed out defined as Fn await ready for timeout self timeout duration bool defined in struct Isready module common file is_ready rs"
    """
    # Get rid of all the camel case / snake case
    # - inflection.underscore changes the camel case to snake case
    # - inflection.humanize converts the snake case to human readable form
    name = inflection.humanize(inflection.underscore(chunk["name"]))
    signature = inflection.humanize(inflection.underscore(chunk["signature"]))

    # Check if docstring is provided
    docstring = ""
    if chunk["docstring"]:
        docstring = f"that does {chunk['docstring']} "

    # Extract the location of that snippet of code
    context = (
        f"module {chunk['context']['module']} "
        f"file {chunk['context']['file_name']}"
    )
    if chunk["context"]["struct_name"]:
        struct_name = inflection.humanize(
            inflection.underscore(chunk["context"]["struct_name"])
        )
        context = f"defined in struct {struct_name} {context}"

    # Combine all the bits and pieces together
    text_representation = (
        f"{chunk['code_type']} {name} "
        f"{docstring}"
        f"defined as {signature} "
        f"{context}"
    )

    # Remove any special characters and concatenate the tokens
    tokens = re.split(r"\W", text_representation)
    tokens = filter(lambda x: x, tokens)
    return " ".join(tokens)

Here is how the same structure looks like, after performing the normalization step:

In [7]:
textify(structures[0])

'Struct Inverted index ram that does Inverted flatten index from dimension id to posting list defined as doc inverted flatten index from dimension id to posting list derive debug clone partial eq pub struct inverted index ram doc posting lists for each dimension flattened dimension id posting list doc gaps are filled with empty posting lists pub postings vec posting list doc number of unique indexed vectors doc pre computed on build and upsert to avoid having to traverse the posting lists pub vector count usize module inverted_index file inverted_index_ram rs'

Let's do it for all the structures at once:

In [8]:
text_representations = list(map(textify, structures))

As a next step, we are going to extract all the code snippets to a separate list. This will be an input to the different model we want to use.

In [9]:
code_snippets = [
    structure["context"]["snippet"]
    for structure in structures
]
code_snippets[0]

'/// Inverted flatten index from dimension id to posting list\n#[derive(Debug, Clone, PartialEq)]\npub struct InvertedIndexRam {\n    /// Posting lists for each dimension flattened (dimension id -> posting list)\n    /// Gaps are filled with empty posting lists\n    pub postings: Vec<PostingList>,\n    /// Number of unique indexed vectors\n    /// pre-computed on build and upsert to avoid having to traverse the posting lists.\n    pub vector_count: usize,\n}\n'

Created embeddings have to be indexed in a Qdrant collection. For that, we need a running instance. The easiest way is to deploy it using the [Qdrant Cloud](https://cloud.qdrant.io/). There is a free tier 1GB cluster available, but you can alternatively use [a local Docker container](https://qdrant.tech/documentation/quick-start/), but running it in Google Colab might require installing Docker first.

In [10]:
QDRANT_URL = "https://my-cluster.cloud.qdrant.io:6333" # http://localhost:6333 for local instance
QDRANT_API_KEY = "THIS_IS_YOUR_API_KEY" # None for local instance

In [None]:
from qdrant_client import QdrantClient, models

client = QdrantClient(QDRANT_URL, api_key=QDRANT_API_KEY)
client.create_collection(
    "qdrant-sources",
    vectors_config={
        "text": models.VectorParams(
            size=384,
            distance=models.Distance.COSINE,
        ),
        "code": models.VectorParams(
            size=768,
            distance=models.Distance.COSINE,
        ),
    }
)

True

Our collection should be created already. As you may see, we configured so called **[named vectors](https://qdrant.tech/documentation/concepts/points/)**, to have two different embeddings stored in the same collection.

Let's finally index all the data.

In [None]:
import uuid

points = [
    models.PointStruct(
        id=uuid.uuid4().hex,
        vector={
             "text": models.Document(text=text, model="sentence-transformers/all-MiniLM-L6-v2"),
             "code": models.Document(text=code, model="jinaai/jina-embeddings-v2-base-code"),
         },
        payload=structure
    )
    for text, code, structure in zip(text_representations, code_snippets, structures)
]
len(points)

4723

In [14]:
client.upload_points(
    "qdrant-sources",
    points=points,
    batch_size=64,
)

If you want to check if all the points were sent, counting them might be the easiest idea.

In [15]:
client.count("qdrant-sources")

CountResult(count=4723)

If you, however, want to know how the count endpoint works internally in the Qdrant server, that might be a question to ask.

In [16]:
query = "How do I count points in a collection?"

First of all, let's use one model at a time. Let's start with the general purpose one.

In [None]:
hits = client.query_points(
     "qdrant-sources",
     query=models.Document(text=query, model="sentence-transformers/all-MiniLM-L6-v2"),
     using="text",
     limit=5,
 ).points

for hit in hits:
    print(
        "| ",
        hit.payload["context"]["module"], " | ",
        hit.payload["context"]["file_name"], " | ",
        hit.score, " | `",
        hit.payload["signature"], "` |"
    )

|  toc  |  point_ops.rs  |  0.59448624  | ` async fn count (& self , collection_name : & str , request : CountRequestInternal , read_consistency : Option < ReadConsistency > , shard_selection : ShardSelectorInternal ,) -> Result < CountResult , StorageError > ` |
|  operations  |  types.rs  |  0.5493385  | ` # [doc = " Count Request"] # [doc = " Counts the number of points which satisfy the given filter."] # [doc = " If filter is not provided, the count of all points in the collection will be returned."] # [derive (Debug , Deserialize , Serialize , JsonSchema , Validate)] # [serde (rename_all = "snake_case")] pub struct CountRequestInternal { # [doc = " Look only for points which satisfies this conditions"] # [validate] pub filter : Option < Filter > , # [doc = " If true, count exact number of points. If false, count approximate number of points faster."] # [doc = " Approximate count might be unreliable during the indexing process. Default: true"] # [serde (default = "default_exact_cou

The results obtained with the code specific model should be different.

In [None]:
hits = client.query_points(
     "qdrant-sources",
     query=models.Document(text=query, model="jinaai/jina-embeddings-v2-base-code"),
     using="code",
     limit=5,
 ).points

for hit in hits:
    print(
        "| ",
        hit.payload["context"]["module"], " | ",
        hit.payload["context"]["file_name"], " | ",
        hit.score, " | `",
        hit.payload["signature"], "` |"
    )

|  field_index  |  geo_index.rs  |  0.73278356  | ` fn count_indexed_points (& self) -> usize ` |
|  numeric_index  |  mod.rs  |  0.7254975  | ` fn count_indexed_points (& self) -> usize ` |
|  map_index  |  mod.rs  |  0.7124739  | ` fn count_indexed_points (& self) -> usize ` |
|  map_index  |  mod.rs  |  0.7124739  | ` fn count_indexed_points (& self) -> usize ` |
|  fixtures  |  payload_context_fixture.rs  |  0.7062038  | ` fn total_point_count (& self) -> usize ` |


In reality, we implemented the system with two different models, as we want to combine the results coming from both of them. We can do it with a batch request, so there is just a single call to Qdrant.

In [None]:
responses = client.query_batch_points(
    "qdrant-sources",
    requests=[
        models.QueryRequest(
            query=models.Document(text=query, model="sentence-transformers/all-MiniLM-L6-v2"),
            using="text",
            with_payload=True,
            limit=5,
        ),
        models.QueryRequest(
            query=models.Document(text=query, model="jinaai/jina-embeddings-v2-base-code"),
            using="code",
            with_payload=True,
            limit=5,
        ),
    ]
)

results = [response.points for response in responses]
for hits in results:
    for hit in hits:
        print(
            "| ",
            hit.payload["context"]["module"], " | ",
            hit.payload["context"]["file_name"], " | ",
            hit.score, " | `",
            hit.payload["signature"], "` |"
        )

|  toc  |  point_ops.rs  |  0.59448624  | ` async fn count (& self , collection_name : & str , request : CountRequestInternal , read_consistency : Option < ReadConsistency > , shard_selection : ShardSelectorInternal ,) -> Result < CountResult , StorageError > ` |
|  operations  |  types.rs  |  0.5493385  | ` # [doc = " Count Request"] # [doc = " Counts the number of points which satisfy the given filter."] # [doc = " If filter is not provided, the count of all points in the collection will be returned."] # [derive (Debug , Deserialize , Serialize , JsonSchema , Validate)] # [serde (rename_all = "snake_case")] pub struct CountRequestInternal { # [doc = " Look only for points which satisfies this conditions"] # [validate] pub filter : Option < Filter > , # [doc = " If true, count exact number of points. If false, count approximate number of points faster."] # [doc = " Approximate count might be unreliable during the indexing process. Default: true"] # [serde (default = "default_exact_cou

Last but not least, if we want to improve the diversity of the results, grouping them by the module might be a good idea.

In [None]:
results = client.query_points_groups(
    collection_name="qdrant-sources",
    using="code",
    query=models.Document(text=query, model="jinaai/jina-embeddings-v2-base-code"),
    group_by="context.module",
    limit=5,
    group_size=1,
)

for group in results.groups:
    for hit in group.hits:
        print(
            "| ",
            hit.payload["context"]["module"], " | ",
            hit.payload["context"]["file_name"], " | ",
            hit.score, " | `",
            hit.payload["signature"], "` |"
        )

|  field_index  |  geo_index.rs  |  0.73278356  | ` fn count_indexed_points (& self) -> usize ` |
|  numeric_index  |  mod.rs  |  0.7254975  | ` fn count_indexed_points (& self) -> usize ` |
|  map_index  |  mod.rs  |  0.7124739  | ` fn count_indexed_points (& self) -> usize ` |
|  fixtures  |  payload_context_fixture.rs  |  0.7062038  | ` fn total_point_count (& self) -> usize ` |
|  hnsw_index  |  graph_links.rs  |  0.6998417  | ` fn num_points (& self) -> usize ` |


For a more detailed guide, please check our [code search tutorial](https://qdrant.tech/documentation/tutorials/code-search/) and [code search demo](https://github.com/qdrant/demo-code-search).