# Qdrant

This notebook shows how to use [Qdrant](https://qdrant.tech/) to index and search for similar papers based on their embeddings.

**Strengths:**
- Open-source vector database with customizable features.
- Supports complex queries, geospatial search, and filtering.
- Built-in support for incremental indexing and real-time updates.
- Suitable for various use cases like search, recommendation, and clustering.
- Community-driven development with potential for rapid feature enhancements.
- Can be deployed on-premises or in the cloud.
- Supports batch search.
- Provides GO, Rust, Python, and Javascript libraries.

**Weaknesses:**
- Currently, no significant weaknesses identified.

### Load OpenAI embeddings

In [4]:
import os

os.getcwd()

'/Users/patrick/Code/praktikum-ise-2023-patrick-zierahn/notebooks/vector_db'

In [5]:
import pandas as pd
import numpy as np

datafile_path = f"data/software_architecture/bib-text_embeddings.csv.csv"
df = pd.read_csv(datafile_path)
df["embedding"] = df.embedding.apply(eval).apply(np.array)  # convert string to array
df

Unnamed: 0,author,title,doi,classes,url,abstract,embedding
0,Alessio Bucaioni and Patrizio Pelliccione and ...,Aligning Architecture with Business Goals in t...,10.1109/ICSA51549.2021.00020,"Meta Data{Research Level{Primary Research}, Ki...",https://doi.org/10.1109/ICSA51549.2021.00020,When designing complex automotive systems in p...,"[0.00016247970052063465, -0.006258538924157619..."
1,H{\'{e}}ctor Cadavid and Vasilios Andrikopoulo...,System- and Software-level Architecting Harmon...,10.1109/ICSA51549.2021.00010,"Meta Data{Kind{full}, Paper class{Evaluation R...",https://doi.org/10.1109/ICSA51549.2021.00010,The problems caused by the gap between system-...,"[0.01373602356761694, -0.007272415794432163, 0..."
2,Joshua Garcia and Mehdi Mirakhorli and Lu Xiao...,Constructing a Shared Infrastructure for Softw...,10.1109/ICSA51549.2021.00022,"Meta Data{Paper class{Evaluation Research}, Re...",https://doi.org/10.1109/ICSA51549.2021.00022,Over the past three decades software engineeri...,"[0.0022478506434708834, -0.0009823687141761184..."
3,Holger Knoche and Wilhelm Hasselbring,Continuous {API} Evolution in Heterogenous Ent...,10.1109/ICSA51549.2021.00014,"Meta Data{Research Level{Primary Research}, Ki...",https://doi.org/10.1109/ICSA51549.2021.00014,The ability to independently deploy parts of a...,"[0.0058586616069078445, -0.031336553394794464,..."
4,Duc Minh Le and Suhrid Karthik and Marcelo Sch...,Architectural Decay as Predictor of Issue- and...,10.1109/ICSA51549.2021.00017,"Meta Data{Paper class{Evaluation Research}, Re...",https://doi.org/10.1109/ICSA51549.2021.00017,Architectural decay imposes real costs in term...,"[0.015215540304780006, -0.02269367128610611, 0..."
...,...,...,...,...,...,...,...
148,"Keim, Jan and Schulz, Sophie and Fuch{\ss}, Do...",Trace {Link} {Recovery} for {Software} {Archit...,10.1007/978-3-030-86044-8_7,"Meta Data{Research Level{Primary Research}, Ki...",,Software Architecture Documentation often cons...,"[0.00465840520337224, 0.004803537856787443, 0...."
149,"Shabelnyk, Oleksandr and Frangoudis, Pantelis ...",Updating {Service}-{Based} {Software} {Systems...,10.1007/978-3-030-86044-8_10,"Meta Data{Paper class{Proposal of Solution}, R...",,Contemporary component-based systems often man...,"[-0.004098787903785706, -0.006373200099915266,..."
150,Stefan Kugele and David Hettler and Jan Peter,Data-Centric Communication and Containerizatio...,10.1109/ICSA.2018.00016,"Meta Data{Kind{full}, Research Level{Primary R...",https://doi.org/10.1109/ICSA.2018.00016,Context: The functional interconnection and da...,"[-0.011806309223175049, 0.010385949164628983, ..."
151,Banani Roy and Amit Kumar Mondal and Chanchal ...,Towards a Reference Architecture for Cloud-Bas...,10.1109/ICSA.2017.42,"Meta Data{Paper class{Validation Research, Eva...",https://doi.org/10.1109/ICSA.2017.42,The domain of plant genotyping and phenotyping...,"[-0.00909524504095316, -0.012267849408090115, ..."


In [9]:
embeddings = np.array(df["embedding"].tolist())
print("Embeddings shape:", embeddings.shape)

Embeddings shape: (153, 1536)


### Installation Qdrant

In [6]:
!docker pull qdrant/qdrant

Using default tag: latest
latest: Pulling from qdrant/qdrant

[1B4f7dc01b: Pulling fs layer 
[1B4c9e3517: Pulling fs layer 
[1B79e24954: Pulling fs layer 
[1B21281235: Pulling fs layer 
[1B8dab8050: Pulling fs layer 
[1B3d7d85aa: Pulling fs layer 
[1B10b9cceb: Pulling fs layer 
[1BDigest: sha256:fe6155cde4854925e6aec1c9e7e12660443972d374e8f1095e118515b6d01075[7A[2K[8A[2K[7A[2K[8A[2K[7A[2K[8A[2K[7A[2K[8A[2K[7A[2K[7A[2K[5A[2K[8A[2K[5A[2K[8A[2K[5A[2K[8A[2K[8A[2K[8A[2K[5A[2K[8A[2K[2K[8A[2K[8A[2K[5A[2K[5A[2K[8A[2K[5A[2K[8A[2K[3A[2K[5A[2K[8A[2K[5A[2K[8A[2K[5A[2K[8A[2K[5A[2K[5A[2K[8A[2K[5A[2K[8A[2K[5A[2K[8A[2K[5A[2K[5A[2K[8A[2K[8A[2K[2A[2K[1A[2K[8A[2K[8A[2K[8A[2K[8A[2K[8A[2K[8A[2K[8A[2K[8A[2K[8A[2K[8A[2KDownloading  29.03MB/29.16MB[8A[2K[8A[2K[8A[2K[8A[2K[8A[2K[8A[2K[8A[2K[8A[2K[7A[2K[5A[2K[5A[2K
Status: Downloaded newer image for qdrant/qd

In [11]:
!docker run --rm -p "6333:6333" -p "6334:6334" -d qdrant/qdrant

858c09e1c74910f32747c54b9c0ce95c99cf7d43fc0d23e1b870a1bcd8d31232


### Create a collection

In [12]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient("localhost", port=6333)
client.recreate_collection(
    collection_name="architecture_classes",
    vectors_config=VectorParams(
        size=embeddings.shape[1],
        distance=Distance.DOT
    ),
)

True

In [20]:
import json
from src.utils.utils_json import print_json

collection_info = client.get_collection(collection_name="architecture_classes")

print_json(json.loads(collection_info.model_dump_json()))

 {
  "config": {
    "hnsw_config": {
      "ef_construct": 100,
      "full_scan_threshold": 10000,
      "m": 16,
      "max_indexing_threads": 0,
      "on_disk": false,
      "payload_m": null
    },
    "optimizer_config": {
      "default_segment_number": 0,
      "deleted_threshold": 0.2,
      "flush_interval_sec": 5,
      "indexing_threshold": 20000,
      "max_optimization_threads": 1,
      "max_segment_size": null,
      "memmap_threshold": null,
      "vacuum_min_vector_number": 1000
    },
    "params": {
      "on_disk_payload": true,
      "replication_factor": 1,
      "shard_number": 1,
      "vectors": {
        "distance": "Dot",
        "hnsw_config": null,
        "on_disk": null,
        "quantization_config": null,
        "size": 1536
      },
      "write_consistency_factor": 1
    },
    "quantization_config": null,
    "wal_config": {
      "wal_capacity_mb": 32,
      "wal_segments_ahead": 0
    }
  },
  "indexed_vectors_count": 0,
  "optimizer_status": "o

### Insert embeddings into the collection

In [23]:
from qdrant_client.http.models import PointStruct

operation_info = client.upsert(
    collection_name="architecture_classes",
    wait=True,
    points=[
        PointStruct(id=inx, vector=embeddings[inx])
        for inx in range(embeddings.shape[0])
    ]
)

print(operation_info)

operation_id=1 status=<UpdateStatus.COMPLETED: 'completed'>


### Find similar papers given an embedding

In [27]:
example = df.loc[0]

print("Title:", example["title"])
print("doi:", example["doi"])

query_vector = example["embedding"]
print("Query vector:", query_vector)
print()

search_result = client.search(
    collection_name="architecture_classes",
    query_vector=query_vector,
    limit=3
)

for result in search_result:
    print("Title:", df.loc[result.id]["title"])
    print("doi:", df.loc[result.id]["doi"])
    print("Score:", result.score)
    print()

Title: Aligning Architecture with Business Goals in the Automotive Domain
doi: 10.1109/ICSA51549.2021.00020
Query vector: [ 0.00016248 -0.00625854  0.02198316 ... -0.00174366 -0.00339407
 -0.01621425]

Title: Aligning Architecture with Business Goals in the Automotive Domain
doi: 10.1109/ICSA51549.2021.00020
Score: 0.9999999

Title: Technical Architectures for Automotive Systems
doi: 10.1109/ICSA47634.2020.00013
Score: 0.8932742

Title: On Service-Orientation for Automotive Software
doi: 10.1109/ICSA.2017.20
Score: 0.8728726


### Find similar papers given a text input

In [31]:
import openai

# Provide OpenAI API key and choose one of the available models:
# https://beta.openai.com/docs/models/overview
openai.api_key = os.environ['OPENAI_API_KEY']

response = openai.Embedding.create(
    input="Microservices",
    model="text-embedding-ada-002",
)

request_embedding = response["data"][0]["embedding"]

In [32]:
search_result = client.search(
    collection_name="architecture_classes",
    query_vector=request_embedding,
    limit=3
)

for result in search_result:
    print("Title:", df.loc[result.id]["title"])
    print("doi:", df.loc[result.id]["doi"])
    print("Score:", result.score)
    print()

Title: From a Monolith to a Microservices Architecture: An Approach Based on Transactional Contexts
doi: 10.1007/978-3-030-29983-5\_3
Score: 0.8552163

Title: From Monolithic Architecture Style to Microservice one Based on a Semi-Automatic Approach
doi: 10.1109/ICSA47634.2020.00023
Score: 0.85248506

Title: Migrating Towards Microservice Architectures: An Industrial Survey
doi: 10.1109/ICSA.2018.00012
Score: 0.8496934
