<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/qdrant_arize.png" width="500"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Tuning a RAG Pipeline using Qdrant and Arize Phoenix</h1>

ℹ️ This notebook requires an OpenAI API key.

### **1. Import Relevant Packages**

In [1]:
import os

# Setup projects
SIMPLE_RAG_PROJECT = "simple-rag"
HYBRID_RAG_PROJECT = "hybrid-rag"
os.environ["PHOENIX_PROJECT_NAME"] = SIMPLE_RAG_PROJECT

In [2]:
import datetime
import json
import os
import pickle
import ssl
import time
import urllib
from getpass import getpass
from urllib.request import urlopen

import certifi
import nest_asyncio
import openai
import pandas as pd
import phoenix as px
import requests
from bs4 import BeautifulSoup
from llama_index.core import (
    ServiceContext, StorageContext, download_loader,
    load_index_from_storage, set_global_handler
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.graph_stores.simple import SimpleGraphStore
from llama_index.core.indices.vector_store.base import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from phoenix.evals import (
    HallucinationEvaluator, OpenAIModel, QAEvaluator,
    RelevanceEvaluator, run_evals
)
from phoenix.session.evaluation import get_qa_with_reference, get_retrieved_documents
from phoenix.trace import DocumentEvaluations, SpanEvaluations
from tqdm import tqdm

import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient, models
from qdrant_client.http.models import PointStruct

nest_asyncio.apply()  # needed for concurrent evals in notebook environments
pd.set_option("display.max_colwidth", 1000)

### **2. Launch Phoenix**
You can run Phoenix in the background to collect trace data emitted by any LlamaIndex application that has been instrumented with the OpenInferenceTraceCallbackHandler. Phoenix supports LlamaIndex's one-click observability which will automatically instrument your LlamaIndex application! You can consult our integration guide for a more detailed explanation of how to instrument your LlamaIndex application.

Launch Phoenix and follow the instructions in the cell output to open the Phoenix UI (the UI should be empty because we have yet to run the LlamaIndex application).

In [3]:
session = px.launch_app()

🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


Be sure to enable phoenix as your global handler for tracing!

In [4]:
set_global_handler("arize_phoenix")

### **3. Setup your openai key and retrieve the documents to be used**

In [5]:
from dotenv import load_dotenv
load_dotenv()

True

In [6]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

### **4. Retrieve the documents / dataset to be used**

In [7]:
from datasets import load_dataset

# If the dataset is gated/private, make sure you have run huggingface-cli login
dataset = load_dataset("atitaarora/qdrant_doc", split="train")

In [8]:
dataset.info

DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'source': Value(dtype='string', id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name='csv', dataset_name='qdrant_doc', config_name='default', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=1767967, num_examples=240, shard_lengths=None, dataset_name='qdrant_doc')}, download_checksums={'hf://datasets/atitaarora/qdrant_doc@8d859890840f65337c38e96d660b81b1441bbecd/documents.csv': {'num_bytes': 1777260, 'checksum': None}}, download_size=1777260, post_processing_size=None, dataset_size=1767967, size_in_bytes=3545227)

### **5. Definition of global chunk properties and chunk processing**
Processing each document with desired **TEXT_SPLITTER_ALGO , CHUNK_SIZE , CHUNK_OVERLAP** etc

In [11]:
## Global config for chunk processing
CHUNK_SIZE = 512 #1000
CHUNK_OVERLAP = 50

### **6. Process dataset as langchain (or llamaindex) document for further processing**

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document as LangchainDocument
from llama_index.core import Document

## Split and process the document chunks from the given dataset

def process_document_chunks(dataset,chunk_size,chunk_overlap):
    langchain_docs = [
        LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]})
        for doc in tqdm(dataset)
    ]

    # could showcase another variation of processed documents
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        add_start_index=True,
        separators=["\n\n", "\n", ".", " ", ""],
    )

    docs_processed = []
    for doc in langchain_docs:
        docs_processed += text_splitter.split_documents([doc])

    ## Converting Langchain document chunks above into Llamaindex Document for ingestion
    llama_documents = [
        Document.from_langchain_format(doc)
        for doc in docs_processed
    ]
    return llama_documents

In [13]:
documents = process_document_chunks(dataset, CHUNK_SIZE, CHUNK_OVERLAP)
len(documents)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 240/240 [00:00<00:00, 8416.59it/s]


4431

### **7. Setting up Qdrant and Collection**

We first set up the qdrant client and then create a collection so that our data may be stored.

In [14]:
##Uncomment to initialise qdrant client in memory
#client = qdrant_client.QdrantClient(
#    location=":memory:",
#)

##Uncomment below to connect to Qdrant Cloud
client = QdrantClient(
    os.environ.get("QDRANT_URL"), 
    api_key=os.environ.get("QDRANT_API_KEY"),
)

## Uncomment below to connect to local Qdrant
#client = qdrant_client.QdrantClient("http://localhost:6333")

In [20]:
## Collection Name 
COLLECTION_NAME = "qdrant_docs_arize_dense"

In [21]:
## General Collection level operations

## Get information about existing collections 
client.get_collections()

## Get information about specific collection
#collection_info = client.get_collection(COLLECTION_NAME)
#print(collection_info)

## Deleting collection, if need be
#client.delete_collection(COLLECTION_NAME)

CollectionsResponse(collections=[CollectionDescription(name='qdrant_docs_arize_hybrid')])

In [12]:
## Declaring the intended Embedding Model with Fastembed
from fastembed.embedding import TextEmbedding

pd.DataFrame(TextEmbedding.list_supported_models())



Unnamed: 0,model,dim,description,size_in_GB,sources
0,BAAI/bge-base-en,768,Base English model,0.42,{'url': 'https://storage.googleapis.com/qdrant-fastembed/fast-bge-base-en.tar.gz'}
1,BAAI/bge-base-en-v1.5,768,"Base English model, v1.5",0.21,"{'url': 'https://storage.googleapis.com/qdrant-fastembed/fast-bge-base-en-v1.5.tar.gz', 'hf': 'qdrant/bge-base-en-v1.5-onnx-q'}"
2,BAAI/bge-large-en-v1.5,1024,"Large English model, v1.5",1.2,{'hf': 'qdrant/bge-large-en-v1.5-onnx'}
3,BAAI/bge-small-en,384,Fast English model,0.13,{'url': 'https://storage.googleapis.com/qdrant-fastembed/BAAI-bge-small-en.tar.gz'}
4,BAAI/bge-small-en-v1.5,384,Fast and Default English model,0.067,{'hf': 'qdrant/bge-small-en-v1.5-onnx-q'}
5,BAAI/bge-small-zh-v1.5,512,Fast and recommended Chinese model,0.09,{'url': 'https://storage.googleapis.com/qdrant-fastembed/fast-bge-small-zh-v1.5.tar.gz'}
6,sentence-transformers/all-MiniLM-L6-v2,384,"Sentence Transformer model, MiniLM-L6-v2",0.09,"{'url': 'https://storage.googleapis.com/qdrant-fastembed/sentence-transformers-all-MiniLM-L6-v2.tar.gz', 'hf': 'qdrant/all-MiniLM-L6-v2-onnx'}"
7,sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2,384,"Sentence Transformer model, paraphrase-multilingual-MiniLM-L12-v2",0.22,{'hf': 'qdrant/paraphrase-multilingual-MiniLM-L12-v2-onnx-Q'}
8,nomic-ai/nomic-embed-text-v1,768,8192 context length english model,0.52,{'hf': 'nomic-ai/nomic-embed-text-v1'}
9,nomic-ai/nomic-embed-text-v1.5,768,8192 context length english model,0.52,{'hf': 'nomic-ai/nomic-embed-text-v1.5'}


### **8. Document Embedding processing and Ingestion**

This example uses a `QdrantVectorStore` and creates a new collection to work fully connected with Qdrant but you can use whatever LlamaIndex application you like.

In [22]:
## Initializing the space to work with llama-index and related settings
import llama_index
from llama_index.core import Settings
from llama_index.vector_stores.qdrant import QdrantVectorStore
from phoenix.trace import suppress_tracing
## Uncomment it if you'd like to use FastEmbed instead of OpenAI
## For the complete list of supported models,
##please check https://qdrant.github.io/fastembed/examples/Supported_Models/
from llama_index.embeddings.fastembed import FastEmbedEmbedding

vector_store = QdrantVectorStore(client=client, collection_name=COLLECTION_NAME)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

##Uncomment if using FastEmbed
Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-small-en-v1.5")

## Uncomment it if you'd like to use OpenAI Embeddings instead of FastEmbed
#Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

Settings.llm = OpenAI(model="gpt-4-1106-preview", temperature=0.0)


Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

In [23]:
## Note: Indexing block is to be used when you're creating a new collection 
## If using an existing collection, Skip this block and execute the next one instead.

from phoenix.trace import suppress_tracing
with suppress_tracing():
  dense_vector_index = VectorStoreIndex.from_documents(
      documents,
      storage_context=storage_context,
      show_progress=True
  )

Parsing nodes:   0%|          | 0/4431 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/335 [00:00<?, ?it/s]

### **8a. Connecting to existing Collection**

This example uses a `QdrantVectorStore` and uses the previously generated collection to work fully connected with Qdrant.

In [24]:
## Note: Execute this block when using an existing collection
from llama_index.core.vector_stores.types import VectorStoreQueryMode
from llama_index.core.indices.vector_store import VectorIndexRetriever
dense_vector_store = QdrantVectorStore(client=client, collection_name=COLLECTION_NAME)
dense_vector_index = VectorStoreIndex.from_vector_store(vector_store=dense_vector_store)

In [25]:
## Sanity Check : Count the total number of documents processed equals number of documents in the collection
client.count(collection_name=COLLECTION_NAME)

CountResult(count=4431)

### **9.Running an example query and printing out the response.**

In [26]:
##Initialise retriever to interact with the Qdrant collection
dense_retriever = VectorIndexRetriever(
    index=dense_vector_index,
    vector_store_query_mode=VectorStoreQueryMode.DEFAULT,
    similarity_top_k=2
)

In [27]:
## let's try out a sample query using our dense retriever
response = dense_retriever.retrieve("What is a Merge Optimizer?")
for i, node in enumerate(response):
    print(i + 1, node.text, end="\n\n")

1 Such segments, for example, are created as copy-on-write segments during optimization itself.



It is also essential to have at least one small segment that Qdrant will use to store frequently updated data.

On the other hand, too many small segments lead to suboptimal search performance.



There is the Merge Optimizer, which combines the smallest segments into one large segment. It is used if too many segments are created.

2 ---

title: Optimizer

weight: 70

aliases:

  - ../optimizer

---



# Optimizer



It is much more efficient to apply changes in batches than perform each change individually, as many other databases do. Qdrant here is no exception. Since Qdrant operates with data structures that are not always easy to change, it is sometimes necessary to rebuild those structures completely.



Storage optimization in Qdrant occurs at the segment level (see [storage](../storage)).



In [28]:
# We can view the above data in the UI
px.active_session().view()

📺 Opening a view to the Phoenix app. The app is running at http://localhost:6006/


### **10. Run Your Query Engine and View Your Traces in Phoenix**

We've compiled a list of the baseline questions about Qdrant. Let's download the sample queries and take a look.

In [29]:
## Loading the Eval dataset
from datasets import load_dataset
qdrant_qa = load_dataset("atitaarora/qdrant_doc_qna", split="train")
qdrant_qa_question = qdrant_qa.select_columns(['question'])

In [30]:
qdrant_qa_question['question'][:10]

['What is vaccum optimizer ?',
 'Tell me about ‘always_ram’ parameter?',
 'What is difference between scalar and product quantization?',
 'What is ‘best_score’ strategy?',
 'How does oversampling helps?',
 'What is the purpose of ‘CreatePayloadIndexAsync’?',
 'What is the purpose of ef_construct in HNSW ?',
 'How do you use ‘ordering’ parameter?',
 'What is significance of ‘on_disk_payload’ setting?',
 'What is the impact of ‘write_consistency_factor’ ?']

In [31]:
from llama_index.core import get_response_synthesizer
from llama_index.core.query_engine import RetrieverQueryEngine

#define response synthesizer
response_synthesizer = get_response_synthesizer()

#assemble query engine for dense retriever
dense_query_engine = RetrieverQueryEngine(
                     retriever=dense_retriever,
                     response_synthesizer=response_synthesizer,)
#query_engine = index.as_query_engine()
for query in tqdm(qdrant_qa_question['question'][:10]):
    try:
      dense_query_engine.query(query)
    except Exception as e:
      pass

  0%|                                                                                                                                        | 0/10 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:15<00:00,  7.59s/it]


Check the Phoenix UI as your queries run. Your traces should appear in real time.

Open the Phoenix UI with the link below if you haven't already and click through the queries to better understand how the query engine is performing. For each trace you will see a break

Phoenix can be used to understand and troubleshoot your by surfacing:
 - **Application latency** - highlighting slow invocations of LLMs, Retrievers, etc.
 - **Token Usage** - Displays the breakdown of token usage with LLMs to surface up your most expensive LLM calls
 - **Runtime Exceptions** - Critical runtime exceptions such as rate-limiting are captured as exception events.
 - **Retrieved Documents** - view all the documents retrieved during a retriever call and the score and order in which they were returned
 - **Embeddings** - view the embedding text used for retrieval and the underlying embedding model
LLM Parameters - view the parameters used when calling out to an LLM to debug things like temperature and the system prompts
 - **Prompt Templates** - Figure out what prompt template is used during the prompting step and what variables were used.
 - **Tool Descriptions** - view the description and function signature of the tools your LLM has been given access to
 - **LLM Function Calls** - if using OpenAI or other a model with function calls, you can view the function selection and function messages in the input messages to the LLM.

<img src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/RAG_trace_details.png" alt="Trace Details View on Phoenix" style="width:100%; height:auto;">

In [20]:
print(f"🚀 Open the Phoenix UI if you haven't already: {session.url}")

🚀 Open the Phoenix UI if you haven't already: http://localhost:6006/


### **11. Export and Evaluate Your Trace Data**
You can export your trace data as a pandas dataframe for further analysis and evaluation.

In this case, we will export our retriever spans into two separate dataframes:

queries_df, in which the retrieved documents for each query are concatenated into a single column, retrieved_documents_df, in which each retrieved document is "exploded" into its own row to enable the evaluation of each query-document pair in isolation. This will enable us to compute multiple kinds of evaluations, including:

relevance: Are the retrieved documents grounded in the response? Q&A correctness: Are your application's responses grounded in the retrieved context? hallucinations: Is your application making up false information?

In [32]:
queries_df = get_qa_with_reference(px.Client())
retrieved_documents_df = get_retrieved_documents(px.Client())

In [33]:
queries_df

Unnamed: 0_level_0,input,output,reference
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
14fcacd026e9f882,What is the impact of ‘write_consistency_factor’ ?,"The `write_consistency_factor` impacts the reliability and consistency of write operations in a distributed system. When this factor is increased, it requires more replicas to acknowledge a write operation before it is considered successful and a response is sent to the client. This makes write operations more tolerant to network partitions, as it ensures that a larger number of replicas have the write data before proceeding. However, it also means that a higher number of active replicas are needed to perform write operations, which could potentially lead to higher latency or even failures in write operations if the required number of replicas is not available.","- `write_consistency_factor` - defines the number of replicas that must acknowledge a write operation before responding to the client. Increasing this value will make write operations tolerant to network partitions in the cluster, but will require a higher number of replicas to be active to perform write operations.\n\n### Write consistency factor\n\n\n\nThe `write_consistency_factor` represents the number of replicas that must acknowledge a write operation before responding to the client. It is set to one by default.\n\nIt can be configured at the collection's creation time.\n\n\n\n```http\n\nPUT /collections/{collection_name}\n\n{\n\n ""vectors"": {\n\n ""size"": 300,\n\n ""distance"": ""Cosine""\n\n },\n\n ""shard_number"": 6,\n\n ""replication_factor"": 2,\n\n ""write_consistency_factor"": 2,\n\n}\n\n```\n\n\n\n```python"
1a847e1aa16e4e15,What is significance of ‘on_disk_payload’ setting?,"The `on_disk_payload` setting is significant because it determines where payload data is stored. When set to `true`, it ensures that payload data is stored only on disk, rather than in RAM. This can be particularly useful for managing memory usage effectively when dealing with large payloads, as it helps to limit the amount of RAM required by not loading large payload data into memory.","* `on_disk_payload` - defines where to store payload data. If `true` - payload will be stored on disk only. Might be useful for limiting the RAM usage in case of large payload.\n\n* `quantization_config` - see [quantization](../../guides/quantization/#setting-up-quantization-in-qdrant) for details.\n\n\n\nDefault parameters for the optional collection parameters are defined in [configuration file](https://github.com/qdrant/qdrant/blob/master/config/config.yaml).\n\nThe payload data is loaded into RAM at service startup while disk and [RocksDB](https://rocksdb.org/) are used for persistence only.\n\nThis type of storage works quite fast, but it may require a lot of space to keep all the data in RAM, especially if the payload has large values attached - abstracts of text or even images.\n\n\n\nIn the case of large payload values, it might be better to use OnDisk payload storage."
a449f21559141637,How do you use ‘ordering’ parameter?,"The `ordering` parameter can be used with update and delete operations to ensure that these operations are executed in the same order on all replicas. To use this parameter, you would include it in your request to Qdrant. For example, when making an HTTP PUT request to update points in a collection, you would append `?ordering=strong` to the URL, like this:\n\n```\nPUT /collections/{collection_name}/points?ordering=strong\n```\n\nIn the body of the request, you would include the details of the points you want to update, such as their IDs, payloads, and vectors.\n\nWhen using the `ordering` parameter, Qdrant will route the operation to the leader replica of the shard and will not respond to the client until the operation has been completed. This ensures that all replicas process the update or delete operations in the same order, which is particularly useful for avoiding data inconsistencies when there are concurrent updates to the same documents.","- Write `ordering` param, can be used with update and delete operations to ensure that the operations are executed in the same order on all replicas. If this option is used, Qdrant will route the operation to the leader replica of the shard and wait for the response before responding to the client. This option is useful to avoid data inconsistency in case of concurrent updates of the same documents\n\n```http\n\nPUT /collections/{collection_name}/points?ordering=strong\n\n{\n\n ""batch"": {\n\n ""ids"": [1, 2, 3],\n\n ""payloads"": [\n\n {""color"": ""red""},\n\n {""color"": ""green""},\n\n {""color"": ""blue""}\n\n ],\n\n ""vectors"": [\n\n [0.9, 0.1, 0.1],\n\n [0.1, 0.9, 0.1],\n\n [0.1, 0.1, 0.9]\n\n ]\n\n }\n\n}\n\n```\n\n\n\n```python\n\nclient.upsert(\n\n collection_name=""{collection_name}"",\n\n points=models.Batch(\n\n ids=[1, 2, 3],"
313e98c658beac38,What is the purpose of ef_construct in HNSW ?,"The `ef_construct` parameter in HNSW is the number of neighbours to consider during the index building. The larger the value of `ef_construct`, the higher the precision of the search, but this also results in longer indexing time.","(""my_vector"".into()),\n\n VectorParamsDiff {\n\n hnsw_config: Some(HnswConfigDiff {\n\n m: Some(32),\n\n ef_construct: Some(123),\n\n ..Default::default()\n\n }),\n\n ..Default::default()\n\n },\n\n )]),\n\n },\n\n )),\n\n }),\n\nThe larger the value of it, the higher the precision of the search, but more space required. The `ef_construct` parameter is the number of \n\nneighbours to consider during the index building. Again, the larger the value, the higher the precision, but the longer the indexing time.\n\nThe default values of these parameters are `m=16` and `ef_construct=100`. Let's try to increase them to `m=32` and `ef_construct=200` and"
5734d6ff741beaee,What is the purpose of ‘CreatePayloadIndexAsync’?,"The purpose of the `createPayloadIndex` function (which could be referred to as `CreatePayloadIndexAsync` in an asynchronous context) is to create an index on a payload field within a collection in a database. This function specifies the field to be indexed and defines the schema for the index, including the type of data, tokenizer, minimum and maximum token length, and whether the text should be converted to lowercase. This indexing facilitates efficient querying and retrieval of data based on the specified field.","client.createPayloadIndex(""{collection_name}"", {\n\n field_name: ""name_of_the_field_to_index"",\n\n field_schema: {\n\n type: ""text"",\n\n tokenizer: ""word"",\n\n min_token_len: 2,\n\n max_token_len: 15,\n\n lowercase: true,\n\n },\n\n});\n\n```\n\n\n\n```rust\n\nuse qdrant_client::{\n\n client::QdrantClient,\n\n qdrant::{\n\n payload_index_params::IndexParams, FieldType, PayloadIndexParams, TextIndexParams,\n\n TokenizerType,\n\n },\n\n};\n\n},\n\n ""api"": {\n\n ""type"": ""openapi"",\n\n ""url"": ""https://your-application-name.fly.dev/.well-known/openapi.yaml"",\n\n ""has_user_authentication"": false\n\n },\n\n ""logo_url"": ""https://your-application-name.fly.dev/.well-known/logo.png"",\n\n ""contact_email"": ""email@domain.com"",\n\n ""legal_info_url"": ""email@domain.com""\n\n}\n\n```\n\n\n\nThat was the last step before running the final command. The command that will deploy \n\nthe application on the server:\n\n\n\n```bash\n\nflyctl deploy\n\n```"
10259e2e2f83a1b3,How does oversampling helps?,"Oversampling helps in two distinct ways:\n\n1. It improves the accuracy and performance of similarity search algorithms by allowing for significant compression of high-dimensional vectors in memory, while compensating for accuracy loss by re-scoring additional points with the original vectors.\n\n2. It equalizes the representation of classes in the training dataset, which enables more fair and accurate modeling of real-world scenarios.","### Oversampling for quantization\n\n\n\nWe are introducing [oversampling](/documentation/guides/quantization/#oversampling) as a new way to help you improve the accuracy and performance of similarity search algorithms. With this method, you are able to significantly compress high-dimensional vectors in memory and then compensate the accuracy loss by re-scoring additional points with the original vectors.\n\noversampling helps equalize the representation of classes in the training dataset, thus enabling more fair and accurate modeling of real-world scenarios."
7f854af390f89fcb,What is ‘best_score’ strategy?,"The 'best_score' strategy is a method used to find similar vectors by comparing each candidate against every example to identify the ones that are closer to a positive example while avoiding those closer to a negative one. This strategy involves selecting the best positive and best negative scores for each candidate, and then calculating the final score using a step formula. If the best positive score is greater than the best negative score, the final score is the best positive score. Otherwise, the final score is the negative of the best negative score squared. The performance of this strategy is affected linearly by the number of examples used.","This is the default strategy that's going to be set implicitly, but you can explicitly define it by setting `""strategy"": ""average_vector""` in the recommendation request.\n\n\n\n### Best score strategy\n\n\n\n*Available as of v1.6.0*\n\n\n\nA new strategy introduced in v1.6, is called `best_score`. It is based on the idea that the best way to find similar vectors is to find the ones that are closer to a positive example, while avoiding the ones that are closer to a negative one.\n\nThe way it works is that each candidate is measured against every example, then we select the best positive and best negative scores. The final score is chosen with this step formula:\n\n\n\n```rust\n\nlet score = if best_positive_score > best_negative_score {\n\n best_positive_score;\n\n} else {\n\n -(best_negative_score * best_negative_score);\n\n};\n\n```\n\n\n\n<aside role=""alert"">\n\nThe performance of <code>best_score</code> strategy will be linearly impacted by the amount of examples.\n\n</as..."
a9d2001e9ca7f6a2,What is difference between scalar and product quantization?,"Scalar quantization is a compression technique that reduces the number of bits used to represent each component of a vector. It involves converting the original floating-point representations of vector components into a lower bit representation, such as converting 32-bit floating numbers into 8-bit unsigned integers.\n\nProduct quantization, on the other hand, is a different approach to compressing high-dimensional vectors. It involves dividing the vector into smaller sub-vectors and quantizing each sub-vector separately. This method is not as friendly to SIMD (Single Instruction, Multiple Data) operations, which can make distance calculations slower compared to scalar quantization. Additionally, product quantization typically results in a loss of accuracy, and therefore, it is recommended for use with high-dimensional vectors where this trade-off is acceptable.","But there are some tradeoffs. Product quantization distance calculations are not SIMD-friendly, so it is slower than scalar quantization.\n\nAlso, product quantization has a loss of accuracy, so it is recommended to use it only for high-dimensional vectors.\n\n\n\nPlease refer to the [Quantization Tips](#quantization-tips) section for more information on how to optimize the quantization parameters for your use case.\n\n\n\n## How to choose the right quantization method\n\n*Available as of v1.1.0*\n\n\n\nScalar quantization, in the context of vector search engines, is a compression technique that compresses vectors by reducing the number of bits used to represent each vector component.\n\n\n\n\n\nFor instance, Qdrant uses 32-bit floating numbers to represent the original vector components. Scalar quantization allows you to reduce the number of bits used to 8.\n\nIn other words, Qdrant performs `float32 -> uint8` conversion for each vector component."
a70081ec7e95ccc3,Tell me about ‘always_ram’ parameter?,"The `always_ram` parameter is a configuration option that determines whether quantized vectors should be always cached in RAM or not. By default, quantized vectors are loaded in the same way as the original vectors. However, in certain setups, keeping quantized vectors in RAM can speed up the search process. If you want to store quantized vectors in RAM, you can set `always_ram` to `true`. This setting can be particularly useful if you experience a significant decrease in search quality and need to ensure faster search performance.","""compression"": ""x32"",\n\n ""always_ram"": true\n\n }\n\n },\n\n ""on_disk"": true\n\n }\n\n },\n\n ""hnsw_config"": {\n\n ""ef_construct"": 123\n\n },\n\n ""quantization_config"": {\n\n ""scalar"": {\n\n ""type"": ""int8"",\n\n ""quantile"": 0.8,\n\n ""always_ram"": false\n\n }\n\n }\n\n}\n\n```\n\n\n\n```bash\n\ncurl -X PATCH http://localhost:6333/collections/test_collection1 \\n\nIt might be worth tuning this parameter if you experience a significant decrease in search quality.\n\n\n\n`always_ram` - whether to keep quantized vectors always cached in RAM or not. By default, quantized vectors are loaded in the same way as the original vectors.\n\nHowever, in some setups you might want to keep quantized vectors in RAM to speed up the search process.\n\n\n\nIn this case, you can set `always_ram` to `true` to store quantized vectors in RAM.\n\n\n\n### Setting up Binary Quant..."
52a8df108660be63,What is vaccum optimizer ?,"The term ""vaccum optimizer"" does not appear directly in the provided context. However, the context does mention an ""optimizer_config"" with various parameters that could be related to the optimization process of a system. One of the parameters within this configuration is ""vacuum_min_vector_number,"" which suggests a threshold for a certain operation that could be part of an optimization or maintenance routine, possibly related to data storage or indexing. Without additional context, it's not possible to provide a detailed explanation of a ""vaccum optimizer.""","return optimizer\n\n```\n\n\n\nCaching in Quaterion is used for avoiding calculation of outputs of a frozen pretrained `Encoder` in every epoch.\n\nWhen it is configured, outputs will be computed once and cached in the preferred device for direct usage later on.\n\nIt provides both a considerable speedup and less memory footprint.\n\nHowever, it is quite a bit versatile and has several knobs to tune.\n\n},\n\n ""optimizer_config"": {\n\n ""deleted_threshold"": 0.2,\n\n ""vacuum_min_vector_number"": 1000,\n\n ""default_segment_number"": 0,\n\n ""max_segment_size"": null,\n\n ""memmap_threshold"": null,\n\n ""indexing_threshold"": 20000,\n\n ""flush_interval_sec"": 5,\n\n ""max_optimization_threads"": 1\n\n },\n\n ""wal_config"": {\n\n ""wal_capacity_mb"": 32,"


In [34]:
retrieved_documents_df

Unnamed: 0_level_0,Unnamed: 1_level_0,context.trace_id,input,reference,document_score
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
7cab8e400cabe8e7,0,2920c32bcf94359d7b2906641d8f2a10,What is the impact of ‘write_consistency_factor’ ?,"- `write_consistency_factor` - defines the number of replicas that must acknowledge a write operation before responding to the client. Increasing this value will make write operations tolerant to network partitions in the cluster, but will require a higher number of replicas to be active to perform write operations.",0.832443
7cab8e400cabe8e7,1,2920c32bcf94359d7b2906641d8f2a10,What is the impact of ‘write_consistency_factor’ ?,"### Write consistency factor\n\n\n\nThe `write_consistency_factor` represents the number of replicas that must acknowledge a write operation before responding to the client. It is set to one by default.\n\nIt can be configured at the collection's creation time.\n\n\n\n```http\n\nPUT /collections/{collection_name}\n\n{\n\n ""vectors"": {\n\n ""size"": 300,\n\n ""distance"": ""Cosine""\n\n },\n\n ""shard_number"": 6,\n\n ""replication_factor"": 2,\n\n ""write_consistency_factor"": 2,\n\n}\n\n```\n\n\n\n```python",0.825855
36d7d3481d63eb2f,0,7817c0d936f861fa95d57d5965f6f831,What is significance of ‘on_disk_payload’ setting?,* `on_disk_payload` - defines where to store payload data. If `true` - payload will be stored on disk only. Might be useful for limiting the RAM usage in case of large payload.\n\n* `quantization_config` - see [quantization](../../guides/quantization/#setting-up-quantization-in-qdrant) for details.\n\n\n\nDefault parameters for the optional collection parameters are defined in [configuration file](https://github.com/qdrant/qdrant/blob/master/config/config.yaml).,0.821666
36d7d3481d63eb2f,1,7817c0d936f861fa95d57d5965f6f831,What is significance of ‘on_disk_payload’ setting?,"The payload data is loaded into RAM at service startup while disk and [RocksDB](https://rocksdb.org/) are used for persistence only.\n\nThis type of storage works quite fast, but it may require a lot of space to keep all the data in RAM, especially if the payload has large values attached - abstracts of text or even images.\n\n\n\nIn the case of large payload values, it might be better to use OnDisk payload storage.",0.787703
4f34361dbde57463,0,daa61b1e45237715bbd8eafcd09d6c2c,How do you use ‘ordering’ parameter?,"- Write `ordering` param, can be used with update and delete operations to ensure that the operations are executed in the same order on all replicas. If this option is used, Qdrant will route the operation to the leader replica of the shard and wait for the response before responding to the client. This option is useful to avoid data inconsistency in case of concurrent updates of the same documents",0.770652
4f34361dbde57463,1,daa61b1e45237715bbd8eafcd09d6c2c,How do you use ‘ordering’ parameter?,"```http\n\nPUT /collections/{collection_name}/points?ordering=strong\n\n{\n\n ""batch"": {\n\n ""ids"": [1, 2, 3],\n\n ""payloads"": [\n\n {""color"": ""red""},\n\n {""color"": ""green""},\n\n {""color"": ""blue""}\n\n ],\n\n ""vectors"": [\n\n [0.9, 0.1, 0.1],\n\n [0.1, 0.9, 0.1],\n\n [0.1, 0.1, 0.9]\n\n ]\n\n }\n\n}\n\n```\n\n\n\n```python\n\nclient.upsert(\n\n collection_name=""{collection_name}"",\n\n points=models.Batch(\n\n ids=[1, 2, 3],",0.740061
bfd965ae080f81ce,0,2e0440e73984c19e901106346153c857,What is the purpose of ef_construct in HNSW ?,"(""my_vector"".into()),\n\n VectorParamsDiff {\n\n hnsw_config: Some(HnswConfigDiff {\n\n m: Some(32),\n\n ef_construct: Some(123),\n\n ..Default::default()\n\n }),\n\n ..Default::default()\n\n },\n\n )]),\n\n },\n\n )),\n\n }),",0.78715
bfd965ae080f81ce,1,2e0440e73984c19e901106346153c857,What is the purpose of ef_construct in HNSW ?,"The larger the value of it, the higher the precision of the search, but more space required. The `ef_construct` parameter is the number of \n\nneighbours to consider during the index building. Again, the larger the value, the higher the precision, but the longer the indexing time.\n\nThe default values of these parameters are `m=16` and `ef_construct=100`. Let's try to increase them to `m=32` and `ef_construct=200` and",0.767757
e67bf6d53c9b6a51,0,c96de30824656d78982669c4dcfc97ec,What is the purpose of ‘CreatePayloadIndexAsync’?,"client.createPayloadIndex(""{collection_name}"", {\n\n field_name: ""name_of_the_field_to_index"",\n\n field_schema: {\n\n type: ""text"",\n\n tokenizer: ""word"",\n\n min_token_len: 2,\n\n max_token_len: 15,\n\n lowercase: true,\n\n },\n\n});\n\n```\n\n\n\n```rust\n\nuse qdrant_client::{\n\n client::QdrantClient,\n\n qdrant::{\n\n payload_index_params::IndexParams, FieldType, PayloadIndexParams, TextIndexParams,\n\n TokenizerType,\n\n },\n\n};",0.717572
e67bf6d53c9b6a51,1,c96de30824656d78982669c4dcfc97ec,What is the purpose of ‘CreatePayloadIndexAsync’?,"},\n\n ""api"": {\n\n ""type"": ""openapi"",\n\n ""url"": ""https://your-application-name.fly.dev/.well-known/openapi.yaml"",\n\n ""has_user_authentication"": false\n\n },\n\n ""logo_url"": ""https://your-application-name.fly.dev/.well-known/logo.png"",\n\n ""contact_email"": ""email@domain.com"",\n\n ""legal_info_url"": ""email@domain.com""\n\n}\n\n```\n\n\n\nThat was the last step before running the final command. The command that will deploy \n\nthe application on the server:\n\n\n\n```bash\n\nflyctl deploy\n\n```",0.698218


### **12. Define your evaluation model and your evaluators**

Next, define your evaluation model and your evaluators.

Evaluators are built on top of language models and prompt the LLM to assess the quality of responses, the relevance of retrieved documents, etc., and provide a quality signal even in the absence of human-labeled data. Pick an evaluator type and instantiate it with the language model you want to use to perform evaluations using our battle-tested evaluation templates.

In [35]:
eval_model = OpenAIModel(
    model="gpt-4-turbo-preview",
)
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_correctness_evaluator = QAEvaluator(eval_model)
relevance_evaluator = RelevanceEvaluator(eval_model)

hallucination_eval_df, qa_correctness_eval_df = run_evals(
    dataframe=queries_df,
    evaluators=[hallucination_evaluator, qa_correctness_evaluator],
    provide_explanation=True,
)
relevance_eval_df = run_evals(
    dataframe=retrieved_documents_df,
    evaluators=[relevance_evaluator],
    provide_explanation=True,
)[0]

px.Client().log_evaluations(
    SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df),
    SpanEvaluations(eval_name="QA Correctness", dataframe=qa_correctness_eval_df),
)
px.Client().log_evaluations(DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df))

run_evals |          | 0/22 (0.0%) | ⏳ 00:00<? | ?it/s

run_evals |          | 0/22 (0.0%) | ⏳ 00:00<? | ?it/s

Your evaluations should now appear as annotations on the appropriate spans in Phoenix.

![A view of the Phoenix UI with evaluation annotations](https://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/evals/traces_with_evaluation_annotations.png)

### **13. Let's try Hybrid search now**

In [36]:
## Define a new collection to store your hybrid emebeddings
COLLECTION_NAME_HYBRID = "qdrant_docs_arize_hybrid"

In [26]:
##Reprocess documents with different settings if needed 
#documents = process_document_chunks(dataset , CHUNK_SIZE , CHUNK_OVERLAP)

In [27]:
#len(documents)

In [37]:
##List of supported sparse vector models
from fastembed.sparse.sparse_text_embedding import SparseTextEmbedding
SparseTextEmbedding.list_supported_models()

[{'model': 'prithvida/Splade_PP_en_v1',
  'vocab_size': 30522,
  'description': 'Misspelled version of the model. Retained for backward compatibility. Independent Implementation of SPLADE++ Model for English',
  'size_in_GB': 0.532,
  'sources': {'hf': 'Qdrant/SPLADE_PP_en_v1'}},
 {'model': 'prithivida/Splade_PP_en_v1',
  'vocab_size': 30522,
  'description': 'Independent Implementation of SPLADE++ Model for English',
  'size_in_GB': 0.532,
  'sources': {'hf': 'Qdrant/SPLADE_PP_en_v1'}}]

### **14. Ingest Sparse and Dense vectors into Qdrant**

Ingest sparse and dense vectors into Qdrant Collection.
We are using Splade++ model for Sparse Vector Model and default Fastembed model - bge-small-en-1.5 for dense embeddings. 

In [38]:
## Initializing the space to work with llama-index and related settings
import llama_index
from llama_index.core import Settings
from llama_index.vector_stores.qdrant import QdrantVectorStore
from fastembed.sparse.sparse_text_embedding import SparseTextEmbedding, SparseEmbedding
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from typing import List, Tuple

sparse_model_name = "prithivida/Splade_PP_en_v1"

# This triggers the model download
sparse_model = SparseTextEmbedding(model_name=sparse_model_name, batch_size=32)

## Computing sparse vectors
def compute_sparse_vectors(
    texts: List[str],
    ) -> Tuple[List[List[int]], List[List[float]]]:
    indices, values = [], []
    for embedding in sparse_model.embed(texts):
        indices.append(embedding.indices.tolist())
        values.append(embedding.values.tolist())
    return indices, values

## Creating a vector store with Hybrid search enabled
hybrid_vector_store = QdrantVectorStore(
    client=client,
    collection_name=COLLECTION_NAME_HYBRID,
    enable_hybrid=True,
    sparse_doc_fn=compute_sparse_vectors,
    sparse_query_fn=compute_sparse_vectors)

storage_context = StorageContext.from_defaults(vector_store=hybrid_vector_store)

Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-small-en-v1.5")


Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

In [None]:
## Note : Ingesting sparse and dense vectors into Qdrant collection
## This block is to be used when you're creating a new collection if using an existing collection,
## Skip this block and execute the next one instead.
from phoenix.trace import suppress_tracing
with suppress_tracing():
    hybrid_vector_index = VectorStoreIndex.from_documents(
        documents,
        storage_context=storage_context,
        show_progress=True
    )

In [39]:
## Note: Execute this block when using an existing collection
from llama_index.core.vector_stores.types import VectorStoreQueryMode
from llama_index.core.indices.vector_store import VectorIndexRetriever

hybrid_vector_index = VectorStoreIndex.from_vector_store(vector_store=hybrid_vector_store)

In [40]:
## collection level operations
client.get_collection(COLLECTION_NAME_HYBRID)
#client.delete_collection(COLLECTION_NAME_HYBRID)

CollectionInfo(status=<CollectionStatus.GREEN: 'green'>, optimizer_status=<OptimizersStatusOneOf.OK: 'ok'>, vectors_count=8862, indexed_vectors_count=4429, points_count=4431, segments_count=2, config=CollectionConfig(params=CollectionParams(vectors={'text-dense': VectorParams(size=384, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None)}, shard_number=1, sharding_method=None, replication_factor=1, write_consistency_factor=1, read_fan_out_factor=None, on_disk_payload=True, sparse_vectors={'text-sparse': SparseVectorParams(index=SparseIndexParams(full_scan_threshold=None, on_disk=None))}), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=False, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None, indexing_threshold=20000, flush_interval_sec=5, max_optimization_thread

In [41]:
## Check the number of documents matches the expected number of document chunks 
client.count(collection_name=COLLECTION_NAME_HYBRID)

CountResult(count=4431)

### **15. Hybrid Search with Qdrant**

In [42]:
## Before trying Hybrid search , lets try Sparse Vector Search Retriever 
from llama_index.core.vector_stores.types import VectorStoreQueryMode
from llama_index.core.indices.vector_store import VectorIndexRetriever

sparse_retriever = VectorIndexRetriever(
    index=hybrid_vector_index,
    vector_store_query_mode=VectorStoreQueryMode.SPARSE,
    sparse_top_k=2,
)

## Pure sparse vector search
nodes = sparse_retriever.retrieve("What is a Merge Optimizer?")
for i, node in enumerate(nodes):
    print(i + 1, node.text, end="\n")

1 Such segments, for example, are created as copy-on-write segments during optimization itself.



It is also essential to have at least one small segment that Qdrant will use to store frequently updated data.

On the other hand, too many small segments lead to suboptimal search performance.



There is the Merge Optimizer, which combines the smallest segments into one large segment. It is used if too many segments are created.
2 The criteria for starting the optimizer are defined in the configuration file.



Here is an example of parameter values:



```yaml

storage:

  optimizers:

    # If the number of segments exceeds this value, the optimizer will merge the smallest segments.

    max_segment_number: 5

```



## Indexing Optimizer



Qdrant allows you to choose the type of indexes and data storage methods used depending on the number of records.


In [43]:
## Let's try Hybrid Search Retriever now using the 'alpha' parameter that controls the weightage between
## the dense and sparse vector search scores.
# NOTE: For hybrid search (0 for sparse search, 1 for dense search)

hybrid_retriever = VectorIndexRetriever(
    index=hybrid_vector_index,
    vector_store_query_mode=VectorStoreQueryMode.HYBRID,
    sparse_top_k=1,
    similarity_top_k=2,
    alpha=0.1,
)

In [44]:
## Let's try hybrid retriever 
nodes = hybrid_retriever.retrieve("What is a Merge Optimizer?")
for i, node in enumerate(nodes):
    print(i + 1, node.text, end="\n")

1 Such segments, for example, are created as copy-on-write segments during optimization itself.



It is also essential to have at least one small segment that Qdrant will use to store frequently updated data.

On the other hand, too many small segments lead to suboptimal search performance.



There is the Merge Optimizer, which combines the smallest segments into one large segment. It is used if too many segments are created.
2 ---

title: Optimizer

weight: 70

aliases:

  - ../optimizer

---



# Optimizer



It is much more efficient to apply changes in batches than perform each change individually, as many other databases do. Qdrant here is no exception. Since Qdrant operates with data structures that are not always easy to change, it is sometimes necessary to rebuild those structures completely.



Storage optimization in Qdrant occurs at the segment level (see [storage](../storage)).


In [451]:
# We shouldn't be modifying the alpha parameter after the retriever has been created
# but that's the easiest way to show the effect of the parameter
#hybrid_retriever._alpha = 0.1
#hybrid_retriever._alpha = 0.9

#nodes = hybrid_retriever.retrieve("What is merge optimizer?")
#for i, node in enumerate(nodes):
#    print(i + 1, node.text, end="\n")

In [45]:
from llama_index.core import get_response_synthesizer
from llama_index.core.query_engine import RetrieverQueryEngine

#define response synthesizer
response_synthesizer = get_response_synthesizer()

#assemble query engine for hybrid retriever
hybrid_query_engine = RetrieverQueryEngine(
                        retriever=hybrid_retriever,
                        response_synthesizer=response_synthesizer,)

### **16. Re-Run Your Query Engine and View Your Traces in Phoenix**

Let's rerun the list of the baseline questions about Qdrant on the Hybrid Retriever. 

In [46]:
## Switching phoenix project space
from phoenix.trace import using_project

# Switch project to run evals
with using_project(HYBRID_RAG_PROJECT):
# All spans created within this context will be associated with the `HYBRID_RAG_PROJECT` project.

    ##Reuse the previously loaded dataset `qdrant_qa_question`
    
    for query in tqdm(qdrant_qa_question['question'][:10]):
        try:
          hybrid_query_engine.query(query)
        except Exception as e:
          pass



  0%|                                                                                                                                        | 0/10 [00:00<?, ?it/s][A[A

 10%|████████████▊                                                                                                                   | 1/10 [00:05<00:47,  5.26s/it][A[A

 20%|█████████████████████████▌                                                                                                      | 2/10 [00:12<00:52,  6.58s/it][A[A

 30%|██████████████████████████████████████▍                                                                                         | 3/10 [00:21<00:51,  7.36s/it][A[A

 40%|███████████████████████████████████████████████████▏                                                                            | 4/10 [00:29<00:47,  7.95s/it][A[A

 50%|████████████████████████████████████████████████████████████████                                                                | 5/1

In [48]:
print(f"🚀 Open the Phoenix UI if you haven't already: {session.url}")

🚀 Open the Phoenix UI if you haven't already: http://localhost:6006/


In [47]:
## Switching phoenix project space
from phoenix.trace import using_project


queries_df_hybrid = get_qa_with_reference(px.Client(), project_name=HYBRID_RAG_PROJECT)
retrieved_documents_df_hybrid = get_retrieved_documents(px.Client(), project_name=HYBRID_RAG_PROJECT)

In [48]:
queries_df_hybrid

Unnamed: 0_level_0,input,output,reference
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5c23a162c061df49,What is the impact of ‘write_consistency_factor’ ?,"The impact of the `write_consistency_factor` is that it defines the number of replicas that must acknowledge a write operation before the system responds to the client. If this value is increased, write operations will be more tolerant to network partitions within the cluster. However, this also means that a higher number of replicas need to be active for write operations to be performed successfully.","### Write consistency factor\n\n\n\nThe `write_consistency_factor` represents the number of replicas that must acknowledge a write operation before responding to the client. It is set to one by default.\n\nIt can be configured at the collection's creation time.\n\n\n\n```http\n\nPUT /collections/{collection_name}\n\n{\n\n ""vectors"": {\n\n ""size"": 300,\n\n ""distance"": ""Cosine""\n\n },\n\n ""shard_number"": 6,\n\n ""replication_factor"": 2,\n\n ""write_consistency_factor"": 2,\n\n}\n\n```\n\n\n\n```python\n\n- `write_consistency_factor` - defines the number of replicas that must acknowledge a write operation before responding to the client. Increasing this value will make write operations tolerant to network partitions in the cluster, but will require a higher number of replicas to be active to perform write operations."
37c98aebc845e8be,What is significance of ‘on_disk_payload’ setting?,"The `on_disk_payload` setting is significant because it determines where payload data is stored. When set to `true`, it ensures that payload data is stored only on disk, rather than in RAM. This can be particularly useful for managing memory usage effectively when dealing with large payloads, as it helps to limit the amount of RAM required by not loading large payload values into memory.","* `on_disk_payload` - defines where to store payload data. If `true` - payload will be stored on disk only. Might be useful for limiting the RAM usage in case of large payload.\n\n* `quantization_config` - see [quantization](../../guides/quantization/#setting-up-quantization-in-qdrant) for details.\n\n\n\nDefault parameters for the optional collection parameters are defined in [configuration file](https://github.com/qdrant/qdrant/blob/master/config/config.yaml).\n\nThe payload data is loaded into RAM at service startup while disk and [RocksDB](https://rocksdb.org/) are used for persistence only.\n\nThis type of storage works quite fast, but it may require a lot of space to keep all the data in RAM, especially if the payload has large values attached - abstracts of text or even images.\n\n\n\nIn the case of large payload values, it might be better to use OnDisk payload storage."
f805dcdab7f8a853,How do you use ‘ordering’ parameter?,"The `ordering` parameter can be used with update and delete operations to ensure that these operations are executed in the same order on all replicas. When this option is utilized, the operation is routed to the leader replica of the shard, and the system waits for a response from the leader before responding to the client. This is particularly useful for avoiding data inconsistencies that might arise from concurrent updates or deletions of the same documents.","Since the `filter` parameter is specified, the search is performed only among those points that satisfy the filter condition.\n\nSee details of possible filters and their work in the [filtering](../filtering) section.\n\n\n\nExample result of this API would be\n\n\n\n```json\n\n{\n\n ""result"": [\n\n { ""id"": 10, ""score"": 0.81 },\n\n { ""id"": 14, ""score"": 0.75 },\n\n { ""id"": 11, ""score"": 0.73 }\n\n ],\n\n ""status"": ""ok"",\n\n ""time"": 0.001\n\n}\n\n```\n\n\n\nThe `result` contains ordered by `score` list of found point ids.\n\n- Write `ordering` param, can be used with update and delete operations to ensure that the operations are executed in the same order on all replicas. If this option is used, Qdrant will route the operation to the leader replica of the shard and wait for the response before responding to the client. This option is useful to avoid data inconsistency in case of concurrent updates of the same documents"
f18bd2ff51730c94,What is the purpose of ef_construct in HNSW ?,The purpose of `ef_construct` in HNSW is to specify the search range when building the index.,"In order to improve performance, HNSW limits the maximum degree of nodes on each layer of the graph to `m`. In addition, you can use `ef_construct` (when building index) or `ef` (when searching targets) to specify a search range.\n\n\n\nThe corresponding parameters could be configured in the configuration file:\n\n\n\n```yaml\n\nstorage:\n\n # Default parameters of HNSW Index. Could be overridden for each collection or named vector individually\n\n hnsw_index:\n\n # Number of edges per node in the index graph.\n\n(""my_vector"".into()),\n\n VectorParamsDiff {\n\n hnsw_config: Some(HnswConfigDiff {\n\n m: Some(32),\n\n ef_construct: Some(123),\n\n ..Default::default()\n\n }),\n\n ..Default::default()\n\n },\n\n )]),\n\n },\n\n ..."
aa11f9fef7241d75,What is the purpose of ‘CreatePayloadIndexAsync’?,"The purpose of `CreatePayloadIndexAsync` is to create an index on a specific field within a collection, which in this case is `group_id`. This indexing operation is designed to optimize the search queries that filter on the `group_id` field, making them faster by efficiently locating the nearest neighbors within each group. However, it is noted that global search requests that do not use the `group_id` filter may experience slower performance because they would require scanning across all groups.","await client.CreatePayloadIndexAsync(collectionName: ""{collection_name}"", fieldName: ""group_id"");\n\n```\n\n\n\n## Limitations\n\n\n\nOne downside to this approach is that global requests (without the `group_id` filter) will be slower since they will necessitate scanning all groups to identify the nearest neighbors.\n\nclient.createPayloadIndex(""{collection_name}"", {\n\n field_name: ""name_of_the_field_to_index"",\n\n field_schema: {\n\n type: ""text"",\n\n tokenizer: ""word"",\n\n min_token_len: 2,\n\n max_token_len: 15,\n\n lowercase: true,\n\n },\n\n});\n\n```\n\n\n\n```rust\n\nuse qdrant_client::{\n\n client::QdrantClient,\n\n qdrant::{\n\n payload_index_params::IndexParams, FieldType, PayloadIndexParams, TextIndexParams,\n\n TokenizerType,\n\n },\n\n};"
ee2bf20b47a300fb,How does oversampling helps?,"Oversampling helps in two distinct ways. In the context of training datasets, it helps to equalize the representation of classes, which facilitates more fair and accurate modeling of real-world scenarios. In the context of quantization for similarity search algorithms, oversampling allows for significant compression of high-dimensional vectors in memory while compensating for the accuracy loss by re-scoring additional points with the original vectors.","oversampling helps equalize the representation of classes in the training dataset, thus enabling more fair and accurate modeling of real-world scenarios.\n\n### Oversampling for quantization\n\n\n\nWe are introducing [oversampling](/documentation/guides/quantization/#oversampling) as a new way to help you improve the accuracy and performance of similarity search algorithms. With this method, you are able to significantly compress high-dimensional vectors in memory and then compensate the accuracy loss by re-scoring additional points with the original vectors."
aab780d492b0df42,What is ‘best_score’ strategy?,"The 'best_score' strategy is a method introduced in version 1.6.0 for finding similar vectors by identifying those that are closer to a positive example while avoiding those that are closer to a negative one. Unlike strategies that rely on averages, 'best_score' does not use a single query vector. Instead, it calculates the distance between a traversed point and each positive and negative example separately during each step of HNSW graph traversal. This approach is more flexible and allows for the passing of just negative samples. It employs a more sophisticated algorithm to determine the best score at every step.","### The new hotness - Best score\n\n\n\nThe new strategy is called `best_score`. It does not rely on averages and is more flexible. It allows you to pass just negative \n\nsamples and uses a slightly more sophisticated algorithm under the hood.\n\n\n\nThe best score is chosen at every step of HNSW graph traversal. We separately calculate the distance between a traversed point \n\nand every positive and negative example. In the case of the best score strategy, **there is no single query vector anymore, but a\n\nThis is the default strategy that's going to be set implicitly, but you can explicitly define it by setting `""strategy"": ""average_vector""` in the recommendation request.\n\n\n\n### Best score strategy\n\n\n\n*Available as of v1.6.0*\n\n\n\nA new strategy introduced in v1.6, is called `best_score`. It is based on the idea that the best way to find similar vectors is to find the ones that are closer to a positive example, while avoiding the ones that are closer to a negative one."
7bf4346a4098eada,What is difference between scalar and product quantization?,"Scalar Quantization and Product Quantization are both techniques used to compress data, but they differ in their approach and outcomes. Scalar Quantization compresses data by quantizing each component of the vector independently, which can be more straightforward and SIMD-friendly, leading to faster computations. On the other hand, Product Quantization achieves a higher compression rate by dividing the vector into sub-vectors and quantizing these sub-vectors independently. However, this method can result in a loss of accuracy and may not be as fast as Scalar Quantization when it comes to in-RAM search speed due to its non-SIMD-friendly nature of distance calculations.","</tr>\n\n </tbody>\n\n</table>\n\n\n\nIt turns out that in some cases, Product Quantization may not only reduce the memory usage, \n\nbut also the search time.\n\n\n\n## Good practices\n\n\n\nCompared to Scalar Quantization, Product Quantization offers a higher compression rate. However, this comes with considerable trade-offs in accuracy, and at times, in-RAM search speed.\n\n\n\nProduct Quantization tends to be favored in certain specific scenarios:\n\nBut there are some tradeoffs. Product quantization distance calculations are not SIMD-friendly, so it is slower than scalar quantization.\n\nAlso, product quantization has a loss of accuracy, so it is recommended to use it only for high-dimensional vectors.\n\n\n\nPlease refer to the [Quantization Tips](#quantization-tips) section for more information on how to optimize the quantization parameters for your use case.\n\n\n\n## How to choose the right quantization method"
6ff81acee6114830,Tell me about ‘always_ram’ parameter?,"The `always_ram` parameter is a setting that determines whether quantized vectors should be kept cached in RAM at all times. When set to `true`, it ensures that quantized vectors are stored in RAM, which can speed up the search process. By default, quantized vectors are not always kept in RAM; they are loaded as needed, similar to the original vectors. However, if you are experiencing a significant decrease in search quality or want to improve search performance, you might consider setting `always_ram` to `true` to keep the quantized vectors in RAM.","It might be worth tuning this parameter if you experience a significant decrease in search quality.\n\n\n\n`always_ram` - whether to keep quantized vectors always cached in RAM or not. By default, quantized vectors are loaded in the same way as the original vectors.\n\nHowever, in some setups you might want to keep quantized vectors in RAM to speed up the search process.\n\n\n\nIn this case, you can set `always_ram` to `true` to store quantized vectors in RAM.\n\n\n\n### Setting up Binary Quantization\n\n""compression"": ""x32"",\n\n ""always_ram"": true\n\n }\n\n },\n\n ""on_disk"": true\n\n }\n\n },\n\n ""hnsw_config"": {\n\n ""ef_construct"": 123\n\n },\n\n ""quantization_config"": {\n\n ""scalar"": {\n\n ""type"": ""int8"",\n\n ""quantile"": 0.8,\n\n ""always_ram"": false\n\n }\n\n }\n\n}\n\n```\n\n\n\n```bash\n\ncurl -X PATCH http://localhost:6333/collections/test_collec..."
f4257b93b605915d,What is vaccum optimizer ?,"The context provided does not contain any information about a ""vacuum optimizer."" It is possible that you are referring to a different concept or there might be a typo in your query. If you are looking for information on a specific feature or concept related to optimizers, please provide the correct term or additional context.","So, for example, if the number of points is less than 10000, using any index would be less efficient than a brute force scan.\n\n\n\nThe Indexing Optimizer is used to implement the enabling of indexes and memmap storage when the minimal amount of records is reached.\n\n\n\nThe criteria for starting the optimizer are defined in the configuration file.\n\n\n\nHere is an example of parameter values:\n\n\n\n```yaml\n\nstorage:\n\n optimizers:\n\n # Maximum size (in kilobytes) of vectors to store in-memory per segment.\n\nreturn optimizer\n\n```\n\n\n\nCaching in Quaterion is used for avoiding calculation of outputs of a frozen pretrained `Encoder` in every epoch.\n\nWhen it is configured, outputs will be computed once and cached in the preferred device for direct usage later on.\n\nIt provides both a considerable speedup and less memory footprint.\n\nHowever, it is quite a bit versatile and has several knobs to tune."


In [49]:
retrieved_documents_df_hybrid

Unnamed: 0_level_0,Unnamed: 1_level_0,context.trace_id,input,reference,document_score
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
483939ac09cf8106,0,05c01218d18e182252b4fefa2e82684d,What is the impact of ‘write_consistency_factor’ ?,"### Write consistency factor\n\n\n\nThe `write_consistency_factor` represents the number of replicas that must acknowledge a write operation before responding to the client. It is set to one by default.\n\nIt can be configured at the collection's creation time.\n\n\n\n```http\n\nPUT /collections/{collection_name}\n\n{\n\n ""vectors"": {\n\n ""size"": 300,\n\n ""distance"": ""Cosine""\n\n },\n\n ""shard_number"": 6,\n\n ""replication_factor"": 2,\n\n ""write_consistency_factor"": 2,\n\n}\n\n```\n\n\n\n```python",19.420935
483939ac09cf8106,1,05c01218d18e182252b4fefa2e82684d,What is the impact of ‘write_consistency_factor’ ?,"- `write_consistency_factor` - defines the number of replicas that must acknowledge a write operation before responding to the client. Increasing this value will make write operations tolerant to network partitions in the cluster, but will require a higher number of replicas to be active to perform write operations.",0.1
0a46933e8bfdf38d,0,663f1d1540de82a7ad575e205b5ff803,What is significance of ‘on_disk_payload’ setting?,* `on_disk_payload` - defines where to store payload data. If `true` - payload will be stored on disk only. Might be useful for limiting the RAM usage in case of large payload.\n\n* `quantization_config` - see [quantization](../../guides/quantization/#setting-up-quantization-in-qdrant) for details.\n\n\n\nDefault parameters for the optional collection parameters are defined in [configuration file](https://github.com/qdrant/qdrant/blob/master/config/config.yaml).,12.34962
0a46933e8bfdf38d,1,663f1d1540de82a7ad575e205b5ff803,What is significance of ‘on_disk_payload’ setting?,"The payload data is loaded into RAM at service startup while disk and [RocksDB](https://rocksdb.org/) are used for persistence only.\n\nThis type of storage works quite fast, but it may require a lot of space to keep all the data in RAM, especially if the payload has large values attached - abstracts of text or even images.\n\n\n\nIn the case of large payload values, it might be better to use OnDisk payload storage.",0.0
bc1217192fe1f4d8,0,5bea862dbb13dea05e5797a69d530e35,How do you use ‘ordering’ parameter?,"Since the `filter` parameter is specified, the search is performed only among those points that satisfy the filter condition.\n\nSee details of possible filters and their work in the [filtering](../filtering) section.\n\n\n\nExample result of this API would be\n\n\n\n```json\n\n{\n\n ""result"": [\n\n { ""id"": 10, ""score"": 0.81 },\n\n { ""id"": 14, ""score"": 0.75 },\n\n { ""id"": 11, ""score"": 0.73 }\n\n ],\n\n ""status"": ""ok"",\n\n ""time"": 0.001\n\n}\n\n```\n\n\n\nThe `result` contains ordered by `score` list of found point ids.",11.016777
bc1217192fe1f4d8,1,5bea862dbb13dea05e5797a69d530e35,How do you use ‘ordering’ parameter?,"- Write `ordering` param, can be used with update and delete operations to ensure that the operations are executed in the same order on all replicas. If this option is used, Qdrant will route the operation to the leader replica of the shard and wait for the response before responding to the client. This option is useful to avoid data inconsistency in case of concurrent updates of the same documents",0.1
6f3cfcf870268b23,0,851c0526ccef268680c9818a4dfebea3,What is the purpose of ef_construct in HNSW ?,"In order to improve performance, HNSW limits the maximum degree of nodes on each layer of the graph to `m`. In addition, you can use `ef_construct` (when building index) or `ef` (when searching targets) to specify a search range.\n\n\n\nThe corresponding parameters could be configured in the configuration file:\n\n\n\n```yaml\n\nstorage:\n\n # Default parameters of HNSW Index. Could be overridden for each collection or named vector individually\n\n hnsw_index:\n\n # Number of edges per node in the index graph.",15.599039
6f3cfcf870268b23,1,851c0526ccef268680c9818a4dfebea3,What is the purpose of ef_construct in HNSW ?,"(""my_vector"".into()),\n\n VectorParamsDiff {\n\n hnsw_config: Some(HnswConfigDiff {\n\n m: Some(32),\n\n ef_construct: Some(123),\n\n ..Default::default()\n\n }),\n\n ..Default::default()\n\n },\n\n )]),\n\n },\n\n )),\n\n }),",0.1
40a17423fd3bb548,0,4d1a0bfa050b72d8ea801b5182d0ba1e,What is the purpose of ‘CreatePayloadIndexAsync’?,"await client.CreatePayloadIndexAsync(collectionName: ""{collection_name}"", fieldName: ""group_id"");\n\n```\n\n\n\n## Limitations\n\n\n\nOne downside to this approach is that global requests (without the `group_id` filter) will be slower since they will necessitate scanning all groups to identify the nearest neighbors.",15.6317
40a17423fd3bb548,1,4d1a0bfa050b72d8ea801b5182d0ba1e,What is the purpose of ‘CreatePayloadIndexAsync’?,"client.createPayloadIndex(""{collection_name}"", {\n\n field_name: ""name_of_the_field_to_index"",\n\n field_schema: {\n\n type: ""text"",\n\n tokenizer: ""word"",\n\n min_token_len: 2,\n\n max_token_len: 15,\n\n lowercase: true,\n\n },\n\n});\n\n```\n\n\n\n```rust\n\nuse qdrant_client::{\n\n client::QdrantClient,\n\n qdrant::{\n\n payload_index_params::IndexParams, FieldType, PayloadIndexParams, TextIndexParams,\n\n TokenizerType,\n\n },\n\n};",0.1


### **17. Define your evaluation model and your evaluators for Hybrid Search**

Next, define your evaluation model and your evaluators.

Evaluators are built on top of language models and prompt the LLM to assess the quality of responses, the relevance of retrieved documents, etc., and provide a quality signal even in the absence of human-labeled data. Pick an evaluator type and instantiate it with the language model you want to use to perform evaluations using our battle-tested evaluation templates.

In [50]:


# all spans created within this context will be associated with the `HYBRID_RAG_PROJECT` project.
eval_model = OpenAIModel(
    model="gpt-4-turbo-preview",
)
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_correctness_evaluator = QAEvaluator(eval_model)
relevance_evaluator = RelevanceEvaluator(eval_model)

hallucination_eval_df_hybrid, qa_correctness_eval_df_hybrid = run_evals(
    dataframe=queries_df_hybrid,
    evaluators=[hallucination_evaluator, qa_correctness_evaluator],
    provide_explanation=True,
)
relevance_eval_df_hybrid = run_evals(
    dataframe=retrieved_documents_df_hybrid,
    evaluators=[relevance_evaluator],
    provide_explanation=True,
)[0]

px.Client().log_evaluations(
    SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df_hybrid),
    SpanEvaluations(eval_name="QA Correctness", dataframe=qa_correctness_eval_df_hybrid),
    project_name=HYBRID_RAG_PROJECT,
)
px.Client().log_evaluations(DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df_hybrid),
                            project_name=HYBRID_RAG_PROJECT)

run_evals |          | 0/20 (0.0%) | ⏳ 00:00<? | ?it/s

run_evals |          | 0/20 (0.0%) | ⏳ 00:00<? | ?it/s