Can´t set indexing_threshold to 0 for bulk upload #620

marcossilva · 2024-05-02T22:54:06Z

I'm trying to bulk upload 4.5M points using qdrant but have been strugling with the ingestion time. I tried to run it in memory to speed up, tried to use the upload_collection as suggested in the points docs. But it seems that the main problem is not being able to set the indexing_threshold to 0 as sugested in the bulk upload docs

Current Behavior

On v1.9.0 creating a new collection with the optimizers_config setting the indexing_threshold to 0, updating the collection to set the indexing_threshold to 0 seems to have no effect as the snippet below provide example

Steps to Reproduce

from qdrant_client import QdrantClient


client = QdrantClient(":memory:", prefer_grpc=True)
if not client.collection_exists("title_vectors_simple"):
    client.create_collection(
        collection_name="title_vectors_simple",
        vectors_config=VectorParams(size=384, distance=Distance.COSINE),
        optimizers_config=OptimizersConfigDiff(
            indexing_threshold=0,
        ),
        shard_number=1,
        quantization_config=ScalarQuantization(
            scalar=ScalarQuantizationConfig(
                type=ScalarType.INT8,
                quantile=0.99,
                always_ram=True,
            ),
        ),
    )

# Returns False
client.update_collection(
    collection_name="title_vectors_simple",
    optimizers_config=OptimizersConfigDiff(
        indexing_threshold=0,
    ),
)

print(client.get_collection("title_vectors_simple"))

Output:

status=<CollectionStatus.GREEN: 'green'> optimizer_status=<OptimizersStatusOneOf.OK: 'ok'> vectors_count=None indexed_vectors_count=0 points_count=0 segments_count=1 config=CollectionConfig(params=CollectionParams(vectors=VectorParams(size=384, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None), shard_number=None, sharding_method=None, replication_factor=None, write_consistency_factor=None, read_fan_out_factor=None, on_disk_payload=None, sparse_vectors=None), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=None, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None, **indexing_threshold**=20000, flush_interval_sec=5, max_optimization_threads=1), wal_config=WalConfig(wal_capacity_mb=32, wal_segments_ahead=0), quantization_config=None) payload_schema={}

Expected Behavior

I expected that creating the collection or updating it would set the initial parameters but most of my configurations in the creation of the collection were simply ignored

The text was updated successfully, but these errors were encountered:

generall · 2024-05-02T22:56:16Z

Hey @marcossilva

It looks that you are using local mode: client = QdrantClient(":memory:")

It doesn't have any index and it not expected to be used with any kind of large loads. Could you please give a try the server version with the same script.

marcossilva · 2024-05-03T15:15:00Z

Thanks for the quick reply @generall . I'm currently using an internally deployed qdrant and had the same problem. I tried with both the memory and path client initializations to debug locally but this also happens either locally (with the in-memory, the path and running a local docker qdrant server) or in the qdrant deployed in our Kubernetes cluster

joein · 2024-05-21T15:24:24Z

Hi @marcossilva , not sure if the issue still persists for you

Once you switch from local mode to the server mode, your code should be able to set indexing_threshold.
However, I think it is usually not required to switch off indexing_threshold for this amount of points.
Could you measure what are the current time required to upload this amount of points and what is the desired time you're trying to achieve?

What is the batch size you're using?
Are your embeddings already created by the moment of uploading, or do you create them during the uploading process?

marcossilva mentioned this issue May 2, 2024

Can´t set indexing_threshold to 0 for bulk upload qdrant/qdrant#4162

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can´t set indexing_threshold to 0 for bulk upload #620

Can´t set indexing_threshold to 0 for bulk upload #620

marcossilva commented May 2, 2024

generall commented May 2, 2024

marcossilva commented May 3, 2024 •

edited

Loading

joein commented May 21, 2024

Can´t set indexing_threshold to 0 for bulk upload #620

Can´t set indexing_threshold to 0 for bulk upload #620

Comments

marcossilva commented May 2, 2024

Current Behavior

Steps to Reproduce

Expected Behavior

generall commented May 2, 2024

marcossilva commented May 3, 2024 • edited Loading

joein commented May 21, 2024

marcossilva commented May 3, 2024 •

edited

Loading