Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can´t set indexing_threshold to 0 for bulk upload #620

Open
marcossilva opened this issue May 2, 2024 · 3 comments
Open

Can´t set indexing_threshold to 0 for bulk upload #620

marcossilva opened this issue May 2, 2024 · 3 comments

Comments

@marcossilva
Copy link

I'm trying to bulk upload 4.5M points using qdrant but have been strugling with the ingestion time. I tried to run it in memory to speed up, tried to use the upload_collection as suggested in the points docs. But it seems that the main problem is not being able to set the indexing_threshold to 0 as sugested in the bulk upload docs

Current Behavior

On v1.9.0 creating a new collection with the optimizers_config setting the indexing_threshold to 0, updating the collection to set the indexing_threshold to 0 seems to have no effect as the snippet below provide example

Steps to Reproduce

from qdrant_client import QdrantClient


client = QdrantClient(":memory:", prefer_grpc=True)
if not client.collection_exists("title_vectors_simple"):
    client.create_collection(
        collection_name="title_vectors_simple",
        vectors_config=VectorParams(size=384, distance=Distance.COSINE),
        optimizers_config=OptimizersConfigDiff(
            indexing_threshold=0,
        ),
        shard_number=1,
        quantization_config=ScalarQuantization(
            scalar=ScalarQuantizationConfig(
                type=ScalarType.INT8,
                quantile=0.99,
                always_ram=True,
            ),
        ),
    )

# Returns False
client.update_collection(
    collection_name="title_vectors_simple",
    optimizers_config=OptimizersConfigDiff(
        indexing_threshold=0,
    ),
)

print(client.get_collection("title_vectors_simple"))

Output:

status=<CollectionStatus.GREEN: 'green'> optimizer_status=<OptimizersStatusOneOf.OK: 'ok'> vectors_count=None indexed_vectors_count=0 points_count=0 segments_count=1 config=CollectionConfig(params=CollectionParams(vectors=VectorParams(size=384, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None), shard_number=None, sharding_method=None, replication_factor=None, write_consistency_factor=None, read_fan_out_factor=None, on_disk_payload=None, sparse_vectors=None), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=None, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None, **indexing_threshold**=20000, flush_interval_sec=5, max_optimization_threads=1), wal_config=WalConfig(wal_capacity_mb=32, wal_segments_ahead=0), quantization_config=None) payload_schema={}

Expected Behavior

I expected that creating the collection or updating it would set the initial parameters but most of my configurations in the creation of the collection were simply ignored

@generall
Copy link
Member

generall commented May 2, 2024

Hey @marcossilva

It looks that you are using local mode: client = QdrantClient(":memory:")

It doesn't have any index and it not expected to be used with any kind of large loads. Could you please give a try the server version with the same script.

@marcossilva
Copy link
Author

marcossilva commented May 3, 2024

Thanks for the quick reply @generall . I'm currently using an internally deployed qdrant and had the same problem. I tried with both the memory and path client initializations to debug locally but this also happens either locally (with the in-memory, the path and running a local docker qdrant server) or in the qdrant deployed in our Kubernetes cluster

@joein
Copy link
Member

joein commented May 21, 2024

Hi @marcossilva , not sure if the issue still persists for you

Once you switch from local mode to the server mode, your code should be able to set indexing_threshold.
However, I think it is usually not required to switch off indexing_threshold for this amount of points.
Could you measure what are the current time required to upload this amount of points and what is the desired time you're trying to achieve?

What is the batch size you're using?
Are your embeddings already created by the moment of uploading, or do you create them during the uploading process?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants