Skip to content

Conversation

@matiasdaloia
Copy link
Contributor

@matiasdaloia matiasdaloia commented Oct 13, 2025

Summary by CodeRabbit

  • New Features
    • Fast Import Mode: disables indexing during import, adds pre/post health checks, improved import lifecycle logs, and enables post-import background indexing.
  • Refactor
    • Batch processing reworked to reduce concurrency: sequential upserts, per-worker rate limiting, and safer cleanup on errors.
  • Chores
    • Updated Docker Compose mounts/ports and added a tuned Qdrant configuration for bulk imports.
    • Overhauled linter/CI configuration and expanded .gitignore.

@matiasdaloia matiasdaloia self-assigned this Oct 13, 2025
@matiasdaloia matiasdaloia added the enhancement New feature or request label Oct 13, 2025
@coderabbitai
Copy link

coderabbitai bot commented Oct 13, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Adds Qdrant bulk-import changes: pre-import health checks, sequential per-language batch upserts with optional delays, import-time HNSW/indexing disabled and re-enabled post-import, plus new Qdrant config and docker-compose bind mounts. Also updates lint configs, minor API parameter renames, and small internal refactors.

Changes

Cohort / File(s) Summary
Ignore updates
\.gitignore
Adds target to ignored paths; keeps existing papi/ entry.
Importer / Qdrant import flow
cmd/import/main.go
Adds Qdrant health checks and cleanup handling; disables HNSW/indexing during import and re-enables it after import; uses sequential per-language batch upserts with a configurable BatchInsertDelay; adjusts logging and per-worker behavior; adds post-import indexing updater and stats reporting.
Qdrant runtime & compose
qdrant-config.yaml, docker-compose.qdrant.yml
Adds a new Qdrant config tuned for bulk import (storage, WAL, optimizer, performance, service limits); docker-compose switches to host bind mounts for data/config, exposes ports (6333, 6334, 6335), removes snapshots volume, and specifies bridge network driver.
Linting / CI
\.golangci\.yml, .github/workflows/golangci-lint.yml
Reworks golangci-lint config to formatter-based settings, curated linter set, expanded linters-settings and exclusions; updates GitHub Action to golangci-lint-action v8 and normalizes YAML formatting.
gRPC / REST server param rename
internal/protocol/grpc/server.go, internal/protocol/rest/server.go
Renames function parameter from config to cfg and updates internal references to use cfg.*; behavior unchanged aside from naming.
Handler constructor rename
internal/handler/scan_handler.go
Changes NewScanHandler second parameter name from mapper to scanMapper and assigns to the handler field accordingly; signature updated.
Internal service minor logic
internal/service/scan_service_impl.go
Simplifies boolean expression (logical equivalence) for threshold checking without changing behavior.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User/CLI
  participant IMP as Import CLI
  participant Q as Qdrant

  U->>IMP: start import
  IMP->>Q: health check (ListCollections)
  alt healthy
    IMP->>Q: create collections (indexing/HNSW disabled)
    loop per language
      loop per batch
        IMP->>Q: upsert batch (sequential per collection)
        IMP-->>IMP: optional BatchInsertDelay
      end
    end
    IMP->>Q: enable indexing / updateCollectionIndexing
    IMP->>Q: fetch collection stats
    IMP-->>U: report completion & stats
  else unhealthy
    IMP-->>U: log unhealthy and exit
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I thump my paws—import night!
Hush HNSW, we’ll index right.
Batch by batch, a tidy queue,
Qdrant hums with ports anew.
Config snug in burrowed file—🥕✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly summarizes the primary change of reducing resource overhead in the import script while remaining concise and specific to the core of the changeset. It's directly related to the main modifications, making it clear to a reviewer.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between faa7203 and 85e020d.

📒 Files selected for processing (7)
  • .github/workflows/golangci-lint.yml (2 hunks)
  • .golangci.yml (1 hunks)
  • cmd/import/main.go (11 hunks)
  • internal/handler/scan_handler.go (1 hunks)
  • internal/protocol/grpc/server.go (2 hunks)
  • internal/protocol/rest/server.go (2 hunks)
  • internal/service/scan_service_impl.go (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/import/main.go (1)

374-381: Increment errorCount for read errors

Currently logged but not counted.

-		record, err := reader.Read()
+		record, err := reader.Read()
 		if err != nil {
 			if err == io.EOF {
 				break
 			}
+			errorCount++
 			log.Printf("WARNING: Error reading line %d in file %s: %v", lineNumber, filePath, err)
 			continue
 		}
🧹 Nitpick comments (9)
.gitignore (1)

25-26: Fix typo in comment

"Ingore" → "Ignore".

-# Ingore the version text file. only used during build
+# Ignore the version text file. only used during build
cmd/import/main.go (6)

262-291: Double-check disabling HNSW via M=0 + IndexingThreshold=0

  • M=0 may be rejected by Qdrant depending on version; IndexingThreshold=0 already disables index building during import.
  • Consider dropping M override and rely on IndexingThreshold=0, or confirm M=0 is supported.

If you prefer to avoid risk, remove M here and keep OptimizersConfig.IndexingThreshold=0 to disable indexing during import.


266-267: Use Hamming distance for binary 0/1 vectors (more suitable than Manhattan).

For 0/1 vectors, Hamming matches intent and may be more efficient.

-			Distance: qdrant.Distance_Manhattan,
+			Distance: qdrant.Distance_Hamming,

Apply to dirs/names/contents.

Also applies to: 275-276, 284-285


396-419: Make batch insert delay configurable via flag

100ms may be too slow/fast depending on environment. Expose as flag (e.g., -batch-insert-delay=100ms) with sensible default.


612-624: Deterministic per-collection upsert order (optional)

Iterating maps yields random order. Sorting keys improves reproducibility of logs/troubleshooting.

Example:

// at top imports
import "sort"

// replace map iteration
keys := make([]string, 0, len(collectionPoints))
for k := range collectionPoints {
    keys = append(keys, k)
}
sort.Strings(keys)
for _, collectionName := range keys {
    points := collectionPoints[collectionName]
    ...
}

671-688: Remove redundant file open; use os.ReadFile(absPath) directly

File is opened and closed but unused; then os.ReadFile reads again.

-	file, err := os.Open(absPath)
-	if err != nil {
-		return nil, err
-	}
-	defer func() {
-		if err := file.Close(); err != nil {
-			log.Printf("Warning: Error closing file %s: %v", filename, err)
-		}
-	}()
-
-	data, err := os.ReadFile(filename)
+	data, err := os.ReadFile(absPath)

258-347: Prefer returning errors over log.Fatalf in helpers

createCollection uses log.Fatalf on failure, exiting the process from deep inside. Prefer returning errors and handling them in main. This also avoids skipping defers.

docker-compose.qdrant.yml (2)

19-22: Remove unused named volume or use it in service

Service now uses a bind mount ./qdrant_data:/qdrant/storage, but a named volume qdrant_data is still defined and unused.

Option A: Remove the named volume:

-volumes:
-  qdrant_data:
-    driver: local

Option B: Use the named volume in the service instead of the bind mount.


3-3: Pin Qdrant image to a stable version

Avoid :latest to ensure reproducible environments.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d1d57ca and faa7203.

📒 Files selected for processing (4)
  • .gitignore (1 hunks)
  • cmd/import/main.go (7 hunks)
  • docker-compose.qdrant.yml (1 hunks)
  • qdrant-config.yaml (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
cmd/import/main.go (1)
internal/repository/scan_repository_qdrant_impl.go (1)
  • VectorDim (35-35)
🪛 GitHub Actions: Golang CI Lint
cmd/import/main.go

[error] 113-113: golangci-lint (gocritic): exitAfterDefer - log.Fatalf will exit and defer will not run.

🪛 GitHub Check: build
cmd/import/main.go

[failure] 113-113:
exitAfterDefer: log.Fatalf will exit, and defer func(){...}(...) will not run (gocritic)

🔇 Additional comments (3)
cmd/import/main.go (1)

293-304: Indexing disabled is already enforced via IndexingThreshold=0

Good use of IndexingThreshold=0 to defer index building. If M=0 proves problematic at create-time, this setting alone achieves the “fast import” goal.

docker-compose.qdrant.yml (1)

11-14: Confirm whether exposing 6335 is needed

Port 6335 is typically the P2P/cluster port. If not running a cluster, it can be omitted.

qdrant-config.yaml (1)

7-51: Configuration structure is correct
performance, wal, and optimizers are correctly nested under storage per Qdrant’s official schema. No changes required.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
cmd/import/main.go (3)

176-195: Compile error in worker loop; also fix var naming to satisfy lint

Use a classic for with an index; also rename workerId -> workerID.

Apply:

-	for workerId := range MaxWorkers {
-		wg.Add(1)
-		go func(workerId int) {
+	for workerID := 0; workerID < MaxWorkers; workerID++ {
+		wg.Add(1)
+		go func(workerID int) {
 			defer wg.Done()
 			for file := range filesChan {
 				sectorName := filepath.Base(file)
 				sectorName = strings.TrimSuffix(sectorName, ".csv")
-				log.Printf("Worker %d: Processing sector %s", workerId, sectorName)
+				log.Printf("Worker %d: Processing sector %s", workerID, sectorName)
 
-				err := importCSVFile(ctx, client, file, sectorName)
-				if err != nil {
-					log.Printf("Worker %d: Error importing file %s: %v", workerId, file, err)
+				if err := importCSVFile(ctx, client, file, sectorName); err != nil {
+					log.Printf("Worker %d: Error importing file %s: %v", workerID, file, err)
 					errorsChan <- fmt.Errorf("error importing file %s: %w", file, err)
 				} else {
-					log.Printf("Worker %d: Successfully processed sector %s", workerId, sectorName)
+					log.Printf("Worker %d: Successfully processed sector %s", workerID, sectorName)
 				}
 			}
-		}(workerId)
+		}(workerID)
 	}

This addresses both the compile error and revive var-naming warning.


133-137: Avoid log.Fatalf after setting defers; it prevents client.Close from running (pipeline failure)

golangci-lint exitAfterDefer is flagging this. Replace Fatalf with logging + return to allow deferred client.Close to execute.

Apply:

-			err = client.DeleteCollection(ctx, collectionName)
-			if err != nil {
-				log.Fatalf("Error dropping collection %s: %v", collectionName, err)
-			}
+			err = client.DeleteCollection(ctx, collectionName)
+			if err != nil {
+				log.Printf("Error dropping collection %s: %v", collectionName, err)
+				return
+			}
-	if err != nil {
-		log.Fatalf("Error reading directory: %v", err)
-	}
+	if err != nil {
+		log.Printf("Error reading directory: %v", err)
+		return
+	}
-	if err != nil {
-		log.Fatalf("Error creating collection %s: %v", collectionName, err)
-	}
+	if err != nil {
+		log.Printf("Error creating collection %s: %v", collectionName, err)
+		return
+	}

Alternatively, refactor main into a run() error pattern and exit in main after defers run.

Also applies to: 151-153, 313-315


372-381: Track CSV read errors

errorCount is never incremented; increment it when a non-EOF error occurs to make the later stats accurate.

Apply:

 		record, err := reader.Read()
 		if err != nil {
 			if err == io.EOF {
 				break
 			}
 			log.Printf("WARNING: Error reading line %d in file %s: %v", lineNumber, filePath, err)
+			errorCount++
 			continue
 		}
🧹 Nitpick comments (7)
qdrant-config.yaml (2)

52-65: Consider adding service.indexing_threshold and confirm timeouts

  • You might want optimizers.indexing_threshold: 0 in the file to align with your “indexing disabled” import logic (keeps it consistent if collections inherit defaults).
  • max_timeout_sec: 120 is fine; ensure the importer also uses request-level context timeouts to respect this.

76-77: Telemetry setting contradicts the comment

Comment says “set to false to disable”, and you set false (disabled). If you intended to reduce overhead during bulk import, this is fine; otherwise set true.

docker-compose.qdrant.yml (3)

9-14: Fix volumes mismatch; remove redundant expose (6335) unless clustering

  • You declare a named volume qdrant_data but mount a bind path. Pick one. Recommended: use the named volume.
  • expose duplicates ports already published; 6335 is unused unless cluster mode.

Apply:

-      - ./qdrant_data:/qdrant/storage
+      - qdrant_data:/qdrant/storage
       - ./qdrant-config.yaml:/qdrant/config/production.yaml:ro
-    expose:
-      - 6333
-      - 6334
-      - 6335

3-3: Pin Qdrant image to a version

Avoid latest to ensure reproducibility and schema stability for your config.

Example: qdrant/qdrant:1.10.0 (verify current).


19-22: Named volume declared but unused with current bind mount

If you keep the bind mount approach, remove the top-level named volume. If you adopt the named volume (recommended), keep this section and adjust the service volumes as suggested above.

cmd/import/main.go (2)

242-254: Add a timeout to the health check

Protects CLI from hanging on network issues.

Apply:

-func verifyQdrantHealth(ctx context.Context, client *qdrant.Client) error {
-	log.Println("Performing Qdrant health check...")
+func verifyQdrantHealth(parent context.Context, client *qdrant.Client) error {
+	log.Println("Performing Qdrant health check...")
+	ctx, cancel := context.WithTimeout(parent, 10*time.Second)
+	defer cancel()

614-621: Consider request-level timeouts for Upsert

To avoid long hangs during import spikes, wrap Upsert in a context with timeout (e.g., 30–60s) and retry with backoff if needed.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d1d57ca and c407b76.

📒 Files selected for processing (4)
  • .gitignore (1 hunks)
  • cmd/import/main.go (9 hunks)
  • docker-compose.qdrant.yml (1 hunks)
  • qdrant-config.yaml (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
cmd/import/main.go (1)
internal/repository/scan_repository_qdrant_impl.go (1)
  • VectorDim (35-35)
🪛 GitHub Actions: Golang CI Lint
cmd/import/main.go

[error] 134-134: golangci-lint (gocritic): exitAfterDefer: log.Fatalf will exit, and defer func(){...}(...) will not run.

🪛 GitHub Check: build
cmd/import/main.go

[failure] 177-177:
var-naming: range var workerId should be workerID (revive)

🔇 Additional comments (5)
.gitignore (1)

46-47: LGTM

Ignoring papi/ and target is reasonable for local artifacts.

cmd/import/main.go (3)

260-289: Confirm Qdrant semantics for disabling and later enabling HNSW with named vectors

  • Setting HNSW M=0 per vector disables index building; later UpdateCollection sets HnswConfig at collection level. Confirm that collection-level HnswConfig overrides named vectors (or update per vector if required).
  • IndexingThreshold: 0 in CreateCollection and 100000 in UpdateCollection look consistent with your import strategy.

Also applies to: 297-304


559-587: misc_collection is included in supported collections Verified that entities.GetAllSupportedCollections() returns "misc_collection" and that all supported collections (including misc_collection) are checked/created at startup.


478-490: BinaryQuantization in Qdrant only emulates dot-product/cosine, no native Hamming/L1 support
Qdrant’s BinaryQuantization binarizes floats for fast XOR+popcount dot-product approximations—it doesn’t provide native Hamming or Manhattan distances on dense vectors and no separate binary-vector type with selectable Hamming metric exists. Use dot-product (cosine) for quantized data or compute Hamming externally.

Likely an incorrect or invalid review comment.

qdrant-config.yaml (1)

7-51: Review comment incorrect: configuration placement matches Qdrant schema
The wal and optimizers blocks belong under storage (and storage.performance is supported). No changes needed.

Likely an incorrect or invalid review comment.

@matiasdaloia matiasdaloia merged commit 05320e7 into main Oct 13, 2025
2 of 3 checks passed
@matiasdaloia matiasdaloia deleted the fix/mdaloia/import-script-revamp branch October 13, 2025 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants