-
Notifications
You must be signed in to change notification settings - Fork 0
fix: import script resource overhead #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Caution Review failedThe pull request is closed. WalkthroughAdds Qdrant bulk-import changes: pre-import health checks, sequential per-language batch upserts with optional delays, import-time HNSW/indexing disabled and re-enabled post-import, plus new Qdrant config and docker-compose bind mounts. Also updates lint configs, minor API parameter renames, and small internal refactors. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as User/CLI
participant IMP as Import CLI
participant Q as Qdrant
U->>IMP: start import
IMP->>Q: health check (ListCollections)
alt healthy
IMP->>Q: create collections (indexing/HNSW disabled)
loop per language
loop per batch
IMP->>Q: upsert batch (sequential per collection)
IMP-->>IMP: optional BatchInsertDelay
end
end
IMP->>Q: enable indexing / updateCollectionIndexing
IMP->>Q: fetch collection stats
IMP-->>U: report completion & stats
else unhealthy
IMP-->>U: log unhealthy and exit
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (7)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/import/main.go (1)
374-381: Increment errorCount for read errorsCurrently logged but not counted.
- record, err := reader.Read() + record, err := reader.Read() if err != nil { if err == io.EOF { break } + errorCount++ log.Printf("WARNING: Error reading line %d in file %s: %v", lineNumber, filePath, err) continue }
🧹 Nitpick comments (9)
.gitignore (1)
25-26: Fix typo in comment"Ingore" → "Ignore".
-# Ingore the version text file. only used during build +# Ignore the version text file. only used during buildcmd/import/main.go (6)
262-291: Double-check disabling HNSW via M=0 + IndexingThreshold=0
- M=0 may be rejected by Qdrant depending on version; IndexingThreshold=0 already disables index building during import.
- Consider dropping M override and rely on IndexingThreshold=0, or confirm M=0 is supported.
If you prefer to avoid risk, remove M here and keep OptimizersConfig.IndexingThreshold=0 to disable indexing during import.
266-267: Use Hamming distance for binary 0/1 vectors (more suitable than Manhattan).For 0/1 vectors, Hamming matches intent and may be more efficient.
- Distance: qdrant.Distance_Manhattan, + Distance: qdrant.Distance_Hamming,Apply to dirs/names/contents.
Also applies to: 275-276, 284-285
396-419: Make batch insert delay configurable via flag100ms may be too slow/fast depending on environment. Expose as flag (e.g., -batch-insert-delay=100ms) with sensible default.
612-624: Deterministic per-collection upsert order (optional)Iterating maps yields random order. Sorting keys improves reproducibility of logs/troubleshooting.
Example:
// at top imports import "sort" // replace map iteration keys := make([]string, 0, len(collectionPoints)) for k := range collectionPoints { keys = append(keys, k) } sort.Strings(keys) for _, collectionName := range keys { points := collectionPoints[collectionName] ... }
671-688: Remove redundant file open; use os.ReadFile(absPath) directlyFile is opened and closed but unused; then os.ReadFile reads again.
- file, err := os.Open(absPath) - if err != nil { - return nil, err - } - defer func() { - if err := file.Close(); err != nil { - log.Printf("Warning: Error closing file %s: %v", filename, err) - } - }() - - data, err := os.ReadFile(filename) + data, err := os.ReadFile(absPath)
258-347: Prefer returning errors over log.Fatalf in helperscreateCollection uses log.Fatalf on failure, exiting the process from deep inside. Prefer returning errors and handling them in main. This also avoids skipping defers.
docker-compose.qdrant.yml (2)
19-22: Remove unused named volume or use it in serviceService now uses a bind mount
./qdrant_data:/qdrant/storage, but a named volumeqdrant_datais still defined and unused.Option A: Remove the named volume:
-volumes: - qdrant_data: - driver: localOption B: Use the named volume in the service instead of the bind mount.
3-3: Pin Qdrant image to a stable versionAvoid
:latestto ensure reproducible environments.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
.gitignore(1 hunks)cmd/import/main.go(7 hunks)docker-compose.qdrant.yml(1 hunks)qdrant-config.yaml(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
cmd/import/main.go (1)
internal/repository/scan_repository_qdrant_impl.go (1)
VectorDim(35-35)
🪛 GitHub Actions: Golang CI Lint
cmd/import/main.go
[error] 113-113: golangci-lint (gocritic): exitAfterDefer - log.Fatalf will exit and defer will not run.
🪛 GitHub Check: build
cmd/import/main.go
[failure] 113-113:
exitAfterDefer: log.Fatalf will exit, and defer func(){...}(...) will not run (gocritic)
🔇 Additional comments (3)
cmd/import/main.go (1)
293-304: Indexing disabled is already enforced via IndexingThreshold=0Good use of IndexingThreshold=0 to defer index building. If M=0 proves problematic at create-time, this setting alone achieves the “fast import” goal.
docker-compose.qdrant.yml (1)
11-14: Confirm whether exposing 6335 is neededPort 6335 is typically the P2P/cluster port. If not running a cluster, it can be omitted.
qdrant-config.yaml (1)
7-51: Configuration structure is correct
performance,wal, andoptimizersare correctly nested understorageper Qdrant’s official schema. No changes required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
cmd/import/main.go (3)
176-195: Compile error in worker loop; also fix var naming to satisfy lintUse a classic for with an index; also rename workerId -> workerID.
Apply:
- for workerId := range MaxWorkers { - wg.Add(1) - go func(workerId int) { + for workerID := 0; workerID < MaxWorkers; workerID++ { + wg.Add(1) + go func(workerID int) { defer wg.Done() for file := range filesChan { sectorName := filepath.Base(file) sectorName = strings.TrimSuffix(sectorName, ".csv") - log.Printf("Worker %d: Processing sector %s", workerId, sectorName) + log.Printf("Worker %d: Processing sector %s", workerID, sectorName) - err := importCSVFile(ctx, client, file, sectorName) - if err != nil { - log.Printf("Worker %d: Error importing file %s: %v", workerId, file, err) + if err := importCSVFile(ctx, client, file, sectorName); err != nil { + log.Printf("Worker %d: Error importing file %s: %v", workerID, file, err) errorsChan <- fmt.Errorf("error importing file %s: %w", file, err) } else { - log.Printf("Worker %d: Successfully processed sector %s", workerId, sectorName) + log.Printf("Worker %d: Successfully processed sector %s", workerID, sectorName) } } - }(workerId) + }(workerID) }This addresses both the compile error and revive var-naming warning.
133-137: Avoid log.Fatalf after setting defers; it prevents client.Close from running (pipeline failure)golangci-lint exitAfterDefer is flagging this. Replace Fatalf with logging + return to allow deferred client.Close to execute.
Apply:
- err = client.DeleteCollection(ctx, collectionName) - if err != nil { - log.Fatalf("Error dropping collection %s: %v", collectionName, err) - } + err = client.DeleteCollection(ctx, collectionName) + if err != nil { + log.Printf("Error dropping collection %s: %v", collectionName, err) + return + }- if err != nil { - log.Fatalf("Error reading directory: %v", err) - } + if err != nil { + log.Printf("Error reading directory: %v", err) + return + }- if err != nil { - log.Fatalf("Error creating collection %s: %v", collectionName, err) - } + if err != nil { + log.Printf("Error creating collection %s: %v", collectionName, err) + return + }Alternatively, refactor main into a run() error pattern and exit in main after defers run.
Also applies to: 151-153, 313-315
372-381: Track CSV read errorserrorCount is never incremented; increment it when a non-EOF error occurs to make the later stats accurate.
Apply:
record, err := reader.Read() if err != nil { if err == io.EOF { break } log.Printf("WARNING: Error reading line %d in file %s: %v", lineNumber, filePath, err) + errorCount++ continue }
🧹 Nitpick comments (7)
qdrant-config.yaml (2)
52-65: Consider adding service.indexing_threshold and confirm timeouts
- You might want optimizers.indexing_threshold: 0 in the file to align with your “indexing disabled” import logic (keeps it consistent if collections inherit defaults).
- max_timeout_sec: 120 is fine; ensure the importer also uses request-level context timeouts to respect this.
76-77: Telemetry setting contradicts the commentComment says “set to false to disable”, and you set false (disabled). If you intended to reduce overhead during bulk import, this is fine; otherwise set true.
docker-compose.qdrant.yml (3)
9-14: Fix volumes mismatch; remove redundant expose (6335) unless clustering
- You declare a named volume qdrant_data but mount a bind path. Pick one. Recommended: use the named volume.
- expose duplicates ports already published; 6335 is unused unless cluster mode.
Apply:
- - ./qdrant_data:/qdrant/storage + - qdrant_data:/qdrant/storage - ./qdrant-config.yaml:/qdrant/config/production.yaml:ro - expose: - - 6333 - - 6334 - - 6335
3-3: Pin Qdrant image to a versionAvoid latest to ensure reproducibility and schema stability for your config.
Example: qdrant/qdrant:1.10.0 (verify current).
19-22: Named volume declared but unused with current bind mountIf you keep the bind mount approach, remove the top-level named volume. If you adopt the named volume (recommended), keep this section and adjust the service volumes as suggested above.
cmd/import/main.go (2)
242-254: Add a timeout to the health checkProtects CLI from hanging on network issues.
Apply:
-func verifyQdrantHealth(ctx context.Context, client *qdrant.Client) error { - log.Println("Performing Qdrant health check...") +func verifyQdrantHealth(parent context.Context, client *qdrant.Client) error { + log.Println("Performing Qdrant health check...") + ctx, cancel := context.WithTimeout(parent, 10*time.Second) + defer cancel()
614-621: Consider request-level timeouts for UpsertTo avoid long hangs during import spikes, wrap Upsert in a context with timeout (e.g., 30–60s) and retry with backoff if needed.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
.gitignore(1 hunks)cmd/import/main.go(9 hunks)docker-compose.qdrant.yml(1 hunks)qdrant-config.yaml(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
cmd/import/main.go (1)
internal/repository/scan_repository_qdrant_impl.go (1)
VectorDim(35-35)
🪛 GitHub Actions: Golang CI Lint
cmd/import/main.go
[error] 134-134: golangci-lint (gocritic): exitAfterDefer: log.Fatalf will exit, and defer func(){...}(...) will not run.
🪛 GitHub Check: build
cmd/import/main.go
[failure] 177-177:
var-naming: range var workerId should be workerID (revive)
🔇 Additional comments (5)
.gitignore (1)
46-47: LGTMIgnoring papi/ and target is reasonable for local artifacts.
cmd/import/main.go (3)
260-289: Confirm Qdrant semantics for disabling and later enabling HNSW with named vectors
- Setting HNSW M=0 per vector disables index building; later UpdateCollection sets HnswConfig at collection level. Confirm that collection-level HnswConfig overrides named vectors (or update per vector if required).
- IndexingThreshold: 0 in CreateCollection and 100000 in UpdateCollection look consistent with your import strategy.
Also applies to: 297-304
559-587: misc_collection is included in supported collections Verified that entities.GetAllSupportedCollections() returns "misc_collection" and that all supported collections (including misc_collection) are checked/created at startup.
478-490: BinaryQuantization in Qdrant only emulates dot-product/cosine, no native Hamming/L1 support
Qdrant’s BinaryQuantization binarizes floats for fast XOR+popcount dot-product approximations—it doesn’t provide native Hamming or Manhattan distances on dense vectors and no separate binary-vector type with selectable Hamming metric exists. Use dot-product (cosine) for quantized data or compute Hamming externally.Likely an incorrect or invalid review comment.
qdrant-config.yaml (1)
7-51: Review comment incorrect: configuration placement matches Qdrant schema
The wal and optimizers blocks belong under storage (and storage.performance is supported). No changes needed.Likely an incorrect or invalid review comment.
Summary by CodeRabbit