Summary
engine.search() blocks indefinitely on repositories with many files (~200K+) because the corpus file walker uses walkdir::WalkDir instead of the ignore crate's WalkBuilder, which means .gitignore rules are not respected during file discovery.
Environment
- sift: current main branch (git dep)
- OS: Linux (Arch)
- Rust: stable
Reproduction
use sift::{SearchInput, SearchOptions, Sift};
fn main() {
// Point at any large repo with submodules/node_modules
// e.g., a monorepo with ~200K files (excluding .git/)
let engine = Sift::builder().build();
let options = SearchOptions::default()
.with_limit(10)
.with_strategy("lexical".to_string());
let input = SearchInput::new("/path/to/large/repo", "main").with_options(options);
// This blocks indefinitely
let response = engine.search(input);
}
Test repo: a workspace with git submodules containing node_modules/, build outputs, etc. Total file count: ~1.87M files (excluding .git/), ~200K excluding node_modules/ and target/.
Observed: search() never returns (tested up to 40 seconds, then timed out via external wrapper).
Expected: search completes in seconds by respecting .gitignore to skip node_modules/, build artifacts, etc.
Root cause
In src/search/corpus.rs:283, collect_file_paths uses walkdir::WalkDir:
fn collect_file_paths(root: &Path, ignore: Option<&Ignore>) -> Vec<PathBuf> {
// ...
for entry in WalkDir::new(root).sort_by_file_name().into_iter().flatten() {
// ...
}
}
This walks every file in the directory tree. The Ignore struct (from src/config.rs) only checks against .siftignore files and two hardcoded exclusions (target/**, .git/**). It does not read or respect .gitignore files.
The ignore crate (already a dependency) provides WalkBuilder which natively respects .gitignore, global gitignore, and .ignore files. Switching WalkDir::new(root) to WalkBuilder::new(root) would fix this without changing the API surface.
Workaround
We wrapped engine.search() in a thread with mpsc::recv_timeout (30-second default) to prevent our CLI from hanging:
fn run_search_with_timeout(engine: Sift, input: SearchInput, timeout_secs: u64) -> Result<SearchResponse> {
if timeout_secs == 0 { return engine.search(input); }
let (tx, rx) = std::sync::mpsc::channel();
std::thread::spawn(move || { let _ = tx.send(engine.search(input)); });
match rx.recv_timeout(Duration::from_secs(timeout_secs)) {
Ok(result) => result,
Err(_) => bail!("search timed out after {}s", timeout_secs),
}
}
Suggested fix
Replace WalkDir with the ignore crate's WalkBuilder in collect_file_paths. This would automatically respect .gitignore at every level of the tree, dramatically reducing the file set for typical repositories. The Ignore struct could then layer .siftignore on top of the git-native ignore rules.
Summary
engine.search()blocks indefinitely on repositories with many files (~200K+) because the corpus file walker useswalkdir::WalkDirinstead of theignorecrate'sWalkBuilder, which means.gitignorerules are not respected during file discovery.Environment
Reproduction
Test repo: a workspace with git submodules containing
node_modules/, build outputs, etc. Total file count: ~1.87M files (excluding.git/), ~200K excludingnode_modules/andtarget/.Observed:
search()never returns (tested up to 40 seconds, then timed out via external wrapper).Expected: search completes in seconds by respecting
.gitignoreto skipnode_modules/, build artifacts, etc.Root cause
In
src/search/corpus.rs:283,collect_file_pathsuseswalkdir::WalkDir:This walks every file in the directory tree. The
Ignorestruct (fromsrc/config.rs) only checks against.siftignorefiles and two hardcoded exclusions (target/**,.git/**). It does not read or respect.gitignorefiles.The
ignorecrate (already a dependency) providesWalkBuilderwhich natively respects.gitignore, global gitignore, and.ignorefiles. SwitchingWalkDir::new(root)toWalkBuilder::new(root)would fix this without changing the API surface.Workaround
We wrapped
engine.search()in a thread withmpsc::recv_timeout(30-second default) to prevent our CLI from hanging:Suggested fix
Replace
WalkDirwith theignorecrate'sWalkBuilderincollect_file_paths. This would automatically respect.gitignoreat every level of the tree, dramatically reducing the file set for typical repositories. TheIgnorestruct could then layer.siftignoreon top of the git-native ignore rules.