Grep large folders for words using bloom filters
C Shell
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
CALCULATIONS
FILE_FORMAT
LICENSE
README.md
bloomgrep.c
common.c
common.h
compile.sh
makebloom.c
wordgrep.sh

README.md

bloom-wordgrep

Grep large folders for words using bloom filters. This speeds up the search massively because the bloomfilter can, even just given one word, reduce the number of files that need to be searched by around 97-99%.

In this context, "word" means something that matches [a-zA-Z_] but is not surrounded by one of those characters (see make_bloom_from_file in common.c if this sounds too complicated.)

Compile with compile.sh, then link/copy the binaries and wordgrep.sh into your path.

Basic usage:

cd <some project folder>

# create the index, but skip all folders named "test" and "toolchain"
makebloom . test toolchain

# search for files that might contain the words "alert" and
# "error" - will print false positives!
bloomgrep . alert error | xargs grep "$1"

More practical:

# show all lines in the whole source tree that match "alert(.*error"
bloomgrep . alert error | xargs grep -w 'alert(.*error'

# simple wrapper: find all occurences of the word `indexArrayArgument` in the source
wordgrep.sh indexArrayArgument

This is how I use this for the Chromium source tree:

$ ls
makebloom.sh  src  wordgrep.sh

$ cat makebloom.sh
#!/bin/sh
makebloom . .git out test LayoutTests ManualTests toolchain hunspell_dictionaries theme test_data PerformanceTests

$ wordgrep.sh indexArrayArgument
stats: 93820 files searched, 2662 hits (2.837348%)
./src/third_party/WebKit/Source/bindings/v8/custom/V8WebGLRenderingContextCustom.cpp:    const int indexArrayArgument = 1;
./src/third_party/WebKit/Source/bindings/v8/custom/V8WebGLRenderingContextCustom.cpp:    if (V8Float32Array::hasInstance(info[indexArrayArgument], info.GetIsolate())) {
./src/third_party/WebKit/Source/bindings/v8/custom/V8WebGLRenderingContextCustom.cpp:        Float32Array* array = V8Float32Array::toNative(info[indexArrayArgument]->ToObject());
./src/third_party/WebKit/Source/bindings/v8/custom/V8WebGLRenderingContextCustom.cpp:    if (info[indexArrayArgument].IsEmpty() || !info[indexArrayArgument]->IsArray()) {
./src/third_party/WebKit/Source/bindings/v8/custom/V8WebGLRenderingContextCustom.cpp:        exceptionState.throwTypeError(ExceptionMessages::argumentNullOrIncorrectType(indexArrayArgument + 1, "Array"));
./src/third_party/WebKit/Source/bindings/v8/custom/V8WebGLRenderingContextCustom.cpp:    const int indexArrayArgumentIndex = 1;
./src/third_party/WebKit/Source/bindings/v8/custom/V8WebGLRenderingContextCustom.cpp:    if (V8Int32Array::hasInstance(info[indexArrayArgumentIndex], info.GetIsolate())) {
./src/third_party/WebKit/Source/bindings/v8/custom/V8WebGLRenderingContextCustom.cpp:        Int32Array* array = V8Int32Array::toNative(info[indexArrayArgumentIndex]->ToObject());
./src/third_party/WebKit/Source/bindings/v8/custom/V8WebGLRenderingContextCustom.cpp:    if (info[indexArrayArgumentIndex].IsEmpty() || !info[indexArrayArgumentIndex]->IsArray()) {
./src/third_party/WebKit/Source/bindings/v8/custom/V8WebGLRenderingContextCustom.cpp:        exceptionState.throwTypeError(ExceptionMessages::argumentNullOrIncorrectType(indexArrayArgumentIndex + 1, "Array"));
./src/third_party/WebKit/Source/bindings/v8/custom/V8WebGLRenderingContextCustom.cpp:    v8::Handle<v8::Array> array = v8::Local<v8::Array>::Cast(info[indexArrayArgumentIndex]);

Note all the exclusions - if you e.g. don't exclude .git, you'll experience a significant slowdown because all the git packfiles will trigger false positive matches because they're so big - this is caused by the bloomfilters having a fixed per-file length.

All the following timings were measured with warm caches.

With this, grepping the whole Chromium source tree just takes 4 seconds on my PC:

$ time wordgrep.sh indexArrayArgument
stats: 93820 files searched, 2662 hits (2.837348%)
[...]
real    0m3.516s
user    0m0.292s
sys     0m0.406s

Compare that to a traditional recursive grep:

$ time grep -RF --exclude-dir .git --exclude-dir out --exclude-dir test --exclude-dir LayoutTests --exclude-dir ManualTests --exclude-dir toolchain --exclude-dir hunspell_dictionaries --exclude-dir theme --exc$

real    1m28.150s
user    0m1.870s
sys     0m10.897s

But how long does it take to initially prepare the bloom filters? To "generate the index"? Just a bit more than grepping through the code:

$ time ./makebloom.sh >/dev/null 2>&1

real    2m8.220s
user    0m39.973s
sys     0m12.849s