Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address 2.0 performance issue #321

Closed
2 tasks
simsong opened this issue Jan 5, 2022 · 11 comments
Closed
2 tasks

Address 2.0 performance issue #321

simsong opened this issue Jan 5, 2022 · 11 comments

Comments

@simsong
Copy link
Owner

simsong commented Jan 5, 2022

BE2.0 is now highly reliable and compiles on AMD. Now it is time to address performance issues.

Something about the handling of data is much slower on 2.0. It may be all of the moderated sbuf access, and if so, we will need a way to disable all of the runtime checks.

Reading just the ubnist1 disk image with -x all, BE1.6 reads in 4.2 seconds on my macMini and BE2.0 reads in 4.0 seconds.

Running with -E json with 6 cores produces speeds of BE16: 5.5, 5.5, 5.4 and BE2.0: 9.0, 9.0, 9.0
Running with -E json -j1 produces BE16: 18.9, 18.7 and BE2.0: 42, 42

So we have an easy test case with a scanner that's easy to work with. This should be relatively straightforward to debug. @jonstewart if you have any suggestions, please let me know. Otherwise I'll probably follow this approach:

  • Add a DEBUG flag for scan_json to have it just return, to have it compute the histogram with the sbuf machinery and return, and to have it access every byte with the sbuf machinery and return.
  • Create a performance benchmark script that can reliably benchmark two executables and produce summary results.
@jonstewart
Copy link
Collaborator

jonstewart commented Jan 5, 2022 via email

@simsong
Copy link
Owner Author

simsong commented Jan 5, 2022

Gprof is the # 2 approach if simple code inspection and some #ifdef statements fail. The overhead is a factor of 2 and it shows up when running single threaded, so I suspect it is one thing and easy to find.

I really suspect is is the super-safe sbuf system I created. Too many runtime checks…. But now I can work on efficiency given that I have correctness

@simsong
Copy link
Owner Author

simsong commented Jan 5, 2022

Well look at this:

configure: *************************************
configure: *************************************
configure:   PACKAGE_NAME:     BULK_EXTRACTOR
configure:   PACKAGE_VERSION:  2.0.0-beta3
configure:   CC:               gcc
configure:   CXX:              g++ -std=c++17
configure:   CPPFLAGS:         -I/opt/local/include  -I/usr/local/include -DUTC_OFFSET=+0000
configure:   CFLAGS:           -g -g -O3 -fPIC
configure:   CXXFLAGS:         -g  -pthread -Wall -MD -Wpointer-arith -Wshadow -Wwrite-strings -Wcast-align -Wredundant-decls -Wdisabled-optimization -Wfloat-equal -Wmultichar -Wmissing-noreturn -Woverloaded-virtual -Wsign-promo -funit-at-a-time -Weffc++ -fPIC
configure:   LIBS:             -lewf -lstdc++ -lexpat -lz -ldl -ltermcap -lsqlite3 -lexpat
configure:   LDFLAGS:          -L/opt/local/lib  -L/usr/local/lib

Turns out that the rewritten autoconf file was not running the optimizer when compiling CXXFLAGS.

  • CXXFLAGS fixed.

I've now gone through the scan_aes and discovered it was going to the end of the buffer, rather than the end of the page. I changed that and its going much faster now too.

@jonstewart
Copy link
Collaborator

jonstewart commented Jan 5, 2022 via email

@simsong
Copy link
Owner Author

simsong commented Jan 6, 2022

Discovery #2: margin size got inadvertently increased from 4MiB to 16MiB:

(base) simsong@nimi debug % ~/gits/bulk_extractor-1.6/src/bulk_extractor -h | grep margin
   -g NN        - specify margin (default 4194304)
(base) simsong@nimi debug %

vs.

~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o    23.51s user 0.44s system 104% cpu 22.889 total
(base) simsong@nimi ~ % ~/gits/bulk_extractor/src/bulk_extractor -h | grep margin
  -g, --marginsize arg          margin size in bytes (default: 16777216)
(base) simsong@nimi ~ %

And, get this --- the flex scanners don't respect the pagesize, and appear to scan to the end of the margin.

  • Modify flex scanners so that if a match starts inside the margin it throws an exception that ends the flex scanner.
  • Update BE2 codebase so that the default margin is 4MiB and not 16MiB.

@jonstewart
Copy link
Collaborator

jonstewart commented Jan 6, 2022 via email

@simsong
Copy link
Owner Author

simsong commented Jan 6, 2022

I just found a way to tell them to stop when the regular expression scanner crosses into the margin. You'll like it.

@simsong
Copy link
Owner Author

simsong commented Jan 6, 2022

Before implementing new margin hack (and before shortening the margin from 16MiB to 4MiB):

(base) simsong@nimi ~ % time ~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o ~/out/out-be20-xall-$(python -c 'import time;print(int(time.time()))') ~/corp/nps-2009-ubnist1/ubnist1.gen3.raw >/dev/null
~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o   > /dev/null  138.56s user 3.20s system 102% cpu 2:18.54 total

After:

(base) simsong@nimi ~ % time ~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o ~/out/out-be20-xall-$(python -c 'import time;print(int(time.time()))') ~/corp/nps-2009-ubnist1/ubnist1.gen3.raw
~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o    80.65s user 2.69s system 103% cpu 1:20.44 total

Improvement from 138.56s to 80.65s of user time. It's still not as good as BE1.6, probably because of the bigger margin size....

Check out 4007d30

@simsong
Copy link
Owner Author

simsong commented Jan 6, 2022

Now with the margin lowered from 16MiB to 4MiB:

~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o    76.66s user 2.25s system 103% cpu 1:16.38 total

LOOK AT THAT* -- It's even faster than version 1.6. About time. That's the exception logic. Now to add it to scan_accts.

@simsong
Copy link
Owner Author

simsong commented Jan 6, 2022

Ick. Exception causes allocated values not to be freed by the lexer.

https://github.com/simsong/bulk_extractor/runs/4723066087?check_suite_focus=true

Looks like I'll need to just move the point to the end.

@simsong
Copy link
Owner Author

simsong commented Jan 7, 2022

I've run many performance tests now with BE16 vs. BE20beta4 (which includes the flex improvements noted above).

The table below is all done with the domexusers disk. All are run -j1 which means one thread reading data and one thread processing, which allows very good measurements of time. Each is run with a single scanner. The first, -x all, has no scanners enabled. It shows that that the underlying infrastructure is 2x faster on BE20b4. However, it's also clear that some of the scanners are much slower. Really, only three scanners: the net scanner and two of the flex-based scanners (but not the other two?).

scanner BE16 BE20b4
none 301 141
json 307 274
aes 375 341
gzip 294 276
base64 301 267
zip 298 275
gps 239 239
accts 431 1510
net 461 1030
email 829 1882

With the recursive analysis, the extra time spent in the big scanners (accts, net, email) gets multiplied multiple times.

So I'm nearly there, and the new infrastructure is amazing, and by throwing a lot of cores at the thing, BE2 is faster than BE1. But once I get the net, email and accts scanner faster, BE2 will be much faster than BE1. So I'm going to close this thread and open a thread for net , accts, and email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants