Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define an object-oriented scanner interface #28

Open
jonstewart opened this issue Jul 2, 2021 · 9 comments
Open

Define an object-oriented scanner interface #28

jonstewart opened this issue Jul 2, 2021 · 9 comments

Comments

@jonstewart
Copy link

jonstewart commented Jul 2, 2021

The concept of a scanner was built to be simple—just define a single function!

But, that function must adhere to a certain protocol. It must check the phase and then take appropriate action. These phases and actions correspond to lifetime concerns—initialization, config check, doing work, destruction. This is a good indication that there's an object-oriented design waiting to be defined for scanners, where a new scanner would be a class that inherits from a scanner base class. Inheritance from the base class would likely use the CRTP to provide for a common factory/construction method.

A wrinkle of lifetime management in a multithreaded processing application is that the first scanner may be initialized for the purpose of one-time setup and/or configuration/command-line checking, but then a scanner object of each type may be created for each thread in the thread pool. Typically this secondary initialization can be done via a clone() method that takes a "prototype" instance and then simply invokes the copy constructor. In this way a scanner instance can have state which is reused from sbuf to sbuf, generally to avoid reallocation but also perhaps useful when doing some kind of accumulate/reduce operation (like creating histograms)—each thread's instance can create its result of the partition of sbufs it receives and then those partial results can be aggregated at the end of processing, thread-safe and very efficient (presumably the operations would be associative and commutative).

While complicated framework APIs aren't fun, bulk_extractor isn't some terrible Java framework with a bewildering number of classes. A very simple base class API could be defined for scanners. The resulting code for scanners would be then be cleaner, simpler, and more idiomatic.

It may be possible to define such an interface that's then used by the existing function-based API, to avoid an all-or-nothing refactoring.

@simsong
Copy link
Owner

simsong commented Jul 2, 2021

bulk_extractor does have an object-oriented interface. What it doesn't have is object-oriented linkage down to the scanners. However, it has a clearly defined object-based API, in which an object is provided as the sole argument to the scanner function. The scanner function is called with "C" linkage. And the scanner calls object interfaces on objects that are passed in to the function.

The reason that the linkage is "C" and not C++ is two-fold. First and most important, it allows scanners can be loaded dynamically on a variety of platforms. The goal was for bulk_extractor to be run as a drop-in program in a certified environment, but to allow end-users to write their own modules and have them load at run-time.

The second reason is that I tried to do run-time loading of C++ shared libraries and I couldn't get it to work in a way that was reliable.

A third reason is that the current "C" linkage is easier to make work with things like Lambda, easier to call between languages (I was able to call C functions from Python, but not C++ functions, due to name mangling; YMMV), and the like.

However, you can certainly write a class that supports the C linkage but the uses virtual functions to call each of its scanner parts. This is left as an exercise for the reader. I've written a few of these; one of them is used for the flex-based scanners, but they don't replace the actual linkage section.

While working on this C++17 rewrite, I realized that the a scanner linkage prototype could be written with the C++ template system. Then you would instantiate the linkage for each of your named scanners and subclass it. Again, I tried some prototypes, but then gave up on it.

I have a very limited amount of time to work on this project, so I'm putting my efforts into making sure that BE2 supports C++17 and thereby supports modern operating systems. It's been a lot of work. I've done a complete refactoring of the BE13_API system and it is much cleaner now. At some point in the future this might be worth doing.

@jonstewart
Copy link
Author

jonstewart commented Jul 2, 2021

Agreed that C++ linkage across shared libraries is harmful/impossible. I always create C APIs when linking, and just try to make bigger libraries to minimize build products and keep the Interface:Implementation ratio small.

I'm surprised that passing in the scanner_params object works—I don't think it would necessarily work between libraries built by different compilers, or all language runtimes.

Most classes here in be13_api are much improved, but the basic API for writing a new scanner is similar. I think most developers who can write C/C++ would find implementing ~4 C functions (create/clone/do/destroy) easier than managing the phases coming from different entry points. And then if the scanners were all part of the executable build, that could be handled through inheritance from a base class—even easier.

That's my main feedback with the API. I'm still having a hard time wrapping my head around all the sbuf lifetime issues; I'll need to get my hands dirty, and then see if any PRs result.

As an aside with lambdas, a nuance that's easy to forget about—because things still compile even if completely wrong—is needing to use std::ref and std::cref with reference parameters. There's a decent example halfway down this page: https://www.learncpp.com/cpp-tutorial/lambda-captures/. Not sure whether you've encountered that, but forgetting to use std::ref accounts for the majority of bugs I've seen with lambdas.

@simsong
Copy link
Owner

simsong commented Jul 2, 2021 via email

@simsong
Copy link
Owner

simsong commented Feb 11, 2022

Do you think that this is still necessary? It's certainly a hassle to create all of the things needed for a new scanner, but I'm not sure that it's worth doing this.

@jonstewart
Copy link
Author

jonstewart commented Feb 12, 2022 via email

@simsong
Copy link
Owner

simsong commented Feb 12, 2022

I can certainly create a base class from which the scanners are subclassed. My first concern of this, though, is that I don't know how it would impact turning the scanner into a loadable shared library.

I hadn't thought of doing it with templates. I guess it could be done entirely with templates, but it's going to get gross.

@jonstewart
Copy link
Author

jonstewart commented Feb 12, 2022 via email

@simsong
Copy link
Owner

simsong commented Feb 12, 2022

Right. That would be easy to do. But you still need a way to get the scanner registered into either the scanner array or registered at runtime.

@simsong
Copy link
Owner

simsong commented Jan 16, 2024

Is this still an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants