Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for reading BDB without the library #980

Merged
merged 1 commit into from Jan 13, 2020

Conversation

mlschroe
Copy link
Contributor

This commit implements a slow read-only backend that allows
accessing of BerkeleyDB databases without using the BerkeleyDB
library. The code supports btree version 9 and hash version 8
and 9.

There are two use cases for this:

  1. Conversion of an existing BerkeleyDB to a different
    backend.

  2. Allowing package scriptlets to do database queries while
    in a transaction that replaced rpm with a version that
    no longer links against BerkeleyDB.

Currently prefix searching is not supported (but it would be
easy to add).

Copy link
Member

@Conan-Kudo Conan-Kudo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how I feel about this, but...

configure.ac Outdated
@@ -594,6 +594,20 @@ AS_IF([test "$enable_ndb" = yes],[
])
AM_CONDITIONAL([NDB], [test "$enable_ndb" = yes])

#=================
# Process --enable-bdb-ro
AC_ARG_ENABLE([bdb-ro], [AS_HELP_STRING([--enable-ndb-ro (EXPERIMENTAL)],[enable the read-only Berkeley DB code])],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--enable-ndb-ro -> --enable-bdb-ro

Freudian slip? ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, copy'n'pasted from the ndb lines.

@mlschroe
Copy link
Contributor Author

Updated the code. It's now much faster and also implements prefix search.

@Conan-Kudo
Copy link
Member

@mlschroe I'm going to guess this probably doesn't work with a bdb6-based rpmdb (e.g. OpenMandriva's rpmdb)?

@mlschroe
Copy link
Contributor Author

Db6 just added support for large data blobs and a heap database type, both are not used by rpm. I tested it with a cooker container image, and the code could read the database without problems.

@mlschroe
Copy link
Contributor Author

Added a check for the hash/btree version.

Copy link
Contributor

@ignatenkobrain ignatenkobrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 5 files changed, 837 insertions(+)

:/

@Conan-Kudo
Copy link
Member

@mlschroe This should probably conflict with the option to enable the normal bdb backend. I see no reason for both to be enabled at the same time.

@pmatilai
Copy link
Member

pmatilai commented Jan 2, 2020

Oh my.

I'm not exactly overjoyed about the idea of having custom BDB reader code in rpm, but given the alternatives, it actually looks almost pretty 😁
Oh and our database team will love you forever for this.

@pmatilai
Copy link
Member

pmatilai commented Jan 2, 2020

 5 files changed, 837 insertions(+)

:/

To put it into perspective:

$ find db-5.3.28 -name "*.[ch]" |wc -l
1037

In other words, this is fewer lines than the BDB has source files.

@Conan-Kudo
Copy link
Member

In other words, this is fewer lines than the BDB has source files.

This is a wonderful and equally terrifying statistic.

@pmatilai
Copy link
Member

pmatilai commented Jan 2, 2020

It really is 😂

Actually the above stats are a bit off, because the Fedora sources include db-1.85 compat. But the point does hold even with a pristine BDB tarball: it has 894 *.[ch] source files. And that's missing quite a bit of other stuff, the tarball has 9241 files in total (including docs and all).

@Conan-Kudo
Copy link
Member

@pmatilai I'm pretty sure I'm going to want this for transitioning OpenMandriva away from BDB. We're using db6 (even though I didn't want to...), and with the latest versions of DNF okay with non-BDB, I can finally start considering it...

@Conan-Kudo
Copy link
Member

Conan-Kudo commented Jan 2, 2020

@mlschroe The code works surprisingly well for me (which is terrifying and awesome in itself), but I think I'd be more comfortable with this if it conflicted with the regular bdb backend option.

@pmatilai
Copy link
Member

pmatilai commented Jan 2, 2020

Greetings from our DB people: "truly like X-mas gift" 🎅

One possibility to handle the "conflict" might be making it an argument to --enable-bdb (eg --enable-bdb=readonly), which then skips the other variants. In that case it could technically be called "bdb" and avoid all the "configured to blabla, using blabla" warnings from backend detection.
OTOH for various purposes (testing etc) its useful to be able to support both from one rpm version.

@Conan-Kudo
Copy link
Member

One possibility to handle the "conflict" might be making it an argument to --enable-bdb (eg --enable-bdb=readonly), which then skips the other variants. In that case it could technically be called "bdb" and avoid all the "configured to blabla, using blabla" warnings from backend detection.

I think this makes sense. If @mlschroe does this, then I'm happy to approve this too.

@pmatilai
Copy link
Member

pmatilai commented Jan 7, 2020

Well, I only said "one possibility", doesn't necessarily mean I think it's the right one. Technically there's no reason to limit it that way, and like I said, especially for testing it's highly useful to be able to build both the "real" and the readonly backends in.

if (meta[4] < 9 || meta[4] > 10) {
fprintf(stderr, "unsupported btree version %d\n", meta[4]);
bdb_close(db);
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these early returns should be NULLs, not zeros as the function returns a pointer.
I'd also prefer the using the "goto err" idiom to minimize the number of exit points from the function.

if ((rdb->db_mode & O_ACCMODE) != O_RDONLY)
return EPERM;
if ((dbi = dbiNew(rdb, rpmtag)) == NULL)
return 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dbiNew() can't actually fail, so this is superfluous.

free(path);
dbi->dbi_flags |= DBI_RDONLY;
if (dbip)
*dbip = dbi;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would leak the dbi allocation if somebody actually called it with dbip as NULL. I don't think anything does, but the code checks for that case so it'll likely trip up false positives from static analyzers if nothing else.

Here too, as allocations are involved, I'd much prefer single point of cleanup + exit idiom that rpm generally uses. Not a showstopper though.

static int bdbro_Verify(dbiIndex dbi, unsigned int flags)
{
return 0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is beyond the scope of this patch, but just an idea: we could have a generic db verify operation that does something in a backend independent way in the case the backend doesn't implement a specific verify operation on its own. "Something" could be simply open, read through all records and close.

AC_DEFINE(WITH_BDB_RO, 1, [Build with read-only Berkeley DB])
])
AM_CONDITIONAL([BDB_RO], [test "$enable_bdb_ro" = yes])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please enable this in CI (ie ci/Dockerfile)

On a related note to self/whoever gets to it first: for whatever reason ndb is not enabled there either, and the others are only implicitly enabled.

@pmatilai
Copy link
Member

pmatilai commented Jan 8, 2020

I think I'd be more comfortable with this if it conflicted with the regular bdb backend option.

Now that I looked atthis a bit closer, conflicting would be both unnecessary and plain draconian. If both are enabled in configure, real BDB always wins on autodetection but either can be manually configured. Which is just the way it should be, but it wouldn't hurt to mention the expected behavior in the commit message.

pmatilai added a commit to pmatilai/rpm that referenced this pull request Jan 8, 2020
BDBD, LMDB and SQLite were already implicitly enabled via build
dependencies, but NDB build has not been enabled at all.
Came up when discussing read-only BDB in rpm-software-management#980.
pmatilai added a commit that referenced this pull request Jan 8, 2020
BDBD, LMDB and SQLite were already implicitly enabled via build
dependencies, but NDB build has not been enabled at all.
Came up when discussing read-only BDB in #980.
This commit implements a read-only backend that allows accessing
of BerkeleyDB databases without using the BerkeleyDB library.
The code supports btree version 9-10 and hash version 8-10.

There are two use cases for this:

1) Conversion of an existing BerkeleyDB to a different
   backend.

2) Allowing package scriptlets to do database queries while
   in a transaction that replaced rpm with a version that
   no longer links against BerkeleyDB.

If both BerkeleyDB and the read-only backend are enabled, rpm will
default to BerkeleyDB.
@mlschroe
Copy link
Contributor Author

Updated the pull request:

  • about the dbiNew() call that cannot fail: this is copied from the other backends. I currently left it untouched so that a different commit can clean up all the backends.
  • about using NULL: I think NULL is some leftover from the old K&R days and should be avoided. Saying to use NULL as return value is like saying you need to use return 0ULL in functions that return unsigned long longs. Anyway, I've changed the code so that NULL is used.
  • about that goto exit idiom: I don't like it so I only use it if it really reduces the lines of code.
  • about the dbi allocation leak: yes, that was a bug. Fixed.
  • the ro backend is now also enabled in the ci.

@pmatilai
Copy link
Member

Thanks.

As for the stylistic issues:

0 long long is still an integer value, a NULL pointer is a rather special entity even if the technical value happens to be zero as well. If nothing else, it's a powerful clue to the user that it's supposed to be a pointer and not an integer zero, whereas '0' is ambiguous. When looking at code with tired eyes, I take every crutch I can get :)

As for goto exit, it might not always save lines initially, the main feature is avoiding redundancy and leaks in the long run. In a function with many return points, if a new feature or such needs to use malloc, you'll end up adding umptheen free()'s to unrelated error paths instead of just doing it once at the end. Not to preach or anything, it's just a lesson I got hammered into my head in the early years of cleaning up memleaks and the like in rpm.

@pmatilai pmatilai merged commit 7cc9eb8 into rpm-software-management:master Jan 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants