Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPM backend performance is limited by arrays of hdrNum's #290

Closed
n3npq opened this issue Jul 29, 2017 · 0 comments
Closed

RPM backend performance is limited by arrays of hdrNum's #290

n3npq opened this issue Jul 29, 2017 · 0 comments

Comments

@n3npq
Copy link
Contributor

n3npq commented Jul 29, 2017

The following callgraphs for BDB/LMDB/NDB all show a common hotspot retrieving arrays of hdrNum's from indices.

The performance problem shows up worst on add/del operations, where a RMW loop has to be performed to add/del a hdrNum item to an array. The array is then sorted (and perhaps uniqified) by qsort(3) repeatedly, the worst case behavior for the algorithm, resorting almost sorted arrays (merge sort or even a home rolled insertion loop would be less costly).

Maintaining the hdrNum's endianness is another flaw; exposing the hdrNum's through the RPM API is yet another flaw because the values will change with every --rebuildb (i.e. the hdrNum's are not persistent).

The fundamental architectural problem that needs solving for better performance is the nesting of per-header and then per-tag operations performed by rpmdbAdd(). Ideally, a batch mode update for each index of all the headers would remove the need to constantly reread/modify/rewrite.

One approach to removing the overhead associated with the array management that "works" with BerkeleyDB is to tie the secondary index to the primary store using db->associate. Then Berkeley DB can handle the caching/optimizations needed to handle indices transparently to RPM.

(aside)
I don't yet know how to do db->associate like optimization for NDB/LMDB. On the todo++ list ...

Using db->associate in Berkeley DB is essentially the same as using a SQL trigger to maintain indices derived from a primary store, a very common abstraction used with RDBM's.

Here are the callgraphs that show the performance bottleneck for all of BDB/NDB/LMDB:

BDB

[jbj@ji rpm]$ /usr/bin/time sudo ./libtool --mode=execute /home/jbj/bin/cg ./rpmdb --rebuilddb
208.17user 3.31system 3:32.94elapsed 99%CPU (0avgtext+0avgdata 74000maxresident)k
0inputs+492608outputs (0major+37493minor)pagefuls 0swaps

bdb.cga.gz

NDB

/usr/bin/time sudo ./libtool --mode=execute /home/jbj/bin/cg ./rpmdb --rebuilddb --ndb
99.59user 3.67system 8:35.82elapsed 20%CPU (0avgtext+0avgdata 93888maxresident)k
0inputs+3315224outputs (0major+461509minor)pagefuls 0swaps

ndb.cga.gz

LMDB

[jbj@ji rpm]$ /usr/bin/time sudo ./libtool --mode=execute /home/jbj/bin/cg ./rpmdb --rebuilddb --lmdb
113.50user 1.57system 1:55.07elapsed 99%CPU (0avgtext+0avgdata 393692maxresident)k
0inputs+455720outputs (1103major+129040minor)pagefuls 0swaps

lmdb.cga.gz

@n3npq n3npq closed this as completed Aug 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant