Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to speedup virtual tables? #465

Closed
fzakaria opened this issue Jul 30, 2023 · 29 comments
Closed

How to speedup virtual tables? #465

fzakaria opened this issue Jul 30, 2023 · 29 comments

Comments

@fzakaria
Copy link

fzakaria commented Jul 30, 2023

Hi Roger!

I'm not sure if this is a function of SQLite or the C-wrapper but I'm finding the virtual table wrapper very slow.
I'm trying to run it over a table that may have ~700k rows and it's taking a very very long time.

I've boiled down the sample to:

import apsw
import apsw.ext

# 3 lines of code ...
def table_range(start=1, stop=700_000, step=1):
    for i in range(start, stop + 1, step):
        yield (i,)


# set column names
table_range.columns = ("value",)
# set how to access what table_range returns
table_range.column_access = apsw.ext.VTColumnAccess.By_Index

connection = apsw.Connection(":memory:")
# register it
apsw.ext.make_virtual_module(connection, "range", table_range)

# see it work.  we can provide both positional and keyword
# arguments
query = "SELECT * FROM range(90) WHERE step=2"
for row in connection.execute(query):
    # don't pay any cost for printing
    pass
print("done")

I'm not very proficient at profiling both C-extensions and Python -- cProfile doesn't have anything really to help here.

@fzakaria
Copy link
Author

Using the C-API I know I can insert ~700k rows incredibly fast (few seconds).

@fzakaria
Copy link
Author

Hmm -- never mind that actually runs pretty fast -- can't figure out why my generator is so slow.

@fzakaria
Copy link
Author

fzakaria commented Jul 30, 2023

I know it's from https://github.com/fzakaria/sqlelf/blob/main/sqlelf/elf/instruction.py
It's from the Capstone library because I changed the code to generate all the instructions ahead-of-time and the 700k instructions only take a few seconds to turn into a list()

diff --git a/sqlelf/elf/instruction.py b/sqlelf/elf/instruction.py
index 2a0bed2..cafdc00 100644
--- a/sqlelf/elf/instruction.py
+++ b/sqlelf/elf/instruction.py
@@ -20,9 +20,10 @@ def elf_instructions(binaries: list[lief.Binary]):
                     # complicates the internal operations and slows down
                     # the engine a bit, so only do that if needed.
                     md.detail = False
-                    for (address, size, mnemonic, op_str) in md.disasm_lite(
-                        data, section.virtual_address
-                    ):
+                    print("Starting")
+                    instructions = list(md.disasm_lite(data, section.virtual_address))
+                    print("Ending: " + str(len(instructions)))
+                    for (address, size, mnemonic, op_str) in instructions:
                         yield {
                             "path": binary.name,
                             "section": section.name,

I just did a silly log statement and see it finishes the instructions to a list pretty fast then I am just 🕐 for the query to finish.

sqlelf /usr/bin/python3 --sql "select COUNT(*) from elf_instructions where section = '.text'"

Starting
Ending: 7
Starting
Ending: 7
Starting
Ending: 1485
Starting
Ending: 693445

@fzakaria
Copy link
Author

fzakaria commented Jul 30, 2023

Here is the output of cProfile with snakeviz with the command:

> timeout -s INT 3m python -m cProfile -o sqlelf.prof \ 
$(which sqlelf) /usr/bin/python3 --sql "select COUNT(*) from elf_instructions where section = '.text'"

image

I'll close this issue pretty soon -- unless you have any feedback; sorry for the noise here.
I'll need to figure out what is going on :S

@rogerbinns
Copy link
Owner

The visualization does show the time is spent in "your" code.

Virtual tables will be slower in the big picture though because they do more work. APSW/Python makes them slower still because GILs have to be acquired and released, and Python evaluation will be slower than C.

My advise is that if you are going to have a large amount of data and you are going to return all of it via virtual tables, then you'll find it quicker to load the data into a SQLite temporary table and just use that. SQLite's btrees and C code will be the most performant.

Virtual tables are great when they can avoid returning information hence saving work.

@fzakaria
Copy link
Author

Thank you for taking the time to answer.
I just wasn't sure if I was screwing up my profiling due to the c-extension stuff.
Turns out the SQLite example above finishes in sub-second -- so my use of the other library (Capstone) seems to be paying a heavy tax; it uses ctypes heavily.

Anyways -- that's on me to investigate; thanks again :)

@fzakaria
Copy link
Author

profiler.tar.gz

I've attached a profile with py-spy if anyone is interested.
You can load it with https://www.speedscope.app/ -- it includes C-extensions.

Lot of time just spent in "Column" for apsw.
I don't think there is any more optimization here to bad.

image
(left-heavy view groups similar spans together to show where total time is being spent)

@rogerbinns
Copy link
Owner

It doesn't look like there is anything I could do in APSW to make it faster. SQlite only asks for a single value at a time, so 10,000 rows of 10 columns means 10,000 calls to next and 100,000 calls to column. You really want to avoid large amounts of data because of the Python overhead.

The APSW Python code you are using could be made slightly faster by eliminating an if statement and replacing with specialised methods, but I don't think it will make a meaningful difference.

It does look like the code you execute in response to next does a lot of work. I'm guessing it also releases the GIL. Perhaps what you can do is run that in a different thread and push the results to a queue, and then in the next method just pull from that queue. It is hard to tell if that would make a difference with the added overhead.

I did realise that the code that does conversion between Python and SQLite C types isn't covered in speedtest. so it should be #466

@fzakaria
Copy link
Author

fzakaria commented Aug 1, 2023

I even batch all the work ahead of time but the generator itself spends a lot time just returning the results.
I may need to rewrite this at some point to C --

In any case it's "usable" as is right now with ~5-10seconds wait time.
(Would be cool if I could show some sort of progress meter)

It's been a good exercise to explore the space which is what Python is perfect for now.
I'll leave it as is until the inevitable "rewrite in Rust issue pops up" 🗡️
TY

@rogerbinns
Copy link
Owner

I did my own quick virtual table benchmark with a million rows of 10 columns. Exercising it was SELECT * FROM vtable which meant SQLite had to retrieve every column of every row. The data is identical in each test, and calculated up front.
I got these times in seconds:

index         5.607
dict          6.400
dict_create   6.685

index was the data as tuples. dict was the data as a dict. That shows a ~15% performance hit accessing by dict versus tuple. For dict_create I created the dict in each yield which is another ~5% performance hit. dict / index have an inner loop of yield row whereas dict_create is yield {"col1": 1234, "col2": "abc", ....

Something else you are likely affected by is that you have zillions of duplicate values. It is something I've experienced before when doing analytics processing. eg every string PUSH rbp is a unique instance, so if that was disassembled 100,000 times then the same number of memory string copies exist. The same applies to repeated integers and other values. This causes the program memory footprint to be way larger and hence more memory thrashing.

Python does intern single letter strings and -1 through 255. What I've done before is make a dict to store all values and get a singleton for them, something like:

_uniques = {}
def unique_value(v):
    try:
        return _uniques[v]
    except KeyError:
        _uniques[v] = v
        return v

I did look at Capstone source to see if they call the Python intern or the C level intern but didn't find either.

Perhaps tweaks like above will delay the rewrites by a few months :)

@rogerbinns
Copy link
Owner

Looks like I do need to optimise the Column method at least. It's body is an if over the access types:

                if self.access == VTColumnAccess.By_Index:
                    v = self.current_row[which]
                elif self.access == VTColumnAccess.By_Name:
                    v = self.current_row[self.columns[which]]
                elif self.access == VTColumnAccess.By_Attr:
                    v = getattr(self.current_row, self.columns[which])

I reordered the if statements to be name, attr, index and then reran my benchmark:

index        6.878
dict         5.676
dict_create  5.907

That is a ~20% difference by index being first versus third in the if chain ordering. I then deleted the test for repr_invalid (False by default) and got these times:

index        6.771
dict         5.592
dict_create  5.785

Another ~2% difference. So ballpark I could make things ~25% faster by using specialised versions of the methods. Re-opening to do just that.

@rogerbinns rogerbinns reopened this Aug 2, 2023
@fzakaria
Copy link
Author

fzakaria commented Aug 2, 2023

@rogerbinns cool insights!
Can you explain a bit more the unique_values ?

Should I be passing unique sentinels back to apsw?
I'm getting distinct strings back from Capstone for each individual instruction but your guess is then converting them to unique objects back out of the generator may help?
(I could also intern the column names also?)

@rogerbinns
Copy link
Owner

unique_values is only relevant if you are storing all the items, such as in #465 (comment) and end up with many items that have the same values. It is of no use if values are constructed, used, and then discarded immediately in the code.

I've attached a demo counter.zip that shows the approach in use. A list of 100,000 identical dictionaries are created. The first just duplicates the same dictionary 100,000 times, the second creates each dictionary on the fly, and the third runs each key and value through unique_value with on the fly creation.

As expected the first results in 4 distinct objects, the second in 100,003 distinct objects, and the third in 4.

This approach is useful if you end up with lots of items in memory simultaneously that have the same value, but were separately allocated. I suspect that isn't something you actually want to do.

You should probably also be using tuples for your data instead of dictionaries, because dictionaries are a more complicated more overhead data structure, but you aren't actually using their resulting extra functionality.

@fzakaria
Copy link
Author

fzakaria commented Aug 2, 2023

The difference between (1) and (2) are subtle and I'm surprised they give different # of objects:


def make_items1():
    return [{"columnA": "a string", "columnB": 1234}] * COUNT

def make_items2():
    return list({"column" + "A": "a" + " string", "column" + "B": i + 1234 - i} for i in range(COUNT))

In my example I have:

  for (address, size, mnemonic, op_str) in md.disasm_lite(
      data, section.virtual_address
  ):
      yield {
          "path": binary_name,
          "section": section_name,
          "mnemonic": mnemonic,
          "address": address,
          "operands": op_str,
      }

mnemonic, address, and op_str were all allocated by Capstone and likely have a low cardinality.
I can wrap their objects in unique_values to return to apsw but the objects themselves were already paid for?

@rogerbinns
Copy link
Owner

For (1) it is the same objects in Python bytecode, so they just return an additional reference. For (2) it is constructing new objects (that is why I deliberately made them an expression not a constant). For strings Python will look in the interned strings, and for numbers -1 through 255 are preallocated. Since the expression results are neither you get new objects even though they have the same values.

Under the hood in the Python C APIs everything is a PyObject pointer. Everything is done by reference, not by value.

Unless you are keeping all the md.disasm_lite results in memory at the same time, unique_values is not worth it. unique_values is only about saving memory when storing large numbers of the same values, that are otherwise distinct Python objects.

@fzakaria
Copy link
Author

fzakaria commented Aug 3, 2023

well I am passing all these objects back to SQLite which does store them in-memory to do the SQL analysis.
I'll just give it a shot later this week --
I'm expecting my 3rd kid to arrive any moment -- due date is August 5th :)

I'm working on a new object file format for Linux as well; I would love to chat about that with you sometime.
I can take that into email though.

@rogerbinns
Copy link
Owner

SQLite has a significantly more compact storage format than Python, although it doesn't do deduplication. (You can manually do that similar to normalization, but I doubt it is worth it.)

Best wishes on the potential future coder :)

I'd be delighted to discuss object file formats - contact details can be found via my github profile.

rogerbinns added a commit that referenced this issue Aug 8, 2023
rogerbinns added a commit that referenced this issue Aug 8, 2023
This results in 5-10% less cpu consumption.  The entrance
method already ensures that VTColumnAccess is passed in

Refs #465
rogerbinns added a commit that referenced this issue Aug 9, 2023
This avoids the if/elif chains.  The result is about twice
the throughput of naive version.

Refs #465
@rogerbinns
Copy link
Owner

The Python speedups above result in a doubling of throughput - the benchmark went from ~950k Column calls per second to ~1.95 million. That was due to dropping if statements and overwriting the Column method at runtime with a specialised version. By_Index is still the fastest but By_Dict and By_Attr aren't that much worse because they aren't lower down in nested if statements.

I've now done profiling on the C code (using callgrind) and there isn't anything that can be done. It does turn out that SQLite still inserts each row into a database page, so the flow is Python values get converted to SQLite/C types, they get inserted into a page, then the row is returned from the page again, requiring each SQLite/C value to be converted back to Python and put into a tuple.

It would still be quicker for this much data to use executemany and insert it into a temp table, especially if you are going to do operations on it at the SQL level (eg count of instructions).

@fzakaria
Copy link
Author

Wow I should bump apsw now on my package -- I will see 100x speedup for this part?
nice!

@rogerbinns
Copy link
Owner

The updated code isn't in a release yet. It will likely be a few weeks or more before I do another release, unless this is urgent :) In the short term you can copy the file into wherever you have apsw and get all the benefits.

If you change none of your code, you'll get up to a doubling of throughput - Amdahl's law applies.

@rogerbinns
Copy link
Owner

The release incorporating this is now out. Good news is that another ~25% improvement is coming in #477

@fzakaria
Copy link
Author

Sweet! I have a TODO to revisit this and bump my revision post release.
I can try to give back some numbers afterwards.

@fzakaria
Copy link
Author

fzakaria commented Aug 31, 2023

I just upgraded and I am seeing solid speedups. Use to be in the 5-7s range.

❯ time sqlelf /usr/bin/python3 --sql "select COUNT(*) from elf_instructions where section = '.text'"
┌──────────┐
│ COUNT(*) │
│ 693445   │
└──────────┘
sqlelf /usr/bin/python3 --sql   2.32s user 0.16s system 99% cpu 2.479 total

I also tried ByIndex and saw an additional 10% speedup to get it down to 2.0Xs but not sure if i'll commit that quite yet.

I noticed the output table has a lot of fancy terminal coloring now.

image

@fzakaria
Copy link
Author

is there an easy way to have virtual tables store their data in a temporary table?
I just tried the following which isn't even completing....but it's only 3000 rows?

SELECT *
FROM ELF_SYMBOLS caller, ELF_SYMBOLS callee
WHERE
caller.name = callee.name AND
caller.path != callee.path
sqlite> select path, COUNT(*) from ELF_SYMBOLS GROUP BY path;
┌──────────────────────────────────────────┬──────────┐
│                   path                   │ COUNT(*) │
│ /lib/x86_64-linux-gnu/libruby-3.1.so.3.1 │ 2124     │
│ /lib/x86_64-linux-gnu/libz.so.1          │ 125      │
│ /usr/bin/ruby                            │ 20       │
└──────────────────────────────────────────┴──────────┘

@rogerbinns
Copy link
Owner

The terminal colours have always been there, based on the type of the value. The line drawing was added in 3.42.0.0 (May 2023).

You have to do the temporary table stuff yourself CREATE TEMP TABLE cache_elf_instructions AS SELECT * FROM elf_instructions and then reference cache_elf_instructions as the table name. I can't think of any practical way of having some sort of automagically way of doing this, because the SQLite code needs to directly reference the temp table to exclude APSW code being an intermediary.

#477 being completed should take out about another second in your runtime above. You can try it out now by checking out the vectorcall branch of APSW git.

For your query above you should prepend it with EXPLAIN QUERY PLAN to see what SQLite is doing. It looks O(N^2) to me as there is no index.

@fzakaria
Copy link
Author

fzakaria commented Aug 31, 2023

Another idea was creating a new database :memory: and attaching the other database and inserting the table into it.

Edit: or for apsw would be to automatically create the TEMP table wrapper for each generator?

@rogerbinns
Copy link
Owner

TEMP tables have the advantage of being automatically garbage collected when the main db closes. A pragma controls where they are stored, so there is no need for :memory:.

Making a temp table is as simple as CREATE TEMP TABLE ... AS SELECT ..., and isn't the difficult bit. The problem is future queries have to reference the temp table not the underlying one which means the queries have to be rewritten. There is no way of doing that because I don't have a practical way of parsing SQLite flavoured SQL. It is possible to get query details to at least find out what tables a query accesses, but that is about it.

Using EXPLAIN QUERY PLAN should have been helpful. Ultimately SQLite uses nested for loops. You may be able to rewrite your queries using nested selects to avoid combinatorial behaviour and pass parameters to nested selects.

@fzakaria
Copy link
Author

fzakaria commented Sep 1, 2023

Wow I'm surprised how fast slurping it in a TEMP table is...
All I did was create an indirect temporary table:

❯ git patch --cached sqlelf/cli.py
diff --git a/sqlelf/cli.py b/sqlelf/cli.py
index 55409f5..8ccaa4f 100644
--- a/sqlelf/cli.py
+++ b/sqlelf/cli.py
@@ -52,6 +52,14 @@ def start():
     strings.register(connection, binaries)
     instruction.register(connection, binaries)
 
+    connection.execute("""
+    CREATE TEMP TABLE elf_instructions
+    AS SELECT * FROM raw_elf_instructions;
+                       
+    CREATE TEMP TABLE elf_symbols
+    AS SELECT * FROM raw_elf_symbols;
+    """)
+
     shell = apsw.shell.Shell(db=connection)
 
     if args.sql:

They execute so fast now :)
(ideal would be to hide the raw_elf_* tables but meh)

For one-off executions you pay the cost of re-inserting it into the :memory: database everytime

❯ time sqlelf /usr/bin/ruby /lib/x86_64-linux-gnu/libruby-3.1.so.3.1 /lib/x86_64-linux-gnu/libz.so.1  --sql "SELECT caller.path as 'caller.path',
       callee.path as 'calee.path',
       caller.name,
       caller.demangled_name
FROM ELF_SYMBOLS caller
INNER JOIN ELF_SYMBOLS callee
ON
caller.name = callee.name AND
caller.path != callee.path AND
caller.imported = TRUE AND
callee.exported = TRUE
LIMIT 25;"
WARNING:root:SQLITE_LOG: automatic index on elf_symbols(exported) (284) SQLITE_WARNING SQLITE_WARNING_AUTOINDEX
┌──────────────────────────────────────────┬──────────────────────────────────────────┬─────────────────┬─────────────────┐
│               caller.path                │                calee.path                │      name       │ demangled_name  │
│ /usr/bin/ruby                            │ /lib/x86_64-linux-gnu/libruby-3.1.so.3.1 │ ruby_run_node   │ ruby_run_node   │
│ /usr/bin/ruby                            │ /lib/x86_64-linux-gnu/libruby-3.1.so.3.1 │ ruby_init       │ ruby_init       │
│ /usr/bin/ruby                            │ /lib/x86_64-linux-gnu/libruby-3.1.so.3.1 │ ruby_options    │ ruby_options    │
│ /usr/bin/ruby                            │ /lib/x86_64-linux-gnu/libruby-3.1.so.3.1 │ ruby_sysinit    │ ruby_sysinit    │
│ /usr/bin/ruby                            │ /lib/x86_64-linux-gnu/libruby-3.1.so.3.1 │ ruby_init_stack │ ruby_init_stack │
│ /lib/x86_64-linux-gnu/libruby-3.1.so.3.1 │ /lib/x86_64-linux-gnu/libz.so.1          │ uncompress      │ uncompress      │
└──────────────────────────────────────────┴──────────────────────────────────────────┴─────────────────┴─────────────────┘
sqlelf /usr/bin/ruby /lib/x86_64-linux-gnu/libruby-3.1.so.3.1  --sql   4.83s user 0.24s system 99% cpu 5.076 total

But if you start the apsw shell which I easily wrap then it's even better.

@rogerbinns
Copy link
Owner

Did you notice:

WARNING:root:SQLITE_LOG: automatic index on elf_symbols ...

That shows really should be looking at the query plan, and will get more performance improvements through the query structure and creating indices. If you don't change SQLite defaults then cache size should be used - the default is 2MB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants