## Introducing DBZero (4/12). So how fast is it?

Hey there! In this notebook, we'll be exploring how fast DBZero really is. Unfortunately, there's no straightforward answer to this question, as performance can be influenced by various factors such as hardware setup, available RAM, network latencies (in the cloud version), or even specific usage patterns. However, I can tell you that DBZero is definitely a speedy platform.

To give you a better idea of how it stacks up against other technologies, we'll take a look at a few real-world use cases. We'll also discuss some techniques you can use to improve the performance of specific operations. So, buckle up and let's dive into the world of DBZero performance!

In [1]:
import dbzero as db0
from performance_charts import performance_chart, performance_plot, add_measurement, init_chart
from bokeh.io import show, output_notebook
from demo_utils import Speedometer

To start, we'll create a bar chart that displays the number of operations that can be performed per second. As we run our test cases, the chart will be updated in real-time to reflect the latest results.

In [2]:
init_chart(["python", "pg", "dbzero"], title="Performance Comparison of Object Creation")
output_notebook(resources=None, verbose=False, hide_banner=True)
show(performance_chart, notebook_url="http://192.168.8.125:8888", port=8889)

To keep things simple, we'll be using a dataset that contains just three columns: first name, surname, and address.

In [3]:
import pandas as pd
df = pd.read_csv("/src/dev/notebooks/data/identities_1M.csv.gzip", compression="gzip", nrows=40000)
df

Unnamed: 0,first_name,surname,address
0,Tawanda,CAIZA,250 Rt 59
1,Sandi,TERISSI,555 East Main St
2,Gemma,PISITELLO,900 Boston Post Road
3,Kate,BUTTERFLY,1450 No Brindlee Mtn Pkwy
4,Marilee,NEAMȚU,655 Boston Post Rd
...,...,...,...
39995,Beckham,HOLOTESCU,656 New Haven Ave
39996,Cosmo,ŠOBER,655 Boston Post Rd
39997,Durward,STRICKER-BAROLIN,330 Sutton Rd
39998,Pricilla,BOCANERA,100 Elm Ridge Center Dr


We want to make sure our measurements are accurate, so it's best to avoid pulling rows from the data frame (which are stored as columns). Instead, let's work with in-memory row tuples to get the most reliable results.

In [4]:
rows = [row for _, row in df.iterrows()]

What if we wanted to create regular Python in-memory objects that hold this same data? Well, we can measure the performance of this operation by tracking how many objects we're able to create in a given unit of time.

To do this, I've included a function below that you can use to measure the performance of creating these objects. Once you run the code, be sure to check out the chart above to see the results (note that the chart displays the performance in kOPS).

In [5]:
from demo_utils import Speedometer

class PyPerson:
    def __init__(self, *args):
        self.first_name = args[0]
        self.surname = args[1]
        self.address = args[2]
    
def test_load_objects_to_python():
    meter = Speedometer()
    meter.start()
    for row in rows:
        p = PyPerson(*row)
    meter.measure(len(df))
    add_measurement("python", meter.speed())

In [6]:
test_load_objects_to_python()

Pure Python: 287347.48466536024 operations/sec


Just to give you an idea, on my machine, I was able to create these in-memory objects at a rate of roughly 300k per second. Now, let's take a look at what happens when we try the same operation with PostgreSQL using SQLAlchemy (without any indexing at this stage).

In [7]:
from sqlalchemy import create_engine, text, Table, Column, Integer, String
from sqlalchemy.orm import Mapped
from sqlalchemy.orm import mapped_column
from sqlalchemy.orm import relationship
from sqlalchemy.orm import DeclarativeBase, Session


class Base(DeclarativeBase):
    pass


class PgPerson(Base):
    __tablename__ = "Persons"
    
    id:  Mapped[int] = mapped_column(primary_key=True)
    first_name: Mapped[str] = mapped_column(String(), index=False)
    surname: Mapped[str] = mapped_column(String(), index=False)
    address: Mapped[str] = mapped_column(String())
    
    def __repr__(self) -> str:
        return f"{self.surname}, {self.first_name}. Address: {self.address}"
    
engine = create_engine('postgresql+psycopg2://root:root@192.168.8.125/test_db', echo=False)
Base.metadata.create_all(engine)

def test_load_objects_to_postgres():
    meter = Speedometer()
    session = Session(engine) 
    # Clear any data remnants
    session.query(PgPerson).delete()
    meter.start()
    for row in rows:
        session.add(PgPerson(first_name=row[0], surname=row[1], address=row[2]))
    # Time without commit and session close
    meter.measure(len(df))
    session.commit()
    session.close()
    add_measurement("pg", meter.speed())

In [8]:
test_load_objects_to_postgres()

SQLAlchemy + PostgreSQL: 40428.21145158946 operations/sec


Interestingly, when using SQLAlchemy ORM with PostgreSQL, we found that the performance was only about 1/10th of the pure Python speed (excluding the commit time). This is understandable since we not only need to construct Python objects using the ORM, but we also need to translate them to SQL queries and send them to the engine, where they are finally persisted.

Now, let's switch gears and see how the same operation performs on DBZero.

In [9]:
db0.init()

DBZero initialized for tenant: itx


Our first approach is to use a Python-native syntax for working with DBZero. It's worth noting that this code differs from the pure Python code by just a single annotation.

In [10]:
@db0.keepit
class DB0Person:
    def __init__(self, *args):
        self.first_name = args[0]
        self.surname = args[1]
        self.address = args[2]
    
def test_load_objects_to_db0():
    meter = Speedometer()
    meter.start()
    for row in rows:
        p = DB0Person(*row)
    meter.measure(len(df))
    add_measurement("dbzero", meter.speed())

In [11]:
test_load_objects_to_db0()

DBZero: 145967.78836854364 operations/sec


When I ran this code, I was able to achieve a performance of around 154k objects per second, which is somewhere in between the performance of PostgreSQL and pure Python. This seems like a reasonable result, given that we need to create Python objects and put them into the DBZero space, which adds some additional time.

However, the good news is that DBZero offers a few additional techniques that can significantly improve the performance of object creation tasks. So, let's explore those next!

#### Ok, so the performance is not too bad. But you say it can be improved... how then?

Before we dive into those techniques, let's take a moment to understand how the default object initializers in DBZero work.

By default, if we omit the user-defined `__init__` method, DBZero will provide a built-in one. This is important to keep in mind as we explore ways to optimize object creation in DBZero.

In [12]:
@db0.keepit
class DB0DefaultPerson:
    def __repr__(self) -> str:
        return f"{self.surname}, {self.first_name}. Address: {self.address}"

def test_load_default_objects_to_db0():
    meter = Speedometer()
    meter.start()
    for row in rows:
        p = DB0DefaultPerson(first_name = row[0], surname=row[1], address=row[2])
    meter.measure(len(df))
    add_measurement("dbzero", meter.speed())

In [13]:
test_load_default_objects_to_db0()

DBZero: 116000.34588983137 operations/sec


#### While the performance of built-in initializers is similar, they can help avoid some boilerplate code. But are they useful in any other way?
Yes, they are. One key advantage is that default initializers can be fully controlled by DBZero. This means that DBZero knows that no special operations are performed on fields before they're assigned as object members, which enables optimizations.

For example, DBZero adds a static method called `__batch_init__` to classes with built-in initializers. This method can be used to initialize multiple objects at once, which can be much faster than initializing objects one at a time.

#### What does __batch_init__ do and how can I use it?
Let's take a closer look at an example of how to use __batch_init__. This method takes two positional arguments: a tuple of field names and a source of rows (e.g. a list or a generator). The result is an iterable that must be iterated over to create new DB0 Python objects.

In [14]:
friends = [("Adam", "Fox", "Dallas"), ("Jacek", "Wasilewski", "Gdynia")]
for p in DB0DefaultPerson.__batch_init__(["first_name", "surname", "address"], friends):
    print(p)

Fox, Adam. Address: Dallas
Wasilewski, Jacek. Address: Gdynia


#### It's faster than the regular __init__, right?
Yes, you're right! Using `__batch_init__` can be much faster than using the regular `__init__` method.

Let's measure how much faster it is. We can do this by pulling out Python objects that were initialized with rows using `__batch_init__`.

In [15]:
def test_batch_load_objects_to_db0():
    meter = Speedometer()
    meter.start()
    for _ in DB0DefaultPerson.__batch_init__(("first_name", "surname", "address"), rows):
        pass
    meter.measure(len(df))
    add_measurement("dbzero", meter.speed())

In [16]:
test_batch_load_objects_to_db0()

DBZero: 230970.5340265098 operations/sec


On my machine, the performance improved by about 50% (as shown in the chart at the top of this document). However, if the objects are not needed immediately, we can further improve performance by using the "express" mode. In this mode, the objects are created in DB0 but are not returned to Python.

In [17]:
def test_express_batch_load_objects_to_db0():
    meter = Speedometer()
    meter.start()
    DB0DefaultPerson.__batch_init__(("first_name", "surname", "address"), rows, express=True)
    meter.measure(len(df))
    add_measurement("dbzero", meter.speed())

In [18]:
test_express_batch_load_objects_to_db0()

DBZero: 297775.37969338667 operations/sec


#### Nice, so now it's almost as fast as in-memory Python.
Yet it is, and we have even more techniques to further improve performance.

As already mentioned, a pandas data frame actually consists of columns, which are numpy columns stored in memory. We can take advantage of this by passing the entire pandas data frame to `__batch_init__`, which will pull the relevant columns by name. This can make the process even more efficient.

In [19]:
def test_batch_load_objects_from_columns_to_db0():
    meter = Speedometer()
    meter.measure(0)
    for p in DB0DefaultPerson.__batch_init__(("first_name", "surname", "address"), data=df):
        pass
    meter.measure(len(df))
    add_measurement("dbzero", meter.speed())

In [20]:
test_batch_load_objects_from_columns_to_db0()

DBZero: 554914.6291588771 operations/sec


Express mode can also be used when creating objects from a pandas data frame, which can further boost performance.

In [21]:
def test_express_batch_load_objects_from_columns_to_db0():
    meter = Speedometer()
    meter.measure(0)
    DB0DefaultPerson.__batch_init__(("first_name", "surname", "address"), data=df, express=True)
    meter.measure(len(df))
    add_measurement("dbzero", meter.speed())

In [22]:
test_express_batch_load_objects_from_columns_to_db0()

DBZero: 904560.3568689611 operations/sec


Ok, let's take another look at our charts and see the results below.

In [23]:
show(performance_plot())

In [24]:
db0.close()

Stay tuned, in the next episode we'll dive into the performance of lookup queries.