## Introducing **dbzero** (3/12): Exploring the Limits

In [1]:
import dbzero as db0
from mem_charts import mem_usage_chart, random_string
from bokeh.io import show, output_notebook
import concurrent.futures

We've imported the 'bokeh' package to visualize current memory utilization on a chart.

Let's also create an executor to be able to run Python tasks in the background (in a separate thread).

In [2]:
executor = concurrent.futures.ThreadPoolExecutor(max_workers=1)

The chart below presents live memory utilization of the current process, refreshed every 1 second.

In [3]:
output_notebook(resources=None, verbose=False, hide_banner=True)
show(mem_usage_chart, notebook_url="http://127.0.0.1:8888", port=8889)

Let's first see how memory utilization grows when running regular Python code. We're adding 50k random string elements to a regular Python list in 100 iterations. This totals 5M data elements added. Let's see how this performs...

In [4]:
result = []
def generate_sequence(result, length, batch):
    for _ in range(length):
        result.extend([random_string() for _ in range(batch)])
    print("Task finished")

In [5]:
task = executor.submit(generate_sequence, result, length=100, batch = 50000)

Task finished
Task finished


The memory just continues to grow and grow and would eventually crash the process with an "out of memory" error. In Python we can release memory by simply emptying the list.

In [6]:
result = []

### OK, so how does **dbzero** differ in this matter?
In **dbzero**, you work most of the time like with regular Python code but no longer need to worry about memory limits. That's right, even if it's terabytes of data to deal with, your process will never exceed the limits which you define yourself.

After initialization, **dbzero** will occupy a small amount of your memory...

In [7]:
db0.init(dbzero_root = "/dbzero", prefix = "data")

But you can control how much additional memory it uses by invoking the 'set_cache_size' method.

In [8]:
db0.set_cache_size(128 << 20)

Let's now repeat the test using db0.list (a list object inside the dbzero space). Watch carefully how the memory utilization stops at some point (when the defined cache limit is reached) and does not grow no matter how much data you put into your list.

In [9]:
db0_result = db0.list()

In [10]:
def db0_generate_sequence(result, length, batch):
    for _ in range(length):
        result.extend([random_string() for _ in range(batch)]) 
    print("Task finished")

In [11]:
task = executor.submit(db0_generate_sequence, db0_result, length=100, batch = 50000)

In [13]:
print(len(db0_result))
db0_result[12313]

5000000


'1pXfgYf8dYXN'

### Well, so where is the data actually stored in this case?
dbzero implements memory exchange algorithms. It fetches data from the cloud or, in the case of the local version, from the filesystem on a need-to-know basis and retains it in a local cache to allow rapid access in the future. The process is completely transparent to the developer.

In [14]:
db0.close()