# Lazy File Loading

With ZnTrack > 0.3.5 a lazy loading feature was introduced. This is essential for graphs with many dependencies and large Files.
Lazy file loading allows us to only load data when it is accessed.
This tutorial will show the benefits but also the difficulties that come with it.

By default `config.lazy == True` which globally enables lazy file loading. See the Note section when this can cause problems. You can disable it by changing the `zntrack.config.lazy = False`

In [1]:
from zntrack import config

# When using ZnTrack we can write our code inside a Jupyter notebook.
# We can make use of this functionality by setting the `nb_name` config as follows:
config.nb_name = "09_lazy.ipynb"

In [2]:
from zntrack.utils import cwd_temp_dir

temp_dir = cwd_temp_dir()

In [3]:
!git init
!dvc init

Initialized empty Git repository in C:/Users/fabia/AppData/Local/Temp/tmp2q_n25ir/.git/
Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


Let's start by creating some Example Nodes

In [4]:
from zntrack import Node, zn
import random

We will now create a PrintOption that is identical to `zn.outs` but prints a message every time the data is read from files.

In [5]:
class PrintOption(zn.outs):
    def get_data_from_files(self, instance):
        print(f"Loading data from files for {instance.node_name}")
        return super(PrintOption, self).get_data_from_files(instance)

In [6]:
class RandomNumber(Node):
    start = zn.params()
    stop = zn.params()
    number = PrintOption()  # = zn.outs() + print

    def run(self):
        self.number = random.randrange(self.start, self.stop)

In this first Example we will not use lazy loading.

In [7]:
RandomNumber(start=1, stop=1000).write_graph(run=True)

Submit issues to https://github.com/zincware/ZnTrack.
2022-02-17 17:04:18,484 utils (INFO): Running stage 'RandomNumber':
> python -c "from src.RandomNumber import RandomNumber; RandomNumber.load(name='RandomNumber').run_and_save()" 
Creating 'dvc.yaml'
Adding stage 'RandomNumber' in 'dvc.yaml'
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

To track the changes with git, run:

    git add dvc.yaml dvc.lock 'nodes\RandomNumber\.gitignore'

To enable auto staging, run:

	dvc config core.autostage true



In [8]:
random_number = RandomNumber.load(lazy=False)

Loading data from files for RandomNumber


As we can see, the RandomNumber is already loaded into memory

In [9]:
random_number.number

515

Now let us do the same thing with `lazy=True`

In [10]:
lazy_random_number = RandomNumber.load(lazy=True)
print(lazy_random_number.__dict__["number"])

<class 'zntrack.utils.LazyOption'>


We can see, that the random number is not yet available but as soon as we access the attribute it will be loaded for us (and stored in memory).

In [11]:
lazy_random_number.number

Loading data from files for RandomNumber


515

Let's build some dependencies to show where lazy loading is especially useful.

In [12]:
class AddOne(Node):
    deps = zn.deps()
    number = PrintOption()

    def __init__(self, deps=None, **kwargs):
        super().__init__(**kwargs)
        self.deps = deps

    def run(self):
        self.number = self.deps.number + 1

In [13]:
AddOne(deps=RandomNumber.load(), name="AddOne_0").write_graph(run=True)
for index in range(10):
    AddOne(
        deps=AddOne.load(name=f"AddOne_{index}"), name=f"AddOne_{index+1}"
    ).write_graph(run=True)

Submit issues to https://github.com/zincware/ZnTrack.
2022-02-17 17:04:23,394 utils (INFO): Running stage 'AddOne_0':
> python -c "from src.AddOne import AddOne; AddOne.load(name='AddOne_0').run_and_save()" 
Loading data from files for RandomNumber
Adding stage 'AddOne_0' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

    git add dvc.yaml 'nodes\AddOne_0\.gitignore' dvc.lock

To enable auto staging, run:

	dvc config core.autostage true

Submit issues to https://github.com/zincware/ZnTrack.
2022-02-17 17:04:28,459 utils (INFO): Running stage 'AddOne_1':
> python -c "from src.AddOne import AddOne; AddOne.load(name='AddOne_1').run_and_save()" 
Loading data from files for AddOne_0
Adding stage 'AddOne_1' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

    git add dvc.lock dvc.yaml 'nodes\AddOne_1\.gitignore'

To enable auto staging, run:

	dvc config core.autostage true

Submit issues to https:/

In [14]:
!dvc dag

+--------------+ 
| RandomNumber | 
+--------------+ 
        *        
        *        
        *        
  +----------+   
  | AddOne_0 |   
  +----------+   
        *        
        *        
        *        
  +----------+   
  | AddOne_1 |   
  +----------+   
        *        
        *        
        *        
  +----------+   
  | AddOne_2 |   
  +----------+   
        *        
        *        
        *        
  +----------+   
  | AddOne_3 |   
  +----------+   
        *        
        *        
        *        
  +----------+   
  | AddOne_4 |   
  +----------+   
        *        
        *        
        *        
  +----------+   
  | AddOne_5 |   
  +----------+   
        *        
        *        
        *        
  +----------+   
  | AddOne_6 |   
  +----------+   
        *        
        *        
        *        
  +----------+   
  | AddOne_7 |   
  +----------+   
        *        
        *        
        *        
  +----------+   
  | AddOne

If we now load the latest `AddOne` we will see that it loads up everything into memory, although we might only be interested in the most recent number.

In [15]:
add_one = AddOne.load(name="AddOne_10", lazy=False)

Loading data from files for RandomNumber
Loading data from files for AddOne_0
Loading data from files for AddOne_1
Loading data from files for AddOne_2
Loading data from files for AddOne_3
Loading data from files for AddOne_4
Loading data from files for AddOne_5
Loading data from files for AddOne_6
Loading data from files for AddOne_7
Loading data from files for AddOne_8
Loading data from files for AddOne_9
Loading data from files for AddOne_10


It is rather unlikely that we need all these data to be stored in memory. So we can use `lazy=True` to avoid that.

In [16]:
add_one_lazy = AddOne.load(name="AddOne_10", lazy=True)

We can check with an arbitrary depth of dependencies that both instances yield the same value.

In [17]:
add_one_lazy.deps.deps.deps.deps.deps.deps.deps.number

Loading data from files for AddOne_3


519

In [18]:
add_one.deps.deps.deps.deps.deps.deps.deps.number

519

## Notes
When using ZnTrack to compare data of different versions it is important to either not use `lazy=True` or load the data manually before loading another version of the data.
In the following example we store the result of `dvc repro` for three different experiments with and without `lazy=True` and compare the results.

In [19]:
RandomNumber(start=0, stop=5000).write_graph()
!dvc repro
add_one_lazy_1 = AddOne.load(name="AddOne_10", lazy=True)
add_one_1 = AddOne.load(name="AddOne_10", lazy=False)

RandomNumber(start=0, stop=5001).write_graph()
!dvc repro
add_one_lazy_2 = AddOne.load(name="AddOne_10", lazy=True)
add_one_2 = AddOne.load(name="AddOne_10", lazy=False)

RandomNumber(start=0, stop=5002).write_graph()
!dvc repro
add_one_lazy_3 = AddOne.load(name="AddOne_10", lazy=True)
add_one_3 = AddOne.load(name="AddOne_10", lazy=False)

Submit issues to https://github.com/zincware/ZnTrack.
2022-02-17 17:05:18,420 utils (INFO): Modifying stage 'RandomNumber' in 'dvc.yaml'

To track the changes with git, run:

    git add dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true

Running stage 'RandomNumber':
> python -c "from src.RandomNumber import RandomNumber; RandomNumber.load(name='RandomNumber').run_and_save()" 
Updating lock file 'dvc.lock'

Running stage 'AddOne_0':
> python -c "from src.AddOne import AddOne; AddOne.load(name='AddOne_0').run_and_save()" 
Loading data from files for RandomNumber
Updating lock file 'dvc.lock'

Running stage 'AddOne_1':
> python -c "from src.AddOne import AddOne; AddOne.load(name='AddOne_1').run_and_save()" 
Loading data from files for AddOne_0
Updating lock file 'dvc.lock'

Running stage 'AddOne_2':
> python -c "from src.AddOne import AddOne; AddOne.load(name='AddOne_2').run_and_save()" 
Loading data from files for AddOne_1
Updating lock file 'dvc.lock'

Ru

In [20]:
# with lazy we get the same number for every run which is not what we expect.
print(f"{add_one_lazy_1.number} == {add_one_lazy_2.number} == {add_one_lazy_3.number}")

Loading data from files for AddOne_10
Loading data from files for AddOne_10
Loading data from files for AddOne_10
579 == 579 == 579


In [21]:
# With lazy=False we get the results we expect.
print(f"{add_one_1.number} != {add_one_2.number} != {add_one_3.number}")

3864 != 2717 != 579


You can "lock" one value into place (loading it into memory) by accessing it e.g. through `_ = add_one_lazy_1.number`. This way you are able to only load certain values and still having the benefit of `lazy=True` if you only want to compare certain values.

In [None]:
temp_dir.cleanup()