# Lazy File Loading

With ZnTrack > 0.3.5 a lazy loading feature was introduced. This is essential for graphs with many dependencies and large Files.
Lazy file loading allows us to only load data when it is accessed.
This tutorial will show the benefits but also the difficulties that come with it.

By default `config.lazy == True` which globally enables lazy file loading. See the Note section when this can cause problems. You can disable it by changing the `zntrack.config.lazy = False`

In [1]:
from zntrack import config

# When using ZnTrack we can write our code inside a Jupyter notebook.
# We can make use of this functionality by setting the `nb_name` config as follows:
config.nb_name = "09_lazy.ipynb"
config.lazy = False

In [2]:
from zntrack.utils import cwd_temp_dir

temp_dir = cwd_temp_dir()

In [3]:
!git init
!dvc init

Initialized empty Git repository in /tmp/tmp193rwpr3/.git/
Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


Let's start by creating some Example Nodes

In [4]:
import zntrack
import random

We will now create a PrintOption that is identical to `zn.outs` but prints a message every time the data is read from files.

In [5]:
class PrintOption(zntrack.zn.Output):

    def __init__(self):
        super().__init__(dvc_option = "outs", use_repr=False)
        # zntrack will try dvc --PrintOption outs.json
        # we must tell it to use dvc --outs outs.json instead
        
    def _get_value_from_file(self, instance) -> any:
        print(f"Loading data from files for {instance.name}")
        return super()._get_value_from_file(instance)

In [6]:
class RandomNumber(zntrack.Node):
    start = zntrack.zn.params()
    stop = zntrack.zn.params()
    number = PrintOption()  # = zn.outs() + print

    def run(self):
        self.number = random.randrange(self.start, self.stop)

In this first Example we will not use lazy loading.

In [7]:
with zntrack.Project() as project:
    random_number = RandomNumber(start=1, stop=1000)
project.run()

Running DVC command: 'stage add --name RandomNumber --force ...'


Creating 'dvc.yaml'
Adding stage 'RandomNumber' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml nodes/RandomNumber/.gitignore

To enable auto staging, run:

	dvc config core.autostage true


Jupyter support is an experimental feature! Please save your notebook before running this command!
Submit issues to https://github.com/zincware/ZnTrack.
[NbConvertApp] Converting notebook 09_lazy.ipynb to script
[NbConvertApp] Writing 5527 bytes to 09_lazy.py
Running DVC command: 'repro'


Running stage 'RandomNumber':
> zntrack run src.RandomNumber.RandomNumber --name RandomNumber
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock

To enable auto staging, run:

	dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.


In [8]:
random_number.load(lazy=False)

Loading data from files for RandomNumber


As we can see, the RandomNumber is already loaded into memory

In [9]:
random_number.number

639

Now let us do the same thing with `lazy=True`

In [10]:
lazy_random_number = RandomNumber.from_rev(lazy=True)
print(lazy_random_number.__dict__["number"])

<class 'zntrack.utils.LazyOption'>


We can see, that the random number is not yet available but as soon as we access the attribute it will be loaded for us (and stored in memory).

In [11]:
lazy_random_number.number

Loading data from files for RandomNumber


639

Let's build some dependencies to show where lazy loading is especially useful.

In [12]:
class AddOne(zntrack.Node):
    deps = zntrack.zn.deps()
    number = PrintOption()

    def run(self):
        self.number = self.deps.number + 1

In [13]:
with zntrack.Project() as project:
    random_number = RandomNumber(start=1, stop=100)

    add_one = AddOne(deps=random_number, name="AddOne_0")
    for index in range(10):
        add_one = AddOne(deps=add_one, name=f"AddOne_{index+1}")

project.run()

Running DVC command: 'stage add --name RandomNumber --force ...'


Modifying stage 'RandomNumber' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true


[NbConvertApp] Converting notebook 09_lazy.ipynb to script
[NbConvertApp] Writing 5527 bytes to 09_lazy.py
Running DVC command: 'stage add --name AddOne_0 --force ...'


Adding stage 'AddOne_0' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml nodes/AddOne_0/.gitignore

To enable auto staging, run:

	dvc config core.autostage true


[NbConvertApp] Converting notebook 09_lazy.ipynb to script
[NbConvertApp] Writing 5527 bytes to 09_lazy.py
Running DVC command: 'stage add --name AddOne_1 --force ...'


Adding stage 'AddOne_1' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml nodes/AddOne_1/.gitignore

To enable auto staging, run:

	dvc config core.autostage true


[NbConvertApp] Converting notebook 09_lazy.ipynb to script
[NbConvertApp] Writing 5527 bytes to 09_lazy.py
Running DVC command: 'stage add --name AddOne_2 --force ...'


Adding stage 'AddOne_2' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml nodes/AddOne_2/.gitignore

To enable auto staging, run:

	dvc config core.autostage true


[NbConvertApp] Converting notebook 09_lazy.ipynb to script
[NbConvertApp] Writing 5527 bytes to 09_lazy.py
Running DVC command: 'stage add --name AddOne_3 --force ...'


Adding stage 'AddOne_3' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml nodes/AddOne_3/.gitignore

To enable auto staging, run:

	dvc config core.autostage true


[NbConvertApp] Converting notebook 09_lazy.ipynb to script
[NbConvertApp] Writing 5527 bytes to 09_lazy.py
Running DVC command: 'stage add --name AddOne_4 --force ...'


Adding stage 'AddOne_4' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml nodes/AddOne_4/.gitignore

To enable auto staging, run:

	dvc config core.autostage true


[NbConvertApp] Converting notebook 09_lazy.ipynb to script
[NbConvertApp] Writing 5527 bytes to 09_lazy.py
Running DVC command: 'stage add --name AddOne_5 --force ...'


Adding stage 'AddOne_5' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml nodes/AddOne_5/.gitignore

To enable auto staging, run:

	dvc config core.autostage true


[NbConvertApp] Converting notebook 09_lazy.ipynb to script
[NbConvertApp] Writing 5527 bytes to 09_lazy.py
Running DVC command: 'stage add --name AddOne_6 --force ...'


Adding stage 'AddOne_6' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml nodes/AddOne_6/.gitignore

To enable auto staging, run:

	dvc config core.autostage true


[NbConvertApp] Converting notebook 09_lazy.ipynb to script
[NbConvertApp] Writing 5527 bytes to 09_lazy.py
Running DVC command: 'stage add --name AddOne_7 --force ...'


Adding stage 'AddOne_7' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml nodes/AddOne_7/.gitignore

To enable auto staging, run:

	dvc config core.autostage true


[NbConvertApp] Converting notebook 09_lazy.ipynb to script
[NbConvertApp] Writing 5527 bytes to 09_lazy.py
Running DVC command: 'stage add --name AddOne_8 --force ...'


Adding stage 'AddOne_8' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml nodes/AddOne_8/.gitignore

To enable auto staging, run:

	dvc config core.autostage true


[NbConvertApp] Converting notebook 09_lazy.ipynb to script
[NbConvertApp] Writing 5527 bytes to 09_lazy.py
Running DVC command: 'stage add --name AddOne_9 --force ...'


Adding stage 'AddOne_9' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml nodes/AddOne_9/.gitignore

To enable auto staging, run:

	dvc config core.autostage true


[NbConvertApp] Converting notebook 09_lazy.ipynb to script
[NbConvertApp] Writing 5527 bytes to 09_lazy.py
Running DVC command: 'stage add --name AddOne_10 --force ...'


Adding stage 'AddOne_10' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml nodes/AddOne_10/.gitignore

To enable auto staging, run:

	dvc config core.autostage true


[NbConvertApp] Converting notebook 09_lazy.ipynb to script
[NbConvertApp] Writing 5527 bytes to 09_lazy.py
Running DVC command: 'repro'


Running stage 'RandomNumber':
> zntrack run src.RandomNumber.RandomNumber --name RandomNumber
Updating lock file 'dvc.lock'

Running stage 'AddOne_0':
> zntrack run src.AddOne.AddOne --name AddOne_0
Loading data from files for RandomNumber
Updating lock file 'dvc.lock'

Running stage 'AddOne_1':
> zntrack run src.AddOne.AddOne --name AddOne_1
Loading data from files for AddOne_0
Updating lock file 'dvc.lock'

Running stage 'AddOne_2':
> zntrack run src.AddOne.AddOne --name AddOne_2
Loading data from files for AddOne_1
Updating lock file 'dvc.lock'

Running stage 'AddOne_3':
> zntrack run src.AddOne.AddOne --name AddOne_3
Loading data from files for AddOne_2
Updating lock file 'dvc.lock'

Running stage 'AddOne_4':
> zntrack run src.AddOne.AddOne --name AddOne_4
Loading data from files for AddOne_3
Updating lock file 'dvc.lock'

Running stage 'AddOne_5':
> zntrack run src.AddOne.AddOne --name AddOne_5
Loading data from files for AddOne_4
Updating lock file 'dvc.lock'

Running stage 'AddO

In [14]:
!dvc dag

+--------------+ 
| RandomNumber | 
+--------------+ 
        *        
        *        
        *        
  +----------+   
  | AddOne_0 |   
  +----------+   
        *        
        *        
        *        
  +----------+   
  | AddOne_1 |   
  +----------+   
        *        
        *        
        *        
  +----------+   
  | AddOne_2 |   
  +----------+   
        *        
        *        
        *        
  +----------+   
  | AddOne_3 |   
  +----------+   
        *        
        *        
        *        
  +----------+   
  | AddOne_4 |   
  +----------+   
        *        
        *        
        *        
  +----------+   
  | AddOne_5 |   
  +----------+   
        *        
        *        
        *        
  +----------+   
  | AddOne_6 |   
  +----------+   
        *        
        *        
        *        
  +----------+   
  | AddOne_7 |   
  +----------+   
        *        
        *        
        *        
  +----------+   
  | AddOne

If we now load the latest `AddOne` we will see that it loads up everything into memory, although we might only be interested in the most recent number.

In [15]:
add_one = AddOne.from_rev(name="AddOne_10", lazy=False)

Loading data from files for RandomNumber
Loading data from files for AddOne_0
Loading data from files for AddOne_1
Loading data from files for AddOne_2
Loading data from files for AddOne_3
Loading data from files for AddOne_4
Loading data from files for AddOne_5
Loading data from files for AddOne_6
Loading data from files for AddOne_7
Loading data from files for AddOne_8
Loading data from files for AddOne_9
Loading data from files for AddOne_10


It is rather unlikely that we need all these data to be stored in memory. So we can use `lazy=True` to avoid that.

In [16]:
add_one_lazy = AddOne.from_rev(name="AddOne_10", lazy=True)

We can check with an arbitrary depth of dependencies that both instances yield the same value.

In [17]:
add_one_lazy.deps.deps.deps.deps.deps.deps.deps.number

Loading data from files for RandomNumber
Loading data from files for AddOne_0
Loading data from files for AddOne_1
Loading data from files for AddOne_2
Loading data from files for AddOne_3
Loading data from files for AddOne_4
Loading data from files for AddOne_5
Loading data from files for AddOne_6
Loading data from files for AddOne_7
Loading data from files for AddOne_8
Loading data from files for AddOne_9


93

In [18]:
add_one.deps.deps.deps.deps.deps.deps.deps.number

93

## Notes
When using ZnTrack to compare data of different versions it is important to either not use `lazy=True` or load the data manually before loading another version of the data.
In the following example we store the result of `dvc repro` for three different experiments with and without `lazy=True` and compare the results.

In [19]:
RandomNumber(start=0, stop=5000).write_graph(run=True)
random_number_lazy_1 = RandomNumber.from_rev(lazy=True)
random_number_1 = RandomNumber.from_rev(lazy=False)

RandomNumber(start=0, stop=5001).write_graph(run=True)
random_number_lazy_2 = RandomNumber.from_rev(lazy=True)
random_number_2 = RandomNumber.from_rev(lazy=False)

RandomNumber(start=0, stop=5002).write_graph(run=True)
random_number_lazy_3 = RandomNumber.from_rev(lazy=True)
random_number_3 = RandomNumber.from_rev(lazy=False)

Running DVC command: 'stage add --name RandomNumber --force ...'


Modifying stage 'RandomNumber' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true


[NbConvertApp] Converting notebook 09_lazy.ipynb to script
[NbConvertApp] Writing 5527 bytes to 09_lazy.py
Running DVC command: 'repro RandomNumber'


Running stage 'RandomNumber':
> zntrack run src.RandomNumber.RandomNumber --name RandomNumber
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock

To enable auto staging, run:

	dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.
Loading data from files for RandomNumber


Running DVC command: 'stage add --name RandomNumber --force ...'


Modifying stage 'RandomNumber' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true


[NbConvertApp] Converting notebook 09_lazy.ipynb to script
[NbConvertApp] Writing 5527 bytes to 09_lazy.py
Running DVC command: 'repro RandomNumber'


Running stage 'RandomNumber':
> zntrack run src.RandomNumber.RandomNumber --name RandomNumber
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock

To enable auto staging, run:

	dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.
Loading data from files for RandomNumber


Running DVC command: 'stage add --name RandomNumber --force ...'


Modifying stage 'RandomNumber' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true


[NbConvertApp] Converting notebook 09_lazy.ipynb to script
[NbConvertApp] Writing 5527 bytes to 09_lazy.py
Running DVC command: 'repro RandomNumber'


Running stage 'RandomNumber':
> zntrack run src.RandomNumber.RandomNumber --name RandomNumber
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock

To enable auto staging, run:

	dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.
Loading data from files for RandomNumber


In [20]:
# with lazy we get the same number for every run which is not what we expect.
print(
    f"{random_number_lazy_1.number} == {random_number_lazy_2.number} =="
    f" {random_number_lazy_3.number}"
)
assert random_number_lazy_1.number == random_number_lazy_2.number
assert random_number_lazy_1.number == random_number_lazy_3.number

Loading data from files for RandomNumber
Loading data from files for RandomNumber
Loading data from files for RandomNumber
4780 == 4780 == 4780


In [21]:
# With lazy=False we get the results we expect. 
# (Except for some random scenarios, where two random numbers are the same.)
print(f"{random_number_1.number} != {random_number_2.number} != {random_number_3.number}")
assert random_number_1.number != random_number_2.number
assert random_number_1.number != random_number_3.number

1638 != 3575 != 4780


You can "lock" one value into place (loading it into memory) by accessing it e.g. through `_ = add_one_lazy_1.number`. This way you are able to only load certain values and still having the benefit of `lazy=True` if you only want to compare certain values.

In [22]:
temp_dir.cleanup()