# Overview

PyTrack is designed as an object oriented mapper for [DVC](https://dvc.org/).
DVC provides tracking of large data files within a GIT repository.
Therefore, all PyTrack instances will later be executed inside a GIT repository.
Furthermore, DVC provides method for building a dependency graph, tracking parameters, comparing metrics and querying multiple runs.

**Why does it need an object-oriented mapper?**

Whilst DVC provides all this functionality it is designed to be programming language independent. PyTrack is designed purely for building python packages and is optimized in such manner.

## Stages


DVC organizes its pipeline in multiple stages (see https://dvc.org/doc/start for more information).
In the case of PyTrack every stage is decorated with `pytrack.PyTrack` as follows

In [26]:
!rm -rf *
!rm -rf .*

rm: refusing to remove '.' or '..' directory: skipping '.'
rm: refusing to remove '.' or '..' directory: skipping '..'


In [2]:
!git init
!dvc init

Initialized empty Git repository in /tikhome/fzills/PycharmProjects/py-track/docs/source/Overview/01_Intro/.git/
Initialized DVC repository.

You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[0m[31m|[0m                                                                     [31m|[0m
[31m|[0m        DVC has enabled anonymous aggregate usage analytics.         [31m|[0m
[31m|[0m     Read the analytics documentation (and how to opt-out) here:     [31m|[0m
[31m|[0m             <[36mhttps://dvc.org/doc/user-guide/analytics[39m>              [31m|[0m
[31m|[0m                                                                     [31m|[0m
[31m+---------------------------------------------------------------------+
[0m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: <[36mhttps://dvc.org/doc[39m>
- Get help and share ideas: <[36mhttps://dvc.org/chat[39m>
- Star us on GitHub: <[3

In [3]:
from pytrack import PyTrack, config

config.nb_name = "01_Intro.ipynb"

@PyTrack()
class Stage0:
    def run(self):
        pass

Submit issues to https://github.com/zincware/py-track.


[NbConvertApp] Converting notebook 01_Intro.ipynb to script
[NbConvertApp] Writing 8652 bytes to 01_Intro.py


Every DVC stage is organized as a Python class. The class must implement a `run` method, which is the entry point for the computation executed by DVC. Furthermore a `__call__` method will be created, if none is provided. When calling the stage, the DVC file will be created and the parameters are beeing passed.

In [4]:
stage_0 = Stage0()
stage_0()
!tree

2021-10-15 15:36:55,642 (INFO): Creating 'dvc.yaml'
Adding stage 'Stage0' in 'dvc.yaml'

To track the changes with git, run:

	git add outs/.gitignore dvc.yaml

[01;34m.[00m
├── 01_Intro.ipynb
├── [01;34mconfig[00m
│   └── pytrack.json
├── dvc.yaml
├── [01;34mouts[00m
└── [01;34msrc[00m
    └── Stage0.py

3 directories, 4 files


As shown we create a file in `config/pytrack.json` which stores the parameters of all stages. Furthermore, a directory `outs` is additionally added. Using a Jupyter Notebook has the specialty, that a directory `src` is added which contains a converted `*.py` file of the Jupyter Notebook. These direcotries and files are managed by PyTrack. The file `dvc.yaml` is created by calling dvc in the background and is used by DVC to organize the pipelines.

We can now use `dvc  repro` to execute our code, which in result does nothing yet

In [5]:
!dvc repro

Running stage 'Stage0':                                               core[39m>
> python3 -c "from src.Stage0 import Stage0; Stage0(load=True).run()"
Generating lock file 'dvc.lock'                                                 
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock
Use `dvc push` to send your updates to remote storage.
[0m

### PyTrack Results
We can see, that the stage is run without issues.
Unfortunately, the stage we just created doesn't do anything.
In our first example we would like to create a random number in our stage and save it.
We can do this utilizing `pytrack.DVC.result` which is a special type of a DVC outs file, managed by PyTrack.
We do this by defining a class level attribute.
This is similar to setting a Python `@property` where `__get__` and `__set__` has some custom handling assigend to it.
In comparison the the `@property` we do not need to think about the `getter/setter`.

In [6]:
from pytrack import DVC
from random import randrange

@PyTrack()
class RandomNumer:
    number = DVC.result()
    
    def run(self):
        self.number = randrange(10)

random_number = RandomNumer()
random_number()

Submit issues to https://github.com/zincware/py-track.


[NbConvertApp] Converting notebook 01_Intro.ipynb to script




[NbConvertApp] Writing 8652 bytes to 01_Intro.py


2021-10-15 15:37:04,445 (INFO): Adding stage 'RandomNumer' in 'dvc.yaml'

To track the changes with git, run:

	git add outs/.gitignore dvc.yaml



We can access the outcome of our stage by passing `load=True`. Doing this on our stage will give us a warning and simply return None.
This is, because we haven't actually executed the `run` method yet. Again, this is done via `dvc repro`

In [7]:
RandomNumer(load=True).number



In [8]:
!dvc repro

Running stage 'RandomNumer':                                          core[39m>
> python3 -c "from src.RandomNumer import RandomNumer; RandomNumer(load=True).run()"
Updating lock file 'dvc.lock'                                                   

Stage 'Stage0' didn't change, skipping

To track the changes with git, run:

	git add dvc.lock
Use `dvc push` to send your updates to remote storage.
[0m

Now we can have a look at our result and work with it.

In [9]:
RandomNumer(load=True).number

6

Because we are using DVC, rerunnig the stage via `dvc repro` will not result in a new computation, but instead it will use the cached value.
Changing this is explained later.

### PyTrack arguments
Currently our stage will always yield a random number in the hard coded range 0-9.
PyTrack stages become increasingly more interesting when introducing custom parameters. We can now start by adding a maximum value to our stage.


In [10]:
@PyTrack()
class MaxRandomNumer:
    number = DVC.result()
    maximum = DVC.params()
    
    def __call__(self, maximum):
        self.maximum = maximum
    
    def run(self):
        self.number = randrange(self.maximum)

max_random_number = MaxRandomNumer()
max_random_number(maximum=512)

Submit issues to https://github.com/zincware/py-track.


[NbConvertApp] Converting notebook 01_Intro.ipynb to script




[NbConvertApp] Writing 8652 bytes to 01_Intro.py


2021-10-15 15:37:13,309 (INFO): Adding stage 'MaxRandomNumer' in 'dvc.yaml'

To track the changes with git, run:

	git add outs/.gitignore dvc.yaml



In [11]:
!dvc repro

Running stage 'MaxRandomNumer':                                       core[39m>
> python3 -c "from src.MaxRandomNumer import MaxRandomNumer; MaxRandomNumer(load=True).run()"
Updating lock file 'dvc.lock'                                                   

Stage 'RandomNumer' didn't change, skipping
Stage 'Stage0' didn't change, skipping

To track the changes with git, run:

	git add dvc.lock
Use `dvc push` to send your updates to remote storage.
[0m

In [12]:
MaxRandomNumer(load=True).number

448

### Custom Types and Files

When using Arguments PyTrack can handle the most basic python types and also some more complex ones like `pathlib.Path`.
In the following example we introduce using paths as arguments and writing data to a custom output file.
Therefore we can use `DVC.outs`

In [13]:
from pathlib import Path

@PyTrack()
class WriteToFile:
    filename: Path = DVC.outs()
    
    def __call__(self, filename: Path):
        self.filename = filename
    
    def run(self):
        self.filename.write_text('Lorem Ipsum')
        
    def read_from_file(self):
        print(self.filename.read_text())

write_to_file = WriteToFile()
write_to_file(filename=Path("outs", "example.txt"))

Submit issues to https://github.com/zincware/py-track.


[NbConvertApp] Converting notebook 01_Intro.ipynb to script




[NbConvertApp] Writing 8652 bytes to 01_Intro.py


2021-10-15 15:37:22,503 (INFO): Adding stage 'WriteToFile' in 'dvc.yaml'

To track the changes with git, run:

	git add outs/.gitignore dvc.yaml



In [14]:
!dvc repro

Stage 'MaxRandomNumer' didn't change, skipping                        core[39m>
Stage 'Stage0' didn't change, skipping
Stage 'RandomNumer' didn't change, skipping
Running stage 'WriteToFile':
> python3 -c "from src.WriteToFile import WriteToFile; WriteToFile(load=True).run()"
Updating lock file 'dvc.lock'                                                   

To track the changes with git, run:

	git add dvc.lock
Use `dvc push` to send your updates to remote storage.
[0m

We can see, that a file in `outs` with our filename has been created. The file can be generated anywhere inside the DVC repository, but using the already exising `outs` directory can be handy.
We can again load the stage and have a look at the filename and read from it.

In [15]:
WriteToFile(load=True).filename

PosixPath('outs/example.txt')

In [16]:
WriteToFile(load=True).read_from_file()

Lorem Ipsum


### PyTrack Init

As you may have already noticed we have not created an `__init__` yet. Arguments are passed to the `__call__` and `PyTrackOptions` (DVC.<...>) are defined on a class level. The following example will illustrate, why using the `__init__` can lead to confusing results.
Therefore we need to keep in mind, that DVC runs the follwing command: 

    python3 -c "from src.Stage0 import Stage0; Stage0(load=True).run()"
    
which we will use to imitate `dvc repro` in the following.

In [17]:
@PyTrack()
class InitStage:
    def __init__(self, value = "Not defined"):
        self.value = value
        
    def run(self):
        print(self.value)

Submit issues to https://github.com/zincware/py-track.


[NbConvertApp] Converting notebook 01_Intro.ipynb to script
[NbConvertApp] Writing 8652 bytes to 01_Intro.py


In [18]:
init_stage = InitStage(value='Lorem Ipsum')
init_stage()
print(init_stage.value)

2021-10-15 15:37:31,662 (INFO): Adding stage 'InitStage' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml outs/.gitignore

Lorem Ipsum


In [19]:
InitStage(load=True).run()

Not defined


We can see, that our passed value is not available during the command that is executed by `DVC`. This is important to keep in mind, when using PyTrack. The issue can be easily solved by using `DVC.params()`. Altough possible, it should be avoided to define them within the `__init__` and go for class level definitions. Nevertheless, the `__init__` can be used for e.g., defining class attributes or setting `PyTrackOption`.
We can therefore extend our `MaxRandomNumber` in the following way by a constand minimum value:

In [20]:
@PyTrack()
class InitMaxRandomNumer:
    number = DVC.result()
    maximum = DVC.params()
    
    def __init__(self):
        self.minimum = 0
    
    def __call__(self, maximum):
        self.maximum = maximum
        
    def run(self):
        self.number = randrange(self.minimum, self.maximum)

init_max_random_number = InitMaxRandomNumer()
init_max_random_number(maximum=512)

Submit issues to https://github.com/zincware/py-track.


[NbConvertApp] Converting notebook 01_Intro.ipynb to script




[NbConvertApp] Writing 8652 bytes to 01_Intro.py


2021-10-15 15:37:37,769 (INFO): Adding stage 'InitMaxRandomNumer' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml outs/.gitignore



In [21]:
!dvc repro

Running stage 'InitStage':                                            core[39m>
> python3 -c "from src.InitStage import InitStage; InitStage(load=True).run()"
Not defined
Updating lock file 'dvc.lock'                                                   

Stage 'Stage0' didn't change, skipping
Stage 'MaxRandomNumer' didn't change, skipping
Stage 'RandomNumer' didn't change, skipping
Stage 'WriteToFile' didn't change, skipping
Running stage 'InitMaxRandomNumer':
> python3 -c "from src.InitMaxRandomNumer import InitMaxRandomNumer; InitMaxRandomNumer(load=True).run()"
Updating lock file 'dvc.lock'                                                   

To track the changes with git, run:

	git add dvc.lock
Use `dvc push` to send your updates to remote storage.
[0m

In [22]:
InitMaxRandomNumer(load=True).number

208

Because this is a essential property of PyTrack and differs from most other Python code the following example DOES NOT work, because dvc will try to run `InitMaxRandomNumer(load=True).run()` without passing a value to `maximum` and therefore resulting in an error!

```python

@PyTrack()
class InitMaxRandomNumerWrong:
    number = DVC.result()
    maximum = DVC.params()
    
    def __init__(self, maximum):
        self.minimum = 0
        self.maximum = maximum
        
    def run(self):
        self.number = randrange(self.minimum, self.maximum)
```

what does work would be the follwing version. But for code clearity it should be avoided if possible and the `__call__` should be utilized.
Sometimes a combined approach might be inevitable, because e.g., upon class instatiation a generated value shall be passed and later a user value.

In [23]:
@PyTrack()
class InitMaxRandomNumerCheaty:
    number = DVC.result()
    maximum = DVC.params()
    
    def __init__(self, maximum = None):
        self.minimum = 0
        if maximum is not None:
            self.maximum = maximum
        
    def run(self):
        self.number = randrange(self.minimum, self.maximum)

Submit issues to https://github.com/zincware/py-track.


[NbConvertApp] Converting notebook 01_Intro.ipynb to script
[NbConvertApp] Writing 8652 bytes to 01_Intro.py


In [24]:
init_max_random_number_cheaty = InitMaxRandomNumerCheaty(maximum=4096)
init_max_random_number_cheaty()
!dvc repro

2021-10-15 15:37:47,448 (INFO): Adding stage 'InitMaxRandomNumerCheaty' in 'dvc.yaml'

To track the changes with git, run:

	git add outs/.gitignore dvc.yaml

Stage 'Stage0' didn't change, skipping                                core[39m>
Running stage 'InitMaxRandomNumerCheaty':
> python3 -c "from src.InitMaxRandomNumerCheaty import InitMaxRandomNumerCheaty; InitMaxRandomNumerCheaty(load=True).run()"
Updating lock file 'dvc.lock'                                                   

Stage 'InitStage' didn't change, skipping
Stage 'RandomNumer' didn't change, skipping
Stage 'InitMaxRandomNumer' didn't change, skipping
Stage 'WriteToFile' didn't change, skipping
Stage 'MaxRandomNumer' didn't change, skipping

To track the changes with git, run:

	git add dvc.lock
Use `dvc push` to send your updates to remote storage.
[0m

In [25]:
InitMaxRandomNumerCheaty(load=True).number

723