In [1]:
%%sh
# Small cleanup for reproducibility
rm -rf /tmp/dds

## Tutorial

The `dds` package solves the data integration problem in data science codebases. By using the `dds` package, you can safely assume that:

 - data consumed or produced is up to date with the current code
 - if a piece of data (machine learning models, datasets, ...) has already been calculated for a given code, it will immediately be used, dramatically accelerating the run of the code

`dds` works by inspecting python code and checking against a central store if its output has already been calculated. In that sense, it can be thought of as a smart caching system that detects if it should rerun calculations.

In order to work, `dds` needs two pieces of information:

 - where to store all the pieces of data (called blobs in `dds` jargon) that have been already calculated. This is by default in `/tmp/dds/data_internal` (or equivalent for your operating system)
 - where to store all the paths that are being requested for evaluation. It is by default in `/tmp/dds/data_paths`.
 
 Here is a simple "Hello world" example in DDS:

In [2]:
import dds

@dds.dds_function("/hello_world")
def hello_world():
    print("hello_world() has been called")
    return "Hello, world"

hello_world()

hello_world() has been called


'Hello, world'

When we called the function, a few things happened:

 - `dds` calculated a unique fingerprint for this function and checked if a blob was already associated for this fingerprint in its storage
 - since this is the first run, the function was executed and its result was stored in a storage
 - also, because the output is associated to a path (`/hello_world`), the path `/hello_world` filled with the content of the output.

We can in fact see all these outputs in the default store. Here is the file newly created with our welcoming content:

In [3]:
! cat /tmp/dds/data/hello_world

Hello, world

But that file is just a link to the unique signature associated with this piece of code:

In [4]:
! readlink -f /tmp/dds/data/hello_world

/tmp/dds/internal/blobs/7f96a7c7b03977a1b296381ce9ecdcd6a378e6f891ef7da1433d2bfb1f747f8c



This function prints a message whenever it executes. Now, if we try to run it again, it will actually not run, because the code has not changed.

In [5]:
hello_world()

'Hello, world'

In fact, because `dds` looks at the source code, if you redefine the function with the same content, it still does not recompute:

In [6]:
@dds.dds_function("/hello_world")
def hello_world():
    print("hello_world() has been called")
    return "Hello, world"

hello_world()

'Hello, world'

Functions can include arbitrary dependencies, as shown with this example. The function `f` has a dependency on an extra variable:



In [7]:
my_var = 1

@dds.dds_function("/f")
def f():
    print("Calling f")
    return my_var

f()

Calling f


1

If we call the function again, as seen before, the function does not get called again:

In [8]:
f()

1

However, if we change any dependency of the function, such as `my_var`, then the function will get evaluated again:

In [9]:
my_var = 2
f()

1

Interestingly, if we change the variable again to its previous value, the function does not get evaluated again! The signature of the function will match a signature that was calculated before, hence there is no need to recompute it.

In [10]:
my_var = 1
f()

1