In [1]:
%%sh
# Small cleanup for reproducibility
rm -rf /tmp/dds
echo "world" >> "input.txt"

## Tutorial

The `dds` package solves the data integration problem in data science codebases. By using the `dds` package, you can safely assume that:

 - data consumed or produced is up to date with the current code
 - if a piece of data (machine learning models, datasets, ...) has already been calculated for a given code, it will immediately be used, dramatically accelerating the run of the code

`dds` works by inspecting python code and checking against a central store if its output has already been calculated. In that sense, it can be thought of as a smart caching system that detects if it should rerun calculations.

In order to work, `dds` needs two pieces of information:

 - where to store all the pieces of data (called blobs in `dds` jargon) that have been already calculated. This is by default in `/tmp/dds/data_internal` (or equivalent for your operating system)
 - where to store all the paths that are being requested for evaluation. It is by default in `/tmp/dds/data_paths`.
 
 Here is a simple "Hello world" example in DDS:

In [2]:
import dds

@dds.dds_function("/hello_world")
def hello_world():
    print("hello_world() has been called")
    return "Hello, world"

hello_world()

hello_world() has been called


'Hello, world'

When we called the function, a few things happened:

 - `dds` calculated a unique fingerprint for this function and checked if a blob was already associated for this fingerprint in its storage
 - since this is the first run, the function was executed and its result was stored in a storage
 - also, because the output is associated to a path (`/hello_world`), the path `/hello_world` filled with the content of the output.

We can in fact see all these outputs in the default store. Here is the file newly created with our welcoming content:

In [3]:
! cat /tmp/dds/data/hello_world

Hello, world

But that file is just a link to the unique signature associated with this piece of code:

In [4]:
! readlink -f /tmp/dds/data/hello_world

/tmp/dds/internal/blobs/7f96a7c7b03977a1b296381ce9ecdcd6a378e6f891ef7da1433d2bfb1f747f8c



This function prints a message whenever it executes. Now, if we try to run it again, it will actually not run, because the code has not changed.

In [5]:
hello_world()

'Hello, world'

In fact, because `dds` looks at the source code, if you redefine the function with the same content, it still does not recompute:

In [6]:
@dds.dds_function("/hello_world")
def hello_world():
    print("hello_world() has been called")
    return "Hello, world"

hello_world()

'Hello, world'

Functions can include arbitrary dependencies, as shown with this example. The function `f` has a dependency on an extra variable:



In [7]:
my_var = 1

@dds.dds_function("/f")
def f():
    print("Calling f")
    return my_var

f()

Calling f


1

If we call the function again, as seen before, the function does not get called again:

In [8]:
f()

1

However, if we change any dependency of the function, such as `my_var`, then the function will get evaluated again:

In [9]:
my_var = 2
f()

Calling f


2

Interestingly, if we change the variable again to its previous value, the function does not get evaluated again! The signature of the function will match a signature that was calculated before, hence there is no need to recompute it.

In [10]:
my_var = 1
f()

1

This mechanism covers all the basic structures in python (functions, dictionaries, lists, basic types, ...).

A function that is annotated with a `dds` annotation is called a _data function_. It is a function that not only a name in code but also a data path associated with it, and for which the output is captured and stored in a data system.

As we said, the `dds_function` annotation requires little code change but only works for functions that do not have arguments. How to deal with more complicated functions?
This is the object of the next section.

## Functions with arguments: keep() and eval()

`dds` can also wrap functions that have arguments using the `dds.keep()` function. Here is a simple example, in which the `hello` function expects an extra word to be provided:

In [11]:
def hello(name):
    print(f"Calling function hello on {name}")
    return f"Hello, {name}"

greeting = hello("world")
greeting

Calling function hello on world


'Hello, world'

In order to capture a specific call to this function with `dds`, the function call has to be wrapped with the `dds.keep` function:

In [12]:
greeting = dds.keep("/greeting", hello, "world")
greeting

Calling function hello on world


'Hello, world'

Again, try to change the argument of the function to see when it calls the function. This substitution can be done everywhere the function `hello(world)` was called. It can also be wrapped in a separate function instead of `hello`. This is in fact how the decorator `dds_function` works.

This constructs works well if the arguments can be summarized to a signature. It will fail for complex objects such as files, because `dds` needs to understand basic information about the input of a function to decide if it has changed or not. As an example:

In [13]:
def hello_from_file(file):
    name = file.readline().strip()
    print("Calling hello_from_file")
    return f"Hello, {name}"

f = open("input.txt", "r")
hello_from_file(f)

Calling hello_from_file


'Hello, world'

In [14]:
# This line will trigger a NotImplementedError
# dds.keep("/greeting", hello_from_file, open("input.txt", "r"))

How do we still use files? `dds` does not need to understand the content passed to a function _if_ it is called as a sub-function within `dds`. More concretely in this example, we can create a wrapper function that contains the file call and the call to the function to keep:

In [15]:
def wrapper_hello():
    f = open("input.txt", "r")
    print(f"Opening file {f}")
    greeting = dds.keep("/greeting", hello_from_file, f)
    return greeting

dds.eval(wrapper_hello)

Opening file <_io.TextIOWrapper name='input.txt' mode='r' encoding='UTF-8'>
Calling hello_from_file


'Hello, world'

Calling the function again shows that:

 - we still open the file: the content of `wrapper_hello` is still executed. 
 - `hello_from_file` is not called again: even if we pass a file to it, all the source code to provide the arguments is the same, the function `hello_from_file` is the same, hence `dds` assumes that the resulting `greeting` is going to be the same.

As a result, `wrapper_hello` is run (it is just `eval`uated), but all the sub-calls to data functions are going to be cached.

In [16]:
dds.eval(wrapper_hello)

Opening file <_io.TextIOWrapper name='input.txt' mode='r' encoding='UTF-8'>


'Hello, world'

As a conclusion, `dds` provides 3 basic functions to track and cache pieces of data:

 * `dds_function` is an annotation for functions that take no arguments and return a piece of data that should be tracked
 * `keep` is a function that wraps function calls. It can be used standalone when the function uses basic types as arguments.
 * `eval` is used in conjunction with `keep` when data functions take complex arguments.
 
By building on these foundations, `dds` allows you to do many more things such as visualizing all the dependencies between data, speeding up Machine Learning pipelines, and parallelizing your code automatically. The other tutorials provide more information.