api: high level dvc.api import/export meets data catalog

- what do you do for a living? 
- oh, different things, import, export, family business 
(any Hollywood smuggler and now a dvc dev)


This feature was first proposed by @dmpetrov and was discussed several times here and there. Recent conversations/debates with @dmpetrov and @shcheklein resulted into some more or less defined approach to it, which I will lay out here.

**The need:** we want things to be easier reused across repos, so we want some higher level api than just `get()/read()/open()`, presumably some call, which will return a ready to use dataframe, an instantiated model, etc.


## The design

So here goes the design so far. The exporting side will need to list available things and write some instantiating glue code. Like:
```yaml
# dvc-artifacts.yml
artifacts:
 - name: face_rec_model:
 description: Tensorflow face recognition model 
 call: export.make_face_rec
 deps: ["data/face_rec_model.dat", "data/extra.dat"]
```
```python
# export.py
import tflearn

def make_face_rec(face_rec_model_data, extra_data):
 # ... construct a model
 return model
```
Then the importting side might simply do:
```python
import dvc.api

model = dvc.api.summon("face_rec_model", "https://path/to/repo", rev=...)
model.predict(...)
```
Summon is a good (to my liking) and short name that refers to summoning a genie or a tesla :)

This will provide a flexible base to export/summon anything. Some questions arise about requrements, repeatetive glue code, Python/R interoperability and easing things for a publisher. I will go through this.


## Requirements

Instantiating a model or a dataframe will require some libraries installed like pandas or PyTorch, or even their particular versions. In basic scenario we might ignore that - in most cases a user will know what is this about an will have it installed already. The thing is we can't provide it seamlessly since installing python libs inside `dvc.api.summon()` call will be surprising and might break some things. So deps management will be separate anyway. The possible way:

```yaml
artifacts:
 - name: some_dataframe
 ...
 reqs: 
 - pandas>=1.2.3,<2
 - otherlib>=1.2
```

```console
$ dvc summon-reqs <repo-url> <name> 
pandas>=1.2.3,<2
otherlib>=1.2
$ dvc summon-reqs <repo-url> <name> | xargs pip install
```
 

## Glue code

There are many common scenarios like export csv-file as dataframe, which will produce repetitive glue code. This could be handled by providing common export functions in our `dvcx` (for dvc export) library: 

```yaml
artifacts:
 - name: some_dataframe
 call: dvcx.csv_to_df
 deps: ["path/to/some.csv"] 
 reqs: 
 - dvcx
 - pandas
```

In this particular scenario one might even use `pandas.read_csv` directly. There should be a way to parametrize those:

```yaml
artifacts:
 - name: some_dataframe
 call: pandas.read_csv
 args:
 sep: ;
 true_values: ["yes", "oui", "ja"]
 deps: ["path/to/some.csv"] 
 reqs: 
 - dvcx
 - pandas
```

Note that deps are passed as positional arguments here.


## Python/R interoperability

This is a big concern for @dmpetrov. This will mostly work for simpler cases like dataframes, there is probably no practical way to make PyTorch models usable in R, there might be for some simpler models like linear regression. Anyway to solve those I suggest providing `dvcx` library for R, sharing function and param names, so that same yaml file will work for R too. This is to my mind 2nd or even 3rd layer on top of basic summon functionality, so might be added later.


## Making it easier for a publisher

Raised by @shcheklein. All this additional yaml/python files might stop someone from publishing things. I don't see this as an issue since:
- making it easier for the publisher will make it harder for a user,
- yaml files as above is probably the most easier way to state what is meant to be reused from outside, the alternative is plain text (more on this below)
- we already have a way to use things not explicetly published with `.read()/.open()/.get()`

So the "plain text alternative" - one might just write somewhere in a readme - I have this and this things and provide snippets like:

```python
import dvc.api
import pandas

with dvc.api.open("cities.csv", repo="http://path/to/repo") as f:
 cities = pandas.read_csv(f)
```

This is easy to use, does not require any new concepts and any additional effort from our side.


## What's next?

In my opinion the basic summon functionality might be implemented in the near future as having high value/effort rate and more or less clear design.

Still all of this was only discussed in my 1-1s with @dmpetrov and @shcheklein, so some input from the rest of the team would be highly desirable. So @efiop @mroutis @pared @jorgeorpinel please join.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

api: high level dvc.api import/export meets data catalog #2719

The design

Requirements

Glue code

Python/R interoperability

Making it easier for a publisher

What's next?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

api: high level dvc.api import/export meets data catalog #2719

Description

The design

Requirements

Glue code

Python/R interoperability

Making it easier for a publisher

What's next?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions