-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
- what do you do for a living?
- oh, different things, import, export, family business
(any Hollywood smuggler and now a dvc dev)
This feature was first proposed by @dmpetrov and was discussed several times here and there. Recent conversations/debates with @dmpetrov and @shcheklein resulted into some more or less defined approach to it, which I will lay out here.
The need: we want things to be easier reused across repos, so we want some higher level api than just get()/read()/open(), presumably some call, which will return a ready to use dataframe, an instantiated model, etc.
The design
So here goes the design so far. The exporting side will need to list available things and write some instantiating glue code. Like:
# dvc-artifacts.yml
artifacts:
- name: face_rec_model:
description: Tensorflow face recognition model
call: export.make_face_rec
deps: ["data/face_rec_model.dat", "data/extra.dat"]# export.py
import tflearn
def make_face_rec(face_rec_model_data, extra_data):
# ... construct a model
return modelThen the importting side might simply do:
import dvc.api
model = dvc.api.summon("face_rec_model", "https://path/to/repo", rev=...)
model.predict(...)Summon is a good (to my liking) and short name that refers to summoning a genie or a tesla :)
This will provide a flexible base to export/summon anything. Some questions arise about requrements, repeatetive glue code, Python/R interoperability and easing things for a publisher. I will go through this.
Requirements
Instantiating a model or a dataframe will require some libraries installed like pandas or PyTorch, or even their particular versions. In basic scenario we might ignore that - in most cases a user will know what is this about an will have it installed already. The thing is we can't provide it seamlessly since installing python libs inside dvc.api.summon() call will be surprising and might break some things. So deps management will be separate anyway. The possible way:
artifacts:
- name: some_dataframe
...
reqs:
- pandas>=1.2.3,<2
- otherlib>=1.2$ dvc summon-reqs <repo-url> <name>
pandas>=1.2.3,<2
otherlib>=1.2
$ dvc summon-reqs <repo-url> <name> | xargs pip installGlue code
There are many common scenarios like export csv-file as dataframe, which will produce repetitive glue code. This could be handled by providing common export functions in our dvcx (for dvc export) library:
artifacts:
- name: some_dataframe
call: dvcx.csv_to_df
deps: ["path/to/some.csv"]
reqs:
- dvcx
- pandasIn this particular scenario one might even use pandas.read_csv directly. There should be a way to parametrize those:
artifacts:
- name: some_dataframe
call: pandas.read_csv
args:
sep: ;
true_values: ["yes", "oui", "ja"]
deps: ["path/to/some.csv"]
reqs:
- dvcx
- pandasNote that deps are passed as positional arguments here.
Python/R interoperability
This is a big concern for @dmpetrov. This will mostly work for simpler cases like dataframes, there is probably no practical way to make PyTorch models usable in R, there might be for some simpler models like linear regression. Anyway to solve those I suggest providing dvcx library for R, sharing function and param names, so that same yaml file will work for R too. This is to my mind 2nd or even 3rd layer on top of basic summon functionality, so might be added later.
Making it easier for a publisher
Raised by @shcheklein. All this additional yaml/python files might stop someone from publishing things. I don't see this as an issue since:
- making it easier for the publisher will make it harder for a user,
- yaml files as above is probably the most easier way to state what is meant to be reused from outside, the alternative is plain text (more on this below)
- we already have a way to use things not explicetly published with
.read()/.open()/.get()
So the "plain text alternative" - one might just write somewhere in a readme - I have this and this things and provide snippets like:
import dvc.api
import pandas
with dvc.api.open("cities.csv", repo="http://path/to/repo") as f:
cities = pandas.read_csv(f)This is easy to use, does not require any new concepts and any additional effort from our side.
What's next?
In my opinion the basic summon functionality might be implemented in the near future as having high value/effort rate and more or less clear design.
Still all of this was only discussed in my 1-1s with @dmpetrov and @shcheklein, so some input from the rest of the team would be highly desirable. So @efiop @MrOutis @pared @jorgeorpinel please join.