dvc: cache graph/stages/pipelines #2529

efiop · 2019-09-23T16:38:58Z

Signed-off-by: Ruslan Kuprieiev ruslan@iterative.ai

Related to #755

Have you followed the guidelines in our
Contributing document?
Does your PR affect documented changes or does it add new functionality
that should be documented? If yes, have you created a PR for
dvc.org documenting it or at
least opened an issue for it? If so, please add a link to it.

Suor · 2019-09-24T02:00:45Z

How is this related to #755?

On patch itself, it looks so fragile that it doesn't look like it caching anything - it's too easy call .reload() to burn the cache, it's probably there already. .reload() itself doesn't make sense - it precalculates everything and then deletes it, so that it would be calculated on access again.

This could be done the same as here, but:

.reset() instead if .reload(), only dels things
.reset() is called after .tree assignment or other tree changes
cached properties are used as usual, not precalculated
consider replacing some or most cached properties with simple properties

Also, you are calculating graphs twice because how .graph and .active_graph are implemented, you can do it in one pass:

@cached_property
def graph(self):
    self.graph, self.active_graph = self._graph()
    return self.graph

Another thing is you resurrected concepts of active stages and active pipelines, which we dropped some time ago and altered .reproduce() behaviour or implementation - I don't know what it does now, but let's not mix this change with cache addition.

efiop · 2019-09-24T04:30:18Z

How is this related to #755?

I need to implement a stage locker that needs to know what DAG looks like in order to lock dependencies and stuff, and this patch helps avoid passing self.stages() everywhere.

On patch itself, it looks so fragile that it doesn't look like it caching anything - it's too easy call .reload() to burn the cache, it's probably there already. .reload() itself doesn't make sense - it precalculates everything and then deletes it, so that it would be calculated on access again.

Yeah, reload's purpose is to burn the cache, correct, what is wrong with that? It pre-calculates as a workaround to avoid attribute error when you are trying to del some cached property to invalidate it, when it hasn't been called previously. So pre-calculation happens only when cache is empty. Would love to know how to del empty funcy.cached_property the correct way 🙂

This could be done the same as here, but:
.reset() instead if .reload(), only dels things
.reset() is called after .tree assignment or other tree changes
cached properties are used as usual, not precalculated
consider replacing some or most cached properties with simple properties

That is precicesly what reload is meant to do, I've just used cached_property, but should've used good old read-only property and _property in it. reset indeed sounds better than reload, thanks, I'll rename it.

Also, you are calculating graphs twice because how .graph and .active_graph are implemented, you can do it in one pass:

Yep, I should really start adding tickboxes to the PR description to list the issues I'm aware of and that I will fix before WIP becomes a final patch.

Another thing is you resurrected concepts of active stages and active pipelines, which we dropped some time ago and altered .reproduce() behaviour or implementation - I don't know what it does now, but let's not mix this change with cache addition.

It was an incorrect decision to drop it everywhere. It is not needed for pull/push/etc, but is def needed for repro, so I've fixed that.

Suor · 2019-09-24T04:52:03Z

To remove attribute/cached prop, which you are not sure is there, you can:

self.__dict__.pop('graph', None)

Maybe I should consider adding silent deletion of cached_property to funcy. Swithing to simple properties will also work, but will be a bit more verbose:

@property
def graph(self):
    if not getattr(self, '_graph', None):
        self._graph, self._active_graph = self._build_graph()
    return self._graph

On active things, active_stages are not used at all, other things are used only in repro, so they should go there. Also, there is no need to cache everything, once graph is built querying it should be much faster - at least that doesn't require fs access. So instead of cached props on Repo simple vars in reproduce() should work.

As, you can see we are discussing two things here in parallel, which complicate each other. So I suggest moving repro refactor to a separate PR.

Suor · 2019-09-24T05:07:28Z

BTW, I still don't understand why you are calling .reload() so often, cache should be invalidated on change only.

efiop · 2019-09-24T05:09:31Z

BTW, I still don't understand why you are calling .reload() so often, cache should be invalidated on change only.

@Suor Cache should be reloaded after repo lock is taken, as someone might've deleted/added/modified stages, while we didn't have the lock. And it needs to reloaded whenever we change self.tree.

Suor · 2019-09-24T05:24:40Z

@efiop So that should be synchronized in code somehow, simply repeating self.reload() everywhere is error prone. We also should protect from graph collection, while lock is not hold.

Summarizing:

check if we hold the lock before graph collection (need a fast way to check that, i.e. add is_locked prop on the lock or add that to Repo)
drop cache on lock release (if it's accessed after that it could be stale already)

Suor · 2019-09-24T05:29:57Z

A side note: looks like releasing lock on runs, may cost us some performance - graph needs to be rebuilt if accessed.

efiop · 2019-09-24T06:43:20Z

Btw, this also solves #2519 (comment) . And I've found a few @locked that I've missed during #2519.

Suor · 2019-09-24T09:01:09Z

It's very hard to review while two things, caching and reproduce refactor, are intermixed here. Would you mind separating these?

efiop · 2019-09-24T10:48:02Z

@Suor it is not trivial to decouple, as they are closely tied together. I did try.

efiop · 2019-09-24T10:55:11Z

@Suor Or are you specifically talking about elif pipeline or all_pipelines and the fact that i'm using get_active_pipeline there? In that case, that one is a trivial quickfix.

Suor · 2019-09-24T12:17:01Z

I think we should remove all that "active" stuff. This has no sense, especially in Repo context, if some of that is needed in reproduce then it can be calculated there.

Suor · 2019-09-24T12:17:41Z

So there will be disagreement, while cache stuff might be merged much faster.

efiop · 2019-09-24T12:31:28Z

I think we should remove all that "active" stuff. This has no sense, especially in Repo context, if some of that is needed in reproduce then it can be calculated there.

We shouldn't remove it, as it is a core feature for repro and the only part where it was broken is for --all-pipelines and --pipeline options, where it was using self._get_pipeline instead of using an active pipeline. In regular dvc repro cases without that option, dvc repro was and is using active graph.

dvc/repo/__init__.py

shcheklein

following the discussion: looks good to me. Just some minor comments to address/improve. I don't like the way active/non active are returned and used - via 0 and 1 indexes - this is cosmetic. I tend to agree with @Suor that's it's better to keep this logic inside the repro if it's the only consumer of the active stuff, but I don't "feel" the code strong enough to have a strong opinion. For example, if it'll require importing a lot of stuff into repro and repeating all this DAG logic that would be bad. Also, since it's caching this stuff now - it's probably a good idea to keep it in the same place.

efiop · 2019-09-25T06:37:35Z

@Suor @shcheklein You are right about active stuff. I've refactored it so we take original G and cut-off locked deps to achieve active graph and now we are only doing and using that in reproduce. Looks much better know IMHO.

Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>

treeverse#2529 (comment) Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>

@Suor

As noted by @Suor's , this is much less prone to errors.

@Suor

As per @Suor's request.

Suor

So the big things are done. Except one - check that graph is not collected while lock is not hold. That will protect us from forgetting @locked basically. This could be done separately though.

The other are minor and cosmetic things.

dvc/command/pipeline.py

dvc/repo/__init__.py

Suor · 2019-09-25T11:46:03Z

dvc/repo/__init__.py

+                stage = attrs[node]
+                if stage.path.startswith(target + os.sep):
+                    ret.append(stage)
+            return ret


Why don't we base this on stages?

stages = get_stages(G) return [stage for stage in stages if stage.path.startswith(target + os.sep)]

Another thing is it looks like we were not collecting all the stages before, but now we do.

The order matters. E.g. for repro --recursive --single-item we need to run those in the proper order. Might be better to do that not in collect, but rather in reproduce itself. Need to take a look separately.

So this one can also be significantly simplified by using stages as nodes.

Suor · 2019-09-25T11:49:50Z

dvc/repo/__init__.py

+            return ret

        stage = Stage.load(self, target)
        if not with_deps:


So with_deps is ignored if it's recursive && isdir and not ignored if it's recursive && !isdir. Maybe add:

assert not recursive or not with_deps # recursive => not with_deps

At the starts of the function, to guarantee that there is no such situation.

It is more a bug though. --with-deps --recursive is a perfectly valid combination. I'll adjust it with a test later.

So this goes to separate issue?

dvc/repo/add.py

dvc/repo/graph.py

Suor · 2019-09-25T12:23:13Z

dvc/repo/reproduce.py

+
+    active = G.copy()
+    stages = nx.get_node_attributes(G, "stage")
+    for node in G.nodes():


As I can see you don't need to call .nodes(), it's possible to iterate graph directly and there is no need to prepare attrs you may use stage = G.nodes[node]['stage']. Or iterate G.nodes.items()

BTW, why don't we add stages to graph directly? Then it would be simply:

for stage in G: if not stage.locked: continue # ...

Addressed the nodes() issue.

BTW, why don't we add stages to graph directly? Then it would be simply:

Mostly a legacy I think. Might take a look at some point.

dvc/repo/reproduce.py

Suor · 2019-09-25T12:27:39Z

dvc/repo/reproduce.py

            for node in G.nodes():
                if G.in_degree(node) == 0:
-                    targets.append(os.path.join(self.root_dir, node))
+                    targets.append(attrs[node])


This all could be:

targets = [stage for G in pipelines for stage in G if G.in_degree(stage) == 0]

That is terrible to do such a long comprehensions. Much better to just leave them structured.

You can break list comprehension into several lines, this would be the same but more efficient and descriptive. Collecting results with list.append() always looked like a smell to me.

treeverse#2529 (comment)

pared · 2019-09-25T14:02:15Z

dvc/repo/tag/list.py

+from dvc.repo import locked
+
+
+@locked


I don't quite get why do we have locked here. As I understand, it is supposed to reset graph/stages/pipelines upon changing the state of dvc repo. Can list do that?

It it supposed to reset it because it no longer guarantees that it will stay intact. But locked here is about locking the dvc repository, so that no other dvc instance can add/remove dvcfiles from the project, which would affect self.collect used below.

Ok, thanks for the clarification, I only looked at locked as a mechanism caching the graph. It makes sense.

treeverse#2529 (comment)

treeverse#2529 (comment) treeverse#2529 (comment)

efiop · 2019-09-25T16:40:02Z

So the big things are done. Except one - check that graph is not collected while lock is not hold. That will protect us from forgetting @locked basically. This could be done separately though.

Definitely separately. Also we do collect graph without the lock on API/import/get calls when dealing with external repo. These days we can bring that back for the symmetry, as we have universal locks now and are not afraid of NFS and friends.

Suor · 2019-09-26T04:42:02Z

dvc/repo/__init__.py

        with repo.lock:
-            return f(repo, *args, **kwargs)
+            ret = f(repo, *args, **kwargs)
+            # Our graph cache is no longer valid after we release the repo.lock


Why it's not valid? I would write that concurrent dvc process might change stage files and this graph cache might be stale.

Suor · 2019-09-26T04:45:16Z

dvc/repo/__init__.py

+                stage = attrs[node]
+                if stage.path.startswith(target + os.sep):
+                    ret.append(stage)
+            return ret


So this one can also be significantly simplified by using stages as nodes.

Suor · 2019-09-26T04:45:45Z

dvc/repo/__init__.py

+            return ret

        stage = Stage.load(self, target)
        if not with_deps:


So this goes to separate issue?

Suor · 2019-09-26T04:50:55Z

dvc/repo/reproduce.py

            for node in G.nodes():
                if G.in_degree(node) == 0:
-                    targets.append(os.path.join(self.root_dir, node))
+                    targets.append(attrs[node])


You can break list comprehension into several lines, this would be the same but more efficient and descriptive. Collecting results with list.append() always looked like a smell to me.

efiop force-pushed the 755 branch 3 times, most recently from cb163d5 to c04c20f Compare September 23, 2019 19:44

efiop force-pushed the 755 branch 3 times, most recently from 43f6c36 to f208553 Compare September 24, 2019 06:33

efiop changed the title ~~[WIP] dvc: cache graph/stages/pipelines~~ dvc: cache graph/stages/pipelines Sep 24, 2019

efiop force-pushed the 755 branch from f208553 to 6838504 Compare September 24, 2019 07:22

efiop requested review from a user, Suor and pared and removed request for Suor September 24, 2019 08:21

efiop changed the title ~~dvc: cache graph/stages/pipelines~~ [WIP] dvc: cache graph/stages/pipelines Sep 24, 2019

efiop force-pushed the 755 branch from 6838504 to 493973e Compare September 24, 2019 10:48

efiop changed the title ~~[WIP] dvc: cache graph/stages/pipelines~~ dvc: cache graph/stages/pipelines Sep 24, 2019

shcheklein reviewed Sep 24, 2019

View reviewed changes

dvc/repo/__init__.py Outdated Show resolved Hide resolved

shcheklein reviewed Sep 24, 2019

View reviewed changes

dvc/repo/__init__.py Outdated Show resolved Hide resolved

shcheklein approved these changes Sep 24, 2019

View reviewed changes

efiop force-pushed the 755 branch 2 times, most recently from 5f99ff4 to 510f1bc Compare September 25, 2019 06:36

Ruslan Kuprieiev added 4 commits September 25, 2019 12:18

dvc: cache graph/stages/pipelines

8805098

Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>

repo: check_dag: don't reset the cache

bd9a726

treeverse#2529 (comment) Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>

repo: rename check_dag(stages) -> check_modified_graph(new_stages)

310826a

treeverse#2529 (comment) Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>

repo: refactor "active" logic and remove it from Repo class

bf4bf75

efiop force-pushed the 755 branch from 510f1bc to c71170f Compare September 25, 2019 09:18

Ruslan Kuprieiev added 2 commits September 25, 2019 12:22

repo: use tree.setter to automate cache reset

8971b26

As noted by @Suor's , this is much less prone to errors.

repo: move graph-related staticmethods into graph.py

b402112

As per @Suor's request.

efiop force-pushed the 755 branch from c71170f to b402112 Compare September 25, 2019 10:29

Suor suggested changes Sep 25, 2019

View reviewed changes

Ruslan Kuprieiev added 4 commits September 25, 2019 16:37

dvc: remove unused from_directory

95d87f9

treeverse#2529 (comment)

repo: use _tree instead of __tree

2a05f1a

treeverse#2529 (comment)

repo: add a note about cache invalidation on repo.lock release

7b7ab36

treeverse#2529 (comment)

repo: rename reset() -> _reset()

ec2e945

treeverse#2529 (comment)

pared reviewed Sep 25, 2019

View reviewed changes

Ruslan Kuprieiev added 4 commits September 25, 2019 19:00

pipeline: use graph helpers

959f103

treeverse#2529 (comment)

repo: use list comprehension for collect()

994e7a5

treeverse#2529 (comment)

repo: graph: only use first cycle

8ee93d8

treeverse#2529 (comment)

dvc: don't use G.nodes()

166e87f

treeverse#2529 (comment) treeverse#2529 (comment)

efiop merged commit 381601f into treeverse:master Sep 26, 2019

efiop deleted the 755 branch September 26, 2019 04:42

Suor approved these changes Sep 26, 2019

View reviewed changes

dvc: cache graph/stages/pipelines #2529

dvc: cache graph/stages/pipelines #2529

Uh oh!

Conversation

efiop commented Sep 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Suor commented Sep 24, 2019

Uh oh!

efiop commented Sep 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Suor commented Sep 24, 2019

Uh oh!

Suor commented Sep 24, 2019

Uh oh!

efiop commented Sep 24, 2019

Uh oh!

Suor commented Sep 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Suor commented Sep 24, 2019

Uh oh!

efiop commented Sep 24, 2019

Uh oh!

Suor commented Sep 24, 2019

Uh oh!

efiop commented Sep 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

efiop commented Sep 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Suor commented Sep 24, 2019

Uh oh!

Suor commented Sep 24, 2019

Uh oh!

efiop commented Sep 24, 2019

Uh oh!

Uh oh!

Uh oh!

shcheklein left a comment

Choose a reason for hiding this comment

Uh oh!

efiop commented Sep 25, 2019

Uh oh!

Suor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

efiop commented Sep 23, 2019 •

edited

Loading

efiop commented Sep 24, 2019 •

edited

Loading

Suor commented Sep 24, 2019 •

edited

Loading

efiop commented Sep 24, 2019 •

edited

Loading

efiop commented Sep 24, 2019 •

edited

Loading