Skip to content

Conversation

@efiop
Copy link
Contributor

@efiop efiop commented Sep 23, 2019

Signed-off-by: Ruslan Kuprieiev ruslan@iterative.ai

Related to #755

  • Have you followed the guidelines in our
    Contributing document?

  • Does your PR affect documented changes or does it add new functionality
    that should be documented? If yes, have you created a PR for
    dvc.org documenting it or at
    least opened an issue for it? If so, please add a link to it.


@efiop efiop force-pushed the 755 branch 3 times, most recently from cb163d5 to c04c20f Compare September 23, 2019 19:44
@Suor
Copy link
Contributor

Suor commented Sep 24, 2019

How is this related to #755?

On patch itself, it looks so fragile that it doesn't look like it caching anything - it's too easy call .reload() to burn the cache, it's probably there already. .reload() itself doesn't make sense - it precalculates everything and then deletes it, so that it would be calculated on access again.

This could be done the same as here, but:

  • .reset() instead if .reload(), only dels things
  • .reset() is called after .tree assignment or other tree changes
  • cached properties are used as usual, not precalculated
  • consider replacing some or most cached properties with simple properties

Also, you are calculating graphs twice because how .graph and .active_graph are implemented, you can do it in one pass:

@cached_property
def graph(self):
    self.graph, self.active_graph = self._graph()
    return self.graph

Another thing is you resurrected concepts of active stages and active pipelines, which we dropped some time ago and altered .reproduce() behaviour or implementation - I don't know what it does now, but let's not mix this change with cache addition.

@efiop
Copy link
Contributor Author

efiop commented Sep 24, 2019

How is this related to #755?

I need to implement a stage locker that needs to know what DAG looks like in order to lock dependencies and stuff, and this patch helps avoid passing self.stages() everywhere.

On patch itself, it looks so fragile that it doesn't look like it caching anything - it's too easy call .reload() to burn the cache, it's probably there already. .reload() itself doesn't make sense - it precalculates everything and then deletes it, so that it would be calculated on access again.

Yeah, reload's purpose is to burn the cache, correct, what is wrong with that? It pre-calculates as a workaround to avoid attribute error when you are trying to del some cached property to invalidate it, when it hasn't been called previously. So pre-calculation happens only when cache is empty. Would love to know how to del empty funcy.cached_property the correct way 🙂

This could be done the same as here, but:
.reset() instead if .reload(), only dels things
.reset() is called after .tree assignment or other tree changes
cached properties are used as usual, not precalculated
consider replacing some or most cached properties with simple properties

That is precicesly what reload is meant to do, I've just used cached_property, but should've used good old read-only property and _property in it. reset indeed sounds better than reload, thanks, I'll rename it.

Also, you are calculating graphs twice because how .graph and .active_graph are implemented, you can do it in one pass:

Yep, I should really start adding tickboxes to the PR description to list the issues I'm aware of and that I will fix before WIP becomes a final patch.

Another thing is you resurrected concepts of active stages and active pipelines, which we dropped some time ago and altered .reproduce() behaviour or implementation - I don't know what it does now, but let's not mix this change with cache addition.

It was an incorrect decision to drop it everywhere. It is not needed for pull/push/etc, but is def needed for repro, so I've fixed that.

@Suor
Copy link
Contributor

Suor commented Sep 24, 2019

To remove attribute/cached prop, which you are not sure is there, you can:

self.__dict__.pop('graph', None)

Maybe I should consider adding silent deletion of cached_property to funcy. Swithing to simple properties will also work, but will be a bit more verbose:

@property
def graph(self):
    if not getattr(self, '_graph', None):
        self._graph, self._active_graph = self._build_graph()
    return self._graph

On active things, active_stages are not used at all, other things are used only in repro, so they should go there. Also, there is no need to cache everything, once graph is built querying it should be much faster - at least that doesn't require fs access. So instead of cached props on Repo simple vars in reproduce() should work.

As, you can see we are discussing two things here in parallel, which complicate each other. So I suggest moving repro refactor to a separate PR.

@Suor
Copy link
Contributor

Suor commented Sep 24, 2019

BTW, I still don't understand why you are calling .reload() so often, cache should be invalidated on change only.

@efiop
Copy link
Contributor Author

efiop commented Sep 24, 2019

BTW, I still don't understand why you are calling .reload() so often, cache should be invalidated on change only.

@Suor Cache should be reloaded after repo lock is taken, as someone might've deleted/added/modified stages, while we didn't have the lock. And it needs to reloaded whenever we change self.tree.

@Suor
Copy link
Contributor

Suor commented Sep 24, 2019

@efiop So that should be synchronized in code somehow, simply repeating self.reload() everywhere is error prone. We also should protect from graph collection, while lock is not hold.

Summarizing:

  • check if we hold the lock before graph collection (need a fast way to check that, i.e. add is_locked prop on the lock or add that to Repo)
  • drop cache on lock release (if it's accessed after that it could be stale already)

@Suor
Copy link
Contributor

Suor commented Sep 24, 2019

A side note: looks like releasing lock on runs, may cost us some performance - graph needs to be rebuilt if accessed.

@efiop efiop force-pushed the 755 branch 3 times, most recently from 43f6c36 to f208553 Compare September 24, 2019 06:33
@efiop efiop changed the title [WIP] dvc: cache graph/stages/pipelines dvc: cache graph/stages/pipelines Sep 24, 2019
@efiop
Copy link
Contributor Author

efiop commented Sep 24, 2019

Btw, this also solves #2519 (comment) . And I've found a few @locked that I've missed during #2519.

@efiop efiop requested review from a user, Suor and pared and removed request for Suor September 24, 2019 08:21
@Suor
Copy link
Contributor

Suor commented Sep 24, 2019

It's very hard to review while two things, caching and reproduce refactor, are intermixed here. Would you mind separating these?

@efiop efiop changed the title dvc: cache graph/stages/pipelines [WIP] dvc: cache graph/stages/pipelines Sep 24, 2019
@efiop
Copy link
Contributor Author

efiop commented Sep 24, 2019

@Suor it is not trivial to decouple, as they are closely tied together. I did try.

@efiop
Copy link
Contributor Author

efiop commented Sep 24, 2019

@Suor Or are you specifically talking about elif pipeline or all_pipelines and the fact that i'm using get_active_pipeline there? In that case, that one is a trivial quickfix.

@efiop efiop changed the title [WIP] dvc: cache graph/stages/pipelines dvc: cache graph/stages/pipelines Sep 24, 2019
@Suor
Copy link
Contributor

Suor commented Sep 24, 2019

I think we should remove all that "active" stuff. This has no sense, especially in Repo context, if some of that is needed in reproduce then it can be calculated there.

@Suor
Copy link
Contributor

Suor commented Sep 24, 2019

So there will be disagreement, while cache stuff might be merged much faster.

@efiop
Copy link
Contributor Author

efiop commented Sep 24, 2019

I think we should remove all that "active" stuff. This has no sense, especially in Repo context, if some of that is needed in reproduce then it can be calculated there.

We shouldn't remove it, as it is a core feature for repro and the only part where it was broken is for --all-pipelines and --pipeline options, where it was using self._get_pipeline instead of using an active pipeline. In regular dvc repro cases without that option, dvc repro was and is using active graph.

Copy link
Contributor

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following the discussion: looks good to me. Just some minor comments to address/improve. I don't like the way active/non active are returned and used - via 0 and 1 indexes - this is cosmetic. I tend to agree with @Suor that's it's better to keep this logic inside the repro if it's the only consumer of the active stuff, but I don't "feel" the code strong enough to have a strong opinion. For example, if it'll require importing a lot of stuff into repro and repeating all this DAG logic that would be bad. Also, since it's caching this stuff now - it's probably a good idea to keep it in the same place.

@efiop efiop force-pushed the 755 branch 2 times, most recently from 5f99ff4 to 510f1bc Compare September 25, 2019 06:36
@efiop
Copy link
Contributor Author

efiop commented Sep 25, 2019

@Suor @shcheklein You are right about active stuff. I've refactored it so we take original G and cut-off locked deps to achieve active graph and now we are only doing and using that in reproduce. Looks much better know IMHO.

Ruslan Kuprieiev added 4 commits September 25, 2019 12:18
Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>
treeverse#2529 (comment)

Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>
Ruslan Kuprieiev added 2 commits September 25, 2019 12:22
As noted by @Suor's , this is much less prone to errors.
Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the big things are done. Except one - check that graph is not collected while lock is not hold. That will protect us from forgetting @locked basically. This could be done separately though.

The other are minor and cosmetic things.

stage = attrs[node]
if stage.path.startswith(target + os.sep):
ret.append(stage)
return ret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we base this on stages?

stages = get_stages(G)
return [stage for stage in stages if stage.path.startswith(target + os.sep)]

Another thing is it looks like we were not collecting all the stages before, but now we do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order matters. E.g. for repro --recursive --single-item we need to run those in the proper order. Might be better to do that not in collect, but rather in reproduce itself. Need to take a look separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this one can also be significantly simplified by using stages as nodes.

return ret

stage = Stage.load(self, target)
if not with_deps:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So with_deps is ignored if it's recursive && isdir and not ignored if it's recursive && !isdir. Maybe add:

assert not recursive or not with_deps  # recursive => not with_deps

At the starts of the function, to guarantee that there is no such situation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more a bug though. --with-deps --recursive is a perfectly valid combination. I'll adjust it with a test later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this goes to separate issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor Yes.


active = G.copy()
stages = nx.get_node_attributes(G, "stage")
for node in G.nodes():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I can see you don't need to call .nodes(), it's possible to iterate graph directly and there is no need to prepare attrs you may use stage = G.nodes[node]['stage']. Or iterate G.nodes.items()

BTW, why don't we add stages to graph directly? Then it would be simply:

for stage in G:
    if not stage.locked:
        continue
    # ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the nodes() issue.

BTW, why don't we add stages to graph directly? Then it would be simply:

Mostly a legacy I think. Might take a look at some point.

for node in G.nodes():
if G.in_degree(node) == 0:
targets.append(os.path.join(self.root_dir, node))
targets.append(attrs[node])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all could be:

targets = [stage for G in pipelines for stage in G if G.in_degree(stage) == 0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is terrible to do such a long comprehensions. Much better to just leave them structured.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can break list comprehension into several lines, this would be the same but more efficient and descriptive. Collecting results with list.append() always looked like a smell to me.

from dvc.repo import locked


@locked
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite get why do we have locked here. As I understand, it is supposed to reset graph/stages/pipelines upon changing the state of dvc repo. Can list do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It it supposed to reset it because it no longer guarantees that it will stay intact. But locked here is about locking the dvc repository, so that no other dvc instance can add/remove dvcfiles from the project, which would affect self.collect used below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for the clarification, I only looked at locked as a mechanism caching the graph. It makes sense.

@efiop
Copy link
Contributor Author

efiop commented Sep 25, 2019

So the big things are done. Except one - check that graph is not collected while lock is not hold. That will protect us from forgetting @locked basically. This could be done separately though.

Definitely separately. Also we do collect graph without the lock on API/import/get calls when dealing with external repo. These days we can bring that back for the symmetry, as we have universal locks now and are not afraid of NFS and friends.

@efiop efiop merged commit 381601f into treeverse:master Sep 26, 2019
@efiop efiop deleted the 755 branch September 26, 2019 04:42
with repo.lock:
return f(repo, *args, **kwargs)
ret = f(repo, *args, **kwargs)
# Our graph cache is no longer valid after we release the repo.lock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it's not valid? I would write that concurrent dvc process might change stage files and this graph cache might be stale.

stage = attrs[node]
if stage.path.startswith(target + os.sep):
ret.append(stage)
return ret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this one can also be significantly simplified by using stages as nodes.

return ret

stage = Stage.load(self, target)
if not with_deps:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this goes to separate issue?

for node in G.nodes():
if G.in_degree(node) == 0:
targets.append(os.path.join(self.root_dir, node))
targets.append(attrs[node])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can break list comprehension into several lines, this would be the same but more efficient and descriptive. Collecting results with list.append() always looked like a smell to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants