What are the plans for merge? #52

Open
schwern opened this Issue Apr 2, 2013 · 9 comments

Projects

None yet

2 participants

Collaborator
schwern commented Apr 2, 2013

I just noticed that merging stacks disappeared. This is an important part of the value of Pinto to me. Delete + copy is enough to do a dev -> staging -> production workflow, but if there's any hot patches they'll have to be moved back manually.

Owner
thaljef commented Apr 2, 2013

merge is another feature that fell out when the VCS was redesigned. I always thought it could be pretty useful too, but when I asked around, no one claimed to use it or even understand the use case for it. I'd be happy to bring it back if we can both agree on what it ought to do.

So in your mind, what does it mean to "merge" two stacks?

Collaborator
schwern commented Apr 9, 2013

In all cases B is being merged into A. * is the repository root. A and
B are branch heads.

First is the simple fast-forward case. No change has been made to B
since A branched from it.

*---B----A

becomes

*--------A
         B

The opposite is either a no-op or an error because B is not an ancestor
of A.

*---A----B

becomes

*---A----B
or error

Now for the interesting part, when both branches have diverged. A use
case is where you have multiple project branches (we'll call them P1 and
P2) branched from a baseline branch (A). For example, if there's a bug
fix in File::Spec or similarly universally important module that might
go into A and be pushed out to the projects.

       _____P1
      /
*-------A
     \____P2

You want to merge changes to the baseline set of modules (A) into the
project branches. What happens? To figure that out we have to break
down the concept of what can be automatically merged and what cannot be.

Merging conflicts are caused by two changes trying to change the same
line or nearby lines. This has two goals. First, with a direct
conflict the computer has no way to figure out which change takes
precedence, it needs a human to make that decision.

Second, checking the surrounding context hasn't changed is an attempt to
try to make sure the change is still relevant. That the context hasn't
changed to make this change make no sense.

To sum, a merge can be done automatically unless...

  1. Both sides are changing the same thing in different ways.
  2. One side has changed something in a way to make the other side's
    change not make sense.

What Pinto has going for it is most merge systems have no understanding
of the thing they're merging. It's just lines of text, not code. Pinto
does understand.

Final clarifying point. All "changes" referred to are since the last
time A was merged into P1. This is so that if A and P1 both touch
Foo::Bar, you merge and resolve the conflict, that the next merge
doesn't repeat the Foo::Bar conflict. This is where Pinto must do merge
tracking, to remember what was merged into who and when. This is where
Subversion fell flat on its face. It's not easy unless you have a tree
structure for your repository like Git. The Subversion merge tracking
model is crap, but it is relatively simple to implement.

For the first point, if both branches change the same module it is a
conflict. Pretty straight forward. This includes pins.

A merge cannot violate any pins, that is a conflict.

For the second point, there's a few ways to interpret an out of context
change. First, for each changed module Pinto must check that it doesn't
break the dependency requirements of anything else. Pinto can already
do this.

A first order merge algorithm for merging A into P1 might be...

  • @A_Changes = all the modules changed by A.
  • @P1_Changes = all the modules changed by P1.
  • If any of them are the same, conflict.
  • Add @A_Changes to P1.
  • If any dependencies are violated, conflict.
  • If any pins are violated, conflict.
  • Once resolved, record that A@123 was merged into P1@234.

Conflict resolution must be interactive and the user must be able to
continue the merge after resolving the conflict. This is to say it's
not OK for Pinto to just error out of the merge on conflict.

Also note that the dependency check must take into account ALL of
@A_Changes at once. That is to say, if A upgrades Foo::Bar which
requires an upgrade to Some::Module, Pinto should wait to check if
@A_Changes is going to upgrade Some::Module, too. I've run into this
problem with "pinto add".

How's that?

Owner
thaljef commented Apr 9, 2013

How's that?

I think we're getting close. I've gotta run, but I wanted to jot down some thoughts before I forget...

  • Fast-forward merge: no problem there.
  • Pinto stores history in a directed graph, just like Git.
  • Do we agree that merging is about merging packages, not distributions?
  • Unlike a text file, an index is an unordered list. The "context" doesn't mean anything.
  • I think the next step is to work through some cases with hypothetical indexes, rather than ancestry graphs.
Collaborator
schwern commented Apr 9, 2013

Merging packages vs distributions... if I understand you correctly "merge by package" would treat a change like these separate actions...

  • Replace Test::More 0.88 with 0.98
  • Replace Test::Simple 0.88 with 0.98
  • Replace Test::Builder 0.88 with 0.98

"Merge by distribution" treats the same thing as one change.

  • Replace MSCHWERN/Test-Simple-0.88.tar.gz with MSCHWERN/Test-Simple-0.98.tar.gz

This again hits at that core architectural problem of CPAN, it has no concept of a distribution. Ideally, CPAN and Pinto should never have two releases of the same distribution in the index. In that sense, merging by distribution is what you want. From another perspective, the Pinto user is always working with releases of distributions. It doesn't make sense to say "I want to upgrade JUST Test::More but leave Test::Builder as is even though they're part of the same distribution". Since Pinto presents the user with releases of distributions, merging should be by releases of distributions.

The analogy to adjacency and ordering in Pinto is the dependency graph.

Good to hear Pinto is a graph rather than a stack of pancakes. :) That will make this so much easier.

I'm not sure what a "hypothetical index" is. Example?

Owner
thaljef commented Apr 9, 2013

I'm not sure what a "hypothetical index" is. Example?

Using a couple of packages, describe what the index looks like at each point (the common ancestor, tip A, and tip B) and show what the new tip should look like after the merge. For example:

ancestor                               B   --------merges into------>  A  --------- yielding --------> A'
_________________________________________________________________________________________________________________
Author/Dist-1/PkgA~1        Author/Dist-1/PkgA~1              Author/Dist-1/PkgA~1           Author/Dist~1/PkgA~1
Author/Dist-2/PkgB~2        Author/Dist-2/PkgB~2              Author/Dist-2/PkgB~2           Author/Dist~2/PkgB~2
<none>                      Author/Dist-3/PkgC~3              <none>                         <conflict ???>
Author/Dist-4/PkgD~2        Author/Dist-5/PkgD~3              Author/Dist-6/PkgD~4           <conflict>

Something like that :)

Owner
thaljef commented Apr 9, 2013

Merging packages vs distributions...

We're going to have to sort that part out first. I'll get back to you on that tomorrow.

Owner
thaljef commented Apr 10, 2013

Here's my feeling at the moment...

Distributions are purely notional. They can't be reliably ordered or even related to each other. Pretending that they can goes against the grain of the toolchain and smells like a recipe for trouble. I don't believe it is an "architectural problem" of CPAN. Rather, it is just a natural side-effect from having module-level dependencies.

I don't agree that limiting Pinto to one release of a distribution per stack is necessarily "ideal." In your use case it might make perfect sense. But for others, it may not. Some may wish to leave old distributions in the index just like PAUSE does. Fortunately, Pinto lets you go either way. But I don't feel that I can say either strategy is "right."

A Pinto user is not "always working with releases of distributions" as you say. I think you may feel that way because you want to make a repository full of old versions of modules, and the only way you can do that with Pinto right now is to ask for a specific release of a distribution.

But if Pinto can precisely locate (module => version) pairs for you, then releases become largely invisible to you. In fact, that very same issue is discussed in #37 and is the focus of our crowd-funding campaign at http://tinyurl.com/gopinto.

So in terms of merging, I think it means that merges are done at the package level. However, I can probably still preserve the all-or-nothing behavior that you want for distributions. We'll just have to sketch out some test cases for that and make sure we both understand.

Collaborator
schwern commented Apr 15, 2013

On 4/10/13 5:34 AM, Jeffrey Ryan Thalhammer wrote:

Distributions are purely notional. They can't be reliably ordered or
even related to each other. Pretending that they can goes against the
grain of the toolchain and smells like a recipe for trouble. I don't
believe it is an "architectural problem" of CPAN. Rather, it is just a
natural side-effect from having module-level dependencies.

Returning from the Perl QA Hackathon, CPAN feels this is an
architectural problem and are moving to revise it. An installed
distributions database is in the works, distribution level CPAN meta
data, and new standards for Foo-Bar-1.23 to contain Foo::Bar.

For Pinto, if pull is going to treat things as releases then why not
merging?

In addition, Pinto is in the position where it knows what is currently
installed and what came with each release. It has an installed releases
database. It can determine if two releases intersect with each other
simply by parsing the archive name or by checking if they change the
same modules. For the purposes of merging (and many other things) it's
important to know if two tarballs are going to change the same package.

Let me go through an example of a merge to show you what I mean and how
it works out better by distribution.

Foo-1.22 contains...
    Foo @ 1.22
    Foo::Bar @ 1.22
    Foo::Baz @ 1.22
    Foo::Buf @ 1.22

Foo-1.23 contains...
    Foo @ 1.23
    Foo::Bar @ 1.23
    Foo::Baz @ 1.23

Foo-1.24 contains...
    Foo @ 1.24
    Foo::Bar @ 1.24
    Foo::Buf @ 1.24
  • master pulls Foo-1.22.
  • feature and fix branch from master
  • feature pulls Foo-1.23
  • fix pulls Foo-1.24
  • fix is merged into feature
  • feature is merged into master

What happens? Here's the state just before merging begins.

          master          feature        fix

Foo 1.22 1.23 1.24
Foo::Bar 1.22 1.23 1.24
Foo::Baz 1.22 1.23 1.22?
Foo::Buf 1.22 1.22? 1.24

The question marks indicate places where if we change pull to first
uninstall there will be no version installed. Pull would have
uninstalled it.

If we do it by package, then the user is presented with this...

+Foo @ 1.24
-Foo @ 1.23
+Foo::Bar @ 1.24
-Foo::Bar @ 1.23
+Foo::Baz @ 1.22
-Foo::Baz @ 1.23
+Foo::Buf @ 1.24
-Foo::Baf @ 1.22

Note the bogus downgrade of Foo::Baz. In this case it doesn't matter,
but I'm pretty sure I can contrive cases where it will.

In addition, there is really only two valid outcomes...

Equivalent to Foo-1.24

Foo @ 1.24
Foo::Bar @ 1.24
Foo::Baz @ 1.22
Foo::Buf @ 1.24

OR

Equivalent to Foo-1.23

Foo @ 1.23
Foo::Bar @ 1.23
Foo::Baz @ 1.23
Foo::Baf @ 1.22

I can think of no use case where you'd want to mix and match versions of
modules from distributions. Allowing the user to do so just begs their
making a mistake when merging.

In addition, it's a lot for the user to parse. And that's for a small
module. When merging, here's what I'd rather see...

+Foo-1.24
-Foo-1.23
+Foo @ 1.24
-Foo @ 1.23
+Foo::Bar @ 1.24
-Foo::Bar @ 1.23
+Foo::Baz @ 1.22
-Foo::Baz @ 1.23
+Foo::Buf @ 1.24
-Foo::Baf @ 1.22

What that clearly says is "Foo was upgraded from 1.23 to 1.24".

Conflicts should be determined by two branches changing the same module,
but this should cause a conflict at the archive level because it is the
archives which are in conflict.

Looking at it another way, the user isn't going to cherry pick Foo from
Foo-1.23 and Foo::Bar from Foo-1.24 to resolve this conflict. They're
going to pick Foo-1.23 or Foo-1.24. In fact, conflict resolution could
be presented like this...

Merging fix into feature...
Conflict: Both branches update the Foo distribution

    parent:  Foo-1.22
    fix:     Foo-1.24
    feature: Foo-1.23
                      parent          feature        fix
        Foo           1.22            1.23           1.24
        Foo::Bar      1.22            1.23           1.24
        Foo::Baz      1.22            1.23           1.22
        Foo::Buf      1.22            1.22           1.24

Which version would you like?

Nice, clean presentation and a clear choice for the user which they
cannot screw up.

(I'm not sure how any of the above would interact with prerequisites)

I don't agree that limiting Pinto to one release of a distribution per
stack is necessarily "ideal." In your use case it might make perfect
sense. But for others, it may not. Some may wish to leave "old"
distributions in the index just like PAUSE does. Fortunately, Pinto
let's you go either way. But I don't feel that I can say either strategy
is "right."

A) What's the use case?
B) Would you agree it's not a common case?
C) Can we make that a configurable option defaulting to no?

A Pinto user is not "always working with releases of distributions" as
you say. I think you may feel that way because you want to make a
repository full of old versions of modules, and the only way you can do
that with Pinto right now is to ask for a specific release of a
distribution.

But if Pinto can precisely locate (module => version) pairs /for you/,
then releases become largely invisible to you. In fact, that very same
issue is discussed in #37 #37
and is the focus of our crowd-funding campaign at
http://tinyurl.com/gopinto.

As I understand it, even if a user types in "pinto pull Foo::Bar" on the
command line, there is no difference between that and "pinto pull ". That is, it's just syntax
sugar for pulling a release.

Is there a commonly used command which allows the user to manipulate
Pinto at the individual package level which is not syntax sugar for
working with an archive? For example, can I change Test::More without
changing Test::Simple and Test::Builder and is this a commonly done thing?

So in terms of merging, I think it means that merges are done at the
package level. However, I can probably still preserve the all-or-nothing
behavior that you want for distributions. We'll just have to sketch out
some test cases for that and make sure we both understand.

Maybe I'm not sure what "merging at the package level" means. Because
merging happens between stacks, and everything else Pinto does appears
to internally happen at the archive level.

Owner
thaljef commented Oct 14, 2013

@schwern, I have come around to your point of view. Pinto manages distributions, not modules. So yes, I agree that a Pinto index should never contain just part of a distribution. When you add or pull a distribution to a stack, then any incumbent distributions which contain a package that exists in the incoming distribution should be removed from the index entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment