The Collect, Extract and Integration chain #16

BigRoy · 2015-07-05T06:52:41Z

Goal

Decide upon the way forward for building new family types and how to implement its Collector, Extractor and possibly Integrator.

The main goal is to have a simple, consistent and strong solution that can be used throughout Magenta and also allows the plug-ins to be easily used in other packages.

Implementation

Our Integrator depends on knowing data about the Extracted content. It needs to know:

Where the files have been extracted to, the extractDir
Where the files need to be integrated to, the integrateDir

We also want to implement versioning. So that could be additional required data.

The extractDir data is defined in the Extractor (see plugin.py) and is a temporary directory. This sets up the dependency that any instance has only 1 extractDir. This would require to inherit from plugin.py.

The integrateDir is computed using the project's schema and data from the instance. That data is:

Any data that is required to format a path, like root, container and asset.
The family name is used to define what output template from the project's schema to use for formatting.

Currently I've separated how we inject data into the Instance by taking it from the Context. This way injecting this data into the instance does not need to be done from within each Collector. Have a look here: BigRoy/pyblish-magenta@b6e4d19

Though this will always override it for any instance that has been Collected, which might be more annoying than what we gain from removing this duplicity in code.

Thoughts?

The text was updated successfully, but these errors were encountered:

mottosso · 2015-07-06T11:01:20Z

It's a broad topic, here's some broad thoughts.

The original intent of extraction was always to perform serialisation only, and not involve itself with location or interaction with databases. In this case, extracting into a temporary directory and passing this directory on to integration is exactly aligned with this.

Integration then is the complete opposite. It doesn't do any generation of data on it's own, but merely "mediates" the data, and aligns it with the overall pipeline.

Where Collection represents the "input" of a processing graph, Integration then is the "output". Inbetween, data may "fan out", become divided into smaller tasks, but in the end, it must all pass through integration, i.e. "fan in", if the content is to ever see the light of day.

We also want to implement versioning. So that could be additional required data.

Canonically, no process should ever know about existing assets or the state of existing assets until it comes to integration. In the case of versioning, which requires knowledge about which is the currently highest version in order to increment it, this would have to happen solely during integration.

This means that an integrator is free to not only produce final outputs, but also communicate and gather information (unrelated to validation and extraction) in order to make it's final decision. An integrator is always assumed to be right, so no validation is ever required here, nor serialisation. Which in most cases should converge into plain file-copying and persistence of data within each Instance and/or Context.

Though this will always override it for any instance that has been Collected, which might be more annoying than what we gain from removing this duplicity in code.

Not sure how you mean here, but if you mean that the first instance will create a temporary directory, whereas subsequent instances would be written to an already existing temporary directory, than that's perfectly fine and intended.

The temporary directory is much like Git's "staging area" in that it holds an arbitrary amount of information, but does so temporarily until it all is converged, or integrated, with the rest of the data.

BigRoy · 2015-07-06T12:34:31Z

Canonically, no process should ever know about existing assets or the state of existing assets until it comes to integration. In the case of versioning, which requires knowledge about which is the currently highest version in order to increment it, this would have to happen solely during integration.

This means that an integrator is free to not only produce final outputs, but also communicate and gather information (unrelated to validation and extraction) in order to make it's final decision. An integrator is always assumed to be right, so no validation is ever required here, nor serialisation. Which in most cases should converge into plain file-copying and persistence of data within each Instance and/or Context.

Why would it be up to the Integrator to acquire the data (eg. about the current highest version) as opposed to the Collector?

This would also limit Validations (eg. for versioning) like this: https://github.com/mkolar/pyblish-kredenc/blob/master/plugins/common/validate_version_number.py

I feel it might be nice to have the Selector provide data about the current highest published version of the asset. I was thinking about having an Integrator ordered -0.1 that is toggled off by default for Increment Version. Only if this is toggled on will it Incrementally Publish. It's up to the artist to ensure the changes he made won't break anything. What do you think?

Either way. I would love to see a simple pseudocode example on what the Collector does, what the Extractor does and what the Integrator does.

mottosso · 2015-07-06T12:39:10Z

Why would it be up to the Integrator to acquire the data (eg. about the current highest version) as opposed to the Collector?

Because it isn't related to the quality of what you are outputting. If a version on disk is faulty, then that is a fault carried over from a previous publish.

Either way. I would love to see a simple pseudocode example on what the Collector does, what the Extractor does and what the Integrator does.

Sure, I'll have a look at this.

mottosso · 2015-07-06T13:49:11Z

Either way. I would love to see a simple pseudocode example on what the Collector does, what the Extractor does and what the Integrator does.

I've mocked up an example for you here.

https://gist.github.com/mottosso/863e97d6f9d08a0d9eee

mottosso · 2015-07-06T13:57:14Z

I feel it might be nice to have the Selector provide data about the current highest published version of the asset. I was thinking about having an Integrator ordered -0.1 that is toggled off by default for Increment Version. Only if this is toggled on will it Incrementally Publish. It's up to the artist to ensure the changes he made won't break anything. What do you think?

It would be nice and convenient, but also break encapsulation. Think about it. That data doesn't need validation, it has already been saved to disk. The damage is already done.

Furthermore, that data isn't part of what an artist has produced, it's part of what previous Integrators have produced. If anyone should be warned about an invalid version or bad naming convention on already written files, it should be the developer who produced the integrator.

BigRoy · 2015-07-06T15:57:59Z

It would be nice and convenient, but also break encapsulation. Think about it. That data doesn't need validation, it has already been saved to disk. The damage is already done.

This isn't correct. The damage wouldn't have been done if the Validator catches it before Extraction. Plus it won't even be in the 'damaging' position if it would have Validated after Extraction. It would only be stored in the temporary location.

I think it's not that we're validating whether previous extractions went alright, but whether the version we are integrating now is up to par with our requirements.

Though as you state it's definitely not up to the artist to provide where it would go towards, unless there's user-defined data that influences "as what type of data it gets extracted". A good example could be publishing shader variations (which we do a lot in our pipeline). For example we build a red, blue and yellow bottle of wine. Each individual variation (for a single asset) could be Validated whether it's named correctly or already existing, etc. The point being that when a user can interact with data which influences Integration we want it to get validated because it's prone to human error.

But I think it's good to see where the ship leads us if we keep it purely implemented in Integration.

I've mocked up an example for you here.
https://gist.github.com/mottosso/863e97d6f9d08a0d9eee

Some questions that come to mind:

How do we let the Extractors extract to the correct temporary location without having to redesign Extractors per pipeline? Should we add a Selector that sets up extractDir data? (Or an Extractor that is ordered -0.1, whatever makes more sense). Do we let multiple Extractors extract to the same directory? If so, what do we do on naming conflicts? Or how do we ensure there are no naming conflicts?
What data do we provide so that integrator knows how to rename a file in the end. This is partially dependent on the structure for how we we want files to be integrated. Do we smash it all into a single published folder for an asset?

mottosso · 2015-07-06T17:32:41Z

This isn't correct. The damage wouldn't have been done if the Validator catches it before Extraction.

Are we talking about looking at existing files on disk, and validating whether those files are valid, during the publish of a new file?

Here's what I'm hearing.

MyAsset
├── publish
│   ├── myasset_v001.ma
│   ├── myasset_v002.ma
│   └── myasset_v003.ma
└── dev

When we're about to publish MyAsset once more, it would then create myasset_v004.ma.

You would like to (1) include myasset_v001-3.ma during collection of MyAsset, and (2) validate these versions? I'm sure this isn't what you mean.

A good example could be publishing shader variations (which we do a lot in our pipeline). For example we build a red, blue and yellow bottle of wine.

I guarantee you that there is a better way to solve this exact thing which doesn't involve integration to be validated.

I invite you to produce this asset in the \Pyblish\_sandbox\magenta directory and I'll gladly walk you through how this can happen without complicating integration.

How do we let the Extractors extract to the correct temporary location without having to redesign Extractors per pipeline? Do we let multiple Extractors extract to the same directory? If so, what do we do on naming conflicts? Or how do we ensure there are no naming conflicts?

Yes, that's right, multiple extractors write to the same directory. That's what this is doing. The directory is a generic staging area, each extractor could create it's own little subdirectory if needed, but in general, the data each extractor produces should be unique enough to not need to do that.

The way I handled this in Napoleon was to create one subdirectory per family, and typically only extracted a single family via single extractor.

What data do we provide so that integrator knows how to rename a file in the end.

It depends on what file we're talking about.

Let's take the model from ben in The Deal as an example. Ben is extracted as e.g. ben.mb, his parent (temporary) directory is stored in his instance as e.g. commitDir.

/tmp
└── ben.mb

In this case, an integrator with support for model families would come to expect models to be stored in this manner, a name and suffix and could simply move this exact file into the appropriate directory and give it an appropriate name.

In case a playblast and gif is also present..

/tmp
├── ben.mov
├── ben.gif
└── ben.mb

The integrator will now need to support gifs and playblasts to properly manage their final locations, and when it does will know what to do with files in whichever format they are expected to reside in, for example, it could make the distinction based on their suffix.

So you see there needs to be an interplay between extractors and integrators. There needs to be an "API" or "contract" which they have both agreed to. Any extractor going rouge to produce things an integrator isn't expecting, will simply not get integrated. No harm done.

BigRoy · 2016-02-17T17:16:41Z

So much has changed since this discussion and I'm not even sure how to "relate" this to the current state of Magenta. If this is relevant I think it would be great to see it outlined briefly what exactly we need to fix or add, otherwise close the discussion.

mottosso closed this as completed Feb 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Collect, Extract and Integration chain #16

The Collect, Extract and Integration chain #16

BigRoy commented Jul 5, 2015

mottosso commented Jul 6, 2015

BigRoy commented Jul 6, 2015

mottosso commented Jul 6, 2015

mottosso commented Jul 6, 2015

mottosso commented Jul 6, 2015

BigRoy commented Jul 6, 2015

mottosso commented Jul 6, 2015

BigRoy commented Feb 17, 2016

The Collect, Extract and Integration chain #16

The Collect, Extract and Integration chain #16

Comments

BigRoy commented Jul 5, 2015

Goal

Implementation

mottosso commented Jul 6, 2015

BigRoy commented Jul 6, 2015

mottosso commented Jul 6, 2015

mottosso commented Jul 6, 2015

mottosso commented Jul 6, 2015

BigRoy commented Jul 6, 2015

mottosso commented Jul 6, 2015

BigRoy commented Feb 17, 2016