Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataMigrate with a detached container #1

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

tyler-cai-microsoft
Copy link
Owner

@tyler-cai-microsoft tyler-cai-microsoft commented Aug 9, 2023

  • I used runtime factory chaining
  • Did not implement quorum proposals - the solution will need it and understand what the challenges there are
  • Did not bulletproof clients from submitting ops - quorum proposal pausing. There is a partial solution implemented.
  • Did not close all containers
  • Transitions a shared cell to a shared map
  • Need to consider the whole running summarizer locked action as double submitting a summary will result in the summary not being uploaded properly.
  • Using a readonly context (detached == readonly)
  • Need to consider vamping the whole summarizer system from the summarizer class to allow summarizing with a specific summary
  • Need a way to access the summarizer when it gets created, I hacked my way around it.

return readonlyContext;
}

class ContainerRuntimeFactoryWithDataMigration extends RuntimeFactoryHelper {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth adding comments above key classes to clarify their purpose.
If I got it right, this is an CRF adapter, that makes a call on what CRF instance to use depending on version and detached/attached state.

}
}

class SharedCellToSharedMapFactory implements IChannelFactory {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I got right, this is essentially on-the-fly transformer for node in the graph (dds), if using this kind of language.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

@@ -381,8 +387,11 @@ export class TestObjectProvider implements ITestObjectProvider {
entryPoint: fluidEntryPoint,
loaderProps?: Partial<ILoaderProps>,
requestHeader?: IRequestHeader,
packageEntries: Iterable<[IFluidCodeDetails, fluidEntryPoint]> = [
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine for a hack, but it's not great to have an argument that is not used if further arguments are present.
Not sure how many implementations of this interface are out there, but I'd consider adding new API instead on it

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was part of my attempt to do some mock of quorum proposals, feel free to ignore.

@@ -2825,6 +2891,16 @@ export class ContainerRuntime
);
}

public submitTransition() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe startProposal() or submitTransitionProposal()?


const loadV2Container = async (summaryVersion: string): Promise<IContainer> =>
provider.loadContainer(
dataMigrationRuntimeFactoryV2,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code is a bit confusing to read - if you keep current hack around, I'd suggest passing undefined here (as this arg is not used) - it's a bit easier to read it that way (without need to remind constantly that first arg is not used)

runtimeFactoryV1,
runtimeFactoryV1ToV2,
runtimeFactoryV2,
deferred,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest instead have a function on ContainerRuntimeFactoryWithDataMigration that returns this promise. Under cover (inside class) we can use same deferred approach, but having a function with a name and a call many lines below will be more readable as one could glimpse at least some context from the name - when I see where this differed is used 40 lines down, I have no context where it came from, and what are the semantics for it - need to drill into implementation of this class to get that.

summarizer.on("connected", () => {
waitForSummarizerRuntimeConnection.resolve();
});
await waitForSummarizerRuntimeConnection.promise;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even in your PR we use this pattern of setting up listener to resolve differed multiple times. And it's somewhat wrong as you do not use once(), you use on() :) I suggest creating a general-purpose helper function that takes arbitrary EventEmitter, arbitrary event name and returns a promise when such event name is called. Implementation will create deferred and do what you do here, but for callers it would be a single line call - await waitEvent(emitter, "connected").

await waitForSummarizerRuntimeConnection.promise;
const migrateAck = await migrate(summarizer, readonlyRuntime);
const summaryVersion = migrateAck.handle;
transitionContainer.close();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using same prefix for different types of variables. For example, I think readonlyRuntime corresponds here to transitionContainer, if so - it would be great for both variables have either "transition" or "readonly" or both in their names. Same prefixes make it easier to follow relationships. Sometimes numbers are better, i.e. runtime1, runtime2, ... with some header at the top describing what 1/2/3 actually represent.

assert(codeDetails !== undefined, "get code details failed");

this._currentRuntimeFactory = codeDetails.package === "v1" ? this.v1 : this.v2;
if (this.clientDetailsType === detachedClientType) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something worth confirming (I think I made this mistake initially in my thinking, and I think you follow it):
I think CRF is supposed to be stateless object, and used to load many different files (i.e. there is one-many relationships for CFR-CR). If that's true, then I think we are in trouble with this approach.

If you look at IRuntimeFactory, it's simple - it has only one method. I think using RuntimeFactoryHelper here hurts more than it helps - if it's a single method, then that means all this code could be written in linear fashion, and then most of the state likely will be on stack, not as class remembers.

That said, I think the whole RuntimeFactoryHelper pattern is wrong. I'd break it into two classes (probably worth opening a task to do this work in the future, as it's outside of this work):

  1. Class A, that implements IRuntimeFactory, but also takes a factory to be able to create instances of class B for each instantiation.
  2. Class B, that has all the methods around pre-initializaiton, post-initializaiton, etc. And it's stateful, i.e. it's safe to have state because it's used to load only one instance of a runtime.
  3. I'd kill "provide" nonsense from it, and make it not implement IRuntimeFactory, but implement all the other methods it has today. And in places where we need IRuntimeFactory, I'd use different class implementation that would create

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not in love with this implementation either. Glad this clarified that part.

this.submitSignalFn = () => {};
}

this.loadDetachedAndTransitionFn = async () => {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit convoluted. On one hand I can see why you might want to put this code here - "/_detachedContainerRuntime" string seems to be ContainerRuntime business only. But otherwise this block seems like does not belong here - it belongs where loadDetachedAndTransitionFn() is called.

Maybe one reason why it's not there - this class has access to loader, which (if I got right) is relative loader (i.e. it operates only on current URI namespace). But do we need loader at all? Can we have detached context and runtime attached to it, and that's about it? I.e. could we avoid touching loader layer?

Your approach with having to use loader to create container object is fine as well, it's just so hard to follow the network of connections / code flow on how things are created and what is their state as functionality is spread out.

return newTree;
}

async function migrate(summarizerRuntime: ContainerRuntime, readonlyRuntime: ContainerRuntime) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I got lost here on what these runtimes are. Trying to follow the code, it feels like readonlyRuntime should be a detached runtime, but if so, I'm not sure why we call startSummary() on it, so I think I got it wrong (and thus got lost)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we call startSummary, we can call summarize and potentially get an incremental summary that is submittable.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't call startSummary, we can't summarize because the summarizerNode system would break down. All clients utilize the summarizerNode system from what I understand. This may change in the future, I'm not sure.

}

async function migrate(summarizerRuntime: ContainerRuntime, readonlyRuntime: ContainerRuntime) {
(readonlyRuntime as any).summarizerNode.startSummary(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple notes:

  • I think we need to do something here to ensure that summarizer does not produce a summary on its own, with V2 quorum state (as that's the state of it when we get here) but with V1 content.
  • In general, we should consider each object in the system to know what to do in response to various events (like quorum proposals), i.e. instead of poking at objects from outside and tell them what to do, I think we benefit a lot (where it's possible to accomplish) if objects (like summarizer) know what is expected from them in response to observing transition ops. This solves first point, but also reduces number of weird APIs we need to add to various layers to be able to pick through layers.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the notes.

const neverCancel = new Deferred<SummarizerStopReason>();
const runningSummarizer = (summarizerRuntime as any).summarizer.runningSummarizer;
await runningSummarizer.summarizingLock;
await runningSummarizer.lockedSummaryAction(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above - I think this kind of locking should happen immediately in response to V2 acceptance op, such that summarizer can't produce one more summary with ref seq number of V2 acceptance op. That's fine as prototype - this is mostly invitation for discussion on how to organize code in the future to have proper correctness gurantees.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the locking should happen on receival of the op.

readonlyRuntime.deltaManager.lastSequenceNumber,
new TelemetryNullLogger(),
);
const appSummary = await readonlyRuntime.summarize({ fullTree: true });
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think you would use createSummary() call instead - I think it's more suitable here.
That said, I think I see the problem with my original thinking - createSummary() will use refSeqNumber = 0, and I believe you are trying to fix it. I wonder if we could generalize it, i.e. can we change our code to be able to create detached containers from arbitrary seq number? I'm sure it's not easy - I'm sure zero is encoded in other places (like driver), though I remember I did some work to remove such code and rely strictly on what Container provides.
I think we need to start collecting notes like that into some readme.md - things that are hard / we have to hack around. As they are negatives for this approach, and having too many of these could outweigh the benefits of this approach over some other approach. I.e. we should always think about it as an experiment, in search for best solution.

That all said, do we care what is the seq number? The app tree should not care, no? Where does it show up?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some metadata field that I had to modify in the submit summary function on the summarizer client. Look at the containerRuntime submit summary function.

@@ -212,6 +212,9 @@ export enum ContainerMessageType {
* See the [IdCompressor README](./id-compressor/README.md) for more details.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is for whole review (putting it here to allow proper discussion thread).

I believe what you have here is not export / import flow, where arbitrary transformation over file is possible. Instead, you are essentially doing conversion on the fly, in-place, i.e., the fact of loading container with special container runtime results in transformation of content and a new state. And the only transformations allowed are those that transform individual notes irrespective of the rest of container (where node is DDS or data store).

There are couple directions of exploration from here:

  • If developer needs arbitrary transformation (let's say merge 2 DDSs into one instance), how would we go about it?
    • Here, we might say - we do not care about incrementality / preservation of IDs. It's a full rewrite where we do not care if most of the content is the same or not.
  • If we are fine with only node-to-node transformations, can we do it somehow in-place inside of summarizer client, such that it could produce incremental summary?
  • the most complex - both of above, i.e. relatively arbitrary transformations inside of container runtime that result in incremental summary
    • we need to be careful though with "arbitrary" part here. I think we should allow arbitrary transformations, but I believe they could be constrained to - replace a node (with all of its children) to another node. Here node could be DDS, DataStore, or ContainerRuntime (in later case it means - wipe out all data stores and start from scratch creating any arbitrary set of data stores). I think the key here - developer has choices, but small incremental changes are possible.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know if it adds clarify, but here is an example of something similar outside of our project:

  1. https://www.inkandswitch.com/cambria/
  2. https://github.com/stripe/dagon

The problem of changing data schema is very common in industry, and there is a bunch of interesting work out there. I'm not sure work itself is very applicable, but ideas are.

Here is my understanding how it works usually:

  1. There is a framework that allows walking / enumerating through DAG of objects (in our case it's CR/DS (Datastores)/DDSs
  2. a set of callbacks (or better - classes with methods, such that they can be stateful) that are called for nodes with ability to replace node with some other node.
  3. Such callbacks/classes
    • could replace a node with another node. For example, data store with another data store instance.
    • There could be multiple passes over the graph, such that these classes could collect info on first pass, and act on this info on second pass. I.e. imagine that on first pass it collects info about all Directory instances in container, and on second pass it created one SharedTree instance and patches all other DDSs / Datastores that had reference to Directory instance.
    • GC subsystem is invoked at the end to collect nodes that are not referenced any more.

That said, all such systems usually work on DAG. If we count handles, we deal with arbitrary graph, not DAG. And it's not clear if we want to support such arbitrary transformations that could leave dangling handles. We might want to say initially - it should be possible, and we will explore it later, but for now we allow only simplistic transformation - DDS node to DDS node.

I have no way how to do it, this is mostly discussion allowed to see if we can see some interesting next steps on that way - the way to more generalized system in the future but leveraging principles of that system even today to implement in-place incremental substitution of DDS instance.

Let's chat more :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my poor-man's implementation suggestion: imagine that all parent nodes (container runtime, data store) provide a method to enumerate children and replace child with another (new) child, or patch child in-place (like make changes inside of DDS through normal APIs, but this does not result in ops - though we do not need it now, so defer for later implementation of it). These methods are not exposed on any interfaces, only ContainerRuntime (and CRF) have access to them. For now, we will use only leaf node replacement (i.e. replace DataStore child) for our needs, and will expose transformer that walks the tree and asks transformer if a node (DDS) needs to be replaced. Our transformer will look only for specific factory type (Legacy Shared Tree) and will involve abstract method to replace it with another DDS instance.

Something like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants