Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement "event merging" #68

Closed
wimleers opened this issue Aug 1, 2011 · 7 comments
Closed

Implement "event merging" #68

wimleers opened this issue Aug 1, 2011 · 7 comments

Comments

@wimleers
Copy link
Owner

wimleers commented Aug 1, 2011

Whenever a change on the file system is detected, File Conveyor checks adds it to the "discover queue", after which it is moved ASAP to the "pipeline queue". In the pipeline queue, there are items of the form (input_file, event). Now, if a file is changed, that may happen in the form of "1) delete, 2) create". Those will be 2 separate events in the queue. I want to add functionality to merge those events.
Also, when a file is first created, then deleted, then created, then deleted, and so on, and the daemon has so many files to process that all those events are still queued, then it is actually not smart enough: it will delete the file from the CDN, then create it, then delete it, and so on. I.e. it will replay the exact events, instead of being smart enough to merge the events into the actual resulting event. Merging these events will make it faster and more efficient.
This is the only major oversight I'm aware of.

All merging patterns:

  • new + modify = new
  • new + delete = remove from the queue!
    * modify + modify = modify
    * modify + delete = delete
  • delete + new = modify

This implies that a new data structure has to be developed: the current PersistentQueue data structure does not allow for this merging of events.

@ghost ghost assigned wimleers Aug 1, 2011
@wimleers
Copy link
Owner Author

wimleers commented Aug 1, 2011

Related issues: #2, #12 & #30.

@tedfordgif
Copy link

What is the reason to have new and modified be different states? It seems that state determination should be based on interrogation of the endpoint(s) of the transfer(s). In other words, a file is new if it does not exist at the destination, and modified if it does. If I have multiple endpoints, they might somehow get into different states.

@wimleers
Copy link
Owner Author

wimleers commented Aug 1, 2011

The state at the endpoint is also taken into account.

However, this is purely about the state of file events in the pipeline queue. The pipeline queue contains files that are yet to be processed (and then transported). Nothing has happened to them yet, except for these events being detected and thus these files ending up in the discovery queue.
Therefor, it makes sense to mark a new, unsynced file that is then modified (while it is still unsynced) as just "new" instead of creating both a "new" and a "modified" event for it in the pipeline queue (this is what currently happens). The result of this current (unfortunate & inefficient) implementation is that the file is actually synced first and then synced again, while no changes have happened. It's synced first because it was in the pipeline queue for the creation ("new") event and then again because it was changed ("modify") before it could be synced.

That's what this is about.

The state at the endpoint is only taken into account as soon as we arrive at the transporting step.

I hope this answers your concerns.

@tedfordgif
Copy link

I understand the purpose of this ticket--perhaps my question could have been clearer. Is there any need to track three file statuses (new, modified, deleted), or would two decision statuses suffice (needs-to-be-transferred, needs-to-be-deleted)? I suppose there is no need to throw away that information when supplied by the notification library, but there might not be a need to keep it either. If so, instead of worrying about merging events, it's just a binary decision:

if deleted:
    queue_status = delete
else: # new or modified
    queue_status = transfer

It doesn't change the decision, but simplifying like this might have made this design consideration more obvious in the first place.

@wimleers
Copy link
Owner Author

wimleers commented Aug 2, 2011

First of all: these design decisions were made >2 years ago, so I can't immediately give you "the complete explanation".

Those two decision statuses are indeed all that's necessary to be able to know whether to either transfer or delete the file from one of its destinations.
However, this simplification of events only happens at the very latest step. Before that, I keep all metadata around (you never know what kind of edge case, optimization or functionality might need additional metadata). File Conveyor's components (i.e. all code except for arbitrator.py) are as generic as possible, allowing them to be interchanged fairly easily.

Also, I think there may be some edge cases deep in arbitrator.py that require this knowledge. They might not strictly need to, but for now, I'm not going to change this.

Finally, in the case of FSMonitorInotify, there is the fact that it triggers many events for even a simple touch /monitored/directory. When that file does not yet exist, it generates the IN_CREATE (mapped to FSMonitor.CREATED) and IN_ATTRIB (mapped to FSMonitor.MODIFIED) events. When we would be able to merge these events to a single FSMonitor.CREATED event, we have the correct interpretation of events without already simplifying it to transfer so early in the process. I.e., other applications may be able to benefit from this as well.

I'm afraid I can't give you a better answer than the above at this time.

@wimleers wimleers reopened this Aug 2, 2011
@wimleers
Copy link
Owner Author

wimleers commented Aug 2, 2011

Oops, that commit should have closed #69! Reopening.

wimleers added a commit that referenced this issue Aug 2, 2011
wimleers added a commit that referenced this issue Aug 3, 2011
…lity to get an item for a key and to remove an item for a key. Necessary for #68.
@wimleers wimleers reopened this Aug 3, 2011
@wimleers
Copy link
Owner Author

wimleers commented Aug 3, 2011

There's something stupid I've overlooked: the changed DB schema for PersistentQueue implies I need to provide an upgrade path! Reported by @unn. Will fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants