Implement "event merging" #68

wimleers · 2011-08-01T09:56:23Z

Whenever a change on the file system is detected, File Conveyor checks adds it to the "discover queue", after which it is moved ASAP to the "pipeline queue". In the pipeline queue, there are items of the form (input_file, event). Now, if a file is changed, that may happen in the form of "1) delete, 2) create". Those will be 2 separate events in the queue. I want to add functionality to merge those events.
Also, when a file is first created, then deleted, then created, then deleted, and so on, and the daemon has so many files to process that all those events are still queued, then it is actually not smart enough: it will delete the file from the CDN, then create it, then delete it, and so on. I.e. it will replay the exact events, instead of being smart enough to merge the events into the actual resulting event. Merging these events will make it faster and more efficient.
This is the only major oversight I'm aware of.

All merging patterns:

new + modify = new
new + delete = remove from the queue!
* modify + modify = modify
* modify + delete = delete
delete + new = modify

This implies that a new data structure has to be developed: the current PersistentQueue data structure does not allow for this merging of events.

The text was updated successfully, but these errors were encountered:

wimleers · 2011-08-01T10:01:57Z

Related issues: #2, #12 & #30.

tedfordgif · 2011-08-01T17:35:24Z

What is the reason to have new and modified be different states? It seems that state determination should be based on interrogation of the endpoint(s) of the transfer(s). In other words, a file is new if it does not exist at the destination, and modified if it does. If I have multiple endpoints, they might somehow get into different states.

wimleers · 2011-08-01T17:44:45Z

The state at the endpoint is also taken into account.

However, this is purely about the state of file events in the pipeline queue. The pipeline queue contains files that are yet to be processed (and then transported). Nothing has happened to them yet, except for these events being detected and thus these files ending up in the discovery queue.
Therefor, it makes sense to mark a new, unsynced file that is then modified (while it is still unsynced) as just "new" instead of creating both a "new" and a "modified" event for it in the pipeline queue (this is what currently happens). The result of this current (unfortunate & inefficient) implementation is that the file is actually synced first and then synced again, while no changes have happened. It's synced first because it was in the pipeline queue for the creation ("new") event and then again because it was changed ("modify") before it could be synced.

That's what this is about.

The state at the endpoint is only taken into account as soon as we arrive at the transporting step.

I hope this answers your concerns.

tedfordgif · 2011-08-01T20:10:45Z

I understand the purpose of this ticket--perhaps my question could have been clearer. Is there any need to track three file statuses (new, modified, deleted), or would two decision statuses suffice (needs-to-be-transferred, needs-to-be-deleted)? I suppose there is no need to throw away that information when supplied by the notification library, but there might not be a need to keep it either. If so, instead of worrying about merging events, it's just a binary decision:

if deleted:
    queue_status = delete
else: # new or modified
    queue_status = transfer

It doesn't change the decision, but simplifying like this might have made this design consideration more obvious in the first place.

wimleers · 2011-08-02T13:56:19Z

First of all: these design decisions were made >2 years ago, so I can't immediately give you "the complete explanation".

Those two decision statuses are indeed all that's necessary to be able to know whether to either transfer or delete the file from one of its destinations.
However, this simplification of events only happens at the very latest step. Before that, I keep all metadata around (you never know what kind of edge case, optimization or functionality might need additional metadata). File Conveyor's components (i.e. all code except for arbitrator.py) are as generic as possible, allowing them to be interchanged fairly easily.

Also, I think there may be some edge cases deep in arbitrator.py that require this knowledge. They might not strictly need to, but for now, I'm not going to change this.

Finally, in the case of FSMonitorInotify, there is the fact that it triggers many events for even a simple touch /monitored/directory. When that file does not yet exist, it generates the IN_CREATE (mapped to FSMonitor.CREATED) and IN_ATTRIB (mapped to FSMonitor.MODIFIED) events. When we would be able to merge these events to a single FSMonitor.CREATED event, we have the correct interpretation of events without already simplifying it to transfer so early in the process. I.e., other applications may be able to benefit from this as well.

I'm afraid I can't give you a better answer than the above at this time.

wimleers · 2011-08-02T14:41:23Z

Oops, that commit should have closed #69! Reopening.

…ms in the queue. Necessary for #68.

…lity to get an item for a key and to remove an item for a key. Necessary for #68.

wimleers · 2011-08-03T17:22:55Z

There's something stupid I've overlooked: the changed DB schema for PersistentQueue implies I need to provide an upgrade path! Reported by @unn. Will fix this.

…at is by definition invalid. Fixes #112. Related to #68.

ghost assigned wimleers Aug 1, 2011

wimleers closed this as completed in 771adab Aug 2, 2011

wimleers reopened this Aug 2, 2011

wimleers added a commit that referenced this issue Aug 2, 2011

Extended PersistentQueue's feature set with the ability to update ite…

3928c4f

…ms in the queue. Necessary for #68.

wimleers added a commit that referenced this issue Aug 3, 2011

Extended PersistentQueue's feature set slightly further, with the abi…

2bf901f

…lity to get an item for a key and to remove an item for a key. Necessary for #68.

wimleers closed this as completed in 081fb8c Aug 3, 2011

wimleers reopened this Aug 3, 2011

wimleers closed this as completed in 8e396a2 Aug 3, 2011

wimleers added a commit that referenced this issue Aug 3, 2011

Improved wording of event merging logging. Related to #68.

527919e

wimleers mentioned this issue Oct 14, 2012

When a newly created file is deleted before it was synced, File Conveyor crashes on a logging statement #112

Closed

wimleers added a commit that referenced this issue Oct 24, 2012

Remove unneeded parameter that was also performing an array lookup th…

e450df2

…at is by definition invalid. Fixes #112. Related to #68.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement "event merging" #68

Implement "event merging" #68

wimleers commented Aug 1, 2011

wimleers commented Aug 1, 2011

tedfordgif commented Aug 1, 2011

wimleers commented Aug 1, 2011

tedfordgif commented Aug 1, 2011

wimleers commented Aug 2, 2011

wimleers commented Aug 2, 2011

wimleers commented Aug 3, 2011

Implement "event merging" #68

Implement "event merging" #68

Comments

wimleers commented Aug 1, 2011

wimleers commented Aug 1, 2011

tedfordgif commented Aug 1, 2011

wimleers commented Aug 1, 2011

tedfordgif commented Aug 1, 2011

wimleers commented Aug 2, 2011

wimleers commented Aug 2, 2011

wimleers commented Aug 3, 2011