INT-4116: Introduce FileAggregator #3511

artembilan · 2021-03-12T19:22:38Z

JIRA: https://jira.spring.io/browse/INT-4116

Implement a FileSplitter.FileMarker-based aggregation strategies
and utilize them in a general FileAggregator component
Make HeaderAttributeCorrelationStrategy.attributeName as final; add Assert.notEmpty()
Fix AggregatorFactoryBean and AggregatorSpec to parse the provided processor
for possible CorrelationStrategy and/or ReleaseStrategy
Introduce short-cut methods into Java & Kotlin DSL for an aggregate() configuration
Introduce a FileHeaders.LINE_COUNT for header to be populated in the FileSplitter.
We need this info in the FileAggregator to avoid possible overhead with JSON deserialization
of the FileSplitter.FileMarker messages
Test and document the feature
Improve FileSplitter doc for code block switch (tabs)

garyrussell · 2021-03-12T21:21:12Z

...src/main/java/org/springframework/integration/file/aggregator/FileMarkerReleaseStrategy.java

+				.findAny()
+				.map((message) -> message.getHeaders().get(FileHeaders.LINE_COUNT, Long.class))
+				.map((lineCount) -> lineCount == messages.size() - 2)
+				.orElse(false);


This won't scale well, especially with large files; it would be better to add getLast() to message group.

Oh! I see your point. It looks like the ReleaseStrategy contract would be better as canRelease(MessageGroup group, Message<?> currentMessage) since we always trigger it whenever we got a new message.
We can deprecate the current contract in favor of new and remove in the next 6.0 😄

But getLast() API in the MessageGroup won't hurt!

Well, after looking into this closely, the getLast() doesn't feel OK. My test with the filter() and ExecutorChannel confirms that the last one is not always the END marker. So, I would say that canRelease(MessageGroup group, Message<?> currentMessage) makes sense for the main code flow, but forceComplete(MessageGroup) is still need to iterate the whole group, since there is no way to be sure that the last added to the group is exactly a FileMarker message.

At least some optimization is on its way! 😄

Well, if they go async, all bets are off. If the last marker is processed before the size matches, we'll never release the group.

If we want to support that (async), we'd need another api on the group (e.g. condition1Passed, set to true if we ever received the last marker).

Also, the file lines would be re-assembled out of order.

We may not care about an order of those items, especially when we convert them to some domain objects for further batch inserts into DB. The applySequence on the FileSplitter and Resequencer were always there for us to reassemble an original order.

I need to give the problem you have described more thoughts...

Have a good weekend!

garyrussell · 2021-03-12T21:24:26Z

...va/org/springframework/integration/file/aggregator/FileAggregatingMessageGroupProcessor.java

+				.filter((message) -> !message.getHeaders().containsKey(FileHeaders.MARKER))
+				.map(Message::getPayload)
+				.collect(Collectors.toList());
+	}


I thought we decided not to use Stream APIs in main code flows.

Yeah... Will rework.
Thanks for the reminder!

JIRA: https://jira.spring.io/browse/INT-4116 * Implement a `FileSplitter.FileMarker`-based aggregation strategies and utilize them in a general `FileAggregator` component * Make `HeaderAttributeCorrelationStrategy.attributeName` as `final`; add `Assert.notEmpty()` * Fix `AggregatorFactoryBean` and `AggregatorSpec` to parse the provided processor for possible `CorrelationStrategy` and/or `ReleaseStrategy` * Introduce short-cut methods into Java & Kotlin DSL for an `aggregate()` configuration * Introduce a `FileHeaders.LINE_COUNT` for header to be populated in the `FileSplitter`. We need this info in the `FileAggregator` to avoid possible overhead with JSON deserialization of the `FileSplitter.FileMarker` messages * Test and document the feature * Improve `FileSplitter` doc for code block switch (tabs)

artembilan · 2021-03-15T13:38:06Z

Pushed a fix without Java Streams.

The feature you are asking would probably like this:

The MessageGroup gets a new property String condition (or groupCondition if some DB vendors has the condition as a key word)
The MessageGroupStore get a new option - coditionSupplier(Function<Message, String>)
When we add a mesasge to the group, we calculate a condition against that message and store it into a group entity
The ReleaseStrategy can then consult this new group property without walking through the whole group.

I suggest to make this condition as a String since it is the only way to have any variety of data structure: JSON, SpEL or just number representation. It is then a ReleaseStrategy responsibility to parse such a condition properly for its purposes. Even if it is SpEL to parse, it is still faster, then load the whole group from DB.

I would suggest to revise this feature in a separate issue/PR: really the proposed solution in the current PR fully reflects what we have so far and what an original JIRA is asking.

Any feedback welcome!

Will take a look into that a bit later: I need to fix failing JMS tests and check TaskScheduler for errorChannel.
Thanks

garyrussell · 2021-03-16T18:16:20Z

...src/main/java/org/springframework/integration/file/aggregator/FileMarkerReleaseStrategy.java

+		int size = group.size();
+		if (size > 1) { // Need more than only a START marker
+			Collection<Message<?>> messages = group.getMessages();
+			for (Message<?> message : messages) {


This is still no good; we can't iterate over the whole collection each time, it just won't scale with large groups.

Correct, but see what we have so far in other place - SequenceAwareMessageGroup.
Not related to this one, but similar iteration on each added message.

Nevertheless I agree with your concern and really will address i, but I'd like to do that in the separate PR.
See my yesterday's comment.

OK; but let's open an issue so it's not forgotten.

If you agree with my proposal, I'll start working on that immediately, so this FileAggregator won't be a bottle neck any more.

Yes; proposal looks good.

artembilan added type: enhancement in: file labels Mar 12, 2021

artembilan added this to the 5.5 M3 milestone Mar 12, 2021

garyrussell suggested changes Mar 12, 2021

View reviewed changes

artembilan added 2 commits March 12, 2021 16:43

* Rework FileAggregator do not use Java Streams

15346ce

artembilan force-pushed the INT-4116 branch from 29cd096 to 15346ce Compare March 15, 2021 13:29

garyrussell suggested changes Mar 16, 2021

View reviewed changes

garyrussell approved these changes Mar 16, 2021

View reviewed changes

garyrussell merged commit 4342586 into spring-projects:master Mar 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INT-4116: Introduce FileAggregator #3511

INT-4116: Introduce FileAggregator #3511

artembilan commented Mar 12, 2021

garyrussell Mar 12, 2021

artembilan Mar 12, 2021

artembilan Mar 12, 2021

garyrussell Mar 12, 2021

artembilan Mar 12, 2021

garyrussell Mar 12, 2021

artembilan Mar 12, 2021

artembilan commented Mar 15, 2021

garyrussell Mar 16, 2021

artembilan Mar 16, 2021

garyrussell Mar 16, 2021

artembilan Mar 16, 2021

garyrussell Mar 16, 2021

INT-4116: Introduce FileAggregator #3511

INT-4116: Introduce FileAggregator #3511

Conversation

artembilan commented Mar 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

artembilan commented Mar 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment