Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make KafkaItemWriter extensible and document its thread-safety #3970

Closed
SebastianGoeb opened this issue Aug 3, 2021 · 7 comments
Closed

Comments

@SebastianGoeb
Copy link

SebastianGoeb commented Aug 3, 2021

I would like for KafkaItemWriter to be thread safe, considering KafkaTemplate is also thread safe. FlatFileItemWriter/AbstractFileItemWriter behaves similarly, doing nothing to ensure thread safety itself (beyond not sharing variables across thread boundaries), and instead relying on the underlying BufferedWriter to handle synchronization. The same thing should be possible for KafkaItemWriter, since KafkaTemplate is thread safe.

KafkaItemWriter currently shares a List<ListenableFuture<SendResult<K, T>>> across threads, which is written to during write() and later clear()ed again. I could wrap the whole thing with a SynchronizedItemStreamWriter, but it would be a shame to lose multithreading capabilities, especially since KafkaTemplate can handle that anyway. Instead I propose to simply make the listenableFutures a ThreadLocal<...> listenableFutures = ThreadLocal.withInitial(ArrayList::new), which should take care of the problem.

Context:
we noticed ConcurrentModificationExceptions in KafkaItemWriter::flush in our integration tests, which set KafkaTemplate to auto-flush, unlike the live application, which does not. This causes the write() call to take longer, and that allows 2 threads to interfere. The live application has 10 threads and chunk size 1000, processing about 1k items/s in total. This works out to about 100 items/s/thread or each thread executing write() every 10s or so. Since there is no auto-flush outside of tests, write() only dumps the items onto the KafkaProducer's RecordAccumulator and immediately exits.It seems that these calls mostly miss each other, but the problem should still be there. We would rather not synchronize everything, so for now we are using the following modified ThreadSafeKafkaItemWriter:

public class ThreadSafeKafkaItemWriter<K, T> extends KeyValueItemWriter<K, T> {

    protected KafkaTemplate<K, T> kafkaTemplate;
    private final ThreadLocal<List<ListenableFuture<SendResult<K, T>>>> listenableFutures = ThreadLocal.withInitial(ArrayList::new);
    private long timeout = -1;

    @Override
    protected void writeKeyValue(K key, T value) {
        List<ListenableFuture<SendResult<K, T>>> futures = this.listenableFutures.get();
        if (this.delete) {
            futures.add(this.kafkaTemplate.sendDefault(key, null));
        } else {
            futures.add(this.kafkaTemplate.sendDefault(key, value));
        }
    }

    @Override
    protected void flush() throws Exception {
        this.kafkaTemplate.flush();
        List<ListenableFuture<SendResult<K, T>>> futures = this.listenableFutures.get();
        for (ListenableFuture<SendResult<K, T>> future : futures) {
            if (this.timeout >= 0) {
                future.get(this.timeout, TimeUnit.MILLISECONDS);
            } else {
                future.get();
            }
        }
        futures.clear();
    }

    @Override
    protected void init() {
        Assert.notNull(this.kafkaTemplate, "KafkaTemplate must not be null.");
        Assert.notNull(this.kafkaTemplate.getDefaultTopic(), "KafkaTemplate must have the default topic set.");
    }

    /**
     * Set the {@link KafkaTemplate} to use.
     *
     * @param kafkaTemplate to use
     */
    public void setKafkaTemplate(KafkaTemplate<K, T> kafkaTemplate) {
        this.kafkaTemplate = kafkaTemplate;
    }

    /**
     * The time limit to wait when flushing items to Kafka.
     *
     * @param timeout milliseconds to wait, defaults to -1 (no timeout).
     * @since 4.3.2
     */
    public void setTimeout(long timeout) {
        this.timeout = timeout;
    }

}
@SebastianGoeb SebastianGoeb added status: waiting-for-triage Issues that we did not analyse yet type: feature labels Aug 3, 2021
@fmbenhassine
Copy link
Contributor

Thank you for opening this issue. Instead of using ThreadLocal.withInitial(ArrayList::new) for listenableFutures, what about using a thread-safe data structure like CopyOnWriteArrayList for instance? Can you try that and share your feedback? Otherwise, please share the test that fails with a ConcurrentModificationException and we can discuss the best way to fix the issue.

we noticed ConcurrentModificationExceptions in KafkaItemWriter::flush in our integration tests, which set KafkaTemplate to auto-flush, unlike the live application, which does not.

I recommend disabling auto-flush in your tests as well, otherwise your test won't mimic the real world behaviour of flushing items in bulk.

@fmbenhassine fmbenhassine added the status: waiting-for-reporter Issues for which we are waiting for feedback from the reporter label Aug 5, 2021
@SebastianGoeb
Copy link
Author

SebastianGoeb commented Aug 5, 2021

what about using a thread-safe data structure like CopyOnWriteArrayList for instance?

Is that a good idea? CopyOnWriteArrayList duplicates its entire contents every time it is written to. It's more useful for heavy read applications than heavy write applications. It would turn the simple writeKeyValue() loop in KeyValueItemWriter into an O(n^2) situation. This would be especially problematic in highly multithreaded applications (which we are trying to address here) because now all threads dump their items into one combined list, which is then duplicated num_threads * items_per_thread times per chunk.

I feel like that would also break the entire flush/clear logic, since COWAL only gives you a snapshot when you iterate it. If N threads are trying to iterate the list, to wait for listenableFutures, and call clear(), all mixed up, I'd be very nervous guaranteeing that that still works as intended with each thread waiting for the correct futures.

I'd feel much more comfortable if threads didn't have their data thrown into a shared data structure unless they're never going to read from it again (like KafkaProducer's own record buffers, which are read asynchronously by other threads).

I recommend disabling auto-flush

Yeah, probably a good idea in general, but in this case I think it would only mask the issue. In production we process about 2Mil items per job, and this issue hasn't showed up yet, likely because delegating to KafkaTemplate is a pretty quick process, so it's unlikely that two write()s would ever collide. But that doesn't mean the issue isn't there. One day it'll crash. At least the auto-flushing flagged the problem 🤷 .

@SebastianGoeb
Copy link
Author

SebastianGoeb commented Aug 5, 2021

About the test, it's a pretty big one because it tests everything in succession and Kafka is only the last step. But it boils down to a Step that uses a JdbcPagingItemReader to grabs IDs from the DB, a Processor that turns IDs into JPA objects and JPA objects into Avro objects, and then the KafkaItemWriter that writes them to Kafka. the IT only runs the whole thing (with pageSize, chunkSize, currency all == 2, and auto-flush == true), and later expects that there should be 11 Records in Kafka. In prod it would be (pageSize = cunkSize = 1000, concurrency = 10, auto-flush = false).

Something like this:


        int pageSize = 2;
        int concurrency = 2;
        int chunkSize = 2;
        SimpleAsyncTaskExecutor executor = new SimpleAsyncTaskExecutor();
        executor.setConcurrencyLimit(concurrency);

        return steps.get("myPublishStep")
                .<UUID, MyAvroType>chunk(chunkSize)
                .reader(new JdbcPagingItemReaderBuilder<UUID>()
                        // connection
                        .dataSource(dataSource)
                        // query
                        .selectClause("id")
                        .fromClause("mytable")
                        .whereClause("mycondition")
                        .sortKeys(Map.of("id", ASCENDING))
                        .parameterValues(Map.of("someparameter", "somevalue"))
                        // results
                        .rowMapper(new SingleColumnRowMapper<>())
                        // performance
                        .pageSize(pageSize)
                        .saveState(false) // must be false if multithreaded
                        .build())
                .processor(id -> mapIdToJpaObjectToAvroObject(id))
                .writer(new KafkaItemWriterBuilder<String, MyAvroType>()
                        .kafkaTemplate(kafkaTemplate)
                        .itemKeyMapper(myAvroObject -> myAvroObject.getSomeKeyProperty())
                        .delete(false)
                        .build())
                .taskExecutor(executor)
                .throttleLimit(concurrency)
                .build();
        // run the job...
        List<ConsumerRecord<Object, Object>> records = testConsumer.waitForAvailableRecords(KAFKA_RECORD_WAIT_TIMEOUT);
        assertThat(records).hasSize(11);

@fmbenhassine
Copy link
Contributor

Thank you for your feedback, I understand the use case better now. You are right about the CopyOnWriteArrayList, it is indeed not a good idea here.

My concern with trying to introduce thread-safety techniques (whatever the technique is: thread locals, concurrent data structures, synchronization constructs, etc) is that, in addition to making the code complex, if we do it for the KafkaItemWriter, we should do it everywhere else for consistency (like ListItemWriter, FlatFileItemWriter, StaxEventItemWriter, etc). Up until now, this has been avoided in preference to using a synchronized decorator, even if this is not optimal in some cases. The documentation has a mention about that in the Multi-threaded step section.

I believe we should document the class as being non thread-safe because this is missing now (it was thread-safe until #3773) and let the synchronization aspect to the user with a decorator for consistency. You mentioned the SynchronizedItemStreamWriter, but it won't work here because KafkaItemWriter is not an ItemStream. We can introduce decorators for readers/writers that are not ItemStreams (as suggested in #3741 (comment)) so that users don't have to write them each time. What do you think?

@SebastianGoeb
Copy link
Author

if we do it for the KafkaItemWriter, we should do it everywhere else for consistency (like ListItemWriter, FlatFileItemWriter, StaxEventItemWriter, etc).

Unfortunately, if we do nothing, it's also inconsistent, because some ItemWriters are thread safe and some aren't. For example, FlatFileItemWriter is thread safe, KafkaItemWriter isn't, JpaItemWriter and JdbcBatchItemWriter are thread safe again. I can't see any structure to this, that would let me tell which item writer is thread safe and which isn't without reading the source. Some are obviously documented like Jpa/Jdbc, but if I just assume any undocumented writers are not thread safe, then I wouldn't know that FlatFileItemWriter is actually thread safe. I think going through and documenting thread safety would be quite helpful to users, because otherwise they need to (like me) go through every reader they want to use and check the internals for thread safety, which isn't the greatest user experience and is somewhat error prone (what if I used to rely on KafkaItemWriter to be thread safe, until #3741 broke that, like you said).

I also couldn't find any ItemWriter that handles its own concurrency (like the proposed KafkaItemWriter change). All the ones I've found, that are thread safe, delegate this to the next layer down. For FlatFileItemWriter, this is the BufferedWriter, for Jpa it's the EntityManager, for Jdbc I think it's probably the SessionFactory that's deep inside NamedParameterJdbcTemplate somewhere. Those next layers handle concurrency somehow, and the ItemWriter just makes sure, not to get two calls to write(List) mixed up. However, this is exactly what KafkaItemWriter does not, with that shared list of futures. So to me, it seems like KafkaItemWriter is "doing something extra", that other writes aren't. That, for me, would warrant making it thread safe, because, as mentioned, it was thread safe until this extra future construction broke that. But that's just my opinion. Maybe the following is a good compromise?

I don't think just slapping synchronized on everything and calling it a day is a solution for every use case though. Sometimes that's fine, because we just want to control execution order, and the thing we're writing to can't handle multiple threads anyway (FlatFileItemWriter / BufferedWriter's own synchronization). Other times we could lose a lot. Maybe it would make sense to offer 2 sets of decorators? SynchronizedItemStream/-Writer and ThreadLocalItemStream/-Writer, that makes the entire delegate writer thread-local, for cases where we just want to fix the ordering, vs. cases where we want to have maximum parallelism. Or maybe there's a way to get rid of the future-list in KafkaItemWriter again, and let KafkaTemplate take care of this itself? Maybe that's unlikely though.

So I've got these ideas:

  • document thread safety
  • possibly document it in code with a boolean isThreadSafe(), so the ChunkingTasklet can check if someone is trying to combine throttle() with a non-thread-safe writer and throw an error.
  • maybe offer a ThreadLocalItemWriter decorator for when we don't want to bottleneck everything though a synchronized method
  • make as many writers thread safe as possible, simply because spring batch offers throttle() and it's kinda dangerous to have that along with non thread-safe writers, especially if the thread safety is not part of the method contract and can change for other reasons. Users will rely on the thread safety after they've checked it, and then have broken writers in a future release.

@fmbenhassine fmbenhassine removed the status: waiting-for-reporter Issues for which we are waiting for feedback from the reporter label Aug 31, 2021
@ElgunGasimzade
Copy link

Hi, I used ConcurrentLinkedQueue for this case. What do you think about that, should i use ThreadLocal instead? Has there been another update for this case during this time or have you changed your method after that?

@fmbenhassine
Copy link
Contributor

fmbenhassine commented May 9, 2023

I also couldn't find any ItemWriter that handles its own concurrency (like the proposed KafkaItemWriter change).

Yes, and for consistency with other implementations, there is no plan to introduce any concurrency construct in the implementation of KafkaItemWriter neither. If needed, this writer could be made synchronized with a decorator. If this is not efficient, then that writer could be extended or a custom one could be used for more fine-grained concurrency access control.

I will proceed with the following actions:

  • Document the thread-safety of KafkaItemWriter in its javadocs as this is currently missing
  • Make KafkaItemWriter#completableFutures protected so it can be used in extensions
  • Create an issue to add synchronized decorators for non-stream readers/writers, so the KafkaItemWriter can be decrorated properly (using the SynchronizedItemStreamWriter is not possible since KafkaItemWriter is not an ItemStream). This is done (see Add synchronized decorators for non-stream item readers/writers #4368) and planned for v5.1
  • Add a column in the readers/writers appendix about thread-safety so the information is easily found (Document Thread Safety of item readers and writers #3646)

@fmbenhassine fmbenhassine added this to the 5.0.2 milestone May 9, 2023
@fmbenhassine fmbenhassine changed the title Make KafkaItemWriter Thread Safe Make KafkaItemWriter extensible and document its thread-safety May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants