OutOfMemory error from S3 transfer manager #30

dhamilton · 2017-01-16T19:27:45Z

Hi, we've been seeing issues where the number of threads open by the plugin seems to be increasing unbounded and the connector workers all eventually die with stack traces pointing to failure to create threads in the S3 transfer manager. We do see data being uploaded to S3 but I'm wondering if this might be a case where we are consuming faster than we can upload. Has this been observed by anyone else in the past and is there a known workaround? Thanks.

[2017-01-16 18:30:15,554] ERROR Task pixel_events_s3-3 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:142)
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:713)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1360)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:132)
at com.amazonaws.services.s3.transfer.internal.UploadMonitor.(UploadMonitor.java:129)
at com.amazonaws.services.s3.transfer.TransferManager.upload(TransferManager.java:449)
at com.amazonaws.services.s3.transfer.TransferManager.upload(TransferManager.java:382)
at org.apache.hadoop.fs.s3a.S3AOutputStream.close(S3AOutputStream.java:127)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
at java.io.FilterOutputStream.close(FilterOutputStream.java:160)
at java.io.FilterOutputStream.close(FilterOutputStream.java:160)
at org.apache.avro.file.DataFileWriter.close(DataFileWriter.java:376)
at io.confluent.connect.hdfs.avro.AvroRecordWriterProvider$1.close(AvroRecordWriterProvider.java:69)
at io.confluent.connect.hdfs.TopicPartitionWriter.closeTempFile(TopicPartitionWriter.java:514)
at io.confluent.connect.hdfs.TopicPartitionWriter.close(TopicPartitionWriter.java:319)
at io.confluent.connect.hdfs.DataWriter.close(DataWriter.java:299)
at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:110)
at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:301)
at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:432)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:146)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

PraveenSeluka · 2017-01-16T19:40:33Z

Hi @dhamilton Thanks for reporting it. This seems to be caused due to a bug in s3a where it does not clean up threads, and instead goes on creating worker threads and eventually runs out. We just found this issue and am working on a fix (which is to upgrade AWS SDK)

Meanwhile, you could try using NativeS3FileSystem instead of S3A. I will have this fix soon and you could try it out.

Thanks
Praveen

dhamilton · 2017-01-16T19:47:23Z

Awesome, thank you so much for the quick reply, we'll try the NativeS3FileSystem for now.

Also just so you're aware we also hit the following issue confluentinc/schema-registry#363 in the schema registry client which blocked us from syncing multiple topics with the same schema. We rebuilt your plugin using confluent version 3.0.1 and kafka version 0.10.0.1 and this issue was corrected by that update.

PraveenSeluka · 2017-01-17T08:26:29Z

@dhamilton Can you try it out now. I have pushed a commit which upgrades the AWS SDK (and hadoop), and this should fix this thread leak issue. Please let us know if this helps. Thanks for using StreamX

Regarding confluent + kafka version upgrade, I will test it out

dhamilton · 2017-01-17T15:35:02Z

Sure, we'll test this out today and let you know what we find.

dhamilton · 2017-01-17T20:22:43Z

We're seeing the following issue when trying to resume the job. Have you seen this before?

[2017-01-17 20:21:35,124] ERROR Task pixel_events_s3-3 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:142)
java.lang.NullPointerException
at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:110)
at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:301)
at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:432)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:146)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
[2017-01-17 20:21:35,124] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:143)

dhamilton · 2017-01-17T20:55:37Z

Just to follow up that exception does not occur when using the NativeS3FileSystem. I've seen it in the past using the S3FileSystem outside of this issue.

PraveenSeluka · 2017-01-17T21:20:10Z

This usually points to some wrong config related to s3a, check the access/secret keys and the file system impl configs. Task will try creating S3Storage and it fails when there is any hadoop/s3 issues and then it gets into closePartitions flow - which is not the real issue.

PraveenSeluka · 2017-01-18T23:05:50Z

@dhamilton I have been testing this and hitting number of issues with S3A. In my testing, I dont see any NativeS3FileSystem issues, so I suggest you to use that instead. (I will spend some more time to get check S3A more)

dhamilton · 2017-01-19T03:00:07Z

Yes we've been having much better luck with NativeS3FileSystem so we'll stick with that for now. Thanks for your help!

phillipamann · 2017-01-25T17:53:33Z

Thanks for this discussion. I have hit this error as well.

phillipamann · 2017-01-25T23:01:36Z

~~I am unable to use NativeS3FileSystem command. Can you share your setting for how you get this to work?~~ I got this to work. I will see if I get OutOfMemory now.

PraveenSeluka · 2017-02-01T23:24:23Z

Here is the summary of this issue

Hadoop 2.7/2.6 by default uses AWS SDK 1.7.4. This version of AWS SDK has a thread leak issue(s3-transfer-manager-worker-1 is the thread name). Reported here https://forums.aws.amazon.com/thread.jspa?threadID=83520&tstart=0
Thread leak issue is fixed and they ask to upgrade to later AWS SDK. So, I picked 1.11.69 version of AWS SDK. (I could confirm that there is no thread leak after this upgrade)
After AWS SDK upgrade, it hits Connection pool shut down issue - I could find this reported by another user here BUG: AmazonHttpClient becomes unusable when one connection in the pool shuts down aws/aws-sdk-java#490
the above JIRA indicates this issue is related to httpClient and this is fixed in 4.4a1 version of httpClient http://stackoverflow.com/questions/25889925/apache-poolinghttpclientconnectionmanager-throwing-illegal-state-exception
Upgrading httpClient does not seem to solve the problem. So, am still hitting "Connection Pool Shut Down" issue with AWS SDK version 1.11.69. Am trying out few different versions to see if I can get this working.

Let me know if you find anything.

PraveenSeluka · 2017-02-02T01:15:44Z

I tried more to get this working.

I had added aws-java-sdk-s3 version 1.1.86 (latest version) as maven dependency and been testing that. I just found that the dependency hadoop-aws is still pulling the aws-java-sdk 1.7.4, so I missed to exclude that. I excluded 1.7.4 and made sure only later version of aws-java-sdk-s3 is in classpath
I found the following issue https://issues.apache.org/jira/browse/HADOOP-12420. AWS SDK has changed some method signatures and S3AFileSystem cant work with this new signatures. So, S3AFileSystem is compatible only with aws-java-sdk 1.7.4 - and that has thread leak issue. Due to this, currently its not possible to solve this.

There are two approaches

Fix thread leak in aws-java-sdk 1.7.4 version and publish another artifact
Change S3AFileSystem to work with the newer signatures of AWS SDK

I guess both of these are not straight forward. AWS SDK upgrade in hadoop is done, but this fix is available in hadoop 2.8 or 3.0 (not released yet). See https://issues.apache.org/jira/browse/HADOOP-12269. Will wait for it to become stable and then upgrade hadoop version, which will fix this issue.

Please use NativeS3FileSystem till then. Thanks

levin81 mentioned this issue Feb 1, 2017

NullPointerException when tasks.max > 1 and using s3a #38

Open

PraveenSeluka mentioned this issue Feb 2, 2017

java.lang.NoSuchFieldError: INSTANCE exception, caused by http client version mismatch #32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OutOfMemory error from S3 transfer manager #30

OutOfMemory error from S3 transfer manager #30

dhamilton commented Jan 16, 2017 •

edited

Loading

PraveenSeluka commented Jan 16, 2017

dhamilton commented Jan 16, 2017

PraveenSeluka commented Jan 17, 2017

dhamilton commented Jan 17, 2017

dhamilton commented Jan 17, 2017

dhamilton commented Jan 17, 2017

PraveenSeluka commented Jan 17, 2017

PraveenSeluka commented Jan 18, 2017

dhamilton commented Jan 19, 2017

phillipamann commented Jan 25, 2017

phillipamann commented Jan 25, 2017 •

edited

Loading

PraveenSeluka commented Feb 1, 2017 •

edited

Loading

PraveenSeluka commented Feb 2, 2017 •

edited

Loading

OutOfMemory error from S3 transfer manager #30

OutOfMemory error from S3 transfer manager #30

Comments

dhamilton commented Jan 16, 2017 • edited Loading

PraveenSeluka commented Jan 16, 2017

dhamilton commented Jan 16, 2017

PraveenSeluka commented Jan 17, 2017

dhamilton commented Jan 17, 2017

dhamilton commented Jan 17, 2017

dhamilton commented Jan 17, 2017

PraveenSeluka commented Jan 17, 2017

PraveenSeluka commented Jan 18, 2017

dhamilton commented Jan 19, 2017

phillipamann commented Jan 25, 2017

phillipamann commented Jan 25, 2017 • edited Loading

PraveenSeluka commented Feb 1, 2017 • edited Loading

PraveenSeluka commented Feb 2, 2017 • edited Loading

dhamilton commented Jan 16, 2017 •

edited

Loading

phillipamann commented Jan 25, 2017 •

edited

Loading

PraveenSeluka commented Feb 1, 2017 •

edited

Loading

PraveenSeluka commented Feb 2, 2017 •

edited

Loading