NullPointerExceptions killing ToeThreads #16

anjackson · 2018-07-27T11:40:09Z

We're seeing really odd fatal errors, killing off ToeThreads in crawls:

SEVERE: org.archive.crawler.framework.ToeThread run Fatal exception in ToeThread #99: https://www.andersonslimited.co.uk/robots.txt [Wed Jul 25 09:08:01 GMT 2018]
java.lang.NullPointerException
        at org.archive.crawler.frontier.BdbMultipleWorkQueues.delete(BdbMultipleWorkQueues.java:484)
        at org.archive.crawler.frontier.BdbWorkQueue.deleteItem(BdbWorkQueue.java:88)
        at org.archive.crawler.frontier.WorkQueue.dequeue(WorkQueue.java:195)
        at org.archive.crawler.frontier.WorkQueueFrontier.processFinish(WorkQueueFrontier.java:948)
        at org.archive.crawler.frontier.AbstractFrontier.finished(AbstractFrontier.java:574)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:187)

Looking at the code, this shouldn't really be possible!

Going up the call tree, it appears the peekItem has become inconsistent with, i.e. reset to null.

Note that NetArchive Suite have also seen this issue and patched it in this way.

Also observing

java.util.ConcurrentModificationException
        at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1211)
        at java.util.TreeMap$EntryIterator.next(TreeMap.java:1247)
        at java.util.TreeMap$EntryIterator.next(TreeMap.java:1242)
        at com.esotericsoftware.kryo.serialize.MapSerializer.writeObjectData(MapSerializer.java:90)
        at com.esotericsoftware.kryo.serialize.FieldSerializer.writeObjectData(FieldSerializer.java:161)
        at com.esotericsoftware.kryo.Kryo.writeObjectData(Kryo.java:453)
        at com.esotericsoftware.kryo.ObjectBuffer.writeObjectData(ObjectBuffer.java:262)
        at org.archive.bdb.KryoBinding.objectToEntry(KryoBinding.java:81)
        at com.sleepycat.collections.DataView.useValue(DataView.java:549)
        at com.sleepycat.collections.DataCursor.initForPut(DataCursor.java:817)
        at com.sleepycat.collections.DataCursor.put(DataCursor.java:751)
        at com.sleepycat.collections.StoredContainer.putKeyValue(StoredContainer.java:321)
        at com.sleepycat.collections.StoredMap.put(StoredMap.java:279)
        at org.archive.util.ObjectIdentityBdbManualCache$1.onRemoval(ObjectIdentityBdbManualCache.java:122)
        at com.google.common.cache.LocalCache.processPendingNotifications(LocalCache.java:1954)
        at com.google.common.cache.LocalCache$Segment.runUnlockedCleanup(LocalCache.java:3457)
        at com.google.common.cache.LocalCache$Segment.postWriteCleanup(LocalCache.java:3433)
        at com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2888)
        at com.google.common.cache.LocalCache.put(LocalCache.java:4146)
        at org.archive.util.ObjectIdentityBdbManualCache.dirtyKey(ObjectIdentityBdbManualCache.java:379)
        at org.archive.crawler.frontier.WorkQueue.makeDirty(WorkQueue.java:690)
        at org.archive.crawler.frontier.AbstractFrontier.tally(AbstractFrontier.java:636)
        at org.archive.crawler.frontier.AbstractFrontier.doJournalAdded(AbstractFrontier.java:647)
        at org.archive.crawler.frontier.WorkQueueFrontier.sendToQueue(WorkQueueFrontier.java:410)
        at org.archive.crawler.frontier.WorkQueueFrontier.processScheduleAlways(WorkQueueFrontier.java:333)
        at org.archive.crawler.frontier.AbstractFrontier.receive(AbstractFrontier.java:554)
        at org.archive.crawler.util.SetBasedUriUniqFilter.add(SetBasedUriUniqFilter.java:85)
        at org.archive.crawler.frontier.WorkQueueFrontier.processScheduleIfUnique(WorkQueueFrontier.java:378)
        at org.archive.crawler.frontier.WorkQueueFrontier.schedule(WorkQueueFrontier.java:356)
        at org.archive.crawler.postprocessor.CandidatesProcessor.runCandidateChain(CandidatesProcessor.java:189)
        at uk.bl.wap.crawler.frontier.KafkaUrlReceiver$CrawlMessageHandler.run(KafkaUrlReceiver.java:468)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)```

So, what seems to be happening, I think, is that occasionally, between this statement and this one, the WorkQueue gets updated by a separate thread in a way that forces it to get written out to disk and then read back in again. As peekItem is transient, flushing it out to the disk and back drops the value and we're left with a null.

The text was updated successfully, but these errors were encountered:

anjackson · 2018-07-27T11:43:17Z

So, re-using the beginDisposition/endDisposition lock used by the ToeThreads should resolve this issue, although locking the whole frontier seems likely to cause contention and slow things down somewhat. However, I can see no way to implement WorkQueue-level locking on the current frontier implementation.

anjackson · 2018-07-27T12:14:33Z

Note that my multi-threaded CrawlMessageHandler was also interfering with itself, for the same reason, which was the cause of the ConcurrentModificationException errors.

anjackson · 2018-08-29T15:12:48Z

Re-running the domain crawl now and fortunately the slow-down seems fairly minor. However, #ToeThreads still not stable. After a day of crawling 27 of 1,600 ToeThreads have died.

anjackson · 2018-08-29T15:24:21Z

Oh dear, still not got it:

SEVERE: org.archive.crawler.framework.ToeThread run Fatal exception in ToeThread #82: https://irp-cdn.multiscreensite.com/af541788/dms3rep/multi/mobile/wash-n-fold_icon-113x113.svg [Tue Aug 28 20:37:21 GMT 2018]
java.lang.NullPointerException
        at org.archive.crawler.frontier.BdbMultipleWorkQueues.delete(BdbMultipleWorkQueues.java:484)
        at org.archive.crawler.frontier.BdbWorkQueue.deleteItem(BdbWorkQueue.java:88)
        at org.archive.crawler.frontier.WorkQueue.dequeue(WorkQueue.java:195)
        at org.archive.crawler.frontier.WorkQueueFrontier.processFinish(WorkQueueFrontier.java:948)
        at org.archive.crawler.frontier.AbstractFrontier.finished(AbstractFrontier.java:574)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:187)

anjackson · 2018-08-29T15:47:29Z

Hmm, there are other points where WorkQueue.makeDirty is called, which might be the cause. For example, when the controller asks for the next() URI any ready queue with no elements in it could be marked with noteExausted() which then could call makeDirty() at the wrong moment. Could track calls to makeDirty(), perhaps?

It does seem that the rate of ToeThread death is somewhat slower than before, but it's hard to remember and Prometheus appears not to have discarded the old from the last run, somehow.

anjackson · 2018-08-29T21:19:04Z

Looking back at older logs, the thread death rate was much worse before (e.g. around 30 out of 100 ToeThreads suffered a fatal error in the first day of the crawl). So these changes do appear to have helped, but some other part of the code is still interfering.

anjackson · 2018-09-14T14:52:35Z

Okay, this appears to be resolved by internetarchive/heritrix3#213 so closing here, may take longer to close upstream.

anjackson self-assigned this Jul 27, 2018

anjackson added a commit that referenced this issue Jul 27, 2018

Lock the frontier while enqueuing to avoid NPE, see #16

03a9034

anjackson mentioned this issue Sep 13, 2018

BdbFrontier thread safety internetarchive/heritrix3#212

Closed

anjackson closed this as completed Sep 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NullPointerExceptions killing ToeThreads #16

NullPointerExceptions killing ToeThreads #16

anjackson commented Jul 27, 2018

anjackson commented Jul 27, 2018

anjackson commented Jul 27, 2018

anjackson commented Aug 29, 2018

anjackson commented Aug 29, 2018

anjackson commented Aug 29, 2018 •

edited

anjackson commented Aug 29, 2018

anjackson commented Sep 14, 2018

NullPointerExceptions killing ToeThreads #16

NullPointerExceptions killing ToeThreads #16

Comments

anjackson commented Jul 27, 2018

anjackson commented Jul 27, 2018

anjackson commented Jul 27, 2018

anjackson commented Aug 29, 2018

anjackson commented Aug 29, 2018

anjackson commented Aug 29, 2018 • edited

anjackson commented Aug 29, 2018

anjackson commented Sep 14, 2018

anjackson commented Aug 29, 2018 •

edited