Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullPointerExceptions killing ToeThreads #16

Closed
anjackson opened this issue Jul 27, 2018 · 7 comments
Closed

NullPointerExceptions killing ToeThreads #16

anjackson opened this issue Jul 27, 2018 · 7 comments
Assignees

Comments

@anjackson
Copy link
Contributor

We're seeing really odd fatal errors, killing off ToeThreads in crawls:

SEVERE: org.archive.crawler.framework.ToeThread run Fatal exception in ToeThread #99: https://www.andersonslimited.co.uk/robots.txt [Wed Jul 25 09:08:01 GMT 2018]
java.lang.NullPointerException
        at org.archive.crawler.frontier.BdbMultipleWorkQueues.delete(BdbMultipleWorkQueues.java:484)
        at org.archive.crawler.frontier.BdbWorkQueue.deleteItem(BdbWorkQueue.java:88)
        at org.archive.crawler.frontier.WorkQueue.dequeue(WorkQueue.java:195)
        at org.archive.crawler.frontier.WorkQueueFrontier.processFinish(WorkQueueFrontier.java:948)
        at org.archive.crawler.frontier.AbstractFrontier.finished(AbstractFrontier.java:574)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:187)

Looking at the code, this shouldn't really be possible!

Going up the call tree, it appears the peekItem has become inconsistent with, i.e. reset to null.

Note that NetArchive Suite have also seen this issue and patched it in this way.

Also observing

java.util.ConcurrentModificationException
        at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1211)
        at java.util.TreeMap$EntryIterator.next(TreeMap.java:1247)
        at java.util.TreeMap$EntryIterator.next(TreeMap.java:1242)
        at com.esotericsoftware.kryo.serialize.MapSerializer.writeObjectData(MapSerializer.java:90)
        at com.esotericsoftware.kryo.serialize.FieldSerializer.writeObjectData(FieldSerializer.java:161)
        at com.esotericsoftware.kryo.Kryo.writeObjectData(Kryo.java:453)
        at com.esotericsoftware.kryo.ObjectBuffer.writeObjectData(ObjectBuffer.java:262)
        at org.archive.bdb.KryoBinding.objectToEntry(KryoBinding.java:81)
        at com.sleepycat.collections.DataView.useValue(DataView.java:549)
        at com.sleepycat.collections.DataCursor.initForPut(DataCursor.java:817)
        at com.sleepycat.collections.DataCursor.put(DataCursor.java:751)
        at com.sleepycat.collections.StoredContainer.putKeyValue(StoredContainer.java:321)
        at com.sleepycat.collections.StoredMap.put(StoredMap.java:279)
        at org.archive.util.ObjectIdentityBdbManualCache$1.onRemoval(ObjectIdentityBdbManualCache.java:122)
        at com.google.common.cache.LocalCache.processPendingNotifications(LocalCache.java:1954)
        at com.google.common.cache.LocalCache$Segment.runUnlockedCleanup(LocalCache.java:3457)
        at com.google.common.cache.LocalCache$Segment.postWriteCleanup(LocalCache.java:3433)
        at com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2888)
        at com.google.common.cache.LocalCache.put(LocalCache.java:4146)
        at org.archive.util.ObjectIdentityBdbManualCache.dirtyKey(ObjectIdentityBdbManualCache.java:379)
        at org.archive.crawler.frontier.WorkQueue.makeDirty(WorkQueue.java:690)
        at org.archive.crawler.frontier.AbstractFrontier.tally(AbstractFrontier.java:636)
        at org.archive.crawler.frontier.AbstractFrontier.doJournalAdded(AbstractFrontier.java:647)
        at org.archive.crawler.frontier.WorkQueueFrontier.sendToQueue(WorkQueueFrontier.java:410)
        at org.archive.crawler.frontier.WorkQueueFrontier.processScheduleAlways(WorkQueueFrontier.java:333)
        at org.archive.crawler.frontier.AbstractFrontier.receive(AbstractFrontier.java:554)
        at org.archive.crawler.util.SetBasedUriUniqFilter.add(SetBasedUriUniqFilter.java:85)
        at org.archive.crawler.frontier.WorkQueueFrontier.processScheduleIfUnique(WorkQueueFrontier.java:378)
        at org.archive.crawler.frontier.WorkQueueFrontier.schedule(WorkQueueFrontier.java:356)
        at org.archive.crawler.postprocessor.CandidatesProcessor.runCandidateChain(CandidatesProcessor.java:189)
        at uk.bl.wap.crawler.frontier.KafkaUrlReceiver$CrawlMessageHandler.run(KafkaUrlReceiver.java:468)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)```

So, what seems to be happening, I think, is that occasionally, between this statement and this one, the WorkQueue gets updated by a separate thread in a way that forces it to get written out to disk and then read back in again. As peekItem is transient, flushing it out to the disk and back drops the value and we're left with a null.

@anjackson
Copy link
Contributor Author

So, re-using the beginDisposition/endDisposition lock used by the ToeThreads should resolve this issue, although locking the whole frontier seems likely to cause contention and slow things down somewhat. However, I can see no way to implement WorkQueue-level locking on the current frontier implementation.

@anjackson
Copy link
Contributor Author

Note that my multi-threaded CrawlMessageHandler was also interfering with itself, for the same reason, which was the cause of the ConcurrentModificationException errors.

@anjackson
Copy link
Contributor Author

Re-running the domain crawl now and fortunately the slow-down seems fairly minor. However, #ToeThreads still not stable. After a day of crawling 27 of 1,600 ToeThreads have died.

@anjackson
Copy link
Contributor Author

Oh dear, still not got it:

SEVERE: org.archive.crawler.framework.ToeThread run Fatal exception in ToeThread #82: https://irp-cdn.multiscreensite.com/af541788/dms3rep/multi/mobile/wash-n-fold_icon-113x113.svg [Tue Aug 28 20:37:21 GMT 2018]
java.lang.NullPointerException
        at org.archive.crawler.frontier.BdbMultipleWorkQueues.delete(BdbMultipleWorkQueues.java:484)
        at org.archive.crawler.frontier.BdbWorkQueue.deleteItem(BdbWorkQueue.java:88)
        at org.archive.crawler.frontier.WorkQueue.dequeue(WorkQueue.java:195)
        at org.archive.crawler.frontier.WorkQueueFrontier.processFinish(WorkQueueFrontier.java:948)
        at org.archive.crawler.frontier.AbstractFrontier.finished(AbstractFrontier.java:574)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:187)

@anjackson
Copy link
Contributor Author

anjackson commented Aug 29, 2018

Hmm, there are other points where WorkQueue.makeDirty is called, which might be the cause. For example, when the controller asks for the next() URI any ready queue with no elements in it could be marked with noteExausted() which then could call makeDirty() at the wrong moment. Could track calls to makeDirty(), perhaps?

It does seem that the rate of ToeThread death is somewhat slower than before, but it's hard to remember and Prometheus appears not to have discarded the old from the last run, somehow.

@anjackson
Copy link
Contributor Author

Looking back at older logs, the thread death rate was much worse before (e.g. around 30 out of 100 ToeThreads suffered a fatal error in the first day of the crawl). So these changes do appear to have helped, but some other part of the code is still interfering.

@anjackson
Copy link
Contributor Author

Okay, this appears to be resolved by internetarchive/heritrix3#213 so closing here, may take longer to close upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant