Test failures on Windows caused by unclosed data files #32

hhoffstaette · 2013-02-07T11:17:09Z

Running only RedoUndoTest (to isolate from fallout, via mvn test -Dtest=journal.io.api.RedoUndoTest) yields:

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running journal.io.api.RedoUndoTest
Tests run: 19, Failures: 10, Errors: 2, Skipped: 0, Time elapsed: 0.601 sec <<< FAILURE!

Results :

Failed tests:   testRedoLargeChunksOfData(journal.io.api.RedoUndoTest): expected:<127> but was:<0>
  testRemoveThroughRedo(journal.io.api.RedoUndoTest): expected:<3> but was:<130>
  testUndoBackwardOrder(journal.io.api.RedoUndoTest)
  testUndoLargeChunksOfData(journal.io.api.RedoUndoTest): arrays first differed at element [0]; expected:<-1> but was:<67>
  testRemoveThroughUndo(journal.io.api.RedoUndoTest): expected:<3> but was:<133>
  testRedoForwardOrder(journal.io.api.RedoUndoTest): expected:<[B]> but was:<[A]>
  testRedoEmptyJournal(journal.io.api.RedoUndoTest): expected:<0> but was:<5>
  testUndoEmptyJournal(journal.io.api.RedoUndoTest): expected:<0> but was:<5>
  testRedoTakesNewWritesIntoAccount(journal.io.api.RedoUndoTest): expected:<[B]> but was:<[A]>
  testUndoDoesntTakeNewWritesIntoAccount(journal.io.api.RedoUndoTest)

Tests in error:
  testNoSuchElementExceptionWithRedoIterator(journal.io.api.RedoUndoTest): Unexpected exception, expected<java.util.NoSuchElementException>
but was<java.lang.AssertionError>
  testNoSuchElementExceptionWithUndoIterator(journal.io.api.RedoUndoTest): Unexpected exception, expected<java.util.NoSuchElementException>
but was<java.lang.AssertionError>

This is repeatable in both maven and eclipse. Apparently this test failure also has enough fallout to prevent surefire from terminating, as it seems the exploded child VM never terminates.

Related issues:
#33

The text was updated successfully, but these errors were encountered:

sbtourist · 2013-02-07T11:22:33Z

Thanks for the bug report!

I can't see the exceptions you get there, could you write down the stack traces, if any?

hhoffstaette · 2013-02-07T11:23:26Z

Fixed, sorry. Fighting with markdown eating maven output..

sbtourist · 2013-02-07T11:26:05Z

Still can't see the stack traces, do you have any?

hhoffstaette · 2013-02-07T11:40:16Z

Almost all tests (in this test class) fail with various asserts, but as I have just found out the exact failed assert in each test depends on which test ran before. I can also create different assertion failures depending on whether I remove the journal between each isolated test. So I could give you one or more stack traces, but they would not be indicative of what's really going on. I will try to isolate one test that repeatably fails.

hhoffstaette · 2013-02-07T12:55:18Z

Ha! Culprit found, it's in redo.

After disabling all test methods and adding assert to the file cleanup logic in the superclass I confirmed my suspicion that this is due to a stale data files between tests: the f.delete() in AbstractJournalTest.deleteFilesInDirectory fails for most tests. Wrapping that in a retry loop ran forever. Since I have no Virus checker crap running we can rule that out as well (besides they work correctly these days). Since nobody else has the journal file open it must be the journal.

So I disabled all tests, wrote a simple one that only does a simple sync write and nothing else. This worked: the cleanup in tearDown succeeded. Opening the redo() makes it fail immediately.

@Test
public void testSimple() throws Exception {
    journal.write("A".getBytes("UTF-8"), Journal.WriteType.SYNC);
    // this makes it keep its hands on the opened log between tests
    // journal.redo();
}

After looking through the code path from journal() onwards it seems that closing the journal must invalidate & close all DataFile instances, not just clear the Map.

sbtourist · 2013-02-07T13:01:54Z

Thanks for the detailed explanation.

I think no one tested it on Windows before, and everything is green on Unix, but I'll have a look at the close method as you suggested.

In the meantime, feel free to open a pull request if you already spotted the solution ;)

hhoffstaette · 2013-02-07T13:04:05Z

DataFileAccessor.getOrCreateRaf() picks up RAFs but never closes them in time. So instead of

void close() {
    // No-op as of now...
}

would iterating through the perThreadRAFs and close()ing them be the right thing to do?

hhoffstaette · 2013-02-07T13:07:28Z

With a long-enough test suite you would ultimately run out of files on Linux (Unix) as well. Testing on Windows is an excellent way to make sure stuff really works.. :)

sbtourist · 2013-02-07T13:27:44Z

The ResourceDisposer thread is in charge of closing RAFs, but you're right they might remain open: RAF disposing can definitely be improved there, so I'll have a shot at it.

hhoffstaette · 2013-02-07T13:40:45Z

So after staring at the internals of the DataFileAccessor I added in Journal.close():

    for (DataFile df: dataFiles.values()) {
        accessor.dispose(df);
    }

right before accessor.close() and lo! all tests pass, the file is cleaned up between tests. Doesn't look so "nice" compared to just making the RAF clean up in close(), but it seems more symmetric in the sense that the Journal has acquired the DataFiles and now is responsible to clean them up as well.

sbtourist · 2013-02-07T13:49:53Z

Good point, but I'd prefer to have that happen inside DataFileAccessor#close().
Also, we have to consider what happens to concurrent reads: they may end up creating another RAF just after you closed it; moving all the close/dispose/shutdown logic into DataFileAccessor would help to deal with that.

hhoffstaette · 2013-02-07T15:09:54Z

At the moment concurrent reads (esp. from Locations) will fail anyway since they are not fenced by closed() status anywhere. Agreed that moving all that into DFA will help later on.
Btw this is only the first step: fixing this makes RedoUndoTest work, but there are many similar failures (and NPEs) in the other tests later on (unclosed Journals leaving stale files, various NPEs etc.), but one thing after another.
I'll have a stab at DAF.close() tomorrow.

sbtourist · 2013-02-07T15:19:05Z

Yes, concurrent reads doesn't play that nicely with close right now: pausing the accessor before closing and then checking at each read operation if the DAF has been closed should probably work.

Talking about the other failures you see, are those related to the stale open files, or not? Which NPEs do you see?

hhoffstaette · 2013-02-07T15:38:00Z

As far as I can tell the other failures are mostly also fallout from stale files. Quick glance at the NPEs seems to indicate that the "journal" instance is sometimes not where it's supposed to be. But that can all be random followup state stale, esp. after I added the asserts in the superclass.

hhoffstaette · 2013-02-07T15:45:24Z

So how about this (ignoring the concurrent reads for now). To avoid too much duplication we move the removeRaf() stuff into a shared method and redirect dispose(Datafile) and close() onto it. This is probably quite racy, but still closes all files. :)
Later on we can sprinkle DAF.isClosed fences everyhwere.

void dispose(DataFile dataFile) {
    Integer dataFileId = dataFile.getDataFileId();
    for (Entry<Thread, ConcurrentMap<Integer, RandomAccessFile>> threadRafs : perThreadDataFileRafs.entrySet()) {
        for (Entry<Integer, RandomAccessFile> raf : threadRafs.getValue().entrySet()) {
            if (raf.getKey().equals(dataFileId)) {
                dispose(threadRafs.getKey(), dataFileId);
            }
        }
    }
}

private void dispose(Thread t, Integer dataFileId) {
    Lock lock = getOrCreateLock(t, dataFileId);
    lock.lock();
    try {
        removeRaf(t, dataFileId);
    } catch (IOException ex) {
        warn(ex, ex.getMessage());
    } finally {
        lock.unlock();
    }
}

void close() {
    for (Entry<Thread, ConcurrentMap<Integer, RandomAccessFile>> threadRafs : perThreadDataFileRafs.entrySet()) {
        for (Entry<Integer, RandomAccessFile> raf : threadRafs.getValue().entrySet()) {
            dispose(threadRafs.getKey(), raf.getKey());
        }
    }
}

FWIW originally there was a return statement after the removeRaf() call in the original dispose(Datafile) method (https://github.com/sbtourist/Journal.IO/blob/master/src/main/java/journal/io/api/DataFileAccessor.java#L197). That seems wrong as it would short-circuit any outer loop and only clean up a found data file for the first affected thread, not any subsequent ones. Due to the lack of docs and comments I have no idea whether that is an actual problem, but looping further doesn't seem wrong.

sbtourist · 2013-02-07T16:54:14Z

Yes, calling disposeDataFile for each file on close is inefficient (but close shouldn't happen often, so no big worries), but not racy provided we pause the accessor before, and then we put guards on the open/close status.

I'd say go with that and send a pull request, we can do more fine grained reviews from there :)

hhoffstaette · 2013-02-07T17:11:50Z

Slightly updated and pushed into my clone, see request. dispose(Thread t, Integer dataFileId) now looks suspiciously like removeRaf(Thread thread, Integer file) which is only called by the Disposer. Maybe we can clean that up too.

sbtourist · 2013-02-07T18:48:39Z

Reviewed and found just a minor issue, looks good otherwise.

Are you going to address the concurrency issues too?

hhoffstaette · 2013-02-08T09:03:16Z

I first wanted to take a look at the other tests.

sbtourist · 2013-02-08T09:40:16Z

Sure, just let me know.
Otherwise I'll merge your pull request and take care of that :)

hhoffstaette · 2013-02-08T09:49:59Z

Better pull as it is now - I was just looking at the other tests (no surprise that they fail) and for them I'll have to do both more cleanups in the abstract test superclass and better interaction with individual test classes and -methods. Right now this is not only confusing but also often plain wrong. Meanwhile if you can take a look at the races that would be nice.

sbtourist · 2013-02-08T09:54:00Z

I'll pull and take care of the rest.

Please report back any single exception/problem you get by the way: I'd like to understand why you get all of those and I (and others) don't get any.

hhoffstaette · 2013-02-08T10:09:45Z

As a quick example: every test that creates its own Journal (i.e. not using the one created in the superclass) does no close it, leaking files. The rest is just test state management fragility: as soon as something explodes, everything after that is bound to go downhill as well. That's why I also added the additional asserts in the superclass: to explode as loudly as fast as possible.
External test state isolation is a noriously difficult problem, and since you use OSX you just don't get the root cause (undeleted journal files with stale state).

sbtourist · 2013-02-08T10:16:05Z

So it looks like everything is related to unclosed/leaky journals.

Agreed, current tests are pretty fragile: I slightly improved that recently, but there's still a long way to go.

hhoffstaette · 2013-02-08T10:39:26Z

Another great example is the CompactionTest, which "succeeds" but logs warnings that the old uncompacted files could not be deleted (because of open readers). One could argue that the logging is not really helpful since it does not give any instructions what to do, whether the action successfully completed or manual intervention is necessary.
I completely understand that all this is just a consequence of devs implicitly abusing file system behaviour because it's "convenient", but the bigger picture is more complicated: in the above example it's actually a hidden (and probably unconsciously made) assumption about concurrency and resource management.

sbtourist · 2013-02-08T11:53:14Z

The reason why the case above happens is more probably a leaky RAF again, as it looks like removeDataFile doesn't ask to close them.
No assumptions about concurrency or resource management: just a bug ... which is why I was saying to report any odd things you see.

The point is, in my opinion, about tradeoffs: you can either set things for max correctness and fail fast behaviour, or good enough solutions (I know, sounds terrible).
In the case above, the latter was chosen: provide a log message (could be improved, but how? not easy) and go ahead, as the undeleted log file will be ignored from that point on and eventually compacted again at the next Journal restart, relying on users to report suspicious warn messages ... like you did :)

It's a bit sad maybe, but that's how software development (and yes, engineering too) works most of the time.

sbtourist · 2013-02-08T21:38:59Z

Ok, I fixed a few (well, hopefully all) things around journal close operations and resource leaks.
Could have a look and give it a try?
Thanks :)

hhoffstaette · 2013-02-11T09:35:09Z

So I just merged your latest changes (after figuring out remotes..sorry, hg user ;) and can report that at least the CompactionTest now runs fine without log warnings. Nice!

sbtourist · 2013-02-11T09:46:36Z

Cool, thanks much for your tests and patches.

I plan to close your issue by the end of the day, but if you have any other problems with that, feel free to report them at any time :)

hhoffstaette · 2013-02-11T09:49:40Z

Well, strictly speaking the remaining failures are also test failures, but (as far as I can tell for now) they are "only" bugs in the tests. I'll see if I can get them fixed by evening, say ~8pm; if not just close and I'll throw them your way when I'm done.

sbtourist · 2013-02-11T09:53:11Z

Cool, if you'll report them back here, I'll try to help, otherwise a pull request will be great :)

hhoffstaette · 2013-02-11T14:27:11Z

So as far as I can tell #37 says it all and this can now be closed. :)

ghost assigned sbtourist Feb 17, 2013

sbtourist closed this as completed Feb 17, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test failures on Windows caused by unclosed data files #32

Test failures on Windows caused by unclosed data files #32

hhoffstaette commented Feb 7, 2013

sbtourist commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

sbtourist commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

sbtourist commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

sbtourist commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

sbtourist commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

sbtourist commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

sbtourist commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

sbtourist commented Feb 7, 2013

hhoffstaette commented Feb 8, 2013

sbtourist commented Feb 8, 2013

hhoffstaette commented Feb 8, 2013

sbtourist commented Feb 8, 2013

hhoffstaette commented Feb 8, 2013

sbtourist commented Feb 8, 2013

hhoffstaette commented Feb 8, 2013

sbtourist commented Feb 8, 2013

sbtourist commented Feb 8, 2013

hhoffstaette commented Feb 11, 2013

sbtourist commented Feb 11, 2013

hhoffstaette commented Feb 11, 2013

sbtourist commented Feb 11, 2013

hhoffstaette commented Feb 11, 2013

Test failures on Windows caused by unclosed data files #32

Test failures on Windows caused by unclosed data files #32

Comments

hhoffstaette commented Feb 7, 2013

sbtourist commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

sbtourist commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

sbtourist commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

sbtourist commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

sbtourist commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

sbtourist commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

sbtourist commented Feb 7, 2013

hhoffstaette commented Feb 7, 2013

sbtourist commented Feb 7, 2013

hhoffstaette commented Feb 8, 2013

sbtourist commented Feb 8, 2013

hhoffstaette commented Feb 8, 2013

sbtourist commented Feb 8, 2013

hhoffstaette commented Feb 8, 2013

sbtourist commented Feb 8, 2013

hhoffstaette commented Feb 8, 2013

sbtourist commented Feb 8, 2013

sbtourist commented Feb 8, 2013

hhoffstaette commented Feb 11, 2013

sbtourist commented Feb 11, 2013

hhoffstaette commented Feb 11, 2013

sbtourist commented Feb 11, 2013

hhoffstaette commented Feb 11, 2013