Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failures on Windows caused by unclosed data files #32

Closed
hhoffstaette opened this issue Feb 7, 2013 · 32 comments
Closed

Test failures on Windows caused by unclosed data files #32

hhoffstaette opened this issue Feb 7, 2013 · 32 comments
Assignees

Comments

@hhoffstaette
Copy link
Contributor

Running only RedoUndoTest (to isolate from fallout, via mvn test -Dtest=journal.io.api.RedoUndoTest) yields:

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running journal.io.api.RedoUndoTest
Tests run: 19, Failures: 10, Errors: 2, Skipped: 0, Time elapsed: 0.601 sec <<< FAILURE!

Results :

Failed tests:   testRedoLargeChunksOfData(journal.io.api.RedoUndoTest): expected:<127> but was:<0>
  testRemoveThroughRedo(journal.io.api.RedoUndoTest): expected:<3> but was:<130>
  testUndoBackwardOrder(journal.io.api.RedoUndoTest)
  testUndoLargeChunksOfData(journal.io.api.RedoUndoTest): arrays first differed at element [0]; expected:<-1> but was:<67>
  testRemoveThroughUndo(journal.io.api.RedoUndoTest): expected:<3> but was:<133>
  testRedoForwardOrder(journal.io.api.RedoUndoTest): expected:<[B]> but was:<[A]>
  testRedoEmptyJournal(journal.io.api.RedoUndoTest): expected:<0> but was:<5>
  testUndoEmptyJournal(journal.io.api.RedoUndoTest): expected:<0> but was:<5>
  testRedoTakesNewWritesIntoAccount(journal.io.api.RedoUndoTest): expected:<[B]> but was:<[A]>
  testUndoDoesntTakeNewWritesIntoAccount(journal.io.api.RedoUndoTest)

Tests in error:
  testNoSuchElementExceptionWithRedoIterator(journal.io.api.RedoUndoTest): Unexpected exception, expected<java.util.NoSuchElementException>
but was<java.lang.AssertionError>
  testNoSuchElementExceptionWithUndoIterator(journal.io.api.RedoUndoTest): Unexpected exception, expected<java.util.NoSuchElementException>
but was<java.lang.AssertionError>

This is repeatable in both maven and eclipse. Apparently this test failure also has enough fallout to prevent surefire from terminating, as it seems the exploded child VM never terminates.

Related issues:
#33

@sbtourist
Copy link
Owner

Thanks for the bug report!

I can't see the exceptions you get there, could you write down the stack traces, if any?

@hhoffstaette
Copy link
Contributor Author

Fixed, sorry. Fighting with markdown eating maven output..

@sbtourist
Copy link
Owner

Still can't see the stack traces, do you have any?

@hhoffstaette
Copy link
Contributor Author

Almost all tests (in this test class) fail with various asserts, but as I have just found out the exact failed assert in each test depends on which test ran before. I can also create different assertion failures depending on whether I remove the journal between each isolated test. So I could give you one or more stack traces, but they would not be indicative of what's really going on. I will try to isolate one test that repeatably fails.

@hhoffstaette
Copy link
Contributor Author

Ha! Culprit found, it's in redo.

After disabling all test methods and adding assert to the file cleanup logic in the superclass I confirmed my suspicion that this is due to a stale data files between tests: the f.delete() in AbstractJournalTest.deleteFilesInDirectory fails for most tests. Wrapping that in a retry loop ran forever. Since I have no Virus checker crap running we can rule that out as well (besides they work correctly these days). Since nobody else has the journal file open it must be the journal.

So I disabled all tests, wrote a simple one that only does a simple sync write and nothing else. This worked: the cleanup in tearDown succeeded. Opening the redo() makes it fail immediately.

@Test
public void testSimple() throws Exception {
    journal.write("A".getBytes("UTF-8"), Journal.WriteType.SYNC);
    // this makes it keep its hands on the opened log between tests
    // journal.redo();
}

After looking through the code path from journal() onwards it seems that closing the journal must invalidate & close all DataFile instances, not just clear the Map.

@sbtourist
Copy link
Owner

Thanks for the detailed explanation.

I think no one tested it on Windows before, and everything is green on Unix, but I'll have a look at the close method as you suggested.

In the meantime, feel free to open a pull request if you already spotted the solution ;)

@hhoffstaette
Copy link
Contributor Author

DataFileAccessor.getOrCreateRaf() picks up RAFs but never closes them in time. So instead of

void close() {
    // No-op as of now...
}

would iterating through the perThreadRAFs and close()ing them be the right thing to do?

@hhoffstaette
Copy link
Contributor Author

With a long-enough test suite you would ultimately run out of files on Linux (Unix) as well. Testing on Windows is an excellent way to make sure stuff really works.. :)

@sbtourist
Copy link
Owner

The ResourceDisposer thread is in charge of closing RAFs, but you're right they might remain open: RAF disposing can definitely be improved there, so I'll have a shot at it.

@hhoffstaette
Copy link
Contributor Author

So after staring at the internals of the DataFileAccessor I added in Journal.close():

    for (DataFile df: dataFiles.values()) {
        accessor.dispose(df);
    }

right before accessor.close() and lo! all tests pass, the file is cleaned up between tests. Doesn't look so "nice" compared to just making the RAF clean up in close(), but it seems more symmetric in the sense that the Journal has acquired the DataFiles and now is responsible to clean them up as well.

@sbtourist
Copy link
Owner

Good point, but I'd prefer to have that happen inside DataFileAccessor#close().
Also, we have to consider what happens to concurrent reads: they may end up creating another RAF just after you closed it; moving all the close/dispose/shutdown logic into DataFileAccessor would help to deal with that.

@hhoffstaette
Copy link
Contributor Author

At the moment concurrent reads (esp. from Locations) will fail anyway since they are not fenced by closed() status anywhere. Agreed that moving all that into DFA will help later on.
Btw this is only the first step: fixing this makes RedoUndoTest work, but there are many similar failures (and NPEs) in the other tests later on (unclosed Journals leaving stale files, various NPEs etc.), but one thing after another.
I'll have a stab at DAF.close() tomorrow.

@sbtourist
Copy link
Owner

Yes, concurrent reads doesn't play that nicely with close right now: pausing the accessor before closing and then checking at each read operation if the DAF has been closed should probably work.

Talking about the other failures you see, are those related to the stale open files, or not? Which NPEs do you see?

@hhoffstaette
Copy link
Contributor Author

As far as I can tell the other failures are mostly also fallout from stale files. Quick glance at the NPEs seems to indicate that the "journal" instance is sometimes not where it's supposed to be. But that can all be random followup state stale, esp. after I added the asserts in the superclass.

@hhoffstaette
Copy link
Contributor Author

So how about this (ignoring the concurrent reads for now). To avoid too much duplication we move the removeRaf() stuff into a shared method and redirect dispose(Datafile) and close() onto it. This is probably quite racy, but still closes all files. :)
Later on we can sprinkle DAF.isClosed fences everyhwere.

void dispose(DataFile dataFile) {
    Integer dataFileId = dataFile.getDataFileId();
    for (Entry<Thread, ConcurrentMap<Integer, RandomAccessFile>> threadRafs : perThreadDataFileRafs.entrySet()) {
        for (Entry<Integer, RandomAccessFile> raf : threadRafs.getValue().entrySet()) {
            if (raf.getKey().equals(dataFileId)) {
                dispose(threadRafs.getKey(), dataFileId);
            }
        }
    }
}

private void dispose(Thread t, Integer dataFileId) {
    Lock lock = getOrCreateLock(t, dataFileId);
    lock.lock();
    try {
        removeRaf(t, dataFileId);
    } catch (IOException ex) {
        warn(ex, ex.getMessage());
    } finally {
        lock.unlock();
    }
}

void close() {
    for (Entry<Thread, ConcurrentMap<Integer, RandomAccessFile>> threadRafs : perThreadDataFileRafs.entrySet()) {
        for (Entry<Integer, RandomAccessFile> raf : threadRafs.getValue().entrySet()) {
            dispose(threadRafs.getKey(), raf.getKey());
        }
    }
}

FWIW originally there was a return statement after the removeRaf() call in the original dispose(Datafile) method (https://github.com/sbtourist/Journal.IO/blob/master/src/main/java/journal/io/api/DataFileAccessor.java#L197). That seems wrong as it would short-circuit any outer loop and only clean up a found data file for the first affected thread, not any subsequent ones. Due to the lack of docs and comments I have no idea whether that is an actual problem, but looping further doesn't seem wrong.

@sbtourist
Copy link
Owner

Yes, calling disposeDataFile for each file on close is inefficient (but close shouldn't happen often, so no big worries), but not racy provided we pause the accessor before, and then we put guards on the open/close status.

I'd say go with that and send a pull request, we can do more fine grained reviews from there :)

@hhoffstaette
Copy link
Contributor Author

Slightly updated and pushed into my clone, see request. dispose(Thread t, Integer dataFileId) now looks suspiciously like removeRaf(Thread thread, Integer file) which is only called by the Disposer. Maybe we can clean that up too.

@sbtourist
Copy link
Owner

Reviewed and found just a minor issue, looks good otherwise.

Are you going to address the concurrency issues too?

@hhoffstaette
Copy link
Contributor Author

I first wanted to take a look at the other tests.

@sbtourist
Copy link
Owner

Sure, just let me know.
Otherwise I'll merge your pull request and take care of that :)

@hhoffstaette
Copy link
Contributor Author

Better pull as it is now - I was just looking at the other tests (no surprise that they fail) and for them I'll have to do both more cleanups in the abstract test superclass and better interaction with individual test classes and -methods. Right now this is not only confusing but also often plain wrong. Meanwhile if you can take a look at the races that would be nice.

@sbtourist
Copy link
Owner

I'll pull and take care of the rest.

Please report back any single exception/problem you get by the way: I'd like to understand why you get all of those and I (and others) don't get any.

@hhoffstaette
Copy link
Contributor Author

As a quick example: every test that creates its own Journal (i.e. not using the one created in the superclass) does no close it, leaking files. The rest is just test state management fragility: as soon as something explodes, everything after that is bound to go downhill as well. That's why I also added the additional asserts in the superclass: to explode as loudly as fast as possible.
External test state isolation is a noriously difficult problem, and since you use OSX you just don't get the root cause (undeleted journal files with stale state).

@sbtourist
Copy link
Owner

So it looks like everything is related to unclosed/leaky journals.

Agreed, current tests are pretty fragile: I slightly improved that recently, but there's still a long way to go.

@hhoffstaette
Copy link
Contributor Author

Another great example is the CompactionTest, which "succeeds" but logs warnings that the old uncompacted files could not be deleted (because of open readers). One could argue that the logging is not really helpful since it does not give any instructions what to do, whether the action successfully completed or manual intervention is necessary.
I completely understand that all this is just a consequence of devs implicitly abusing file system behaviour because it's "convenient", but the bigger picture is more complicated: in the above example it's actually a hidden (and probably unconsciously made) assumption about concurrency and resource management.

@sbtourist
Copy link
Owner

The reason why the case above happens is more probably a leaky RAF again, as it looks like removeDataFile doesn't ask to close them.
No assumptions about concurrency or resource management: just a bug ... which is why I was saying to report any odd things you see.

The point is, in my opinion, about tradeoffs: you can either set things for max correctness and fail fast behaviour, or good enough solutions (I know, sounds terrible).
In the case above, the latter was chosen: provide a log message (could be improved, but how? not easy) and go ahead, as the undeleted log file will be ignored from that point on and eventually compacted again at the next Journal restart, relying on users to report suspicious warn messages ... like you did :)

It's a bit sad maybe, but that's how software development (and yes, engineering too) works most of the time.

@sbtourist
Copy link
Owner

Ok, I fixed a few (well, hopefully all) things around journal close operations and resource leaks.
Could have a look and give it a try?
Thanks :)

@hhoffstaette
Copy link
Contributor Author

So I just merged your latest changes (after figuring out remotes..sorry, hg user ;) and can report that at least the CompactionTest now runs fine without log warnings. Nice!

@sbtourist
Copy link
Owner

Cool, thanks much for your tests and patches.

I plan to close your issue by the end of the day, but if you have any other problems with that, feel free to report them at any time :)

@hhoffstaette
Copy link
Contributor Author

Well, strictly speaking the remaining failures are also test failures, but (as far as I can tell for now) they are "only" bugs in the tests. I'll see if I can get them fixed by evening, say ~8pm; if not just close and I'll throw them your way when I'm done.

@sbtourist
Copy link
Owner

Cool, if you'll report them back here, I'll try to help, otherwise a pull request will be great :)

@hhoffstaette
Copy link
Contributor Author

So as far as I can tell #37 says it all and this can now be closed. :)

@ghost ghost assigned sbtourist Feb 17, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants