Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault learnerSGDEOnOffTest #47

Closed
valentjn opened this issue Mar 11, 2019 · 18 comments
Closed

Segfault learnerSGDEOnOffTest #47

valentjn opened this issue Mar 11, 2019 · 18 comments
Labels
bug Crashes, freezes, compilation/run-time errors, undesirable behavior, wrong info in help/docs fixed Issues that have been fixed

Comments

@valentjn
Copy link
Member

In GitLab by @lettrich on Jul 2, 2017, 12:30

Durch Commit b8f5768 kommt es beim Aufruf von
datadriven/examples/learnerSGDEOnOffTest zu einem Segfault.

Der Fehler kann bis zu LearnerSGDEOnOff.cpp : 352 verfolgt werden:
densityFunction.first->eval(data, perClassDensities.back(), true);

Reported by
@waegemans
@krsgpp

@valentjn valentjn added the bug Crashes, freezes, compilation/run-time errors, undesirable behavior, wrong info in help/docs label Mar 11, 2019
@valentjn
Copy link
Member Author

In GitLab by @lettrich on Jul 2, 2017, 12:30

created branch 42-segfault-learnersgdeonofftest

@valentjn
Copy link
Member Author

In GitLab by @lettrich on Jul 3, 2017, 08:51

First of all I was able to reproduce the error on my system and I am very surprised that I did not see it in the first place. Also my tests have shown, that it seems to be older then commit b8f5768 . I will need more time to look into that but I'm working on it and will keep you updated about my progress.

@valentjn
Copy link
Member Author

In GitLab by @valentjn on Jul 3, 2017, 09:23

You know about git bisect? It will tell you the right commit. If you have a script which tells good and bad commits apart, it even works automagically.

@valentjn
Copy link
Member Author

In GitLab by @valentjn on Jul 3, 2017, 09:24

Is it possible to add a test (after you've found/fixed it) to avoid regression bugs?

@valentjn
Copy link
Member Author

In GitLab by @lettrich on Jul 3, 2017, 09:47

I will see what I can do. I'm officially out of the game, but since it feels now that sgpp is also partially my baby I feel a certain responsibility ;) Nevertheless I want to point out, that tests for those classes should have been @maierjn 's work in the first place. My additions all provide unit tests. (however it is my stupidity in the second place to refactor untested code and wonder that its not bug free :D )

@valentjn
Copy link
Member Author

In GitLab by @lettrich on Jul 22, 2017, 17:39

OK, so bisecting (very cool feature btw) hints me at commit merge commit 67ad260. This is something to start with but on the other hand terrible news, especially as both branches had working code before. So its back to opening up the debugger and finding out what went wrong

@valentjn
Copy link
Member Author

In GitLab by @lettrich on Jul 22, 2017, 19:33

Ok so the first thing that looks very fishy is, that the pointer to the bounding box of the grid that's evaluated when this thing segfaults looks like it is invalid - the member variables all contain garbage values.

@valentjn
Copy link
Member Author

In GitLab by @lettrich on Jul 22, 2017, 20:34

I found the source of the bug and have to admit, this is extremely neatly hidden. The base::Grid::clone() member function is buggy. It was edited by me during the coding days to fix a shortcoming of the initial code by @fabianfranzelin. Despite writing unit tests for this function, the simple test was not enough to detect the error.

The quick fix for this is to revert the clone function to the state before 23ac235. Having tried this, I can confirm that it works. However this is only a quick fix, since just copying the grid points of the grid may work for most cases but doesn't work with stretching or bounding box. Unfortunately as far as I can see, there seems to be something wrong with the copy constructor of base::HashGridStorage or one of its members. I suggest that this is done together while dealing with some of the issues I pointed out in #36.

I am sorry for taking so long to find this, but this took almost an entire day of intense work to find, trace back and debug, since I assumed the error anywhere but in a flawed copy constructor in code I did not write...

Please also note that while I am very happy to have worked with all of you, I have to admit that I am not very keen to implement the "real fix", since I am no longer affiliated with SGpp.

@waegemans, please tell me if the quickfix works for you, so I can sleep better at night ;)

@valentjn
Copy link
Member Author

In GitLab by @valentjn on Jul 24, 2017, 10:49

Wow--thank you so much for the thorough investigation!

Let's look at what you did in Grid::clone...

newGrid->storage = HashGridStorage{this->storage};

Good lord! 🤦 Never ever do that. Seriously!

What happens if this line gets executed is the following:

  1. Create a temporary HashGridStorage object with the copy constructor, with this->storage as argument.
  2. Assign the temporary object to newGrid->storage, either using operator= of HashGridStorage or, if this doesn't exist (like in this case), using the implicit assignment operation. This copies pointers 1:1 to the new object.
  3. Destruct the temporary object as its lifetime is limited to this single line. Member pointers of HashGridStorage are deleted.
  4. newGrid->storage will now contain dangling pointers as the pointers were the same as in the temporary object.
  5. Crash.

I've put together a simple test program if you want to experiment [1].

The real culprit is that HashGridStorage doesn't have a proper assignment operator (the error would've been the same if you did newGrid->storage = this->storage;). Does anybody know why? I'll just implement a straightforward one.

But even if there would be an operator=, the line above is still bad code, as a temporary object gets created which could create much unnecessary overhead.

since I assumed the error anywhere but in a flawed copy constructor in code I did not write...

The copy constructor seems to be fine. You should've been looking what that clone fix you wrote is actually doing, so you're not completely innocent 😉 On the other hand, it's quite astonishing that nobody noticed so far...

[1] http://cpp.sh/7t24r

@valentjn
Copy link
Member Author

In GitLab by @valentjn on Jul 24, 2017, 10:54

mentioned in commit 24935f6

@valentjn
Copy link
Member Author

In GitLab by @valentjn on Jul 24, 2017, 11:06

mentioned in merge request !21

@valentjn
Copy link
Member Author

In GitLab by @valentjn on Jul 24, 2017, 11:06

Implemented HashGridStorage::operator=, the example seems to work again. Discuss the fix in MR !21.

@valentjn
Copy link
Member Author

In GitLab by @lettrich on Jul 24, 2017, 11:07

thank you so much for the feedback. Undoubtably you're right about the explicit call of the copy constructor, followed by an assignment. Also sorry for not taking a very detailed look at all of my code before starting to blame others.

@valentjn
Copy link
Member Author

In GitLab by @valentjn on Jul 24, 2017, 11:09

You shall be forgiven 😃

@valentjn
Copy link
Member Author

In GitLab by @valentjn on Jul 24, 2017, 16:27

closed via commit 24935f6

@valentjn
Copy link
Member Author

In GitLab by @valentjn on Jul 24, 2017, 16:27

closed via commit ca2b9b9

@valentjn
Copy link
Member Author

In GitLab by @valentjn on Jul 24, 2017, 16:27

closed via merge request !21

@valentjn
Copy link
Member Author

In GitLab by @valentjn on Jul 26, 2017, 13:41

mentioned in commit ca2b9b9

@valentjn valentjn added the fixed Issues that have been fixed label Mar 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Crashes, freezes, compilation/run-time errors, undesirable behavior, wrong info in help/docs fixed Issues that have been fixed
Development

No branches or pull requests

1 participant