Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leaks with LP Solvers are back #17320

Closed
ypfmde opened this issue Nov 12, 2014 · 32 comments
Closed

Memory leaks with LP Solvers are back #17320

ypfmde opened this issue Nov 12, 2014 · 32 comments

Comments

@ypfmde
Copy link

ypfmde commented Nov 12, 2014

3 years ago a memory leak in lp solvers was fixed in ticket 11917. It seems that the problems are back in Sage 6.3. The code

while True:
    P = MixedIntegerLinearProgram(solver="Coin")

consumes several GB of memory within a few seconds! The same happens with solver="gurobi", but not with solver="glpk".

Simon King confirmed the leak at sage-support, including some analysis and details.

CC: @simon-king-jena @nathanncohen

Component: linear programming

Author: Nils Bruin, Nathann Cohen

Branch/Commit: 5480f4e

Reviewer: Nils Bruin, Nathann Cohen

Issue created by migration from https://trac.sagemath.org/ticket/17320

@ypfmde ypfmde added this to the sage-6.4 milestone Nov 12, 2014
@nbruin
Copy link
Contributor

nbruin commented Nov 12, 2014

comment:3

When I try to replicate Simon's memory object analysis, I do not find a significant leak:

sage: import gc
sage: from collections import Counter
sage: gc.collect()
0
sage: gc.set_debug(gc.DEBUG_STATS)
sage: pre={id(a) for a in gc.get_objects()}
sage: for i in [1..100000]:
....:         P = MixedIntegerLinearProgram(solver="Coin")
....:         del P
....:     
sage: gc.collect()
gc: collecting generation 2...
gc: objects in each generation: 117 0 89955
gc: done, 0.0464s elapsed.
0
sage: gc.collect()
gc: collecting generation 2...
gc: objects in each generation: 48 0 90013
gc: done, 0.0397s elapsed.
0
sage: post=Counter(str(type(a)) for a in gc.get_objects() if id(a) not in pre)
sage: post
Counter({"<type 'method_descriptor'>": 32, "<type 'list'>": 23, "<type 'dict'>": 13, "<type 'tuple'>": 9, "<class '_ast.Name'>": 7, "<type 'weakref'>": 5, "<class '_ast.Call'>": 4, "<type 'frame'>": 2, "<type 'builtin_function_or_method'>": 2, "<type 'listiterator'>": 1, "<type 'set'>": 1, "<type 'instancemethod'>": 1, "<class '_ast.comprehension'>": 1, "<class '_ast.Compare'>": 1, "<class '_ast.Attribute'>": 1, "<class '_ast.Interactive'>": 1, "<type 'module'>": 1})

As you can see, the garbage collector does not get triggered during the loop (indicating that reference counting is sufficient to keep the number of python objects bounded in memory during the loop). Also, after the loop there are no alarming objects in "post".

I can confirm that memory usage explodes, but I think this indicates the memory leak is outside of python managed memory.

[The problem with Simon's code might be that he doesn't update the set S when he counts a new object, so there's a possibility for objects that get created on the first pass to get counted every time.]

@nbruin
Copy link
Contributor

nbruin commented Nov 12, 2014

comment:4

The following is fishy in coin_backend.pyx:

    def __cinit__(self, maximization = True):
...
        self.si = new OsiClpSolverInterface()
        self.model =  new CbcModel(self.si[0])
...
    def __dealloc__(self):
        del self.si

If deleting self.si is required in dealloc, shouldn't self.model be deleted as well? (this indicates to me that cython has grown enough understanding of C++ to know how to translate a "del" on a C++ object?)

Indeed, I've tried and simply inserting

        del self.model

makes memory consumption constant with the given test. I haven't tested if it affects operations otherwise. Responsible ticket seems to be #12220, where new CbcModel was added. In that same file we have

    cpdef int solve(self) except -1:
...
        model = new CbcModel(si[0])
...(*)
        del self.model
        self.model = model

which seems to suggest that unceremoniously deleting a model is just fine.

Incidentally, that routine can leak too, because there are a whole bunch of "raise" statements in (*) that can lead to self.model not being replaced by model, but model itself not being deleted.

This suggests that the whole statement here should be guarded by some try/except/finally to ensure that del model is executed if self.model=model is not.

@nbruin
Copy link
Contributor

nbruin commented Nov 13, 2014

@nbruin
Copy link
Contributor

nbruin commented Nov 13, 2014

Commit: bc20667

@nbruin
Copy link
Contributor

nbruin commented Nov 13, 2014

comment:6

something like this should do the trick. Feel free to amend as needed.


New commits:

bc20667trac #17320: fix some memory management for coin_backend

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Nov 13, 2014

comment:7

Hello !

I do not understand why you put all this in a "try" block. Wouldn't it be easier to just set self.si and self.model to NULL at the beginning, and add "if self.model != NULL" in the dealloc ?

Nathann

@nbruin
Copy link
Contributor

nbruin commented Nov 13, 2014

comment:8

Replying to @nathanncohen:

I do not understand why you put all this in a "try" block. Wouldn't it be easier to just set self.si and self.model to NULL at the beginning, and add "if self.model != NULL" in the dealloc ?

Quite possibly, but that's not how that code was written and I wasn't planning on delving into the program logic. There's a "new ..." there and there are uncontrolled exits via explicit raises and quite possibly by other code that can generate errors. If the new model gets inserted into self.model there's no leak because the old value of self.model gets properly deleted, but all the error condition exits would lead to an allocated "model" that simply falls out of scope, i.e., a memory leak (unless cython generates an implicit try/finally to cleanly delete any local variables that point to C++ objects--it doesn't, otherwise the original code wouldn't have worked properly).

I just expressed it as a try/finally to ensure we always have a chance to see if "model" needs to be deleted. I'm sure there are other solutions.

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Nov 16, 2014

comment:9

Hello !

It seems that only moving the "del" above was sufficient. I created a branch at public/17320 that does that, tell me what you think !

Nathann

@ypfmde
Copy link
Author

ypfmde commented Nov 16, 2014

comment:10

Dear Nathann,

ok, seems to fix the leak issue. Is there a similar easy fix for the gurobi backend?

Best wishes, Peter

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Nov 16, 2014

comment:11

Helloooooooo !

ok, seems to fix the leak issue. Is there a similar easy fix for the gurobi backend?

No idea, I don't have a gurobi license anymore. Their making them computer-dependent complicates stuff a lot.

Nathann

@ypfmde
Copy link
Author

ypfmde commented Nov 16, 2014

comment:12

Nathann,

what do you mean? On whichever machine I use gurobi, it is a matter of a few seconds to receive the free academic license.

-- Peter

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Nov 16, 2014

comment:13

Hello !

what do you mean? On whichever machine I use gurobi, it is a matter of a few seconds to receive the free academic license.

Oh ? Well, last time I tried I had to setup a weird proxy and start the authentication software in that proxy jail. Because my lab was not recognised as 'research institution' and stuff. And right now I am in India for two months.

Well, I will give it a try tomorrow but I am sure that I will lose at least one hour on that :-P

Nathann

@nbruin
Copy link
Contributor

nbruin commented Nov 16, 2014

comment:14

Replying to @nathanncohen:

It seems that only moving the "del" above was sufficient. I created a branch at public/17320 that does that, tell me what you think !

It seems to me that only committing to the new model by assigning it to self.model at the very end is a design feature: it means that if the computation fails for any reason, the object self is unaffected and hence still in a consistent state. It's also better for future development, because as long as model is private, there is no lock required to modify it. Once it's exported to self.model, you'd need to lock self before self.model is changed. Python currently doesn't support such multiprocessing programming anyway, but having changes in global state as atomic as possible is generally a good feature, because it reduces the possibility for unexpected interactions between multiple threads (the GIL does get released every now and again ...)

With your change, the model that just experienced an error is now stored in self. It could be convenient to have it available for further analysis, so a priori it's not clear which design is more useful. It depends on the details of what the code does and how it does it. Changing the design just to avoid a try/finally is probably not a good idea, however.

Furthermore, there is still this code between model allocation and committing to self.model=model.

        model.setLogLevel(self.model.logLevel())

        # multithreading
        import multiprocessing
        model.setNumberThreads(multiprocessing.cpu_count())

        model.branchAndBound()

which looks like it's quite nontrivial and hence could raise errors by itself. You'd still have a leak then. So either guard that with a try/finally, in which case you might as well stick with the guard all the way, or move the self.model=model even higher.

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Nov 17, 2014

comment:15

Hello,

It seems to me that only committing to the new model by assigning it to self.model at the very end is a design feature:

I wrote this file myself, and I do not think that this had the slightest importance to me at that time.

it means that if the computation fails for any reason,

When an exception is raised in that code, it is almost always because the problem is infeasible. This is not a "failure", it is the actual answer the user is looking for.

the object self is unaffected and hence still in a consistent state. It's also better for future development, because as long as model is private, there is no lock required to modify it. Once it's exported to self.model, you'd need to lock self before self.model is changed. Python currently doesn't support such multiprocessing programming anyway, but having changes in global state as atomic as possible is generally a good feature, because it reduces the possibility for unexpected interactions between multiple threads (the GIL does get released every now and again ...)

If your worry is that in a multithread Python program with a MixedIntegerLinearProgram instance p the function p.solve() may be called simultaneously from different processors, I believe that you can forget it. Even if that made the sligthest sense, you would have many other places in the code to re-read and adapt.

With your change, the model that just experienced an error is now stored in self. It could be convenient to have it available for further analysis, so a priori it's not clear which design is more useful. It depends on the details of what the code does and how it does it.

Well, again what you call an 'error' is just a certificate of infeasibiity, and this is definitely good to have around. But the fact that all doctests pass (even in the graph/ folder) indicates that this change brings no real problem.

Furthermore, there is still this code between model allocation and committing to self.model=model.

        model.setLogLevel(self.model.logLevel())

        # multithreading
        import multiprocessing
        model.setNumberThreads(multiprocessing.cpu_count())

        model.branchAndBound()

which looks like it's quite nontrivial and hence could raise errors by itself. You'd still have a leak then. So either guard that with a try/finally, in which case you might as well stick with the guard all the way, or move the self.model=model even higher.

Please move that line higher if you care. I believe that you are being worried unnecessarily.

Nathann

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Nov 17, 2014

comment:16

Well, I will give it a try tomorrow but I am sure that I will lose at least one hour on that :-P

I just tried it, but here we have no direct connection to internet: everything has to go through a http proxy that obviously will not handle their grbgetkey utility.

I will try again tonight from another connection, but then I will not be on a university network anymore.... >_<

Nathann

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Nov 17, 2014

Changed branch from u/nbruin/memory_leaks_with_lp_solvers_are_back to public/17320

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Nov 17, 2014

Changed commit from bc20667 to 351f3e1

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Nov 17, 2014

comment:17

Hello again !

Turns out that their license thing was greatly simplified since (or that I was lucky). Either way it went smoothly after I got a real internet connection.

I did the fix for Gubori, though for some reason I don't quite get why it works... I actually removed a call to GRBfreeenv and the memory leak seems to have disappeared. While debugging I noticed that GRBmodel returned an error code.

Well, here it is :-)

Nathann


New commits:

4da9aa3trac #17320: Broken doctests
4fcbc89trac #17320: Memory leak
520c6f1trac #17320: Merged with 6.5.beta0
351f3e1trac #17320: Same fix for Gurobi

@nathanncohen nathanncohen mannequin added the s: needs review label Nov 17, 2014
@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Nov 17, 2014

Changed commit from 351f3e1 to 5480f4e

@sagetrac-git
Copy link
Mannequin

sagetrac-git mannequin commented Nov 17, 2014

Branch pushed to git repo; I updated commit sha1. New commits:

5480f4etrac 17320: tighten window for memory leak in CoinBackend.solve and ensure that function declared "int" actually has a return statement

@nbruin
Copy link
Contributor

nbruin commented Nov 17, 2014

comment:19

Replying to @nathanncohen:

Please move that line higher if you care. I believe that you are being worried unnecessarily.

OK, done. If it's possible the obviously avoid memory leaks at virtually no cost then we should, particularly because sage library code often is used as template for new contributions.

Also, ensured that CoinBackend.solve has a return statement: In C, falling out of a function with a defined return type without executing an explicit return causes undefined behaviour, which could include returning -1 by accident (the one value that indicates erroneous behaviour). It seems cython does set the return value to 0 in absence of a return statement, but I wasn't able to find that stated explicitly in cython's documentation. Being explicit about it is safer. (certainly, if you say in the function signature that returning -1 means an exception has occurred, then it is instructive to explicitly not do that when no exception has occurred)

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Nov 18, 2014

comment:20

Hellooooooo !!

OK, done. If it's possible the obviously avoid memory leaks at virtually no cost then we should, particularly because sage library code often is used as template for new contributions.

Okayokay. Well, there is obviously nothing wrong with what you did !

Right now I am on a very bad internet connection. It goes through a weird proxy and I can't use git nor ssh and I get disconnected from trac so I have to re-send every comment 10 times. I will be able to test and review your commit tonight. Do you have anything against the previous changes ? If not I will set this ticket to positive_review later when I will be able to test your commit.

Thanks,

Nathann

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Nov 18, 2014

comment:21

Hello ! I was finally able to test your latest commit and Sage reported nothing wrong. You can set this ticket to positive_review if you agree with my commits.

Thanks,

Nathann

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Nov 18, 2014

Author: Nils Bruin, Nathann Cohen

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Nov 18, 2014

Reviewer: Nils Bruin, Nathann Cohen

@nbruin
Copy link
Contributor

nbruin commented Nov 18, 2014

comment:22

Replying to @nathanncohen:

You can set this ticket to positive_review if you agree with my commits.

No objections here. However, someone who uses gurobi should probably check&sign off on that change: especially because it's optional, the buildbot won't have anything useful to say about it either.

@nathanncohen
Copy link
Mannequin

nathanncohen mannequin commented Nov 18, 2014

comment:23

No objections here. However, someone who uses gurobi should probably check&sign off on that change: especially because it's optional, the buildbot won't have anything useful to say about it either.

Okay... Perhaps Peter then, since he raised the problem ? :-P

Nathann

@ypfmde
Copy link
Author

ypfmde commented Nov 18, 2014

comment:24

Replying to @nathanncohen:

Okay... Perhaps Peter then, since he raised the problem ? :-P

Well, I tried the patch in version 6.4.rc2. My original program now runs fine. There, however, all MIPs are feasible. So I tried a silly test code where every other MIP is infeasible. Again, memory usage is constant.

So the Coin and gurobi backends now seem to work fine.

(By the way, MIPSolverException cannot be used in a try/except environment. Not sure if this is a bug. Anyway, this was the same in the unpatched version.)

@simon-king-jena
Copy link
Member

comment:26

Replying to @ypfmde:

(By the way, MIPSolverException cannot be used in a try/except environment. Not sure if this is a bug. Anyway, this was the same in the unpatched version.)

Why not? Did you perhaps not import the error first?

sage: from sage.numerical.mip import MIPSolverException
sage: try:
....:     raise MIPSolverException("Problem")
....: except MIPSolverException:
....:     print "problem caught"
....: 
problem caught

@ypfmde
Copy link
Author

ypfmde commented Nov 18, 2014

comment:27

Replying to @simon-king-jena:

Replying to @ypfmde:

(By the way, MIPSolverException cannot be used in a try/except environment. Not sure if this is a bug. Anyway, this was the same in the unpatched version.)

Why not? Did you perhaps not import the error first?

Indeed, I didn't know that one has to import this error. Hmm, still doesn't look user-friendly to me that the class MixedIntegerLinearProgram is known to Sage without an import, while the exception belonging to it is not. -- Peter Mueller

@simon-king-jena
Copy link
Member

comment:28

Replying to @ypfmde:

Why not? Did you perhaps not import the error first?

Indeed, I didn't know that one has to import this error. Hmm, still doesn't look user-friendly to me that the class MixedIntegerLinearProgram is known to Sage without an import, while the exception belonging to it is not. -- Peter Mueller

I'd say that having MixedIntegerLinearProgram in the global name space is not good, since most Sage users wouldn't use it. So, it is friendly to some users that there is no need to import it before using.

And MIPSolverException is a technicality that is only relevant to developers or expert users, and thus clearly shouldn't pollute the global name space.

@vbraun
Copy link
Member

vbraun commented Nov 20, 2014

Changed branch from public/17320 to 5480f4e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants