Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak for 0mq transport in case of TCP DDOS #36618

Merged
merged 1 commit into from
Sep 28, 2016

Conversation

onorua
Copy link
Contributor

@onorua onorua commented Sep 27, 2016

What does this PR do?

In case anyone is trying to "DDOS" zeromq transport with TCP messages,
we've got ZeroMQPubServerChannel process utilized more than 8GB of RAM.
With this patch, we have got ZeroMQPubServerChannel stabilized at 300MB
for > 500 nodes, with no performance penalties. It doesn't affect performance of 0mq minions, as it perform gc.collect only in case of exception.

What issues does this PR fix or reference?

#36612

Tests written?

No

In case anyone is trying to "DDOS" zeromq transport with TCP messages,
we've got ZeroMQPubServerChannel process utilized more than 8GB of RAM.
With this patch, we have got ZeroMQPubServerChannel stabilized at 300MB
for > 500 nodes.
It doesn't affect performance for 0mq minions, as it perform gc.collect
only in case of exception.
@ahammond
Copy link
Contributor

@cachedout can you please take a look? I'd desperately like this to be in 2016.3.4 if at all possible.

@rallytime rallytime added the bugfix-bckport will be be back-ported to an older release branch by creating a PR against that branch label Sep 27, 2016
@cachedout
Copy link
Contributor

@ahammond @onorua In your testing of this, did the gc.enable() not happen? Since this is in an exception block, I think a collect() is probably fine but they can be pretty expensive so I'm a little curious about what's really happening here.

@onorua
Copy link
Contributor Author

onorua commented Sep 28, 2016

@cachedout gc.enable() doesn't do collecting, but rather set a flag to do so, as described in source code. Default thresholds for python GC is set to 7000 allocations (youngest generation collected after 700 allocations, the middle generation after 10 young collections).
The function you would like to pay attention:

1008    static PyObject *
1009    gc_enable(PyObject *self, PyObject *noargs)
1010    {
1011        enabled = 1;
1012        Py_INCREF(Py_None);
1013        return Py_None;
1014    }

Which means, the time you disable garbage collecting and re-enable it later, it lost counts of objects, and start over again.
To prove it, I've made following exercise for you:

In [626]: gc.disable()

In [627]: gc.get_count()
Out[627]: (213, 10, 10)
....
In [683]: gc.get_count()
Out[683]: (737, 10, 10)

In [684]: gc.get_count()
Out[684]: (746, 10, 10)

In [685]: gc.enable()

In [686]: gc.get_count()
Out[686]: (30, 11, 10)

In [687]: gc.get_count()
Out[687]: (37, 11, 10)

In [688]: gc.get_count()
Out[688]: (45, 11, 10)

In [689]: gc.get_count()
Out[689]: (53, 11, 10)

In [690]: gc.get_count()
Out[690]: (61, 11, 10)

In [691]:

In [691]: gc.get_threshold
Out[691]: <function gc.get_threshold>

In [692]: gc.get_threshold()
Out[692]: (700, 10, 10)

As you can see, disregard that my threshold is set to 700, 10, 10, I've got 61, 11, 10 just disabling GC and enabling it again.

Generally speaking, disabling gc in python just to process one message is quite overkill, but I trust that you made some tests and it is working better with gc.disable(), that is why I had proposed to run gc in case of exception.

@cachedout
Copy link
Contributor

@onorua Thanks for the explanation. If I recall correctly, this was enabled at the advice of the msgpack project. They have a note on it here: https://github.com/msgpack/msgpack-python#gc

Some testing on this was done on the Salt side but it was a long time ago and I don't have the results handy right now.

@onorua
Copy link
Contributor Author

onorua commented Sep 28, 2016

@cachedout how big is the message? From the msgpack statement I understood that it is the case only for large messages. I believe we could try to comment out gc.disable and gc.enable and check performance impact, but it would be better if you could test it in isolated environment (in our case, it will be production, and somebody can run some job which will make test irrelevant).
Do you think you can do this tests or you prefer to leave gc.collect() in exceptions part and let it be?

@cachedout
Copy link
Contributor

@onorua I am fine with putting this gc.collect() in place for now. If, later on, tests conclude that under normal operation we can remove the the .enable and .disable then that is even better.

@cachedout cachedout merged commit e41d6a2 into saltstack:develop Sep 28, 2016
@rallytime rallytime added ZZZ[Done]-back-ported-bf RETIRED The pull request has been back-ported to an older branch. and removed bugfix-bckport will be be back-ported to an older release branch by creating a PR against that branch labels Sep 28, 2016
cachedout pushed a commit that referenced this pull request Sep 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ZZZ[Done]-back-ported-bf RETIRED The pull request has been back-ported to an older branch.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants