Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of NLM_F_DUMP_INTR is apparently missing #874

Closed
crosser opened this issue Jan 24, 2022 · 16 comments
Closed

Handling of NLM_F_DUMP_INTR is apparently missing #874

crosser opened this issue Jan 24, 2022 · 16 comments

Comments

@crosser
Copy link
Contributor

crosser commented Jan 24, 2022

We got a hard to reproduce condition in our tests, when a dump of network interfaces and addresses returned incomplete information. Because we have already encountered a related problem with nftables that turned out to be related to improper reaction NLM_F_DUMP_INTR condition, I suspect that the same thing is happening in pythoute2.

The point is, when the kernel is producing a dump of a kernel structure over multiple netlink messages, and the structure changes in the mid-way, NLM_F_DUMP_INTR is added to the header flags, that must be treated by userspace as an indication that the requested dump contains inconsistent data and must be re-requested. See function nl_dump_check_consistent() in the kernel source include/net/netlink.h.

I cannot find related functionality in pyroute2, which may explain the error that we encountered. I suspect that the necessary check for the flag NLM_F_DUMP_INTR in the received netlink messages is missing.

@svinota
Copy link
Owner

svinota commented Jan 24, 2022

Yep, looks like that. Thanks, investigating!

@svinota
Copy link
Owner

svinota commented Jan 27, 2022

Yep. What kind of the feedback would you expect here — a simple flag in the response, or an exception raised?

@crosser
Copy link
Contributor Author

crosser commented Jan 27, 2022

Either of those will do, though in my opinion it will be fine to simply retry the request until it succeeds completes with anything else than NLM_F_DUMP_INTR, without exposing it to the user. It is not, logically, and "error situation", just a "normal situation" that is exposed to userspace only because kernel cannot recover from it transparently, it needs to retroactively invalidate data that it has already passed over.

svinota added a commit that referenced this issue Jan 29, 2022
flags added:
        * NLM_F_DUMP_INTR
        * NLM_F_DUMP_FILTERED

Bug-Url: #874
@svinota
Copy link
Owner

svinota commented Jan 29, 2022

Things get more complicated. I've reproduced the 0x10 flag NLM_F_DUMP_INTR responses having ~50k IP addresses on the system, but look.

By default the library reads the response and returns it to the user as a list, but it's only the default historical behaviour. Having nlm_generator = True the library returns generators instead of lists, thus making transparent subsequent retries impossible: the response may be partly returned at the time we get a message with the flag 0x10 in the header.

So I leave it as is for now — only adding the flag, thanks — and as a workaround I can propose checking the response's msg['header']['flags'] for NLM_F_DUMP_INTR:

from pyroute2.netlink import NLM_F_DUMP_INTR

for msg in ipr.addr('dump'):
    if msg['header']['flags'] & NLM_F_DUMP_INTR:
        the_retry_logic()

If you can propose any improvements and shortcuts that will not break the compatibility — pls ping me. Thanks!

@crosser
Copy link
Contributor Author

crosser commented Jan 29, 2022

I realize that if the procedure that returns the data is a generator, than it is impossible to make retry logic invisible to the caller.

However just hoping that all users of the library will know that they need to check an obscure flag, in my opinion, will not work. The next person who, like me, suddenly encounters truncated data will have no idea what's going on.

In addition to that, such approach will make it impossible to use list comprehensions like this (taken from real code):

def _get_all_ifs(ipr):
    return [
        (
            el["index"],
            el.get_attr("IFLA_IFNAME"),
            el.get_nested("IFLA_LINKINFO", "IFLA_INFO_KIND"),
        )
        for el in ipr.get_links()
    ]

that I, for one, rather like! 😁

I think that it may be more "user-friendly" to raise an exception with a suggestive name. Then when this (rare) situation happens, at least the user will get a clue what happened, rather than struggle with mysteriously truncated result. With exception, I would write the above code as

def _get_all_ifs(ipr):
    while True:
        try:
            return [
                (
                    el["index"],
                    el.get_attr("IFLA_IFNAME"),
                    el.get_nested("IFLA_LINKINFO", "IFLA_INFO_KIND"),
                )
                for el in ipr.get_links()
            ]
        except NetlinkDumpInterrupted:
            continue

I my eyes, this looks less ugly than checking the flag in every returned element.

What do you think?

P.S. I am not 100% sure, but it may be necessary to consume potentially remaining messages waiting in the socket, after the one with raised NLM_F_DUMP_INTR flag, unless you close and reopen the socket. See this patch as an example. "Exception" approach allows to do it in the library, before raising the exception, and deny the user the opportunity to make another error.

svinota added a commit that referenced this issue Feb 4, 2022
svinota added a commit that referenced this issue Feb 12, 2022
@crosser
Copy link
Contributor Author

crosser commented Feb 15, 2022

As it seems to work now, are there plans to tag a release?.. We'd build and internal package, it's nicer to have it anchored on a tagged version than on a commit id... Thanks!

@svinota
Copy link
Owner

svinota commented Feb 15, 2022

Yep, sure. Let's tag it on Thursday — I would like to have one more feature in the release :) And hope to add it tonight + some time for testing.

@svinota
Copy link
Owner

svinota commented Feb 16, 2022

Ok, closing #878 and tagging the release.

NB: pls take look at #881

@svinota
Copy link
Owner

svinota commented Feb 16, 2022

tagged

@svinota
Copy link
Owner

svinota commented Feb 18, 2022

Closing the ticket, pls don't hesitate to re-open in the case of any issues.

@svinota svinota closed this as completed Feb 18, 2022
svinota added a commit that referenced this issue Feb 22, 2022
the main event loop should not make any sync calls on the sources,
so change the source restart to be triggered by the state

Bug-Url: #883
Bug-Url: #874
@elajkat
Copy link
Contributor

elajkat commented Mar 9, 2022

Hi, in our CI we see issues with this new exception (NetlinkDumpInterrupted):
TypeError: init() takes 1 positional argument but 3 were given (see bug report for Openstack Neutron: https://bugs.launchpad.net/neutron/+bug/1962608 )

It seems to be coming from here:
https://github.com/svinota/pyroute2/blob/0.6.7/pyroute2.core/pr2modules/netlink/exceptions.py#L61-L67

This can be fixed with something like this diff:

  • def init(self):
  •    super(NetlinkDumpInterrupted, self).__init__(-1, 'dump interrupted')
    
  • def init(self, code=None, msg=None):
  •    code = code if code else -1
    
  •    msg = msg if msg else 'dump interrupted'
    
  •    super(NetlinkDumpInterrupted, self).__init__(code, msg)
    

As I see this will not introduce extra problems, and works in our test env also, what do you think?

@crosser
Copy link
Contributor Author

crosser commented Mar 9, 2022

@elajkat , would it make sense to submit a pull request with this test for inclusion in pyroute2 testssuite? I think everyone would benefit from it...

@svinota
Copy link
Owner

svinota commented Mar 9, 2022

@elajkat hej Lajos! pls look at e29ff25 if it fixes the issue

@elajkat
Copy link
Contributor

elajkat commented Mar 9, 2022

Thanks, looks perfect.
A question: it seems that we have this exception quite often (at least in our functional test env, in which Neutron creates parralel namespaces and things like that to make test execution as parrlalel as possible), is it worth to retry the operation after a NetlinkDumpInterrupted?

@crosser
Copy link
Contributor Author

crosser commented Mar 9, 2022

@elajkat well, if you get this exception, it means that the data is incomplete. So yes, it's worth and I would say mandatory to retry, if you care about the accuracy of Neutron's view of the landscape. There is some example code in one of my comments above for inspiration.

@elajkat
Copy link
Contributor

elajkat commented Mar 9, 2022

Thanks, I will add retry logic then

openstack-mirroring pushed a commit to openstack/requirements that referenced this issue Mar 14, 2022
pyroute 0.6.7 introduced a new exception NetlinkDumpInterrupted which
is raised when NLM_F_DUMP_INTR is set in the flags during dump of
devices.
Final solution for it should be to use retry as suggested by pyroute2
developers (see [1]), a possible patch for the retry logic is [2].

[1]: svinota/pyroute2#874 (comment)
[2]: https://review.opendev.org/c/openstack/neutron/+/833015
Related-Bug: #1962608

Change-Id: I757a2d03e60772c716ddaa5fa63f36cce10af984
openstack-mirroring pushed a commit to openstack/openstack that referenced this issue Mar 14, 2022
* Update requirements from branch 'master'
  to 9d80db06bcaebea774dca6325a10921221ec31d4
  - Merge "Limit pyroute2 to be <0.6.6"
  - Limit pyroute2 to be <0.6.6
    
    pyroute 0.6.7 introduced a new exception NetlinkDumpInterrupted which
    is raised when NLM_F_DUMP_INTR is set in the flags during dump of
    devices.
    Final solution for it should be to use retry as suggested by pyroute2
    developers (see [1]), a possible patch for the retry logic is [2].
    
    [1]: svinota/pyroute2#874 (comment)
    [2]: https://review.opendev.org/c/openstack/neutron/+/833015
    Related-Bug: #1962608
    
    Change-Id: I757a2d03e60772c716ddaa5fa63f36cce10af984
openstack-mirroring pushed a commit to openstack/openstack that referenced this issue Apr 12, 2022
* Update neutron from branch 'master'
  to ec4cf672f11a812a475d051970069e5b3a304189
  - Merge "Add retry for privsep get_link_devices"
  - Add retry for privsep get_link_devices
    
    pyroute 0.6.6 introduced a new exception NetlinkDumpInterrupted which
    is raised when NLM_F_DUMP_INTR is set in the flags during dump of
    devices.
    The suggestion from pyroute developers is to retry in case of this
    exception (see [1]).
    
    [1]: svinota/pyroute2#874 (comment)
    
    Closes-Bug: #1962608
    
    Change-Id: Ie195ad596fd148708fc30946bde964d52444afee
openstack-mirroring pushed a commit to openstack/neutron that referenced this issue Apr 12, 2022
pyroute 0.6.6 introduced a new exception NetlinkDumpInterrupted which
is raised when NLM_F_DUMP_INTR is set in the flags during dump of
devices.
The suggestion from pyroute developers is to retry in case of this
exception (see [1]).

[1]: svinota/pyroute2#874 (comment)

Closes-Bug: #1962608

Change-Id: Ie195ad596fd148708fc30946bde964d52444afee
openstack-mirroring pushed a commit to openstack/neutron that referenced this issue Jun 1, 2022
pyroute 0.6.6 introduced a new exception NetlinkDumpInterrupted which
is raised when NLM_F_DUMP_INTR is set in the flags during dump of
devices.
The suggestion from pyroute developers is to retry in case of this
exception (see [1]).

[1]: svinota/pyroute2#874 (comment)

Closes-Bug: #1962608

Change-Id: Ie195ad596fd148708fc30946bde964d52444afee
(cherry picked from commit 74a9e83)
openstack-mirroring pushed a commit to openstack/neutron that referenced this issue Jun 10, 2022
pyroute 0.6.6 introduced a new exception NetlinkDumpInterrupted which
is raised when NLM_F_DUMP_INTR is set in the flags during dump of
devices.
The suggestion from pyroute developers is to retry in case of this
exception (see [1]).

[1]: svinota/pyroute2#874 (comment)

Closes-Bug: #1962608

Depends-On: https://review.opendev.org/c/openstack/requirements/+/845268
Change-Id: Ie195ad596fd148708fc30946bde964d52444afee
(cherry picked from commit 74a9e83)
(cherry picked from commit 541d1fc)
openstack-mirroring pushed a commit to openstack/neutron that referenced this issue Jun 14, 2022
pyroute 0.6.6 introduced a new exception NetlinkDumpInterrupted which
is raised when NLM_F_DUMP_INTR is set in the flags during dump of
devices.
The suggestion from pyroute developers is to retry in case of this
exception (see [1]).

[1]: svinota/pyroute2#874 (comment)

Closes-Bug: #1962608

Depends-On: https://review.opendev.org/c/openstack/requirements/+/845268
Change-Id: Ie195ad596fd148708fc30946bde964d52444afee
(cherry picked from commit 74a9e83)
(cherry picked from commit 541d1fc)
(cherry picked from commit ea85053)
Conflicts: neutron/privileged/agent/linux/ip_lib.py
 	requirements.txt
damiandabrowski5 pushed a commit to damiandabrowski5/openstack-neutron that referenced this issue Jun 22, 2022
pyroute 0.6.6 introduced a new exception NetlinkDumpInterrupted which
is raised when NLM_F_DUMP_INTR is set in the flags during dump of
devices.
The suggestion from pyroute developers is to retry in case of this
exception (see [1]).

[1]: svinota/pyroute2#874 (comment)

Closes-Bug: #1962608

Depends-On: https://review.opendev.org/c/openstack/requirements/+/845268
Change-Id: Ie195ad596fd148708fc30946bde964d52444afee
(cherry picked from commit 74a9e83)
(cherry picked from commit 541d1fc)
damiandabrowski5 pushed a commit to damiandabrowski5/openstack-neutron that referenced this issue Jun 22, 2022
pyroute 0.6.6 introduced a new exception NetlinkDumpInterrupted which
is raised when NLM_F_DUMP_INTR is set in the flags during dump of
devices.
The suggestion from pyroute developers is to retry in case of this
exception (see [1]).

[1]: svinota/pyroute2#874 (comment)

Closes-Bug: #1962608

Depends-On: https://review.opendev.org/c/openstack/requirements/+/845268
Change-Id: Ie195ad596fd148708fc30946bde964d52444afee
(cherry picked from commit 74a9e83)
(cherry picked from commit 541d1fc)
nectar-gerrit pushed a commit to NeCTAR-RC/neutron that referenced this issue Nov 8, 2022
pyroute 0.6.6 introduced a new exception NetlinkDumpInterrupted which
is raised when NLM_F_DUMP_INTR is set in the flags during dump of
devices.
The suggestion from pyroute developers is to retry in case of this
exception (see [1]).

[1]: svinota/pyroute2#874 (comment)

Closes-Bug: #1962608

Depends-On: https://review.opendev.org/c/openstack/requirements/+/845268
Change-Id: Ie195ad596fd148708fc30946bde964d52444afee
(cherry picked from commit 74a9e83)
(cherry picked from commit 541d1fc)
(cherry picked from commit ea85053)
Conflicts: neutron/privileged/agent/linux/ip_lib.py
 	requirements.txt
openstack-mirroring pushed a commit to openstack/ovn-bgp-agent that referenced this issue May 3, 2023
pyroute 0.6.6 introduced a new exception NetlinkDumpInterrupted which
is raised when NLM_F_DUMP_INTR is set in the flags during dump of
devices.
The suggestion from pyroute developers is to retry in case of this
exception (see [1]).

For reference, this is how it was handled in neutron side: [2]

[1] svinota/pyroute2#874 (comment)
[2] https://review.opendev.org/c/openstack/neutron/+/833015

Change-Id: I6b5604266f29a585bb1e0f8e605ee5df523b5951
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants