-
Notifications
You must be signed in to change notification settings - Fork 331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xcon hub recv buffer #231
Comments
i may have jumped the gun here...this helped but didnt fully fix my problem. |
upping the value to 65535 seems to work(for now). fwiw, the trigger seems to be when my script collects the received routes from the vr-bgp instances. |
The hub is implemented in python and probably not in the most efficient way possible... so I wouldn't be surprised if it can't keep up but it shouldn't hang completely (as you indicated per mail). Did you verify that by ping or so? If you run with --debug and it prints packets to log, can you then observe that it hangs when you do this large RIB fetch? Do you do your RIB fetch in-band? I interact with the vr-bgp API out-of-band, i.e. across docker0 (or whatever network you have underneat) whereas only the BGP traffic is in-band across the vr-xcon overlay. If you are doing it in-band, can you change to out-of-band, so the RIB fetch itself is not going across the hub? Just to try out at least (should be easy to repeat by just curling the API, right!?). How large is the RIB? Full table? topo-machine uses a single vr-xcon to run all connectivity in the entire topology. You could instead divide it up, so that you have one vr-xcon instance per hub and even one per p2p connection. That should just be a few line changes to topo-machine. If vr-xcon actually hangs, that must be a bug and separating it over multiple instances would probably just be a workaround rather than a true fix. Nonetheless, it could get you up and running again. It might also be that it is actually having multiple hubs / p2p connecitons in the same vr-xcon instance that have us end up in some weird dead lock state. |
after some more in depth investigation, my initial thoughts were incorrect. the trigger seems to be announcing the routes, it just so happens that my script sleeps for 90s to allow for convergence before fetching the rib, which happens to be the hold timer value. it just seemed like fetching the rib was the culprit. rib fetching is done out of band on docker0 indeed. the rib isnt that large, a few hundred routes max(my normal test environment). below is a scaled down version(vr-bgp instance wise) to make debugging easier, and to see if i could get it to break with just 2 neighbors). i have gotten the setup into a fairly interesting state. there are 2 vr-bgp instances connected to a single hub, and i've broken it in such a way that 1 is reachable and 1 is not reachable. i have turned on all of the debugging, and on the hub i can see traffic coming in and going out to both vr-bgp instances(.2 is the virtual router, .3 and .4 are vr-bgp instances, and .7 is the hub):
on the vr-bgp instance that is working i can see traffic on both the tap interface and the eth interface.
tap0:
on the vr-bgp instance that isnt working i see packets on eth0 but not on tap0.
tap0:
i also have the docker logs from the vr-bgp instances and the hub instance. the vr-bgp instances both show this:
so they appear to at least be getting packets from the hub. on the hub i see packets coming in and going out to both vr-bgp instances:
at this point im not entirely sure what is going on or what to poke next. |
also for what its worth i wasnt able to break the hub with a single vr-bgp instance connected to it. |
one thing ive noticed on the vr-bgp instance that isnt working is the tcp_buf len(i reverted back to 2048 in the xcon.py code so this isnt me messing with the buffer in any way, both instances are using the same image)
on the instance thats working it looks like this:
|
I've been struggling with the
I've re-implemented the
A potential area of improvement would active socket monitoring. I have seen one rare occurrence where it took up to a full minute for an endpoint on the hub to recover from a failure. That was probably just a TCP socket timing out at the OS level. Can we do anything to speed up detection? |
@bdreisbach I have since figured out that the real root cause for all of this is likely in QEMU.
99% of my traffic flow is between VM1 and VM2 but the problem actually occurs towards VM3. I have seen in comparing packet captures with a continuous ping between VM1 and VM3 that:
Communication can be restored by either:
|
closing in favor of #238 |
i was having issues when using xcon in hub mode if there was too much traffic on the hub. the packets would get eaten and the bgp sessions would go down with hold time expired. upping the recieve buffer to 16384 fixed my issue. not sure if anyone else cares about this, if there is interest i can submit a MR.
The text was updated successfully, but these errors were encountered: