Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RouterOS 6.43.8 instability (testing to get 6.43.13 functioning) #7

Closed
nathanfaber opened this issue Jan 10, 2019 · 22 comments

Comments

@nathanfaber
Copy link
Contributor

nathanfaber commented Jan 10, 2019

There seems to be an issue with 6.43.8. Pairs have been unstable and rebooting periodically. Occasionally, I have had issues accessing the RouterOS console (web, ssh, serial). It accepts the login and then stalls and never gives the prompt (last line "/command Use command at the base level").

Unclear what is happening. I do not recommend running 6.43.8 outside of testing.

@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Jan 11, 2019

Also seeing:

Topics |   | snmpwarning
-- | -- | --
Message |   | timeout while waiting for program 48
@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Jan 14, 2019

6.43.8 seems to be completely unstable with ha-mikrotik. I am now testing with 6.42.11 (long-term), which appears to be stable so far.
 

@nathanfaber nathanfaber added bug workaround and removed bug labels Jan 14, 2019
@the-nicolas

This comment has been minimized.

Copy link

the-nicolas commented Mar 19, 2019

What about newer versions like 6.44.1? Any experience so far?

@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Mar 24, 2019

What about newer versions like 6.44.1? Any experience so far?

I have only tested up to 6.42.11. I will try to test a newer version in the next week or two and report back.

@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Mar 25, 2019

I have deployed 6.43.13 (long-term) to a pair and I will report back if it appears stable.

@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Mar 25, 2019

6.43.13 worked well overnight and has now been deployed to another production pair.

I am also doing a reboot sync/check/loop and currently at 28 iterations. So far, so good.

Reboot loop code:

:for pushCount from=1 to=10000 do={
   :put "$pushCount pushing"
   $HAPushStandby
   :put "$pushCount push done"
   :delay 120
   $HASyncStandby;
   :put "$pushCount sync ok"
   :delay 10
}
@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Mar 25, 2019

@the-nicolas Are you actually interested in 6.44.1 or will the 6.43.13 that I am testing suffice? I don't have any features that I am seeking in 6.44 right now so I'd stay on 6.43 for testing unless there is demand for the other branch.

@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Mar 25, 2019

Using 6.43.13, I believe I have detected firing on a scheduler event for start-time=startup that is added after the boot but not expected to execute yet until next reboot. This is bad as I rely on this event to only fire after reboot. I am going to make a post on the the Mikrotik forums about this to see if they can look into it. This causes a role switch when there shouldn't be one. I am continuing to try to reproduce it.

This is one of the problems I had with 6.43.8, hopefully someone at Mikrotik can confirm why this is happening.

@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Mar 25, 2019

Added a patch ecf52e8 to prevent ha_startup from running twice. I am continuing to test. It is not recommend to run 6.43.13 with the existing releases, you will need to wait for the next release or run the current master.

@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Mar 26, 2019

It looks like the routers are stable once booted and properly initialized but there is another rare race with interfaces not showing up during initialization. It looks like we are not seeing the interface at periodically:

/interface ethernet get [find default-name="$haInterface"] orig-mac-address

I suspect it vanishes briefly during the enable a few lines earlier and then comes back. I have caught the standby in this state 2 times during 100 or so reboots, so it isn't that easy to run into.

I am doing a reboot loop to try to catch it with some additional logging but I suspect the fix will be similar to what I did earlier in the initialization (wait for it to show up again):

:while ([:len [/interface find where name="$haInterface"]]!=1) do={
/log error "ha_startup: delaying for hardware...cant find $haInterface"
:delay .1
}

@nathanfaber nathanfaber changed the title RouterOS 6.43.8 instability RouterOS 6.43.8 instability (testing to get 6.43.13 functioning) Mar 26, 2019
@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Mar 26, 2019

I am spinning 3 pairs on $HALoopPushStandby using a8e378f.
One pair is on 6.42.11 and the other two are on 6.43.13.

If everything stays stable over the next 24 hours with these, I will stamp a release.

@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Mar 27, 2019

Ended up reworking the initialization code into a retry loop. Tested it on 3 pairs for 12 hours. Now testing 64a7c8a on 5 pairs. Feeling pretty good about the current build for final release but continuing to test.
5 pairs:
2 x 6.42.11
3 x 6.43.13

@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Mar 27, 2019

Extending pairs test to include 6.44.1:
5 pairs
2 x 6.42.11
2 x 6.43.13
1 x 6.44.1

@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Mar 28, 2019

Completed over 100 standby cycles on each those 5 pairs without issue. Now testing general stability without any manual forced pushes. Assuming they all survive the next 24h, I will stamp the release.

@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Mar 28, 2019

Release is stamped for rc1. Assuming no other issues, this will be the final release of 0.6.
If you can test it, please do.
https://github.com/svlsResearch/ha-mikrotik/releases/tag/v0.6rc1

@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Mar 29, 2019

Do not proceed with the upgrade. There is an issue after ~24 hours of runtime with the new RouterOS that I am trying to debug.

Problem is with RouterOS (old versions still appear stable with new ha-mikrotik) but newer ones have a problem.

@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Mar 29, 2019

After around 16 hours of uptime (16:23-16:33) I am seeing "ERROR ATTEMPTED TO RUN AGAIN" protection logging on all pairs running 6.43.13 and 6.44.1 (6.42.11 continues to be fine and stable). Somehow, RouterOS is deciding to run the startup scheduler events that should be running on next boot.

Forgot to say...after this happens, we lose the entire environment (/environment print), which is very bad.

I am suspecting it is another component that I run on RouterOS that isn't part of ha-mikrotik that is causing this, so it may not be an issue for others, but I want to confirm this before I let it go wild.

@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Mar 29, 2019

99% convinced that the environment loss issue is due to another script that fires every 1m on my systems and checks the health of various netwatches. The script has not been completing and there is no check to see if another is running, so they accumulate in running jobs. There were over 200 of them running on the ones I was able to observe, that had the lost environment. I think the RouterOS scripting/interpreter process is crashing due to out of memory and restarting, causing the environment to be lost and the startup to happen again.

This component has nothing to do with ha-mikrotik and has some other issue with 6.43.13/6.44.1.

Based on this, I still believe the current ha-mikrotik is functional and stable with all versions but until I can do some more testing with this other component disabled, I am going to hold off.

@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Mar 29, 2019

Further confirmation of the memory exhaustion theory - last 5 days since RouterOS upgrade, memory slowly ramping up. I have disabled the script that I believe is responsible to to see if memory stays stable.

image

@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Mar 29, 2019

The issue with my other script has been tracked down to:
:put [/ip route check 1.2.3.4 without-paging as-value]
On 6.42.11, this returns immediately as if once is added.
On 6.43.13 and 6.44.1, this never returns.
Fix is simple, add once.
:put [/ip route check 1.2.3.4 without-paging as-value once]

So this other script was definitely continuing to spawn until something went wrong with the RouterOS interpreter.

Again, this problem had nothing to do with ha-mikrotik but this script runs on machines that I also run ha-mikrotik, which caused an overall problem. Testing continues, still expecting to stamp this current master for release without change.

@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Mar 30, 2019

Crossed the 16h mark on all the pairs, they look much better. Release will be stamped on Monday.
image

@nathanfaber

This comment has been minimized.

Copy link
Contributor Author

nathanfaber commented Apr 1, 2019

Everything remains stable. Stamping the release and closing this issue.

@nathanfaber nathanfaber closed this Apr 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.